spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-24 07:00:04 +03:00

Author	SHA1	Message	Date
Edward	9dfb12e29f	Update universe example codes (#9422 ) * Update universe plugins * Adjust azure trigger * Add init to tests/universe * deliberatly trying to break the universe to see if the CI catches it * revert Co-authored-by: svlandeg <svlandeg@github.com>	2021-10-14 09:37:05 +02:00
Paul O'Leary McCann	a3b7519aba	Fix JA Morph Values (#9449 ) * Don't set empty / weird values in morph * Update tests to handy empty morph values * Fix everything * Replace potentially problematic characters * Fix test	2021-10-14 09:21:36 +02:00
Ines Montani	c48564688f	Merge pull request #9423 from explosion/tests/issue-marker	2021-10-13 16:53:40 +02:00
Edward	72711dc2c9	Update universe example codes (#9422 ) * Update universe plugins * Adjust azure trigger * Add init to tests/universe * deliberatly trying to break the universe to see if the CI catches it * revert Co-authored-by: svlandeg <svlandeg@github.com>	2021-10-13 16:29:19 +02:00
Jette16	78365452d3	Moved test for universe into .github folder (#9447 ) * Moved universe-test into .github folder * Cleaned code * CHanged a file name	2021-10-13 14:13:06 +02:00
Sofie Van Landeghem	d2645b2e03	Fix test for spancat (#9446 ) * fix test for spancat * increase tolerance for almost equal checks * Update spacy/tests/test_models.py * Update spacy/tests/test_models.py	2021-10-13 10:48:35 +02:00
Sofie Van Landeghem	2e3d6b8b5a	Fix test for spancat (#9446 ) * fix test for spancat * increase tolerance for almost equal checks * Update spacy/tests/test_models.py * Update spacy/tests/test_models.py	2021-10-13 10:47:56 +02:00
Sofie Van Landeghem	5e8e8525f0	fix W108 filter (#9438 ) * remove text argument from W108 to enable 'once' filtering * include the option of partial POS annotation * fix typo * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-12 19:56:44 +02:00
Lj Miranda	6425b9a1c4	Include JsonlCorpus from the imports (#9431 )	2021-10-12 15:39:14 +02:00
Ryn Daniels	bb6623bb2d	Merge pull request #9426 from explosion/rfd-bot-config Add allowed_teams to the explosion-bot config	2021-10-12 09:50:39 +02:00
Ryn Daniels	2fb420ec23	Add allowed_teams to the explosion-bot config	2021-10-11 18:20:48 +02:00
Ryn Daniels	f64e39fa49	Install explosionbot as a github action (#9420 )	2021-10-11 15:43:27 +02:00
Paul O'Leary McCann	efe5beefe0	Add test for case where parser overwrite annotations (#9406 ) * Add test for case where parser overwrite annotations * Move test to its own file Also add note about how other tokens modify results. * Fix xfail decorator	2021-10-11 14:57:45 +02:00
Ines Montani	1fa7c4e73b	Support issue marker via pytest	2021-10-11 13:56:24 +02:00
Paul O'Leary McCann	3b429619a8	Fix UD POS docs links (fix #9013 ) (#9407 ) * Fix UD POS docs links (fix #9013) The previous link seems to have been for UD v1. * Fix link	2021-10-11 11:51:59 +02:00
Paul O'Leary McCann	b53e39455e	Fix UD POS docs links (fix #9013 ) (#9407 ) * Fix UD POS docs links (fix #9013) The previous link seems to have been for UD v1. * Fix link	2021-10-11 11:51:19 +02:00
Paul O'Leary McCann	fd759a881b	Fix inconsistent lemmas (#9405 ) * Add util function to unique lists and preserve order * Use unique function instead of list(set()) list(set()) has the issue that it's not consistent between runs of the Python interpreter, so order can vary. list(set()) calls were left in a few places where they were behind calls to sorted(). I think in this case the calls to list() can be removed, but this commit doesn't do that. * Use the existing pattern for this	2021-10-11 11:38:45 +02:00
Adriane Boyd	fd91e6a33c	Fix types descriptions of sm and sent models (#9401 )	2021-10-11 11:18:10 +02:00
Adriane Boyd	fd7edbc645	Fix types descriptions of sm and sent models (#9401 )	2021-10-11 11:17:18 +02:00
Adriane Boyd	bbe4d3300a	Remove traces of lexemes from vocab serialization (#9400 )	2021-10-11 11:15:51 +02:00
Sofie Van Landeghem	a6ac36bcb3	Doc fixes in convert API (#9350 ) * add more info on the spacy debug command * formatting	2021-10-11 11:15:20 +02:00
Adriane Boyd	a5231cb044	Remove traces of lexemes from vocab serialization (#9400 )	2021-10-11 11:13:35 +02:00
Jette16	3b144a3a51	Add universe test (#9278 ) * Added test for universe.json * Added contributor agreement * Ran black on test_universe_json.py	2021-10-11 11:08:46 +02:00
Ines Montani	5003a9c3c7	Move core training logic in CLI into standalone function (#9398 )	2021-10-11 10:56:14 +02:00
Adriane Boyd	ae1b3e960b	Update overwrite and scorer in API docs (#9384 ) * Update overwrite and scorer in API docs * Rephrase morphologizer extend + example	2021-10-11 10:35:07 +02:00
Paul O'Leary McCann	2a7e327310	Fix Dependency Matcher Ordering Issue (#9337 ) * Fix inconsistency This makes the failing test pass, so that behavior is consistent whether patterns are added in one call or two. The issue is that the hash for patterns depended on the index of the pattern in the list of current patterns, not the list of total patterns, so a second call would get identical match ids. * Add illustrative test case * Add failing test for remove case Patterns are not removed from the internal matcher on calls to remove, which causes spurious weird matches (or misses). * Fix removal issue Remove patterns from the internal matcher. * Check that the single add call also gets no matches	2021-10-11 10:26:13 +02:00
Paul O'Leary McCann	5dbe4e8392	Update new issue config with Python 3.10 info Also adds note that Install issues go to Discussions.	2021-10-11 15:41:32 +09:00
Paul O'Leary McCann	48ba4e60f4	Add new style citation file (#9388 )	2021-10-07 17:47:39 +02:00
Paul O'Leary McCann	113d53ab6c	Fix tests for changes to inflection structure (#9390 )	2021-10-07 13:42:18 +02:00
Paul O'Leary McCann	c4e3b7a5db	Change JA inflection separator to semicolon Hyphen is unsuitable because of interactions with the JA data fields, but pipe is also unsuitable because it has a different meaning in UD data, so it's better to use something that has no significance in either case. So this uses semicolon.	2021-10-07 17:28:15 +09:00
Paul O'Leary McCann	227f98081b	Use a pipe for separating Japanese inflections Inflection values look like this pipe separated: 五段-ラ行\|連用形-促音便 So using a hyphen erases the original fields.	2021-10-07 17:14:05 +09:00
Paul O'Leary McCann	f975690cc9	Use hyphen to join parts of inflection in JA tokenizer	2021-10-07 17:09:38 +09:00
Sofie Van Landeghem	f87ae3cb7d	Doc fixes in convert API (#9350 ) * add more info on the spacy debug command * formatting	2021-10-06 13:13:18 +09:00
Elia Robyn Lake (Robyn Speer)	53b5f245ed	Allow IETF language codes, aliases, and close matches (#9342 ) * use language-matching to allow language code aliases Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * link to "IETF language tags" in docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Make requirements consistent Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * change "two-letter language ID" to "IETF language tag" in language docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * use langcodes 3.2 and handle language-tag errors better Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * all unknown language codes are ImportErrors Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai>	2021-10-05 09:52:22 +02:00
Adriane Boyd	4192e71599	Sync vocab in vectors and components sourced in configs (#9335 ) Since a component may reference anything in the vocab, share the full vocab when loading source components and vectors (which will include `strings` as of #8909). When loading a source component from a config, save and restore the vocab state after loading source pipelines, in particular to preserve the original state without vectors, since `[initialize.vectors] = null` skips rather than resets the vectors. The vocab references are not synced for components loaded with `Language.add_pipe(source=)` because the pipelines are already loaded and not necessarily with the same vocab. A warning could be added in `Language.create_pipe_from_source` that it may be necessary to save and reload before training, but it's a rare enough case that this kind of warning may be too noisy overall.	2021-10-04 12:19:02 +02:00
Paul O'Leary McCann	1ee6541ab0	Moving Japanese tokenizer extra info to Token.morph (#8977 ) * Use morph for extra Japanese tokenizer info Previously Japanese tokenizer info that didn't correspond to Token fields was put in user data. Since spaCy core should avoid touching user data, this moves most information to the Token.morph attribute. It also adds the normalized form, which wasn't exposed before. The subtokens, which are a list of full tokens, are still added to user data, except with the default tokenizer granualarity. With the default tokenizer settings the subtokens are all None, so in this case the user data is simply not set. * Update tests Also adds a new test for norm data. * Update docs * Add Japanese morphologizer factory Set the default to `extend=True` so that the morphologizer does not clobber the values set by the tokenizer. * Use the norm_ field for normalized forms Before this commit, normalized forms were put in the "norm" field in the morph attributes. I am not sure why I did that instead of using the token morph, I think I just forgot about it. * Skip test if sudachipy is not installed * Fix import Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-01 19:19:26 +02:00
Paul O'Leary McCann	8f2409e514	Don't serialize user data in DocBin if not saving it (fix #9190 ) (#9226 ) * Don't store user data if told not to (fix #9190) * Add unit tests for the store_user_data setting	2021-10-01 12:37:39 +02:00
Paul O'Leary McCann	23badbd55c	Updating Troubleshooting Docs (#9329 ) * Add link to Discussions FAQ * Remove old FAQ entries I think these are no longer relevant. - no-cache-dir: affected pip versions are very old now - narrow unicode: not an issue from py3.3+ - utf-8 osx: upstream bug closed in 2019 Some of the other issues are also maybe not frequent.	2021-10-01 12:31:41 +02:00
Paul O'Leary McCann	6e833b617a	Updating Troubleshooting Docs (#9329 ) * Add link to Discussions FAQ * Remove old FAQ entries I think these are no longer relevant. - no-cache-dir: affected pip versions are very old now - narrow unicode: not an issue from py3.3+ - utf-8 osx: upstream bug closed in 2019 Some of the other issues are also maybe not frequent.	2021-10-01 12:28:22 +02:00
github-actions[bot]	42a76c758f	Auto-format code with black (#9346 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-10-01 11:17:11 +02:00
Adriane Boyd	b3192ddea3	Sync thinc install dep in setup, fix test packaging (#9336 ) * Sync thinc install dep in setup * Add __init__.py to include package tests in package * Include *.toml in package	2021-09-30 19:02:10 +02:00
Adriane Boyd	03fefa37e2	Add overwrite settings for more components (#9050 ) * Add overwrite settings for more components For pipeline components where it's relevant and not already implemented, add an explicit `overwrite` setting that controls whether `set_annotations` overwrites existing annotation. For the `morphologizer`, add an additional setting `extend`, which controls whether the existing features are preserved. * +overwrite, +extend: overwrite values of existing features, add any new features * +overwrite, -extend: overwrite completely, removing any existing features * -overwrite, +extend: keep values of existing features, add any new features * -overwrite, -extend: do not modify the existing value if set In all cases an unset value will be set by `set_annotations`. Preserve current overwrite defaults: * True: morphologizer, entity linker * False: tagger, sentencizer, senter * Add backwards compat overwrite settings * Put empty line back Removed by accident in last commit * Set backwards-compatible defaults in __init__ Because the `TrainablePipe` serialization methods update `cfg`, there's no straightforward way to detect whether models serialized with a previous version are missing the overwrite settings. It would be possible in the sentencizer due to its separate serialization methods, however to keep the changes parallel, this also sets the default in `__init__`. * Remove traces Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2021-09-30 15:35:55 +02:00
Jim O’Regan	8fe525beb5	Add an Irish lemmatiser, based on BuNaMo (#9102 ) * add tréis/théis * remove previous contents, add demutate/unponc * fmt off/on wrapping * type hints * IrishLemmatizer (sic) * Use spacy-lookups-data>=1.0.3 * Minor bug fixes, refactoring for IrishLemmatizer * Fix return type for ADP list lookups * Fix and refactor lookup table lookups for missing/string/list * Remove unused variables * skip lookup of verbal substantives and adjectives; just demutate * Fix morph checks API details * Add types and format * Move helper methods into lemmatizer Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-09-30 14:18:47 +02:00
Paul O'Leary McCann	0508795d67	Fix invalid json	2021-09-30 15:24:47 +09:00
Paul O'Leary McCann	78a88f7de7	Fix invalid json	2021-09-30 15:23:55 +09:00
Martin Vallone	f15bb40941	Adding PhruzzMatcher to spaCy universe (#9321 ) * Adding PhruzzMatcher to spaCy universe * Fixes to make the package work properly	2021-09-30 14:26:40 +09:00
Martin Vallone	a14ab7e882	Adding PhruzzMatcher to spaCy universe (#9321 ) * Adding PhruzzMatcher to spaCy universe * Fixes to make the package work properly	2021-09-30 13:46:53 +09:00
Elia Robyn Lake (Robyn Speer)	5b0b0ca809	Move WandB loggers into spacy-loggers (#9223 ) * factor out the WandB logger into spacy-loggers Signed-off-by: Elia Robyn Speer <gh@arborelia.net> * depend on spacy-loggers so they are available Signed-off-by: Elia Robyn Speer <gh@arborelia.net> * remove docs of spacy.WandbLogger.v2 (moved to spacy-loggers) Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Version number suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update references to WandbLogger Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * make order of deps more consistent Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-09-29 11:12:50 +02:00
Adriane Boyd	e750c1760c	Restore tokenization timing in Language.evaluate (#9305 ) Restore tokenization timing steps that were accidentally removed in #6765.	2021-09-27 20:44:14 +02:00
Sofie Van Landeghem	a361df00cd	Raise E983 early on in docbin init (#9247 ) * raise E983 early on in docbin init * catch situation before error is raised * add more info on the spacy debug command	2021-09-27 20:43:03 +02:00

... 4 5 6 7 8 ...

15324 Commits