spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-06 16:51:14 +03:00

Author	SHA1	Message	Date
Adriane Boyd	d98d525bc8	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-3	2021-10-14 09:41:46 +02:00
Paul O'Leary McCann	a3b7519aba	Fix JA Morph Values (#9449 ) * Don't set empty / weird values in morph * Update tests to handy empty morph values * Fix everything * Replace potentially problematic characters * Fix test	2021-10-14 09:21:36 +02:00
Ines Montani	c48564688f	Merge pull request #9423 from explosion/tests/issue-marker	2021-10-13 16:53:40 +02:00
Jette16	78365452d3	Moved test for universe into .github folder (#9447 ) * Moved universe-test into .github folder * Cleaned code * CHanged a file name	2021-10-13 14:13:06 +02:00
Sofie Van Landeghem	2e3d6b8b5a	Fix test for spancat (#9446 ) * fix test for spancat * increase tolerance for almost equal checks * Update spacy/tests/test_models.py * Update spacy/tests/test_models.py	2021-10-13 10:47:56 +02:00
Paul O'Leary McCann	efe5beefe0	Add test for case where parser overwrite annotations (#9406 ) * Add test for case where parser overwrite annotations * Move test to its own file Also add note about how other tokens modify results. * Fix xfail decorator	2021-10-11 14:57:45 +02:00
Ines Montani	1fa7c4e73b	Support issue marker via pytest	2021-10-11 13:56:24 +02:00
Jette16	3b144a3a51	Add universe test (#9278 ) * Added test for universe.json * Added contributor agreement * Ran black on test_universe_json.py	2021-10-11 11:08:46 +02:00
Paul O'Leary McCann	2a7e327310	Fix Dependency Matcher Ordering Issue (#9337 ) * Fix inconsistency This makes the failing test pass, so that behavior is consistent whether patterns are added in one call or two. The issue is that the hash for patterns depended on the index of the pattern in the list of current patterns, not the list of total patterns, so a second call would get identical match ids. * Add illustrative test case * Add failing test for remove case Patterns are not removed from the internal matcher on calls to remove, which causes spurious weird matches (or misses). * Fix removal issue Remove patterns from the internal matcher. * Check that the single add call also gets no matches	2021-10-11 10:26:13 +02:00
Paul O'Leary McCann	113d53ab6c	Fix tests for changes to inflection structure (#9390 )	2021-10-07 13:42:18 +02:00
Elia Robyn Lake (Robyn Speer)	53b5f245ed	Allow IETF language codes, aliases, and close matches (#9342 ) * use language-matching to allow language code aliases Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * link to "IETF language tags" in docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Make requirements consistent Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * change "two-letter language ID" to "IETF language tag" in language docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * use langcodes 3.2 and handle language-tag errors better Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * all unknown language codes are ImportErrors Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai>	2021-10-05 09:52:22 +02:00
Paul O'Leary McCann	1ee6541ab0	Moving Japanese tokenizer extra info to Token.morph (#8977 ) * Use morph for extra Japanese tokenizer info Previously Japanese tokenizer info that didn't correspond to Token fields was put in user data. Since spaCy core should avoid touching user data, this moves most information to the Token.morph attribute. It also adds the normalized form, which wasn't exposed before. The subtokens, which are a list of full tokens, are still added to user data, except with the default tokenizer granualarity. With the default tokenizer settings the subtokens are all None, so in this case the user data is simply not set. * Update tests Also adds a new test for norm data. * Update docs * Add Japanese morphologizer factory Set the default to `extend=True` so that the morphologizer does not clobber the values set by the tokenizer. * Use the norm_ field for normalized forms Before this commit, normalized forms were put in the "norm" field in the morph attributes. I am not sure why I did that instead of using the token morph, I think I just forgot about it. * Skip test if sudachipy is not installed * Fix import Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-01 19:19:26 +02:00
Paul O'Leary McCann	8f2409e514	Don't serialize user data in DocBin if not saving it (fix #9190 ) (#9226 ) * Don't store user data if told not to (fix #9190) * Add unit tests for the store_user_data setting	2021-10-01 12:37:39 +02:00
Adriane Boyd	b3192ddea3	Sync thinc install dep in setup, fix test packaging (#9336 ) * Sync thinc install dep in setup * Add __init__.py to include package tests in package * Include *.toml in package	2021-09-30 19:02:10 +02:00
Adriane Boyd	03fefa37e2	Add overwrite settings for more components (#9050 ) * Add overwrite settings for more components For pipeline components where it's relevant and not already implemented, add an explicit `overwrite` setting that controls whether `set_annotations` overwrites existing annotation. For the `morphologizer`, add an additional setting `extend`, which controls whether the existing features are preserved. * +overwrite, +extend: overwrite values of existing features, add any new features * +overwrite, -extend: overwrite completely, removing any existing features * -overwrite, +extend: keep values of existing features, add any new features * -overwrite, -extend: do not modify the existing value if set In all cases an unset value will be set by `set_annotations`. Preserve current overwrite defaults: * True: morphologizer, entity linker * False: tagger, sentencizer, senter * Add backwards compat overwrite settings * Put empty line back Removed by accident in last commit * Set backwards-compatible defaults in __init__ Because the `TrainablePipe` serialization methods update `cfg`, there's no straightforward way to detect whether models serialized with a previous version are missing the overwrite settings. It would be possible in the sentencizer due to its separate serialization methods, however to keep the changes parallel, this also sets the default in `__init__`. * Remove traces Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2021-09-30 15:35:55 +02:00
Adriane Boyd	effae12cbd	Update slow readers test to use textcat_multilabel (#9300 )	2021-09-27 20:04:02 +02:00
Adriane Boyd	fe5f5d6ac6	Update Catalan tokenizer (#9297 ) * Update Makefile For more recent python version * updated for bsc changes New tokenization changes * Update test_text.py * updating tests and requirements * changed failed test in test/lang/ca changed failed test in test/lang/ca * Update .gitignore deleted stashed changes line * back to python 3.6 and remove transformer requirements As per request * Update test_exception.py Change the test * Update test_exception.py Remove test print * Update Makefile For more recent python version * updated for bsc changes New tokenization changes * updating tests and requirements * Update requirements.txt Removed spacy-transfromers from requirements * Update test_exception.py Added final punctuation to ensure consistency * Update Makefile Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Format * Update test to check all tokens Co-authored-by: cayorodriguez <crodriguezp@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-27 14:42:30 +02:00
Adriane Boyd	03f234b739	Merge remote-tracking branch 'upstream/master' into develop	2021-09-27 09:10:45 +02:00
github-actions[bot]	4da2af4e0e	Auto-format code with black (#9284 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-09-24 10:46:43 +02:00
Jette16	5eced281d8	Add universe test (#9278 ) * Added test for universe.json * Added contributor agreement * Ran black on test_universe_json.py	2021-09-23 14:31:42 +02:00
Adriane Boyd	2f0bb77920	Accept Doc input in pipelines (#9069 ) * Accept Doc input in pipelines Allow `Doc` input to `Language.__call__` and `Language.pipe`, which skips `Language.make_doc` and passes the doc directly to the pipeline. * ensure_doc helper function * avoid running multiple processes on GPU * Update spacy/tests/test_language.py Co-authored-by: svlandeg <svlandeg@github.com>	2021-09-22 09:41:05 +02:00
Adriane Boyd	00bdb31150	Fix vector for 0-length span (#9244 )	2021-09-20 20:22:49 +02:00
github-actions[bot]	015d439eb6	Auto-format code with black (#9234 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-09-20 08:49:19 +02:00
Paul O'Leary McCann	c4f0800fb8	Validate pos values when creating Doc (#9148 ) * Validate pos values when creating Doc * Add clear error when setting invalid pos This also changes the error language slightly. * Fix variable name * Update spacy/tokens/doc.pyx * Test that setting invalid pos raises an error Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-16 13:28:05 +02:00
Paul O'Leary McCann	0f01f46e02	Update Cython string types (#9143 ) * Replace all basestring references with unicode `basestring` was a compatability type introduced by Cython to make dealing with utf-8 strings in Python2 easier. In Python3 it is equivalent to the unicode (or str) type. I replaced all references to basestring with unicode, since that was used elsewhere, but we could also just replace them with str, which shoudl also be equivalent. All tests pass locally. * Replace all references to unicode type with str Since we only support python3 this is simpler. * Remove all references to unicode type This removes all references to the unicode type across the codebase and replaces them with `str`, which makes it more drastic than the prior commits. In order to make this work importing `unicode_literals` had to be removed, and one explicit unicode literal also had to be removed (it is unclear why this is necessary in Cython with language level 3, but without doing it there were errors about implicit conversion). When `unicode` is used as a type in comments it was also edited to be `str`. Additionally `coding: utf8` headers were removed from a few files.	2021-09-13 17:02:17 +02:00
Adriane Boyd	aba6ce3a43	Handle spacy-legacy in package CLI for dependencies (#9163 ) * Handle spacy-legacy in package CLI for dependencies * Implement legacy backoff in spacy registry.find * Remove unused import * Update and format test	2021-09-08 11:46:40 +02:00
github-actions[bot]	584fae5807	Auto-format code with black (#9130 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-09-03 10:47:03 +02:00
Kevin Humphreys	ca93504660	Pass alignments to Matcher callbacks (#9001 ) * pass alignments to callbacks * refactor for single callback loop * Update spacy/matcher/matcher.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-02 12:58:05 +02:00
Robyn Speer	d60b748e3c	Fix surprises when asking for the root of a git repo (#9074 ) * Fix surprises when asking for the root of a git repo In the case of the first asset I wanted to get from git, the data I wanted was the entire repository. I tried leaving "path" blank, which gave a less-than-helpful error, and then I tried `path: "/"`, which started copying my entire filesystem into the project. The path I should have used was "". I've made two changes to make this smoother for others: - The 'path' within a git clone defaults to "" - If the path points outside of the tmpdir that the git clone goes into, we fail with an error Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * use a descriptive error instead of a default plus some minor fixes from PR review Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * check for None values in assets Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai>	2021-09-01 22:52:08 +02:00
github-actions[bot]	fb9c31fbda	Auto-format code with black (#9065 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-08-27 11:42:27 +02:00
Ines Montani	4cd052e81d	Include component factories in third-party dependencies resolver (#9009 ) * Include component factories in third-party dependencies resolver * Increment catalogue and update test	2021-08-25 14:58:01 +02:00
Sofie Van Landeghem	4d52d7051c	Fix spancat training on nested entities (#9007 ) * overfitting test on non-overlapping entities * add failing overfitting test for overlapping entities * failing test for list comprehension * remove test that was put in separate PR * bugfix * cleanup	2021-08-20 12:37:50 +02:00
Sofie Van Landeghem	de025beb5f	Warn and document spangroup.doc weakref (#8980 ) * test for error after Doc has been garbage collected * warn about using a SpanGroup when the Doc has been garbage collected * add warning to the docs * rephrase slightly * raise error instead of warning * update * move warning to doc property	2021-08-20 11:06:19 +02:00
Adriane Boyd	c5de9b463a	Update custom tokenizer APIs and pickling (#8972 ) * Fix incorrect pickling of Japanese and Korean pipelines, which led to the entire pipeline being reset if pickled * Enable pickling of Vietnamese tokenizer * Update tokenizer APIs for Chinese, Japanese, Korean, Thai, and Vietnamese so that only the `Vocab` is required for initialization	2021-08-19 14:37:47 +02:00
Ines Montani	d94ddd5686	Auto-detect package dependencies in spacy package (#8948 ) * Auto-detect package dependencies in spacy package * Add simple get_third_party_dependencies test * Import packages_distributions explicitly * Inline packages_distributions * Fix docstring [ci skip] * Relax catalogue requirement * Move importlib_metadata to spacy.compat with note * Include license information [ci skip]	2021-08-17 14:05:13 +02:00
Sofie Van Landeghem	0a6b68848f	Fix making span_group (#8975 ) * fix _make_span_group * fix imports	2021-08-17 10:36:34 +02:00
github-actions[bot]	92071326d8	Auto-format code with black (#8950 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-08-13 11:48:38 +02:00
Adriane Boyd	f99d6d5e39	Refactor scoring methods to use registered functions (#8766 ) * Add scorer option to components Add an optional `scorer` parameter to all pipeline components. If a scoring function is provided, it overrides the default scoring method for that component. * Add registered scorers for all components * Add `scorers` registry * Move all scoring methods outside of components as independent functions and register * Use the registered scoring methods as defaults in configs and inits Additional: * The scoring methods no longer have access to the full component, so use settings from `cfg` as default scorer options to handle settings such as `labels`, `threshold`, and `positive_label` * The `attribute_ruler` scoring method no longer has access to the patterns, so all scoring methods are called * Bug fix: `spancat` scoring method is updated to set `allow_overlap` to score overlapping spans correctly * Update Russian lemmatizer to use direct score method * Check type of cfg in Pipe.score * Fix check * Update spacy/pipeline/sentencizer.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove validate_examples from scoring functions * Use Pipe.labels instead of Pipe.cfg["labels"] Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-10 15:13:39 +02:00
fgaim	ee011ca963	Update Tigrinya ትግርኛ language support (#8900 ) * Add missing punctuation for Tigrinya and Amharic * Fix numeral and ordinal numbers for Tigrinya - Amharic was used in many cases - Also fixed some typos * Update Tigrinya stop-words * Contributor agreement for fgaim * Fix typo in "ti" lang test * Remove multi-word entries from numbers and ordinals	2021-08-10 13:55:08 +02:00
Paul O'Leary McCann	6029cfc391	Add scores to output in spancat (#8855 ) * Add scores to output in spancat This exposes the scores as an attribute on the SpanGroup. Includes a basic test. * Add basic doc note * Vectorize score calcs * Add "annotation format" section * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Clean up doc section * Ran prettier on docs * Get arrays off the gpu before iterating over them * Remove int() calls Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-08-10 13:47:49 +02:00
Adriane Boyd	a79888ed67	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-1	2021-08-09 13:13:13 +02:00
github-actions[bot]	56d4d87aeb	Auto-format code with black (#8895 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-08-06 13:38:06 +02:00
Adriane Boyd	fa2e7a4bbf	Fix spancat tests on GPU (#8872 ) * Fix spancat tests on GPU * Fix more spancat tests	2021-08-04 14:29:43 +02:00
Adriane Boyd	941a591f3c	Pass excludes when serializing vocab (#8824 ) * Pass excludes when serializing vocab Additional minor bug fix: * Deserialize vocab in `EntityLinker.from_disk` * Add test for excluding strings on load * Fix formatting	2021-08-03 14:42:44 +02:00
Adriane Boyd	175847f92c	Support list values and INTERSECTS in Matcher (#8784 ) * Support list values and IS_INTERSECT in Matcher * Support list values as token attributes for set operators, not just as pattern values. * Add `IS_INTERSECT` operator. * Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs. * Rename IS_INTERSECT to INTERSECTS	2021-08-02 19:39:26 +02:00
Adriane Boyd	fbbbda1954	Fix start/end chars for empty and out-of-bounds spans (#8816 )	2021-08-02 19:07:19 +02:00
Adriane Boyd	81d3a1edb1	Use tokenizer URL_MATCH pattern in LIKE_URL (#8765 )	2021-07-27 12:07:01 +02:00
Sofie Van Landeghem	83e27d262e	negative tag annotation (#8731 ) * unit test to unlearn tag via negative annotation * bump thinc to 8.0.8	2021-07-19 14:39:11 +02:00
Ines Montani	f90482d077	Tidy up and auto-format	2021-07-18 15:44:56 +10:00
Ines Montani	15e6578f7d	Adjust formatting	2021-07-17 10:49:13 +10:00

1 2 3 4 5 ...

2307 Commits