spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-14 03:00:40 +03:00

Author	SHA1	Message	Date
Paul O'Leary McCann	1ee6541ab0	Moving Japanese tokenizer extra info to Token.morph (#8977 ) * Use morph for extra Japanese tokenizer info Previously Japanese tokenizer info that didn't correspond to Token fields was put in user data. Since spaCy core should avoid touching user data, this moves most information to the Token.morph attribute. It also adds the normalized form, which wasn't exposed before. The subtokens, which are a list of full tokens, are still added to user data, except with the default tokenizer granualarity. With the default tokenizer settings the subtokens are all None, so in this case the user data is simply not set. * Update tests Also adds a new test for norm data. * Update docs * Add Japanese morphologizer factory Set the default to `extend=True` so that the morphologizer does not clobber the values set by the tokenizer. * Use the norm_ field for normalized forms Before this commit, normalized forms were put in the "norm" field in the morph attributes. I am not sure why I did that instead of using the token morph, I think I just forgot about it. * Skip test if sudachipy is not installed * Fix import Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-01 19:19:26 +02:00
Paul O'Leary McCann	8f2409e514	Don't serialize user data in DocBin if not saving it (fix #9190 ) (#9226 ) * Don't store user data if told not to (fix #9190) * Add unit tests for the store_user_data setting	2021-10-01 12:37:39 +02:00
Adriane Boyd	03fefa37e2	Add overwrite settings for more components (#9050 ) * Add overwrite settings for more components For pipeline components where it's relevant and not already implemented, add an explicit `overwrite` setting that controls whether `set_annotations` overwrites existing annotation. For the `morphologizer`, add an additional setting `extend`, which controls whether the existing features are preserved. * +overwrite, +extend: overwrite values of existing features, add any new features * +overwrite, -extend: overwrite completely, removing any existing features * -overwrite, +extend: keep values of existing features, add any new features * -overwrite, -extend: do not modify the existing value if set In all cases an unset value will be set by `set_annotations`. Preserve current overwrite defaults: * True: morphologizer, entity linker * False: tagger, sentencizer, senter * Add backwards compat overwrite settings * Put empty line back Removed by accident in last commit * Set backwards-compatible defaults in __init__ Because the `TrainablePipe` serialization methods update `cfg`, there's no straightforward way to detect whether models serialized with a previous version are missing the overwrite settings. It would be possible in the sentencizer due to its separate serialization methods, however to keep the changes parallel, this also sets the default in `__init__`. * Remove traces Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2021-09-30 15:35:55 +02:00
Adriane Boyd	fe5f5d6ac6	Update Catalan tokenizer (#9297 ) * Update Makefile For more recent python version * updated for bsc changes New tokenization changes * Update test_text.py * updating tests and requirements * changed failed test in test/lang/ca changed failed test in test/lang/ca * Update .gitignore deleted stashed changes line * back to python 3.6 and remove transformer requirements As per request * Update test_exception.py Change the test * Update test_exception.py Remove test print * Update Makefile For more recent python version * updated for bsc changes New tokenization changes * updating tests and requirements * Update requirements.txt Removed spacy-transfromers from requirements * Update test_exception.py Added final punctuation to ensure consistency * Update Makefile Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Format * Update test to check all tokens Co-authored-by: cayorodriguez <crodriguezp@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-27 14:42:30 +02:00
Adriane Boyd	03f234b739	Merge remote-tracking branch 'upstream/master' into develop	2021-09-27 09:10:45 +02:00
github-actions[bot]	4da2af4e0e	Auto-format code with black (#9284 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-09-24 10:46:43 +02:00
Jette16	5eced281d8	Add universe test (#9278 ) * Added test for universe.json * Added contributor agreement * Ran black on test_universe_json.py	2021-09-23 14:31:42 +02:00
Adriane Boyd	2f0bb77920	Accept Doc input in pipelines (#9069 ) * Accept Doc input in pipelines Allow `Doc` input to `Language.__call__` and `Language.pipe`, which skips `Language.make_doc` and passes the doc directly to the pipeline. * ensure_doc helper function * avoid running multiple processes on GPU * Update spacy/tests/test_language.py Co-authored-by: svlandeg <svlandeg@github.com>	2021-09-22 09:41:05 +02:00
Adriane Boyd	00bdb31150	Fix vector for 0-length span (#9244 )	2021-09-20 20:22:49 +02:00
github-actions[bot]	015d439eb6	Auto-format code with black (#9234 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-09-20 08:49:19 +02:00
Paul O'Leary McCann	c4f0800fb8	Validate pos values when creating Doc (#9148 ) * Validate pos values when creating Doc * Add clear error when setting invalid pos This also changes the error language slightly. * Fix variable name * Update spacy/tokens/doc.pyx * Test that setting invalid pos raises an error Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-16 13:28:05 +02:00
Paul O'Leary McCann	0f01f46e02	Update Cython string types (#9143 ) * Replace all basestring references with unicode `basestring` was a compatability type introduced by Cython to make dealing with utf-8 strings in Python2 easier. In Python3 it is equivalent to the unicode (or str) type. I replaced all references to basestring with unicode, since that was used elsewhere, but we could also just replace them with str, which shoudl also be equivalent. All tests pass locally. * Replace all references to unicode type with str Since we only support python3 this is simpler. * Remove all references to unicode type This removes all references to the unicode type across the codebase and replaces them with `str`, which makes it more drastic than the prior commits. In order to make this work importing `unicode_literals` had to be removed, and one explicit unicode literal also had to be removed (it is unclear why this is necessary in Cython with language level 3, but without doing it there were errors about implicit conversion). When `unicode` is used as a type in comments it was also edited to be `str`. Additionally `coding: utf8` headers were removed from a few files.	2021-09-13 17:02:17 +02:00
Adriane Boyd	aba6ce3a43	Handle spacy-legacy in package CLI for dependencies (#9163 ) * Handle spacy-legacy in package CLI for dependencies * Implement legacy backoff in spacy registry.find * Remove unused import * Update and format test	2021-09-08 11:46:40 +02:00
github-actions[bot]	584fae5807	Auto-format code with black (#9130 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-09-03 10:47:03 +02:00
Kevin Humphreys	ca93504660	Pass alignments to Matcher callbacks (#9001 ) * pass alignments to callbacks * refactor for single callback loop * Update spacy/matcher/matcher.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-02 12:58:05 +02:00
Robyn Speer	d60b748e3c	Fix surprises when asking for the root of a git repo (#9074 ) * Fix surprises when asking for the root of a git repo In the case of the first asset I wanted to get from git, the data I wanted was the entire repository. I tried leaving "path" blank, which gave a less-than-helpful error, and then I tried `path: "/"`, which started copying my entire filesystem into the project. The path I should have used was "". I've made two changes to make this smoother for others: - The 'path' within a git clone defaults to "" - If the path points outside of the tmpdir that the git clone goes into, we fail with an error Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * use a descriptive error instead of a default plus some minor fixes from PR review Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * check for None values in assets Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai>	2021-09-01 22:52:08 +02:00
github-actions[bot]	fb9c31fbda	Auto-format code with black (#9065 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-08-27 11:42:27 +02:00
Ines Montani	4cd052e81d	Include component factories in third-party dependencies resolver (#9009 ) * Include component factories in third-party dependencies resolver * Increment catalogue and update test	2021-08-25 14:58:01 +02:00
Sofie Van Landeghem	4d52d7051c	Fix spancat training on nested entities (#9007 ) * overfitting test on non-overlapping entities * add failing overfitting test for overlapping entities * failing test for list comprehension * remove test that was put in separate PR * bugfix * cleanup	2021-08-20 12:37:50 +02:00
Sofie Van Landeghem	de025beb5f	Warn and document spangroup.doc weakref (#8980 ) * test for error after Doc has been garbage collected * warn about using a SpanGroup when the Doc has been garbage collected * add warning to the docs * rephrase slightly * raise error instead of warning * update * move warning to doc property	2021-08-20 11:06:19 +02:00
Adriane Boyd	c5de9b463a	Update custom tokenizer APIs and pickling (#8972 ) * Fix incorrect pickling of Japanese and Korean pipelines, which led to the entire pipeline being reset if pickled * Enable pickling of Vietnamese tokenizer * Update tokenizer APIs for Chinese, Japanese, Korean, Thai, and Vietnamese so that only the `Vocab` is required for initialization	2021-08-19 14:37:47 +02:00
Ines Montani	d94ddd5686	Auto-detect package dependencies in spacy package (#8948 ) * Auto-detect package dependencies in spacy package * Add simple get_third_party_dependencies test * Import packages_distributions explicitly * Inline packages_distributions * Fix docstring [ci skip] * Relax catalogue requirement * Move importlib_metadata to spacy.compat with note * Include license information [ci skip]	2021-08-17 14:05:13 +02:00
Sofie Van Landeghem	0a6b68848f	Fix making span_group (#8975 ) * fix _make_span_group * fix imports	2021-08-17 10:36:34 +02:00
github-actions[bot]	92071326d8	Auto-format code with black (#8950 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-08-13 11:48:38 +02:00
Adriane Boyd	f99d6d5e39	Refactor scoring methods to use registered functions (#8766 ) * Add scorer option to components Add an optional `scorer` parameter to all pipeline components. If a scoring function is provided, it overrides the default scoring method for that component. * Add registered scorers for all components * Add `scorers` registry * Move all scoring methods outside of components as independent functions and register * Use the registered scoring methods as defaults in configs and inits Additional: * The scoring methods no longer have access to the full component, so use settings from `cfg` as default scorer options to handle settings such as `labels`, `threshold`, and `positive_label` * The `attribute_ruler` scoring method no longer has access to the patterns, so all scoring methods are called * Bug fix: `spancat` scoring method is updated to set `allow_overlap` to score overlapping spans correctly * Update Russian lemmatizer to use direct score method * Check type of cfg in Pipe.score * Fix check * Update spacy/pipeline/sentencizer.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove validate_examples from scoring functions * Use Pipe.labels instead of Pipe.cfg["labels"] Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-10 15:13:39 +02:00
fgaim	ee011ca963	Update Tigrinya ትግርኛ language support (#8900 ) * Add missing punctuation for Tigrinya and Amharic * Fix numeral and ordinal numbers for Tigrinya - Amharic was used in many cases - Also fixed some typos * Update Tigrinya stop-words * Contributor agreement for fgaim * Fix typo in "ti" lang test * Remove multi-word entries from numbers and ordinals	2021-08-10 13:55:08 +02:00
Paul O'Leary McCann	6029cfc391	Add scores to output in spancat (#8855 ) * Add scores to output in spancat This exposes the scores as an attribute on the SpanGroup. Includes a basic test. * Add basic doc note * Vectorize score calcs * Add "annotation format" section * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Clean up doc section * Ran prettier on docs * Get arrays off the gpu before iterating over them * Remove int() calls Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-08-10 13:47:49 +02:00
Adriane Boyd	a79888ed67	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-1	2021-08-09 13:13:13 +02:00
github-actions[bot]	56d4d87aeb	Auto-format code with black (#8895 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-08-06 13:38:06 +02:00
Adriane Boyd	fa2e7a4bbf	Fix spancat tests on GPU (#8872 ) * Fix spancat tests on GPU * Fix more spancat tests	2021-08-04 14:29:43 +02:00
Adriane Boyd	941a591f3c	Pass excludes when serializing vocab (#8824 ) * Pass excludes when serializing vocab Additional minor bug fix: * Deserialize vocab in `EntityLinker.from_disk` * Add test for excluding strings on load * Fix formatting	2021-08-03 14:42:44 +02:00
Adriane Boyd	175847f92c	Support list values and INTERSECTS in Matcher (#8784 ) * Support list values and IS_INTERSECT in Matcher * Support list values as token attributes for set operators, not just as pattern values. * Add `IS_INTERSECT` operator. * Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs. * Rename IS_INTERSECT to INTERSECTS	2021-08-02 19:39:26 +02:00
Adriane Boyd	fbbbda1954	Fix start/end chars for empty and out-of-bounds spans (#8816 )	2021-08-02 19:07:19 +02:00
Adriane Boyd	81d3a1edb1	Use tokenizer URL_MATCH pattern in LIKE_URL (#8765 )	2021-07-27 12:07:01 +02:00
Sofie Van Landeghem	83e27d262e	negative tag annotation (#8731 ) * unit test to unlearn tag via negative annotation * bump thinc to 8.0.8	2021-07-19 14:39:11 +02:00
Ines Montani	f90482d077	Tidy up and auto-format	2021-07-18 15:44:56 +10:00
Ines Montani	15e6578f7d	Adjust formatting	2021-07-17 10:49:13 +10:00
explosion-bot	eff3d1088b	Auto-format code with black	2021-07-16 08:03:36 +00:00
Adriane Boyd	ac45c7c045	Add pre-commit to ignored requirements (#8728 )	2021-07-15 16:41:15 +02:00
jmyerston	993b0fab0e	Added ancient Greek language support (#8606 ) * Add ancient Greek language support Initial commit * Contributor Agreement * grc tokenizer test added and files formatted with black, unnecessary import removed Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Commas in lists fixed. __init__py added to test * Update lex_attrs.py * Update stop_words.py * Update stop_words.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-07-15 10:27:17 +02:00
Sofie Van Landeghem	77859beb99	spacy.ngram_range_suggester.v1 (#8699 )	2021-07-15 10:01:22 +02:00
Julien Rossi	e117573822	Adding noun_chunks to the DUTCH language model (nl) (#8529 ) * ✨ implement noun_chunks for dutch language * copy/paste FR and SV syntax iterators to accomodate UD tags * added tests with dutch text * signed contributor agreement * 🐛 fix noun chunks generator * built from scratch * define noun chunk as a single Noun-Phrase * includes some corner cases debugging (incorrect POS tagging) * test with provided annotated sample (POS, DEP) * ✅ fix failing test * CI pipeline did not like the added sample file * add the sample as a pytest fixture * Update spacy/lang/nl/syntax_iterators.py * Update spacy/lang/nl/syntax_iterators.py Code readability Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/tests/lang/nl/test_noun_chunks.py correct comment Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * finalize code * change "if next_word" into "if next_word is not None" Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-07-14 14:01:02 +02:00
Adriane Boyd	b8e720fdb9	Fix Azerbaijani init, extend lang init tests (#8656 ) * Extend langs in initialize tests * Fix az init	2021-07-09 15:36:35 +02:00
Sofie Van Landeghem	64fac754fe	add spacy prefix to ngram_suggester.v1 (#8623 )	2021-07-07 08:09:30 +02:00
Sofie Van Landeghem	733e8ceea9	fix spancat initialize with labels (#8620 )	2021-07-06 19:08:25 +02:00
Sofie Van Landeghem	3daf57d70c	Small spancat fixes (#8614 ) * two small fixes + additional tests * rename	2021-07-06 14:15:41 +02:00
Adriane Boyd	29906884c5	Raise an error for textcat with <2 labels (#8584 ) * Raise an error for textcat with <2 labels Raise an error if initializing a `textcat` component without at least two labels. * Add similar note to docs * Update positive_label description in API docs	2021-07-06 12:35:22 +02:00
explosion-bot	ee37288a1f	Auto-format code with black	2021-07-02 07:48:26 +00:00
Ines Montani	af9d984407	Merge pull request #8405 from svlandeg/fix/whitespace_tokenizer [ci skip]	2021-06-30 20:52:59 +10:00
Ines Montani	7f65902702	Merge pull request #8522 from adrianeboyd/chore/update-flake8 Update flake8 version in reqs and CI	2021-06-28 21:46:06 +10:00

1 2 3 4 5 ...

2294 Commits