spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-11-10 21:07:53 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	5bebbf7550	Python 3.13 support (#13823 ) In order to support Python 3.13, we had to migrate to Cython 3.0. This caused some tricky interaction with our Pydantic usage, because Cython 3 uses the from __future__ import annotations semantics, which causes type annotations to be saved as strings. The end result is that we can't have Language.factory decorated functions in Cython modules anymore, as the Language.factory decorator expects to inspect the signature of the functions and build a Pydantic model. If the function is implemented in Cython, an error is raised because the type is not resolved. To address this I've moved the factory functions into a new module, spacy.pipeline.factories. I've added __getattr__ importlib hooks to the previous locations, in case anyone was importing these functions directly. The change should have no backwards compatibility implications. Along the way I've also refactored the registration of functions for the config. Previously these ran as import-time side-effects, using the registry decorator. I've created instead a new module spacy.registrations. When the registry is accessed it calls a function ensure_populated(), which cases the registrations to occur. I've made a similar change to the Language.factory registrations in the new spacy.pipeline.factories module. I want to remove these import-time side-effects so that we can speed up the loading time of the library, which can be especially painful on the CLI. I also find that I'm often working to track down the implementations of functions referenced by strings in the config. Having the registrations all happen in one place will make this easier. With these changes I've fortunately avoided the need to migrate to Pydantic v2 properly --- we're still using the v1 compatibility shim. We might not be able to hold out forever though: Pydantic (reasonably) aren't actively supporting the v1 shims. I put a lot of work into v2 migration when investigating the 3.13 support, and it's definitely challenging. In any case, it's a relief that we don't have to do the v2 migration at the same time as the Cython 3.0/Python 3.13 support.	2025-05-22 13:47:21 +02:00
Adriane Boyd	55614d6799	Add profile=False to currently unprofiled cython	2023-09-28 17:09:41 +02:00
Basile Dura	b0228d8ea6	ci: add cython linter (#12694 ) * chore: add cython-linter dev dependency * fix: lexeme.pyx * fix: morphology.pxd * fix: tokenizer.pxd * fix: vocab.pxd * fix: morphology.pxd (line length) * ci: add cython-lint * ci: fix cython-lint call * Fix kb/candidate.pyx. * Fix kb/kb.pyx. * Fix kb/kb_in_memory.pyx. * Fix kb. * Fix training/ partially. * Fix training/. Ignore trailing whitespaces and too long lines. * Fix ml/. * Fix matcher/. * Fix pipeline/. * Fix tokens/. * Fix build errors. Fix vocab.pyx. * Fix cython-lint install and run. * Fix lexeme.pyx, parts_of_speech.pxd, vectors.pyx. Temporarily disable cython-lint execution. * Fix attrs.pyx, lexeme.pyx, symbols.pxd, isort issues. * Make cython-lint install conditional. Fix tokenizer.pyx. * Fix remaining files. Reenable cython-lint check. * Readded parentheses. * Fix test_build_dependencies(). * Add explanatory comment to cython-lint execution. --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-07-19 12:03:31 +02:00
Daniël de Kok	e2b70df012	Configure isort to use the Black profile, recursively isort the `spacy` module (#12721 ) * Use isort with Black profile * isort all the things * Fix import cycles as a result of import sorting * Add DOCBIN_ALL_ATTRS type definition * Add isort to requirements * Remove isort from build dependencies check * Typo	2023-06-14 17:48:41 +02:00
Daniël de Kok	a06cbae70d	precompute_hiddens/Parser: do not look up CPU ops (3.4) (#11069 ) * precompute_hiddens/Parser: do not look up CPU ops `get_ops("cpu")` is quite expensive. To avoid this, we want to cache the result as in #11068. However, for 3.x we do not want to change the ABI. So we avoid the expensive lookup by using NumpyOps. This should have a minimal impact, since `get_ops("cpu")` was only used when the model ops were `CupyOps`. If the ops are `AppleOps`, we are still passing through the correct BLAS implementation. * _NUMPY_OPS -> NUMPY_OPS	2022-07-05 10:53:42 +02:00
Daniël de Kok	85dd2b6c04	Parser: use C saxpy/sgemm provided by the Ops implementation (#10773 ) * Parser: use C saxpy/sgemm provided by the Ops implementation This is a backport of https://github.com/explosion/spaCy/pull/10747 from the parser refactor branch. It eliminates the explicit calls to BLIS, instead using the saxpy/sgemm provided by the Ops implementation. This allows us to use Accelerate in the parser on M1 Macs (with an updated thinc-apple-ops). Performance of the de_core_news_lg pipe: BLIS 0.7.0, no thinc-apple-ops: 6385 WPS BLIS 0.7.0, thinc-apple-ops: 36455 WPS BLIS 0.9.0, no thinc-apple-ops: 19188 WPS BLIS 0.9.0, thinc-apple-ops: 36682 WPS This PR, thinc-apple-ops: 38726 WPS Performance of the de_core_news_lg pipe (only tok2vec -> parser): BLIS 0.7.0, no thinc-apple-ops: 13907 WPS BLIS 0.7.0, thinc-apple-ops: 73172 WPS BLIS 0.9.0, no thinc-apple-ops: 41576 WPS BLIS 0.9.0, thinc-apple-ops: 72569 WPS This PR, thinc-apple-ops: 87061 WPS * Require thinc >=8.1.0,<8.2.0 * Lower thinc lowerbound to 8.1.0.dev0 * Use best CPU ops for CBLAS when the parser model is on the GPU * Fix another unguarded cblas() call * Fix: use ops as a shorthand for self.model.ops Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2022-05-27 11:20:52 +02:00
Adriane Boyd	b278f31ee6	Document scorers in registry and components from #8766 (#8929 ) * Document scorers in registry and components from #8766 * Update spacy/pipeline/lemmatizer.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/dependencyparser.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Reformat Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-12 12:50:03 +02:00
Adriane Boyd	f99d6d5e39	Refactor scoring methods to use registered functions (#8766 ) * Add scorer option to components Add an optional `scorer` parameter to all pipeline components. If a scoring function is provided, it overrides the default scoring method for that component. * Add registered scorers for all components * Add `scorers` registry * Move all scoring methods outside of components as independent functions and register * Use the registered scoring methods as defaults in configs and inits Additional: * The scoring methods no longer have access to the full component, so use settings from `cfg` as default scorer options to handle settings such as `labels`, `threshold`, and `positive_label` * The `attribute_ruler` scoring method no longer has access to the patterns, so all scoring methods are called * Bug fix: `spancat` scoring method is updated to set `allow_overlap` to score overlapping spans correctly * Update Russian lemmatizer to use direct score method * Check type of cfg in Pipe.score * Fix check * Update spacy/pipeline/sentencizer.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove validate_examples from scoring functions * Use Pipe.labels instead of Pipe.cfg["labels"] Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-10 15:13:39 +02:00
Adriane Boyd	941a591f3c	Pass excludes when serializing vocab (#8824 ) * Pass excludes when serializing vocab Additional minor bug fix: * Deserialize vocab in `EntityLinker.from_disk` * Add test for excluding strings on load * Fix formatting	2021-08-03 14:42:44 +02:00
Matthew Honnibal	6f5e308d17	Support negative examples in partial NER annotations (#8106 ) * Support a cfg field in transition system * Make NER 'has gold' check use right alignment for span * Pass 'negative_samples_key' property into NER transition system * Add field for negative samples to NER transition system * Check neg_key in NER has_gold * Support negative examples in NER oracle * Test for negative examples in NER * Fix name of config variable in NER * Remove vestiges of old-style partial annotation * Remove obsolete tests * Add comment noting lack of support for negative samples in parser * Additions to "neg examples" PR (#8201) * add custom error and test for deprecated format * add test for unlearning an entity * add break also for Begin's cost * add negative_samples_key property on Parser * rename * extend docs & fix some older docs issues * add subclass constructors, clean up tests, fix docs * add flaky test with ValueError if gold parse was not found * remove ValueError if n_gold == 0 * fix docstring * Hack in environment variables to try out training * Remove hack * Remove NER hack, and support 'negative O' samples * Fix O oracle * Fix transition parser * Remove 'not O' from oracle * Fix NER oracle * check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation * use set instead of list in consistency check Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-06-17 17:33:00 +10:00
Adriane Boyd	39153ef90f	Update lexeme_norm checks * Add util method for check * Add new languages to list with lexeme norm tables * Add check to all relevant components * Add config details to warning message Note that we're not actually inspecting the model config to see if `NORM` is used as an attribute, so it may warn in cases where it's not relevant.	2021-03-19 10:59:27 +01:00
Sofie Van Landeghem	837a4f53c2	Error handling in nlp.pipe (#6817 ) * add error handler for pipe methods * add unit tests * remove pipe method that are the same as their base class * have Language keep track of a default error handler * cleanup * formatting * small refactor * add documentation	2021-01-29 08:51:21 +08:00
Sofie Van Landeghem	6b68ad027b	Fix beam NER resizing (#6834 ) * move label check to sub methods * add tests	2021-01-27 23:39:14 +11:00
Matthew Honnibal	1d20e21f3e	Add labels implicitly for parser and ner	2021-01-27 12:54:47 +11:00
Matthew Honnibal	f049df1715	Revert "Set annotations in update" (#6810 ) * Revert "Set annotations in update (#6767)" This reverts commit `e680efc7cc`. * Fix version * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/transition_parser.pyx * Update spacy/pipeline/transition_parser.pyx * Update website/docs/api/multilabel_textcategorizer.md * Update website/docs/api/tok2vec.md * Update website/docs/usage/layers-architectures.md * Update website/docs/usage/layers-architectures.md * Update website/docs/api/transformer.md * Update website/docs/api/textcategorizer.md * Update website/docs/api/tagger.md * Update spacy/pipeline/entity_linker.py * Update website/docs/api/sentencerecognizer.md * Update website/docs/api/pipe.md * Update website/docs/api/morphologizer.md * Update website/docs/api/entityrecognizer.md * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/multitask.pyx * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/textcat.py * Update spacy/pipeline/textcat.py * Update spacy/pipeline/textcat.py * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/trainable_pipe.pyx * Update spacy/pipeline/trainable_pipe.pyx * Update spacy/pipeline/transition_parser.pyx * Update spacy/pipeline/transition_parser.pyx * Update website/docs/api/entitylinker.md * Update website/docs/api/dependencyparser.md * Update spacy/pipeline/trainable_pipe.pyx	2021-01-25 22:18:45 +08:00
Sofie Van Landeghem	e680efc7cc	Set annotations in update (#6767 ) * bump to 3.0.0rc4 * do set_annotations in component update calls * update docs and remove set_annotations flag * fix EL test	2021-01-20 11:49:25 +11:00
Matthew Honnibal	8656a08777	Add beam_parser and beam_ner components for v3 (#6369 ) * Get basic beam tests working * Get basic beam tests working * Compile _beam_utils * Remove prints * Test beam density * Beam parser seems to train * Draft beam NER * Upd beam * Add hypothesis as dev dependency * Implement missing is-gold-parse method * Implement early update * Fix state hashing * Fix test * Fix test * Default to non-beam in parser constructor * Improve oracle for beam * Start refactoring beam * Update test * Refactor beam * Update nn * Refactor beam and weight by cost * Update ner beam settings * Update test * Add __init__.pxd * Upd test * Fix test * Upd test * Fix test * Remove ring buffer history from StateC * WIP change arc-eager transitions * Add state tests * Support ternary sent start values * Fix arc eager * Fix NER * Pass oracle cut size for beam * Fix ner test * Fix beam * Improve StateC.clone * Improve StateClass.borrow * Work directly with StateC, not StateClass * Remove print statements * Fix state copy * Improve state class * Refactor parser oracles * Fix arc eager oracle * Fix arc eager oracle * Use a vector to implement the stack * Refactor state data structure * Fix alignment of sent start * Add get_aligned_sent_starts method * Add test for ae oracle when bad sentence starts * Fix sentence segment handling * Avoid Reduce that inserts illegal sentence * Update preset SBD test * Fix test * Remove prints * Fix sent starts in Example * Improve python API of StateClass * Tweak comments and debug output of arc eager * Upd test * Fix state test * Fix state test	2020-12-13 09:08:32 +08:00
Ines Montani	bfa3931c9d	Revert added_strings change (#6236 )	2020-10-10 18:55:07 +02:00
Sofie Van Landeghem	d093d6343b	TrainablePipe (#6213 ) * rename Pipe to TrainablePipe * split functionality between Pipe and TrainablePipe * remove unnecessary methods from certain components * cleanup * hasattr(component, "pipe") should be sufficient again * remove serialization and vocab/cfg from Pipe * unify _ensure_examples and validate_examples * small fixes * hasattr checks for self.cfg and self.vocab * make is_resizable and is_trainable properties * serialize strings.json instead of vocab * fix KB IO + tests * fix typos * more typos * _added_strings as a set * few more tests specifically for _added_strings field * bump to 3.0.0a36	2020-10-08 21:33:49 +02:00
svlandeg	ff9ac39c88	read entity_ruler patterns with srsly.read_jsonl.v1	2020-10-05 22:50:14 +02:00
svlandeg	65abd77779	add finish_update to Pipe	2020-10-05 16:23:33 +02:00
Matthew Honnibal	0b5c72fce2	Fix incorrect docstrings	2020-09-29 18:30:38 +02:00
Matthew Honnibal	1fd002180e	Allow more components to use labels	2020-09-29 16:48:56 +02:00
Matthew Honnibal	58c8d4b414	Add label_data property to pipeline	2020-09-29 16:22:13 +02:00
Ines Montani	f171903139	Clean up sgd and pipeline -> nlp	2020-09-29 12:20:26 +02:00
Matthew Honnibal	9c8b2524fe	Upd initialize args	2020-09-29 12:08:37 +02:00
Matthew Honnibal	f2d1b7feb5	Clean up sgd	2020-09-29 12:00:08 +02:00
Ines Montani	dec984a9c1	Update Language.initialize and support components/tokenizer settings	2020-09-29 11:52:45 +02:00
Matthew Honnibal	b3b6868639	Remove 'sgd' arg from component initialize	2020-09-29 11:42:35 +02:00
Ines Montani	ff9a63bfbd	begin_training -> initialize	2020-09-28 21:35:09 +02:00
Sofie Van Landeghem	8e7557656f	Renaming gold & annotation_setter (#6042 ) * version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting	2020-09-09 10:31:03 +02:00
Sofie Van Landeghem	60f22e1800	Pipe API (#6034 ) * ensure Language passes on valid examples for initialization * fix tagger model initialization * check for valid get_examples across components * assume labels were added before begin_training * fix senter initialization * fix morphologizer initialization * use methods to check arguments * test textcat init, requires thinc>=8.0.0a31 * fix tok2vec init * fix entity linker init * use islice * fix simple NER * cleanup debug model * fix assert statements * fix tests * throw error when adding a label if the output layer can't be resized anymore * fix test * add failing test for simple_ner * UX improvements * morphologizer UX * assume begin_training gets a representative set and processes the labels * remove assumptions for output of untrained NER model * restore test for original purpose	2020-09-08 22:44:25 +02:00
Matthew Honnibal	737a1408d9	Improve implementation of fix #6010 Follow-ups to the parser efficiency fix. * Avoid introducing new counter for number of pushes * Base cut on number of transitions, keeping it more even * Reintroduce the randomization we had in v2.	2020-09-02 14:42:32 +02:00
Matthew Honnibal	c1bf3a5602	Fix significant performance bug in parser training (#6010 ) The parser training makes use of a trick for long documents, where we use the oracle to cut up the document into sections, so that we can have batch items in the middle of a document. For instance, if we have one document of 600 words, we might make 6 states, starting at words 0, 100, 200, 300, 400 and 500. The problem is for v3, I screwed this up and didn't stop parsing! So instead of a batch of [100, 100, 100, 100, 100, 100], we'd have a batch of [600, 500, 400, 300, 200, 100]. Oops. The implementation here could probably be improved, it's annoying to have this extra variable in the state. But this'll do. This makes the v3 parser training 5-10 times faster, depending on document lengths. This problem wasn't in v2.	2020-09-02 12:57:13 +02:00
Ines Montani	8128e5eb35	Replace lexeme_norm warning with logging	2020-08-14 15:00:52 +02:00
Ines Montani	950832f087	Tidy up pipes (#5906 ) * Tidy up pipes * Fix init, defaults and raise custom errors * Update docs * Update docs [ci skip] * Apply suggestions from code review Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Tidy up error handling and validation, fix consistency * Simplify get_examples check * Remove unused import [ci skip] Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-11 23:29:31 +02:00
Ines Montani	56c17973aa	Use "raise ... from" in custom errors for better tracebacks	2020-08-05 23:53:21 +02:00
Sofie Van Landeghem	ca491722ad	The Parser is now a Pipe (2) (#5844 ) * moving syntax folder to _parser_internals * moving nn_parser and transition_system * move nn_parser and transition_system out of internals folder * moving nn_parser code into transition_system file * rename transition_system to transition_parser * moving parser_model and _state to ml * move _state back to internals * The Parser now inherits from Pipe! * small code fixes * removing unnecessary imports * remove link_vectors_to_models * transition_system to internals folder * little bit more cleanup * newlines	2020-07-30 23:30:54 +02:00

38 Commits