spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-10 19:57:17 +03:00

Author	SHA1	Message	Date
Adriane Boyd	538304948e	Remove profile=True from currently profiled cython	2023-09-28 17:09:41 +02:00
Basile Dura	b0228d8ea6	ci: add cython linter (#12694 ) * chore: add cython-linter dev dependency * fix: lexeme.pyx * fix: morphology.pxd * fix: tokenizer.pxd * fix: vocab.pxd * fix: morphology.pxd (line length) * ci: add cython-lint * ci: fix cython-lint call * Fix kb/candidate.pyx. * Fix kb/kb.pyx. * Fix kb/kb_in_memory.pyx. * Fix kb. * Fix training/ partially. * Fix training/. Ignore trailing whitespaces and too long lines. * Fix ml/. * Fix matcher/. * Fix pipeline/. * Fix tokens/. * Fix build errors. Fix vocab.pyx. * Fix cython-lint install and run. * Fix lexeme.pyx, parts_of_speech.pxd, vectors.pyx. Temporarily disable cython-lint execution. * Fix attrs.pyx, lexeme.pyx, symbols.pxd, isort issues. * Make cython-lint install conditional. Fix tokenizer.pyx. * Fix remaining files. Reenable cython-lint check. * Readded parentheses. * Fix test_build_dependencies(). * Add explanatory comment to cython-lint execution. --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-07-19 12:03:31 +02:00
Daniël de Kok	e2b70df012	Configure isort to use the Black profile, recursively isort the `spacy` module (#12721 ) * Use isort with Black profile * isort all the things * Fix import cycles as a result of import sorting * Add DOCBIN_ALL_ATTRS type definition * Add isort to requirements * Remove isort from build dependencies check * Typo	2023-06-14 17:48:41 +02:00
Vinit Ravishankar	28de85737f	Tagger label smoothing (#12293 ) * add label smoothing * use True/False instead of floats * add entropy to debug data * formatting * docs * change test to check difference in distributions * Update website/docs/api/tagger.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/tagger.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * bool -> float * update docs * fix seed * black * update tests to use label_smoothing = 0.0 * set default to 0.0, update quickstart * Update spacy/pipeline/tagger.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update morphologizer, tagger test * fix morph docs * add url to docs --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-22 12:17:56 +01:00
Daniël de Kok	e5debc68e4	Tagger: use unnormalized probabilities for inference (#10197 ) * Tagger: use unnormalized probabilities for inference Using unnormalized softmax avoids use of the relatively expensive exp function, which can significantly speed up non-transformer models (e.g. I got a speedup of 27% on a German tagging + parsing pipeline). * Add spacy.Tagger.v2 with configurable normalization Normalization of probabilities is disabled by default to improve performance. * Update documentation, models, and tests to spacy.Tagger.v2 * Move Tagger.v1 to spacy-legacy * docs/architectures: run prettier * Unnormalized softmax is now a Softmax_v2 option * Require thinc 8.0.14 and spacy-legacy 3.0.9	2022-03-15 14:15:31 +01:00
Daniël de Kok	72f7f4e68a	morphologizer: avoid recreating label tuple for each token (#9764 ) * morphologizer: avoid recreating label tuple for each token The `labels` property converts the dictionary key set to a tuple. This property was used for every annotated token, recreating the tuple over and over again. Construct the tuple once in the set_annotations function and reuse it. On a Finnish pipeline that I was experimenting with, this results in a speedup of ~15% (~13000 -> ~15000 WPS). * tagger: avoid recreating label tuple for each token	2021-11-30 11:58:59 +01:00
Adriane Boyd	03fefa37e2	Add overwrite settings for more components (#9050 ) * Add overwrite settings for more components For pipeline components where it's relevant and not already implemented, add an explicit `overwrite` setting that controls whether `set_annotations` overwrites existing annotation. For the `morphologizer`, add an additional setting `extend`, which controls whether the existing features are preserved. * +overwrite, +extend: overwrite values of existing features, add any new features * +overwrite, -extend: overwrite completely, removing any existing features * -overwrite, +extend: keep values of existing features, add any new features * -overwrite, -extend: do not modify the existing value if set In all cases an unset value will be set by `set_annotations`. Preserve current overwrite defaults: * True: morphologizer, entity linker * False: tagger, sentencizer, senter * Add backwards compat overwrite settings * Put empty line back Removed by accident in last commit * Set backwards-compatible defaults in __init__ Because the `TrainablePipe` serialization methods update `cfg`, there's no straightforward way to detect whether models serialized with a previous version are missing the overwrite settings. It would be possible in the sentencizer due to its separate serialization methods, however to keep the changes parallel, this also sets the default in `__init__`. * Remove traces Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2021-09-30 15:35:55 +02:00
Adriane Boyd	b278f31ee6	Document scorers in registry and components from #8766 (#8929 ) * Document scorers in registry and components from #8766 * Update spacy/pipeline/lemmatizer.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/dependencyparser.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Reformat Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-12 12:50:03 +02:00
Adriane Boyd	f99d6d5e39	Refactor scoring methods to use registered functions (#8766 ) * Add scorer option to components Add an optional `scorer` parameter to all pipeline components. If a scoring function is provided, it overrides the default scoring method for that component. * Add registered scorers for all components * Add `scorers` registry * Move all scoring methods outside of components as independent functions and register * Use the registered scoring methods as defaults in configs and inits Additional: * The scoring methods no longer have access to the full component, so use settings from `cfg` as default scorer options to handle settings such as `labels`, `threshold`, and `positive_label` * The `attribute_ruler` scoring method no longer has access to the patterns, so all scoring methods are called * Bug fix: `spancat` scoring method is updated to set `allow_overlap` to score overlapping spans correctly * Update Russian lemmatizer to use direct score method * Check type of cfg in Pipe.score * Fix check * Update spacy/pipeline/sentencizer.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove validate_examples from scoring functions * Use Pipe.labels instead of Pipe.cfg["labels"] Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-10 15:13:39 +02:00
Adriane Boyd	d2bdaa7823	Replace negative rows with 0 in StaticVectors (#7674 ) * Replace negative rows with 0 in StaticVectors Replace negative row indices with 0-vectors in `StaticVectors`. * Increase versions related to StaticVectors * Increase versions of all architctures and layers related to `StaticVectors` * Improve efficiency of 0-vector operations Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5 * Update config defaults to new versions * Update docs	2021-04-22 18:04:15 +10:00
Adriane Boyd	39153ef90f	Update lexeme_norm checks * Add util method for check * Add new languages to list with lexeme norm tables * Add check to all relevant components * Add config details to warning message Note that we're not actually inspecting the model config to see if `NORM` is used as an attribute, so it may warn in cases where it's not relevant.	2021-03-19 10:59:27 +01:00
Ines Montani	d0c3775712	Replace links to nightly docs [ci skip]	2021-01-30 20:09:38 +11:00
Adriane Boyd	9328dd5625	Handle unset token.morph in Morphologizer (#6704 ) * Handle unset token.morph in Morphologizer Handle unset `token.morph` in `Morphologizer.initialize` and `Morphologizer.get_loss`. If both `token.morph` and `token.pos` are unset, treat the annotation as missing rather than empty. * Add token.has_morph()	2021-01-15 17:20:10 +01:00
Sofie Van Landeghem	75d9019343	Fix types of Tok2Vec encoding architectures (#6442 ) * fix TorchBiLSTMEncoder documentation * ensure the types of the encoding Tok2vec layers are correct * update references from v1 to v2 for the new architectures	2021-01-07 16:39:27 +11:00
svlandeg	73fc1ed963	remove labels from morphologizer constructor	2020-11-11 21:48:50 +01:00
Adriane Boyd	a7e7d6c6c9	Ignore misaligned in Morphologizer.get_loss (#6363 ) Fix bug where `Morphologizer.get_loss` treated misaligned annotation as `EMPTY_MORPH` rather than ignoring it. Remove unneeded default `EMPTY_MORPH` mappings.	2020-11-10 20:15:09 +08:00
Adriane Boyd	a4b32b9552	Handle missing reference values in scorer (#6286 ) * Handle missing reference values in scorer Handle missing values in reference doc during scoring where it is possible to detect an unset state for the attribute. If no reference docs contain annotation, `None` is returned instead of a score. `spacy evaluate` displays `-` for missing scores and the missing scores are saved as `None`/`null` in the metrics. Attributes without unset states: * `token.head`: relies on `token.dep` to recognize unset values * `doc.cats`: unable to handle missing annotation Additional changes: * add optional `has_annotation` check to `score_scans` to replace `doc.sents` hack * update `score_token_attr_per_feat` to handle missing and empty morph representations * fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START` vs. `SENT_START` * Fix import * Update return types	2020-11-03 15:47:18 +01:00
Ines Montani	bfa3931c9d	Revert added_strings change (#6236 )	2020-10-10 18:55:07 +02:00
Adriane Boyd	39aabf50ab	Also rename to include_static_vectors in CharEmbed	2020-10-09 11:54:48 +02:00
Sofie Van Landeghem	d093d6343b	TrainablePipe (#6213 ) * rename Pipe to TrainablePipe * split functionality between Pipe and TrainablePipe * remove unnecessary methods from certain components * cleanup * hasattr(component, "pipe") should be sufficient again * remove serialization and vocab/cfg from Pipe * unify _ensure_examples and validate_examples * small fixes * hasattr checks for self.cfg and self.vocab * make is_resizable and is_trainable properties * serialize strings.json instead of vocab * fix KB IO + tests * fix typos * more typos * _added_strings as a set * few more tests specifically for _added_strings field * bump to 3.0.0a36	2020-10-08 21:33:49 +02:00
Sofie Van Landeghem	f4f49f5877	update blis (#6198 ) * allow higher blis version * fix typo * bump to 3.0.0a34 * fix pins in other files	2020-10-05 14:58:56 +02:00
Ines Montani	bcd52e5486	Tidy up errors and warnings	2020-10-04 11:16:31 +02:00
Matthew Honnibal	3b2a78720c	Upd morphologizer	2020-10-03 19:35:19 +02:00
Matthew Honnibal	8ea8b7d940	Support loading labels in morphologizer	2020-10-03 19:13:42 +02:00
Adriane Boyd	86c3ec9c2b	Refactor Token morph setting (#6175 ) * Refactor Token morph setting * Remove `Token.morph_` * Add `Token.set_morph()` * `0` resets `token.c.morph` to unset * Any other values are passed to `Morphology.add` * Add token.morph setter to set from MorphAnalysis	2020-10-01 22:21:46 +02:00
Matthew Honnibal	0b5c72fce2	Fix incorrect docstrings	2020-09-29 18:30:38 +02:00
Matthew Honnibal	58c8d4b414	Add label_data property to pipeline	2020-09-29 16:22:13 +02:00
Ines Montani	f171903139	Clean up sgd and pipeline -> nlp	2020-09-29 12:20:26 +02:00
Matthew Honnibal	b3b6868639	Remove 'sgd' arg from component initialize	2020-09-29 11:42:35 +02:00
Ines Montani	ff9a63bfbd	begin_training -> initialize	2020-09-28 21:35:09 +02:00
Ines Montani	ae51f580c1	Fix handling of score_weights	2020-09-24 10:27:33 +02:00
Adriane Boyd	7e4cd7575c	Refactor Docs.is_ flags (#6044 ) * Refactor Docs.is_ flags * Add derived `Doc.has_annotation` method * `Doc.has_annotation(attr)` returns `True` for partial annotation * `Doc.has_annotation(attr, require_complete=True)` returns `True` for complete annotation * Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced` and `is_nered` * Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The list is the `DocBin` attributes list plus `SPACY` and `LENGTH`. Notes on `Doc.has_annotation`: * `HEAD` is converted to `DEP` because heads don't have an unset state * Accept `IS_SENT_START` as a synonym of `SENT_START` Additional changes: * Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for `DocBin` * In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override `SENT_START` * In `Doc.from_array()` using `attrs` other than `Doc._get_array_attrs()` (i.e., a user's custom list rather than our default internal list) with both `HEAD` and `SENT_START` shows a warning that `HEAD` will override `SENT_START` * `set_children_from_heads` does not require dependency labels to set sentence boundaries and sets `sent_start` for all non-sentence starts to `-1` * Fix call to set_children_form_heads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-17 00:14:01 +02:00
Adriane Boyd	f3db3f6fe0	Add vectors option to CharacterEmbed (#6069 ) * Add vectors option to CharacterEmbed * Update spacy/pipeline/morphologizer.pyx * Adjust default morphologizer config Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-16 17:45:04 +02:00
Adriane Boyd	d722a439aa	Remove unneeded methods in senter and morphologizer (#6074 ) Now that the tagger doesn't manage the tag map, the child classes senter and morphologizer don't need to override the serialization methods.	2020-09-16 17:39:41 +02:00
Sofie Van Landeghem	8e7557656f	Renaming gold & annotation_setter (#6042 ) * version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting	2020-09-09 10:31:03 +02:00
Sofie Van Landeghem	60f22e1800	Pipe API (#6034 ) * ensure Language passes on valid examples for initialization * fix tagger model initialization * check for valid get_examples across components * assume labels were added before begin_training * fix senter initialization * fix morphologizer initialization * use methods to check arguments * test textcat init, requires thinc>=8.0.0a31 * fix tok2vec init * fix entity linker init * use islice * fix simple NER * cleanup debug model * fix assert statements * fix tests * throw error when adding a label if the output layer can't be resized anymore * fix test * add failing test for simple_ner * UX improvements * morphologizer UX * assume begin_training gets a representative set and processes the labels * remove assumptions for output of untrained NER model * restore test for original purpose	2020-09-08 22:44:25 +02:00
Ines Montani	ab1bb421ed	Update docs links in codebase	2020-09-04 12:58:50 +02:00
Ines Montani	950832f087	Tidy up pipes (#5906 ) * Tidy up pipes * Fix init, defaults and raise custom errors * Update docs * Update docs [ci skip] * Apply suggestions from code review Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Tidy up error handling and validation, fix consistency * Simplify get_examples check * Remove unused import [ci skip] Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-11 23:29:31 +02:00
Ines Montani	d5c78c7a34	Update docs and fix consistency	2020-08-09 22:31:52 +02:00
Ines Montani	56c17973aa	Use "raise ... from" in custom errors for better tracebacks	2020-08-05 23:53:21 +02:00
Ines Montani	7a21775cd0	Merge pull request #5834 from explosion/feature/vectors	2020-07-29 18:49:26 +02:00
Ines Montani	b0f57a0cac	Update docs and consistency	2020-07-29 15:14:07 +02:00
Matthew Honnibal	c27309f839	Merge branch 'develop' into feature/vectors	2020-07-29 14:54:10 +02:00
Ines Montani	ff0bc05da8	Fix docstrings [ci skip]	2020-07-29 14:09:37 +02:00
Matthew Honnibal	6a6b09bd32	Update morphologizer model	2020-07-29 14:01:12 +02:00
Matthew Honnibal	1784c95827	Clean up link_vectors_to_models unused stuff	2020-07-29 14:01:11 +02:00
Ines Montani	ae4d8a6ffd	Update docstrings, docs and pipe consistency	2020-07-28 13:37:31 +02:00
Ines Montani	894e20c466	Merge branch 'develop' into feature/component-scores	2020-07-27 18:14:39 +02:00
Ines Montani	d8b519c23c	API docs, docstrings and argument consistency	2020-07-27 18:11:45 +02:00
Adriane Boyd	8bb0507777	Add and update score methods and score weights Add and update `score` methods, provided `scores`, and default weights `default_score_weights` for pipeline components. * `scores` provides all top-level keys returned by `score` (merely informative, similar to `assigns`). * `default_score_weights` provides the default weights for a default config. * The keys from `default_score_weights` determine which values will be shown in the `spacy train` output, so keys with weight `0.0` will be displayed but not counted toward the overall score.	2020-07-27 14:44:53 +02:00

1 2

74 Commits