spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-29 11:26:28 +03:00

Author	SHA1	Message	Date
Adriane Boyd	378db0eb1e	Temporarily skip tests that require models/compat	2022-11-25 12:05:25 +01:00
Raphael Mitsch	c0fd8a2e71	find-threshold: CLI command for multi-label classifier threshold tuning (#11280 ) * Add foundation for find-threshold CLI functionality. * Finish first draft for find-threshold. * Add tests. * Revert adjusted import statements. * Fix mypy errors. * Fix imports. * Harmonize arguments with spacy evaluate command. * Generalize component and threshold handling. Harmonize arguments with 'spacy evaluate' CLI. * Fix Spancat test. * Add beta parameter to Scorer and PRFScore. * Make beta a component scorer setting. * Remove beta. * Update nlp.config (workaround). * Reload pipeline on threshold change. Adjust tests. Remove confection reference. * Remove assumption of component being a Pipe object or having a .cfg attribute. * Adjust test output and reference values. * Remove beta references. Delete universe.json. * Reverting unnecessary changes. Removing unused default values. Renaming variables in find-cli tests. * Update spacy/cli/find_threshold.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Remove adding labels in tests. * Remove unused error * Undo changes to PRFScorer * Change default value for n_trials. Log table iteratively. * Add warnings for pointless applications of find_threshold(). * Fix imports. * Adjust type check of TextCategorizer to exclude subclasses. * Change check of if there's only one unique value in scores. * Update spacy/cli/find_threshold.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Incorporate feedback. * Fix test issue. Update docstring. * Update docs & docstring. * Update spacy/tests/test_cli.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add examples to docs. Rename _nlp to nlp in tests. * Update spacy/cli/find_threshold.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/cli/find_threshold.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-11-25 11:44:55 +01:00
Adriane Boyd	30d31fd335	Update Russian and Ukrainian lemmatizers (#11811 ) * pymorph2 issues #11620, #11626, #11625: - #11620: pymorphy2_lookup - #11626: handle multiple forms pointing to the same normal form + handling empty POS tag - #11625: matching DET that are labelled as PRON by pymorhp2 * Move lemmatizer algorithm changes back into RussianLemmatizer * Fix uk pymorphy3_lookup mode init * Move and update tests for ru/uk lookup lemmatizer modes * Fix typo * Remove traces of previous behavior for uninflected POS * Refactor to private generic-looking pymorphy methods * Remove xfailed uk lemmatizer cases * Update spacy/lang/ru/lemmatizer.py Co-authored-by: Richard Hudson <richard@explosion.ai> Co-authored-by: Dmytro S Lituiev <d.lituiev@gmail.com> Co-authored-by: Richard Hudson <richard@explosion.ai>	2022-11-25 11:12:46 +01:00
Adriane Boyd	8f062b849c	Fix Matcher cython profile=True header (#11867 )	2022-11-24 16:03:42 +01:00
Madeesh Kannan	5ea14af32b	Add `training.before_update` callback (#11739 ) * Add `training.before_update` callback This callback can be used to implement training paradigms like gradual (un)freezing of components (e.g: the Transformer) after a certain number of training steps to mitigate catastrophic forgetting during fine-tuning. * Fix type annotation, default config value * Generalize arguments passed to the callback * Update schema * Pass `epoch` to callback, rename `current_step` to `step` * Add test * Simplify test * Replace config string with `spacy.blank` * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Cleanup imports Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-23 17:54:58 +01:00
Edward	e79910d57e	Remove sentiment extension (#11722 ) * remove sentiment attribute * remove sentiment from docs * add test for backwards compatibility * replace from_disk with from_bytes * Fix docs and format file * Fix formatting	2022-11-23 13:09:32 +01:00
Paul O'Leary McCann	f1ddac187d	Remove unused error object (#11837 )	2022-11-23 10:51:31 +01:00
Marco Edward Gorelli	f0d8309a28	fix comparison of constants (#11834 ) Co-authored-by: MarcoGorelli <>	2022-11-21 08:12:03 +01:00
github-actions[bot]	89bfd06fbd	Auto-format code with black (#11826 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-11-18 18:24:13 +09:00
Adriane Boyd	a83463c5e0	Add transformer recommendation for ca (#11819 ) Model recommendation from @cayorodriguez.	2022-11-18 08:15:27 +01:00
Paul O'Leary McCann	75bb7ad541	Check textcat values for validity (#11763 ) * Check textcat values for validity * Fix error numbers * Clean up vals reference * Check category value validity through training The _validate_categories is called in update, which for multilabel is inherited from the single label component. * Formatting	2022-11-17 10:25:01 +01:00
Paul O'Leary McCann	c0c54e44bc	Add equality definition for vectors (#11806 ) * Add equality definition for vectors This re-uses the check from sourcing components. * Use the equality check * Format Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-16 09:44:42 +01:00
Sofie Van Landeghem	caa9efad59	prevent rewriting an already raw URL (#11810 )	2022-11-15 14:15:00 +01:00
Denis Bezykornov	7e684ad691	Update russian tokenizer exceptions (#11753 ) * Fix typos, add couple of new abbreviations, remove nonbreaking spaces * Remove space from abbreviation Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-15 11:37:25 +01:00
github-actions[bot]	188a7d00eb	Auto-format code with black (#11792 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-11-11 09:58:31 +01:00
Adriane Boyd	03eebe9d1c	Update warning, add tests for project requirements check (#11777 ) * Update warning, add tests for project requirements check * Make warning more general for differences between PEP 508 and pip * Add tests for _check_requirements * Parameterize test	2022-11-09 10:59:28 +01:00
Raphael Mitsch	20bbbe3e44	Revert disable/disabled merging behavior (#11745 ) * Merge disable with disabled. Adjust warnings, errors and tests. * Replace any() with set operation. * Update spacy/tests/pipeline/test_pipe_methods.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update docs. * Remve reference to config entry nlp.enabled from docs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-08 14:58:10 +01:00
Adriane Boyd	e116395f89	Add fallback in requirements check, only check once (#11735 ) * Add fallback in requirements check, only check once * Rename to skip_requirements_check * Update spacy/cli/project/run.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2022-11-07 14:46:08 +01:00
Adriane Boyd	e91b47a226	Check for unsafe paths in tarfile.extractall (CVE-2007-4559) (#11746 ) * Adding tarfile member sanitization to extractall() * Format * Simplify and add error message * Fix import * Add comment about CVE Co-authored-by: TrellixVulnTeam <charles.mcfarland@trellix.com>	2022-11-07 10:43:34 +01:00
Adriane Boyd	ea326cf47d	Fix types for Span.id and Span.id_ (#11744 )	2022-11-07 08:11:13 +01:00
github-actions[bot]	bbf64cfc43	Auto-format code with black (#11749 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-11-04 11:17:43 +01:00
Adriane Boyd	40e1000db0	Restore Doc attr getter values in Doc.to_json (#11700 )	2022-11-03 11:49:08 +01:00
Paul O'Leary McCann	db56600536	Fix default parameters for load functions (fix #11706 ) (#11713 ) * Fix default parameters for load functions Some load functions used SimpleFrozenList() directly instead of the _DEFAULT_EMPTY_PIPES parameter. That mostly worked as intended, but the changes in #11459 check for equality using identity, not value, so a warning is incorrectly raised sometimes, as in #11706. This change just has all the load functions use the singleton value instead. * Add test that there are no warnings on module-based load This will succeed due to changes in this branch, but local tests with the latest release failed as intended. * Try reverting commit and see if CI changes There is an error in CI that is probably unrelated. Revert "Fix default parameters for load functions" This reverts commit `dc46b35687`. * Revert "Try reverting commit and see if CI changes" This reverts commit `2514ed07ef`. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-03 10:52:59 +01:00
Adriane Boyd	68b8fa2df2	Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master-4	2022-11-03 09:42:36 +01:00
Adriane Boyd	420b1d854b	Update textcat scorer threshold behavior (#11696 ) * Update textcat scorer threshold behavior For `textcat` (with exclusive classes) the scorer should always use a threshold of 0.0 because there should be one predicted label per doc and the numeric score for that particular label should not matter. * Rename to test_textcat_multilabel_threshold * Remove all uses of threshold for multi_label=False * Update Scorer.score_cats API docs * Add tests for score_cats with thresholds * Update textcat API docs * Fix types * Convert threshold back to float * Fix threshold type in docstring * Improve formatting in Scorer API docs	2022-11-02 15:35:04 +01:00
Paul O'Leary McCann	d61e742960	Handle Docs with no entities in EntityLinker (#11640 ) * Handle docs with no entities If a whole batch contains no entities it won't make it to the model, but it's possible for individual Docs to have no entities. Before this commit, those Docs would cause an error when attempting to concatenate arrays because the dimensions didn't match. It turns out the process of preparing the Ragged at the end of the span maker forward was a little different from list2ragged, which just uses the flatten function directly. Letting list2ragged do the conversion avoids the dimension issue. This did not come up before because in NEL demo projects it's typical for data with no entities to be discarded before it reaches the NEL component. This includes a simple direct test that shows the issue and checks it's resolved. It doesn't check if there are any downstream changes, so a more complete test could be added. A full run was tested by adding an example with no entities to the Emerson sample project. * Add a blank instance to default training data in tests Rather than adding a specific test, since not failing on instances with no entities is basic functionality, it makes sense to add it to the default set. * Fix without modifying architecture If the architecture is modified this would have to be a new version, but this change isn't big enough to merit that.	2022-10-28 10:25:34 +02:00
Adriane Boyd	865691d169	Adjust default attrs for textcat configs (#11698 )	2022-10-26 08:43:00 +02:00
Adriane Boyd	88d35450dc	Rename test helper method with non-test_ name (#11701 )	2022-10-25 14:53:18 +02:00
Adriane Boyd	cae4589f5a	Replace EntityRuler with SpanRuler implementation (#11320 ) * Replace EntityRuler with SpanRuler implementation Remove `EntityRuler` and rename the `SpanRuler`-based `future_entity_ruler` to `entity_ruler`. Main changes: * It is no longer possible to load patterns on init as with `EntityRuler(patterns=)`. * The older serialization formats (`patterns.jsonl`) are no longer supported and the related tests are removed. * The config settings are only stored in the config, not in the serialized component (in particular the `phrase_matcher_attr` and overwrite settings). * Add migration guide to EntityRuler API docs * docs update * Minor edit Co-authored-by: svlandeg <svlandeg@github.com>	2022-10-24 09:11:35 +02:00
Adriane Boyd	a4bd890f32	Merge pull request #11686 from adrianeboyd/chore/update-v4-from-master Update v4 from master	2022-10-21 12:55:53 +02:00
github-actions[bot]	84d9cb6b38	Auto-format code with black (#11687 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-10-21 11:54:17 +02:00
Paul O'Leary McCann	0e2b7fb28b	Remove thinc util reimports (#11665 ) * Remove imports marked as v2 leftovers There are a few functions that were in `spacy.util` in v2, but were moved to Thinc. In v3 these were imported in `spacy.util` so that code could be used unchanged, but the comment over them indicates they should always be imported from Thinc. This commit removes those imports. It doesn't look like any DeprecationWarning was ever thrown for using these, but it is probably fine to remove them anyway with a major version. It is not clear that they were widely used. * Import fix_random_seed correctly This seems to be the only place in spaCy that was using the old import.	2022-10-21 11:01:18 +02:00
Adriane Boyd	103b24fb25	Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master	2022-10-21 09:13:32 +02:00
Adriane Boyd	7e56701057	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5	2022-10-20 13:38:49 +02:00
Adriane Boyd	3d0e895363	Set version to v3.4.2 (#11672 )	2022-10-19 17:33:55 +02:00
Edward	d66ccb8eb0	Fix multiple entries per custom extension in doc json (#11551 ) * Fix multiple extensions and character offset * Rename token_start/end to start/end * Refactor Doc.from_json based on review * Iterate over user_data items * Only add non-empty underscore entries Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-10-19 15:52:47 +02:00
Paul O'Leary McCann	858565a567	Fix issues with DVC commands (#11592 ) * Fix flag handling in dvc Prior to this commit, if a flag (--verbose or --quiet) was passed to DVC, it would be added to the end of the generated dvc command line. This would result in the command being interpreted as part of the actual command to run, rather than an argument to dvc. This would result in command lines like: spacy project run preprocess --verbose That would fail with an error that there's no such directory as `--verbose`. This change puts the flags at the front of the dvc command so that they are interpreted correctly. It removes the `run_dvc_commands` function, which had been reduced to just a for loop and wasn't used elsewhere. A separate problem is that there's no way to specify the quiet behaviour to dvc from the command line, though it's unclear if that's a bug. * Add dvc quiet flag to docs * Handle case in DVC where no commands are appropriate If only have commands with no deps or outputs (admittedly unlikely), you get a weird error about the dvc file not existing. This gives explicit output instead. * Add support for quiet flag * Fix command execution Commands are strings now because they're joined further up.	2022-10-18 15:11:39 +09:00
Sofie Van Landeghem	2ce6aadda2	update default configs to recent versions (#11618 )	2022-10-17 12:10:03 +02:00
github-actions[bot]	ceb62352bf	Auto-format code with black (#11649 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-10-14 18:04:55 +09:00
Adriane Boyd	6b5a3e7219	Extend to pydantic v1.10 (#11635 ) * Update types in `spacy.schemas` for updated pydantic+mypy	2022-10-14 08:16:49 +02:00
Sofie Van Landeghem	4d869fcc11	Small fixes to docstrings (#11610 ) * add missing scorer arg to docstring * fix class names in textcat_multilabel * add missing scorer to docstrings	2022-10-12 15:17:40 +02:00
Adriane Boyd	fe06e037bc	Fix init for pymorphy2_lookup lemmatizer mode (#11631 )	2022-10-12 12:18:39 +02:00
Sofie Van Landeghem	29649589fc	remove dtype (#11615 )	2022-10-11 15:25:05 +02:00
Sofie Van Landeghem	ef74f8f5e4	Fix mypy error in edittree lemmatizer (#11612 ) * cleanup imports * try limiting Thinc to previous release * remove Model specification * fix code and revert Thinc constraint	2022-10-11 14:15:22 +02:00
Madeesh Kannan	446a3ecf34	`StringStore` refactoring (#11344 ) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit `1af9510ceb`. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes	2022-10-06 10:51:06 +02:00
svlandeg	d4922f25fc	fix test for EL activations with refactored KB	2022-10-03 14:41:15 +02:00
svlandeg	e3027c65b8	Merge branch 'copy_develop' into copy_v4	2022-10-03 14:12:16 +02:00
svlandeg	9c8cdb403e	Merge branch 'master_copy' into develop_copy	2022-09-30 15:40:26 +02:00
Sofie Van Landeghem	bcda8bc1e7	update mypy to latest version (#11546 ) * update mypy and disable it for python 3.6 * ignoring mypy's type redefinition error	2022-09-29 14:24:40 +02:00
Adriane Boyd	6d7630c5d3	Allow overriding spacy_version in spacy package meta (#11552 )	2022-09-29 10:44:06 +02:00
Peter Baumgartner	e794d4ae39	`debug data` Spancat Table Improvements (#11504 ) * update * fix format function * pull out _format_number * format with black	2022-09-28 17:16:05 +02:00
Raphael Mitsch	aea16719be	Simplify and clarify enable/disable behavior of spacy.load() (#11459 ) * Change enable/disable behavior so that arguments take precedence over config options. Extend error message on conflict. Add warning message in case of overwriting config option with arguments. * Fix tests in test_serialize_pipeline.py to reflect changes to handling of enable/disable. * Fix type issue. * Move comment. * Move comment. * Issue UserWarning instead of printing wasabi message. Adjust test. * Added pytest.warns(UserWarning) for expected warning to fix tests. * Update warning message. * Move type handling out of fetch_pipes_status(). * Add global variable for default value. Use id() to determine whether used values are default value. * Fix default value for disable. * Rename DEFAULT_PIPE_STATUS to _DEFAULT_EMPTY_PIPES.	2022-09-27 14:22:36 +02:00
Jacobo Myerston	3e8bc1272f	add punctuation to grc (#11426 ) * add punctuation to grc Add support for special editorial punctuation that is common in ancient Greek texts. Ancient Greek texts, as found in digital and print form, have been largely edited by scholars. Restorations and improvements are normally marked with special characters that need to be handled properly by the tokenizer. * add unit tests * simplify regex * move generic quotes to char classes * rename unit test * fix regex Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-09-27 11:38:56 +02:00
Adriane Boyd	877671e09a	Preserve missing entity annotation in augmenters (#11540 ) Preserve both `-` and `O` annotation in augmenters rather than relying on `Example.to_dict`'s default support for one option outside of labeled entity spans. This is intended as a temporary workaround for augmenters for v3.4.x. The behavior of `Example` and related IOB utils could be improved in the general case for v3.5.	2022-09-27 10:16:51 +02:00
Richard Hudson	6f692a06d5	Remove side effects from Doc.__init__() (#11506 ) * Remove side effects from Doc.__init__() * Changes based on review comment * Readd test * Change interface of Doc.__init__() * Simplify test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update doc.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-09-26 15:58:21 +02:00
Raphael Mitsch	af9b01ef97	Add dependency check to project step runs (#11226 ) * Add dependency check to project step running. * Fix dependency mismatch warning. * Remove newline. * Add types-setuptools to setup.cfg. * Move types-setuptools to test requirements. Move warnings into _validate_requirements(). Handle file reading in project_run(). * Remove newline formatting for output of package conflicts. * Show full version conflict message instead of just package name. * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix typo. * Re-add rephrasing of message for conflicting packages. Remove requirements path redundancy. * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Print unified message for requirement conflicts and missing requirements. * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix warning message. * Print conflict/missing messages individually. * Print conflict/missing messages individually. * Add check_requirements setting in project.yml to disable requirements check. * Update website/docs/usage/projects.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/usage/projects.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update description of project.yml structure in projects.md. * Update website/docs/usage/projects.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Prettify projects docs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-09-16 16:54:31 +02:00
github-actions[bot]	279358be63	Auto-format code with black (#11513 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-09-16 11:50:19 +02:00
Sofie Van Landeghem	0509f90874	add dot (#11500 )	2022-09-15 17:29:42 +02:00
Adriane Boyd	7c98245c0c	Add levenshtein from polyleven (#11418 ) Add a simple levenshtein distance function using the implementation from the polyleven library as `spacy.matcher.levenshtein`.	2022-09-14 17:05:22 +02:00
Daniël de Kok	efdbb722c5	Store activations in `Doc`s when `save_activations` is enabled (#11002 ) * Store activations in Doc when `store_activations` is enabled This change adds the new `activations` attribute to `Doc`. This attribute can be used by trainable pipes to store their activations, probabilities, and guesses for downstream users. As an example, this change modifies the `tagger` and `senter` pipes to add an `store_activations` option. When this option is enabled, the probabilities and guesses are stored in `set_annotations`. * Change type of `store_activations` to `Union[bool, List[str]]` When the value is: - A bool: all activations are stored when set to `True`. - A List[str]: the activations named in the list are stored * Formatting fixes in Tagger * Support store_activations in spancat and morphologizer * Make Doc.activations type visible to MyPy * textcat/textcat_multilabel: add store_activations option * trainable_lemmatizer/entity_linker: add store_activations option * parser/ner: do not currently support returning activations * Extend tagger and senter tests So that they, like the other tests, also check that we get no activations if no activations were requested. * Document `Doc.activations` and `store_activations` in the relevant pipes * Start errors/warnings at higher numbers to avoid merge conflicts Between the master and v4 branches. * Add `store_activations` to docstrings. * Replace store_activations setter by set_store_activations method Setters that take a different type than what the getter returns are still problematic for MyPy. Replace the setter by a method, so that type inference works everywhere. * Use dict comprehension suggested by @svlandeg * Revert "Use dict comprehension suggested by @svlandeg" This reverts commit `6e7b958f70`. * EntityLinker: add type annotations to _add_activations * _store_activations: make kwarg-only, remove doc_scores_lens arg * set_annotations: add type annotations * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * TextCat.predict: return dict * Make the `TrainablePipe.store_activations` property a bool This means that we can also bring back `store_activations` setter. * Remove `TrainablePipe.activations` We do not need to enumerate the activations anymore since `store_activations` is `bool`. * Add type annotations for activations in predict/set_annotations * Rename `TrainablePipe.store_activations` to `save_activations` * Error E1400 is not used anymore This error was used when activations were still `Union[bool, List[str]]`. * Change wording in API docs after store -> save change * docs: tag (save_)activations as new in spaCy 4.0 * Fix copied line in morphologizer activations test * Don't train in any test_save_activations test * Rename activations - "probs" -> "probabilities" - "guesses" -> "label_ids", except in the edit tree lemmatizer, where "guesses" -> "tree_ids". * Remove unused W400 warning. This warning was used when we still allowed the user to specify which activations to save. * Formatting fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Replace "kb_ids" by a constant * spancat: replace a cast by an assertion * Fix EOF spacing * Fix comments in test_save_activations tests * Do not set RNG seed in activation saving tests * Revert "spancat: replace a cast by an assertion" This reverts commit `0bd5730d16`. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-09-13 09:51:12 +02:00
Sofie Van Landeghem	cc10a27c59	Prevent tok2vec to broadcast to listeners when predicting (#11385 ) * replicate bug with tok2vec in annotating components * add overfitting test with a frozen tok2vec * remove broadcast from predict and check doc.tensor instead * remove broadcast * proper error * slight rephrase of documentation	2022-09-12 15:36:48 +02:00
Madeesh Kannan	0ec9a696e6	Fix config validation failures caused by NVTX pipeline wrappers (#11460 ) * Enable Cython<->Python bindings for `Pipe` and `TrainablePipe` methods * `pipes_with_nvtx_range`: Skip hooking methods whose signature cannot be ascertained When loading pipelines from a config file, the arguments passed to individual pipeline components is validated by `pydantic` during init. For this, the validation model attempts to parse the function signature of the component's c'tor/entry point so that it can check if all mandatory parameters are present in the config file. When using the `models_and_pipes_with_nvtx_range` as a `after_pipeline_creation` callback, the methods of all pipeline components get replaced by a NVTX range wrapper before the above-mentioned validation takes place. This can be problematic for components that are implemented as Cython extension types - if the extension type is not compiled with Python bindings for its methods, they will have no signatures at runtime. This resulted in `pydantic` matching the wrapper's parameters with the those in the config and raising errors. To avoid this, we now skip applying the wrapper to any (Cython) methods that do not have signatures.	2022-09-12 14:55:41 +02:00
kadarakos	6b83fee58d	Assets message (#11458 ) * new error message when 'project run assets' * new error message when 'project run assets' * Update spacy/cli/project/run.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-09-09 17:17:10 +02:00
Adriane Boyd	8a86a35eab	Remove has_letters in config template (#11465 ) Due to problems with the javascript conversion in the website quickstart, remove the `has_letters` setting to simplify generating `attrs` for the default `tok2vec`. Additionally reduce `PREFIX` as in the trained pipelines.	2022-09-09 15:10:04 +02:00
github-actions[bot]	0c72c6bb2c	Auto-format code with black (#11468 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-09-09 11:21:17 +02:00
Raphael Mitsch	1f23c615d7	Refactor KB for easier customization (#11268 ) * Add implementation of batching + backwards compatibility fixes. Tests indicate issue with batch disambiguation for custom singular entity lookups. * Fix tests. Add distinction w.r.t. batch size. * Remove redundant and add new comments. * Adjust comments. Fix variable naming in EL prediction. * Fix mypy errors. * Remove KB entity type config option. Change return types of candidate retrieval functions to Iterable from Iterator. Fix various other issues. * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/kb_base.pyx Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/kb_base.pyx Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Add error messages to NotImplementedErrors. Remove redundant comment. * Fix imports. * Remove redundant comments. * Rename KnowledgeBase to InMemoryLookupKB and BaseKnowledgeBase to KnowledgeBase. * Fix tests. * Update spacy/errors.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move KB into subdirectory. * Adjust imports after KB move to dedicated subdirectory. * Fix config imports. * Move Candidate + retrieval functions to separate module. Fix other, small issues. * Fix docstrings and error message w.r.t. class names. Fix typing for candidate retrieval functions. * Update spacy/kb/kb_in_memory.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/models/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix typing. * Change typing of mentions to be Span instead of Union[Span, str]. * Update docs. * Update EntityLinker and _architecture docs. * Update website/docs/api/entitylinker.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Adjust message for E1046. * Re-add section for Candidate in kb.md, add reference to dedicated page. * Update docs and docstrings. * Re-add section + reference for KnowledgeBase.get_alias_candidates() in docs. * Update spacy/kb/candidate.pyx * Update spacy/kb/kb_in_memory.pyx * Update spacy/pipeline/legacy/entity_linker.py * Remove canididate.md. Remove mistakenly added config snippet in entity_linker.py. Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-09-08 10:38:07 +02:00
shademe	977b847cce	Merge branch 'develop' into merge-develop-into-v4	2022-09-07 11:35:47 +02:00
Sofie Van Landeghem	d801cccd38	Merge pull request #11430 from rmitsch/chore/synch-develop Synch develop with master	2022-09-05 15:07:18 +02:00
Paul O'Leary McCann	977dc33312	Add a way to get the URL to download a pipeline to the CLI (#11175 ) * Add a dry run flag to download * Remove --dry-run, add --url option to `spacy info` instead * Make mypy happy * Print only the URL, so it's easier to use in scripts * Don't add the egg hash unless downloading an sdist * Update spacy/cli/info.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add two implementations of requirements * Clean up requirements sample slightly This should make mypy happy * Update URL help string * Remove requirements option * Add url option to docs * Add URL to spacy info model output, when available * Add types-setuptools to testing reqs * Add types-setuptools to requirements * Add "compatible", expand docstring * Update spacy/cli/info.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Run prettier on CLI docs * Update docs Add a sidebar about finding download URLs, with some examples of the new command. * Add download URLs to table on model page * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Updates from review * download url -> download link * Update docs Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-09-02 11:58:21 +02:00
github-actions[bot]	71884d0942	Auto-format code with black (#11427 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-09-02 11:43:20 +02:00
Madeesh Kannan	d1760ebe02	Better handling of unexpected types in `SetPredicate` (#11312 ) * `Matcher`: Better type checking of values in `SetPredicate` `SetPredicate`: Emit warning and return `False` on unexpected value types * Rename `value_type_mismatch` variable * Inline warning * Remove unexpected type warning from `_SetPredicate` * Ensure that `str` values are not interpreted as sequences Check elements of sequence values for convertibility to `str` or `int` * Add more `INTERSECT` and `IN` test cases * Test for inputs with multiple characters * Return `False` early instead of using a boolean flag * Remove superfluous `int` check, parentheses * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Appy suggestions from code review * Clarify test comment Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-09-02 09:09:48 +02:00
Adriane Boyd	4a615cacd2	Consolidate and freeze symbols (#11352 ) * Consolidate and freeze symbols Instead of having symbol values defined in three potentially conflicting places (`spacy.attrs`, `spacy.parts_of_speech`, `spacy.symbols`), define all symbols in `spacy.symbols` and reference those values in `spacy.attrs` and `spacy.parts_of_speech`. Remove deprecated and placeholder symbols from `spacy.attrs.IDS`. Make `spacy.attrs.NAMES` and `spacy.symbols.NAMES` reverse dicts rather than lists in order to support future use of hash values in `attr_id_t`. Minor changes: * Use `uint64_t` for attrs in `Doc.to_array` to support future use of hash values * Remove unneeded attrs filter for error message in `Doc.to_array` * Remove unused attr `SENT_END` * Handle dynamic size of attr_id_t in Doc.to_array * Undo added warnings * Refactor to make Doc.to_array more similar to Doc.from_array * Improve refactoring	2022-09-02 09:08:40 +02:00
Adriane Boyd	78f5503a29	Check for any non-Doc returned value for components (#11424 )	2022-09-01 19:37:23 +02:00
Madeesh Kannan	604a7c3c26	`SpanGroup(s)`-related optimizations (#11380 ) * `SpanGroup`: Add support for binding copies to a new reference document * `SpanGroups`: Replace superfluous serialize-deserialize roundtrip in `copy` Instead, directly copy the in-memory representations of the constituent `SpanGroup`s. * Update `SpanGroup.copy()` signature * Rename `new_doc` param to `doc` * Fix kwdarg * Update `.pyi` file and docstrings * `mypy` fix * Update spacy/tokens/span_group.pyx * Update docs Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-31 09:03:20 +02:00
Sofie Van Landeghem	8fc0efc502	Allow string argument for disable/enable/exclude (#11406 ) * adding unit test for spacy.load with disable/exclude string arg * allow pure strings in from_config * update docs * upstream type adjustements * docs update * make docstring more consistent * Update spacy/language.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * two more cleanups * fix type in internal method Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-31 09:02:34 +02:00
Paul O'Leary McCann	698b8b495f	Update/remove old Matcher syntax (#11370 ) * Clean up old Matcher call style related stuff In v2 Matcher.add was called with (key, on_match, patterns). In v3 this was changed to (key, patterns, , on_match=None), but there were various points where the old call syntax was documented or handled specially. This removes all those. The Matcher itself didn't need any code changes, as it just gives a generic type error. However the PhraseMatcher required some changes because it would automatically "fix" the old call style. Surprisingly, the tokenizer was still using the old call style in one place. After these changes tests failed in two places: 1. one test for the "new" call style, including the "old" call style. I removed this test. 2. deserializing the PhraseMatcher fails because the input docs are a set. I am not sure why 2 is happening - I guess it's a quirk of the serialization format? - so for now I just convert the set to a list when deserializing. The check that the input Docs are a List in the PhraseMatcher is a new check, but makes it parallel with the other Matchers, which seemed like the right thing to do. * Add notes related to input docs / deserialization type * Remove Typing import * Remove old note about call style change * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Use separate method for setting internal doc representations In addition to the title change, this changes the internal dict to be a defaultdict, instead of a dict with frequent use of setdefault. * Add _add_from_arrays for unpickling * Cleanup around adding from arrays This moves adding to internal structures into the private batch method, and removes the single-add method. This has one behavioral change for `add`, in that if something is wrong with the list of input Docs (such as one of the items not being a Doc), valid items before the invalid one will not be added. Also the callback will not be updated if anything is invalid. This change should not be significant. This also adds a test to check failure when given a non-Doc. * Update spacy/matcher/phrasematcher.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-30 15:40:31 +02:00
Daniël de Kok	3f4b4b7b4f	Fix `test_{prefer,require}_gpu` (#11390 ) * Fix `test_{prefer,require}_gpu` These tests assumed that GPUs are only supported with CuPy, but since Thinc 8.1 we also support Metal Performance Shaders. * test_misc: arrange thinc imports to be together	2022-08-30 14:21:02 +02:00
Patrick J. Burns	5ae63b1fbd	Add Latin language support (#11349 ) * Add lang folder for la (Latin) * Add Latin lang classes * Add minimal tokenizer exceptions * Add minimal stopwords * Add minimal lex_attrs * Update stopwords, tokenizer exceptions * Add la tests; register la_tokenizer in conftest.py * Update spacy/lang/la/lex_attrs.py Remove duplicate form in Latin lex_attrs Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update natto-py version spec (#11222) * Update natto-py version spec * Update setup.cfg Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add scorer to textcat API docs config settings (#11263) * Update docs for pipeline initialize() methods (#11221) * Update documentation for dependency parser * Update documentation for trainable_lemmatizer * Update documentation for entity_linker * Update documentation for ner * Update documentation for morphologizer * Update documentation for senter * Update documentation for spancat * Update documentation for tagger * Update documentation for textcat * Update documentation for tok2vec * Run prettier on edited files * Apply similar changes in transformer docs * Remove need to say annotated example explicitly I removed the need to say "Must contain at least one annotated Example" because it's often a given that Examples will contain some gold-standard annotation. * Run prettier on transformer docs * chore: add 'concepCy' to spacy universe (#11255) * chore: add 'concepCy' to spacy universe * docs: add 'slogan' to concepCy * Support full prerelease versions in the compat table (#11228) * Support full prerelease versions in the compat table * Fix types * adding spans to doc_annotation in Example.to_dict (#11261) * adding spans to doc_annotation in Example.to_dict * to_dict compatible with from_dict: tuples instead of spans * use strings for label and kb_id * Simplify test * Update data formats docs Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix regex invalid escape sequences (#11276) * Add W605 to the errors raised by flake8 in the CI (#11283) * Clean up automated label-based issue handling (#11284) * Clean up automated label-based issue handline 1. upgrade tiangolo/issue-manager to latest 2. move needs-more-info to tiangolo 3. change needs-more-info close time to 7 days 4. delete old needs-more-info config * Use old, longer message * Fix label name * Fix Dutch noun chunks to skip overlapping spans (#11275) * Add test for overlapping noun chunks * Skip overlapping noun chunks * Update spacy/tests/lang/nl/test_noun_chunks.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Docs: displaCy documentation - data types, `parse_{deps,ents,spans}`, spans example (#10950) * add in spans example and parse references * rm autoformatter * rm extra ents copy * TypedDict draft * type fixes * restore non-documentation files * docs update * fix spans example * fix hyperlinks * add parse example * example fix + argument fix * fix api arg in docs * fix bad variable replacement * fix spacing in style Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * fix spacing on table * fix spacing on table * rm temp files Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * include span_ruler for default warning filter (#11333) * Add uk pipelines to website (#11332) * Check for . in factory names (#11336) * Make fixes for PR #11349 * Fix roman numeral coverage in #11349 Co-authored-by: Patrick J. Burns <patricks@diyclassics.org> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Lj Miranda <12949683+ljvmiranda921@users.noreply.github.com> Co-authored-by: Jules Belveze <32683010+JulesBelveze@users.noreply.github.com> Co-authored-by: stefawolf <wlf.ste@gmail.com> Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com> Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>	2022-08-30 14:04:54 +02:00
Adriane Boyd	98a916e01a	Make stable private modules public and adjust names (#11353 ) * Make stable private modules public and adjust names * `spacy.ml._character_embed` -> `spacy.ml.character_embed` * `spacy.ml._precomputable_affine` -> `spacy.ml.precomputable_affine` * `spacy.tokens._serialize` -> `spacy.tokens.doc_bin` * `spacy.tokens._retokenize` -> `spacy.tokens.retokenize` * `spacy.tokens._dict_proxies` -> `spacy.tokens.span_groups` * Skip _precomputable_affine * retokenize -> retokenizer * Fix imports	2022-08-30 13:56:35 +02:00
Adriane Boyd	4bce8fa755	Remove setup_requires from setup.cfg (#11384 ) * Remove setup_requires from setup.cfg * Update requirements test to ignore cython in setup.cfg	2022-08-29 13:23:24 +02:00
Paul O'Leary McCann	aafee5e1b7	Fix lookup usage in French/Catalan (fix #11347 ) (#11382 ) * Fix lookup usage (fix #11347) Before using the lookups table in the French (and Catalan) lemmatizers, there's a check to see if the current term is in the table. But it's checking a string against hashes, so it's always false. Also the table lookup function is designed so you don't have to do that anyway. * Use the lookup table directly * Use string, not token	2022-08-29 10:32:38 +02:00
Edward	6723d76f24	Add ConsoleLogger.v2 (#11214 ) * Init * Change logger to ConsoleLogger.v2 * adjust naming * More naming adjustments * Fix output_file reference error * ignore type * Add basic test for logger * Hopefully fix mypy issue * mypy ignore line * Update mypy line Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update test method name Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Change file saving logic * Fix finalize method * increase spacy-legacy version in requirements * Update docs * small adjustments Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-29 10:23:05 +02:00
Adriane Boyd	2a558a7cdc	Switch to mecab-ko as default Korean tokenizer (#11294 ) * Switch to mecab-ko as default Korean tokenizer Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`. * Temporarily run tests with mecab-ko tokenizer * Fix types * Fix duplicate test names * Update requirements test * Revert "Temporarily run tests with mecab-ko tokenizer" This reverts commit `d2083e7044`. * Add mecab_args setting, fix pickle for KoreanNattoTokenizer * Fix length check * Update docs * Formatting * Update natto-py error message Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2022-08-26 10:11:18 +02:00
Adriane Boyd	740c33fe58	Merge remote-tracking branch 'upstream/develop' into chore/update-v4-from-develop	2022-08-24 20:43:07 +02:00
Adriane Boyd	81874265e9	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5-1	2022-08-24 12:47:42 +02:00
Adriane Boyd	c44d243f25	Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master	2022-08-24 07:15:41 +02:00
Tobius Saul	c09d2fa25b	luganda language extension (#10847 ) * luganda language extension * __init__.py changes * New enhancements * Lexical attribute changed * punctuaction and sentence additions * Remove comment header * Fix typos, reformat * reformated version * Add tokenizer test * Remove contractions from stop words * Format * Add Luganda to website Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-23 13:09:36 +02:00
Edward	5afa98aabf	Support custom attributes for tokens and spans in json conversion (#11125 ) * Add token and span custom attributes to to_json() * Change logic for to_json * Add functionality to from_json * Small adjustments * Move token/span attributes to new dict key * Fix test * Fix the same test but much better * Add backwards compatibility tests and adjust logic * Add test to check if attributes not set in underscore are not saved in the json * Add tests for json compatibility * Adjust test names * Fix tests and clean up code * Fix assert json tests * small adjustment * adjust naming and code readability * Adjust naming, added more tests and changed logic * Fix typo * Adjust errors, naming, and small test optimization * Fix byte tests * Fix bytes tests * Change naming and json structure * update schema * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/tokens/doc.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/tokens/doc.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update schema for underscore attributes * Adjust underscore schema * adjust schema tests Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-23 10:05:02 +02:00
Adriane Boyd	bb0e178878	Make Span/Doc.ents more consistent for ent_kb_id and ent_id (#11328 ) * Map `Span.id` to `Token.ent_id` in all cases when setting `Doc.ents` * Reset `Token.ent_id` and `Token.ent_kb_id` when setting `Doc.ents` * Make `Span.ent_id` an alias of `Span.id` rather than a read-only view of the root token's `ent_id` annotation	2022-08-22 20:28:57 +02:00
Sofie Van Landeghem	1a5be63715	Cleanup Cython structs (#11337 ) * cleanup Tokenizer fields * remove unused object from vocab * remove IS_OOV_DEPRECATED * add back in as FLAG13 * FLAG 18 instead * import fix * fix clumpsy fingers * revert symbol changes in favor of #11352 * bint instead of bool	2022-08-22 15:52:24 +02:00
Adriane Boyd	f55bb7470d	Clean up warnings in the test suite (#11331 )	2022-08-22 12:04:30 +02:00
Paul O'Leary McCann	0f07defe2c	Remove reference to voting on issue (#11335 ) Not clear which issue this refers to, we don't suggest this for any other issues, and we don't use votes in general.	2022-08-22 11:29:05 +02:00
Adriane Boyd	5fa8f4faca	Switch ru and uk lemmatizers to pymorphy3 (#11345 ) * Switch ru and uk lemmatizers to pymorphy3 * Switch to pymorphy3 in tests	2022-08-22 11:27:14 +02:00
Adriane Boyd	3e4cf1bbe1	Check for . in factory names (#11336 )	2022-08-19 09:52:12 +02:00
Sofie Van Landeghem	cab263791f	include span_ruler for default warning filter (#11333 )	2022-08-17 19:55:54 +02:00
Adriane Boyd	d757dec5c4	Remove intify_attrs(_do_deprecated) (#11319 )	2022-08-17 12:13:54 +02:00
Peter Baumgartner	db7b9938a4	Docs: displaCy documentation - data types, `parse_{deps,ents,spans}`, spans example (#10950 ) * add in spans example and parse references * rm autoformatter * rm extra ents copy * TypedDict draft * type fixes * restore non-documentation files * docs update * fix spans example * fix hyperlinks * add parse example * example fix + argument fix * fix api arg in docs * fix bad variable replacement * fix spacing in style Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * fix spacing on table * fix spacing on table * rm temp files Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-08-16 11:23:34 -04:00
antonpibm	551e73ccfc	Match private networks as URLs (#11121 )	2022-08-11 11:26:26 +02:00
Sofie Van Landeghem	5d54c0e32a	Rename modules for consistency (#11286 ) * rename Python module to entity_ruler * rename Python module to attribute_ruler	2022-08-10 11:44:05 +02:00
Adriane Boyd	ed4ad309e6	Fix Dutch noun chunks to skip overlapping spans (#11275 ) * Add test for overlapping noun chunks * Skip overlapping noun chunks * Update spacy/tests/lang/nl/test_noun_chunks.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-08-10 09:49:08 +02:00
Adriane Boyd	fc4246558b	Fix regex invalid escape sequences (#11276 )	2022-08-09 10:59:36 +02:00
stefawolf	23749cfc91	adding spans to doc_annotation in Example.to_dict (#11261 ) * adding spans to doc_annotation in Example.to_dict * to_dict compatible with from_dict: tuples instead of spans * use strings for label and kb_id * Simplify test * Update data formats docs Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-05 12:26:38 +02:00
Luka Dragar	b64243ed55	Updates to Slovenian language (#11162 ) * Added examples for Slovene * Update spacy/lang/sl/examples.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Corrected a typo in one of the sentences * Updated support for Slovenian * Some minor changes to corrections * Added forint currency * Corrected HYPHENS_PERMITTED regex and some formatting * Minor changes * Un-xfail tokenizer test * Format Co-authored-by: Luka Dragar <D20124481@mytudublin.ie> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-05 10:10:18 +02:00
Adriane Boyd	b07708d5d0	Support full prerelease versions in the compat table (#11228 ) * Support full prerelease versions in the compat table * Fix types	2022-08-04 15:14:19 +02:00
Daniël de Kok	e581eeac34	precompute_hiddens/Parser: look up CPU ops once (v4) (#11068 ) * precompute_hiddens/Parser: look up CPU ops once * precompute_hiddens: make cpu_ops private	2022-07-29 15:12:19 +02:00
Daniël de Kok	1ff683a50b	Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220728	2022-07-28 13:53:59 +02:00
ninjalu	95a1b8aca6	add additional REL_OP (#10371 ) * add additional REL_OP * change to condition and new rel_op symbols * add operators to docs * add the anchor while we're in here * add tests Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>	2022-07-27 13:16:44 +02:00
Edward	360a702ecd	Add parent argument (#11210 )	2022-07-26 14:35:18 +02:00
Adriane Boyd	5c2a00cef0	Set version to v3.4.1 (#11209 )	2022-07-26 12:52:38 +02:00
Daniël de Kok	4ee8a06149	Fix compatibility with CuPy 9.x (#11194 ) After the precomputable affine table of shape [nB, nF, nO, nP] is computed, padding with shape [1, nF, nO, nP] is assigned to the first row of the precomputed affine table. However, when we are indexing the precomputed table, we get a row of shape [nF, nO, nP]. CuPy versions before 10.0 cannot paper over this shape difference. This change fixes compatibility with CuPy < 10.0 by squeezing the first dimension of the padding before assignment.	2022-07-26 10:52:01 +02:00
Adriane Boyd	e5990db713	Revert "Temporarily skip tests that require models/compat" This reverts commit `d9320db7db`.	2022-07-25 18:12:18 +02:00
Madeesh Kannan	ba18d2913d	`Morphology`/`Morphologizer` optimizations and refactoring (#11024 ) * `Morphology`: Refactor to use C types, reduce allocations, remove unused code * `Morphologzier`: Avoid unnecessary sorting of morpho features * `Morphologizer`: Remove execessive reallocations of labels, improve hash lookups of labels, coerce `numpy` numeric types to native ints Update docs * Remove unused method * Replace `unique_ptr` usage with `shared_ptr` * Add type annotations to internal Python methods, rename `hash` variable, fix typos * Add comment to clarify implementation detail * Fix return type * `Morphology`: Stop early when splitting fields and values	2022-07-15 11:14:08 +02:00
Nicolai Bjerre Pedersen	2fa983aa2e	Fix span typings (#11119 ) Add id, id_ to span.pyi.	2022-07-12 13:47:35 +02:00
Peter Baumgartner	36cb2029a9	displaCy Spans Vertical Alignment Fix 2 (#11092 ) * add in span render slot fix * fix spacing off by one * rm demo * adjust comments * fix whitespace and overlap issue	2022-07-08 19:20:13 +02:00
github-actions[bot]	e7fd06bdbe	Auto-format code with black (#11099 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-07-08 18:43:25 +09:00
Daniël de Kok	a06cbae70d	precompute_hiddens/Parser: do not look up CPU ops (3.4) (#11069 ) * precompute_hiddens/Parser: do not look up CPU ops `get_ops("cpu")` is quite expensive. To avoid this, we want to cache the result as in #11068. However, for 3.x we do not want to change the ABI. So we avoid the expensive lookup by using NumpyOps. This should have a minimal impact, since `get_ops("cpu")` was only used when the model ops were `CupyOps`. If the ops are `AppleOps`, we are still passing through the correct BLAS implementation. * _NUMPY_OPS -> NUMPY_OPS	2022-07-05 10:53:42 +02:00
Madeesh Kannan	d36d66b7ca	Increase test deadline to 30 minutes to prevent spurious test failures (#11070 ) * Increase test deadline to 30 minutes to prevent spurious test failures * Reduce deadline to 2 minutes	2022-07-04 18:37:09 +02:00
kadarakos	5240baccfe	dont use get_array_module (#11056 )	2022-07-04 17:15:33 +02:00
Raphael Mitsch	e9eb59699f	NEL confidence threshold (#11016 ) * Add base for NEL abstention threshold mechanism. * Add abstention threshold to entity linker. Add test. * Fix entity linking tests. * Changed abstention default threshold from 0 to None. * Fix default values for abstention thresholds. * Fix mypy errors. * Replace assertion with raise of proper error code. * Simplify threshold check. Remove thresholding from EntityLinker_v1. * Rename test. * Update spacy/pipeline/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/pipeline/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Make E1043 configurable. * Update docs. * Rephrase description in docs. Adjusting error code message. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-07-04 17:05:21 +02:00
Madeesh Kannan	59c763eec1	`StringStore`-related optimizations (#10938 ) * `strings`: More roubust type checking of keys/IDs, coerce `int`-like types to `hash_t` * Preserve existing public API behaviour * Fix return type * Replace `bool` with `bint`, rename to `_try_coerce_to_hash`, replace `id` with `hash` * Avoid unnecessary re-encoding and re-calculation of strings and hashs respectively * Rename variables named `hash` Add comment on early return	2022-07-04 15:04:03 +02:00
explosion-bot	7e55a51314	Auto-format code with black	2022-07-01 08:04:32 +00:00
Madeesh Kannan	eaf66e7431	Add NVTX ranges to `TrainablePipe` components (#10965 ) * `TrainablePipe`: Add NVTX range decorator * Annotate `TrainablePipe` subclasses with NVTX ranges * Export function signature to allow introspection of args in tests * Revert "Annotate `TrainablePipe` subclasses with NVTX ranges" This reverts commit `d8684f7372`. * Revert "Export function signature to allow introspection of args in tests" This reverts commit `f4405ca3ad`. * Revert "`TrainablePipe`: Add NVTX range decorator" This reverts commit `26536eb6b8`. * Add `spacy.pipes_with_nvtx_range` pipeline callback * Show warnings for all missing user-defined pipe functions that need to be annotated Fix imports, typos * Rename `DEFAULT_ANNOTATABLE_PIPE_METHODS` to `DEFAULT_NVTX_ANNOTATABLE_PIPE_METHODS` Reorder import * Walk model nodes directly whilst applying NVTX ranges Ignore pipe method wrapper when applying range	2022-06-30 11:28:12 +02:00
Adriane Boyd	3fe9f47de4	Revert "disable failing test because Stanford servers are down (#11015 )" (#11054 ) This reverts commit `f8116078ce`.	2022-06-30 11:24:54 +02:00
Shen Qin	be00db6645	Addition of min_max quantifier in matcher {n,m} (#10981 ) * Min_max_operators 1. Modified API and Usage for spaCy website to include min_max operator 2. Modified matcher.pyx to include min_max function {n,m} and its variants 3. Modified schemas.py to include min_max validation error 4. Added test cases to test_matcher_api.py, test_matcher_logic.py and test_pattern_validation.py * attempt to fix mypy/pydantic compat issue * formatting * Update spacy/tests/matcher/test_pattern_validation.py Co-authored-by: Source-Shen <82353723+Source-Shen@users.noreply.github.com> Co-authored-by: svlandeg <svlandeg@github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-06-30 11:01:58 +02:00
Daniël de Kok	0ff14aabce	vectors: avoid expensive comparisons between numpy ints and Python ints (#10992 ) * vectors: avoid expensive comparisons between numpy ints and Python ints * vectors: avoid failure on lists of ints * Convert another numpy int to Python	2022-06-29 12:58:31 +02:00
Peter Baumgartner	dd038b536c	fix to horizontal space (#10994 )	2022-06-28 20:42:40 +02:00
Adriane Boyd	24f4908fce	Update vector handling in similarity methods (#11013 ) Distinguish between vectors that are 0 vs. missing vectors when warning about missing vectors. Update `Doc.has_vector` to match `Span.has_vector` and `Token.has_vector` for cases where the vocab has vectors but none of the tokens in the container have vectors.	2022-06-28 19:50:47 +02:00
Madeesh Kannan	1d5cad0b42	`Example.get_aligned_parse`: Handle unit and zero length vectors correctly (#11026 ) * `Example.get_aligned_parse`: Do not squeeze gold token idx vector Correctly handle zero-size vectors passed to `np.vectorize` * Add tests * Use `Doc` ctor to initialize attributes * Remove unintended change Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Remove unused import Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-06-28 19:42:58 +02:00
Richard Hudson	a9559e7435	Handle Cyrillic combining diacritics (#10837 ) * Handle Russian, Ukrainian and Bulgarian * Corrections * Correction * Correction to comment * Changes based on review * Correction * Reverted irrelevant change in punctuation.py * Remove unnecessary group * Reverted accidental change	2022-06-28 15:35:32 +02:00
Zackere	8ffff18ac4	Try cloning repo from main & master (#10843 ) * Try cloning repo from main & master * fixup! Try cloning repo from main & master * fixup! fixup! Try cloning repo from main & master * refactor clone and check for repo:branch existence * spacing fix * make mypy happy * type util function * Update spacy/cli/project/clone.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-28 09:11:15 -04:00
Daniël de Kok	1605ef7319	Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220627-2	2022-06-27 17:45:45 +02:00
github-actions[bot]	4155a59d47	Auto-format code with black (#11022 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-06-27 09:35:35 +02:00
Adriane Boyd	738b38064f	Merge pull request #11021 from adrianeboyd/chore/v3.4.0 Set version to v3.4.0	2022-06-24 14:54:16 +02:00
Madeesh Kannan	8f1ba4de58	Backport parser/alignment optimizations from `feature/refactor-parser` (#10952 )	2022-06-24 13:39:52 +02:00
Adriane Boyd	d9320db7db	Temporarily skip tests that require models/compat	2022-06-24 11:20:53 +02:00
Adriane Boyd	bffe54d02b	Set version to v3.4.0	2022-06-24 08:48:58 +02:00
Sofie Van Landeghem	f8116078ce	disable failing test because Stanford servers are down (#11015 )	2022-06-23 10:57:46 +02:00
Sofie Van Landeghem	f00254ae27	add counts to verbose list of NER labels (#10957 )	2022-06-20 09:48:40 +02:00
Raphael Mitsch	4c058eb40a	`enable` argument for spacy.load() (#10784 ) * Enable flag on spacy.load: foundation for include, enable arguments. * Enable flag on spacy.load: fixed tests. * Enable flag on spacy.load: switched from pretrained model to empty model with added pipes for tests. * Enable flag on spacy.load: switched to more consistent error on misspecification of component activity. Test refactoring. Added to default config. * Enable flag on spacy.load: added support for fields not in pipeline. * Enable flag on spacy.load: removed serialization fields from supported fields. * Enable flag on spacy.load: removed 'enable' from config again. * Enable flag on spacy.load: relaxed checks in _resolve_component_activation_status() to allow non-standard pipes. * Enable flag on spacy.load: fixed relaxed checks for _resolve_component_activation_status() to allow non-standard pipes. Extended tests. * Enable flag on spacy.load: comments w.r.t. resolution workarounds. * Enable flag on spacy.load: remove include fields. Update website docs. * Enable flag on spacy.load: updates w.r.t. changes in master. * Implement Doc.from_json(): update docstrings. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): remove newline. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): change error message for E1038. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Enable flag on spacy.load: wrapped docstring for _resolve_component_status() at 80 chars. * Enable flag on spacy.load: changed exmples for enable flag. * Remove newline. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix docstring for Language._resolve_component_status(). * Rename E1038 to E1042. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-17 20:24:13 +01:00
Sofie Van Landeghem	eaeca5eb6a	account for NER labels with a hyphen in the name (#10960 ) * account for NER labels with a hyphen in the name * cleanup * fix docstring * add return type to helper method * shorter method and few more occurrences * user helper method across repo * fix circular import * partial revert to avoid circular import	2022-06-17 20:02:37 +01:00
github-actions[bot]	6313787fb6	Auto-format code with black (#10977 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-06-17 19:41:55 +01:00
Raphael Mitsch	d50668dbf0	Made _initialize_X() methods private. (#10978 )	2022-06-17 15:55:34 +02:00
Raphael Mitsch	a7f6bc5dfb	Workaround for Typer optional default values with Python calls (#10788 ) * Workaround for Typer optional default values with Python calls: added test and workaround. * @rmitsch Workaround for Typer optional default values with Python calls: reverting some black formatting changes. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * @rmitsch Workaround for Typer optional default values with Python calls: removing return type hint. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Workaround for Typer optional default values with Python calls: fixed imports, added GitHub issue marker. * Workaround for Typer optional default values with Python calls: removed forcing of default values for optional arguments in init_config_cli(). Added default values for init_config(). Synchronized default values for init_config_cli() and init_config(). * Workaround for Typer optional default values with Python calls: removed unused import. * Workaround for Typer optional default values with Python calls: fixed usage of optimize in init_config_cli(). * Workaround for Typer optional default values with Pythhon calls: remove output_file from InitDefaultValues. * Workaround for Typer optional default values with Python calls: rename class for default init values. * Workaround for Typer optional default values with Python calls: remove newline. * remove introduced newlines * Remove test_init_config_from_python_without_optional_args(). * remove leftover import * reformat import * remove duplicate Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-17 12:15:36 +02:00
Daniël de Kok	3d3fbeda9f	Update for CBlas changes in Thinc 8.1.0.dev2 (#10970 )	2022-06-16 11:42:34 +02:00
Daniël de Kok	0d352c46ed	vectors: remove use of float as row number (#10955 ) The float -1 was returned rather than the integer -1 as the row for unknown keys. This doesn't introduce a realy bug, since such floats cast (without issues) to int in the conversion to NumPy arrays. Still, it's nice to to do the correct thing :).	2022-06-15 15:32:02 +02:00
Madeesh Kannan	126d1db123	Add failing test: `test_matcher_extension_in_set_predicate` (#10948 )	2022-06-13 10:56:45 +02:00
Daniël de Kok	a83a501195	precomputable_biaffine: avoid concatenation (#10911 ) The `forward` of `precomputable_biaffine` performs matrix multiplication and then `vstack`s the result with padding. This creates a temporary array used for the output of matrix concatenation. This change avoids the temporary by pre-allocating an array that is large enough for the output of matrix multiplication plus padding and fills the array in-place. This gave me a small speedup (a bit over 100 WPS) on de_core_news_lg on M1 Max (after changing thinc-apple-ops to support in-place gemm as BLIS does).	2022-06-10 18:12:28 +02:00
github-actions[bot]	97e8a5041b	Auto-format code with black (#10945 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-06-10 13:21:33 +02:00
Daniël de Kok	2f05c6824c	Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220609	2022-06-09 10:18:25 +02:00
kadarakos	1bb87f35bc	Detect cycle during projectivize (#10877 ) * detect cycle during projectivize * not complete test to detect cycle in projectivize * boolean to int type to propagate error * use unordered_set instead of set * moved error message to errors * removed cycle from test case * use find instead of count * cycle check: only perform one lookup * Return bool again from _has_head_as_ancestor Communicate presence of cycles through an output argument. * Switch to returning std::pair to encode presence of a cycle The has_cycle pointer is too easy to misuse. Ideally, we would have a sum type like Rust's `Result` here, but C++ is not there yet. * _is_non_proj_arc: clarify what we are returning * _has_head_as_ancestor: remove count We are now explicitly checking for cycles, so the algorithm must always terminate. Either we encounter the head, we find a root, or a cycle. * _is_nonproj_arc: simplify condition * Another refactor using C++ exceptions * Remove unused error code * Print graph with cycle on exception * Include .hh files in source package * Add FIXME comment * cycle detection test * find cycle when starting from problematic vertex Co-authored-by: Daniël de Kok <me@danieldk.eu>	2022-06-08 19:34:11 +02:00

1 2 3 4 5 ...

9330 Commits