spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-07 18:54:15 +03:00

Author	SHA1	Message	Date
Denis Bezykornov	7e684ad691	Update russian tokenizer exceptions (#11753 ) * Fix typos, add couple of new abbreviations, remove nonbreaking spaces * Remove space from abbreviation Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-15 11:37:25 +01:00
github-actions[bot]	188a7d00eb	Auto-format code with black (#11792 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-11-11 09:58:31 +01:00
Adriane Boyd	03eebe9d1c	Update warning, add tests for project requirements check (#11777 ) * Update warning, add tests for project requirements check * Make warning more general for differences between PEP 508 and pip * Add tests for _check_requirements * Parameterize test	2022-11-09 10:59:28 +01:00
Raphael Mitsch	20bbbe3e44	Revert disable/disabled merging behavior (#11745 ) * Merge disable with disabled. Adjust warnings, errors and tests. * Replace any() with set operation. * Update spacy/tests/pipeline/test_pipe_methods.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update docs. * Remve reference to config entry nlp.enabled from docs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-08 14:58:10 +01:00
Adriane Boyd	e116395f89	Add fallback in requirements check, only check once (#11735 ) * Add fallback in requirements check, only check once * Rename to skip_requirements_check * Update spacy/cli/project/run.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2022-11-07 14:46:08 +01:00
Adriane Boyd	e91b47a226	Check for unsafe paths in tarfile.extractall (CVE-2007-4559) (#11746 ) * Adding tarfile member sanitization to extractall() * Format * Simplify and add error message * Fix import * Add comment about CVE Co-authored-by: TrellixVulnTeam <charles.mcfarland@trellix.com>	2022-11-07 10:43:34 +01:00
Adriane Boyd	ea326cf47d	Fix types for Span.id and Span.id_ (#11744 )	2022-11-07 08:11:13 +01:00
github-actions[bot]	bbf64cfc43	Auto-format code with black (#11749 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-11-04 11:17:43 +01:00
Adriane Boyd	40e1000db0	Restore Doc attr getter values in Doc.to_json (#11700 )	2022-11-03 11:49:08 +01:00
Paul O'Leary McCann	db56600536	Fix default parameters for load functions (fix #11706 ) (#11713 ) * Fix default parameters for load functions Some load functions used SimpleFrozenList() directly instead of the _DEFAULT_EMPTY_PIPES parameter. That mostly worked as intended, but the changes in #11459 check for equality using identity, not value, so a warning is incorrectly raised sometimes, as in #11706. This change just has all the load functions use the singleton value instead. * Add test that there are no warnings on module-based load This will succeed due to changes in this branch, but local tests with the latest release failed as intended. * Try reverting commit and see if CI changes There is an error in CI that is probably unrelated. Revert "Fix default parameters for load functions" This reverts commit `dc46b35687`. * Revert "Try reverting commit and see if CI changes" This reverts commit `2514ed07ef`. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-03 10:52:59 +01:00
Adriane Boyd	420b1d854b	Update textcat scorer threshold behavior (#11696 ) * Update textcat scorer threshold behavior For `textcat` (with exclusive classes) the scorer should always use a threshold of 0.0 because there should be one predicted label per doc and the numeric score for that particular label should not matter. * Rename to test_textcat_multilabel_threshold * Remove all uses of threshold for multi_label=False * Update Scorer.score_cats API docs * Add tests for score_cats with thresholds * Update textcat API docs * Fix types * Convert threshold back to float * Fix threshold type in docstring * Improve formatting in Scorer API docs	2022-11-02 15:35:04 +01:00
Paul O'Leary McCann	d61e742960	Handle Docs with no entities in EntityLinker (#11640 ) * Handle docs with no entities If a whole batch contains no entities it won't make it to the model, but it's possible for individual Docs to have no entities. Before this commit, those Docs would cause an error when attempting to concatenate arrays because the dimensions didn't match. It turns out the process of preparing the Ragged at the end of the span maker forward was a little different from list2ragged, which just uses the flatten function directly. Letting list2ragged do the conversion avoids the dimension issue. This did not come up before because in NEL demo projects it's typical for data with no entities to be discarded before it reaches the NEL component. This includes a simple direct test that shows the issue and checks it's resolved. It doesn't check if there are any downstream changes, so a more complete test could be added. A full run was tested by adding an example with no entities to the Emerson sample project. * Add a blank instance to default training data in tests Rather than adding a specific test, since not failing on instances with no entities is basic functionality, it makes sense to add it to the default set. * Fix without modifying architecture If the architecture is modified this would have to be a new version, but this change isn't big enough to merit that.	2022-10-28 10:25:34 +02:00
Adriane Boyd	865691d169	Adjust default attrs for textcat configs (#11698 )	2022-10-26 08:43:00 +02:00
Adriane Boyd	88d35450dc	Rename test helper method with non-test_ name (#11701 )	2022-10-25 14:53:18 +02:00
github-actions[bot]	84d9cb6b38	Auto-format code with black (#11687 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-10-21 11:54:17 +02:00
Adriane Boyd	7e56701057	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5	2022-10-20 13:38:49 +02:00
Adriane Boyd	3d0e895363	Set version to v3.4.2 (#11672 )	2022-10-19 17:33:55 +02:00
Edward	d66ccb8eb0	Fix multiple entries per custom extension in doc json (#11551 ) * Fix multiple extensions and character offset * Rename token_start/end to start/end * Refactor Doc.from_json based on review * Iterate over user_data items * Only add non-empty underscore entries Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-10-19 15:52:47 +02:00
Paul O'Leary McCann	858565a567	Fix issues with DVC commands (#11592 ) * Fix flag handling in dvc Prior to this commit, if a flag (--verbose or --quiet) was passed to DVC, it would be added to the end of the generated dvc command line. This would result in the command being interpreted as part of the actual command to run, rather than an argument to dvc. This would result in command lines like: spacy project run preprocess --verbose That would fail with an error that there's no such directory as `--verbose`. This change puts the flags at the front of the dvc command so that they are interpreted correctly. It removes the `run_dvc_commands` function, which had been reduced to just a for loop and wasn't used elsewhere. A separate problem is that there's no way to specify the quiet behaviour to dvc from the command line, though it's unclear if that's a bug. * Add dvc quiet flag to docs * Handle case in DVC where no commands are appropriate If only have commands with no deps or outputs (admittedly unlikely), you get a weird error about the dvc file not existing. This gives explicit output instead. * Add support for quiet flag * Fix command execution Commands are strings now because they're joined further up.	2022-10-18 15:11:39 +09:00
Sofie Van Landeghem	2ce6aadda2	update default configs to recent versions (#11618 )	2022-10-17 12:10:03 +02:00
github-actions[bot]	ceb62352bf	Auto-format code with black (#11649 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-10-14 18:04:55 +09:00
Adriane Boyd	6b5a3e7219	Extend to pydantic v1.10 (#11635 ) * Update types in `spacy.schemas` for updated pydantic+mypy	2022-10-14 08:16:49 +02:00
Sofie Van Landeghem	4d869fcc11	Small fixes to docstrings (#11610 ) * add missing scorer arg to docstring * fix class names in textcat_multilabel * add missing scorer to docstrings	2022-10-12 15:17:40 +02:00
Adriane Boyd	fe06e037bc	Fix init for pymorphy2_lookup lemmatizer mode (#11631 )	2022-10-12 12:18:39 +02:00
Sofie Van Landeghem	29649589fc	remove dtype (#11615 )	2022-10-11 15:25:05 +02:00
Sofie Van Landeghem	ef74f8f5e4	Fix mypy error in edittree lemmatizer (#11612 ) * cleanup imports * try limiting Thinc to previous release * remove Model specification * fix code and revert Thinc constraint	2022-10-11 14:15:22 +02:00
svlandeg	9c8cdb403e	Merge branch 'master_copy' into develop_copy	2022-09-30 15:40:26 +02:00
Sofie Van Landeghem	bcda8bc1e7	update mypy to latest version (#11546 ) * update mypy and disable it for python 3.6 * ignoring mypy's type redefinition error	2022-09-29 14:24:40 +02:00
Adriane Boyd	6d7630c5d3	Allow overriding spacy_version in spacy package meta (#11552 )	2022-09-29 10:44:06 +02:00
Peter Baumgartner	e794d4ae39	`debug data` Spancat Table Improvements (#11504 ) * update * fix format function * pull out _format_number * format with black	2022-09-28 17:16:05 +02:00
Raphael Mitsch	aea16719be	Simplify and clarify enable/disable behavior of spacy.load() (#11459 ) * Change enable/disable behavior so that arguments take precedence over config options. Extend error message on conflict. Add warning message in case of overwriting config option with arguments. * Fix tests in test_serialize_pipeline.py to reflect changes to handling of enable/disable. * Fix type issue. * Move comment. * Move comment. * Issue UserWarning instead of printing wasabi message. Adjust test. * Added pytest.warns(UserWarning) for expected warning to fix tests. * Update warning message. * Move type handling out of fetch_pipes_status(). * Add global variable for default value. Use id() to determine whether used values are default value. * Fix default value for disable. * Rename DEFAULT_PIPE_STATUS to _DEFAULT_EMPTY_PIPES.	2022-09-27 14:22:36 +02:00
Jacobo Myerston	3e8bc1272f	add punctuation to grc (#11426 ) * add punctuation to grc Add support for special editorial punctuation that is common in ancient Greek texts. Ancient Greek texts, as found in digital and print form, have been largely edited by scholars. Restorations and improvements are normally marked with special characters that need to be handled properly by the tokenizer. * add unit tests * simplify regex * move generic quotes to char classes * rename unit test * fix regex Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-09-27 11:38:56 +02:00
Adriane Boyd	877671e09a	Preserve missing entity annotation in augmenters (#11540 ) Preserve both `-` and `O` annotation in augmenters rather than relying on `Example.to_dict`'s default support for one option outside of labeled entity spans. This is intended as a temporary workaround for augmenters for v3.4.x. The behavior of `Example` and related IOB utils could be improved in the general case for v3.5.	2022-09-27 10:16:51 +02:00
Richard Hudson	6f692a06d5	Remove side effects from Doc.__init__() (#11506 ) * Remove side effects from Doc.__init__() * Changes based on review comment * Readd test * Change interface of Doc.__init__() * Simplify test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update doc.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-09-26 15:58:21 +02:00
Raphael Mitsch	af9b01ef97	Add dependency check to project step runs (#11226 ) * Add dependency check to project step running. * Fix dependency mismatch warning. * Remove newline. * Add types-setuptools to setup.cfg. * Move types-setuptools to test requirements. Move warnings into _validate_requirements(). Handle file reading in project_run(). * Remove newline formatting for output of package conflicts. * Show full version conflict message instead of just package name. * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix typo. * Re-add rephrasing of message for conflicting packages. Remove requirements path redundancy. * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Print unified message for requirement conflicts and missing requirements. * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix warning message. * Print conflict/missing messages individually. * Print conflict/missing messages individually. * Add check_requirements setting in project.yml to disable requirements check. * Update website/docs/usage/projects.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/usage/projects.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update description of project.yml structure in projects.md. * Update website/docs/usage/projects.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Prettify projects docs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-09-16 16:54:31 +02:00
github-actions[bot]	279358be63	Auto-format code with black (#11513 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-09-16 11:50:19 +02:00
Sofie Van Landeghem	0509f90874	add dot (#11500 )	2022-09-15 17:29:42 +02:00
Adriane Boyd	7c98245c0c	Add levenshtein from polyleven (#11418 ) Add a simple levenshtein distance function using the implementation from the polyleven library as `spacy.matcher.levenshtein`.	2022-09-14 17:05:22 +02:00
Sofie Van Landeghem	cc10a27c59	Prevent tok2vec to broadcast to listeners when predicting (#11385 ) * replicate bug with tok2vec in annotating components * add overfitting test with a frozen tok2vec * remove broadcast from predict and check doc.tensor instead * remove broadcast * proper error * slight rephrase of documentation	2022-09-12 15:36:48 +02:00
Madeesh Kannan	0ec9a696e6	Fix config validation failures caused by NVTX pipeline wrappers (#11460 ) * Enable Cython<->Python bindings for `Pipe` and `TrainablePipe` methods * `pipes_with_nvtx_range`: Skip hooking methods whose signature cannot be ascertained When loading pipelines from a config file, the arguments passed to individual pipeline components is validated by `pydantic` during init. For this, the validation model attempts to parse the function signature of the component's c'tor/entry point so that it can check if all mandatory parameters are present in the config file. When using the `models_and_pipes_with_nvtx_range` as a `after_pipeline_creation` callback, the methods of all pipeline components get replaced by a NVTX range wrapper before the above-mentioned validation takes place. This can be problematic for components that are implemented as Cython extension types - if the extension type is not compiled with Python bindings for its methods, they will have no signatures at runtime. This resulted in `pydantic` matching the wrapper's parameters with the those in the config and raising errors. To avoid this, we now skip applying the wrapper to any (Cython) methods that do not have signatures.	2022-09-12 14:55:41 +02:00
kadarakos	6b83fee58d	Assets message (#11458 ) * new error message when 'project run assets' * new error message when 'project run assets' * Update spacy/cli/project/run.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-09-09 17:17:10 +02:00
Adriane Boyd	8a86a35eab	Remove has_letters in config template (#11465 ) Due to problems with the javascript conversion in the website quickstart, remove the `has_letters` setting to simplify generating `attrs` for the default `tok2vec`. Additionally reduce `PREFIX` as in the trained pipelines.	2022-09-09 15:10:04 +02:00
github-actions[bot]	0c72c6bb2c	Auto-format code with black (#11468 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-09-09 11:21:17 +02:00
Raphael Mitsch	1f23c615d7	Refactor KB for easier customization (#11268 ) * Add implementation of batching + backwards compatibility fixes. Tests indicate issue with batch disambiguation for custom singular entity lookups. * Fix tests. Add distinction w.r.t. batch size. * Remove redundant and add new comments. * Adjust comments. Fix variable naming in EL prediction. * Fix mypy errors. * Remove KB entity type config option. Change return types of candidate retrieval functions to Iterable from Iterator. Fix various other issues. * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/kb_base.pyx Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/kb_base.pyx Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Add error messages to NotImplementedErrors. Remove redundant comment. * Fix imports. * Remove redundant comments. * Rename KnowledgeBase to InMemoryLookupKB and BaseKnowledgeBase to KnowledgeBase. * Fix tests. * Update spacy/errors.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move KB into subdirectory. * Adjust imports after KB move to dedicated subdirectory. * Fix config imports. * Move Candidate + retrieval functions to separate module. Fix other, small issues. * Fix docstrings and error message w.r.t. class names. Fix typing for candidate retrieval functions. * Update spacy/kb/kb_in_memory.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/models/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix typing. * Change typing of mentions to be Span instead of Union[Span, str]. * Update docs. * Update EntityLinker and _architecture docs. * Update website/docs/api/entitylinker.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Adjust message for E1046. * Re-add section for Candidate in kb.md, add reference to dedicated page. * Update docs and docstrings. * Re-add section + reference for KnowledgeBase.get_alias_candidates() in docs. * Update spacy/kb/candidate.pyx * Update spacy/kb/kb_in_memory.pyx * Update spacy/pipeline/legacy/entity_linker.py * Remove canididate.md. Remove mistakenly added config snippet in entity_linker.py. Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-09-08 10:38:07 +02:00
Sofie Van Landeghem	d801cccd38	Merge pull request #11430 from rmitsch/chore/synch-develop Synch develop with master	2022-09-05 15:07:18 +02:00
Paul O'Leary McCann	977dc33312	Add a way to get the URL to download a pipeline to the CLI (#11175 ) * Add a dry run flag to download * Remove --dry-run, add --url option to `spacy info` instead * Make mypy happy * Print only the URL, so it's easier to use in scripts * Don't add the egg hash unless downloading an sdist * Update spacy/cli/info.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add two implementations of requirements * Clean up requirements sample slightly This should make mypy happy * Update URL help string * Remove requirements option * Add url option to docs * Add URL to spacy info model output, when available * Add types-setuptools to testing reqs * Add types-setuptools to requirements * Add "compatible", expand docstring * Update spacy/cli/info.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Run prettier on CLI docs * Update docs Add a sidebar about finding download URLs, with some examples of the new command. * Add download URLs to table on model page * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Updates from review * download url -> download link * Update docs Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-09-02 11:58:21 +02:00
github-actions[bot]	71884d0942	Auto-format code with black (#11427 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-09-02 11:43:20 +02:00
Madeesh Kannan	d1760ebe02	Better handling of unexpected types in `SetPredicate` (#11312 ) * `Matcher`: Better type checking of values in `SetPredicate` `SetPredicate`: Emit warning and return `False` on unexpected value types * Rename `value_type_mismatch` variable * Inline warning * Remove unexpected type warning from `_SetPredicate` * Ensure that `str` values are not interpreted as sequences Check elements of sequence values for convertibility to `str` or `int` * Add more `INTERSECT` and `IN` test cases * Test for inputs with multiple characters * Return `False` early instead of using a boolean flag * Remove superfluous `int` check, parentheses * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Appy suggestions from code review * Clarify test comment Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-09-02 09:09:48 +02:00
Adriane Boyd	78f5503a29	Check for any non-Doc returned value for components (#11424 )	2022-09-01 19:37:23 +02:00
Madeesh Kannan	604a7c3c26	`SpanGroup(s)`-related optimizations (#11380 ) * `SpanGroup`: Add support for binding copies to a new reference document * `SpanGroups`: Replace superfluous serialize-deserialize roundtrip in `copy` Instead, directly copy the in-memory representations of the constituent `SpanGroup`s. * Update `SpanGroup.copy()` signature * Rename `new_doc` param to `doc` * Fix kwdarg * Update `.pyi` file and docstrings * `mypy` fix * Update spacy/tokens/span_group.pyx * Update docs Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-31 09:03:20 +02:00

1 2 3 4 5 ...

9187 Commits