Commit Graph

9313 Commits

Author SHA1 Message Date
github-actions[bot]
abb0ab109d
Auto-format code with black (#12035)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2023-01-02 11:59:57 +01:00
Adriane Boyd
ef9e504eac
Rename modified textcat scorer to v2 (#11971)
As a follow-up to #11696, rename the modified scorer to v2 and move the
v1 scorer to `spacy-legacy`.
2022-12-29 14:01:08 +01:00
Daniël de Kok
20b63943f5
Adjust to new Schedule class and pass scores to Optimizer (#12008)
* Adjust to new `Schedule` class and pass scores to `Optimizer`

Requires https://github.com/explosion/thinc/pull/804

* Bump minimum Thinc requirement to 9.0.0.dev1
2022-12-29 08:03:24 +01:00
kadarakos
933b54ac79
typo fix (#11995) 2022-12-26 13:26:35 +01:00
Madeesh Kannan
aa2b471a6e
New console logger with expanded progress tracking (#11972)
* Add `ConsoleLogger.v3`

This addition expands the progress bar feature to count up the training/distillation steps to either the next evaluation pass or the maximum number of steps.

* Rename progress bar types

* Add defaults to docs
Minor fixes

* Move comment

* Minor punctuation fixes

* Explicitly check for `None` when validating progress bar type

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-12-23 15:21:44 +01:00
github-actions[bot]
90896504a5
Auto-format code with black (#12019)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-12-23 12:44:07 +01:00
Daniël de Kok
2f08deea2a Fix fallout from a previous merge 2022-12-22 10:23:31 +01:00
Daniël de Kok
207565a788 Merge remote-tracking branch 'upstream/master' into chore/v4-merge-master-20221222 2022-12-22 10:08:54 +01:00
Raphael Mitsch
eef3d950b4
Fix SpanGroup and Span typing (#12009)
* Correct Span.label, Span.kb_id types. Fix SpanGroup.__iter__().

* Extend test.

* Rename test. Fix typo.

* Add comment.

* Fix types for Span.label, Span.kb_id, Span.char_span().

* Update spacy/tests/doc/test_span_group.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update docs.

* Fix typo.

* Update spacy/tokens/span_group.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-12-21 18:54:27 +01:00
kadarakos
c223cd7a86
Add apply CLI (#11376)
* annotate cli first try

* add batch-size and n_process

* rename to apply

* typing fix

* handle file suffixes

* walk directories

* support jsonl

* typing fix

* remove debug

* make suffix optional for walk

* revert unrelated

* don't warn but raise

* better error message

* minor touch up

* Update spacy/tests/test_cli.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/cli/apply.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/cli/apply.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* update tests and bugfix

* add force_overwrite

* typo

* fix adding .spacy suffix

* Update spacy/cli/apply.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/cli/apply.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/cli/apply.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* store user data and rename cmd arg

* include test for user attr

* rename cmd arg

* better help message

* documentation

* prettier

* black

* link fix

* Update spacy/cli/apply.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update website/docs/api/cli.md

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update website/docs/api/cli.md

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update website/docs/api/cli.md

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* addressing reviews

* dont quit but warn

* prettier

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-12-20 17:11:33 +01:00
Jos Polfliet
18ffe5bbd6
Update stop_words.py (#11997)
fix typo in "aangaande"
2022-12-19 16:17:49 +01:00
Daniël de Kok
f9308aae13
Fix v4 branch to build against Thinc v9 (#11921)
* Move `thinc.extra.search` to `spacy.pipeline._parser_internals`

Backport of:
https://github.com/explosion/spaCy/pull/11317

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Replace references to `thinc.backends.linalg` with `CBlas`

Backport of:
https://github.com/explosion/spaCy/pull/11292

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Use cross entropy from `thinc.legacy`

* Require thinc>=9.0.0.dev0,<9.1.0

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2022-12-17 14:32:19 +01:00
Edward
ca75190a3d
Custom extensions for spans with equal boundaries (#11429)
* Init

* Fix return type for mypy

* adjust types and improve setting new attributes

* Add underscore changes to json conversion

* Add test and underscore changes to from_docs

* add underscore changes and test to span.to_doc

* update return values

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add types to function

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* adjust formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* shorten return type

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* add helper function to improve readability

* Improve code and add comments

* rerun azure tests

* Fix tests for json conversion

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-12-12 08:55:53 +01:00
Adriane Boyd
0591e67265
Cast to uint64 for all array-based doc representations (#11933)
* Convert all individual values explicitly to uint64 for array-based doc representations

* Temporarily test with latest numpy v1.24.0rc

* Remove unnecessary conversion from attr_t

* Reduce number of individual casts

* Convert specifically from int32 to uint64

* Revert "Temporarily test with latest numpy v1.24.0rc"

This reverts commit eb0e3c5006.

* Also use int32 in tests
2022-12-12 08:45:35 +01:00
github-actions[bot]
f22fc7a113
Auto-format code with black (#11955)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-12-09 10:15:52 +01:00
Madeesh Kannan
f5aabaf7d6
Remove unused, experimental multi-task components (#11919)
* Remove experimental multi-task components

These are incomplete implementations and are not usable in their current state.

* Remove orphaned error message

* Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)

* Switch ubuntu-latest to ubuntu-20.04 in main tests

* Only use 20.04 for 3.6

* Revert "Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)"

This reverts commit 77c0fd7b17.

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-12-08 13:24:45 +01:00
Paul O'Leary McCann
d60997febb
Remove old model shortcuts (#11916)
* Remove old model shortcuts

* Remove error, docs warnings about shortcuts

* Fix import in util

Accidentally deleted the whole import and not just the old part...

* Change universe example to v3 style

* Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)

* Switch ubuntu-latest to ubuntu-20.04 in main tests

* Only use 20.04 for 3.6

* Update some model loading in Universe

* Add v2 tag to neuralcoref

* Use the spacy-version feature instead of a v2 tag

Co-authored-by: svlandeg <svlandeg@github.com>
2022-12-08 11:45:52 +01:00
Paul O'Leary McCann
6b9af38eeb
Remove all references to "begin_training" (#11943)
When v3 was released, `begin_training` was renamed to `initialize`.
There were warnings in the code and docs about that. This PR removes
them.
2022-12-08 11:43:52 +01:00
Paul O'Leary McCann
5c3a60e8f4
Add in errors used in the beam code that were removed at some point (#11935)
I don't think there's any way to use the beam code at the moment, but as
long as it's around the errors it refers to should also be present.
2022-12-07 15:52:35 +01:00
Daniël de Kok
27fac7df2e
EditTreeLemmatizer: correctly add strings when initializing from labels (#11934)
Strings in replacement nodes where not added to the `StringStore`
when `EditTreeLemmatizer` was initialized from a set of labels. The
corresponding test did not capture this because it added the strings
through the examples that were passed to the initialization.

This change fixes both this bug in the initialization as the 'shadowing'
of the bug in the test.
2022-12-07 13:53:41 +09:00
Zhangrp
23085ffef4
Fix interpolation in directory names, see #11235. (#11914) 2022-12-06 17:42:12 +09:00
Adriane Boyd
8afa8b5a7b
Refactor kwargs in CLI msg for future wasabi compatibility (#11918)
Necessary for mypy with wasabi v1+.
2022-12-05 10:00:00 +01:00
svlandeg
04fea09ffd Merge branch 'copy_master' into copy_v4 2022-12-05 08:56:15 +01:00
github-actions[bot]
df0cb4b77b
Auto-format code with black (#11913)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-12-02 14:49:12 +01:00
Paul O'Leary McCann
f9d17a644b
Config generation fails for GPU without transformers (#11899)
If you don't have spacy-transformers installed, but try to use `init
config` with the GPU flag, you'll get an error. The issue is that the
`use_transformers` flag in the config is conflated with the GPU flag,
and then there's an attempt to access transformers config info that may
not exist.

There may be a better way to do this, but this stops the error.
2022-12-02 10:17:11 +01:00
Adriane Boyd
445c670a2d
Fix spancat for zero suggestions (#11860)
* Add test for spancat predict with zero suggestions

* Fix spancat for zero suggestions

* Undo changes to extract_spans

* Use .sum() as in update
2022-12-02 09:33:52 +01:00
Adriane Boyd
6f9d630f7e
Replace Pipe type with Callable in Language (#11803)
* Replace Pipe type with Callable in Language

* Use Callable[[Doc], Doc] in the docstrings
2022-11-29 13:20:08 +01:00
Paul O'Leary McCann
f1e0243450
Remove macro auc per type from textcat defaults (#11887)
This appears to have been added by mistake and never used. Removing it
does not break validation.
2022-11-29 11:50:23 +01:00
Adriane Boyd
e0d43557b7
Merge pull request #11871 from adrianeboyd/chore/v3.5.0
Prepare for v3.5.0
2022-11-29 11:41:32 +01:00
Adriane Boyd
1ebe7db07c
Support local filesystem remotes for projects (#11762)
* Support local filesystem remotes for projects

* Fix support for local filesystem remotes for projects
  * Use `FluidPath` instead of `Pathy` to support both filesystem and
    remote paths
  * Create missing parent directories if required for local filesystem
  * Add a more general `_file_exists` method to support both `Pathy`,
    `Path`, and `smart_open`-compatible URLs
* Add explicit `smart_open` dependency starting with support for
  `compression` flag
* Update `pathy` dependency to exclude older versions that aren't
  compatible with required `smart_open` version
* Update docs to refer to `Pathy` instead of `smart_open` for project
  remotes (technically you can still push to any `smart_open`-compatible
  path but you can't pull from them)
* Add tests for local filesystem remotes

* Update pathy for general BlobStat sorting

* Add import

* Remove _file_exists since only Pathy remotes are supported

* Format CLI docs

* Clean up merge
2022-11-29 11:40:58 +01:00
Paul O'Leary McCann
f54bfb56c9
Don't throw an error if using displacy on an unset span key (#11845)
* Don't throw an error if using displacy on an unset span key

* List available keys in W117
2022-11-28 10:01:09 +01:00
Adriane Boyd
681ec20914
Add smart_open requirement, update deprecated options (#11864)
* Switch from deprecated `ignore_ext` to `compression`
* Add upload/download test for local files
2022-11-25 13:00:57 +01:00
Adriane Boyd
32396e0bda Set version to v3.5.0 2022-11-25 12:05:25 +01:00
Adriane Boyd
378db0eb1e Temporarily skip tests that require models/compat 2022-11-25 12:05:25 +01:00
Raphael Mitsch
c0fd8a2e71
find-threshold: CLI command for multi-label classifier threshold tuning (#11280)
* Add foundation for find-threshold CLI functionality.

* Finish first draft for find-threshold.

* Add tests.

* Revert adjusted import statements.

* Fix mypy errors.

* Fix imports.

* Harmonize arguments with spacy evaluate command.

* Generalize component and threshold handling. Harmonize arguments with 'spacy evaluate' CLI.

* Fix Spancat test.

* Add beta parameter to Scorer and PRFScore.

* Make beta a component scorer setting.

* Remove beta.

* Update nlp.config (workaround).

* Reload pipeline on threshold change. Adjust tests. Remove confection reference.

* Remove assumption of component being a Pipe object or having a .cfg attribute.

* Adjust test output and reference values.

* Remove beta references. Delete universe.json.

* Reverting unnecessary changes. Removing unused default values. Renaming variables in find-cli tests.

* Update spacy/cli/find_threshold.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Remove adding labels in tests.

* Remove unused error

* Undo changes to PRFScorer

* Change default value for n_trials. Log table iteratively.

* Add warnings for pointless applications of find_threshold().

* Fix imports.

* Adjust type check of TextCategorizer to exclude subclasses.

* Change check of if there's only one unique value in scores.

* Update spacy/cli/find_threshold.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Incorporate feedback.

* Fix test issue. Update docstring.

* Update docs & docstring.

* Update spacy/tests/test_cli.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add examples to docs. Rename _nlp to nlp in tests.

* Update spacy/cli/find_threshold.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/cli/find_threshold.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-11-25 11:44:55 +01:00
Adriane Boyd
30d31fd335
Update Russian and Ukrainian lemmatizers (#11811)
* pymorph2 issues #11620, #11626, #11625:
- #11620: pymorphy2_lookup
- #11626: handle multiple forms pointing to the same normal form + handling empty POS tag
- #11625: matching DET that are labelled as PRON by pymorhp2

* Move lemmatizer algorithm changes back into RussianLemmatizer

* Fix uk pymorphy3_lookup mode init

* Move and update tests for ru/uk lookup lemmatizer modes

* Fix typo

* Remove traces of previous behavior for uninflected POS

* Refactor to private generic-looking pymorphy methods

* Remove xfailed uk lemmatizer cases

* Update spacy/lang/ru/lemmatizer.py

Co-authored-by: Richard Hudson <richard@explosion.ai>

Co-authored-by: Dmytro S Lituiev <d.lituiev@gmail.com>
Co-authored-by: Richard Hudson <richard@explosion.ai>
2022-11-25 11:12:46 +01:00
Adriane Boyd
8f062b849c
Fix Matcher cython profile=True header (#11867) 2022-11-24 16:03:42 +01:00
Madeesh Kannan
5ea14af32b
Add training.before_update callback (#11739)
* Add `training.before_update` callback

This callback can be used to implement training paradigms like gradual (un)freezing of components (e.g: the Transformer) after a certain number of training steps to mitigate catastrophic forgetting during fine-tuning.

* Fix type annotation, default config value

* Generalize arguments passed to the callback

* Update schema

* Pass `epoch` to callback, rename `current_step` to `step`

* Add test

* Simplify test

* Replace config string with `spacy.blank`

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Cleanup imports

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-11-23 17:54:58 +01:00
Edward
e79910d57e
Remove sentiment extension (#11722)
* remove sentiment attribute

* remove sentiment from docs

* add test for backwards compatibility

* replace from_disk with from_bytes

* Fix docs and format file

* Fix formatting
2022-11-23 13:09:32 +01:00
Paul O'Leary McCann
f1ddac187d
Remove unused error object (#11837) 2022-11-23 10:51:31 +01:00
Marco Edward Gorelli
f0d8309a28
fix comparison of constants (#11834)
Co-authored-by: MarcoGorelli <>
2022-11-21 08:12:03 +01:00
github-actions[bot]
89bfd06fbd
Auto-format code with black (#11826)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-11-18 18:24:13 +09:00
Adriane Boyd
a83463c5e0
Add transformer recommendation for ca (#11819)
Model recommendation from @cayorodriguez.
2022-11-18 08:15:27 +01:00
Paul O'Leary McCann
75bb7ad541
Check textcat values for validity (#11763)
* Check textcat values for validity

* Fix error numbers

* Clean up vals reference

* Check category value validity through training

The _validate_categories is called in update, which for multilabel is
inherited from the single label component.

* Formatting
2022-11-17 10:25:01 +01:00
Paul O'Leary McCann
c0c54e44bc
Add equality definition for vectors (#11806)
* Add equality definition for vectors

This re-uses the check from sourcing components.

* Use the equality check

* Format

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-11-16 09:44:42 +01:00
Sofie Van Landeghem
caa9efad59
prevent rewriting an already raw URL (#11810) 2022-11-15 14:15:00 +01:00
Denis Bezykornov
7e684ad691
Update russian tokenizer exceptions (#11753)
* Fix typos, add couple of new abbreviations, remove nonbreaking spaces

* Remove space from abbreviation

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-11-15 11:37:25 +01:00
github-actions[bot]
188a7d00eb
Auto-format code with black (#11792)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-11-11 09:58:31 +01:00
Adriane Boyd
03eebe9d1c
Update warning, add tests for project requirements check (#11777)
* Update warning, add tests for project requirements check

* Make warning more general for differences between PEP 508 and pip
* Add tests for _check_requirements

* Parameterize test
2022-11-09 10:59:28 +01:00
Raphael Mitsch
20bbbe3e44
Revert disable/disabled merging behavior (#11745)
* Merge disable with disabled. Adjust warnings, errors and tests.

* Replace any() with set operation.

* Update spacy/tests/pipeline/test_pipe_methods.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update docs.

* Remve reference to config entry nlp.enabled from docs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-11-08 14:58:10 +01:00
Adriane Boyd
e116395f89
Add fallback in requirements check, only check once (#11735)
* Add fallback in requirements check, only check once

* Rename to skip_requirements_check

* Update spacy/cli/project/run.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-11-07 14:46:08 +01:00
Adriane Boyd
e91b47a226
Check for unsafe paths in tarfile.extractall (CVE-2007-4559) (#11746)
* Adding tarfile member sanitization to extractall()

* Format

* Simplify and add error message

* Fix import

* Add comment about CVE

Co-authored-by: TrellixVulnTeam <charles.mcfarland@trellix.com>
2022-11-07 10:43:34 +01:00
Adriane Boyd
ea326cf47d
Fix types for Span.id and Span.id_ (#11744) 2022-11-07 08:11:13 +01:00
github-actions[bot]
bbf64cfc43
Auto-format code with black (#11749)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-11-04 11:17:43 +01:00
Adriane Boyd
40e1000db0
Restore Doc attr getter values in Doc.to_json (#11700) 2022-11-03 11:49:08 +01:00
Paul O'Leary McCann
db56600536
Fix default parameters for load functions (fix #11706) (#11713)
* Fix default parameters for load functions

Some load functions used SimpleFrozenList() directly instead of the
_DEFAULT_EMPTY_PIPES parameter. That mostly worked as intended, but
the changes in #11459 check for equality using identity, not value, so a
warning is incorrectly raised sometimes, as in #11706.

This change just has all the load functions use the singleton value
instead.

* Add test that there are no warnings on module-based load

This will succeed due to changes in this branch, but local tests with
the latest release failed as intended.

* Try reverting commit and see if CI changes

There is an error in CI that is probably unrelated.

Revert "Fix default parameters for load functions"

This reverts commit dc46b35687.

* Revert "Try reverting commit and see if CI changes"

This reverts commit 2514ed07ef.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-11-03 10:52:59 +01:00
Adriane Boyd
68b8fa2df2 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master-4 2022-11-03 09:42:36 +01:00
Adriane Boyd
420b1d854b
Update textcat scorer threshold behavior (#11696)
* Update textcat scorer threshold behavior

For `textcat` (with exclusive classes) the scorer should always use a
threshold of 0.0 because there should be one predicted label per doc and
the numeric score for that particular label should not matter.

* Rename to test_textcat_multilabel_threshold

* Remove all uses of threshold for multi_label=False

* Update Scorer.score_cats API docs

* Add tests for score_cats with thresholds

* Update textcat API docs

* Fix types

* Convert threshold back to float

* Fix threshold type in docstring

* Improve formatting in Scorer API docs
2022-11-02 15:35:04 +01:00
Paul O'Leary McCann
d61e742960
Handle Docs with no entities in EntityLinker (#11640)
* Handle docs with no entities

If a whole batch contains no entities it won't make it to the model, but
it's possible for individual Docs to have no entities. Before this
commit, those Docs would cause an error when attempting to concatenate
arrays because the dimensions didn't match.

It turns out the process of preparing the Ragged at the end of the span
maker forward was a little different from list2ragged, which just uses
the flatten function directly. Letting list2ragged do the conversion
avoids the dimension issue.

This did not come up before because in NEL demo projects it's typical
for data with no entities to be discarded before it reaches the NEL
component.

This includes a simple direct test that shows the issue and checks it's
resolved. It doesn't check if there are any downstream changes, so a
more complete test could be added. A full run was tested by adding an
example with no entities to the Emerson sample project.

* Add a blank instance to default training data in tests

Rather than adding a specific test, since not failing on instances with
no entities is basic functionality, it makes sense to add it to the
default set.

* Fix without modifying architecture

If the architecture is modified this would have to be a new version, but
this change isn't big enough to merit that.
2022-10-28 10:25:34 +02:00
Adriane Boyd
865691d169
Adjust default attrs for textcat configs (#11698) 2022-10-26 08:43:00 +02:00
Adriane Boyd
88d35450dc
Rename test helper method with non-test_ name (#11701) 2022-10-25 14:53:18 +02:00
Adriane Boyd
cae4589f5a
Replace EntityRuler with SpanRuler implementation (#11320)
* Replace EntityRuler with SpanRuler implementation

Remove `EntityRuler` and rename the `SpanRuler`-based
`future_entity_ruler` to `entity_ruler`.

Main changes:

* It is no longer possible to load patterns on init as with
`EntityRuler(patterns=)`.
* The older serialization formats (`patterns.jsonl`) are no longer
supported and the related tests are removed.
* The config settings are only stored in the config, not in the
serialized component (in particular the `phrase_matcher_attr` and
overwrite settings).

* Add migration guide to EntityRuler API docs

* docs update

* Minor edit

Co-authored-by: svlandeg <svlandeg@github.com>
2022-10-24 09:11:35 +02:00
Adriane Boyd
a4bd890f32
Merge pull request #11686 from adrianeboyd/chore/update-v4-from-master
Update v4 from master
2022-10-21 12:55:53 +02:00
github-actions[bot]
84d9cb6b38
Auto-format code with black (#11687)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-10-21 11:54:17 +02:00
Paul O'Leary McCann
0e2b7fb28b
Remove thinc util reimports (#11665)
* Remove imports marked as v2 leftovers

There are a few functions that were in `spacy.util` in v2, but were
moved to Thinc. In v3 these were imported in `spacy.util` so that code
could be used unchanged, but the comment over them indicates they should
always be imported from Thinc. This commit removes those imports.

It doesn't look like any DeprecationWarning was ever thrown for using
these, but it is probably fine to remove them anyway with a major
version. It is not clear that they were widely used.

* Import fix_random_seed correctly

This seems to be the only place in spaCy that was using the old import.
2022-10-21 11:01:18 +02:00
Adriane Boyd
103b24fb25 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master 2022-10-21 09:13:32 +02:00
Adriane Boyd
7e56701057 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5 2022-10-20 13:38:49 +02:00
Adriane Boyd
3d0e895363
Set version to v3.4.2 (#11672) 2022-10-19 17:33:55 +02:00
Edward
d66ccb8eb0
Fix multiple entries per custom extension in doc json (#11551)
* Fix multiple extensions and character offset

* Rename token_start/end to start/end

* Refactor Doc.from_json based on review

* Iterate over user_data items

* Only add non-empty underscore entries

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-10-19 15:52:47 +02:00
Paul O'Leary McCann
858565a567
Fix issues with DVC commands (#11592)
* Fix flag handling in dvc

Prior to this commit, if a flag (--verbose or --quiet) was passed to
DVC, it would be added to the end of the generated dvc command line.
This would result in the command being interpreted as part of the actual
command to run, rather than an argument to dvc. This would result in
command lines like:

    spacy project run preprocess --verbose

That would fail with an error that there's no such directory as
`--verbose`.

This change puts the flags at the front of the dvc command so that they
are interpreted correctly. It removes the `run_dvc_commands` function,
which had been reduced to just a for loop and wasn't used elsewhere.

A separate problem is that there's no way to specify the quiet behaviour
to dvc from the command line, though it's unclear if that's a bug.

* Add dvc quiet flag to docs

* Handle case in DVC where no commands are appropriate

If only have commands with no deps or outputs (admittedly unlikely), you
get a weird error about the dvc file not existing. This gives explicit
output instead.

* Add support for quiet flag

* Fix command execution

Commands are strings now because they're joined further up.
2022-10-18 15:11:39 +09:00
Sofie Van Landeghem
2ce6aadda2
update default configs to recent versions (#11618) 2022-10-17 12:10:03 +02:00
github-actions[bot]
ceb62352bf
Auto-format code with black (#11649)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-10-14 18:04:55 +09:00
Adriane Boyd
6b5a3e7219
Extend to pydantic v1.10 (#11635)
* Update types in `spacy.schemas` for updated pydantic+mypy
2022-10-14 08:16:49 +02:00
Sofie Van Landeghem
4d869fcc11
Small fixes to docstrings (#11610)
* add missing scorer arg to docstring

* fix class names in textcat_multilabel

* add missing scorer to docstrings
2022-10-12 15:17:40 +02:00
Adriane Boyd
fe06e037bc
Fix init for pymorphy2_lookup lemmatizer mode (#11631) 2022-10-12 12:18:39 +02:00
Sofie Van Landeghem
29649589fc
remove dtype (#11615) 2022-10-11 15:25:05 +02:00
Sofie Van Landeghem
ef74f8f5e4
Fix mypy error in edittree lemmatizer (#11612)
* cleanup imports

* try limiting Thinc to previous release

* remove Model specification

* fix code and revert Thinc constraint
2022-10-11 14:15:22 +02:00
Madeesh Kannan
446a3ecf34
StringStore refactoring (#11344)
* `strings`: Remove unused `hash32_utf8` function

* `strings`: Make `hash_utf8` and `decode_Utf8Str` private

* `strings`: Reorganize private functions

* 'strings': Raise error when non-string/-int types are passed to functions that don't accept them

* `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods

* `Morphology`: Use `StringStore.items()` to enumerate features when pickling

* `test_stringstore`: Update pre-Python 3 tests

* Update `StringStore` docs

* Fix `get_string_id` imports

* Replace redundant test with tests for type checking

* Rename `_retrieve_interned_str`, remove `.get` default arg

* Add `get_string_id` to `strings.pyi`
Remove `mypy` ignore directives from imports of the above

* `strings.pyi`: Replace functions that consume `Union`-typed params with overloads

* `strings.pyi`: Revert some function signatures

* Update `SYMBOLS_BY_INT` lookups and error codes post-merge

* Revert clobbered change introduced in a previous merge

* Remove unnecessary type hint

* Invert tuple order in `StringStore.items()`

* Add test for `StringStore.items()`

* Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling"

This reverts commit 1af9510ceb.

* Rename `keys` and `key_map`

* Add `keys()` and `values()`

* Add comment about the inverted key-value semantics in the API

* Fix type hints

* Implement `keys()`, `values()`, `items()` without generators

* Fix type hints, remove unnecessary boxing

* Update docs

* Simplify `keys/values/items()` impl

* `mypy` fix

* Fix error message, doc fixes
2022-10-06 10:51:06 +02:00
svlandeg
d4922f25fc fix test for EL activations with refactored KB 2022-10-03 14:41:15 +02:00
svlandeg
e3027c65b8 Merge branch 'copy_develop' into copy_v4 2022-10-03 14:12:16 +02:00
svlandeg
9c8cdb403e Merge branch 'master_copy' into develop_copy 2022-09-30 15:40:26 +02:00
Sofie Van Landeghem
bcda8bc1e7
update mypy to latest version (#11546)
* update mypy and disable it for python 3.6

* ignoring mypy's type redefinition error
2022-09-29 14:24:40 +02:00
Adriane Boyd
6d7630c5d3
Allow overriding spacy_version in spacy package meta (#11552) 2022-09-29 10:44:06 +02:00
Peter Baumgartner
e794d4ae39
debug data Spancat Table Improvements (#11504)
* update

* fix format function

* pull out _format_number

* format with black
2022-09-28 17:16:05 +02:00
Raphael Mitsch
aea16719be
Simplify and clarify enable/disable behavior of spacy.load() (#11459)
* Change enable/disable behavior so that arguments take precedence over config options. Extend error message on conflict. Add warning message in case of overwriting config option with arguments.

* Fix tests in test_serialize_pipeline.py to reflect changes to handling of enable/disable.

* Fix type issue.

* Move comment.

* Move comment.

* Issue UserWarning instead of printing wasabi message. Adjust test.

* Added pytest.warns(UserWarning) for expected warning to fix tests.

* Update warning message.

* Move type handling out of fetch_pipes_status().

* Add global variable for default value. Use id() to determine whether used values are default value.

* Fix default value for disable.

* Rename DEFAULT_PIPE_STATUS to _DEFAULT_EMPTY_PIPES.
2022-09-27 14:22:36 +02:00
Jacobo Myerston
3e8bc1272f
add punctuation to grc (#11426)
* add punctuation to grc

Add support for special editorial punctuation that is common in ancient Greek texts.  Ancient Greek texts, as found in digital and print form, have been largely edited by scholars. Restorations and improvements are normally marked with special characters that need to be handled properly by the tokenizer.

* add unit tests

* simplify regex

* move generic quotes to char classes

* rename unit test

* fix regex

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-09-27 11:38:56 +02:00
Adriane Boyd
877671e09a
Preserve missing entity annotation in augmenters (#11540)
Preserve both `-` and `O` annotation in augmenters rather than relying
on `Example.to_dict`'s default support for one option outside of labeled
entity spans.

This is intended as a temporary workaround for augmenters for v3.4.x.
The behavior of `Example` and related IOB utils could be improved in the
general case for v3.5.
2022-09-27 10:16:51 +02:00
Richard Hudson
6f692a06d5
Remove side effects from Doc.__init__() (#11506)
* Remove side effects from Doc.__init__()

* Changes based on review comment

* Readd test

* Change interface of Doc.__init__()

* Simplify test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update doc.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-09-26 15:58:21 +02:00
Raphael Mitsch
af9b01ef97
Add dependency check to project step runs (#11226)
* Add dependency check to project step running.

* Fix dependency mismatch warning.

* Remove newline.

* Add types-setuptools to setup.cfg.

* Move types-setuptools to test requirements. Move warnings into _validate_requirements(). Handle file reading in project_run().

* Remove newline formatting for output of package conflicts.

* Show full version conflict message instead of just package name.

* Update spacy/cli/project/run.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix typo.

* Re-add rephrasing of message for conflicting packages. Remove requirements path redundancy.

* Update spacy/cli/project/run.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/cli/project/run.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Print unified message for requirement conflicts and missing requirements.

* Update spacy/cli/project/run.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix warning message.

* Print conflict/missing messages individually.

* Print conflict/missing messages individually.

* Add check_requirements setting in project.yml to disable requirements check.

* Update website/docs/usage/projects.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/usage/projects.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update description of project.yml structure in projects.md.

* Update website/docs/usage/projects.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Prettify projects docs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-16 16:54:31 +02:00
github-actions[bot]
279358be63
Auto-format code with black (#11513)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-09-16 11:50:19 +02:00
Sofie Van Landeghem
0509f90874
add dot (#11500) 2022-09-15 17:29:42 +02:00
Adriane Boyd
7c98245c0c
Add levenshtein from polyleven (#11418)
Add a simple levenshtein distance function using the implementation from
the polyleven library as `spacy.matcher.levenshtein`.
2022-09-14 17:05:22 +02:00
Daniël de Kok
efdbb722c5
Store activations in Docs when save_activations is enabled (#11002)
* Store activations in Doc when `store_activations` is enabled

This change adds the new `activations` attribute to `Doc`. This
attribute can be used by trainable pipes to store their activations,
probabilities, and guesses for downstream users.

As an example, this change modifies the `tagger` and `senter` pipes to
add an `store_activations` option. When this option is enabled, the
probabilities and guesses are stored in `set_annotations`.

* Change type of `store_activations` to `Union[bool, List[str]]`

When the value is:

- A bool: all activations are stored when set to `True`.
- A List[str]: the activations named in the list are stored

* Formatting fixes in Tagger

* Support store_activations in spancat and morphologizer

* Make Doc.activations type visible to MyPy

* textcat/textcat_multilabel: add store_activations option

* trainable_lemmatizer/entity_linker: add store_activations option

* parser/ner: do not currently support returning activations

* Extend tagger and senter tests

So that they, like the other tests, also check that we get no
activations if no activations were requested.

* Document `Doc.activations` and `store_activations` in the relevant pipes

* Start errors/warnings at higher numbers to avoid merge conflicts

Between the master and v4 branches.

* Add `store_activations` to docstrings.

* Replace store_activations setter by set_store_activations method

Setters that take a different type than what the getter returns are still
problematic for MyPy. Replace the setter by a method, so that type inference
works everywhere.

* Use dict comprehension suggested by @svlandeg

* Revert "Use dict comprehension suggested by @svlandeg"

This reverts commit 6e7b958f70.

* EntityLinker: add type annotations to _add_activations

* _store_activations: make kwarg-only, remove doc_scores_lens arg

* set_annotations: add type annotations

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* TextCat.predict: return dict

* Make the `TrainablePipe.store_activations` property a bool

This means that we can also bring back `store_activations` setter.

* Remove `TrainablePipe.activations`

We do not need to enumerate the activations anymore since `store_activations` is
`bool`.

* Add type annotations for activations in predict/set_annotations

* Rename `TrainablePipe.store_activations` to `save_activations`

* Error E1400 is not used anymore

This error was used when activations were still `Union[bool, List[str]]`.

* Change wording in API docs after store -> save change

* docs: tag (save_)activations as new in spaCy 4.0

* Fix copied line in morphologizer activations test

* Don't train in any test_save_activations test

* Rename activations

- "probs" -> "probabilities"
- "guesses" -> "label_ids", except in the edit tree lemmatizer, where
  "guesses" -> "tree_ids".

* Remove unused W400 warning.

This warning was used when we still allowed the user to specify
which activations to save.

* Formatting fixes

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Replace "kb_ids" by a constant

* spancat: replace a cast by an assertion

* Fix EOF spacing

* Fix comments in test_save_activations tests

* Do not set RNG seed in activation saving tests

* Revert "spancat: replace a cast by an assertion"

This reverts commit 0bd5730d16.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-13 09:51:12 +02:00
Sofie Van Landeghem
cc10a27c59
Prevent tok2vec to broadcast to listeners when predicting (#11385)
* replicate bug with tok2vec in annotating components

* add overfitting test with a frozen tok2vec

* remove broadcast from predict and check doc.tensor instead

* remove broadcast

* proper error

* slight rephrase of documentation
2022-09-12 15:36:48 +02:00
Madeesh Kannan
0ec9a696e6
Fix config validation failures caused by NVTX pipeline wrappers (#11460)
* Enable Cython<->Python bindings for `Pipe` and `TrainablePipe` methods

* `pipes_with_nvtx_range`: Skip hooking methods whose signature cannot be ascertained

When loading pipelines from a config file, the arguments passed to individual pipeline components is validated by `pydantic` during init. For this, the validation model attempts to parse the function signature of the component's c'tor/entry point so that it can check if all mandatory parameters are present in the config file.

When using the `models_and_pipes_with_nvtx_range` as a `after_pipeline_creation` callback, the methods of all pipeline components get replaced by a NVTX range wrapper **before** the above-mentioned validation takes place. This can be problematic for components that are implemented as Cython extension types - if the extension type is not compiled with Python bindings for its methods, they will have no signatures at runtime. This resulted in `pydantic` matching the *wrapper's* parameters with the those in the config and raising errors.

To avoid this, we now skip applying the wrapper to any (Cython) methods that do not have signatures.
2022-09-12 14:55:41 +02:00
kadarakos
6b83fee58d
Assets message (#11458)
* new error message when 'project run assets'

* new error message when 'project run assets'

* Update spacy/cli/project/run.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-09 17:17:10 +02:00
Adriane Boyd
8a86a35eab
Remove has_letters in config template (#11465)
Due to problems with the javascript conversion in the website
quickstart, remove the `has_letters` setting to simplify generating
`attrs` for the default `tok2vec`.

Additionally reduce `PREFIX` as in the trained pipelines.
2022-09-09 15:10:04 +02:00
github-actions[bot]
0c72c6bb2c
Auto-format code with black (#11468)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-09-09 11:21:17 +02:00
Raphael Mitsch
1f23c615d7
Refactor KB for easier customization (#11268)
* Add implementation of batching + backwards compatibility fixes. Tests indicate issue with batch disambiguation for custom singular entity lookups.

* Fix tests. Add distinction w.r.t. batch size.

* Remove redundant and add new comments.

* Adjust comments. Fix variable naming in EL prediction.

* Fix mypy errors.

* Remove KB entity type config option. Change return types of candidate retrieval functions to Iterable from Iterator. Fix various other issues.

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/kb_base.pyx

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/kb_base.pyx

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Add error messages to NotImplementedErrors. Remove redundant comment.

* Fix imports.

* Remove redundant comments.

* Rename KnowledgeBase to InMemoryLookupKB and BaseKnowledgeBase to KnowledgeBase.

* Fix tests.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move KB into subdirectory.

* Adjust imports after KB move to dedicated subdirectory.

* Fix config imports.

* Move Candidate + retrieval functions to separate module. Fix other, small issues.

* Fix docstrings and error message w.r.t. class names. Fix typing for candidate retrieval functions.

* Update spacy/kb/kb_in_memory.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix typing.

* Change typing of mentions to be Span instead of Union[Span, str].

* Update docs.

* Update EntityLinker and _architecture docs.

* Update website/docs/api/entitylinker.md

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Adjust message for E1046.

* Re-add section for Candidate in kb.md, add reference to dedicated page.

* Update docs and docstrings.

* Re-add section + reference for KnowledgeBase.get_alias_candidates() in docs.

* Update spacy/kb/candidate.pyx

* Update spacy/kb/kb_in_memory.pyx

* Update spacy/pipeline/legacy/entity_linker.py

* Remove canididate.md. Remove mistakenly added config snippet in entity_linker.py.

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-08 10:38:07 +02:00
shademe
977b847cce
Merge branch 'develop' into merge-develop-into-v4 2022-09-07 11:35:47 +02:00