Compare commits

..

165 Commits

Author SHA1 Message Date
Matthew Honnibal
f4c8fdfaad Update cli.package for removed spacy.vectors.name attr 2024-09-01 16:43:49 +02:00
Sofie Van Landeghem
818fdb537e
Merge pull request #13490 from svlandeg/feat/update_v4
Update v4 branch with latest from master
2024-05-14 22:41:17 +02:00
svlandeg
e32a394ff0 fix the fix for textcat init functionality 2024-05-14 18:45:51 +02:00
svlandeg
5992e927b9 fix textcat init functionality 2024-05-14 18:38:11 +02:00
svlandeg
c27679f210 Merge branch 'master' into feat/update_v4 2024-05-14 17:42:48 +02:00
Sofie Van Landeghem
287deee02c
remove empty file (#13458) 2024-04-26 10:04:16 +02:00
Daniël de Kok
b2ca7253d2
Document TrainablePipe.save_activations (#13452)
* Document `TrainablePipe.save_activations`

* Fully qualified links

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* prettier

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2024-04-23 09:21:23 +02:00
Daniël de Kok
f5918d4353
Update to Thinc 9.0.0 and set version to 4.0.0.dev3 (#13448)
* Update to Thinc 9.0.0 and set version to 4.0.0.dev3

* Set minimum Python version to 3.9
2024-04-22 09:40:55 +02:00
Daniël de Kok
5bd141013b
Remove apple from extras (#13439)
Account for merging of `thinc-apple-ops` into `thinc`.
2024-04-17 13:43:27 +02:00
Daniël de Kok
8696861c8c
Update spacy-curated-transformers docs for spaCy 4 (#13440)
- Update model constructors to v2 and add `dtype` argument.
- Update to `PyTorchCheckpointLoader` to `v2`.
- Add `transformer_discriminative.v1`.
2024-04-16 12:06:58 +02:00
Daniël de Kok
fbc14aea45
Add distill subcommand (#13431)
* Add distill subcommand

This subcommand distills a student model from a teacher model.

* Fixes from Sofie

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Type and doc fixes

* Wording

* distill: document missing `-o`

* Wording

* Small fix

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2024-04-11 19:33:46 +02:00
Raphael Mitsch
304b9331e6
Modify EL batching to doc-wise streaming approach (#12367)
* Convert Candidate from Cython to Python class.

* Format.

* Fix .entity_ typo in _add_activations() usage.

* Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span].

* Update docs.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update doc string of BaseCandidate.__init__().

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.

* Adjust Candidate to support and mandate numerical entity IDs.

* Format.

* Fix docstring and docs.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename alias -> mention.

* Refactor Candidate attribute names. Update docs and tests accordingly.

* Refacor Candidate attributes and their usage.

* Format.

* Fix mypy error.

* Update error code in line with v4 convention.

* Modify EL batching system.

* Update leftover get_candidates() mention in docs.

* Format docs.

* Format.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Updated error code.

* Simplify interface for int/str representations.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename 'alias' to 'mention'.

* Port Candidate and InMemoryCandidate to Cython.

* Remove redundant entry in setup.py.

* Add abstract class check.

* Drop storing mention.

* Update spacy/kb/candidate.pxd

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix entity_id refactoring problems in docstrings.

* Drop unused InMemoryCandidate._entity_hash.

* Update docstrings.

* Move attributes out of Candidate.

* Partially fix alias/mention terminology usage. Convert Candidate to interface.

* Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs().

* Update docstrings related to prior_prob.

* Update alias/mention usage in doc(strings).

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs.

* Update docstrings.

* Fix InMemoryCandidate attribute names.

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update W401 test.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use Candidate output type for toy generators in the test suite to mimick best practices

* fix docs

* fix import

* Fix merge leftovers.

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/entitylinker.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb_in_memory.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/inmemorylookupkb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update get_candidates() docstring.

* Reformat imports in entity_linker.py.

* Drop valid_ent_idx_per_doc.

* Update docs.

* Format.

* Simplify doc loop in predict().

* Remove E1044 comment.

* Fix merge errors.

* Format.

* Format.

* Format.

* Fix merge error & tests.

* Format.

* Apply suggestions from code review

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Use type alias.

* isort.

* isort.

* Lint.

* Add typedefs.pyx.

* Fix typedef import.

* Fix type aliases.

* Format.

* Update docstring and type usage.

* Add info on get_candidates(), get_candidates_batched().

* Readd get_candidates info to v3 changelog.

* Update website/docs/api/entitylinker.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update factory functions for backwards compatibility.

* Format.

* Ignore mypy error.

* Fix mypy error.

* Format.

* Add test for multiple docs with multiple entities.

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2024-04-09 11:39:18 +02:00
Adriane Boyd
afb22ad491
Remove debug data normalization for span analysis (#13203)
* Remove debug data normalization for span analysis

As a result of this normalization, `debug data` could show a user tokens
that do not exist in their data.

* Update spacy/cli/debug_data.py

---------

Co-authored-by: svlandeg <svlandeg@github.com>
2024-02-06 14:14:55 +01:00
Daniël de Kok
1052cba9f3
Merge pull request #13299 from danieldk/copy/master
Sync main with latests changes from master (v3)
2024-02-04 15:40:55 +01:00
Daniël de Kok
2d4067d021 Test if closing explicitly solves recursive lock issues 2024-02-02 11:39:07 +01:00
Daniël de Kok
70e2f2a14a
Update spacy-legacy dependency to 4.0.0.dev1 (#13270)
This release is compatible with the parser refactor backout.
2024-01-25 18:24:22 +01:00
Daniël de Kok
ce9ea9629f
Set version to v4.0.0.dev2 (#13269) 2024-01-25 12:54:23 +01:00
Daniël de Kok
bbf38d4d0f
Merge pull request #13250 from danieldk/maintenance/v4-merge-master-20240119
Merge `master` into `v4`
2024-01-24 19:30:23 +01:00
Daniël de Kok
9e97c730be Fix up requirements test
To account for buil dependencies being removed from `setup.cfg`.
2024-01-24 17:18:49 +01:00
Daniël de Kok
36ee709390 Remove setup_requires from setup.cfg 2024-01-24 15:02:02 +01:00
Daniël de Kok
e722284ff4 Construct TextCatEnsemble.v2 using helper function 2024-01-24 14:59:01 +01:00
Daniël de Kok
ce4ea5ffa7 Py_UNICODE is not compatible with 3.12 2024-01-24 13:08:56 +01:00
Daniël de Kok
c621e251b8 Typing fixes 2024-01-24 12:20:01 +01:00
Daniël de Kok
82ef6783a8 Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119 2024-01-24 09:09:01 +01:00
Daniël de Kok
81beaea70e Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119 2024-01-19 12:34:29 +01:00
Daniël de Kok
2891e27421
Merge pull request #13191 from explosion/maintenance/revert-parser-refactor
Revert the parser refactor
2024-01-18 17:06:41 +01:00
Daniël de Kok
9972333ef9 Temporily xfail local remote storage test 2024-01-17 10:20:40 +01:00
Daniël de Kok
7351f6bbeb Update thinc dependency to 9.0.0.dev4 2024-01-16 15:56:09 +01:00
Daniël de Kok
7718886fa3
TransitionBasedParser.v2 in run example output
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-12-21 11:14:35 +01:00
Daniël de Kok
532225b955 Fix parser distillation test seed
The test would sometimes fail. Rather than increasing test by increasing
training iterations, use a known-good seed.
2023-12-21 10:06:28 +01:00
Daniël de Kok
7b689bde44 No need for Literal compat, since we only support >= 3.8 2023-12-21 09:47:38 +01:00
Daniël de Kok
57203fa0fc Fix TransitionBasedParser version in transformer embeddings docs 2023-12-19 09:28:20 +01:00
Daniël de Kok
5e8bafa5bb Bring back W401 2023-12-18 20:17:24 +01:00
Daniël de Kok
9b36729cbd Fix Cython lints 2023-12-18 20:02:15 +01:00
Daniël de Kok
42fe4edfd7 Add distillation tests with max cut size
And fix endless loop when the max cut size is 0 or 1.
2023-12-08 20:38:01 +01:00
Daniël de Kok
e2591cda36 isort 2023-12-08 20:24:09 +01:00
Daniël de Kok
e5ec45cb7e Revert "Merge the parser refactor into v4 (#10940)"
This reverts commit a183db3cef.
2023-12-08 20:23:08 +01:00
Daniël de Kok
05803cfe76 Revert "Reimplement distillation with oracle cut size (#12214)"
This reverts commit e27c60a702.
2023-12-08 14:38:05 +01:00
Adriane Boyd
1b2d66f98e
Switch zh tokenizer default pkuseg_model to spacy_ontonotes (#12896)
So that users can use `copy_from_base_model` for other segmenters
without having to override an irrelevant `pkuseg_model` setting, switch
the default `pkuseg_model` to `spacy_ontonotes`.
2023-08-09 10:55:52 +02:00
Paul O'Leary McCann
b4e457d9fe
Accept multiple code files in all CLI commands (#12101)
* Add support for multiple code files to all relevant commands

Prior to this, only the package command supported multiple code files.

* Update docs

* Add debug data test, plus generic fixtures

One tricky thing here: it's tempting to create the config by creating a
pipeline in code, but that requires declaring the custom components
here. However the CliRunner appears to be run in the same process or
otherwise have access to our registry, so it works even without any
code arguments. So it's necessary to avoid declaring the components in
the tests.

* Add debug config test and restructure

The code argument imports the provided file. If it adds item to the
registry, that affects global state, which CliRunner doesn't isolate.
Since there's no standard way to remove things from the registry, this
instead uses subprocess.run to run commands.

* Use a more generic, parametrized test

* Add output arg for assemble and pretrain

Assemble and pretrain require an output argument. This commit adds
assemble testing, but not pretrain, as that requires an actual trainable
component, which is not currently in the test config.

* Add evaluate test and some cleanup

* Mark tests as slow

* Revert argument name change

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Format API CLI docs

* isort

* Fix imports in tests

* isort

* Undo changes to package CLI help

* Fix python executable and lang code in test

* Fix executable in another test

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-08-01 15:24:02 +02:00
Adriane Boyd
2702db9fef
Recommend lookups tables from URLs or other loaders (#12283)
* Recommend lookups tables from URLs or other loaders

Shift away from the `lookups` extra (which isn't removed, just no longer
mentioned) and recommend loading data from the `spacy-lookups-data` repo
or other sources rather than the `spacy-lookups-data` package.

If the tables can't be loaded from the `lookups` registry in the
lemmatizer, show how to specify the tables in `[initialize]` rather than
recommending the `spacy-lookups-data` package.

* Add tests for some rule-based lemmatizers

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-07-31 15:54:35 +02:00
Sofie Van Landeghem
eaaac5a08c
Merge pull request #12842 from svlandeg/sync_v4
Sync v4 with latest from master and develop
2023-07-24 12:13:04 +02:00
Sofie Van Landeghem
f293386d3e
remove unnecessary line
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-07-20 14:08:29 +02:00
Adriane Boyd
4f37e4031c
Update spacy/ml/tb_framework.pyx
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-07-20 09:59:19 +02:00
svlandeg
96f2e30c4b cython fixes and cleanup 2023-07-19 17:41:29 +02:00
svlandeg
846472129c merge fixes (2) 2023-07-19 16:38:37 +02:00
svlandeg
47a82c6164 merge fixes 2023-07-19 16:38:29 +02:00
svlandeg
0e3b6a87d6 Merge branch 'upstream_master' into sync_v4 2023-07-19 16:37:31 +02:00
Sofie Van Landeghem
536798f9e3
Disallow False for first/last arguments of add_pipe (#12793)
* Literal True for first/last options

* add test case

* update docs

* remove old redundant test case

* black formatting

* use Optional typing in docstrings

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-07-06 15:20:13 +02:00
Sofie Van Landeghem
b615964be7
Merge pull request #12752 from danieldk/maintenance/sync-v4-master-20230626
Sync `master` into `v4`
2023-06-28 08:56:54 +01:00
Daniël de Kok
8b2732e276 Fix training.callbacks <-> language import cycle 2023-06-26 12:43:45 +02:00
Daniël de Kok
122f3b32ad Fix span <-> underscore import cycle 2023-06-26 12:43:21 +02:00
Daniël de Kok
bf92ca4f10 Merge remote-tracking branch 'upstream/master' into v4-isort 2023-06-26 12:43:00 +02:00
Daniël de Kok
2468742cb8 isort all the things 2023-06-26 11:41:03 +02:00
Daniël de Kok
68089f65cd Configure isort to use the Black profile, recursively isort the spacy module 2023-06-26 11:40:32 +02:00
Daniël de Kok
17c4a3d646
Set version to v4.0.0.dev1 (#12748) 2023-06-23 09:43:41 +02:00
Sofie Van Landeghem
95619b6736
Merge pull request #12717 from danieldk/sync-v4-master-20230612
Merge master into v4
2023-06-22 17:44:57 +01:00
Daniël de Kok
096794dd74 Account for differences between Span.sents in spaCy 3/4 2023-06-22 15:38:22 +02:00
Daniël de Kok
4990cfefb4 spancat type fixes 2023-06-12 16:43:11 +02:00
Daniël de Kok
d82e167aea Remove Python 3.7 builds 2023-06-12 16:16:03 +02:00
Daniël de Kok
50c5e9a2dd Merge remote-tracking branch 'upstream/master' into sync-v4-master-20230612 2023-06-12 15:57:10 +02:00
Daniël de Kok
8a5814bf2c
Add distillation loop (#12542)
* Add distillation initialization and loop

* Fix up configuration keys

* Add docstring

* Type annotations

* init_nlp_distill -> init_nlp_student

* Do not resolve dot name distill corpus in initialization

(Since we don't use it.)

* student: do not request use of optimizer in student pipe

We apply finish up the updates once in the training loop instead.

Also add the necessary logic to `Language.distill` to mirror
`Language.update`.

* Correctly determine sort key in subdivide_batch

* Fix _distill_loop docstring wrt. stopping condition

* _distill_loop: fix distill_data docstring

Make similar changes in train_while_improving, since it also had
incorrect types and missing type annotations.

* Move `set_{gpu_allocator,seed}_from_config` to spacy.util

* Update Language.update docs for the sgd argument

* Type annotation

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2023-04-21 13:49:40 +02:00
Adriane Boyd
5d0f48fe69
Enforce that Span.start/end(_char) remain valid and in sync (#12268)
* Enforce that Span.start/end(_char) remain valid and in sync

Allowing span attributes to be writable starting in v3 has made it
possible for the internal `Span.start/end/start_char/end_char` to get
out-of-sync or have invalid values.

This checks that the values are valid and syncs the token and char
offsets if any attributes are modified directly. It does not yet handle
the case where the underlying doc is modified.

* Format
2023-04-06 16:01:59 +02:00
Daniël de Kok
b734e5314d
Avoid TrainablePipe.finish_update getting called twice during training (#12450)
* Avoid `TrainablePipe.finish_update` getting called twice during training

PR #12136 fixed an issue where the tok2vec pipe was updated before
gradient were accumulated. However, it introduced a new bug that cause
`finish_update` to be called twice when using the training loop. This
causes a fairly large slowdown.

The `Language.update` method accepts the `sgd` argument for passing an
optimizer. This argument has three possible values:

- `Optimizer`: use the given optimizer to finish pipe updates.
- `None`: use a default optimizer to finish pipe updates.
- `False`: do not finish pipe updates.

However, the latter option was not documented and not valid with the
existing type of `sgd`. I assumed that this was a remnant of earlier
spaCy versions and removed handling of `False`.

However, with that change, we are passing `None` to `Language.update`.
As a result, we were calling `finish_update` in both `Language.update`
and in the training loop after all subbatches are processed.

This change restores proper handling/use of `False`. Moreover, the role
of `False` is now documented and added to the type to avoid future
accidents.

* Fix typo

* Document defaults for `Language.update`
2023-03-30 09:30:42 +02:00
Edward
a653dec654
Add info that Vocab and StringStore are not static in docs (#12427)
* Add size increase info about vocab and stringstore

* Update website/docs/api/stringstore.mdx

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* Update website/docs/api/vocab.mdx

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* Change wording

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-03-27 09:18:23 +02:00
Raphael Mitsch
3102e2e27a
Entity linking: use SpanGroup instead of Iterable[Span] for mentions (#12344)
* Convert Candidate from Cython to Python class.

* Format.

* Fix .entity_ typo in _add_activations() usage.

* Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span].

* Update docs.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update doc string of BaseCandidate.__init__().

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.

* Adjust Candidate to support and mandate numerical entity IDs.

* Format.

* Fix docstring and docs.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename alias -> mention.

* Refactor Candidate attribute names. Update docs and tests accordingly.

* Refacor Candidate attributes and their usage.

* Format.

* Fix mypy error.

* Update error code in line with v4 convention.

* Reverse erroneous changes during merge.

* Update return type in EL tests.

* Re-add Candidate to setup.py.

* Format updated docs.

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-20 12:25:18 +01:00
Raphael Mitsch
9340eb8ad2
Introduce hierarchy for EL Candidate objects (#12341)
* Convert Candidate from Cython to Python class.

* Format.

* Fix .entity_ typo in _add_activations() usage.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update doc string of BaseCandidate.__init__().

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.

* Adjust Candidate to support and mandate numerical entity IDs.

* Format.

* Fix docstring and docs.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename alias -> mention.

* Refactor Candidate attribute names. Update docs and tests accordingly.

* Refacor Candidate attributes and their usage.

* Format.

* Fix mypy error.

* Update error code in line with v4 convention.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Updated error code.

* Simplify interface for int/str representations.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename 'alias' to 'mention'.

* Port Candidate and InMemoryCandidate to Cython.

* Remove redundant entry in setup.py.

* Add abstract class check.

* Drop storing mention.

* Update spacy/kb/candidate.pxd

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix entity_id refactoring problems in docstrings.

* Drop unused InMemoryCandidate._entity_hash.

* Update docstrings.

* Move attributes out of Candidate.

* Partially fix alias/mention terminology usage. Convert Candidate to interface.

* Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs().

* Update docstrings related to prior_prob.

* Update alias/mention usage in doc(strings).

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs.

* Update docstrings.

* Fix InMemoryCandidate attribute names.

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update W401 test.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use Candidate output type for toy generators in the test suite to mimick best practices

* fix docs

* fix import

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-20 00:34:35 +01:00
Adriane Boyd
6ae7618418
Clean up Vocab constructor (#12290)
* Clean up Vocab constructor

* Change effective type of `strings` from `Iterable[str]` to `Optional[StringStore]`
  * Don't automatically add strings to vocab
* Change default values to `None`
* Remove `**deprecated_kwargs`

* Format
2023-03-19 23:41:20 +01:00
Madeesh Kannan
520279ff7c
Tok2Vec: Add distill method (#12108)
* `Tok2Vec`: Add `distill` method

* `Tok2Vec`: Refactor `update`

* Add `Tok2Vec.distill` test

* Update `distill` signature to accept `Example`s instead of separate teacher and student docs

* Add docs

* Remove docstring

* Update test

* Remove `update` calls from test

* Update `Tok2Vec.distill` docstring
2023-03-09 09:37:19 +01:00
Raphael Mitsch
41b3a0d932
Drop support for EntityLinker_v1. (#12377) 2023-03-07 13:10:45 +01:00
Adriane Boyd
8ca71f9591
Merge pull request #12371 from rmitsch/sync/master-into-v4
Sync `v4` with latest from `master`
2023-03-06 17:10:19 +01:00
Raphael Mitsch
749e446ee3 Merge branch 'master' into sync/master-into-v4
# Conflicts:
#	.github/azure-steps.yml
2023-03-06 16:27:56 +01:00
Sofie Van Landeghem
04f41854c1
Merge pull request #12356 from rmitsch/sync/master-into-v4
Sync `v4` with latest from `master`
2023-03-03 09:31:45 +01:00
Raphael Mitsch
1ea31552be Merge branch 'master' into sync/master-into-v4
# Conflicts:
#	requirements.txt
#	spacy/pipeline/entity_linker.py
#	spacy/util.py
#	website/docs/api/entitylinker.mdx
2023-03-02 16:24:15 +01:00
Adriane Boyd
da75896ef5
Return Tuple[Span] for all Doc/Span attrs that provide spans (#12288)
* Return Tuple[Span] for all Doc/Span attrs that provide spans

* Update Span types
2023-03-01 16:00:02 +01:00
Adriane Boyd
df4c069a13
Remove backoff from .vector to .tensor (#12292) 2023-02-23 11:36:50 +01:00
Daniël de Kok
e27c60a702
Reimplement distillation with oracle cut size (#12214)
* Improve the correctness of _parse_patch

* If there are no more actions, do not attempt to make further
  transitions, even if not all states are final.
* Assert that the number of actions for a step is the same as
  the number of states.

* Reimplement distillation with oracle cut size

The code for distillation with an oracle cut size was not reimplemented
after the parser refactor. We did not notice, because we did not have
tests for this functionality. This change brings back the functionality
and adds this to the parser tests.

* Rename states2actions to _states_to_actions for consistency

* Test distillation max cuts in NER

* Mark parser/NER tests as slow

* Typo

* Fix invariant in _states_diff_to_actions

* Rename _init_batch -> _init_batch_from_teacher

* Ninja edit the ninja edit

* Check that we raise an exception when we pass the incorrect number or actions

* Remove unnecessary get

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Write out condition more explicitly

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2023-02-21 15:47:18 +01:00
Paul O'Leary McCann
dd3f138830
Use tempfile.TemporaryDirectory (#12285) 2023-02-16 11:08:55 +01:00
Adriane Boyd
b95123060a
Make Span.char_span optional args keyword-only (#12257)
* Make Span.char_span optional args keyword-only

* Make kb_id and following kw-only

* Format
2023-02-15 12:34:33 +01:00
Adriane Boyd
cbc2ae933e
Remove unused Span.char_span(id=) (#12250) 2023-02-08 14:46:07 +01:00
Adriane Boyd
cf85b81f34
Remove names for vectors (#12243)
* Remove names for vectors

Named vectors are basically a carry-over from v2 and aren't used for
anything.

* Format
2023-02-08 14:37:42 +01:00
Adriane Boyd
5089efa2d0
Use the same tuple in Span cmp and hash (#12251) 2023-02-08 14:28:34 +01:00
Daniël de Kok
eec5ccd72f
Language.update: ensure that tok2vec gets updated (#12136)
* `Language.update`: ensure that tok2vec gets updated

The components in a pipeline can be updated independently. However,
tok2vec implementations are an exception to this, since they depend on
listeners for their gradients. The update method of a tok2vec
implementation computes the tok2vec forward and passes this along with a
backprop function to the listeners. This backprop function accumulates
gradients for all the listeners. There are two ways in which the
accumulated gradients can be used to update the tok2vec weights:

1. Call the `finish_update` method of tok2vec *after* the `update`
   method is called on all of the pipes that use a tok2vec listener.
2. Pass an optimizer to the `update` method of tok2vec. In this
   case, tok2vec will give the last listener a special backprop
   function that calls `finish_update` on the tok2vec.

Unfortunately, `Language.update` did neither of these. Instead, it
immediately called `finish_update` on every pipe after `update`. As a
result, the tok2vec weights are updated when no gradients have been
accumulated from listeners yet. And the gradients of the listeners are
only used in the next call to `Language.update` (when `finish_update` is
called on tok2vec again).

This change fixes this issue by passing the optimizer to the `update`
method of trainable pipes, leading to use of the second strategy
outlined above.

The main updating loop in `Language.update` is also simplified by using
the `TrainableComponent` protocol consistently.

* Train loop: `sgd` is `Optional[Optimizer]`, do not pass false

* Language.update: call pipe finish_update after all pipe updates

This does correct and fast updates if multiple components update the
same parameters.

* Add comment why we moved `finish_update` to a separate loop
2023-02-03 15:22:25 +01:00
Sofie Van Landeghem
c47ec5b5c6
Merge pull request #12218 from adrianeboyd/chore/update-v4-from-master-7
Update v4 from master
2023-02-03 12:04:20 +01:00
Paul O'Leary McCann
89f974d4f5
Cleanup/remove backwards compat overwrite settings (#11888)
* Remove backwards-compatible overwrite from Entity Linker

This also adds a docstring about overwrite, since it wasn't present.

* Fix docstring

* Remove backward compat settings in Morphologizer

This also needed a docstring added.

For this component it's less clear what the right overwrite settings
are.

* Remove backward compat from sentencizer

This was simple

* Remove backward compat from senter

Another simple one

* Remove backward compat setting from tagger

* Add docstrings

* Update spacy/pipeline/morphologizer.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update docs

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-02-02 14:13:38 +01:00
Adriane Boyd
cd95b29053 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master-7 2023-02-02 13:06:15 +01:00
Paul O'Leary McCann
6920fb7baf
Move Entity Linker v1 to spacy-legacy (#12006)
* Move Entity Linker v1 component to spacy-legacy

This is a follow up to #11889 that moves the component instead of
removing it.

In general, we never import from spacy-legacy in spaCy proper. However,
to use this component, that kind of import will be necessary. I was able
to test this without issues, but is this current import strategy
acceptable? Or should we put the component in a registry?

* Use spacy-legacy pr for CI

This will need to be reverted before merging.

* Add temporary step to log installed spacy-legacy version

* Modify requirements.txt to trigger tests

* Add comment to Python to trigger tests

* TODO REVERT This is a commit with logic changes to trigger tests

* Remove pipe from YAML

Works locally, but possibly this is causing a quoting error or
something.

* Revert "TODO REVERT This is a commit with logic changes to trigger tests"

This reverts commit 689fae71f3.

* Revert "Add comment to Python to trigger tests"

This reverts commit 11840fc598.

* Add more logging

* Try installing directly in workflow

* Try explicitly uninstalling spacy-legacy first

* Cat requirements.txt to confirm contents

In the branch, the thinc version spec is `thinc>=8.1.0,<8.2.0`. But in
the logs, it's clear that a development release of 9.0 is being
installed. It's not clear why that would happen.

* Log requirements at start of build

* TODO REVERT Change thinc spec

Want to see what happens to the installed thinc spec with this change.

* Update thinc requirements

This makes it the same as it was before the merge, >=8.1.0,<8.2.0.

* Use same thinc version as v4 branch

* TODO REVERT Mark dependency check as xfail

spacy-legacy is specified as a git checkout in requirements.txt while
this PR is in progress, which makes the consistency check here fail.

* Remove debugging output / install step

* Revert "Remove debugging output / install step"

This reverts commit 923ea7448b.

* Clean up debugging output

The manual install step with the URL fragment seems to have caused
issues on Windows due to the = in the URL being misinterpreted. On the
other hand, removing it seems to mean the git version of spacy-legacy
isn't actually installed.

This PR removes the URL fragment but keeps the direct command-line
install. Additionally, since it looks like this job is configured to use
the default shell (and not bash), it removes a comment that upsets the
Windows cmd shell.

* Revert "TODO REVERT Mark dependency check as xfail"

This reverts commit d4863ec156.

* Fix requirements.txt, increasing spacy-legacy version

* Raise spacy legacy version in setup.cfg

* Remove azure build workarounds

* make spacy-legacy version explicit in error message

* Remove debugging line

* Suggestions from code review
2023-02-01 09:47:56 +01:00
Edward
360ccf628a
Rename language codes (Icelandic, multi-language) (#12149)
* Init

* fix tests

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix test_blank_languages

* Rename xx to mul in docs

* Format _util with black

* prettier formatting

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-31 17:30:43 +01:00
Daniël de Kok
c6cca4c00a
Language.distill: copy both reference and predicted (#12209)
* Language.distill: copy both reference and predicted

In distillation we also modify the teacher docs (e.g. in tok2vec
components), so we need to copy both the reference and predicted doc.

Problem caught by @shadeMe

* Make new `_copy_examples` args kwonly
2023-01-31 13:19:42 +01:00
Daniël de Kok
fb7f018ded
Add the configuration schema for distillation (#12201)
* Add the configuration schema for distillation

This also adds the default configuration and some tests. The schema will
be used by the training loop and `distill` subcommand.

* Format

* Change distillation shortopt to -d

* Fix descripion of max_epochs

* Rename distillation flag to -dt

* Rename `pipe_map` to `student_to_teacher`
2023-01-31 13:06:02 +01:00
Paul O'Leary McCann
1b5aba9e22
Don't re-download installed models (#12188)
* Don't re-download installed models

When downloading a model, this checks if the same version of the same
model is already installed. If it is then the download is skipped.

This is necessary because pip uses the final download URL for its
caching feature, but because of the way models are hosted on Github,
their URLs change every few minutes.

* Use importlib instead of meta.json

* Use get_package_version

* Add untested, disabled test

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-31 11:31:17 +01:00
Daniël de Kok
6b07be2110
Add Language.distill (#12116)
* Add `Language.distill`

This method is the distillation counterpart of `Language.update`.  It
takes a teacher `Language` instance and distills the student pipes on
the teacher pipes.

* Apply suggestions from code review

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Clarify that how Example is used in distillation

* Update transition parser distill docstring for examples argument

* Pass optimizer to `TrainablePipe.distill`

* Annotate pipe before update

As discussed internally, we want to let a pipe annotate before doing an
update with gold/silver data. Otherwise, the output may be (too)
informed by the gold/silver data.

* Rename `component_map` to `student_to_teacher`

* Better synopsis in `Language.distill` docstring

* `name` -> `student_name`

* Fix labels type in docstring

* Mark distill test as slow

* Fix `student_to_teacher` type in docs

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2023-01-30 12:44:11 +01:00
Adriane Boyd
ec45f704b1
Drop python 3.6/3.7, remove unneeded compat (#12187)
* Drop python 3.6/3.7, remove unneeded compat

* Remove unused import

* Minimal python 3.8+ docs updates
2023-01-27 15:48:20 +01:00
Sofie Van Landeghem
1678a98449
Merge pull request #12192 from adrianeboyd/chore/update-v4-from-master-5
Update v4 from master, format, update CI
2023-01-27 14:59:26 +01:00
Adriane Boyd
16609517f1 CI: Skip tests that require published pipelines 2023-01-27 08:37:02 +01:00
Adriane Boyd
fd911fe2af Format 2023-01-27 08:29:46 +01:00
Adriane Boyd
8548d4d16e Merge remote-tracking branch 'upstream/master' into update-v4-from-master-1 2023-01-27 08:29:09 +01:00
Paul O'Leary McCann
de360bc981
Refactor lexeme mem passing (#12125)
* Don't pass mem pool to new lexeme function

* Remove unused mem from function args

Two methods calling _new_lexeme, get and get_by_orth, took mem arguments
just to call the internal method. That's no longer necessary, so this
cleans it up.

* prettier formatting

* Remove more unused mem args
2023-01-25 12:50:21 +09:00
Daniël de Kok
6348a7a4b4
Set version to v4.0.0.dev0 (#12126) 2023-01-19 09:25:34 +01:00
Daniël de Kok
b052b1b47f
Fix batching regression (#12094)
* Fix batching regression

Some time ago, the spaCy v4 branch switched to the new Thinc v9
schedule. However, this introduced an error in how batching is handed.

In the PR, the batchers were changed to keep track of their step,
so that the step can be passed to the schedule. However, the issue
is that the training loop repeatedly calls the batching functions
(rather than using an infinite generator/iterator). So, the step and
therefore the schedule would be reset each epoch. Before the schedule
switch we didn't have this issue, because the old schedules were
stateful.

This PR fixes this issue by reverting the batching functions to use
a (stateful) generator. Their registry functions do accept a `Schedule`
and we convert `Schedule`s to generators.

* Update batcher docs

* Docstring fixes

* Make minibatch take iterables again as well

* Bump thinc requirement to 9.0.0.dev2

* Use type declaration

* Convert another comment into a proper type declaration
2023-01-18 18:28:30 +01:00
Daniël de Kok
a183db3cef
Merge the parser refactor into v4 (#10940)
* Try to fix doc.copy

* Set dev version

* Make vocab always own lexemes

* Change version

* Add SpanGroups.copy method

* Fix set_annotations during Parser.update

* Fix dict proxy copy

* Upd version

* Fix copying SpanGroups

* Fix set_annotations in parser.update

* Fix parser set_annotations during update

* Revert "Fix parser set_annotations during update"

This reverts commit eb138c89ed.

* Revert "Fix set_annotations in parser.update"

This reverts commit c6df0eafd0.

* Fix set_annotations during parser update

* Inc version

* Handle final states in get_oracle_sequence

* Inc version

* Try to fix parser training

* Inc version

* Fix

* Inc version

* Fix parser oracle

* Inc version

* Inc version

* Fix transition has_gold

* Inc version

* Try to use real histories, not oracle

* Inc version

* Upd parser

* Inc version

* WIP on rewrite parser

* WIP refactor parser

* New progress on parser model refactor

* Prepare to remove parser_model.pyx

* Convert parser from cdef class

* Delete spacy.ml.parser_model

* Delete _precomputable_affine module

* Wire up tb_framework to new parser model

* Wire up parser model

* Uncython ner.pyx and dep_parser.pyx

* Uncython

* Work on parser model

* Support unseen_classes in parser model

* Support unseen classes in parser

* Cleaner handling of unseen classes

* Work through tests

* Keep working through errors

* Keep working through errors

* Work on parser. 15 tests failing

* Xfail beam stuff. 9 failures

* More xfail. 7 failures

* Xfail. 6 failures

* cleanup

* formatting

* fixes

* pass nO through

* Fix empty doc in update

* Hackishly fix resizing. 3 failures

* Fix redundant test. 2 failures

* Add reference version

* black formatting

* Get tests passing with reference implementation

* Fix missing prints

* Add missing file

* Improve indexing on reference implementation

* Get non-reference forward func working

* Start rigging beam back up

* removing redundant tests, cf #8106

* black formatting

* temporarily xfailing issue 4314

* make flake8 happy again

* mypy fixes

* ensure labels are added upon predict

* cleanup remnants from merge conflicts

* Improve unseen label masking

Two changes to speed up masking by ~10%:

- Use a bool array rather than an array of float32.

- Let the mask indicate whether a label was seen, rather than
  unseen. The mask is most frequently used to index scores for
  seen labels. However, since the mask marked unseen labels,
  this required computing an intermittent flipped mask.

* Write moves costs directly into numpy array (#10163)

This avoids elementwise indexing and the allocation of an additional
array.

Gives a ~15% speed improvement when using batch_by_sequence with size
32.

* Temporarily disable ner and rehearse tests

Until rehearse is implemented again in the refactored parser.

* Fix loss serialization issue (#10600)

* Fix loss serialization issue

Serialization of a model fails with:

TypeError: array(738.3855, dtype=float32) is not JSON serializable

Fix this using float conversion.

* Disable CI steps that require spacy.TransitionBasedParser.v2

After finishing the refactor, TransitionBasedParser.v2 should be
provided for backwards compat.

* Add back support for beam parsing to the refactored parser (#10633)

* Add back support for beam parsing

Beam parsing was already implemented as part of the `BeamBatch` class.
This change makes its counterpart `GreedyBatch`. Both classes are hooked
up in `TransitionModel`, selecting `GreedyBatch` when the beam size is
one, or `BeamBatch` otherwise.

* Use kwarg for beam width

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Avoid implicit default for beam_width and beam_density

* Parser.{beam,greedy}_parse: ensure labels are added

* Remove 'deprecated' comments

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Parser `StateC` optimizations (#10746)

* `StateC`: Optimizations

Avoid GIL acquisition in `__init__`
Increase default buffer capacities on init
Reduce C++ exception overhead

* Fix typo

* Replace `set::count` with `set::find`

* Add exception attribute to c'tor

* Remove unused import

* Use a power-of-two value for initial capacity
Use default-insert to init `_heads` and `_unshiftable`

* Merge `cdef` variable declarations and assignments

* Vectorize `example.get_aligned_parses` (#10789)

* `example`: Vectorize `get_aligned_parse`
Rename `numpy` import

* Convert aligned array to lists before returning

* Revert import renaming

* Elide slice arguments when selecting the entire range

* Tagger/morphologizer alignment performance optimizations (#10798)

* `example`: Unwrap `numpy` scalar arrays before passing them to `StringStore.__getitem__`

* `AlignmentArray`: Use native list as staging buffer for offset calculation

* `example`: Vectorize `get_aligned`

* Hoist inner functions out of `get_aligned`

* Replace inline `if..else` clause in assignment statement

* `AlignmentArray`: Use raw indexing into offset and data `numpy` arrays

* `example`: Replace array unique value check with `groupby`

* `example`: Correctly exclude tokens with no alignment in `_get_aligned_vectorized`
Simplify `_get_aligned_non_vectorized`

* `util`: Update `all_equal` docstring

* Explicitly use `int32_t*`

* Restore C CPU inference in the refactored parser (#10747)

* Bring back the C parsing model

The C parsing model is used for CPU inference and is still faster for
CPU inference than the forward pass of the Thinc model.

* Use C sgemm provided by the Ops implementation

* Make tb_framework module Cython, merge in C forward implementation

* TransitionModel: raise in backprop returned from forward_cpu

* Re-enable greedy parse test

* Return transition scores when forward_cpu is used

* Apply suggestions from code review

Import `Model` from `thinc.api`

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use relative imports in tb_framework

* Don't assume a default for beam_width

* We don't have a direct dependency on BLIS anymore

* Rename forwards to _forward_{fallback,greedy_cpu}

* Require thinc >=8.1.0,<8.2.0

* tb_framework: clean up imports

* Fix return type of _get_seen_mask

* Move up _forward_greedy_cpu

* Style fixes.

* Lower thinc lowerbound to 8.1.0.dev0

* Formatting fix

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Reimplement parser rehearsal function (#10878)

* Reimplement parser rehearsal function

Before the parser refactor, rehearsal was driven by a loop in the
`rehearse` method itself. For each parsing step, the loops would:

1. Get the predictions of the teacher.
2. Get the predictions and backprop function of the student.
3. Compute the loss and backprop into the student.
4. Move the teacher and student forward with the predictions of
   the student.

In the refactored parser, we cannot perform search stepwise rehearsal
anymore, since the model now predicts all parsing steps at once.
Therefore, rehearsal is performed in the following steps:

1. Get the predictions of all parsing steps from the student, along
   with its backprop function.
2. Get the predictions from the teacher, but use the predictions of
   the student to advance the parser while doing so.
3. Compute the loss and backprop into the student.

To support the second step a new method, `advance_with_actions` is
added to `GreedyBatch`, which performs the provided parsing steps.

* tb_framework: wrap upper_W and upper_b in Linear

Thinc's Optimizer cannot handle resizing of existing parameters. Until
it does, we work around this by wrapping the weights/biases of the upper
layer of the parser model in Linear. When the upper layer is resized, we
copy over the existing parameters into a new Linear instance. This does
not trigger an error in Optimizer, because it sees the resized layer as
a new set of parameters.

* Add test for TransitionSystem.apply_actions

* Better FIXME marker

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Fixes from Madeesh

* Apply suggestions from Sofie

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove useless assignment

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename some identifiers in the parser refactor (#10935)

* Rename _parseC to _parse_batch

* tb_framework: prefix many auxiliary functions with underscore

To clearly state the intent that they are private.

* Rename `lower` to `hidden`, `upper` to `output`

* Parser slow test fixup

We don't have TransitionBasedParser.{v1,v2} until we bring it back as a
legacy option.

* Remove last vestiges of PrecomputableAffine

This does not exist anymore as a separate layer.

* ner: re-enable sentence boundary checks

* Re-enable test that works now.

* test_ner: make loss test more strict again

* Remove commented line

* Re-enable some more beam parser tests

* Remove unused _forward_reference function

* Update for CBlas changes in Thinc 8.1.0.dev2

Bump thinc dependency to 8.1.0.dev3.

* Remove references to spacy.TransitionBasedParser.{v1,v2}

Since they will not be offered starting with spaCy v4.

* `tb_framework`: Replace references to `thinc.backends.linalg` with `CBlas`

* dont use get_array_module (#11056) (#11293)

Co-authored-by: kadarakos <kadar.akos@gmail.com>

* Move `thinc.extra.search` to `spacy.pipeline._parser_internals` (#11317)

* `search`: Move from `thinc.extra.search`
Fix NPE in `Beam.__dealloc__`

* `pytest`: Add support for executing Cython tests
Move `search` tests from thinc and patch them to run with `pytest`

* `mypy` fix

* Update comment

* `conftest`: Expose `register_cython_tests`

* Remove unused import

* Move `argmax` impls to new `_parser_utils` Cython module (#11410)

* Parser does not have to be a cdef class anymore

This also fixes validation of the initialization schema.

* Add back spacy.TransitionBasedParser.v2

* Fix a rename that was missed in #10878.

So that rehearsal tests pass.

* Remove module from setup.py that got added during the merge

* Bring back support for `update_with_oracle_cut_size` (#12086)

* Bring back support for `update_with_oracle_cut_size`

This option was available in the pre-refactor parser, but was never
implemented in the refactored parser. This option cuts transition
sequences that are longer than `update_with_oracle_cut` size into
separate sequences that have at most `update_with_oracle_cut`
transitions. The oracle (gold standard) transition sequence is used to
determine the cuts and the initial states for the additional sequences.

Applying this cut makes the batches more homogeneous in the transition
sequence lengths, making forward passes (and as a consequence training)
much faster.

Training time 1000 steps on de_core_news_lg:

- Before this change: 149s
- After this change: 68s
- Pre-refactor parser: 81s

* Fix a rename that was missed in #10878.

So that rehearsal tests pass.

* Apply suggestions from @shadeMe

* Use chained conditional

* Test with update_with_oracle_cut_size={0, 1, 5, 100}

And fix a git that occurs with a cut size of 1.

* Fix up some merge fall out

* Update parser distillation for the refactor

In the old parser, we'd iterate over the transitions in the distill
function and compute the loss/gradients on the go. In the refactored
parser, we first let the student model parse the inputs. Then we'll let
the teacher compute the transition probabilities of the states in the
student's transition sequence. We can then compute the gradients of the
student given the teacher.

* Add back spacy.TransitionBasedParser.v1 references

- Accordion in the architecture docs.
- Test in test_parse, but disabled until we have a spacy-legacy release.

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: kadarakos <kadar.akos@gmail.com>
2023-01-18 11:27:45 +01:00
Daniël de Kok
5e297aa20e
Add TrainablePipe.{distill,get_teacher_student_loss} (#12016)
* Add `TrainablePipe.{distill,get_teacher_student_loss}`

This change adds two methods:

- `TrainablePipe::distill` which performs a training step of a
   student pipe on a teacher pipe, giving a batch of `Doc`s.
- `TrainablePipe::get_teacher_student_loss` computes the loss
  of a student relative to the teacher.

The `distill` or `get_teacher_student_loss` methods are also implemented
in the tagger, edit tree lemmatizer, and parser pipes, to enable
distillation in those pipes and as an example for other pipes.

* Fix stray `Beam` import

* Fix incorrect import

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* TrainablePipe.distill: use `Iterable[Example]`

* Add Pipe.is_distillable method

* Add `validate_distillation_examples`

This first calls `validate_examples` and then checks that the
student/teacher tokens are the same.

* Update distill documentation

* Add distill documentation for all pipes that support distillation

* Fix incorrect identifier

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add comment to explain `is_distillable`

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-01-16 10:25:53 +01:00
Sofie Van Landeghem
c2f3e699ca
fix anchors (#12095) 2023-01-13 11:14:58 +01:00
Sofie Van Landeghem
2c2e66e145
Merge pull request #12096 from svlandeg/copy_v4
Sync with latest from master
2023-01-11 20:46:33 +01:00
svlandeg
fc2723925b update tests from master to follow v4 principles (2) 2023-01-11 19:04:06 +01:00
svlandeg
6ff5eb256c update tests from master to follow v4 principles 2023-01-11 18:57:50 +01:00
svlandeg
b2fd9490e3 Merge branch 'copy_master' into copy_v4 2023-01-11 18:40:55 +01:00
Madeesh Kannan
a231bf65af
Pass step=0 to Schedule class to yield initial learning rate (#12078) 2023-01-09 20:15:02 +01:00
Tetsuo Kiso
b510fbd0aa
Delete unused imports for StringStore (#12040) 2023-01-03 17:43:09 +01:00
Sofie Van Landeghem
326b541312
Merge pull request #12049 from svlandeg/copy_v4
Sync v4 with latest from master
2023-01-03 16:43:54 +01:00
svlandeg
6852adc8b7 Merge branch 'copy_master' into copy_v4 2023-01-03 13:34:05 +01:00
Daniël de Kok
20b63943f5
Adjust to new Schedule class and pass scores to Optimizer (#12008)
* Adjust to new `Schedule` class and pass scores to `Optimizer`

Requires https://github.com/explosion/thinc/pull/804

* Bump minimum Thinc requirement to 9.0.0.dev1
2022-12-29 08:03:24 +01:00
Daniël de Kok
d30ba9b7b8
Merge pull request #12015 from danieldk/chore/v4-merge-master-20221222
Merge master into v4
2022-12-22 11:22:33 +01:00
Daniël de Kok
2f08deea2a Fix fallout from a previous merge 2022-12-22 10:23:31 +01:00
Daniël de Kok
207565a788 Merge remote-tracking branch 'upstream/master' into chore/v4-merge-master-20221222 2022-12-22 10:08:54 +01:00
Daniël de Kok
f9308aae13
Fix v4 branch to build against Thinc v9 (#11921)
* Move `thinc.extra.search` to `spacy.pipeline._parser_internals`

Backport of:
https://github.com/explosion/spaCy/pull/11317

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Replace references to `thinc.backends.linalg` with `CBlas`

Backport of:
https://github.com/explosion/spaCy/pull/11292

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Use cross entropy from `thinc.legacy`

* Require thinc>=9.0.0.dev0,<9.1.0

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2022-12-17 14:32:19 +01:00
Edward
ca75190a3d
Custom extensions for spans with equal boundaries (#11429)
* Init

* Fix return type for mypy

* adjust types and improve setting new attributes

* Add underscore changes to json conversion

* Add test and underscore changes to from_docs

* add underscore changes and test to span.to_doc

* update return values

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add types to function

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* adjust formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* shorten return type

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* add helper function to improve readability

* Improve code and add comments

* rerun azure tests

* Fix tests for json conversion

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-12-12 08:55:53 +01:00
Madeesh Kannan
f5aabaf7d6
Remove unused, experimental multi-task components (#11919)
* Remove experimental multi-task components

These are incomplete implementations and are not usable in their current state.

* Remove orphaned error message

* Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)

* Switch ubuntu-latest to ubuntu-20.04 in main tests

* Only use 20.04 for 3.6

* Revert "Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)"

This reverts commit 77c0fd7b17.

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-12-08 13:24:45 +01:00
Paul O'Leary McCann
d60997febb
Remove old model shortcuts (#11916)
* Remove old model shortcuts

* Remove error, docs warnings about shortcuts

* Fix import in util

Accidentally deleted the whole import and not just the old part...

* Change universe example to v3 style

* Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)

* Switch ubuntu-latest to ubuntu-20.04 in main tests

* Only use 20.04 for 3.6

* Update some model loading in Universe

* Add v2 tag to neuralcoref

* Use the spacy-version feature instead of a v2 tag

Co-authored-by: svlandeg <svlandeg@github.com>
2022-12-08 11:45:52 +01:00
Paul O'Leary McCann
6b9af38eeb
Remove all references to "begin_training" (#11943)
When v3 was released, `begin_training` was renamed to `initialize`.
There were warnings in the code and docs about that. This PR removes
them.
2022-12-08 11:43:52 +01:00
Sofie Van Landeghem
60379cec65
Merge pull request #11929 from svlandeg/copy_v4
sync v4 with latest master
2022-12-07 15:24:07 +01:00
Paul O'Leary McCann
8267aa1b65 Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)
* Switch ubuntu-latest to ubuntu-20.04 in main tests

* Only use 20.04 for 3.6
2022-12-05 09:44:19 +01:00
svlandeg
799d226676 prettier formatting 2022-12-05 08:57:24 +01:00
svlandeg
04fea09ffd Merge branch 'copy_master' into copy_v4 2022-12-05 08:56:15 +01:00
Edward
e79910d57e
Remove sentiment extension (#11722)
* remove sentiment attribute

* remove sentiment from docs

* add test for backwards compatibility

* replace from_disk with from_bytes

* Fix docs and format file

* Fix formatting
2022-11-23 13:09:32 +01:00
Adriane Boyd
d0fc871a1c
Merge pull request #11741 from adrianeboyd/chore/update-v4-from-master-4
Update v4 from master
2022-11-03 12:39:15 +01:00
Adriane Boyd
68b8fa2df2 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master-4 2022-11-03 09:42:36 +01:00
Adriane Boyd
cae4589f5a
Replace EntityRuler with SpanRuler implementation (#11320)
* Replace EntityRuler with SpanRuler implementation

Remove `EntityRuler` and rename the `SpanRuler`-based
`future_entity_ruler` to `entity_ruler`.

Main changes:

* It is no longer possible to load patterns on init as with
`EntityRuler(patterns=)`.
* The older serialization formats (`patterns.jsonl`) are no longer
supported and the related tests are removed.
* The config settings are only stored in the config, not in the
serialized component (in particular the `phrase_matcher_attr` and
overwrite settings).

* Add migration guide to EntityRuler API docs

* docs update

* Minor edit

Co-authored-by: svlandeg <svlandeg@github.com>
2022-10-24 09:11:35 +02:00
Adriane Boyd
a4bd890f32
Merge pull request #11686 from adrianeboyd/chore/update-v4-from-master
Update v4 from master
2022-10-21 12:55:53 +02:00
Paul O'Leary McCann
0e2b7fb28b
Remove thinc util reimports (#11665)
* Remove imports marked as v2 leftovers

There are a few functions that were in `spacy.util` in v2, but were
moved to Thinc. In v3 these were imported in `spacy.util` so that code
could be used unchanged, but the comment over them indicates they should
always be imported from Thinc. This commit removes those imports.

It doesn't look like any DeprecationWarning was ever thrown for using
these, but it is probably fine to remove them anyway with a major
version. It is not clear that they were widely used.

* Import fix_random_seed correctly

This seems to be the only place in spaCy that was using the old import.
2022-10-21 11:01:18 +02:00
Adriane Boyd
103b24fb25 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master 2022-10-21 09:13:32 +02:00
Madeesh Kannan
446a3ecf34
StringStore refactoring (#11344)
* `strings`: Remove unused `hash32_utf8` function

* `strings`: Make `hash_utf8` and `decode_Utf8Str` private

* `strings`: Reorganize private functions

* 'strings': Raise error when non-string/-int types are passed to functions that don't accept them

* `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods

* `Morphology`: Use `StringStore.items()` to enumerate features when pickling

* `test_stringstore`: Update pre-Python 3 tests

* Update `StringStore` docs

* Fix `get_string_id` imports

* Replace redundant test with tests for type checking

* Rename `_retrieve_interned_str`, remove `.get` default arg

* Add `get_string_id` to `strings.pyi`
Remove `mypy` ignore directives from imports of the above

* `strings.pyi`: Replace functions that consume `Union`-typed params with overloads

* `strings.pyi`: Revert some function signatures

* Update `SYMBOLS_BY_INT` lookups and error codes post-merge

* Revert clobbered change introduced in a previous merge

* Remove unnecessary type hint

* Invert tuple order in `StringStore.items()`

* Add test for `StringStore.items()`

* Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling"

This reverts commit 1af9510ceb.

* Rename `keys` and `key_map`

* Add `keys()` and `values()`

* Add comment about the inverted key-value semantics in the API

* Fix type hints

* Implement `keys()`, `values()`, `items()` without generators

* Fix type hints, remove unnecessary boxing

* Update docs

* Simplify `keys/values/items()` impl

* `mypy` fix

* Fix error message, doc fixes
2022-10-06 10:51:06 +02:00
Sofie Van Landeghem
c6704f368c
Merge pull request #11572 from svlandeg/copy_v4
update v4 with latest from master and develop
2022-10-03 15:30:55 +02:00
svlandeg
d4922f25fc fix test for EL activations with refactored KB 2022-10-03 14:41:15 +02:00
svlandeg
e3027c65b8 Merge branch 'copy_develop' into copy_v4 2022-10-03 14:12:16 +02:00
Sofie Van Landeghem
5157e4e823
disable mypy run for Python 3.10 (#11508) (#11512) 2022-09-15 17:06:58 +02:00
Daniël de Kok
efdbb722c5
Store activations in Docs when save_activations is enabled (#11002)
* Store activations in Doc when `store_activations` is enabled

This change adds the new `activations` attribute to `Doc`. This
attribute can be used by trainable pipes to store their activations,
probabilities, and guesses for downstream users.

As an example, this change modifies the `tagger` and `senter` pipes to
add an `store_activations` option. When this option is enabled, the
probabilities and guesses are stored in `set_annotations`.

* Change type of `store_activations` to `Union[bool, List[str]]`

When the value is:

- A bool: all activations are stored when set to `True`.
- A List[str]: the activations named in the list are stored

* Formatting fixes in Tagger

* Support store_activations in spancat and morphologizer

* Make Doc.activations type visible to MyPy

* textcat/textcat_multilabel: add store_activations option

* trainable_lemmatizer/entity_linker: add store_activations option

* parser/ner: do not currently support returning activations

* Extend tagger and senter tests

So that they, like the other tests, also check that we get no
activations if no activations were requested.

* Document `Doc.activations` and `store_activations` in the relevant pipes

* Start errors/warnings at higher numbers to avoid merge conflicts

Between the master and v4 branches.

* Add `store_activations` to docstrings.

* Replace store_activations setter by set_store_activations method

Setters that take a different type than what the getter returns are still
problematic for MyPy. Replace the setter by a method, so that type inference
works everywhere.

* Use dict comprehension suggested by @svlandeg

* Revert "Use dict comprehension suggested by @svlandeg"

This reverts commit 6e7b958f70.

* EntityLinker: add type annotations to _add_activations

* _store_activations: make kwarg-only, remove doc_scores_lens arg

* set_annotations: add type annotations

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* TextCat.predict: return dict

* Make the `TrainablePipe.store_activations` property a bool

This means that we can also bring back `store_activations` setter.

* Remove `TrainablePipe.activations`

We do not need to enumerate the activations anymore since `store_activations` is
`bool`.

* Add type annotations for activations in predict/set_annotations

* Rename `TrainablePipe.store_activations` to `save_activations`

* Error E1400 is not used anymore

This error was used when activations were still `Union[bool, List[str]]`.

* Change wording in API docs after store -> save change

* docs: tag (save_)activations as new in spaCy 4.0

* Fix copied line in morphologizer activations test

* Don't train in any test_save_activations test

* Rename activations

- "probs" -> "probabilities"
- "guesses" -> "label_ids", except in the edit tree lemmatizer, where
  "guesses" -> "tree_ids".

* Remove unused W400 warning.

This warning was used when we still allowed the user to specify
which activations to save.

* Formatting fixes

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Replace "kb_ids" by a constant

* spancat: replace a cast by an assertion

* Fix EOF spacing

* Fix comments in test_save_activations tests

* Do not set RNG seed in activation saving tests

* Revert "spancat: replace a cast by an assertion"

This reverts commit 0bd5730d16.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-13 09:51:12 +02:00
Madeesh Kannan
60c050e82b
Merge pull request #11448 from shadeMe/merge-develop-into-v4
Merge `develop` into `v4`
2022-09-07 13:26:11 +02:00
shademe
977b847cce
Merge branch 'develop' into merge-develop-into-v4 2022-09-07 11:35:47 +02:00
Adriane Boyd
4a615cacd2
Consolidate and freeze symbols (#11352)
* Consolidate and freeze symbols

Instead of having symbol values defined in three potentially conflicting
places (`spacy.attrs`, `spacy.parts_of_speech`, `spacy.symbols`), define
all symbols in `spacy.symbols` and reference those values in
`spacy.attrs` and `spacy.parts_of_speech`.

Remove deprecated and placeholder symbols from `spacy.attrs.IDS`.

Make `spacy.attrs.NAMES` and `spacy.symbols.NAMES` reverse dicts rather
than lists in order to support future use of hash values in `attr_id_t`.

Minor changes:

* Use `uint64_t` for attrs in `Doc.to_array` to support future use of
hash values
* Remove unneeded attrs filter for error message in `Doc.to_array`
* Remove unused attr `SENT_END`

* Handle dynamic size of attr_id_t in Doc.to_array

* Undo added warnings

* Refactor to make Doc.to_array more similar to Doc.from_array

* Improve refactoring
2022-09-02 09:08:40 +02:00
Paul O'Leary McCann
698b8b495f
Update/remove old Matcher syntax (#11370)
* Clean up old Matcher call style related stuff

In v2 Matcher.add was called with (key, on_match, *patterns). In v3 this
was changed to (key, patterns, *, on_match=None), but there were various
points where the old call syntax was documented or handled specially.
This removes all those.

The Matcher itself didn't need any code changes, as it just gives a
generic type error. However the PhraseMatcher required some changes
because it would automatically "fix" the old call style.

Surprisingly, the tokenizer was still using the old call style in one
place.

After these changes tests failed in two places:

1. one test for the "new" call style, including the "old" call style. I
   removed this test.
2. deserializing the PhraseMatcher fails because the input docs are a
   set.

I am not sure why 2 is happening - I guess it's a quirk of the
serialization format? - so for now I just convert the set to a list when
deserializing. The check that the input Docs are a List in the
PhraseMatcher is a new check, but makes it parallel with the other
Matchers, which seemed like the right thing to do.

* Add notes related to input docs / deserialization type

* Remove Typing import

* Remove old note about call style change

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Use separate method for setting internal doc representations

In addition to the title change, this changes the internal dict to be a
defaultdict, instead of a dict with frequent use of setdefault.

* Add _add_from_arrays for unpickling

* Cleanup around adding from arrays

This moves adding to internal structures into the private batch method,
and removes the single-add method.

This has one behavioral change for `add`, in that if something is wrong
with the list of input Docs (such as one of the items not being a Doc),
valid items before the invalid one will not be added. Also the callback
will not be updated if anything is invalid. This change should not be
significant.

This also adds a test to check failure when given a non-Doc.

* Update spacy/matcher/phrasematcher.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-30 15:40:31 +02:00
Adriane Boyd
98a916e01a
Make stable private modules public and adjust names (#11353)
* Make stable private modules public and adjust names

* `spacy.ml._character_embed` -> `spacy.ml.character_embed`
* `spacy.ml._precomputable_affine` -> `spacy.ml.precomputable_affine`
* `spacy.tokens._serialize` -> `spacy.tokens.doc_bin`
* `spacy.tokens._retokenize` -> `spacy.tokens.retokenize`
* `spacy.tokens._dict_proxies` -> `spacy.tokens.span_groups`

* Skip _precomputable_affine

* retokenize -> retokenizer

* Fix imports
2022-08-30 13:56:35 +02:00
Adriane Boyd
4bce8fa755
Remove setup_requires from setup.cfg (#11384)
* Remove setup_requires from setup.cfg

* Update requirements test to ignore cython in setup.cfg
2022-08-29 13:23:24 +02:00
Adriane Boyd
2a558a7cdc
Switch to mecab-ko as default Korean tokenizer (#11294)
* Switch to mecab-ko as default Korean tokenizer

Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.

Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.

* Temporarily run tests with mecab-ko tokenizer

* Fix types

* Fix duplicate test names

* Update requirements test

* Revert "Temporarily run tests with mecab-ko tokenizer"

This reverts commit d2083e7044.

* Add mecab_args setting, fix pickle for KoreanNattoTokenizer

* Fix length check

* Update docs

* Formatting

* Update natto-py error message

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-08-26 10:11:18 +02:00
Adriane Boyd
1eb7ce5ef7
Merge pull request #11377 from adrianeboyd/chore/update-v4-from-develop-2
Update v4 from develop
2022-08-25 08:26:55 +02:00
Adriane Boyd
740c33fe58 Merge remote-tracking branch 'upstream/develop' into chore/update-v4-from-develop 2022-08-24 20:43:07 +02:00
Sofie Van Landeghem
8dd1fa9896
Merge pull request #11366 from adrianeboyd/chore/update-v4-from-master
Update v4 from master
2022-08-24 09:45:55 +02:00
Adriane Boyd
c44d243f25 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master 2022-08-24 07:15:41 +02:00
Adriane Boyd
bb0e178878
Make Span/Doc.ents more consistent for ent_kb_id and ent_id (#11328)
* Map `Span.id` to `Token.ent_id` in all cases when setting `Doc.ents`
* Reset `Token.ent_id` and `Token.ent_kb_id` when setting `Doc.ents`
* Make `Span.ent_id` an alias of `Span.id` rather than a read-only view
of the root token's `ent_id` annotation
2022-08-22 20:28:57 +02:00
Sofie Van Landeghem
1a5be63715
Cleanup Cython structs (#11337)
* cleanup Tokenizer fields

* remove unused object from vocab

* remove IS_OOV_DEPRECATED

* add back in as FLAG13

* FLAG 18 instead

* import fix

* fix clumpsy fingers

* revert symbol changes in favor of #11352

* bint instead of bool
2022-08-22 15:52:24 +02:00
Adriane Boyd
d757dec5c4
Remove intify_attrs(_do_deprecated) (#11319) 2022-08-17 12:13:54 +02:00
antonpibm
551e73ccfc
Match private networks as URLs (#11121) 2022-08-11 11:26:26 +02:00
Sofie Van Landeghem
5d54c0e32a
Rename modules for consistency (#11286)
* rename Python module to entity_ruler

* rename Python module to attribute_ruler
2022-08-10 11:44:05 +02:00
Daniël de Kok
e581eeac34
precompute_hiddens/Parser: look up CPU ops once (v4) (#11068)
* precompute_hiddens/Parser: look up CPU ops once

* precompute_hiddens: make cpu_ops private
2022-07-29 15:12:19 +02:00
Daniël de Kok
b2d05f9f66
Merge pull request #11242 from danieldk/merge-master-v4-20220728
Merge `master` into `v4`
2022-07-29 09:17:02 +02:00
Daniël de Kok
1ff683a50b Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220728 2022-07-28 13:53:59 +02:00
Madeesh Kannan
ba18d2913d
Morphology/Morphologizer optimizations and refactoring (#11024)
* `Morphology`: Refactor to use C types, reduce allocations, remove unused code

* `Morphologzier`: Avoid unnecessary sorting of morpho features

* `Morphologizer`: Remove execessive reallocations of labels, improve hash lookups of labels, coerce `numpy` numeric types to native ints
Update docs

* Remove unused method

* Replace `unique_ptr` usage with `shared_ptr`

* Add type annotations to internal Python methods, rename `hash` variable, fix typos

* Add comment to clarify implementation detail

* Fix return type

* `Morphology`: Stop early when splitting fields and values
2022-07-15 11:14:08 +02:00
Daniël de Kok
851a7ca4fa
Merge pull request #11035 from danieldk/merge-master-v4-20220627-2
Merge `master` into `v4`
2022-06-27 19:20:21 +02:00
Daniël de Kok
1605ef7319 Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220627-2 2022-06-27 17:45:45 +02:00
Daniël de Kok
7f3842f54d
Merge pull request #10936 from danieldk/merge-master-v4-20220609
Merge `master` into `v4`
2022-06-09 12:47:13 +02:00
Daniël de Kok
2f05c6824c Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220609 2022-06-09 10:18:25 +02:00
Adriane Boyd
10b7223021
Merge pull request #10816 from danieldk/v4-merge-master-20220518
Merge `master` into `v4`
2022-05-20 09:56:06 +02:00
Daniël de Kok
5586fd9311 Merge remote-tracking branch 'upstream/master' into v4-merge-master-20220518 2022-05-18 11:34:54 +02:00
Adriane Boyd
0e71bd973f
Return doc offsets in Matcher on spans (#10576)
The returned match offsets were only adjusted for `as_spans`, not
generally. Because the `on_match` callbacks are always applied to the
doc, the `Matcher` matches on spans should consistently use the doc
offsets.
2022-04-15 15:34:58 +02:00
Daniël de Kok
75f7c15187
Span/SpanGroup: wrap SpanC in shared_ptr (#9869)
* Span/SpanGroup: wrap SpanC in shared_ptr

When a Span that was retrieved from a SpanGroup was modified, these
changes were not reflected in the SpanGroup because the underlying
SpanC struct was copied.

This change applies the solution proposed by @nrodnova, to wrap SpanC in
a shared_ptr. This makes a SpanGroup and Spans derived from it share the
same SpanC. So, changes made through a Span are visible in the SpanGroup
as well.

Fixes #9556

* Test that a SpanGroup is modified through its Spans

* SpanGroup.push_back: remove nogil

Modifying std::vector is not thread-safe.

* C++ >= 11 does not allow const T in vector<T>

* Add Span.span_c as a shorthand for Span.c.get

Since this method is cdef'ed, it is only visible from Cython, so we
avoid using raw pointers in Python

Replace existing uses of span.c.get() to use this new method.

* Fix formatting

* Style fix: pointer types

* SpanGroup.to_bytes: reduce number of shared_ptr::get calls

* Mark SpanGroup modification test with issue

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-12 13:38:52 +01:00
332 changed files with 9120 additions and 12982 deletions

View File

@ -1,99 +0,0 @@
name: Build
on:
push:
tags:
# ytf did they invent their own syntax that's almost regex?
# ** matches 'zero or more of any character'
- 'release-v[0-9]+.[0-9]+.[0-9]+**'
- 'prerelease-v[0-9]+.[0-9]+.[0-9]+**'
jobs:
build_wheels:
name: Build wheels on ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
# macos-13 is an intel runner, macos-14 is apple silicon
os: [ubuntu-latest, windows-latest, macos-13, macos-14, ubuntu-24.04-arm]
steps:
- uses: actions/checkout@v4
# aarch64 (arm) is built via qemu emulation
# QEMU is sadly too slow. We need to wait for public ARM support
#- name: Set up QEMU
# if: runner.os == 'Linux'
# uses: docker/setup-qemu-action@v3
# with:
# platforms: all
- name: Build wheels
uses: pypa/cibuildwheel@v2.21.3
env:
CIBW_ARCHS_LINUX: auto
with:
package-dir: .
output-dir: wheelhouse
config-file: "{package}/pyproject.toml"
- uses: actions/upload-artifact@v4
with:
name: cibw-wheels-${{ matrix.os }}-${{ strategy.job-index }}
path: ./wheelhouse/*.whl
build_sdist:
name: Build source distribution
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build sdist
run: pipx run build --sdist
- uses: actions/upload-artifact@v4
with:
name: cibw-sdist
path: dist/*.tar.gz
create_release:
needs: [build_wheels, build_sdist]
runs-on: ubuntu-latest
permissions:
contents: write
checks: write
actions: read
issues: read
packages: write
pull-requests: read
repository-projects: read
statuses: read
steps:
- name: Get the tag name and determine if it's a prerelease
id: get_tag_info
run: |
FULL_TAG=${GITHUB_REF#refs/tags/}
if [[ $FULL_TAG == release-* ]]; then
TAG_NAME=${FULL_TAG#release-}
IS_PRERELEASE=false
elif [[ $FULL_TAG == prerelease-* ]]; then
TAG_NAME=${FULL_TAG#prerelease-}
IS_PRERELEASE=true
else
echo "Tag does not match expected patterns" >&2
exit 1
fi
echo "FULL_TAG=$TAG_NAME" >> $GITHUB_ENV
echo "TAG_NAME=$TAG_NAME" >> $GITHUB_ENV
echo "IS_PRERELEASE=$IS_PRERELEASE" >> $GITHUB_ENV
- uses: actions/download-artifact@v4
with:
# unpacks all CIBW artifacts into dist/
pattern: cibw-*
path: dist
merge-multiple: true
- name: Create Draft Release
id: create_release
uses: softprops/action-gh-release@v2
if: startsWith(github.ref, 'refs/tags/')
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
name: ${{ env.TAG_NAME }}
draft: true
prerelease: ${{ env.IS_PRERELEASE }}
files: "./dist/*"

View File

@ -1,29 +0,0 @@
# The cibuildwheel action triggers on creation of a release, this
# triggers on publication.
# The expected workflow is to create a draft release and let the wheels
# upload, and then hit 'publish', which uploads to PyPi.
on:
release:
types:
- published
jobs:
upload_pypi:
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/spacy
permissions:
id-token: write
contents: read
if: github.event_name == 'release' && github.event.action == 'published'
# or, alternatively, upload to PyPI on every tag starting with 'v' (remove on: release above to use this)
# if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v')
steps:
- uses: robinraju/release-downloader@v1
with:
tag: ${{ github.event.release.tag_name }}
fileName: '*'
out-file-path: 'dist'
- uses: pypa/gh-action-pypi-publish@release/v1

View File

@ -2,8 +2,6 @@ name: tests
on:
push:
tags-ignore:
- '**'
branches-ignore:
- "spacy.io"
- "nightly.spacy.io"
@ -12,6 +10,7 @@ on:
- "*.md"
- "*.mdx"
- "website/**"
- ".github/workflows/**"
pull_request:
types: [opened, synchronize, reopened, edited]
paths-ignore:
@ -31,7 +30,7 @@ jobs:
- name: Configure Python version
uses: actions/setup-python@v4
with:
python-version: "3.10"
python-version: "3.9"
- name: black
run: |
@ -45,12 +44,11 @@ jobs:
run: |
python -m pip install flake8==5.0.4
python -m flake8 spacy --count --select=E901,E999,F821,F822,F823,W605 --show-source --statistics
# Unfortunately cython-lint isn't working after the shift to Cython 3.
#- name: cython-lint
# run: |
# python -m pip install cython-lint -c requirements.txt
# # E501: line too log, W291: trailing whitespace, E266: too many leading '#' for block comment
# cython-lint spacy --ignore E501,W291,E266
- name: cython-lint
run: |
python -m pip install cython-lint -c requirements.txt
# E501: line too log, W291: trailing whitespace, E266: too many leading '#' for block comment
cython-lint spacy --ignore E501,W291,E266
tests:
name: Test
@ -59,7 +57,14 @@ jobs:
fail-fast: true
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
python_version: ["3.9", "3.12", "3.13"]
python_version: ["3.12"]
include:
- os: ubuntu-20.04
python_version: "3.9"
- os: windows-latest
python_version: "3.10"
- os: macos-latest
python_version: "3.11"
runs-on: ${{ matrix.os }}
@ -84,7 +89,6 @@ jobs:
- name: Run mypy
run: |
python -m mypy spacy
if: matrix.python_version != '3.7'
- name: Delete source directory and .egg-info
run: |
@ -106,22 +110,22 @@ jobs:
- name: Test import
run: python -W error -c "import spacy"
- name: "Test download CLI"
run: |
python -m spacy download ca_core_news_sm
python -m spacy download ca_core_news_md
python -c "import spacy; nlp=spacy.load('ca_core_news_sm'); doc=nlp('test')"
if: matrix.python_version == '3.9'
- name: "Test download_url in info CLI"
run: |
python -W error -m spacy info ca_core_news_sm | grep -q download_url
if: matrix.python_version == '3.9'
- name: "Test no warnings on load (#11713)"
run: |
python -W error -c "import ca_core_news_sm; nlp = ca_core_news_sm.load(); doc=nlp('test')"
if: matrix.python_version == '3.9'
# - name: "Test download CLI"
# run: |
# python -m spacy download ca_core_news_sm
# python -m spacy download ca_core_news_md
# python -c "import spacy; nlp=spacy.load('ca_core_news_sm'); doc=nlp('test')"
# if: matrix.python_version == '3.9'
#
# - name: "Test download_url in info CLI"
# run: |
# python -W error -m spacy info ca_core_news_sm | grep -q download_url
# if: matrix.python_version == '3.9'
#
# - name: "Test no warnings on load (#11713)"
# run: |
# python -W error -c "import ca_core_news_sm; nlp = ca_core_news_sm.load(); doc=nlp('test')"
# if: matrix.python_version == '3.9'
- name: "Test convert CLI"
run: |
@ -145,19 +149,17 @@ jobs:
python -m spacy train ner.cfg --paths.train ner-token-per-line-conll2003.spacy --paths.dev ner-token-per-line-conll2003.spacy --training.max_steps 10 --gpu-id -1
if: matrix.python_version == '3.9'
- name: "Test assemble CLI"
run: |
python -c "import spacy; config = spacy.util.load_config('ner.cfg'); config['components']['ner'] = {'source': 'ca_core_news_sm'}; config.to_disk('ner_source_sm.cfg')"
python -m spacy assemble ner_source_sm.cfg output_dir
env:
PYTHONWARNINGS: "error,ignore::DeprecationWarning"
if: matrix.python_version == '3.9'
- name: "Test assemble CLI vectors warning"
run: |
python -c "import spacy; config = spacy.util.load_config('ner.cfg'); config['components']['ner'] = {'source': 'ca_core_news_md'}; config.to_disk('ner_source_md.cfg')"
python -m spacy assemble ner_source_md.cfg output_dir 2>&1 | grep -q W113
if: matrix.python_version == '3.9'
# - name: "Test assemble CLI"
# run: |
# python -c "import spacy; config = spacy.util.load_config('ner.cfg'); config['components']['ner'] = {'source': 'ca_core_news_sm'}; config.to_disk('ner_source_sm.cfg')"
# PYTHONWARNINGS="error,ignore::DeprecationWarning" python -m spacy assemble ner_source_sm.cfg output_dir
# if: matrix.python_version == '3.9'
#
# - name: "Test assemble CLI vectors warning"
# run: |
# python -c "import spacy; config = spacy.util.load_config('ner.cfg'); config['components']['ner'] = {'source': 'ca_core_news_md'}; config.to_disk('ner_source_md.cfg')"
# python -m spacy assemble ner_source_md.cfg output_dir 2>&1 | grep -q W113
# if: matrix.python_version == '3.9'
- name: "Install test requirements"
run: |
@ -166,10 +168,4 @@ jobs:
- name: "Run CPU tests"
run: |
python -m pytest --pyargs spacy -W error
if: "!(startsWith(matrix.os, 'macos') && matrix.python_version == '3.11')"
- name: "Run CPU tests with thinc-apple-ops"
run: |
python -m pip install 'spacy[apple]'
python -m pytest --pyargs spacy
if: startsWith(matrix.os, 'macos') && matrix.python_version == '3.11'
if: matrix.python_version == '3.11'

View File

@ -25,7 +25,7 @@ jobs:
- name: Configure Python version
uses: actions/setup-python@v4
with:
python-version: "3.7"
python-version: "3.9"
- name: Validate website/meta/universe.json
run: |

View File

@ -3,8 +3,8 @@ repos:
rev: 22.3.0
hooks:
- id: black
language_version: python3.7
additional_dependencies: ['click==8.0.4']
language_version: python3.9
additional_dependencies: ["click==8.0.4"]
- repo: https://github.com/pycqa/flake8
rev: 5.0.4
hooks:

View File

@ -35,7 +35,7 @@ so that more people can benefit from it.
When opening an issue, use a **descriptive title** and include your
**environment** (operating system, Python version, spaCy version). Our
[issue templates](https://github.com/explosion/spaCy/issues/new/choose) help you
[issue template](https://github.com/explosion/spaCy/issues/new) helps you
remember the most important details to include. If you've discovered a bug, you
can also submit a [regression test](#fixing-bugs) straight away. When you're
opening an issue to report the bug, simply refer to your pull request in the
@ -276,7 +276,7 @@ except: # noqa: E722
### Python conventions
All Python code must be written **compatible with Python 3.6+**. More detailed
All Python code must be written **compatible with Python 3.9+**. More detailed
code conventions can be found in the [developer docs](https://github.com/explosion/spaCy/blob/master/extra/DEVELOPER_DOCS/Code%20Conventions.md).
#### I/O and handling paths
@ -449,8 +449,8 @@ and plugins in spaCy v3.0, and we can't wait to see what you build with it!
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
[`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
to make it easier to find. Those are also the topics we're linking to from the
spaCy website. If you're sharing your project on X, feel free to tag
[@spacy_io](https://x.com/spacy_io) so we can check it out.
spaCy website. If you're sharing your project on Twitter, feel free to tag
[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
- Once your extension is published, you can open a
[PR](https://github.com/explosion/spaCy/pulls) to suggest it for the

View File

@ -4,6 +4,5 @@ include README.md
include pyproject.toml
include spacy/py.typed
recursive-include spacy/cli *.yml
recursive-include spacy/tests *.json
recursive-include licenses *
recursive-exclude spacy *.cpp

View File

@ -5,7 +5,7 @@ override SPACY_EXTRAS = spacy-lookups-data==1.0.3
endif
ifndef PYVER
override PYVER = 3.8
override PYVER = 3.9
endif
VENV := ./env$(PYVER)

View File

@ -16,7 +16,7 @@ model packaging, deployment and workflow management. spaCy is commercial
open-source software, released under the
[MIT license](https://github.com/explosion/spaCy/blob/master/LICENSE).
💫 **Version 3.8 out now!**
💫 **Version 3.7 out now!**
[Check out the release notes here.](https://github.com/explosion/spaCy/releases)
[![tests](https://github.com/explosion/spaCy/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spaCy/actions/workflows/tests.yml)
@ -28,11 +28,12 @@ open-source software, released under the
<br />
[![PyPi downloads](https://static.pepy.tech/personalized-badge/spacy?period=total&units=international_system&left_color=grey&right_color=orange&left_text=pip%20downloads)](https://pypi.org/project/spacy/)
[![Conda downloads](https://img.shields.io/conda/dn/conda-forge/spacy?label=conda%20downloads)](https://anaconda.org/conda-forge/spacy)
[![spaCy on Twitter](https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow)](https://twitter.com/spacy_io)
## 📖 Documentation
| Documentation | |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! |
| 📚 **[Usage Guides]** | How to use spaCy and its features. |
| 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. |
@ -46,7 +47,6 @@ open-source software, released under the
| 👩‍🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. |
| 📰 **[Blog]** | Read about current spaCy and Prodigy development, releases, talks and more from Explosion. |
| 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. |
| 🔴 **[Live Stream]** | Join Matt as he works on spaCy and chat about NLP, live every week. |
| 🛠 **[Changelog]** | Changes and version history. |
| 💝 **[Contribute]** | How to contribute to the spaCy project and code base. |
| 👕 **[Swag]** | Support us and our work with unique, custom-designed swag! |
@ -62,7 +62,6 @@ open-source software, released under the
[universe]: https://spacy.io/universe
[spacy vs code extension]: https://github.com/explosion/spacy-vscode
[videos]: https://www.youtube.com/c/ExplosionAI
[live stream]: https://www.youtube.com/playlist?list=PLBmcuObd5An5_iAxNYLJa_xWmNzsYce8c
[online course]: https://course.spacy.io
[blog]: https://explosion.ai
[project templates]: https://github.com/explosion/projects
@ -80,14 +79,13 @@ more people can benefit from it.
| Type | Platforms |
| ------------------------------- | --------------------------------------- |
| 🚨 **Bug Reports** | [GitHub Issue Tracker] |
| 🎁 **Feature Requests & Ideas** | [GitHub Discussions] · [Live Stream] |
| 🎁 **Feature Requests & Ideas** | [GitHub Discussions] |
| 👩‍💻 **Usage Questions** | [GitHub Discussions] · [Stack Overflow] |
| 🗯 **General Discussion** | [GitHub Discussions] · [Live Stream] |
| 🗯 **General Discussion** | [GitHub Discussions] |
[github issue tracker]: https://github.com/explosion/spaCy/issues
[github discussions]: https://github.com/explosion/spaCy/discussions
[stack overflow]: https://stackoverflow.com/questions/tagged/spacy
[live stream]: https://www.youtube.com/playlist?list=PLBmcuObd5An5_iAxNYLJa_xWmNzsYce8c
## Features
@ -117,7 +115,7 @@ For detailed installation instructions, see the
- **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual
Studio)
- **Python version**: Python >=3.7, <3.13 (only 64 bit)
- **Python version**: Python 3.9+ (only 64 bit)
- **Package managers**: [pip] · [conda] (via `conda-forge`)
[pip]: https://pypi.org/project/spacy/

View File

@ -1,20 +0,0 @@
#!/usr/bin/env bash
set -e
# Insist repository is clean
git diff-index --quiet HEAD
version=$(grep "__version__ = " spacy/about.py)
version=${version/__version__ = }
version=${version/\'/}
version=${version/\'/}
version=${version/\"/}
version=${version/\"/}
echo "Pushing release-v"$version
git tag -d release-v$version || true
git push origin :release-v$version || true
git tag release-v$version
git push origin release-v$version

View File

@ -1,2 +1,2 @@
# build version constraints for use with wheelwright
numpy>=2.0.0,<3.0.0
numpy>=1.25.0; python_version>='3.9'

View File

@ -31,7 +31,6 @@ These are repos that can be used by spaCy but aren&#39;t part of a default insta
- [spacy-stanza](https://github.com/explosion/spacy-stanza): This is a wrapper that allows the use of Stanford&#39;s Stanza library in spaCy.
- [spacy-streamlit](https://github.com/explosion/spacy-streamlit): A wrapper for the Streamlit dashboard building library to help with integrating [displaCy](https://spacy.io/api/top-level/#displacy).
- [spacymoji](https://github.com/explosion/spacymoji): A library to add extra support for emoji to spaCy, such as including character names.
- [thinc-apple-ops](https://github.com/explosion/thinc-apple-ops): A special backend for OSX that uses Apple&#39;s native libraries for improved performance.
- [os-signpost](https://github.com/explosion/os-signpost): A Python package that allows you to use the `OSSignposter` API in OSX for performance analysis.
- [spacy-ray](https://github.com/explosion/spacy-ray): A wrapper to integrate spaCy with Ray, a distributed training framework. Currently a work in progress.
@ -79,4 +78,3 @@ Repos that don&#39;t fit in any of the above categories.
- [tokenizations](https://github.com/explosion/tokenizations): A library originally by Yohei Tamura to align strings with tolerance to some variations in features like case and diacritics, used for aligning tokens and wordpieces. Adopted and maintained by Explosion, but usually spacy-alignments is used instead.
- [conll-2012](https://github.com/explosion/conll-2012): A repo to hold some slightly cleaned up versions of the official scripts for the CoNLL 2012 shared task involving coreference resolution. Used in the coref project.
- [fastapi-explosion-extras](https://github.com/explosion/fastapi-explosion-extras): Some small tweaks to FastAPI used at Explosion.

View File

@ -1,67 +1,15 @@
[build-system]
requires = [
"setuptools",
"cython>=3.0,<4.0",
"cython>=0.25,<3.0",
"cymem>=2.0.2,<2.1.0",
"preshed>=3.0.2,<3.1.0",
"murmurhash>=0.28.0,<1.1.0",
"thinc>=8.3.4,<8.4.0",
"numpy>=2.0.0,<3.0.0"
"thinc>=9.0.0,<9.1.0",
"numpy>=1.15.0; python_version < '3.9'",
"numpy>=1.25.0; python_version >= '3.9'",
]
build-backend = "setuptools.build_meta"
[tool.cibuildwheel]
build = "*"
skip = "pp* cp36* cp37* cp38* *-win32 *i686*"
test-skip = ""
free-threaded-support = false
archs = ["native"]
build-frontend = "default"
config-settings = {}
dependency-versions = "pinned"
environment = { PIP_CONSTRAINT = "build-constraints.txt" }
environment-pass = []
build-verbosity = 0
before-all = "curl https://sh.rustup.rs -sSf | sh -s -- -y --profile minimal --default-toolchain stable"
before-build = "pip install -r requirements.txt && python setup.py clean"
repair-wheel-command = ""
test-command = ""
before-test = ""
test-requires = []
test-extras = []
container-engine = "docker"
manylinux-x86_64-image = "manylinux2014"
manylinux-i686-image = "manylinux2014"
manylinux-aarch64-image = "manylinux2014"
manylinux-ppc64le-image = "manylinux2014"
manylinux-s390x-image = "manylinux2014"
manylinux-pypy_x86_64-image = "manylinux2014"
manylinux-pypy_i686-image = "manylinux2014"
manylinux-pypy_aarch64-image = "manylinux2014"
musllinux-x86_64-image = "musllinux_1_2"
musllinux-i686-image = "musllinux_1_2"
musllinux-aarch64-image = "musllinux_1_2"
musllinux-ppc64le-image = "musllinux_1_2"
musllinux-s390x-image = "musllinux_1_2"
[tool.cibuildwheel.linux]
repair-wheel-command = "auditwheel repair -w {dest_dir} {wheel}"
[tool.cibuildwheel.macos]
repair-wheel-command = "delocate-wheel --require-archs {delocate_archs} -w {dest_dir} -v {wheel}"
[tool.cibuildwheel.windows]
[tool.cibuildwheel.pyodide]
[tool.isort]
profile = "black"

View File

@ -1,34 +1,36 @@
# Our libraries
spacy-legacy>=3.0.11,<3.1.0
spacy-legacy>=4.0.0.dev1,<4.1.0
spacy-loggers>=1.0.0,<2.0.0
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
thinc>=8.3.4,<8.4.0
thinc>=9.0.0,<9.1.0
ml_datasets>=0.2.0,<0.3.0
murmurhash>=0.28.0,<1.1.0
wasabi>=0.9.1,<1.2.0
srsly>=2.4.3,<3.0.0
catalogue>=2.0.6,<2.1.0
typer-slim>=0.3.0,<1.0.0
typer>=0.3.0,<1.0.0
weasel>=0.1.0,<0.5.0
# Third party dependencies
numpy>=2.0.0,<3.0.0
numpy>=1.15.0; python_version < "3.9"
numpy>=1.19.0; python_version >= "3.9"
requests>=2.13.0,<3.0.0
tqdm>=4.38.0,<5.0.0
pydantic>=1.7.4,!=1.8,!=1.8.1,<3.0.0
jinja2
langcodes>=3.2.0,<4.0.0
# Official Python utilities
setuptools
packaging>=20.0
# Development dependencies
pre-commit>=2.13.0
cython>=3.0,<4.0
cython>=0.25,<3.0
pytest>=5.2.0,!=7.1.0
pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0
flake8>=3.8.0,<6.0.0
hypothesis>=3.27.0,<7.0.0
mypy>=1.5.0,<1.6.0; platform_machine != "aarch64" and python_version >= "3.8"
mypy>=1.5.0,<1.6.0; platform_machine != "aarch64"
types-mock>=0.1.1
types-setuptools>=57.0.0
types-requests

View File

@ -21,7 +21,6 @@ classifiers =
Programming Language :: Python :: 3.10
Programming Language :: Python :: 3.11
Programming Language :: Python :: 3.12
Programming Language :: Python :: 3.13
Topic :: Scientific/Engineering
project_urls =
Release notes = https://github.com/explosion/spaCy/releases
@ -30,32 +29,21 @@ project_urls =
[options]
zip_safe = false
include_package_data = true
python_requires = >=3.9,<3.14
# NOTE: This section is superseded by pyproject.toml and will be removed in
# spaCy v4
setup_requires =
cython>=3.0,<4.0
numpy>=2.0.0,<3.0.0; python_version < "3.9"
numpy>=2.0.0,<3.0.0; python_version >= "3.9"
# We also need our Cython packages here to compile against
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0
thinc>=8.3.4,<8.4.0
python_requires = >=3.9
install_requires =
# Our libraries
spacy-legacy>=3.0.11,<3.1.0
spacy-legacy>=4.0.0.dev1,<4.1.0
spacy-loggers>=1.0.0,<2.0.0
murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
thinc>=8.3.4,<8.4.0
thinc>=9.0.0,<9.1.0
wasabi>=0.9.1,<1.2.0
srsly>=2.4.3,<3.0.0
catalogue>=2.0.6,<2.1.0
weasel>=0.1.0,<0.5.0
# Third-party dependencies
typer-slim>=0.3.0,<1.0.0
typer>=0.3.0,<1.0.0
tqdm>=4.38.0,<5.0.0
numpy>=1.15.0; python_version < "3.9"
numpy>=1.19.0; python_version >= "3.9"
@ -65,6 +53,7 @@ install_requires =
# Official Python utilities
setuptools
packaging>=20.0
langcodes>=3.2.0,<4.0.0
[options.entry_points]
console_scripts =
@ -113,14 +102,12 @@ cuda12x =
cupy-cuda12x>=11.5.0,<13.0.0
cuda-autodetect =
cupy-wheel>=11.0.0,<13.0.0
apple =
thinc-apple-ops>=1.0.0,<2.0.0
# Language tokenizers with external dependencies
ja =
sudachipy>=0.5.2,!=0.6.1
sudachidict_core>=20211220
ko =
natto-py>=0.9.0
mecab-ko>=1.0.0
th =
pythainlp>=2.0

View File

@ -37,7 +37,6 @@ MOD_NAMES = [
"spacy.pipeline.dep_parser",
"spacy.pipeline._edit_tree_internals.edit_trees",
"spacy.pipeline.morphologizer",
"spacy.pipeline.multitask",
"spacy.pipeline.ner",
"spacy.pipeline.pipe",
"spacy.pipeline.trainable_pipe",
@ -48,6 +47,7 @@ MOD_NAMES = [
"spacy.pipeline._parser_internals.arc_eager",
"spacy.pipeline._parser_internals.ner",
"spacy.pipeline._parser_internals.nonproj",
"spacy.pipeline._parser_internals.search",
"spacy.pipeline._parser_internals._state",
"spacy.pipeline._parser_internals.stateclass",
"spacy.pipeline._parser_internals.transition_system",
@ -61,12 +61,13 @@ MOD_NAMES = [
"spacy.tokens.span_group",
"spacy.tokens.graph",
"spacy.tokens.morphanalysis",
"spacy.tokens._retokenize",
"spacy.tokens.retokenizer",
"spacy.matcher.matcher",
"spacy.matcher.phrasematcher",
"spacy.matcher.dependencymatcher",
"spacy.symbols",
"spacy.vectors",
"spacy.tests.parser._search",
]
COMPILE_OPTIONS = {
"msvc": ["/Ox", "/EHsc"],

View File

@ -17,7 +17,6 @@ from .cli.info import info # noqa: F401
from .errors import Errors
from .glossary import explain # noqa: F401
from .language import Language
from .registrations import REGISTRY_POPULATED, populate_registry
from .util import logger, registry # noqa: F401
from .vocab import Vocab

View File

@ -1,5 +1,9 @@
# fmt: off
__title__ = "spacy"
__version__ = "3.8.7"
__version__ = "4.0.0.dev3"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects"
__projects_branch__ = "v3"
__lookups_tag__ = "v1.0.3"
__lookups_url__ = f"https://raw.githubusercontent.com/explosion/spacy-lookups-data/{__lookups_tag__}/spacy_lookups_data/data/"

View File

@ -1,99 +1,50 @@
# Reserve 64 values for flag features
from . cimport symbols
cdef enum attr_id_t:
NULL_ATTR
IS_ALPHA
IS_ASCII
IS_DIGIT
IS_LOWER
IS_PUNCT
IS_SPACE
IS_TITLE
IS_UPPER
LIKE_URL
LIKE_NUM
LIKE_EMAIL
IS_STOP
IS_OOV_DEPRECATED
IS_BRACKET
IS_QUOTE
IS_LEFT_PUNCT
IS_RIGHT_PUNCT
IS_CURRENCY
NULL_ATTR = 0
IS_ALPHA = symbols.IS_ALPHA
IS_ASCII = symbols.IS_ASCII
IS_DIGIT = symbols.IS_DIGIT
IS_LOWER = symbols.IS_LOWER
IS_PUNCT = symbols.IS_PUNCT
IS_SPACE = symbols.IS_SPACE
IS_TITLE = symbols.IS_TITLE
IS_UPPER = symbols.IS_UPPER
LIKE_URL = symbols.LIKE_URL
LIKE_NUM = symbols.LIKE_NUM
LIKE_EMAIL = symbols.LIKE_EMAIL
IS_STOP = symbols.IS_STOP
IS_BRACKET = symbols.IS_BRACKET
IS_QUOTE = symbols.IS_QUOTE
IS_LEFT_PUNCT = symbols.IS_LEFT_PUNCT
IS_RIGHT_PUNCT = symbols.IS_RIGHT_PUNCT
IS_CURRENCY = symbols.IS_CURRENCY
FLAG19 = 19
FLAG20
FLAG21
FLAG22
FLAG23
FLAG24
FLAG25
FLAG26
FLAG27
FLAG28
FLAG29
FLAG30
FLAG31
FLAG32
FLAG33
FLAG34
FLAG35
FLAG36
FLAG37
FLAG38
FLAG39
FLAG40
FLAG41
FLAG42
FLAG43
FLAG44
FLAG45
FLAG46
FLAG47
FLAG48
FLAG49
FLAG50
FLAG51
FLAG52
FLAG53
FLAG54
FLAG55
FLAG56
FLAG57
FLAG58
FLAG59
FLAG60
FLAG61
FLAG62
FLAG63
ID = symbols.ID
ORTH = symbols.ORTH
LOWER = symbols.LOWER
NORM = symbols.NORM
SHAPE = symbols.SHAPE
PREFIX = symbols.PREFIX
SUFFIX = symbols.SUFFIX
ID
ORTH
LOWER
NORM
SHAPE
PREFIX
SUFFIX
LENGTH = symbols.LENGTH
CLUSTER = symbols.CLUSTER
LEMMA = symbols.LEMMA
POS = symbols.POS
TAG = symbols.TAG
DEP = symbols.DEP
ENT_IOB = symbols.ENT_IOB
ENT_TYPE = symbols.ENT_TYPE
HEAD = symbols.HEAD
SENT_START = symbols.SENT_START
SPACY = symbols.SPACY
PROB = symbols.PROB
LENGTH
CLUSTER
LEMMA
POS
TAG
DEP
ENT_IOB
ENT_TYPE
HEAD
SENT_START
SPACY
PROB
LANG
LANG = symbols.LANG
ENT_KB_ID = symbols.ENT_KB_ID
MORPH
MORPH = symbols.MORPH
ENT_ID = symbols.ENT_ID
IDX
SENT_END
IDX = symbols.IDX

View File

@ -17,57 +17,11 @@ IDS = {
"LIKE_NUM": LIKE_NUM,
"LIKE_EMAIL": LIKE_EMAIL,
"IS_STOP": IS_STOP,
"IS_OOV_DEPRECATED": IS_OOV_DEPRECATED,
"IS_BRACKET": IS_BRACKET,
"IS_QUOTE": IS_QUOTE,
"IS_LEFT_PUNCT": IS_LEFT_PUNCT,
"IS_RIGHT_PUNCT": IS_RIGHT_PUNCT,
"IS_CURRENCY": IS_CURRENCY,
"FLAG19": FLAG19,
"FLAG20": FLAG20,
"FLAG21": FLAG21,
"FLAG22": FLAG22,
"FLAG23": FLAG23,
"FLAG24": FLAG24,
"FLAG25": FLAG25,
"FLAG26": FLAG26,
"FLAG27": FLAG27,
"FLAG28": FLAG28,
"FLAG29": FLAG29,
"FLAG30": FLAG30,
"FLAG31": FLAG31,
"FLAG32": FLAG32,
"FLAG33": FLAG33,
"FLAG34": FLAG34,
"FLAG35": FLAG35,
"FLAG36": FLAG36,
"FLAG37": FLAG37,
"FLAG38": FLAG38,
"FLAG39": FLAG39,
"FLAG40": FLAG40,
"FLAG41": FLAG41,
"FLAG42": FLAG42,
"FLAG43": FLAG43,
"FLAG44": FLAG44,
"FLAG45": FLAG45,
"FLAG46": FLAG46,
"FLAG47": FLAG47,
"FLAG48": FLAG48,
"FLAG49": FLAG49,
"FLAG50": FLAG50,
"FLAG51": FLAG51,
"FLAG52": FLAG52,
"FLAG53": FLAG53,
"FLAG54": FLAG54,
"FLAG55": FLAG55,
"FLAG56": FLAG56,
"FLAG57": FLAG57,
"FLAG58": FLAG58,
"FLAG59": FLAG59,
"FLAG60": FLAG60,
"FLAG61": FLAG61,
"FLAG62": FLAG62,
"FLAG63": FLAG63,
"ID": ID,
"ORTH": ORTH,
"LOWER": LOWER,
@ -93,12 +47,11 @@ IDS = {
}
# ATTR IDs, in order of the symbol
NAMES = [key for key, value in sorted(IDS.items(), key=lambda item: item[1])]
NAMES = {v: k for k, v in IDS.items()}
locals().update(IDS)
def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
def intify_attrs(stringy_attrs, strings_map=None):
"""
Normalize a dictionary of attributes, converting them to ints.
@ -110,75 +63,6 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
converted to ints.
"""
inty_attrs = {}
if _do_deprecated:
if "F" in stringy_attrs:
stringy_attrs["ORTH"] = stringy_attrs.pop("F")
if "L" in stringy_attrs:
stringy_attrs["LEMMA"] = stringy_attrs.pop("L")
if "pos" in stringy_attrs:
stringy_attrs["TAG"] = stringy_attrs.pop("pos")
if "morph" in stringy_attrs:
morphs = stringy_attrs.pop("morph") # no-cython-lint
if "number" in stringy_attrs:
stringy_attrs.pop("number")
if "tenspect" in stringy_attrs:
stringy_attrs.pop("tenspect")
morph_keys = [
"PunctType",
"PunctSide",
"Other",
"Degree",
"AdvType",
"Number",
"VerbForm",
"PronType",
"Aspect",
"Tense",
"PartType",
"Poss",
"Hyph",
"ConjType",
"NumType",
"Foreign",
"VerbType",
"NounType",
"Gender",
"Mood",
"Negative",
"Tense",
"Voice",
"Abbr",
"Derivation",
"Echo",
"Foreign",
"NameType",
"NounType",
"NumForm",
"NumValue",
"PartType",
"Polite",
"StyleVariant",
"PronType",
"AdjType",
"Person",
"Variant",
"AdpType",
"Reflex",
"Negative",
"Mood",
"Aspect",
"Case",
"Polarity",
"PrepCase",
"Animacy", # U20
]
for key in morph_keys:
if key in stringy_attrs:
stringy_attrs.pop(key)
elif key.lower() in stringy_attrs:
stringy_attrs.pop(key.lower())
elif key.upper() in stringy_attrs:
stringy_attrs.pop(key.upper())
for name, value in stringy_attrs.items():
int_key = intify_attr(name)
if int_key is not None:

View File

@ -14,6 +14,7 @@ from .debug_config import debug_config # noqa: F401
from .debug_data import debug_data # noqa: F401
from .debug_diff import debug_diff # noqa: F401
from .debug_model import debug_model # noqa: F401
from .distill import distill # noqa: F401
from .download import download # noqa: F401
from .evaluate import evaluate # noqa: F401
from .find_function import find_function # noqa: F401

View File

@ -11,6 +11,7 @@ from typing import (
Dict,
Iterable,
List,
Literal,
Optional,
Tuple,
Union,
@ -28,7 +29,7 @@ from wasabi import Printer, msg
from weasel import app as project_cli
from .. import about
from ..compat import Literal
from ..errors import RENAMED_LANGUAGE_CODES
from ..schemas import validate
from ..util import (
ENV_VARS,
@ -148,6 +149,16 @@ def _parse_override(value: Any) -> Any:
return str(value)
def _handle_renamed_language_codes(lang: Optional[str]) -> None:
# Throw error for renamed language codes in v4
if lang in RENAMED_LANGUAGE_CODES:
msg.fail(
title="Renamed language code",
text=f"Language code '{lang}' was replaced with '{RENAMED_LANGUAGE_CODES[lang]}' in spaCy v4. Update the language code from '{lang}' to '{RENAMED_LANGUAGE_CODES[lang]}'.",
exits=1,
)
@contextmanager
def show_validation_error(
file_path: Optional[Union[str, Path]] = None,
@ -192,6 +203,13 @@ def show_validation_error(
msg.fail("Config validation error", e, exits=1)
def import_code_paths(code_paths: str) -> None:
"""Helper to import comma-separated list of code paths."""
code_paths = [Path(p.strip()) for p in string_to_list(code_paths)]
for code_path in code_paths:
import_code(code_path)
def import_code(code_path: Optional[Union[Path, str]]) -> None:
"""Helper to import Python file provided in training commands / commands
using the config. This makes custom registered functions available.

View File

@ -11,7 +11,7 @@ from ._util import (
Arg,
Opt,
app,
import_code,
import_code_paths,
parse_config_overrides,
show_validation_error,
)
@ -26,7 +26,7 @@ def assemble_cli(
ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
output_path: Path = Arg(..., help="Output directory to store assembled pipeline in"),
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
# fmt: on
):
@ -46,7 +46,7 @@ def assemble_cli(
if not config_path or (str(config_path) != "-" and not config_path.exists()):
msg.fail("Config file not found", config_path, exits=1)
overrides = parse_config_overrides(ctx.args)
import_code(code_path)
import_code_paths(code_path)
with show_validation_error(config_path):
config = util.load_config(config_path, overrides=overrides, interpolate=False)
msg.divider("Initializing pipeline")

View File

@ -16,7 +16,7 @@ from ..training.converters import (
iob_to_docs,
json_to_docs,
)
from ._util import Arg, Opt, app, walk_directory
from ._util import Arg, Opt, _handle_renamed_language_codes, app, walk_directory
# Converters are matched by file extension except for ner/iob, which are
# matched by file extension and content. To add a converter, add a new
@ -116,6 +116,10 @@ def convert(
input_path = Path(input_path)
if not msg:
msg = Printer(no_print=silent)
# Throw error for renamed language codes in v4
_handle_renamed_language_codes(lang)
ner_map = srsly.read_json(ner_map) if ner_map is not None else None
doc_files = []
for input_loc in walk_directory(input_path, converter):

View File

@ -13,7 +13,7 @@ from ._util import (
Arg,
Opt,
debug_cli,
import_code,
import_code_paths,
parse_config_overrides,
show_validation_error,
)
@ -27,7 +27,7 @@ def debug_config_cli(
# fmt: off
ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),
show_funcs: bool = Opt(False, "--show-functions", "-F", help="Show an overview of all registered functions used in the config and where they come from (modules, files etc.)"),
show_vars: bool = Opt(False, "--show-variables", "-V", help="Show an overview of all variables referenced in the config and their values. This will also reflect variables overwritten on the CLI.")
# fmt: on
@ -44,7 +44,7 @@ def debug_config_cli(
DOCS: https://spacy.io/api/cli#debug-config
"""
overrides = parse_config_overrides(ctx.args)
import_code(code_path)
import_code_paths(code_path)
debug_config(
config_path, overrides=overrides, show_funcs=show_funcs, show_vars=show_vars
)

View File

@ -7,6 +7,7 @@ from typing import (
Dict,
Iterable,
List,
Literal,
Optional,
Sequence,
Set,
@ -22,7 +23,6 @@ import typer
from wasabi import MESSAGES, Printer, msg
from .. import util
from ..compat import Literal
from ..language import Language
from ..morphology import Morphology
from ..pipeline import Morphologizer, SpanCategorizer, TrainablePipe
@ -40,7 +40,7 @@ from ._util import (
_format_number,
app,
debug_cli,
import_code,
import_code_paths,
parse_config_overrides,
show_validation_error,
)
@ -72,7 +72,7 @@ def debug_data_cli(
# fmt: off
ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),
ignore_warnings: bool = Opt(False, "--ignore-warnings", "-IW", help="Ignore warnings, only show stats and errors"),
verbose: bool = Opt(False, "--verbose", "-V", help="Print additional information and explanations"),
no_format: bool = Opt(False, "--no-format", "-NF", help="Don't pretty-print the results"),
@ -92,7 +92,7 @@ def debug_data_cli(
"--help for an overview of the other available debugging commands."
)
overrides = parse_config_overrides(ctx.args)
import_code(code_path)
import_code_paths(code_path)
debug_data(
config_path,
config_overrides=overrides,
@ -1073,8 +1073,7 @@ def _get_distribution(docs, normalize: bool = True) -> Counter:
word_counts: Counter = Counter()
for doc in docs:
for token in doc:
# Normalize the text
t = token.text.lower().replace("``", '"').replace("''", '"')
t = token.text.lower()
word_counts[t] += 1
if normalize:
total = sum(word_counts.values(), 0.0)

View File

@ -170,7 +170,7 @@ def debug_model(
msg.divider(f"STEP 3 - prediction")
msg.info(str(prediction))
msg.good(f"Successfully ended analysis - model looks good.")
msg.good(f"Succesfully ended analysis - model looks good.")
def _sentences():

98
spacy/cli/distill.py Normal file
View File

@ -0,0 +1,98 @@
import logging
import sys
from pathlib import Path
from typing import Any, Dict, Optional, Union
import typer
from wasabi import msg
from .. import util
from ..pipeline.trainable_pipe import TrainablePipe
from ..schemas import ConfigSchemaDistill
from ..training.initialize import init_nlp_student
from ..training.loop import distill as distill_nlp
from ._util import (
Arg,
Opt,
app,
import_code_paths,
parse_config_overrides,
setup_gpu,
show_validation_error,
)
@app.command(
"distill",
context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
)
def distill_cli(
# fmt: off
ctx: typer.Context, # This is only used to read additional arguments
teacher_model: str = Arg(..., help="Teacher model name or path"),
student_config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
output_path: Optional[Path] = Opt(None, "--output", "--output-path", "-o", help="Output directory to store trained pipeline in"),
code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU")
# fmt: on
):
"""
Distill a spaCy pipeline from a teacher model.
DOCS: https://spacy.io/api/cli#distill
"""
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
overrides = parse_config_overrides(ctx.args)
import_code_paths(code_path)
distill(
teacher_model,
student_config_path,
output_path,
use_gpu=use_gpu,
overrides=overrides,
)
def distill(
teacher_model: Union[str, Path],
student_config_path: Union[str, Path],
output_path: Optional[Union[str, Path]] = None,
*,
use_gpu: int = -1,
overrides: Dict[str, Any] = util.SimpleFrozenDict(),
):
student_config_path = util.ensure_path(student_config_path)
output_path = util.ensure_path(output_path)
# Make sure all files and paths exist if they are needed
if not student_config_path or (
str(student_config_path) != "-" and not student_config_path.exists()
):
msg.fail("Student config file not found", student_config_path, exits=1)
if not output_path:
msg.info("No output directory provided")
else:
if not output_path.exists():
output_path.mkdir(parents=True)
msg.good(f"Created output directory: {output_path}")
msg.info(f"Saving to output directory: {output_path}")
setup_gpu(use_gpu)
teacher = util.load_model(teacher_model)
with show_validation_error(student_config_path):
config = util.load_config(
student_config_path, overrides=overrides, interpolate=False
)
msg.divider("Initializing student pipeline")
with show_validation_error(student_config_path, hint_fill=False):
student = init_nlp_student(config, teacher, use_gpu=use_gpu)
msg.good("Initialized student pipeline")
msg.divider("Distilling student pipeline from teacher")
distill_nlp(
teacher,
student,
output_path,
use_gpu=use_gpu,
stdout=sys.stdout,
stderr=sys.stderr,
)

View File

@ -7,9 +7,10 @@ import typer
from wasabi import msg
from .. import about
from ..errors import OLD_MODEL_SHORTCUTS
from ..util import (
get_installed_models,
get_minor_version,
get_package_version,
is_in_interactive,
is_in_jupyter,
is_package,
@ -76,15 +77,17 @@ def download(
version = components[-1]
else:
model_name = model
if model in OLD_MODEL_SHORTCUTS:
msg.warn(
f"As of spaCy v3.0, shortcuts like '{model}' are deprecated. Please "
f"use the full pipeline package name '{OLD_MODEL_SHORTCUTS[model]}' instead."
)
model_name = OLD_MODEL_SHORTCUTS[model]
compatibility = get_compatibility()
version = get_version(model_name, compatibility)
# If we already have this version installed, skip downloading
installed = get_installed_models()
if model_name in installed:
installed_version = get_package_version(model_name)
if installed_version == version:
msg.warn(f"{model_name} v{version} already installed, skipping")
return
filename = get_model_filename(model_name, version, sdist)
download_model(filename, pip_args)

View File

@ -10,7 +10,7 @@ from .. import displacy, util
from ..scorer import Scorer
from ..tokens import Doc
from ..training import Corpus
from ._util import Arg, Opt, app, benchmark_cli, import_code, setup_gpu
from ._util import Arg, Opt, app, benchmark_cli, import_code_paths, setup_gpu
@benchmark_cli.command(
@ -22,7 +22,7 @@ def evaluate_cli(
model: str = Arg(..., help="Model name or path"),
data_path: Path = Arg(..., help="Location of binary evaluation data in .spacy format", exists=True),
output: Optional[Path] = Opt(None, "--output", "-o", help="Output JSON file for metrics", dir_okay=False),
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),
use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU"),
gold_preproc: bool = Opt(False, "--gold-preproc", "-G", help="Use gold preprocessing"),
displacy_path: Optional[Path] = Opt(None, "--displacy-path", "-dp", help="Directory to output rendered parses as HTML", exists=True, file_okay=False),
@ -43,7 +43,7 @@ def evaluate_cli(
DOCS: https://spacy.io/api/cli#benchmark-accuracy
"""
import_code(code_path)
import_code_paths(code_path)
evaluate(
model,
data_path,

View File

@ -1,3 +1,4 @@
import importlib.metadata
import json
import platform
from pathlib import Path
@ -7,7 +8,6 @@ import srsly
from wasabi import MarkdownRenderer, Printer
from .. import about, util
from ..compat import importlib_metadata
from ._util import Arg, Opt, app, string_to_list
from .download import get_latest_version, get_model_filename
@ -137,7 +137,7 @@ def info_installed_model_url(model: str) -> Optional[str]:
dist-info available.
"""
try:
dist = importlib_metadata.distribution(model)
dist = importlib.metadata.distribution(model)
text = dist.read_text("direct_url.json")
if isinstance(text, str):
data = json.loads(text)

View File

@ -9,13 +9,14 @@ from thinc.api import Config
from wasabi import Printer, diff_strings
from .. import util
from ..language import DEFAULT_CONFIG_PRETRAIN_PATH
from ..language import DEFAULT_CONFIG_DISTILL_PATH, DEFAULT_CONFIG_PRETRAIN_PATH
from ..schemas import RecommendationSchema
from ..util import SimpleFrozenList
from ._util import (
COMMAND,
Arg,
Opt,
_handle_renamed_language_codes,
import_code,
init_cli,
show_validation_error,
@ -50,7 +51,7 @@ class InitValues:
def init_config_cli(
# fmt: off
output_file: Path = Arg(..., help="File to save the config to or - for stdout (will only output config and no additional logging info)", allow_dash=True),
lang: str = Opt(InitValues.lang, "--lang", "-l", help="Two-letter code of the language to use"),
lang: str = Opt(InitValues.lang, "--lang", "-l", help="Code of the language to use"),
pipeline: str = Opt(",".join(InitValues.pipeline), "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include (without 'tok2vec' or 'transformer')"),
optimize: Optimizations = Opt(InitValues.optimize, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
gpu: bool = Opt(InitValues.gpu, "--gpu", "-G", help="Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."),
@ -90,6 +91,7 @@ def init_fill_config_cli(
# fmt: off
base_path: Path = Arg(..., help="Path to base config to fill", exists=True, dir_okay=False),
output_file: Path = Arg("-", help="Path to output .cfg file (or - for stdout)", allow_dash=True),
distillation: bool = Opt(False, "--distillation", "-dt", help="Include config for distillation (with 'spacy distill')"),
pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
diff: bool = Opt(False, "--diff", "-D", help="Print a visual diff highlighting the changes"),
code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
@ -105,13 +107,20 @@ def init_fill_config_cli(
DOCS: https://spacy.io/api/cli#init-fill-config
"""
import_code(code_path)
fill_config(output_file, base_path, pretraining=pretraining, diff=diff)
fill_config(
output_file,
base_path,
distillation=distillation,
pretraining=pretraining,
diff=diff,
)
def fill_config(
output_file: Path,
base_path: Path,
*,
distillation: bool = False,
pretraining: bool = False,
diff: bool = False,
silent: bool = False,
@ -130,6 +139,9 @@ def fill_config(
# replaced with their actual config after loading, so we have to re-add them
sourced = util.get_sourced_components(config)
filled["components"].update(sourced)
if distillation:
distillation_config = util.load_config(DEFAULT_CONFIG_DISTILL_PATH)
filled = distillation_config.merge(filled)
if pretraining:
validate_config_for_pretrain(filled, msg)
pretrain_config = util.load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
@ -165,6 +177,10 @@ def init_config(
msg = Printer(no_print=silent)
with TEMPLATE_PATH.open("r") as f:
template = Template(f.read())
# Throw error for renamed language codes in v4
_handle_renamed_language_codes(lang)
# Filter out duplicates since tok2vec and transformer are added by template
pipeline = [pipe for pipe in pipeline if pipe not in ("tok2vec", "transformer")]
defaults = RECOMMENDATIONS["__default__"]

View File

@ -12,6 +12,7 @@ from ..training.initialize import convert_vectors, init_nlp
from ._util import (
Arg,
Opt,
_handle_renamed_language_codes,
import_code,
init_cli,
parse_config_overrides,
@ -29,7 +30,6 @@ def init_vectors_cli(
prune: int = Opt(-1, "--prune", "-p", help="Optional number of vectors to prune to"),
truncate: int = Opt(0, "--truncate", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),
mode: str = Opt("default", "--mode", "-m", help="Vectors mode: default or floret"),
name: Optional[str] = Opt(None, "--name", "-n", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
jsonl_loc: Optional[Path] = Opt(None, "--lexemes-jsonl", "-j", help="Location of JSONL-formatted attributes file", hidden=True),
attr: str = Opt("ORTH", "--attr", "-a", help="Optional token attribute to use for vectors, e.g. LOWER or NORM"),
@ -39,8 +39,11 @@ def init_vectors_cli(
you can use in the [initialize] block of your config to initialize
a model with vectors.
"""
if verbose:
util.logger.setLevel(logging.DEBUG)
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
# Throw error for renamed language codes in v4
_handle_renamed_language_codes(lang)
msg.info(f"Creating blank nlp object for language '{lang}'")
nlp = util.get_lang_class(lang)()
if jsonl_loc is not None:
@ -50,7 +53,6 @@ def init_vectors_cli(
vectors_loc,
truncate=truncate,
prune=prune,
name=name,
mode=mode,
attr=attr,
)

View File

@ -1,3 +1,4 @@
import importlib.metadata
import os
import re
import shutil
@ -13,7 +14,6 @@ from thinc.api import Config
from wasabi import MarkdownRenderer, Printer, get_raw_input
from .. import about, util
from ..compat import importlib_metadata
from ..schemas import ModelMetaSchema, validate
from ._util import SDIST_SUFFIX, WHEEL_SUFFIX, Arg, Opt, app, string_to_list
@ -23,14 +23,13 @@ def package_cli(
# fmt: off
input_dir: Path = Arg(..., help="Directory with pipeline data", exists=True, file_okay=False),
output_dir: Path = Arg(..., help="Output parent directory", exists=True, file_okay=False),
code_paths: str = Opt("", "--code", "-c", help="Comma-separated paths to Python file with additional code (registered functions) to be included in the package"),
code_paths: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be included in the package"),
meta_path: Optional[Path] = Opt(None, "--meta-path", "--meta", "-m", help="Path to meta.json", exists=True, dir_okay=False),
create_meta: bool = Opt(False, "--create-meta", "-C", help="Create meta.json, even if one exists"),
name: Optional[str] = Opt(None, "--name", "-n", help="Package name to override meta"),
version: Optional[str] = Opt(None, "--version", "-v", help="Package version to override meta"),
build: str = Opt("sdist", "--build", "-b", help="Comma-separated formats to build: sdist and/or wheel, or none."),
force: bool = Opt(False, "--force", "-f", "-F", help="Force overwriting existing data in output directory"),
require_parent: bool = Opt(True, "--require-parent/--no-require-parent", "-R", "-R", help="Include the parent package (e.g. spacy) in the requirements"),
# fmt: on
):
"""
@ -61,7 +60,6 @@ def package_cli(
create_sdist=create_sdist,
create_wheel=create_wheel,
force=force,
require_parent=require_parent,
silent=False,
)
@ -76,7 +74,6 @@ def package(
create_meta: bool = False,
create_sdist: bool = True,
create_wheel: bool = False,
require_parent: bool = False,
force: bool = False,
silent: bool = True,
) -> None:
@ -116,7 +113,7 @@ def package(
if not meta_path.exists() or not meta_path.is_file():
msg.fail("Can't load pipeline meta.json", meta_path, exits=1)
meta = srsly.read_json(meta_path)
meta = get_meta(input_dir, meta, require_parent=require_parent)
meta = get_meta(input_dir, meta)
if meta["requirements"]:
msg.good(
f"Including {len(meta['requirements'])} package requirement(s) from "
@ -189,7 +186,6 @@ def package(
imports.append(code_path.stem)
shutil.copy(str(code_path), str(package_path))
create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2))
create_file(main_path / "setup.py", TEMPLATE_SETUP)
create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST)
init_py = TEMPLATE_INIT.format(
@ -254,9 +250,9 @@ def has_build() -> bool:
# in an editable install), so an import check is not sufficient; instead
# check that there is a package version
try:
importlib_metadata.version("build")
importlib.metadata.version("build")
return True
except importlib_metadata.PackageNotFoundError: # type: ignore[attr-defined]
except importlib.metadata.PackageNotFoundError: # type: ignore[attr-defined]
return False
@ -306,8 +302,6 @@ def get_third_party_dependencies(
modules.add(func_info["module"].split(".")[0]) # type: ignore[union-attr]
dependencies = []
for module_name in modules:
if module_name == about.__title__:
continue
if module_name in distributions:
dist = distributions.get(module_name)
if dist:
@ -338,9 +332,7 @@ def create_file(file_path: Path, contents: str) -> None:
def get_meta(
model_path: Union[str, Path],
existing_meta: Dict[str, Any],
require_parent: bool = False,
model_path: Union[str, Path], existing_meta: Dict[str, Any]
) -> Dict[str, Any]:
meta: Dict[str, Any] = {
"lang": "en",
@ -360,7 +352,6 @@ def get_meta(
"width": nlp.vocab.vectors_length,
"vectors": len(nlp.vocab.vectors),
"keys": nlp.vocab.vectors.n_keys,
"name": nlp.vocab.vectors.name,
}
if about.__title__ != "spacy":
meta["parent_package"] = about.__title__
@ -369,8 +360,6 @@ def get_meta(
existing_reqs = [util.split_requirement(req)[0] for req in meta["requirements"]]
reqs = get_third_party_dependencies(nlp.config, exclude=existing_reqs)
meta["requirements"].extend(reqs)
if require_parent and about.__title__ not in meta["requirements"]:
meta["requirements"].append(about.__title__ + meta["spacy_version"])
return meta
@ -545,11 +534,8 @@ def list_files(data_dir):
def list_requirements(meta):
# Up to version 3.7, we included the parent package
# in requirements by default. This behaviour is removed
# in 3.8, with a setting to include the parent package in
# the requirements list in the meta if desired.
requirements = []
parent_package = meta.get('parent_package', 'spacy')
requirements = [parent_package + meta['spacy_version']]
if 'setup_requires' in meta:
requirements += meta['setup_requires']
if 'requirements' in meta:

View File

@ -11,7 +11,7 @@ from ._util import (
Arg,
Opt,
app,
import_code,
import_code_paths,
parse_config_overrides,
setup_gpu,
show_validation_error,
@ -27,7 +27,7 @@ def pretrain_cli(
ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True, dir_okay=False, allow_dash=True),
output_dir: Path = Arg(..., help="Directory to write weights to on each epoch"),
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),
resume_path: Optional[Path] = Opt(None, "--resume-path", "-r", help="Path to pretrained weights from which to resume pretraining"),
epoch_resume: Optional[int] = Opt(None, "--epoch-resume", "-er", help="The epoch to resume counting from when using --resume-path. Prevents unintended overwriting of existing weight files."),
use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU"),
@ -56,7 +56,7 @@ def pretrain_cli(
DOCS: https://spacy.io/api/cli#pretrain
"""
config_overrides = parse_config_overrides(ctx.args)
import_code(code_path)
import_code_paths(code_path)
verify_cli_args(config_path, output_dir, resume_path, epoch_resume)
setup_gpu(use_gpu)
msg.info(f"Loading config from: {config_path}")

View File

@ -238,7 +238,7 @@ grad_factor = 1.0
{% if "entity_linker" in components -%}
[components.entity_linker]
factory = "entity_linker"
get_candidates = {"@misc":"spacy.CandidateGenerator.v1"}
get_candidates = {"@misc":"spacy.CandidateGenerator.v2"}
incl_context = true
incl_prior = true
@ -517,7 +517,7 @@ width = ${components.tok2vec.model.encode.width}
{% if "entity_linker" in components -%}
[components.entity_linker]
factory = "entity_linker"
get_candidates = {"@misc":"spacy.CandidateGenerator.v1"}
get_candidates = {"@misc":"spacy.CandidateGenerator.v2"}
incl_context = true
incl_prior = true

View File

@ -13,7 +13,7 @@ from ._util import (
Arg,
Opt,
app,
import_code,
import_code_paths,
parse_config_overrides,
setup_gpu,
show_validation_error,
@ -28,7 +28,7 @@ def train_cli(
ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
output_path: Optional[Path] = Opt(None, "--output", "--output-path", "-o", help="Output directory to store trained pipeline in"),
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU")
# fmt: on
@ -50,7 +50,7 @@ def train_cli(
if verbose:
util.logger.setLevel(logging.DEBUG)
overrides = parse_config_overrides(ctx.args)
import_code(code_path)
import_code_paths(code_path)
train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)

View File

@ -23,19 +23,6 @@ try:
except ImportError:
cupy = None
if sys.version_info[:2] >= (3, 8): # Python 3.8+
from typing import Literal, Protocol, runtime_checkable
else:
from typing_extensions import Literal, Protocol, runtime_checkable # noqa: F401
# Important note: The importlib_metadata "backport" includes functionality
# that's not part of the built-in importlib.metadata. We should treat this
# import like the built-in and only use what's available there.
try: # Python 3.8+
import importlib.metadata as importlib_metadata
except ImportError:
from catalogue import _importlib_metadata as importlib_metadata # type: ignore[no-redef] # noqa: F401
from thinc.api import Optimizer # noqa: F401
pickle = pickle

View File

@ -0,0 +1,34 @@
[paths]
raw_text = null
[distillation]
corpus = "corpora.distillation"
dropout = 0.1
max_epochs = 1
max_steps = 0
student_to_teacher = {}
[distillation.batcher]
@batchers = "spacy.batch_by_words.v1"
size = 3000
discard_oversize = false
tolerance = 0.2
[distillation.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 1e-8
learn_rate = 1e-4
[corpora]
[corpora.distillation]
@readers = "spacy.PlainTextCorpus.v1"
path = ${paths.raw_text}
min_length = 0
max_length = 0

View File

@ -1,6 +1,7 @@
import warnings
from typing import Literal
from .compat import Literal
from . import about
class ErrorsWithCodes(type):
@ -83,7 +84,7 @@ class Warnings(metaclass=ErrorsWithCodes):
"ignoring the duplicate entry.")
W021 = ("Unexpected hash collision in PhraseMatcher. Matches may be "
"incorrect. Modify PhraseMatcher._terminal_hash to fix.")
W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in "
W024 = ("Entity '{entity}' - alias '{alias}' combination already exists in "
"the Knowledge Base.")
W026 = ("Unable to set all sentence boundaries from dependency parses. If "
"you are constructing a parse tree incrementally by setting "
@ -104,13 +105,14 @@ class Warnings(metaclass=ErrorsWithCodes):
"table. This may degrade the performance of the model to some "
"degree. If this is intentional or the language you're using "
"doesn't have a normalization table, please ignore this warning. "
"If this is surprising, make sure you have the spacy-lookups-data "
"package installed and load the table in your config. The "
"languages with lexeme normalization tables are currently: "
"{langs}\n\nLoad the table in your config with:\n\n"
"If this is surprising, make sure you are loading the table in "
"your config. The languages with lexeme normalization tables are "
"currently: {langs}\n\nAn example of how to load a table in "
"your config :\n\n"
"[initialize.lookups]\n"
"@misc = \"spacy.LookupsDataLoader.v1\"\n"
"@misc = \"spacy.LookupsDataLoaderFromURL.v1\"\n"
"lang = ${{nlp.lang}}\n"
f'url = "{about.__lookups_url__}"\n'
"tables = [\"lexeme_norm\"]\n")
W035 = ("Discarding subpattern '{pattern}' due to an unrecognized "
"attribute or operator.")
@ -132,13 +134,6 @@ class Warnings(metaclass=ErrorsWithCodes):
"and make it independent. For example, `replace_listeners = "
"[\"model.tok2vec\"]` See the documentation for details: "
"https://spacy.io/usage/training#config-components-listeners")
W088 = ("The pipeline component {name} implements a `begin_training` "
"method, which won't be called by spaCy. As of v3.0, `begin_training` "
"has been renamed to `initialize`, so you likely want to rename the "
"component method. See the documentation for details: "
"https://spacy.io/api/language#initialize")
W089 = ("As of spaCy v3.0, the `nlp.begin_training` method has been renamed "
"to `nlp.initialize`.")
W090 = ("Could not locate any {format} files in path '{path}'.")
W091 = ("Could not clean/remove the temp directory at {dir}: {msg}.")
W092 = ("Ignoring annotations for sentence starts, as dependency heads are set.")
@ -222,6 +217,11 @@ class Warnings(metaclass=ErrorsWithCodes):
W126 = ("These keys are unsupported: {unsupported}")
W127 = ("Not all `Language.pipe` worker processes completed successfully")
# v4 warning strings
W401 = ("`incl_prior is True`, but the selected knowledge base type {kb_type} doesn't support prior probability "
"lookups so this setting will be ignored. If your KB does support prior probability lookups, make sure "
"to return `True` in `.supports_prior_probs`.")
class Errors(metaclass=ErrorsWithCodes):
E001 = ("No component '{name}' found in pipeline. Available names: {opts}")
@ -256,9 +256,7 @@ class Errors(metaclass=ErrorsWithCodes):
"https://spacy.io/usage/models")
E011 = ("Unknown operator: '{op}'. Options: {opts}")
E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
E016 = ("MultitaskObjective target should be function or one of: dep, "
"tag, ent, dep_tag_offset, ent_tag.")
E017 = ("Can only add unicode or bytes. Got type: {value_type}")
E017 = ("Can only add 'str' inputs to StringStore. Got type: {value_type}")
E018 = ("Can't retrieve string for hash '{hash_value}'. This usually "
"refers to an issue with the `Vocab` or `StringStore`.")
E019 = ("Can't create transition with unknown action ID: {action}. Action "
@ -470,13 +468,13 @@ class Errors(metaclass=ErrorsWithCodes):
"same, but found '{nlp}' and '{vocab}' respectively.")
E152 = ("The attribute {attr} is not supported for token patterns. "
"Please use the option `validate=True` with the Matcher, PhraseMatcher, "
"EntityRuler or AttributeRuler for more details.")
"SpanRuler or AttributeRuler for more details.")
E153 = ("The value type {vtype} is not supported for token patterns. "
"Please use the option validate=True with Matcher, PhraseMatcher, "
"EntityRuler or AttributeRuler for more details.")
"SpanRuler or AttributeRuler for more details.")
E154 = ("One of the attributes or values is not supported for token "
"patterns. Please use the option `validate=True` with the Matcher, "
"PhraseMatcher, or EntityRuler for more details.")
"PhraseMatcher, or SpanRuler for more details.")
E155 = ("The pipeline needs to include a {pipe} in order to use "
"Matcher or PhraseMatcher with the attribute {attr}. "
"Try using `nlp()` instead of `nlp.make_doc()` or `list(nlp.pipe())` "
@ -500,7 +498,7 @@ class Errors(metaclass=ErrorsWithCodes):
"Current DocBin: {current}\nOther DocBin: {other}")
E169 = ("Can't find module: {module}")
E170 = ("Cannot apply transition {name}: invalid for the current state.")
E171 = ("Matcher.add received invalid 'on_match' callback argument: expected "
E171 = ("{name}.add received invalid 'on_match' callback argument: expected "
"callable or None, but got: {arg_type}")
E175 = ("Can't remove rule for unknown match pattern ID: {key}")
E176 = ("Alias '{alias}' is not defined in the Knowledge Base.")
@ -739,13 +737,6 @@ class Errors(metaclass=ErrorsWithCodes):
"method in component '{name}'. If you want to use this "
"method, make sure it's overwritten on the subclass.")
E940 = ("Found NaN values in scores.")
E941 = ("Can't find model '{name}'. It looks like you're trying to load a "
"model from a shortcut, which is obsolete as of spaCy v3.0. To "
"load the model, use its full name instead:\n\n"
"nlp = spacy.load(\"{full}\")\n\nFor more details on the available "
"models, see the models directory: https://spacy.io/models and if "
"you want to create a blank model, use spacy.blank: "
"nlp = spacy.blank(\"{name}\")")
E942 = ("Executing `after_{name}` callback failed. Expected the function to "
"return an initialized nlp object but got: {value}. Maybe "
"you forgot to return the modified object in your function?")
@ -759,7 +750,7 @@ class Errors(metaclass=ErrorsWithCodes):
"loaded nlp object, but got: {source}")
E947 = ("`Matcher.add` received invalid `greedy` argument: expected "
"a string value from {expected} but got: '{arg}'")
E948 = ("`Matcher.add` received invalid 'patterns' argument: expected "
E948 = ("`{name}.add` received invalid 'patterns' argument: expected "
"a list, but got: {arg_type}")
E949 = ("Unable to align tokens for the predicted and reference docs. It "
"is only possible to align the docs when both texts are the same "
@ -933,8 +924,6 @@ class Errors(metaclass=ErrorsWithCodes):
E1021 = ("`pos` value \"{pp}\" is not a valid Universal Dependencies tag. "
"Non-UD tags should use the `tag` property.")
E1022 = ("Words must be of type str or int, but input is of type '{wtype}'")
E1023 = ("Couldn't read EntityRuler from the {path}. This file doesn't "
"exist.")
E1024 = ("A pattern with {attr_type} '{label}' is not present in "
"'{component}' patterns.")
E1025 = ("Cannot intify the value '{value}' as an IOB string. The only "
@ -945,7 +934,7 @@ class Errors(metaclass=ErrorsWithCodes):
E1029 = ("Edit tree cannot be applied to form.")
E1030 = ("Edit tree identifier out of range.")
E1031 = ("Could not find gold transition - see logs above.")
E1032 = ("`{var}` should not be {forbidden}, but received {value}.")
E1032 = ("Span {var} {value} is out of bounds for {obj} with length {length}.")
E1033 = ("Dimension {name} invalid -- only nO, nF, nP")
E1034 = ("Node index {i} out of bounds ({length})")
E1035 = ("Token index {i} out of bounds ({length})")
@ -962,7 +951,6 @@ class Errors(metaclass=ErrorsWithCodes):
"case pass an empty list for the previously not specified argument to avoid this error.")
E1043 = ("Expected None or a value in range [{range_start}, {range_end}] for entity linker threshold, but got "
"{value}.")
E1044 = ("Expected `candidates_batch_size` to be >= 1, but got: {value}")
E1045 = ("Encountered {parent} subclass without `{parent}.{method}` "
"method in '{name}'. If you want to use this method, make "
"sure it's overwritten on the subclass.")
@ -989,15 +977,35 @@ class Errors(metaclass=ErrorsWithCodes):
"reduction. Please enable one of `use_reduce_first`, "
"`use_reduce_last`, `use_reduce_max` or `use_reduce_mean`.")
# v4 error strings
E4000 = ("Expected a Doc as input, but got: '{type}'")
E4001 = ("Expected input to be one of the following types: ({expected_types}), "
"but got '{received_type}'")
E4002 = ("Pipe '{name}' requires a teacher pipe for distillation.")
E4003 = ("Training examples for distillation must have the exact same tokens in the "
"reference and predicted docs.")
E4004 = ("Backprop is not supported when is_train is not set.")
E4005 = ("EntityLinker_v1 is not supported in spaCy v4. Update your configuration.")
E4006 = ("Expected `entity_id` to be of type {exp_type}, but is of type {found_type}.")
E4007 = ("Span {var} {value} must be {op} Span {existing_var} "
"{existing_value}.")
E4008 = ("Span {pos}_char {value} does not correspond to a token {pos}.")
E4009 = ("The '{attr}' parameter should be 'None' or 'True', but found '{value}'.")
E4010 = ("Required lemmatizer table(s) {missing_tables} not found in "
"[initialize] or in registered lookups (spacy-lookups-data). An "
"example for how to load lemmatizer tables in [initialize]:\n\n"
"[initialize.components]\n\n"
"[initialize.components.{pipe_name}]\n\n"
"[initialize.components.{pipe_name}.lookups]\n"
'@misc = "spacy.LookupsDataLoaderFromURL.v1"\n'
"lang = ${{nlp.lang}}\n"
f'url = "{about.__lookups_url__}"\n'
"tables = {tables}\n"
"# or required tables only: tables = {required_tables}\n")
E4011 = ("Server error ({status_code}), couldn't fetch {url}")
# Deprecated model shortcuts, only used in errors and warnings
OLD_MODEL_SHORTCUTS = {
"en": "en_core_web_sm", "de": "de_core_news_sm", "es": "es_core_news_sm",
"pt": "pt_core_news_sm", "fr": "fr_core_news_sm", "it": "it_core_news_sm",
"nl": "nl_core_news_sm", "el": "el_core_news_sm", "nb": "nb_core_news_sm",
"lt": "lt_core_news_sm", "xx": "xx_ent_wiki_sm"
}
RENAMED_LANGUAGE_CODES = {"xx": "mul", "is": "isl"}
# fmt: on

View File

@ -1,11 +1,10 @@
from .candidate import Candidate, get_candidates, get_candidates_batch
from .candidate import Candidate, InMemoryCandidate
from .kb import KnowledgeBase
from .kb_in_memory import InMemoryLookupKB
__all__ = [
"Candidate",
"KnowledgeBase",
"InMemoryCandidate",
"InMemoryLookupKB",
"get_candidates",
"get_candidates_batch",
]

View File

@ -1,15 +1,17 @@
from libcpp.vector cimport vector
from ..typedefs cimport hash_t
from .kb cimport KnowledgeBase
from .kb_in_memory cimport InMemoryLookupKB
# Object used by the Entity Linker that summarizes one entity-alias candidate
# combination.
cdef class Candidate:
cdef readonly KnowledgeBase kb
cdef hash_t entity_hash
cdef float entity_freq
cdef vector[float] entity_vector
cdef hash_t alias_hash
cdef float prior_prob
pass
cdef class InMemoryCandidate(Candidate):
cdef readonly hash_t _entity_hash
cdef readonly hash_t _alias_hash
cdef vector[float] _entity_vector
cdef float _prior_prob
cdef readonly InMemoryLookupKB _kb
cdef float _entity_freq

View File

@ -1,90 +1,98 @@
# cython: infer_types=True
from typing import Iterable
from .kb_in_memory cimport InMemoryLookupKB
from .kb cimport KnowledgeBase
from ..tokens import Span
from ..errors import Errors
cdef class Candidate:
"""A `Candidate` object refers to a textual mention (`alias`) that may or
may not be resolved to a specific `entity` from a Knowledge Base. This
will be used as input for the entity linking algorithm which will
disambiguate the various candidates to the correct one.
Each candidate (alias, entity) pair is assigned a certain prior probability.
"""A `Candidate` object refers to a textual mention that may or may not be resolved
to a specific entity from a Knowledge Base. This will be used as input for the entity linking
algorithm which will disambiguate the various candidates to the correct one.
Each candidate, which represents a possible link between one textual mention and one entity in the knowledge base,
is assigned a certain prior probability.
DOCS: https://spacy.io/api/kb/#candidate-init
"""
def __init__(self):
# Make sure abstract Candidate is not instantiated.
if self.__class__ == Candidate:
raise TypeError(
Errors.E1046.format(cls_name=self.__class__.__name__)
)
@property
def entity_id(self) -> int:
"""RETURNS (int): Numerical representation of entity ID (if entity ID is numerical, this is just the entity ID,
otherwise the hash of the entity ID string)."""
raise NotImplementedError
@property
def entity_id_(self) -> str:
"""RETURNS (str): String representation of entity ID."""
raise NotImplementedError
@property
def entity_vector(self) -> vector[float]:
"""RETURNS (vector[float]): Entity vector."""
raise NotImplementedError
cdef class InMemoryCandidate(Candidate):
"""Candidate for InMemoryLookupKB."""
def __init__(
self,
KnowledgeBase kb,
entity_hash,
entity_freq,
entity_vector,
alias_hash,
prior_prob
kb: InMemoryLookupKB,
entity_hash: int,
alias_hash: int,
entity_vector: vector[float],
prior_prob: float,
entity_freq: float
):
self.kb = kb
self.entity_hash = entity_hash
self.entity_freq = entity_freq
self.entity_vector = entity_vector
self.alias_hash = alias_hash
self.prior_prob = prior_prob
"""
kb (InMemoryLookupKB]): InMemoryLookupKB instance.
entity_id (int): Entity ID as hash that can be looked up with InMemoryKB.vocab.strings.__getitem__().
entity_freq (int): Entity frequency in KB corpus.
entity_vector (List[float]): Entity embedding.
alias_hash (int): Alias hash.
prior_prob (float): Prior probability of entity for this alias. I. e. the probability that, independent of
the context, this alias - which matches one of this entity's aliases - resolves to one this entity.
"""
super().__init__()
self._entity_hash = entity_hash
self._entity_vector = entity_vector
self._prior_prob = prior_prob
self._kb = kb
self._alias_hash = alias_hash
self._entity_freq = entity_freq
@property
def entity(self) -> int:
"""RETURNS (uint64): hash of the entity's KB ID/name"""
return self.entity_hash
def entity_id(self) -> int:
return self._entity_hash
@property
def entity_(self) -> str:
"""RETURNS (str): ID/name of this entity in the KB"""
return self.kb.vocab.strings[self.entity_hash]
@property
def alias(self) -> int:
"""RETURNS (uint64): hash of the alias"""
return self.alias_hash
@property
def alias_(self) -> str:
"""RETURNS (str): ID of the original alias"""
return self.kb.vocab.strings[self.alias_hash]
@property
def entity_freq(self) -> float:
return self.entity_freq
@property
def entity_vector(self) -> Iterable[float]:
return self.entity_vector
def entity_vector(self) -> vector[float]:
return self._entity_vector
@property
def prior_prob(self) -> float:
return self.prior_prob
"""RETURNS (float): Prior probability that this alias, which matches one of this entity's synonyms, resolves to
this entity."""
return self._prior_prob
@property
def alias(self) -> str:
"""RETURNS (str): Alias."""
return self._kb.vocab.strings[self._alias_hash]
def get_candidates(kb: KnowledgeBase, mention: Span) -> Iterable[Candidate]:
"""
Return candidate entities for a given mention and fetching appropriate
entries from the index.
kb (KnowledgeBase): Knowledge base to query.
mention (Span): Entity mention for which to identify candidates.
RETURNS (Iterable[Candidate]): Identified candidates.
"""
return kb.get_candidates(mention)
@property
def entity_id_(self) -> str:
return self._kb.vocab.strings[self._entity_hash]
def get_candidates_batch(
kb: KnowledgeBase, mentions: Iterable[Span]
) -> Iterable[Iterable[Candidate]]:
"""
Return candidate entities for the given mentions and fetching appropriate entries
from the index.
kb (KnowledgeBase): Knowledge base to query.
mention (Iterable[Span]): Entity mentions for which to identify candidates.
RETURNS (Iterable[Iterable[Candidate]]): Identified candidates.
"""
return kb.get_candidates_batch(mentions)
@property
def entity_freq(self) -> float:
"""RETURNS (float): Entity frequency in KB corpus."""
return self._entity_freq

View File

@ -1,14 +1,14 @@
# cython: infer_types=True
from pathlib import Path
from typing import Iterable, Tuple, Union
from typing import Iterable, Iterator, Tuple, Union
from cymem.cymem cimport Pool
from ..errors import Errors
from ..tokens import Span
from ..tokens import SpanGroup
from ..util import SimpleFrozenList
from .candidate import Candidate
from .candidate cimport Candidate
cdef class KnowledgeBase:
@ -19,6 +19,8 @@ cdef class KnowledgeBase:
DOCS: https://spacy.io/api/kb
"""
CandidatesForMentionT = Iterable[Candidate]
CandidatesForDocT = Iterable[CandidatesForMentionT]
def __init__(self, vocab: Vocab, entity_vector_length: int):
"""Create a KnowledgeBase."""
@ -32,27 +34,15 @@ cdef class KnowledgeBase:
self.entity_vector_length = entity_vector_length
self.mem = Pool()
def get_candidates_batch(
self, mentions: Iterable[Span]
) -> Iterable[Iterable[Candidate]]:
def get_candidates(self, mentions: Iterator[SpanGroup]) -> Iterator[CandidatesForDocT]:
"""
Return candidate entities for specified texts. Each candidate defines
the entity, the original alias, and the prior probability of that
alias resolving to that entity.
If no candidate is found for a given text, an empty list is returned.
mentions (Iterable[Span]): Mentions for which to get candidates.
RETURNS (Iterable[Iterable[Candidate]]): Identified candidates.
"""
return [self.get_candidates(span) for span in mentions]
def get_candidates(self, mention: Span) -> Iterable[Candidate]:
"""
Return candidate entities for specified text. Each candidate defines
the entity, the original alias,
and the prior probability of that alias resolving to that entity.
If the no candidate is found for a given text, an empty list is returned.
mention (Span): Mention for which to get candidates.
RETURNS (Iterable[Candidate]): Identified candidates.
Return candidate entities for the specified groups of mentions (as SpanGroup) per Doc.
Each candidate for a mention defines at least the entity and the entity's embedding vector. Depending on the KB
implementation, further properties - such as the prior probability of the specified mention text resolving to
that entity - might be included.
If no candidates are found for a given mention, an empty list is returned.
mentions (Iterator[SpanGroup]): Mentions for which to get candidates.
RETURNS (Iterator[Iterable[Iterable[Candidate]]]): Identified candidates per mention/doc/doc batch.
"""
raise NotImplementedError(
Errors.E1045.format(
@ -128,3 +118,10 @@ cdef class KnowledgeBase:
parent="KnowledgeBase", method="from_disk", name=self.__name__
)
)
@property
def supports_prior_probs(self) -> bool:
"""RETURNS (bool): Whether this KB type supports looking up prior probabilities for entity mentions."""
raise NotImplementedError(
Errors.E1045.format(parent="KnowledgeBase", method="supports_prior_probs", name=self.__name__)
)

View File

@ -1,5 +1,5 @@
# cython: infer_types=True
from typing import Any, Callable, Dict, Iterable
from typing import Any, Callable, Dict, Iterable, Iterator
import srsly
@ -12,7 +12,7 @@ from preshed.maps cimport PreshMap
import warnings
from pathlib import Path
from ..tokens import Span
from ..tokens import SpanGroup
from ..typedefs cimport hash_t
@ -23,7 +23,7 @@ from ..util import SimpleFrozenList, ensure_path
from ..vocab cimport Vocab
from .kb cimport KnowledgeBase
from .candidate import Candidate as Candidate
from .candidate import InMemoryCandidate
cdef class InMemoryLookupKB(KnowledgeBase):
@ -255,10 +255,11 @@ cdef class InMemoryLookupKB(KnowledgeBase):
alias_entry.probs = probs
self._aliases_table[alias_index] = alias_entry
def get_candidates(self, mention: Span) -> Iterable[Candidate]:
return self.get_alias_candidates(mention.text) # type: ignore
def get_candidates(self, mentions: Iterator[SpanGroup]) -> Iterator[Iterable[Iterable[InMemoryCandidate]]]:
for mentions_for_doc in mentions:
yield [self._get_alias_candidates(span.text) for span in mentions_for_doc]
def get_alias_candidates(self, str alias) -> Iterable[Candidate]:
def _get_alias_candidates(self, str alias) -> Iterable[InMemoryCandidate]:
"""
Return candidate entities for an alias. Each candidate defines the
entity, the original alias, and the prior probability of that alias
@ -271,18 +272,18 @@ cdef class InMemoryLookupKB(KnowledgeBase):
alias_index = <int64_t>self._alias_index.get(alias_hash)
alias_entry = self._aliases_table[alias_index]
return [Candidate(kb=self,
return [
InMemoryCandidate(
kb=self,
entity_hash=self._entries[entry_index].entity_hash,
entity_freq=self._entries[entry_index].freq,
entity_vector=self._vectors_table[
self._entries[entry_index].vector_index
],
alias_hash=alias_hash,
prior_prob=prior_prob)
for (entry_index, prior_prob) in zip(
alias_entry.entry_indices, alias_entry.probs
entity_vector=self._vectors_table[self._entries[entry_index].vector_index],
prior_prob=prior_prob,
entity_freq=self._entries[entry_index].freq
)
if entry_index != 0]
for (entry_index, prior_prob) in zip(alias_entry.entry_indices, alias_entry.probs)
if entry_index != 0
]
def get_vector(self, str entity):
cdef hash_t entity_hash = self.vocab.strings[entity]
@ -316,6 +317,9 @@ cdef class InMemoryLookupKB(KnowledgeBase):
return 0.0
def supports_prior_probs(self) -> bool:
return True
def to_bytes(self, **kwargs):
"""Serialize the current state to a binary string.
"""

View File

@ -1,16 +0,0 @@
from ...language import BaseDefaults, Language
from .lex_attrs import LEX_ATTRS
from .stop_words import STOP_WORDS
class TibetanDefaults(BaseDefaults):
lex_attr_getters = LEX_ATTRS
stop_words = STOP_WORDS
class Tibetan(Language):
lang = "bo"
Defaults = TibetanDefaults
__all__ = ["Tibetan"]

View File

@ -1,16 +0,0 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.bo.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"དོན་དུ་རྒྱ་མཚོ་བླ་མ་ཞེས་བྱ་ཞིང༌།",
"ཏཱ་ལའི་ཞེས་པ་ནི་སོག་སྐད་ཡིན་པ་དེ་བོད་སྐད་དུ་རྒྱ་མཚོའི་དོན་དུ་འཇུག",
"སོག་པོ་ཨལ་ཐན་རྒྱལ་པོས་རྒྱལ་དབང་བསོད་ནམས་རྒྱ་མཚོར་ཆེ་བསྟོད་ཀྱི་མཚན་གསོལ་བ་ཞིག་ཡིན་ཞིང༌།",
"རྗེས་སུ་རྒྱལ་བ་དགེ་འདུན་གྲུབ་དང༌། དགེ་འདུན་རྒྱ་མཚོ་སོ་སོར་ཡང་ཏཱ་ལའི་བླ་མའི་སྐུ་ཕྲེང་དང་པོ་དང༌།",
"གཉིས་པའི་མཚན་དེ་གསོལ་ཞིང༌།༸རྒྱལ་དབང་སྐུ་ཕྲེང་ལྔ་པས་དགའ་ལྡན་ཕོ་བྲང་གི་སྲིད་དབང་བཙུགས་པ་ནས་ཏཱ་ལའི་བླ་མ་ནི་བོད་ཀྱི་ཆོས་སྲིད་གཉིས་ཀྱི་དབུ་ཁྲིད་དུ་གྱུར་ཞིང་།",
"ད་ལྟའི་བར་ཏཱ་ལའི་བླ་མ་སྐུ་ཕྲེང་བཅུ་བཞི་བྱོན་ཡོད།",
]

View File

@ -1,65 +0,0 @@
from ...attrs import LIKE_NUM
# reference 1: https://en.wikipedia.org/wiki/Tibetan_numerals
_num_words = [
"ཀླད་ཀོར་",
"གཅིག་",
"གཉིས་",
"གསུམ་",
"བཞི་",
"ལྔ་",
"དྲུག་",
"བདུན་",
"བརྒྱད་",
"དགུ་",
"བཅུ་",
"བཅུ་གཅིག་",
"བཅུ་གཉིས་",
"བཅུ་གསུམ་",
"བཅུ་བཞི་",
"བཅུ་ལྔ་",
"བཅུ་དྲུག་",
"བཅུ་བདུན་",
"བཅུ་པརྒྱད",
"བཅུ་དགུ་",
"ཉི་ཤུ་",
"སུམ་ཅུ",
"བཞི་བཅུ",
"ལྔ་བཅུ",
"དྲུག་ཅུ",
"བདུན་ཅུ",
"བརྒྱད་ཅུ",
"དགུ་བཅུ",
"བརྒྱ་",
"སྟོང་",
"ཁྲི་",
"ས་ཡ་",
" བྱེ་བ་",
"དུང་ཕྱུར་",
"ཐེར་འབུམ་",
"ཐེར་འབུམ་ཆེན་པོ་",
"ཁྲག་ཁྲིག་",
"ཁྲག་ཁྲིག་ཆེན་པོ་",
]
def like_num(text):
"""
Check if text resembles a number
"""
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -1,198 +0,0 @@
# Source: https://zenodo.org/records/10148636
STOP_WORDS = set(
"""
གས
མས
འད
པས
གཞན
དང
གས
བཅས
ངས
ལས
ཙམ
ཡང
མཐའདག
འད
རང
ངམ
དག
འང
ལགས
ཚང
ཐམསཅད
དམ
འམ
བས
ལགས
གས
མས
བམ
ནམ
ནམ
ངམ
འགའ
ཤས
གམ
ལགས
ཅང
འགའ
སམ
འང
ལས
འཕ
བར
དང
འག
སམ
ཟད
འམ
མམ
དམ
དག
ལམ
ནང
ཙམ
རམ
ཨང
གས
ལགས
པས
རབ
རམ
བས
གཞན
འབའ
གམ
བམ
ཙམ
མམ
ཏམ
ཏམ
ཤས
""".split()
)

View File

@ -1,18 +0,0 @@
from typing import Optional
from ...language import BaseDefaults, Language
from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
class ScottishDefaults(BaseDefaults):
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
stop_words = STOP_WORDS
class Scottish(Language):
lang = "gd"
Defaults = ScottishDefaults
__all__ = ["Scottish"]

View File

@ -1,388 +0,0 @@
STOP_WORDS = set(
"""
'ad
'ar
'd # iad
'g # ag
'ga
'gam
'gan
'gar
'gur
'm # am
'n # an
'n seo
'na
'nad
'nam
'nan
'nar
'nuair
'nur
's
'sa
'san
'sann
'se
'sna
a
a'
a'd # agad
a'm # agam
a-chèile
a-seo
a-sin
a-siud
a chionn
a chionn 's
a chèile
a chéile
a dh'
a h-uile
a seo
ac' # aca
aca
aca-san
acasan
ach
ag
agad
agad-sa
agads'
agadsa
agaibh
agaibhse
againn
againne
agam
agam-sa
agams'
agamsa
agus
aice
aice-se
aicese
aig
aig' # aige
aige
aige-san
aigesan
air
air-san
air neo
airsan
am
an
an seo
an sin
an siud
an uair
ann
ann a
ann a'
ann a shin
ann am
ann an
annad
annam
annam-s'
annamsa
anns
anns an
annta
aon
ar
as
asad
asda
asta
b'
bho
bhon
bhuaidhe # bhuaithe
bhuainn
bhuaipe
bhuaithe
bhuapa
bhur
brì
bu
c'à
car son
carson
cha
chan
chionn
choir
chon
chun
chèile
chéile
chòir
cia mheud
ciamar
co-dhiubh
cuide
cuin
cuin'
cuine
'
càil
càit
càit'
càite
mheud
d'
da
de
dh'
dha
dhaibh
dhaibh-san
dhaibhsan
dhan
dhasan
dhe
dhen
dheth
dhi
dhiom
dhiot
dhith
dhiubh
dhomh
dhomh-s'
dhomhsa
dhu'sa # dhut-sa
dhuibh
dhuibhse
dhuinn
dhuinne
dhuit
dhut
dhutsa
dhut-sa
dhà
dhà-san
dhàsan
dhòmhsa
diubh
do
docha
don
mar
mar
dòch'
dòcha
e
eadar
eatarra
eatorra
eile
esan
fa
far
feud
fhad
fheudar
fhearr
fhein
fheudar
fheàrr
fhèin
fhéin
fhìn
fo
fodha
fodhainn
foipe
fon
fèin
ga
gach
gam
gan
ge brith
ged
gu
gu
gu ruige
gun
gur
gus
i
iad
iadsan
innte
is
ise
le
leam
leam-sa
leamsa
leat
leat-sa
leatha
leatsa
leibh
leis
leis-san
leoth'
leotha
leotha-san
linn
m'
m'a
ma
mac
man
mar
mas
mathaid
mi
mis'
mise
mo
mu
mu 'n
mun
mur
mura
mus
na
na b'
na bu
na iad
nach
nad
nam
nan
nar
nas
neo
no
nuair
o
o'n
oir
oirbh
oirbh-se
oirnn
oirnne
oirre
on
orm
orm-sa
ormsa
orra
orra-san
orrasan
ort
os
r'
ri
ribh
rinn
ris
rithe
rithe-se
rium
rium-sa
riums'
riumsa
riut
riuth'
riutha
riuthasan
ro
ro'n
roimh
roimhe
romhainn
romham
romhpa
ron
ruibh
ruinn
ruinne
sa
san
sann
se
seach
seo
seothach
shin
sibh
sibh-se
sibhse
sin
sineach
sinn
sinne
siod
siodach
siud
siudach
sna # ann an
t'
tarsaing
tarsainn
tarsuinn
thar
thoigh
thro
thu
thuc'
thuca
thugad
thugaibh
thugainn
thugam
thugamsa
thuice
thuige
thus'
thusa
timcheall
toigh
toil
tro
tro' # troimh
troimh
troimhe
tron
tu
tusa
uair
ud
ugaibh
ugam-s'
ugam-sa
uice
uige
uige-san
umad
unnta # ann an
ur
urrainn
à
às
àsan
á
ás
è
ì
ò
ó
""".split(
"\n"
)
)

File diff suppressed because it is too large Load Diff

View File

@ -1,5 +1,5 @@
The list of Croatian lemmas was extracted from the reldi-tagger repository (https://github.com/clarinsi/reldi-tagger).
Reldi-tagger is licensed under the Apache 2.0 licence.
Reldi-tagger is licesned under the Apache 2.0 licence.
@InProceedings{ljubesic16-new,
author = {Nikola Ljubešić and Filip Klubička and Željko Agić and Ivo-Pavao Jazbec},

View File

@ -1,52 +0,0 @@
from typing import Callable, Optional
from thinc.api import Model
from ...language import BaseDefaults, Language
from .lemmatizer import HaitianCreoleLemmatizer
from .lex_attrs import LEX_ATTRS
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
from .stop_words import STOP_WORDS
from .syntax_iterators import SYNTAX_ITERATORS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tag_map import TAG_MAP
class HaitianCreoleDefaults(BaseDefaults):
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
prefixes = TOKENIZER_PREFIXES
infixes = TOKENIZER_INFIXES
suffixes = TOKENIZER_SUFFIXES
lex_attr_getters = LEX_ATTRS
syntax_iterators = SYNTAX_ITERATORS
stop_words = STOP_WORDS
tag_map = TAG_MAP
class HaitianCreole(Language):
lang = "ht"
Defaults = HaitianCreoleDefaults
@HaitianCreole.factory(
"lemmatizer",
assigns=["token.lemma"],
default_config={
"model": None,
"mode": "rule",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0},
)
def make_lemmatizer(
nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
):
return HaitianCreoleLemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["HaitianCreole"]

View File

@ -1,18 +0,0 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.ht.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Apple ap panse achte yon demaraj nan Wayòm Ini pou $1 milya dola",
"Machin otonòm fè responsablite asirans lan ale sou men fabrikan yo",
"San Francisco ap konsidere entèdi robo ki livre sou twotwa yo",
"Lond se yon gwo vil nan Wayòm Ini",
"Kote ou ye?",
"Kilès ki prezidan Lafrans?",
"Ki kapital Etazini?",
"Kile Barack Obama te fèt?",
]

View File

@ -1,51 +0,0 @@
from typing import List, Tuple
from ...pipeline import Lemmatizer
from ...tokens import Token
from ...lookups import Lookups
class HaitianCreoleLemmatizer(Lemmatizer):
"""
Minimal Haitian Creole lemmatizer.
Returns a word's base form based on rules and lookup,
or defaults to the original form.
"""
def is_base_form(self, token: Token) -> bool:
morph = token.morph.to_dict()
upos = token.pos_.lower()
# Consider unmarked forms to be base
if upos in {"noun", "verb", "adj", "adv"}:
if not morph:
return True
if upos == "noun" and morph.get("Number") == "Sing":
return True
if upos == "verb" and morph.get("VerbForm") == "Inf":
return True
if upos == "adj" and morph.get("Degree") == "Pos":
return True
return False
def rule_lemmatize(self, token: Token) -> List[str]:
string = token.text.lower()
pos = token.pos_.lower()
cache_key = (token.orth, token.pos)
if cache_key in self.cache:
return self.cache[cache_key]
forms = []
# fallback rule: just return lowercased form
forms.append(string)
self.cache[cache_key] = forms
return forms
@classmethod
def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]:
if mode == "rule":
required = ["lemma_lookup", "lemma_rules", "lemma_exc", "lemma_index"]
return (required, [])
return super().get_lookups_config(mode)

View File

@ -1,78 +0,0 @@
from ...attrs import LIKE_NUM, NORM
# Cardinal numbers in Creole
_num_words = set(
"""
zewo youn en de twa kat senk sis sèt uit nèf dis
onz douz trèz katoz kenz sèz disèt dizwit diznèf
vent trant karant sinkant swasant swasann-dis
san mil milyon milya
""".split()
)
# Ordinal numbers in Creole (some are French-influenced, some simplified)
_ordinal_words = set(
"""
premye dezyèm twazyèm katryèm senkyèm sizyèm sètvyèm uitvyèm nèvyèm dizyèm
onzèm douzyèm trèzyèm katozyèm kenzèm sèzyèm disetyèm dizwityèm diznèvyèm
ventyèm trantyèm karantyèm sinkantyèm swasantyèm
swasann-disyèm santyèm milyèm milyonnyèm milyadyèm
""".split()
)
NORM_MAP = {
"'m": "mwen",
"'w": "ou",
"'l": "li",
"'n": "nou",
"'y": "yo",
"m": "mwen",
"w": "ou",
"l": "li",
"n": "nou",
"y": "yo",
"m": "mwen",
"n": "nou",
"l": "li",
"y": "yo",
"w": "ou",
"t": "te",
"k": "ki",
"p": "pa",
"M": "Mwen",
"N": "Nou",
"L": "Li",
"Y": "Yo",
"W": "Ou",
"T": "Te",
"K": "Ki",
"P": "Pa",
}
def like_num(text):
text = text.strip().lower()
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
if text in _ordinal_words:
return True
# Handle things like "3yèm", "10yèm", "25yèm", etc.
if text.endswith("yèm") and text[:-3].isdigit():
return True
return False
def norm_custom(text):
return NORM_MAP.get(text, text.lower())
LEX_ATTRS = {
LIKE_NUM: like_num,
NORM: norm_custom,
}

View File

@ -1,43 +0,0 @@
from ..char_classes import (
ALPHA,
ALPHA_LOWER,
ALPHA_UPPER,
CONCAT_QUOTES,
HYPHENS,
LIST_PUNCT,
LIST_QUOTES,
LIST_ELLIPSES,
LIST_ICONS,
merge_chars,
)
ELISION = "'".replace(" ", "")
_prefixes_elision = "m n l y t k w"
_prefixes_elision += " " + _prefixes_elision.upper()
TOKENIZER_PREFIXES = LIST_PUNCT + LIST_QUOTES + [
r"(?:({pe})[{el}])(?=[{a}])".format(
a=ALPHA, el=ELISION, pe=merge_chars(_prefixes_elision)
)
]
TOKENIZER_SUFFIXES = LIST_PUNCT + LIST_QUOTES + LIST_ELLIPSES + [
r"(?<=[0-9])%", # numbers like 10%
r"(?<=[0-9])(?:{h})".format(h=HYPHENS), # hyphens after numbers
r"(?<=[{a}])[']".format(a=ALPHA), # apostrophes after letters
r"(?<=[{a}])['][mwlnytk](?=\s|$)".format(a=ALPHA), # contractions
r"(?<=[{a}0-9])\)", # right parenthesis after letter/number
r"(?<=[{a}])\.(?=\s|$)".format(a=ALPHA), # period after letter if space or end of string
r"(?<=\))[\.\?!]", # punctuation immediately after right parenthesis
]
TOKENIZER_INFIXES = LIST_ELLIPSES + LIST_ICONS + [
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}0-9])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION),
]

View File

@ -1,50 +0,0 @@
STOP_WORDS = set(
"""
a ak an ankò ant apre ap atò avan avanlè
byen byenke
chak
de depi deja deja
e en epi èske
fòk
gen genyen
ki kisa kilès kote koukou konsa konbyen konn konnen kounye kouman
la l laa le li lye
m m' mwen
nan nap nou n'
ou oumenm
pa paske pami pandan pito pou pral preske pwiske
se selman si sou sòt
ta tap tankou te toujou tou tan tout toutotan twòp tèl
w w' wi wè
y y' yo yon yonn
non o oh eh
sa san si swa si
men mèsi oswa osinon
"""
.split()
)
# Add common contractions, with and without apostrophe variants
contractions = ["m'", "n'", "w'", "y'", "l'", "t'", "k'"]
for apostrophe in ["'", "", ""]:
for word in contractions:
STOP_WORDS.add(word.replace("'", apostrophe))

View File

@ -1,74 +0,0 @@
from typing import Iterator, Tuple, Union
from ...errors import Errors
from ...symbols import NOUN, PRON, PROPN
from ...tokens import Doc, Span
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
"""
Detect base noun phrases from a dependency parse for Haitian Creole.
Works on both Doc and Span objects.
"""
# Core nominal dependencies common in Haitian Creole
labels = [
"nsubj",
"obj",
"obl",
"nmod",
"appos",
"ROOT",
]
# Modifiers to optionally include in chunk (to the right)
post_modifiers = ["compound", "flat", "flat:name", "fixed"]
doc = doclike.doc
if not doc.has_annotation("DEP"):
raise ValueError(Errors.E029)
np_deps = {doc.vocab.strings.add(label) for label in labels}
np_mods = {doc.vocab.strings.add(mod) for mod in post_modifiers}
conj_label = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP")
adp_pos = doc.vocab.strings.add("ADP")
cc_pos = doc.vocab.strings.add("CCONJ")
prev_end = -1
for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON):
continue
if word.left_edge.i <= prev_end:
continue
if word.dep in np_deps:
right_end = word
# expand to include known modifiers to the right
for child in word.rights:
if child.dep in np_mods:
right_end = child.right_edge
elif child.pos == NOUN:
right_end = child.right_edge
left_index = word.left_edge.i
# Skip prepositions at the start
if word.left_edge.pos == adp_pos:
left_index += 1
prev_end = right_end.i
yield left_index, right_end.i + 1, np_label
elif word.dep == conj_label:
head = word.head
while head.dep == conj_label and head.head.i < head.i:
head = head.head
if head.dep in np_deps:
left_index = word.left_edge.i
if word.left_edge.pos == cc_pos:
left_index += 1
prev_end = word.i
yield left_index, word.i + 1, np_label
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}

View File

@ -1,21 +0,0 @@
from spacy.symbols import NOUN, VERB, AUX, ADJ, ADV, PRON, DET, ADP, SCONJ, CCONJ, PART, INTJ, NUM, PROPN, PUNCT, SYM, X
TAG_MAP = {
"NOUN": {"pos": NOUN},
"VERB": {"pos": VERB},
"AUX": {"pos": AUX},
"ADJ": {"pos": ADJ},
"ADV": {"pos": ADV},
"PRON": {"pos": PRON},
"DET": {"pos": DET},
"ADP": {"pos": ADP},
"SCONJ": {"pos": SCONJ},
"CCONJ": {"pos": CCONJ},
"PART": {"pos": PART},
"INTJ": {"pos": INTJ},
"NUM": {"pos": NUM},
"PROPN": {"pos": PROPN},
"PUNCT": {"pos": PUNCT},
"SYM": {"pos": SYM},
"X": {"pos": X},
}

View File

@ -1,121 +0,0 @@
from spacy.symbols import ORTH, NORM
def make_variants(base, first_norm, second_orth, second_norm):
return {
base: [
{ORTH: base.split("'")[0] + "'", NORM: first_norm},
{ORTH: second_orth, NORM: second_norm},
],
base.capitalize(): [
{ORTH: base.split("'")[0].capitalize() + "'", NORM: first_norm.capitalize()},
{ORTH: second_orth, NORM: second_norm},
]
}
TOKENIZER_EXCEPTIONS = {
"Dr.": [{ORTH: "Dr."}]
}
# Apostrophe forms
TOKENIZER_EXCEPTIONS.update(make_variants("m'ap", "mwen", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("n'ap", "nou", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("l'ap", "li", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("y'ap", "yo", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("m'te", "mwen", "te", "te"))
TOKENIZER_EXCEPTIONS.update(make_variants("m'pral", "mwen", "pral", "pral"))
TOKENIZER_EXCEPTIONS.update(make_variants("w'ap", "ou", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("k'ap", "ki", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("p'ap", "pa", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("t'ap", "te", "ap", "ap"))
# Non-apostrophe contractions (with capitalized variants)
TOKENIZER_EXCEPTIONS.update({
"map": [
{ORTH: "m", NORM: "mwen"},
{ORTH: "ap", NORM: "ap"},
],
"Map": [
{ORTH: "M", NORM: "Mwen"},
{ORTH: "ap", NORM: "ap"},
],
"lem": [
{ORTH: "le", NORM: "le"},
{ORTH: "m", NORM: "mwen"},
],
"Lem": [
{ORTH: "Le", NORM: "Le"},
{ORTH: "m", NORM: "mwen"},
],
"lew": [
{ORTH: "le", NORM: "le"},
{ORTH: "w", NORM: "ou"},
],
"Lew": [
{ORTH: "Le", NORM: "Le"},
{ORTH: "w", NORM: "ou"},
],
"nap": [
{ORTH: "n", NORM: "nou"},
{ORTH: "ap", NORM: "ap"},
],
"Nap": [
{ORTH: "N", NORM: "Nou"},
{ORTH: "ap", NORM: "ap"},
],
"lap": [
{ORTH: "l", NORM: "li"},
{ORTH: "ap", NORM: "ap"},
],
"Lap": [
{ORTH: "L", NORM: "Li"},
{ORTH: "ap", NORM: "ap"},
],
"yap": [
{ORTH: "y", NORM: "yo"},
{ORTH: "ap", NORM: "ap"},
],
"Yap": [
{ORTH: "Y", NORM: "Yo"},
{ORTH: "ap", NORM: "ap"},
],
"mte": [
{ORTH: "m", NORM: "mwen"},
{ORTH: "te", NORM: "te"},
],
"Mte": [
{ORTH: "M", NORM: "Mwen"},
{ORTH: "te", NORM: "te"},
],
"mpral": [
{ORTH: "m", NORM: "mwen"},
{ORTH: "pral", NORM: "pral"},
],
"Mpral": [
{ORTH: "M", NORM: "Mwen"},
{ORTH: "pral", NORM: "pral"},
],
"wap": [
{ORTH: "w", NORM: "ou"},
{ORTH: "ap", NORM: "ap"},
],
"Wap": [
{ORTH: "W", NORM: "Ou"},
{ORTH: "ap", NORM: "ap"},
],
"kap": [
{ORTH: "k", NORM: "ki"},
{ORTH: "ap", NORM: "ap"},
],
"Kap": [
{ORTH: "K", NORM: "Ki"},
{ORTH: "ap", NORM: "ap"},
],
"tap": [
{ORTH: "t", NORM: "te"},
{ORTH: "ap", NORM: "ap"},
],
"Tap": [
{ORTH: "T", NORM: "Te"},
{ORTH: "ap", NORM: "ap"},
],
})

View File

@ -7,7 +7,7 @@ class IcelandicDefaults(BaseDefaults):
class Icelandic(Language):
lang = "is"
lang = "isl"
Defaults = IcelandicDefaults

View File

@ -32,6 +32,7 @@ split_mode = null
"""
@registry.tokenizers("spacy.ja.JapaneseTokenizer")
def create_tokenizer(split_mode: Optional[str] = None):
def japanese_tokenizer_factory(nlp):
return JapaneseTokenizer(nlp.vocab, split_mode=split_mode)

View File

@ -1,16 +0,0 @@
from ...language import BaseDefaults, Language
from .lex_attrs import LEX_ATTRS
from .stop_words import STOP_WORDS
class KurmanjiDefaults(BaseDefaults):
stop_words = STOP_WORDS
lex_attr_getters = LEX_ATTRS
class Kurmanji(Language):
lang = "kmr"
Defaults = KurmanjiDefaults
__all__ = ["Kurmanji"]

View File

@ -1,17 +0,0 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.kmr.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Berê mirovan her tim li geşedana pêşerojê ye", # People's gaze is always on the development of the future
"Kawa Nemir di 14 salan de Ulysses wergerand Kurmancî.", # Kawa Nemir translated Ulysses into Kurmanji in 14 years.
"Mem Ararat hunermendekî Kurd yê bi nav û deng e.", # Mem Ararat is a famous Kurdish artist
"Firat Cewerî 40 sal e pirtûkên Kurdî dinivîsîne.", # Firat Ceweri has been writing Kurdish books for 40 years
"Rojnamegerê ciwan nûçeyeke balkêş li ser rewşa aborî nivîsand", # The young journalist wrote an interesting news article about the economic situation
"Sektora çandiniyê beşeke giring a belavkirina gaza serayê li seranserê cîhanê pêk tîne", # The agricultural sector constitutes an important part of greenhouse gas emissions worldwide
"Xwendekarên jêhatî di pêşbaziya matematîkê de serkeftî bûn", # Talented students succeeded in the mathematics competition
"Ji ber ji tunebûnê bavê min xwişkeke min nedan xwendin ew ji min re bû derd û kulek.", # Because of poverty, my father didn't send my sister to school, which became a pain and sorrow for me
]

View File

@ -1,138 +0,0 @@
from ...attrs import LIKE_NUM
_num_words = [
"sifir",
"yek",
"du",
"",
"çar",
"pênc",
"şeş",
"heft",
"heşt",
"neh",
"deh",
"yazde",
"dazde",
"sêzde",
"çarde",
"pazde",
"şazde",
"hevde",
"hejde",
"nozde",
"bîst",
"",
"çil",
"pêncî",
"şêst",
"heftê",
"heştê",
"nod",
"sed",
"hezar",
"milyon",
"milyar",
]
_ordinal_words = [
"yekem",
"yekemîn",
"duyem",
"duyemîn",
"sêyem",
"sêyemîn",
"çarem",
"çaremîn",
"pêncem",
"pêncemîn",
"şeşem",
"şeşemîn",
"heftem",
"heftemîn",
"heştem",
"heştemîn",
"nehem",
"nehemîn",
"dehem",
"dehemîn",
"yazdehem",
"yazdehemîn",
"dazdehem",
"dazdehemîn",
"sêzdehem",
"sêzdehemîn",
"çardehem",
"çardehemîn",
"pazdehem",
"pazdehemîn",
"şanzdehem",
"şanzdehemîn",
"hevdehem",
"hevdehemîn",
"hejdehem",
"hejdehemîn",
"nozdehem",
"nozdehemîn",
"bîstem",
"bîstemîn",
"sîyem",
"sîyemîn",
"çilem",
"çilemîn",
"pêncîyem",
"pênciyemîn",
"şêstem",
"şêstemîn",
"heftêyem",
"heftêyemîn",
"heştêyem",
"heştêyemîn",
"notem",
"notemîn",
"sedem",
"sedemîn",
"hezarem",
"hezaremîn",
"milyonem",
"milyonemîn",
"milyarem",
"milyaremîn",
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
text_lower = text.lower()
if text_lower in _num_words:
return True
# Check ordinal number
if text_lower in _ordinal_words:
return True
if is_digit(text_lower):
return True
return False
def is_digit(text):
endings = ("em", "yem", "emîn", "yemîn")
for ending in endings:
to = len(ending)
if text.endswith(ending) and text[:-to].isdigit():
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -1,44 +0,0 @@
STOP_WORDS = set(
"""
û
li
bi
di
da
de
ji
ku
ew
ez
tu
em
hûn
ew
ev
min
te
me
we
wan
va
çi
çawa
çima
kengî
li ku
çend
çiqas
her
hin
gelek
hemû
kes
tişt
""".split()
)

View File

@ -17,20 +17,100 @@ DEFAULT_CONFIG = """
[nlp.tokenizer]
@tokenizers = "spacy.ko.KoreanTokenizer"
mecab_args = ""
"""
def create_tokenizer():
@registry.tokenizers("spacy.ko.KoreanTokenizer")
def create_tokenizer(mecab_args: str):
def korean_tokenizer_factory(nlp):
return KoreanTokenizer(nlp.vocab)
return KoreanTokenizer(nlp.vocab, mecab_args=mecab_args)
return korean_tokenizer_factory
class KoreanTokenizer(DummyTokenizer):
def __init__(self, vocab: Vocab, *, mecab_args: str = ""):
self.vocab = vocab
mecab = try_mecab_import()
self.mecab_tokenizer = mecab.Tagger(mecab_args)
def __reduce__(self):
return KoreanTokenizer, (self.vocab,)
def __call__(self, text: str) -> Doc:
dtokens = list(self.detailed_tokens(text))
surfaces = [dt["surface"] for dt in dtokens]
doc = Doc(self.vocab, words=surfaces, spaces=list(check_spaces(text, surfaces)))
for token, dtoken in zip(doc, dtokens):
first_tag, sep, eomi_tags = dtoken["tag"].partition("+")
token.tag_ = first_tag # stem(어간) or pre-final(선어말 어미)
if token.tag_ in TAG_MAP:
token.pos = TAG_MAP[token.tag_][POS]
else:
token.pos = X
token.lemma_ = dtoken["lemma"]
doc.user_data["full_tags"] = [dt["tag"] for dt in dtokens]
return doc
def detailed_tokens(self, text: str) -> Iterator[Dict[str, Any]]:
# 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3],
# 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], *
for line in self.mecab_tokenizer.parse(text).split("\n"):
if line == "EOS":
break
surface, _, expr = line.partition("\t")
features = expr.split("/")[0].split(",")
tag = features[0]
lemma = "*"
if len(features) >= 8:
lemma = features[7]
if lemma == "*":
lemma = surface
yield {"surface": surface, "lemma": lemma, "tag": tag}
def score(self, examples):
validate_examples(examples, "KoreanTokenizer.score")
return Scorer.score_tokenization(examples)
class KoreanDefaults(BaseDefaults):
config = load_config_from_str(DEFAULT_CONFIG)
lex_attr_getters = LEX_ATTRS
stop_words = STOP_WORDS
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
infixes = TOKENIZER_INFIXES
class Korean(Language):
lang = "ko"
Defaults = KoreanDefaults
def try_mecab_import():
try:
import mecab_ko as MeCab
return MeCab
except ImportError:
raise ImportError(
'The Korean tokenizer ("spacy.ko.KoreanTokenizer") requires '
"the python package `mecab-ko`: pip install mecab-ko"
) from None
@registry.tokenizers("spacy.KoreanNattoTokenizer.v1")
def create_natto_tokenizer():
def korean_natto_tokenizer_factory(nlp):
return KoreanNattoTokenizer(nlp.vocab)
return korean_natto_tokenizer_factory
class KoreanNattoTokenizer(DummyTokenizer):
def __init__(self, vocab: Vocab):
self.vocab = vocab
self._mecab = try_mecab_import() # type: ignore[func-returns-value]
self._mecab = self._try_mecab_import() # type: ignore[func-returns-value]
self._mecab_tokenizer = None
@property
@ -46,7 +126,7 @@ class KoreanTokenizer(DummyTokenizer):
return self._mecab_tokenizer
def __reduce__(self):
return KoreanTokenizer, (self.vocab,)
return KoreanNattoTokenizer, (self.vocab,)
def __call__(self, text: str) -> Doc:
dtokens = list(self.detailed_tokens(text))
@ -73,7 +153,7 @@ class KoreanTokenizer(DummyTokenizer):
feature = node.feature
tag, _, expr = feature.partition(",")
lemma, _, remainder = expr.partition("/")
if lemma == "*":
if lemma == "*" or lemma == "":
lemma = surface
yield {"surface": surface, "lemma": lemma, "tag": tag}
@ -81,28 +161,14 @@ class KoreanTokenizer(DummyTokenizer):
validate_examples(examples, "KoreanTokenizer.score")
return Scorer.score_tokenization(examples)
class KoreanDefaults(BaseDefaults):
config = load_config_from_str(DEFAULT_CONFIG)
lex_attr_getters = LEX_ATTRS
stop_words = STOP_WORDS
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
infixes = TOKENIZER_INFIXES
class Korean(Language):
lang = "ko"
Defaults = KoreanDefaults
def try_mecab_import() -> None:
def _try_mecab_import(self):
try:
from natto import MeCab
return MeCab
except ImportError:
raise ImportError(
'The Korean tokenizer ("spacy.ko.KoreanTokenizer") requires '
'The Korean Natto tokenizer ("spacy.ko.KoreanNattoTokenizer") requires '
"[mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), "
"[mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic), "
"and [natto-py](https://github.com/buruzaemon/natto-py)"

View File

@ -24,6 +24,12 @@ class MacedonianDefaults(BaseDefaults):
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
@classmethod
def create_lemmatizer(cls, nlp=None, lookups=None):
if lookups is None:
lookups = Lookups()
return MacedonianLemmatizer(lookups)
class Macedonian(Language):
lang = "mk"

View File

@ -3,10 +3,10 @@ from ...language import Language
class MultiLanguage(Language):
"""Language class to be used for models that support multiple languages.
This module allows models to specify their language ID as 'xx'.
This module allows models to specify their language ID as 'mul'.
"""
lang = "xx"
lang = "mul"
__all__ = ["MultiLanguage"]

View File

@ -13,6 +13,7 @@ DEFAULT_CONFIG = """
"""
@registry.tokenizers("spacy.th.ThaiTokenizer")
def create_thai_tokenizer():
def thai_tokenizer_factory(nlp):
return ThaiTokenizer(nlp.vocab)

View File

@ -16,10 +16,6 @@ URL_PATTERN = (
r"(?:\S+(?::\S*)?@)?"
r"(?:"
# IP address exclusion
# private & local networks
r"(?!(?:10|127)(?:\.\d{1,3}){3})"
r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
# IP address dotted notation octets
# excludes loopback network 0.0.0.0
# excludes reserved space >= 224.0.0.0

View File

@ -22,6 +22,7 @@ use_pyvi = true
"""
@registry.tokenizers("spacy.vi.VietnameseTokenizer")
def create_vietnamese_tokenizer(use_pyvi: bool = True):
def vietnamese_tokenizer_factory(nlp):
return VietnameseTokenizer(nlp.vocab, use_pyvi=use_pyvi)

View File

@ -31,7 +31,7 @@ segmenter = "char"
[initialize]
[initialize.tokenizer]
pkuseg_model = null
pkuseg_model = "spacy_ontonotes"
pkuseg_user_dict = "default"
"""
@ -46,6 +46,7 @@ class Segmenter(str, Enum):
return list(cls.__members__.keys())
@registry.tokenizers("spacy.zh.ChineseTokenizer")
def create_chinese_tokenizer(segmenter: Segmenter = Segmenter.char):
def chinese_tokenizer_factory(nlp):
return ChineseTokenizer(nlp.vocab, segmenter=segmenter)

View File

@ -5,7 +5,7 @@ import multiprocessing as mp
import random
import traceback
import warnings
from contextlib import ExitStack, contextmanager
from contextlib import contextmanager
from copy import deepcopy
from dataclasses import dataclass
from itertools import chain, cycle
@ -18,6 +18,7 @@ from typing import (
Iterable,
Iterator,
List,
Literal,
NoReturn,
Optional,
Pattern,
@ -30,14 +31,10 @@ from typing import (
overload,
)
import numpy
import srsly
from cymem.cymem import Pool
from thinc.api import Config, CupyOps, Optimizer, get_current_ops
from thinc.util import convert_recursive
from . import about, ty, util
from .compat import Literal
from .errors import Errors, Warnings
from .git_info import GIT_VERSION
from .lang.punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
@ -55,7 +52,7 @@ from .scorer import Scorer
from .tokenizer import Tokenizer
from .tokens import Doc
from .tokens.underscore import Underscore
from .training import Example, validate_examples
from .training import Example, validate_distillation_examples, validate_examples
from .training.initialize import init_tok2vec, init_vocab
from .util import (
_DEFAULT_EMPTY_PIPES,
@ -77,6 +74,9 @@ PipeCallable = Callable[[Doc], Doc]
# This is the base config will all settings (training etc.)
DEFAULT_CONFIG_PATH = Path(__file__).parent / "default_config.cfg"
DEFAULT_CONFIG = util.load_config(DEFAULT_CONFIG_PATH)
# This is the base config for the [distillation] block and currently not included
# in the main config and only added via the 'init fill-config' command
DEFAULT_CONFIG_DISTILL_PATH = Path(__file__).parent / "default_config_distillation.cfg"
# This is the base config for the [pretraining] block and currently not included
# in the main config and only added via the 'init fill-config' command
DEFAULT_CONFIG_PRETRAIN_PATH = Path(__file__).parent / "default_config_pretraining.cfg"
@ -104,6 +104,7 @@ class BaseDefaults:
writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}
@registry.tokenizers("spacy.Tokenizer.v1")
def create_tokenizer() -> Callable[["Language"], Tokenizer]:
"""Registered function to create a tokenizer. Returns a factory that takes
the nlp object and returns a Tokenizer instance using the language detaults.
@ -129,19 +130,13 @@ def create_tokenizer() -> Callable[["Language"], Tokenizer]:
return tokenizer_factory
def load_lookups_data(lang, tables):
util.logger.debug("Loading lookups from spacy-lookups-data: %s", tables)
lookups = load_lookups(lang=lang, tables=tables)
return lookups
class Language:
"""A text-processing pipeline. Usually you'll load this once per process,
and pass the instance around your application.
Defaults (class): Settings, data and factory methods for creating the `nlp`
object and processing pipeline.
lang (str): Two-letter ISO 639-1 or three-letter ISO 639-3 language codes, such as 'en' and 'eng'.
lang (str): IETF language code, such as 'en'.
DOCS: https://spacy.io/api/language
"""
@ -183,9 +178,6 @@ class Language:
DOCS: https://spacy.io/api/language#init
"""
from .pipeline.factories import register_factories
register_factories()
# We're only calling this to import all factories provided via entry
# points. The factory decorator applied to these functions takes care
# of the rest.
@ -202,8 +194,7 @@ class Language:
if not isinstance(vocab, Vocab) and vocab is not True:
raise ValueError(Errors.E918.format(vocab=vocab, vocab_type=type(Vocab)))
if vocab is True:
vectors_name = meta.get("vectors", {}).get("name")
vocab = create_vocab(self.lang, self.Defaults, vectors_name=vectors_name)
vocab = create_vocab(self.lang, self.Defaults)
if not create_vectors:
vectors_cfg = {"vectors": self._config["nlp"]["vectors"]}
create_vectors = registry.resolve(vectors_cfg)["vectors"]
@ -261,7 +252,6 @@ class Language:
"width": self.vocab.vectors_length,
"vectors": len(self.vocab.vectors),
"keys": self.vocab.vectors.n_keys,
"name": self.vocab.vectors.name,
"mode": self.vocab.vectors.mode,
}
self._meta["labels"] = dict(self.pipe_labels)
@ -772,8 +762,8 @@ class Language:
*,
before: Optional[Union[str, int]] = None,
after: Optional[Union[str, int]] = None,
first: Optional[bool] = None,
last: Optional[bool] = None,
first: Optional[Literal[True]] = None,
last: Optional[Literal[True]] = None,
source: Optional["Language"] = None,
config: Dict[str, Any] = SimpleFrozenDict(),
raw_config: Optional[Config] = None,
@ -792,8 +782,8 @@ class Language:
component directly before.
after (Union[str, int]): Name or index of the component to insert new
component directly after.
first (bool): If True, insert component first in the pipeline.
last (bool): If True, insert component last in the pipeline.
first (Optional[Literal[True]]): If True, insert component first in the pipeline.
last (Optional[Literal[True]]): If True, insert component last in the pipeline.
source (Language): Optional loaded nlp object to copy the pipeline
component from.
config (Dict[str, Any]): Config parameters to use for this component.
@ -839,18 +829,22 @@ class Language:
self,
before: Optional[Union[str, int]] = None,
after: Optional[Union[str, int]] = None,
first: Optional[bool] = None,
last: Optional[bool] = None,
first: Optional[Literal[True]] = None,
last: Optional[Literal[True]] = None,
) -> int:
"""Determine where to insert a pipeline component based on the before/
after/first/last values.
before (str): Name or index of the component to insert directly before.
after (str): Name or index of component to insert directly after.
first (bool): If True, insert component first in the pipeline.
last (bool): If True, insert component last in the pipeline.
first (Optional[Literal[True]]): If True, insert component first in the pipeline.
last (Optional[Literal[True]]): If True, insert component last in the pipeline.
RETURNS (int): The index of the new pipeline component.
"""
if first is not None and first is not True:
raise ValueError(Errors.E4009.format(attr="first", value=first))
if last is not None and last is not True:
raise ValueError(Errors.E4009.format(attr="last", value=last))
all_args = {"before": before, "after": after, "first": first, "last": last}
if sum(arg is not None for arg in [before, after, first, last]) >= 2:
raise ValueError(
@ -1060,6 +1054,116 @@ class Language:
raise ValueError(Errors.E005.format(name=name, returned_type=type(doc)))
return doc
def distill(
self,
teacher: "Language",
examples: Iterable[Example],
*,
drop: float = 0.0,
sgd: Union[Optimizer, None, Literal[False]] = None,
losses: Optional[Dict[str, float]] = None,
component_cfg: Optional[Dict[str, Dict[str, Any]]] = None,
exclude: Iterable[str] = SimpleFrozenList(),
annotates: Iterable[str] = SimpleFrozenList(),
student_to_teacher: Optional[Dict[str, str]] = None,
):
"""Distill the models in a student pipeline from a teacher pipeline.
teacher (Language): Teacher to distill from.
examples (Iterable[Example]): Distillation examples. The reference
(teacher) and predicted (student) docs must have the same number of
tokens and the same orthography.
drop (float): The dropout rate.
sgd (Union[Optimizer, None, Literal[False]]): An optimizer. Will
be created via create_optimizer if 'None'. No optimizer will
be used when set to 'False'.
losses (Optional(Dict[str, float])): Dictionary to update with the loss,
keyed by component.
component_cfg (Optional[Dict[str, Dict[str, Any]]]): Config parameters
for specific pipeline components, keyed by component name.
exclude (Iterable[str]): Names of components that shouldn't be updated.
annotates (Iterable[str]): Names of components that should set
annotations on the predicted examples after updating.
student_to_teacher (Optional[Dict[str, str]]): Map student pipe name to
teacher pipe name, only needed for pipes where the student pipe
name does not match the teacher pipe name.
RETURNS (Dict[str, float]): The updated losses dictionary
DOCS: https://spacy.io/api/language#distill
"""
if student_to_teacher is None:
student_to_teacher = {}
if losses is None:
losses = {}
if isinstance(examples, list) and len(examples) == 0:
return losses
validate_distillation_examples(examples, "Language.distill")
examples = _copy_examples(examples, copy_x=True, copy_y=True)
if sgd is None:
if self._optimizer is None:
self._optimizer = self.create_optimizer()
sgd = self._optimizer
if component_cfg is None:
component_cfg = {}
pipe_kwargs = {}
for student_name, student_proc in self.pipeline:
component_cfg.setdefault(student_name, {})
pipe_kwargs[student_name] = deepcopy(component_cfg[student_name])
component_cfg[student_name].setdefault("drop", drop)
pipe_kwargs[student_name].setdefault("batch_size", self.batch_size)
teacher_pipes = dict(teacher.pipeline)
for student_name, student_proc in self.pipeline:
if student_name in annotates:
for doc, eg in zip(
_pipe(
(eg.predicted for eg in examples),
proc=student_proc,
name=student_name,
default_error_handler=self.default_error_handler,
kwargs=pipe_kwargs[student_name],
),
examples,
):
eg.predicted = doc
if (
student_name not in exclude
and isinstance(student_proc, ty.DistillableComponent)
and student_proc.is_distillable
):
# A missing teacher pipe is not an error, some student pipes
# do not need a teacher, such as tok2vec layer losses.
teacher_name = (
student_to_teacher[student_name]
if student_name in student_to_teacher
else student_name
)
teacher_pipe = teacher_pipes.get(teacher_name, None)
student_proc.distill(
teacher_pipe,
examples,
sgd=None,
losses=losses,
**component_cfg[student_name],
)
# Only finish the update after all component updates are done. Some
# components may share weights (such as tok2vec) and we only want
# to apply weight updates after all gradients are accumulated.
for student_name, student_proc in self.pipeline:
if (
student_name not in exclude
and isinstance(student_proc, ty.DistillableComponent)
and student_proc.is_distillable
and sgd not in (None, False)
):
student_proc.finish_update(sgd)
return losses
def disable_pipes(self, *names) -> "DisabledPipes":
"""Disable one or more pipeline components. If used as a context
manager, the pipeline will be restored to the initial state at the end
@ -1148,7 +1252,7 @@ class Language:
_: Optional[Any] = None,
*,
drop: float = 0.0,
sgd: Optional[Optimizer] = None,
sgd: Union[Optimizer, None, Literal[False]] = None,
losses: Optional[Dict[str, float]] = None,
component_cfg: Optional[Dict[str, Dict[str, Any]]] = None,
exclude: Iterable[str] = SimpleFrozenList(),
@ -1159,7 +1263,9 @@ class Language:
examples (Iterable[Example]): A batch of examples
_: Should not be set - serves to catch backwards-incompatible scripts.
drop (float): The dropout rate.
sgd (Optimizer): An optimizer.
sgd (Union[Optimizer, None, Literal[False]]): An optimizer. Will
be created via create_optimizer if 'None'. No optimizer will
be used when set to 'False'.
losses (Dict[str, float]): Dictionary to update with the loss, keyed by
component.
component_cfg (Dict[str, Dict]): Config parameters for specific pipeline
@ -1192,17 +1298,12 @@ class Language:
component_cfg[name].setdefault("drop", drop)
pipe_kwargs[name].setdefault("batch_size", self.batch_size)
for name, proc in self.pipeline:
# ignore statements are used here because mypy ignores hasattr
if name not in exclude and hasattr(proc, "update"):
proc.update(examples, sgd=None, losses=losses, **component_cfg[name]) # type: ignore
if sgd not in (None, False):
if (
name not in exclude
and isinstance(proc, ty.TrainableComponent)
and proc.is_trainable
and proc.model not in (True, False, None)
):
proc.finish_update(sgd)
proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
if name in annotates:
for doc, eg in zip(
_pipe(
@ -1215,7 +1316,19 @@ class Language:
examples,
):
eg.predicted = doc
return _replace_numpy_floats(losses)
# Only finish the update after all component updates are done. Some
# components may share weights (such as tok2vec) and we only want
# to apply weight updates after all gradients are accumulated.
for name, proc in self.pipeline:
if (
name not in exclude
and isinstance(proc, ty.TrainableComponent)
and proc.is_trainable
and sgd not in (None, False)
):
proc.finish_update(sgd)
return losses
def rehearse(
self,
@ -1281,25 +1394,20 @@ class Language:
sgd(key, W, dW) # type: ignore[call-arg, misc]
return losses
def begin_training(
self,
get_examples: Optional[Callable[[], Iterable[Example]]] = None,
*,
sgd: Optional[Optimizer] = None,
) -> Optimizer:
warnings.warn(Warnings.W089, DeprecationWarning)
return self.initialize(get_examples, sgd=sgd)
def initialize(
self,
get_examples: Optional[Callable[[], Iterable[Example]]] = None,
*,
labels: Optional[Dict[str, Any]] = None,
sgd: Optional[Optimizer] = None,
) -> Optimizer:
"""Initialize the pipe for training, using data examples if available.
get_examples (Callable[[], Iterable[Example]]): Optional function that
returns gold-standard Example objects.
labels (Optional[Dict[str, Any]]): Labels to pass to pipe initialization,
using the names of the pipes as keys. Overrides labels that are in
the model configuration.
sgd (Optional[Optimizer]): An optimizer to use for updates. If not
provided, will be created using the .create_optimizer() method.
RETURNS (thinc.api.Optimizer): The optimizer.
@ -1347,6 +1455,8 @@ class Language:
for name, proc in self.pipeline:
if isinstance(proc, ty.InitializableComponent):
p_settings = I["components"].get(name, {})
if labels is not None and name in labels:
p_settings["labels"] = labels[name]
p_settings = validate_init_settings(
proc.initialize, p_settings, section="components", name=name
)
@ -1466,7 +1576,7 @@ class Language:
results = scorer.score(examples, per_component=per_component)
n_words = sum(len(eg.predicted) for eg in examples)
results["speed"] = n_words / (end_time - start_time)
return _replace_numpy_floats(results)
return results
def create_optimizer(self):
"""Create an optimizer, usually using the [training.optimizer] config."""
@ -1820,6 +1930,7 @@ class Language:
# using the nlp.config with all defaults.
config = util.copy_config(config)
orig_pipeline = config.pop("components", {})
orig_distill = config.pop("distillation", None)
orig_pretraining = config.pop("pretraining", None)
config["components"] = {}
if auto_fill:
@ -1828,6 +1939,9 @@ class Language:
filled = config
filled["components"] = orig_pipeline
config["components"] = orig_pipeline
if orig_distill is not None:
filled["distillation"] = orig_distill
config["distillation"] = orig_distill
if orig_pretraining is not None:
filled["pretraining"] = orig_pretraining
config["pretraining"] = orig_pretraining
@ -2095,38 +2209,6 @@ class Language:
util.replace_model_node(pipe.model, listener, new_model) # type: ignore[attr-defined]
tok2vec.remove_listener(listener, pipe_name)
@contextmanager
def memory_zone(self, mem: Optional[Pool] = None) -> Iterator[Pool]:
"""Begin a block where all resources allocated during the block will
be freed at the end of it. If a resources was created within the
memory zone block, accessing it outside the block is invalid.
Behaviour of this invalid access is undefined. Memory zones should
not be nested.
The memory zone is helpful for services that need to process large
volumes of text with a defined memory budget.
Example
-------
>>> with nlp.memory_zone():
... for doc in nlp.pipe(texts):
... process_my_doc(doc)
>>> # use_doc(doc) <-- Invalid: doc was allocated in the memory zone
"""
if mem is None:
mem = Pool()
# The ExitStack allows programmatic nested context managers.
# We don't know how many we need, so it would be awkward to have
# them as nested blocks.
with ExitStack() as stack:
contexts = [stack.enter_context(self.vocab.memory_zone(mem))]
if hasattr(self.tokenizer, "memory_zone"):
contexts.append(stack.enter_context(self.tokenizer.memory_zone(mem)))
for _, pipe in self.pipeline:
if hasattr(pipe, "memory_zone"):
contexts.append(stack.enter_context(pipe.memory_zone(mem)))
yield mem
def to_disk(
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
) -> None:
@ -2144,9 +2226,7 @@ class Language:
serializers["tokenizer"] = lambda p: self.tokenizer.to_disk( # type: ignore[union-attr]
p, exclude=["vocab"]
)
serializers["meta.json"] = lambda p: srsly.write_json(
p, _replace_numpy_floats(self.meta)
)
serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta)
serializers["config.cfg"] = lambda p: self.config.to_disk(p)
for name, proc in self._components:
if name in exclude:
@ -2214,9 +2294,6 @@ class Language:
if path.exists():
data = srsly.read_json(path)
self.meta.update(data)
# self.meta always overrides meta["vectors"] with the metadata
# from self.vocab.vectors, so set the name directly
self.vocab.vectors.name = data.get("vectors", {}).get("name")
def deserialize_vocab(path: Path) -> None:
if path.exists():
@ -2260,9 +2337,7 @@ class Language:
serializers: Dict[str, Callable[[], bytes]] = {}
serializers["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"]) # type: ignore[union-attr]
serializers["meta.json"] = lambda: srsly.json_dumps(
_replace_numpy_floats(self.meta)
)
serializers["meta.json"] = lambda: srsly.json_dumps(self.meta)
serializers["config.cfg"] = lambda: self.config.to_bytes()
for name, proc in self._components:
if name in exclude:
@ -2287,9 +2362,6 @@ class Language:
def deserialize_meta(b):
data = srsly.json_loads(b)
self.meta.update(data)
# self.meta always overrides meta["vectors"] with the metadata
# from self.vocab.vectors, so set the name directly
self.vocab.vectors.name = data.get("vectors", {}).get("name")
deserializers: Dict[str, Callable[[bytes], Any]] = {}
deserializers["config.cfg"] = lambda b: self.config.from_bytes(
@ -2313,12 +2385,6 @@ class Language:
return self
def _replace_numpy_floats(meta_dict: dict) -> dict:
return convert_recursive(
lambda v: isinstance(v, numpy.floating), lambda v: float(v), dict(meta_dict)
)
@dataclass
class FactoryMeta:
"""Dataclass containing information about a component and its defaults
@ -2362,13 +2428,18 @@ class DisabledPipes(list):
self[:] = []
def _copy_examples(examples: Iterable[Example]) -> List[Example]:
def _copy_examples(
examples: Iterable[Example], *, copy_x: bool = True, copy_y: bool = False
) -> List[Example]:
"""Make a copy of a batch of examples, copying the predicted Doc as well.
This is used in contexts where we need to take ownership of the examples
so that they can be mutated, for instance during Language.evaluate and
Language.update.
"""
return [Example(eg.x.copy(), eg.y) for eg in examples]
return [
Example(eg.x.copy() if copy_x else eg.x, eg.y.copy() if copy_y else eg.y)
for eg in examples
]
def _apply_pipes(

View File

@ -12,7 +12,6 @@ from .attrs cimport (
SUFFIX,
attr_id_t,
)
from .strings cimport StringStore
from .structs cimport LexemeC
from .typedefs cimport attr_t, flags_t, hash_t, len_t, tag_t
from .vocab cimport Vocab
@ -35,7 +34,7 @@ cdef class Lexeme:
return self
@staticmethod
cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) noexcept nogil:
cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil:
if name < (sizeof(flags_t) * 8):
Lexeme.c_set_flag(lex, name, value)
elif name == ID:
@ -54,7 +53,7 @@ cdef class Lexeme:
lex.lang = value
@staticmethod
cdef inline attr_t get_struct_attr(const LexemeC* lex, attr_id_t feat_name) noexcept nogil:
cdef inline attr_t get_struct_attr(const LexemeC* lex, attr_id_t feat_name) nogil:
if feat_name < (sizeof(flags_t) * 8):
if Lexeme.c_check_flag(lex, feat_name):
return 1
@ -82,7 +81,7 @@ cdef class Lexeme:
return 0
@staticmethod
cdef inline bint c_check_flag(const LexemeC* lexeme, attr_id_t flag_id) noexcept nogil:
cdef inline bint c_check_flag(const LexemeC* lexeme, attr_id_t flag_id) nogil:
cdef flags_t one = 1
if lexeme.flags & (one << flag_id):
return True
@ -90,7 +89,7 @@ cdef class Lexeme:
return False
@staticmethod
cdef inline bint c_set_flag(LexemeC* lex, attr_id_t flag_id, bint value) noexcept nogil:
cdef inline bint c_set_flag(LexemeC* lex, attr_id_t flag_id, bint value) nogil:
cdef flags_t one = 1
if value:
lex.flags |= one << flag_id

View File

@ -19,7 +19,6 @@ class Lexeme:
def vector_norm(self) -> float: ...
vector: Floats1d
rank: int
sentiment: float
@property
def orth_(self) -> str: ...
@property

View File

@ -57,7 +57,7 @@ cdef class Lexeme:
"""
self.vocab = vocab
self.orth = orth
self.c = <LexemeC*><void*>vocab.get_by_orth(vocab.mem, orth)
self.c = <LexemeC*><void*>vocab.get_by_orth(orth)
if self.c.orth != orth:
raise ValueError(Errors.E071.format(orth=orth, vocab_orth=self.c.orth))
@ -70,7 +70,7 @@ cdef class Lexeme:
if isinstance(other, Lexeme):
a = self.orth
b = other.orth
elif isinstance(other, int):
elif isinstance(other, long):
a = self.orth
b = other
elif isinstance(other, str):
@ -104,7 +104,7 @@ cdef class Lexeme:
# skip PROB, e.g. from lexemes.jsonl
if isinstance(value, float):
continue
elif isinstance(value, int):
elif isinstance(value, (int, long)):
Lexeme.set_struct_attr(self.c, attr, value)
else:
Lexeme.set_struct_attr(self.c, attr, self.vocab.strings.add(value))
@ -193,20 +193,6 @@ cdef class Lexeme:
def rank(self, value):
self.c.id = value
@property
def sentiment(self):
"""RETURNS (float): A scalar value indicating the positivity or
negativity of the lexeme."""
sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment", {})
return sentiment_table.get(self.c.orth, 0.0)
@sentiment.setter
def sentiment(self, float x):
if "lexeme_sentiment" not in self.vocab.lookups:
self.vocab.lookups.add_table("lexeme_sentiment")
sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment")
sentiment_table[self.c.orth] = x
@property
def orth_(self):
"""RETURNS (str): The original verbatim text of the lexeme

View File

@ -2,16 +2,40 @@ from collections import OrderedDict
from pathlib import Path
from typing import Any, Dict, List, Optional, Union
import requests
import srsly
from preshed.bloom import BloomFilter
from .errors import Errors
from .strings import get_string_id
from .util import SimpleFrozenDict, ensure_path, load_language_data, registry
from .util import SimpleFrozenDict, ensure_path, load_language_data, logger, registry
UNSET = object()
@registry.misc("spacy.LookupsDataLoader.v1")
def load_lookups_data(lang, tables):
logger.debug(f"Loading lookups from spacy-lookups-data: {tables}")
lookups = load_lookups(lang=lang, tables=tables)
return lookups
@registry.misc("spacy.LookupsDataLoaderFromURL.v1")
def load_lookups_data_from_url(lang, tables, url):
logger.debug(f"Loading lookups from {url}: {tables}")
lookups = Lookups()
for table in tables:
table_url = url + lang + "_" + table + ".json"
r = requests.get(table_url)
if r.status_code != 200:
raise ValueError(
Errors.E4011.format(status_code=r.status_code, url=table_url)
)
table_data = r.json()
lookups.add_table(table, table_data)
return lookups
def load_lookups(lang: str, tables: List[str], strict: bool = True) -> "Lookups":
"""Load the data from the spacy-lookups-data package for a given language,
if available. Returns an empty `Lookups` container if there's no data or if the package

View File

@ -175,9 +175,9 @@ cdef class DependencyMatcher:
on_match (callable): Optional callback executed on match.
"""
if on_match is not None and not hasattr(on_match, "__call__"):
raise ValueError(Errors.E171.format(arg_type=type(on_match)))
if patterns is None or not isinstance(patterns, List): # old API
raise ValueError(Errors.E948.format(arg_type=type(patterns)))
raise ValueError(Errors.E171.format(name="DependencyMatcher", arg_type=type(on_match)))
if patterns is None or not isinstance(patterns, List):
raise ValueError(Errors.E948.format(name="DependencyMatcher", arg_type=type(patterns)))
for pattern in patterns:
if len(pattern) == 0:
raise ValueError(Errors.E012.format(key=key))

View File

@ -1,4 +1,4 @@
# cython: binding=True, infer_types=True, language_level=3
# cython: binding=True, infer_types=True
from cpython.object cimport PyObject
from libc.stdint cimport int64_t
@ -27,5 +27,6 @@ cpdef bint levenshtein_compare(input_text: str, pattern_text: str, fuzzy: int =
return levenshtein(input_text, pattern_text, max_edits) <= max_edits
@registry.misc("spacy.levenshtein_compare.v1")
def make_levenshtein_compare():
return levenshtein_compare

View File

@ -5,13 +5,13 @@ from typing import (
Iterable,
Iterator,
List,
Literal,
Optional,
Tuple,
Union,
overload,
)
from ..compat import Literal
from ..tokens import Doc, Span
from ..vocab import Vocab

View File

@ -20,6 +20,12 @@ from ..tokens.span cimport Span
from ..tokens.token cimport Token
from ..typedefs cimport attr_t
from ..errors import Errors, MatchPatternError, Warnings
from ..schemas import validate_token_pattern
from .levenshtein import levenshtein_compare
from ..strings cimport get_string_id
from ..attrs import IDS
from ..errors import Errors, MatchPatternError, Warnings
from ..schemas import validate_token_pattern
@ -113,9 +119,9 @@ cdef class Matcher:
"""
errors = {}
if on_match is not None and not hasattr(on_match, "__call__"):
raise ValueError(Errors.E171.format(arg_type=type(on_match)))
if patterns is None or not isinstance(patterns, List): # old API
raise ValueError(Errors.E948.format(arg_type=type(patterns)))
raise ValueError(Errors.E171.format(name="Matcher", arg_type=type(on_match)))
if patterns is None or not isinstance(patterns, List):
raise ValueError(Errors.E948.format(name="Matcher", arg_type=type(patterns)))
if greedy is not None and greedy not in ["FIRST", "LONGEST"]:
raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=greedy))
for i, pattern in enumerate(patterns):
@ -275,6 +281,10 @@ cdef class Matcher:
# non-overlapping ones this `match` can be either (start, end) or
# (start, end, alignments) depending on `with_alignments=` option.
for key, *match in matches:
# Adjust span matches to doc offsets
if isinstance(doclike, Span):
match[0] += doclike.start
match[1] += doclike.start
span_filter = self._filter.get(key)
if span_filter is not None:
pairs = pairs_by_id.get(key, [])
@ -305,9 +315,6 @@ cdef class Matcher:
if as_spans:
final_results = []
for key, start, end, *_ in final_matches:
if isinstance(doclike, Span):
start += doclike.start
end += doclike.start
final_results.append(Span(doc, start, end, label=key))
elif with_alignments:
# convert alignments List[Dict[str, int]] --> List[int]
@ -625,7 +632,7 @@ cdef action_t get_action(
const TokenC * token,
const attr_t * extra_attrs,
const int8_t * predicate_matches
) noexcept nogil:
) nogil:
"""We need to consider:
a) Does the token match the specification? [Yes, No]
b) What's the quantifier? [1, 0+, ?]
@ -740,7 +747,7 @@ cdef int8_t get_is_match(
const TokenC* token,
const attr_t* extra_attrs,
const int8_t* predicate_matches
) noexcept nogil:
) nogil:
for i in range(state.pattern.nr_py):
if predicate_matches[state.pattern.py_predicates[i]] == -1:
return 0
@ -755,14 +762,14 @@ cdef int8_t get_is_match(
return True
cdef inline int8_t get_is_final(PatternStateC state) noexcept nogil:
cdef inline int8_t get_is_final(PatternStateC state) nogil:
if state.pattern[1].quantifier == FINAL_ID:
return 1
else:
return 0
cdef inline int8_t get_quantifier(PatternStateC state) noexcept nogil:
cdef inline int8_t get_quantifier(PatternStateC state) nogil:
return state.pattern.quantifier
@ -805,7 +812,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
return pattern
cdef attr_t get_ent_id(const TokenPatternC* pattern) noexcept nogil:
cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
while pattern.quantifier != FINAL_ID:
pattern += 1
id_attr = pattern[0].attrs[0]

View File

@ -1,6 +1,5 @@
from typing import Any, Callable, Dict, List, Optional, Tuple, Union, overload
from typing import Any, Callable, Dict, List, Literal, Optional, Tuple, Union, overload
from ..compat import Literal
from ..tokens import Doc, Span
from ..vocab import Vocab
from .matcher import Matcher
@ -21,6 +20,15 @@ class PhraseMatcher:
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
] = ...,
) -> None: ...
def _add_from_arrays(
self,
key: str,
specs: List[List[int]],
*,
on_match: Optional[
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
] = ...,
) -> None: ...
def remove(self, key: str) -> None: ...
@overload
def __call__(

View File

@ -1,4 +1,7 @@
# cython: infer_types=True
from collections import defaultdict
from typing import List
from preshed.maps cimport map_clear, map_get, map_init, map_iter, map_set
import warnings
@ -39,7 +42,7 @@ cdef class PhraseMatcher:
"""
self.vocab = vocab
self._callbacks = {}
self._docs = {}
self._docs = defaultdict(set)
self._validate = validate
self.mem = Pool()
@ -47,7 +50,7 @@ cdef class PhraseMatcher:
self._terminal_hash = 826361138722620965
map_init(self.mem, self.c_map, 8)
if isinstance(attr, int):
if isinstance(attr, (int, long)):
self.attr = attr
else:
if attr is None:
@ -155,41 +158,69 @@ cdef class PhraseMatcher:
del self._callbacks[key]
del self._docs[key]
def add(self, key, docs, *_docs, on_match=None):
"""Add a match-rule to the phrase-matcher. A match-rule consists of: an ID
key, an on_match callback, and one or more patterns.
Since spaCy v2.2.2, PhraseMatcher.add takes a list of patterns as the
second argument, with the on_match callback as an optional keyword
argument.
def _add_from_arrays(self, key, specs, *, on_match=None):
"""Add a preprocessed list of specs, with an optional callback.
key (str): The match ID.
docs (list): List of `Doc` objects representing match patterns.
specs (List[List[int]]): A list of lists of hashes to match.
on_match (callable): Callback executed on match.
*_docs (Doc): For backwards compatibility: list of patterns to add
as variable arguments. Will be ignored if a list of patterns is
provided as the second argument.
DOCS: https://spacy.io/api/phrasematcher#add
"""
if docs is None or hasattr(docs, "__call__"): # old API
on_match = docs
docs = _docs
_ = self.vocab[key]
self._callbacks[key] = on_match
self._docs.setdefault(key, set())
cdef MapStruct* current_node
cdef MapStruct* internal_node
cdef void* result
self._callbacks[key] = on_match
for spec in specs:
self._docs[key].add(tuple(spec))
current_node = self.c_map
for token in spec:
if token == self._terminal_hash:
warnings.warn(Warnings.W021)
break
result = <MapStruct*>map_get(current_node, token)
if not result:
internal_node = <MapStruct*>self.mem.alloc(1, sizeof(MapStruct))
map_init(self.mem, internal_node, 8)
map_set(self.mem, current_node, token, internal_node)
result = internal_node
current_node = <MapStruct*>result
result = <MapStruct*>map_get(current_node, self._terminal_hash)
if not result:
internal_node = <MapStruct*>self.mem.alloc(1, sizeof(MapStruct))
map_init(self.mem, internal_node, 8)
map_set(self.mem, current_node, self._terminal_hash, internal_node)
result = internal_node
map_set(self.mem, <MapStruct*>result, self.vocab.strings[key], NULL)
def add(self, key, docs, *, on_match=None):
"""Add a match-rule to the phrase-matcher. A match-rule consists of: an ID
key, a list of one or more patterns, and (optionally) an on_match callback.
key (str): The match ID.
docs (list): List of `Doc` objects representing match patterns.
on_match (callable): Callback executed on match.
If any of the input Docs are invalid, no internal state will be updated.
DOCS: https://spacy.io/api/phrasematcher#add
"""
if isinstance(docs, Doc):
raise ValueError(Errors.E179.format(key=key))
if docs is None or not isinstance(docs, List):
raise ValueError(Errors.E948.format(name="PhraseMatcher", arg_type=type(docs)))
if on_match is not None and not hasattr(on_match, "__call__"):
raise ValueError(Errors.E171.format(name="PhraseMatcher", arg_type=type(on_match)))
_ = self.vocab[key]
specs = []
for doc in docs:
if len(doc) == 0:
continue
if isinstance(doc, Doc):
if not isinstance(doc, Doc):
raise ValueError(Errors.E4000.format(type=type(doc)))
attrs = (TAG, POS, MORPH, LEMMA, DEP)
has_annotation = {attr: doc.has_annotation(attr) for attr in attrs}
for attr in attrs:
@ -208,30 +239,9 @@ cdef class PhraseMatcher:
and self.attr not in attrs:
string_attr = self.vocab.strings[self.attr]
warnings.warn(Warnings.W012.format(key=key, attr=string_attr))
keyword = self._convert_to_array(doc)
else:
keyword = doc
self._docs[key].add(tuple(keyword))
specs.append(self._convert_to_array(doc))
current_node = self.c_map
for token in keyword:
if token == self._terminal_hash:
warnings.warn(Warnings.W021)
break
result = <MapStruct*>map_get(current_node, token)
if not result:
internal_node = <MapStruct*>self.mem.alloc(1, sizeof(MapStruct))
map_init(self.mem, internal_node, 8)
map_set(self.mem, current_node, token, internal_node)
result = internal_node
current_node = <MapStruct*>result
result = <MapStruct*>map_get(current_node, self._terminal_hash)
if not result:
internal_node = <MapStruct*>self.mem.alloc(1, sizeof(MapStruct))
map_init(self.mem, internal_node, 8)
map_set(self.mem, current_node, self._terminal_hash, internal_node)
result = internal_node
map_set(self.mem, <MapStruct*>result, self.vocab.strings[key], NULL)
self._add_from_arrays(key, specs, on_match=on_match)
def __call__(self, object doclike, *, as_spans=False):
"""Find all sequences matching the supplied patterns on the `Doc`.
@ -345,7 +355,7 @@ def unpickle_matcher(vocab, docs, callbacks, attr):
matcher = PhraseMatcher(vocab, attr=attr)
for key, specs in docs.items():
callback = callbacks.get(key, None)
matcher.add(key, specs, on_match=callback)
matcher._add_from_arrays(key, specs, on_match=callback)
return matcher

View File

@ -3,6 +3,7 @@ from thinc.api import Model, normal_init
from ..util import registry
@registry.layers("spacy.PrecomputableAffine.v1")
def PrecomputableAffine(nO, nI, nF, nP, dropout=0.1):
model = Model(
"precomputable_affine",

View File

@ -23,6 +23,7 @@ DEFAULT_NVTX_ANNOTATABLE_PIPE_METHODS = [
"update",
"rehearse",
"get_loss",
"get_teacher_student_loss",
"initialize",
"begin_update",
"finish_update",
@ -50,6 +51,7 @@ def models_with_nvtx_range(nlp, forward_color: int, backprop_color: int):
return nlp
@registry.callbacks("spacy.models_with_nvtx_range.v1")
def create_models_with_nvtx_range(
forward_color: int = -1, backprop_color: int = -1
) -> Callable[["Language"], "Language"]:
@ -109,6 +111,7 @@ def pipes_with_nvtx_range(
return nlp
@registry.callbacks("spacy.models_and_pipes_with_nvtx_range.v1")
def create_models_and_pipes_with_nvtx_range(
forward_color: int = -1,
backprop_color: int = -1,

View File

@ -7,6 +7,7 @@ from ..tokens import Doc
from ..util import registry
@registry.layers("spacy.CharEmbed.v1")
def CharacterEmbed(nM: int, nC: int) -> Model[List[Doc], List[Floats2d]]:
# nM: Number of dimensions per character. nC: Number of characters.
return Model(

View File

@ -4,6 +4,7 @@ from ..attrs import LOWER
from ..util import registry
@registry.layers("spacy.extract_ngrams.v1")
def extract_ngrams(ngram_size: int, attr: int = LOWER) -> Model:
model: Model = Model("extract_ngrams", forward)
model.attrs["ngram_size"] = ngram_size

View File

@ -6,6 +6,7 @@ from thinc.types import Ints1d, Ragged
from ..util import registry
@registry.layers("spacy.extract_spans.v1")
def extract_spans() -> Model[Tuple[Ragged, Ragged], Ragged]:
"""Extract spans from a sequence of source arrays, as specified by an array
of (start, end) indices. The output is a ragged array of the

View File

@ -6,9 +6,8 @@ from thinc.types import Ints2d
from ..tokens import Doc
def FeatureExtractor(
columns: Union[List[str], List[int], List[Union[int, str]]]
) -> Model[List[Doc], List[Ints2d]]:
@registry.layers("spacy.FeatureExtractor.v1")
def FeatureExtractor(columns: List[Union[int, str]]) -> Model[List[Doc], List[Ints2d]]:
return Model("extract_features", forward, attrs={"columns": columns})

View File

@ -1,5 +1,5 @@
from pathlib import Path
from typing import Callable, Iterable, List, Optional, Tuple
from typing import Callable, Iterable, Iterator, List, Optional, Tuple
from thinc.api import (
Linear,
@ -15,19 +15,17 @@ from thinc.api import (
from thinc.types import Floats2d
from ...errors import Errors
from ...kb import (
Candidate,
InMemoryLookupKB,
KnowledgeBase,
get_candidates,
get_candidates_batch,
)
from ...tokens import Doc, Span
from ...kb import Candidate, InMemoryLookupKB, KnowledgeBase
from ...tokens import Doc, Span, SpanGroup
from ...util import registry
from ...vocab import Vocab
from ..extract_spans import extract_spans
CandidatesForMentionT = Iterable[Candidate]
CandidatesForDocT = Iterable[CandidatesForMentionT]
@registry.architectures("spacy.EntityLinker.v2")
def build_nel_encoder(
tok2vec: Model, nO: Optional[int] = None
) -> Model[List[Doc], Floats2d]:
@ -91,6 +89,7 @@ def span_maker_forward(model, docs: List[Doc], is_train) -> Tuple[Ragged, Callab
return out, lambda x: []
@registry.misc("spacy.KBFromFile.v1")
def load_kb(
kb_path: Path,
) -> Callable[[Vocab], KnowledgeBase]:
@ -102,6 +101,7 @@ def load_kb(
return kb_from_file
@registry.misc("spacy.EmptyKB.v2")
def empty_kb_for_config() -> Callable[[Vocab, int], KnowledgeBase]:
def empty_kb_factory(vocab: Vocab, entity_vector_length: int):
return InMemoryLookupKB(vocab=vocab, entity_vector_length=entity_vector_length)
@ -109,6 +109,7 @@ def empty_kb_for_config() -> Callable[[Vocab, int], KnowledgeBase]:
return empty_kb_factory
@registry.misc("spacy.EmptyKB.v1")
def empty_kb(
entity_vector_length: int,
) -> Callable[[Vocab], KnowledgeBase]:
@ -118,11 +119,39 @@ def empty_kb(
return empty_kb_factory
def create_candidates() -> Callable[[KnowledgeBase, Span], Iterable[Candidate]]:
@registry.misc("spacy.CandidateGenerator.v1")
def create_get_candidates() -> Callable[[KnowledgeBase, Span], Iterable[Candidate]]:
return get_candidates
def create_candidates_batch() -> Callable[
[KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]
@registry.misc("spacy.CandidateGenerator.v2")
def create_get_candidates_v2() -> Callable[
[KnowledgeBase, Iterator[SpanGroup]], Iterator[CandidatesForDocT]
]:
return get_candidates_batch
return get_candidates_v2
def get_candidates(kb: KnowledgeBase, mention: Span) -> Iterable[Candidate]:
"""
Return candidate entities for the given mention from the KB.
kb (KnowledgeBase): Knowledge base to query.
mention (Span): Entity mention.
RETURNS (Iterable[Candidate]): Identified candidates for specified mention.
"""
cands_per_doc = next(
get_candidates_v2(kb, iter([SpanGroup(mention.doc, spans=[mention])]))
)
assert isinstance(cands_per_doc, list)
return next(cands_per_doc[0])
def get_candidates_v2(
kb: KnowledgeBase, mentions: Iterator[SpanGroup]
) -> Iterator[Iterable[Iterable[Candidate]]]:
"""
Return candidate entities for the given mentions from the KB.
kb (KnowledgeBase): Knowledge base to query.
mentions (Iterator[SpanGroup]): Mentions per doc.
RETURNS (Iterator[Iterable[Iterable[Candidate]]]): Identified candidates per mentions in document/SpanGroup.
"""
return kb.get_candidates(mentions)

View File

@ -30,6 +30,7 @@ if TYPE_CHECKING:
from ...vocab import Vocab # noqa: F401
@registry.architectures("spacy.PretrainVectors.v1")
def create_pretrain_vectors(
maxout_pieces: int, hidden_size: int, loss: str
) -> Callable[["Vocab", Model], Model]:
@ -56,6 +57,7 @@ def create_pretrain_vectors(
return create_vectors_objective
@registry.architectures("spacy.PretrainCharacters.v1")
def create_pretrain_characters(
maxout_pieces: int, hidden_size: int, n_characters: int
) -> Callable[["Vocab", Model], Model]:

View File

@ -1,9 +1,8 @@
from typing import List, Optional, cast
from typing import List, Literal, Optional
from thinc.api import Linear, Model, chain, list2array, use_ops, zero_init
from thinc.types import Floats2d
from ...compat import Literal
from ...errors import Errors
from ...tokens import Doc
from ...util import registry
@ -11,6 +10,7 @@ from .._precomputable_affine import PrecomputableAffine
from ..tb_framework import TransitionModel
@registry.architectures("spacy.TransitionBasedParser.v2")
def build_tb_parser_model(
tok2vec: Model[List[Doc], List[Floats2d]],
state_type: Literal["parser", "ner"],

View File

@ -10,6 +10,7 @@ InT = List[Doc]
OutT = Floats2d
@registry.architectures("spacy.SpanFinder.v1")
def build_finder_model(
tok2vec: Model[InT, List[Floats2d]], scorer: Model[OutT, OutT]
) -> Model[InT, OutT]:

Some files were not shown because too many files have changed in this diff Show More