Compare commits

...

188 Commits

Author SHA1 Message Date
Jeff Adolphe
41e07772dc
Added Haitian Creole (ht) Language Support to spaCy (#13807)
This PR adds official support for Haitian Creole (ht) to spaCy's spacy/lang module.
It includes:

    Added all core language data files for spacy/lang/ht:
        tokenizer_exceptions.py
        punctuation.py
        lex_attrs.py
        syntax_iterators.py
        lemmatizer.py
        stop_words.py
        tag_map.py

    Unit tests for tokenizer and noun chunking (test_tokenizer.py, test_noun_chunking.py, etc.). Passed all 58 pytest spacy/tests/lang/ht tests that I've created.

    Basic tokenizer rules adapted for Haitian Creole orthography and informal contractions.

    Custom like_num atrribute supporting Haitian number formats (e.g., "3yèm").

    Support for common informal apostrophe usage (e.g., "m'ap", "n'ap", "di'm").

    Ensured no breakages in other language modules.

    Followed spaCy coding style (PEP8, Black).

This provides a foundation for Haitian Creole NLP development using spaCy.
2025-05-28 17:23:38 +02:00
Martin Schorfmann
e8f40e2169
Correct API docs for Span.lemma_, Vocab.to_bytes and Vectors.__init__ (#13436)
* Correct code example for Span.lemma_ in API Docs (#13405)

* Correct documented return type of Vocab.to_bytes in API docs

* Correct wording for Vectors.__init__ in API docs
2025-05-28 17:22:50 +02:00
BLKSerene
7b1d6e58ff
Remove dependency on langcodes (#13760)
This PR removes the dependency on langcodes introduced in #9342.

While the introduction of langcodes allows a significantly wider range of language codes, there are some unexpected side effects:

    zh-Hant (Traditional Chinese) should be mapped to zh intead of None, as spaCy's Chinese model is based on pkuseg which supports tokenization of both Simplified and Traditional Chinese.
    Since it is possible that spaCy may have a model for Norwegian Nynorsk in the future, mapping no (macrolanguage Norwegian) to nb (Norwegian Bokmål) might be misleading. In that case, the user should be asked to specify nb or nn (Norwegian Nynorsk) specifically or consult the doc.
    Same as above for regional variants of languages such as en_gb and en_us.

Overall, IMHO, introducing an extra dependency just for the conversion of language codes is an overkill. It is possible that most user just need the conversion between 2/3-letter ISO codes and a simple dictionary lookup should suffice.

With this PR, ISO 639-1 and ISO 639-3 codes are supported. ISO 639-2/B (bibliographic codes which are not favored and used in ISO 639-3) and deprecated ISO 639-1/2 codes are also supported to maximize backward compatibility.
2025-05-28 17:21:46 +02:00
Matthew Honnibal
864c2f3b51 Format 2025-05-28 17:06:11 +02:00
Matthew Honnibal
75a9d9b9ad Test and fix issue13769 2025-05-28 17:04:23 +02:00
Ilie
bec546cec0
Add TeNs plugin (#13800)
Co-authored-by: Ilie Cristian Dorobat <idorobat@cisco.com>
2025-05-27 01:21:07 +02:00
d0ngw
46613e27cf
fix: match hyphenated words to lemmas in index_table (e.g. "co-authored" -> "co-author") (#13816) 2025-05-27 01:20:26 +02:00
omahs
b205ff65e6
fix typos (#13813) 2025-05-26 16:05:29 +02:00
BLKSerene
92f1b8cdb4
Switch to typer-slim (#13759) 2025-05-26 16:03:49 +02:00
Matthew Honnibal
4b65aa79ee Add release script 2025-05-22 14:00:48 +02:00
Matthew Honnibal
d08f4e3b10 Increment version 2025-05-22 13:58:00 +02:00
Matthew Honnibal
6036f344d3 Remove print statements 2025-05-22 13:56:31 +02:00
Matthew Honnibal
5bebbf7550
Python 3.13 support (#13823)
In order to support Python 3.13, we had to migrate to Cython 3.0. This caused some tricky interaction with our Pydantic usage, because Cython 3 uses the from __future__ import annotations semantics, which causes type annotations to be saved as strings.

The end result is that we can't have Language.factory decorated functions in Cython modules anymore, as the Language.factory decorator expects to inspect the signature of the functions and build a Pydantic model. If the function is implemented in Cython, an error is raised because the type is not resolved.

To address this I've moved the factory functions into a new module, spacy.pipeline.factories. I've added __getattr__ importlib hooks to the previous locations, in case anyone was importing these functions directly. The change should have no backwards compatibility implications.

Along the way I've also refactored the registration of functions for the config. Previously these ran as import-time side-effects, using the registry decorator. I've created instead a new module spacy.registrations. When the registry is accessed it calls a function ensure_populated(), which cases the registrations to occur.

I've made a similar change to the Language.factory registrations in the new spacy.pipeline.factories module.

I want to remove these import-time side-effects so that we can speed up the loading time of the library, which can be especially painful on the CLI. I also find that I'm often working to track down the implementations of functions referenced by strings in the config. Having the registrations all happen in one place will make this easier.

With these changes I've fortunately avoided the need to migrate to Pydantic v2 properly --- we're still using the v1 compatibility shim. We might not be able to hold out forever though: Pydantic (reasonably) aren't actively supporting the v1 shims. I put a lot of work into v2 migration when investigating the 3.13 support, and it's definitely challenging. In any case, it's a relief that we don't have to do the v2 migration at the same time as the Cython 3.0/Python 3.13 support.
2025-05-22 13:47:21 +02:00
Matthew Honnibal
911539e9a4 Update version 2025-05-18 12:18:38 +02:00
Matthew Honnibal
22c1bc785b Replace lte with lt for clarity 2025-05-18 12:18:17 +02:00
Matthew Honnibal
cb5e760e91 Fix python version supported 2025-05-18 12:17:23 +02:00
Gunther Cox
87ec2b72a5
Update spaCy Universe entry for ChatterBot to use correct name casing (#13784) 2025-05-12 07:47:50 +02:00
翟持江
aa8de0ed37
Update embeddings-transformers.mdx, update trf_data examples info in <Runtime usage> (#13811) 2025-05-12 07:47:12 +02:00
Adrien Carpentier
98a19df91a
docs: fix README.md for compatible Python versions (#13749) 2025-04-11 20:56:52 +02:00
Matthew Honnibal
92bd042502 Allow Python 3.13 2025-04-03 23:15:12 +02:00
Matthew Honnibal
d0c705cbc9 Increment version 2025-04-01 09:40:59 +02:00
Matthew Honnibal
b3c46c315e Add support for linux-arm 2025-02-03 18:32:23 +01:00
Ines Montani
d194f06437 Add live stream to site [ci skip] 2025-02-03 09:42:52 +01:00
Ines Montani
055e07d9cc Update README.md [ci skip] 2025-02-03 09:38:32 +01:00
Ines Montani
8e1c14e977 Add live stream to README [ci skip] 2025-02-03 09:37:48 +01:00
Christine P. Chai
4278182dd0
Change Twitter to X (#13740) [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2025-02-03 09:30:21 +01:00
Matthew Honnibal
85cc763006 Fix python version requirement 2025-01-13 18:17:36 +01:00
Matthew Honnibal
ba7468e32e
Update requirements, fixing windows crashes (#13727)
* Re-enable pretraining test

* Require thinc 8.3.4

* Reformat

* Re-enable test
2025-01-13 16:39:46 +01:00
Matthew Honnibal
311f7cc9fb Set version to v3.8.4 2024-12-11 14:14:08 +01:00
Matthew Honnibal
682140496a Align requirements better 2024-12-11 14:13:51 +01:00
Matthew Honnibal
343f4f21d7 Enable Python 3.13 2024-12-11 14:13:28 +01:00
Matthew Honnibal
be0fa812c2 Update cibuildwheel 2024-12-11 13:08:40 +01:00
Matthew Honnibal
a6317b3836
Fix allocation of non-transient strings in StringStore (#13713)
* Fix bug in memory-zone code when adding non-transient strings. The error could result in segmentation faults or other memory errors during memory zones if new labels were added to the model.
* Fix handling of new morphological labels within memory zones. Addresses second issue reported in Memory leak of MorphAnalysis object. #13684
2024-12-11 13:06:53 +01:00
Ines Montani
3e30b5bef6 Add spacy-layout [ci skip] 2024-11-19 10:43:40 +01:00
Matthew Honnibal
3ecec1324c
Usage page on memory management, explaining memory zones and doc_cleaner (#13643) [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2024-10-23 12:42:54 +02:00
Ikko Eltociear Ashimine
15fbf5ef36
docs: update rule-based-matching.mdx (#13665) [ci skip] 2024-10-23 12:07:01 +02:00
Sergei Pashakhin
1ee9a19059
Fix typo (#13657) [ci skip] 2024-10-23 12:06:36 +02:00
thjbdvlt
0d7e57fc3e
universe-pipeline-solipCysme-french (#13627) [ci skip] 2024-10-11 11:26:15 +02:00
Ines Montani
ae5c3e078d Fix universe.json [ci skip] 2024-10-11 11:24:42 +02:00
Andrei (Andrey) Khropov
8d2902b0e7
Fix misspelling (#13631) [ci skip] 2024-10-11 11:23:12 +02:00
aravind-mc
44d1906453
Update universe.json to add my spaCy online course (#13632) [ci skip] 2024-10-11 11:21:57 +02:00
sam rxh
52a4cb0d14
Fix 'issue template' link in CONTRIBUTING.md (#13587) [ci skip] 2024-10-11 11:20:34 +02:00
Ines Montani
10a6f508ab Fix landing banner links [ci skip] 2024-10-11 11:19:10 +02:00
Matthew Honnibal
bda4bb0184
Try disabling pretraining tests to probe windows ci failure (#13646) 2024-10-02 01:01:40 +02:00
Matthew Honnibal
628c973db5 Note minimum python requirement in setup.cfg 2024-10-02 00:49:09 +02:00
Matthew Honnibal
e0782c5e4c Merge branch 'master' into v3.8.x 2024-10-01 23:57:48 +02:00
Matthew Honnibal
5230754986 Fix thinc dependncy 2024-10-01 23:49:17 +02:00
Matthew Honnibal
411b70f5f3 Upd requirements 2024-10-01 23:46:54 +02:00
Matthew Honnibal
08705f5a8c Upd tests 2024-10-01 22:57:25 +02:00
Matthew Honnibal
77177d0216 Upd tests workflow 2024-10-01 22:54:12 +02:00
Matthew Honnibal
5196366af5 Upd tests workflow 2024-10-01 22:53:11 +02:00
Matthew Honnibal
29232ad3b5 Upd tests workflow 2024-10-01 22:51:09 +02:00
Matthew Honnibal
dd47fbb45f Remove 'apple' extra 2024-10-01 22:24:25 +02:00
Matthew Honnibal
63f1b53c1a Check test failure 2024-10-01 16:49:49 +02:00
Matthew Honnibal
0cdcfe56cb Set version to v3.8.2 2024-10-01 16:47:24 +02:00
Matthew Honnibal
924cbc9703 Fix environment variable for test 2024-10-01 16:08:06 +02:00
Matthew Honnibal
e1d050517d Fix requirements.txt 2024-10-01 15:56:18 +02:00
Matthew Honnibal
6c038aaae0 Don't disable tests on workflow changes 2024-10-01 15:32:01 +02:00
Matthew Honnibal
f0084b9143 Fix matrix in tests 2024-10-01 15:28:22 +02:00
Matthew Honnibal
ff81bfb8db Update tests 2024-10-01 13:21:10 +02:00
Matthew Honnibal
9c5b61bdff isort 2024-10-01 12:38:51 +02:00
Matthew Honnibal
725ccbac39 Format 2024-10-01 12:38:02 +02:00
Matthew Honnibal
a8837beab7 Set version to v3.8.1 2024-10-01 12:37:11 +02:00
Matthew Honnibal
3a0aadcf86 Update spacy[apple] thinc-apple-ops pin for numpy v2 compatibility 2024-10-01 10:16:35 +02:00
DomHudson
a61a1d43cf
[Documentation] Replace broken URL in _serialization.mdx (#13641) 2024-09-30 17:45:50 +02:00
Matthew Honnibal
114b4894fb Fix --require-parent default 2024-09-29 15:50:31 +02:00
Matthew Honnibal
dec13b4258 Fix inverted cli arg 2024-09-29 15:50:05 +02:00
Matthew Honnibal
c03f060527 Allow positive option --require-parent 2024-09-29 14:30:14 +02:00
Matthew Honnibal
6255cb985f Include version constraint in parent package requirement 2024-09-29 14:22:21 +02:00
Matthew Honnibal
3b165a8716 Simplify setting to require parent package 2024-09-29 14:19:10 +02:00
Matthew Honnibal
969832f5d6 Fix package 2024-09-29 14:00:11 +02:00
Matthew Honnibal
8ce53a6bbe Syntax 2024-09-29 13:51:44 +02:00
Matthew Honnibal
6fa0d709d5 Support option to not depend on parent package in spacy package 2024-09-29 13:51:04 +02:00
Matthew Honnibal
5010fcbd3a Fix numpy constant 2024-09-14 13:13:11 +02:00
Matthew Honnibal
de4f19f3a3 Fix version 2024-09-14 13:12:44 +02:00
Matthew Honnibal
3d03565498 Replace numpy floats in evaluate and update 2024-09-14 12:55:53 +02:00
Matthew Honnibal
0576a1ff56 Fix numpy floats in meta.json 2024-09-14 12:54:08 +02:00
Matthew Honnibal
2f1e7ed09a Lint 2024-09-14 11:36:27 +02:00
Matthew Honnibal
e2dc9b79e1 Format 2024-09-14 11:29:40 +02:00
Matthew Honnibal
3c3d75015b Set version to v3.7.7 2024-09-14 11:27:32 +02:00
Matthew Honnibal
50aa3b5cbe Merge branch 'master' of https://github.com/explosion/spaCy 2024-09-14 11:09:44 +02:00
Matthew Honnibal
8266031454 Merge numpy version update 2024-09-14 11:08:35 +02:00
Matthew Honnibal
8dcc4b8daf Skip running tests on PRs 2024-09-14 11:07:23 +02:00
Matthew Honnibal
3a635d2c94 Try skipping 686 2024-09-14 00:12:49 +02:00
Matthew Honnibal
a0ce61f55a Fix thinc pin 2024-09-13 14:21:03 +02:00
Matthew Honnibal
83b4015b36 Remove aarch 2024-09-13 12:35:50 +02:00
Matthew Honnibal
419bfaf6e7 Update cibuildwheel 2024-09-13 10:44:48 +02:00
Matthew Honnibal
69ecb85fad Set version to v3.8.1 2024-09-13 10:43:40 +02:00
Matthew Honnibal
b427597fc8 Set version to v3.8.0 2024-09-11 21:32:26 +02:00
Matthew Honnibal
1869a197c9 Try enabling macos-14 for arm builds 2024-09-11 16:06:57 +02:00
Matthew Honnibal
c068e1de1b Fix dependencies 2024-09-11 15:57:52 +02:00
Matthew Honnibal
184e508d9c Update numpy pin 2024-09-11 15:57:17 +02:00
William Mattingly
30f1f33e78
Added Date spaCy to universe (#13415) [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2024-09-10 14:29:03 +02:00
William Mattingly
f1a5ff9dba
added spacy whisper to universe (#13418) [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2024-09-10 14:28:00 +02:00
William Mattingly
c80dacd046
added spacy annoy to universe (#13416) [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2024-09-10 14:26:21 +02:00
William Mattingly
7fbbb2002a
updated universe for number spacy (#13424) [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2024-09-10 14:25:23 +02:00
William Mattingly
89c1774d43
added bagpipes-spacy to universe (#13425) [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2024-09-10 14:24:06 +02:00
thjbdvlt
081e4e385d
universe-project-presque (#13515) [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2024-09-10 14:21:41 +02:00
thjbdvlt
0190e669c5
universe-package-quelquhui (#13514) [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2024-09-10 14:17:33 +02:00
Oren Halvani
54dc4ee8fb
Added: Constituent-Treelib to: universe.json (#13432) [ci skip]
Co-authored-by: Halvani <>
2024-09-10 14:13:36 +02:00
William Mattingly
5a7ad5572c
added gliner-spacy to universe (#13417) [ci skip]
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Ines Montani <ines@ines.io>
2024-09-10 14:12:52 +02:00
marinelay
b18cc94451
Delete unnecessary method (#13441)
Co-authored-by: marinelay <marinelay@gmail.com>
2024-09-09 20:57:13 +02:00
Matthew Honnibal
4cc3ebe74e Format 2024-09-09 20:56:01 +02:00
Matthew Honnibal
a019315534 Fix memory zones 2024-09-09 13:49:41 +02:00
Matthew Honnibal
59ac7e6bdb Format 2024-09-09 11:22:52 +02:00
Matthew Honnibal
b65491b641 Set version to v3.8.0.dev0 2024-09-09 11:20:23 +02:00
Matthew Honnibal
1b8d560d0e
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.

Example usage:

```
with nlp.memory_zone():
    for text in nlp.pipe(texts):
        do_something(doc)
# do_something(doc) <-- Invalid
```

Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.

The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.

Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.

The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.

I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 11:19:39 +02:00
ykyogoku
608f65ce40
add Tibetan (#13510) 2024-09-09 11:18:03 +02:00
Muzaffer Cikay
acbf2a428f
Add Kurdish Kurmanji language (#13561)
* Add Kurdish Kurmanji language

* Add lex_attrs
2024-09-09 11:15:40 +02:00
Mark Liberko
55db9c2e87
Added gd language folder (#13570)
Implemented a foundational Scottish Gaelic (gd) language option with tokenizer_exceptions and stop_words files.
2024-09-09 11:14:09 +02:00
Matthew Honnibal
319e02545c Set version to 3.7.6 2024-08-20 12:16:08 +02:00
Matthew Honnibal
a8accc3396
Use cibuildwheel to build wheels (#13603)
* Add workflow files for cibuildwheel

* Add config for cibuildwheel

* Set version for experimental prerelease

* Try updating cython

* Skip 32-bit windows builds

* Revert "Try updating cython"

This reverts commit c1b794ab5c.

* Try to import cibuildwheel settings from previous setup
2024-08-20 12:15:05 +02:00
Ines Montani
8cda27aefa Add case study [ci skip] 2024-06-26 09:41:23 +02:00
Matthew Honnibal
f78e5ce732 Disable extra CI 2024-06-21 14:32:00 +02:00
Sofie Van Landeghem
a6d0fc3602
Remove typing-extensions from requirements (#13516) 2024-05-31 19:20:46 +02:00
Sofie Van Landeghem
82fc2ecfa5
Bump version to 3.7.5 (#13493) 2024-05-15 12:11:33 +02:00
Sofie Van Landeghem
c195ca4f9c
fix docs for MorphAnalysis.__contains__ (#13433) 2024-05-02 16:46:41 +02:00
Sofie Van Landeghem
d3a232f773
Update LICENSE to include 2024 (#13472) 2024-04-30 09:17:59 +02:00
Sofie Van Landeghem
ecd85d2618
Update Typer pin and GH actions (#13471)
* update gh actions

* pin typer upperbound to 1.0.0
2024-04-29 13:28:46 +02:00
Alex Strick van Linschoten
045cd43c3f
Fix typos in docs (#13466)
* fix typos

* prettier formatting

---------

Co-authored-by: svlandeg <svlandeg@github.com>
2024-04-29 11:10:17 +02:00
Sofie Van Landeghem
74836524e3
Bump to v5 (#13470) 2024-04-29 10:36:31 +02:00
Sofie Van Landeghem
6d6c10ab9c
Fix CI (#13469)
* Remove hardcoded architecture setting

* update classifiers to include Python 3.12
2024-04-29 10:18:07 +02:00
Sofie Van Landeghem
2e2334632b
Fix use_gold_ents behaviour for EntityLinker (#13400)
* fix type annotation in docs

* only restore entities after loss calculation

* restore entities of sample in initialization

* rename overfitting function

* fix EL scorer

* Relax test

* fix formatting

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* rename to _ensure_ents

* further rename

* allow for scorer to be None

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2024-04-16 12:00:22 +02:00
Joe Schiff
2e96797696
Convert properties to decorator syntax (#13390) 2024-04-16 11:51:14 +02:00
Sofie Van Landeghem
f5e85fa05a
allow weasel 0.4.x (#13409) 2024-04-04 12:55:08 +02:00
Yaseen
21aea59001
Update code.module.sass to make code title sticky (#13379) 2024-03-26 12:15:25 +01:00
Sofie Van Landeghem
4dc5fe5469
Renamed main branch back to v4 for now (#13395)
* Update gputests.yml

* Update slowtests.yml
2024-03-26 09:53:07 +01:00
Ines Montani
1252370f69 Move DocSearch key to env var [ci skip] 2024-03-25 10:17:57 +01:00
Sofie Van Landeghem
d410d95b52
remove smart_open requirement as it's taken care of via Weasel (#13391) 2024-03-22 18:21:20 +01:00
Matthew Honnibal
0518c36f04
Sanitize direct download (#13313)
The 'direct' option in 'spacy download' is supposed to only download from our model releases repository. However, users were able to pass in a relative path, allowing download from arbitrary repositories. This meant that a service that sourced strings from user input and which used the direct option would allow users to install arbitrary packages.
2024-02-20 13:17:51 +01:00
Daniël de Kok
bff8725f4b
Set version to 3.7.4 (#13327) 2024-02-14 14:46:28 +01:00
Daniël de Kok
fdfdbcd9f4
Make Language.pipe workers exit cleanly (#13321)
Also warn when any worker exited with a non-zero exit code and modify
test to ensure that workers exit cleanly by default.
2024-02-12 14:39:38 +01:00
Daniël de Kok
14bd9d89a3
Update example that shows model in requirments (#13302)
See #13293.
2024-02-11 19:46:43 +01:00
Daniël de Kok
e1249d3722
Test if closing explicitly solves recursive lock issues (#13304) 2024-02-05 10:07:03 +01:00
Daniël de Kok
40422ff904
Set version to 3.7.3 (#13301) 2024-02-02 13:51:26 +01:00
Daniël de Kok
2dbb332cea
TextCatParametricAttention.v1: set key transform dimensions (#13249)
* TextCatParametricAttention.v1: set key transform dimensions

This is necessary for tok2vec implementations that initialize
lazily (e.g. curated transformers).

* Add lazily-initialized tok2vec to simulate transformers

Add a lazily-initialized tok2vec to the tests and test the current
textcat models with it.

Fix some additional issues found using this test.

* isort

* Add `test.` prefix to `LazyInitTok2Vec.v1`
2024-02-02 13:01:59 +01:00
Daniël de Kok
d84068e460
Run slow tests: v4 -> main (#13290)
* Run slow tests: v4 -> main

* Also update the branch in GPU tests
2024-01-30 13:58:28 +01:00
Sofie Van Landeghem
89a43f39b7
update universe description (#13291) 2024-01-30 13:49:49 +01:00
Daniël de Kok
68d7841df5
Extension serialization attr tests: add teardown (#13284)
The doc/token extension serialization tests add extensions that are not
serializable with pickle. This didn't cause issues before due to the
implicit run order of tests. However, test ordering has changed with
pytest 8.0.0, leading to failed tests in test_language.

Update the fixtures in the extension serialization tests to do proper
teardown and remove the extensions.
2024-01-29 13:51:56 +01:00
Eliana Vornov
00e938a7c3
add custom code support to CLI speed benchmark (#13247)
* add custom code support to CLI speed benchmark

* sort imports

* better copying for warmup docs
2024-01-26 13:29:22 +01:00
Sofie Van Landeghem
68b85ea950
Clarify data_path loading for apply CLI command (#13272)
* attempt to clarify additional annotations on .spacy file

* suggestion by Daniël

* pipeline instead of pipe
2024-01-26 12:10:05 +01:00
Sofie Van Landeghem
7496e03a2c
Clarify vocab docs (#13273)
* add line to ensure that apple is in fact in the vocab

* add that the vocab may be empty
2024-01-26 10:58:48 +01:00
Sofie Van Landeghem
a493981163
fix typo (#13254) 2024-01-24 09:29:57 +01:00
Daniël de Kok
a8894a8946
Merge pull request #13240 from mauricesvp/patch-1
Fix typo in method name
2024-01-23 20:49:21 +01:00
Daniël de Kok
afac7fb650
test_find_available_port: use port 5001 (#13255)
macOS now uses port 5000 for the AirPlay receiver functionality, so this
test will always fail on a macOS desktop (unless AirPlay receiver
functionality is disabled like in CI).
2024-01-23 20:11:16 +01:00
Daniël de Kok
5a2ad4af4b Merge remote-tracking branch 'upstream/master' into patch-1 2024-01-23 19:53:20 +01:00
Daniël de Kok
128197a5fc
Properly clean up pipe multiprocessing workers (#13259)
Before this change, the workers of pipe call with n_process != 1 were
stopped by calling `terminate` on the processes. However, terminating a
process can leave queues, pipes, and other concurrent data structures in
an invalid state.

With this change, we stop using terminate and take the following approach
instead:

* When the all documents are processed, the parent process puts a
  sentinel in the queue of each worker.
* The parent process then calls `join` on each worker process to
  let them finish up gracefully.
* Worker processes break from the queue processing loop when the
  sentinel is encountered, so that they exit.

We need special handling when one of the workers encounters an error and
the error handler is set to raise an exception. In this case, we cannot
rely on the sentinel to finish all workers -- the queue is a FIFO queue
and there may be other work queued up before the sentinel. We use the
following approach to handle error scenarios:

* The parent puts the end-of-work sentinel in the queue of each worker.
* The parent closes the reading-end of the channel of each worker.
* Then:
  - If the worker was waiting for work, it will encounter the sentinel
    and break from the processing loop.
  - If the worker was processing a batch, it will attempt to write
    results to the channel. This will fail because the channel was
    closed by the parent and the worker will break from the processing
    loop.
2024-01-23 18:33:04 +01:00
Raphael Mitsch
3b3b5cdc63
Merge pull request #13253 from explosion/chore/sync-master-with-llm_main
Sync `master` with `docs/llm_main`
2024-01-19 16:50:43 +01:00
Raphael Mitsch
575c405ae3 Fix LLM docs on task factories. 2024-01-19 16:48:54 +01:00
Raphael Mitsch
256468c414 Merge branch 'docs/llm_main' into chore/sync-master-with-llm_main
# Conflicts:
#	website/docs/api/large-language-models.mdx
2024-01-19 16:34:35 +01:00
Raphael Mitsch
91c24c0285
Merge pull request #13251 from explosion/docs/llm_develop
Sync `docs/llm_main` with `docs/llm_develop`
2024-01-19 12:56:38 +01:00
maurice
c608baeecc
Fix typo in method name 2024-01-16 21:54:54 +01:00
Raphael Mitsch
0062c22c35
Updated docs w.r.t. infinite doc length changes (#13214)
* Updated docs w.r.t. infinite doc length.

* Fix typo.

* fix typo's

* Fix table formatting.

* Update formatting.

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2024-01-05 14:20:58 +01:00
Daniël de Kok
e2a3952de5
Add spacy.TextCatParametricAttention.v1 (#13201)
* Add spacy.TextCatParametricAttention.v1

This layer provides is a simplification of the ensemble classifier that
only uses paramteric attention. We have found empirically that with a
sufficient amount of training data, using the ensemble classifier with
BoW does not provide significant improvement in classifier accuracy.
However, plugging in a BoW classifier does reduce GPU training and
inference performance substantially, since it uses a GPU-only kernel.

* Fix merge fallout
2024-01-02 10:03:06 +01:00
Daniël de Kok
7ebba86402
Add TextCatReduce.v1 (#13181)
* Add TextCatReduce.v1

This is a textcat classifier that pools the vectors generated by a
tok2vec implementation and then applies a classifier to the pooled
representation. Three reductions are supported for pooling: first, max,
and mean. When multiple reductions are enabled, the reductions are
concatenated before providing them to the classification layer.

This model is a generalization of the TextCatCNN model, which only
supports mean reductions and is a bit of a misnomer, because it can also
be used with transformers. This change also reimplements TextCatCNN.v2
using the new TextCatReduce.v1 layer.

* Doc fixes

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fully specify `TextCatCNN` <-> `TextCatReduce` equivalence

* Move TextCatCNN docs to legacy, in prep for moving to spacy-legacy

* Add back a test for TextCatCNN.v2

* Replace TextCatCNN in pipe configurations and templates

* Add an infobox to the `TextCatReduce` section with an `TextCatCNN` anchor

* Add last reduction (`use_reduce_last`)

* Remove non-working TextCatCNN Netlify redirect

* Revert layer changes for the quickstart

* Revert one more quickstart change

* Remove unused import

* Fix docstring

* Fix setting name in error message

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-12-21 11:00:06 +01:00
Steven Crowther
764be103bc
update README to include links to GPU processing, LLM's, and the spaCy blog. (#13197)
* Update README.md to include links for GPU processing, LLM, and spaCy's blog.

* Create ojo4f3.md

* corrected README to most current version with links to GPU processing, LLM's, and the spaCy blog.

* Delete .github/contributors/ojo4f3.md

* changed LLM icon

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Apply suggestions from code review

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-12-18 09:49:07 +01:00
Sofie Van Landeghem
56fc3bc0f3
Type documentation fixes for Doc (#13187)
* correct char_span output type - can be None

* unify type of exclude parameter

* black

* further fixes to from_dict and to_dict

* formatting
2023-12-18 09:00:47 +01:00
Ines Montani
7df328fbfe
Update README.md [ci skip] 2023-12-12 10:19:57 +01:00
Raphael Mitsch
d56ee65ddf
Document spacy-llm's TranslationTask (#13183)
* Describe translation task.

* Fix references to examples and template.

* Format.
2023-12-11 17:41:04 +01:00
Raphael Mitsch
e79a9c5acd
Document spacy-llm's RawTask (#13180)
* Add section on RawTask.

* Fix API docs.

* Update website/docs/api/large-language-models.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-12-11 17:14:12 +01:00
Ines Montani
8cfccdd2f8
Update links [ci skip] 2023-12-11 15:51:43 +01:00
Ines Montani
f78b91c03b
Update links [ci skip] 2023-12-11 15:51:01 +01:00
Raphael Mitsch
9fcd2bfa08
Add info on endpoint arg. (#13169) 2023-12-05 12:46:29 +01:00
Raphael Mitsch
a25a3b996b
Merge pull request #13173 from explosion/docs/llm_main
Sync `llm_develop` with `llm_main`
2023-12-04 16:46:21 +01:00
Raphael Mitsch
55ed2b4e82
Add documentation for EL task (#12988)
* Add documentation for EL task.

* Fix EL factory name.

* Add llm_entity_linker_mentio.

* Apply suggestions from code review

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Incorporate feedback.

* Format.

* Fix link to KB data.

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2023-12-04 15:23:28 +01:00
Adriane Boyd
e467573550
Docs: update trf_data examples and pipeline design info (#13164) 2023-12-04 15:15:54 +01:00
Raphael Mitsch
0e43fca036
Add Claude-2.1 mention. (#13167) 2023-12-01 16:48:35 +01:00
Daniël de Kok
da7ad97519
Update TextCatBOW to use the fixed SparseLinear layer (#13149)
* Update `TextCatBOW` to use the fixed `SparseLinear` layer

A while ago, we fixed the `SparseLinear` layer to use all available
parameters: https://github.com/explosion/thinc/pull/754

This change updates `TextCatBOW` to `v3` which uses the new
`SparseLinear_v2` layer. This results in a sizeable improvement on a
text categorization task that was tested.

While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent`
option to make it possible to change the hidden size. Ideally, we'd just
have an option called `length`. But the way that `TextCatBOW` uses
hashes results in a non-uniform distribution of parameters when the
length is not a power of two.

* Replace TexCatBOW `length_exponent` parameter by `length`

We now round up the length to the next power of two if it isn't
a power of two.

* Remove some tests for TextCatBOW.v2

* Fix missing import
2023-11-29 09:11:54 +01:00
Ines Montani
bf7c2ea99a
Add merch link [ci skip] 2023-11-22 12:55:00 +01:00
Ines Montani
8f69e56a5a Add swag [ci skip] 2023-11-20 14:42:01 +01:00
Lise
b6e022381d
Feature/nn and fo language extensions (#13116)
* add language extensions for norwegian nynorsk and faroese

* update docstring for nn/examples.py

* use relative imports

* add fo and nn tokenizers to pytest fixtures

* add unittests for fo and nn and fix bug in nn

* remove module docstring from fo/__init__.py

* add comments about example sentences' origin

* add license information to faroese data credit

* format unittests using black

* add __init__ files to test/lang/nn and tests/lang/fo

* fix import order and use relative imports in fo/__nit__.py and nn/__init__.py

* Make the tests a bit more compact

* Add fo and nn to website languages

* Add note about jul.

* Add "jul." as exception

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-11-20 07:49:59 +01:00
ajbond
9f2ce6bb00
Add Redfield NLP Nodes to the Spacy Universe (#13133) 2023-11-17 09:48:02 +01:00
Madeesh Kannan
bd2c17e206
Warn about reloading dependencies after downloading models (#13081)
* Update the "Missing factory" error message

This accounts for model installations that took place during the current Python session.

* Add a note about Jupyter notebooks

* Move error to `spacy.cli.download`
Add extra message for Jupyter sessions

* Add additional note for interactive sessions

* Remove note about `spacy-transformers` from error message

* `isort`

* Improve checks for colab (also helps displacy)

* Update warning messages

* Improve flow for multiple checks

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-11-10 08:05:07 +01:00
Raphael Mitsch
b2e831d966
LLM docs: OpenAI model update (#13119)
* Update supported OpenAI models.

* Update with new GPT-3.5 and GPT-4 versions.

* Add links to OpenAI model docs.
2023-11-08 17:55:16 +01:00
Adriane Boyd
513bbd5fa3
Add preferred use of build for package CLI (#13109)
Build with `build` if available. Warn and fall back to previous
`setup.py`-based builds if `build` build fails.
2023-11-08 17:35:24 +01:00
Ridge Kimani
2b8da84717
feat: add extra lexical attributes (#13106)
Co-authored-by: Ridge Kimani <ridgekimani@gmail.com>
2023-11-08 17:29:11 +01:00
Adriane Boyd
0c25725359
Update Tokenizer.explain for special cases with whitespace (#13086)
* Update Tokenizer.explain for special cases with whitespace

Update `Tokenizer.explain` to skip special case matches if the exact
text has not been matched due to intervening whitespace.

Enable fuzzy `Tokenizer.explain` tests with additional whitespace
normalization.

* Add unit test for special cases with whitespace, xfail fuzzy tests again
2023-11-06 17:29:59 +01:00
Adriane Boyd
ff9ddb6a07
Unskip python 3.12 remote tests (#13110) 2023-11-06 11:59:45 +01:00
Adriane Boyd
c096c5c0c9
Update for numpy 2.0 deprecations (#13103)
- Replace `np.trapz` with vendored `trapezoid` from scipy
- Replace `np.float_` with `np.float64`
2023-11-06 08:47:53 +01:00
Adriane Boyd
92f1d0a195
CI: Switch to stable python 3.12 and limit 3.11 runs (#13104) 2023-11-03 15:46:03 +01:00
Raphael Mitsch
c4e2daf6ef
Fix displacy span stacking (#13068)
* Fix displacy span stacking.

* Format. Remove counter.

* Remove test files.

* Add unit test. Refactor to allow for unit test.

* Fix off-by-one error in tests.
2023-11-02 12:02:18 +01:00
Sofie Van Landeghem
a804b83a4b
Update llm docs to clarify task-specific factories (#13082)
* fix typo

* add examples to specify custom model for task-specific factory
2023-10-31 22:07:07 +01:00
Sofie Van Landeghem
48248c62b6
Clarify EL example in docs (#13071)
* add comment that pipeline is a custom one

* add link to NEL tutorial

* prettier

* revert prettier reformat

* revert prettier reformat (2)

* fix typo

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-10-31 21:58:29 +01:00
Raphael Mitsch
0c15876502
Fix spancat typo. (#13095) 2023-10-31 13:45:10 +01:00
Raphael Mitsch
9deaac9786
Add note in docs on score_weight config if using a non-default spans_key for SpanCat (#13093)
* Add note on score_weight if using a non-default span_key for SpanCat.

* Fix formatting.

* Fix formatting.

* Fix typo.

* Use warning infobox.

* Fix infobox formatting.
2023-10-30 17:02:08 +01:00
Sofie Van Landeghem
d717123819
Update LICENSE (#13078) 2023-10-23 11:59:18 +02:00
Raphael Mitsch
df07c4734b
Merge pull request #13046 from explosion/docs/llm_main
Sync `docs/llm_develop` with `docs/llm_main`
2023-10-05 16:31:20 +02:00
Raphael Mitsch
030d63ad73
Merge pull request #13045 from explosion/master
Sync `docs/llm_main` with `master`
2023-10-05 16:28:19 +02:00
225 changed files with 11708 additions and 2822 deletions

1
.github/FUNDING.yml vendored Normal file
View File

@ -0,0 +1 @@
custom: [https://explosion.ai/merch, https://explosion.ai/tailored-solutions]

99
.github/workflows/cibuildwheel.yml vendored Normal file
View File

@ -0,0 +1,99 @@
name: Build
on:
push:
tags:
# ytf did they invent their own syntax that's almost regex?
# ** matches 'zero or more of any character'
- 'release-v[0-9]+.[0-9]+.[0-9]+**'
- 'prerelease-v[0-9]+.[0-9]+.[0-9]+**'
jobs:
build_wheels:
name: Build wheels on ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
# macos-13 is an intel runner, macos-14 is apple silicon
os: [ubuntu-latest, windows-latest, macos-13, macos-14, ubuntu-24.04-arm]
steps:
- uses: actions/checkout@v4
# aarch64 (arm) is built via qemu emulation
# QEMU is sadly too slow. We need to wait for public ARM support
#- name: Set up QEMU
# if: runner.os == 'Linux'
# uses: docker/setup-qemu-action@v3
# with:
# platforms: all
- name: Build wheels
uses: pypa/cibuildwheel@v2.21.3
env:
CIBW_ARCHS_LINUX: auto
with:
package-dir: .
output-dir: wheelhouse
config-file: "{package}/pyproject.toml"
- uses: actions/upload-artifact@v4
with:
name: cibw-wheels-${{ matrix.os }}-${{ strategy.job-index }}
path: ./wheelhouse/*.whl
build_sdist:
name: Build source distribution
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build sdist
run: pipx run build --sdist
- uses: actions/upload-artifact@v4
with:
name: cibw-sdist
path: dist/*.tar.gz
create_release:
needs: [build_wheels, build_sdist]
runs-on: ubuntu-latest
permissions:
contents: write
checks: write
actions: read
issues: read
packages: write
pull-requests: read
repository-projects: read
statuses: read
steps:
- name: Get the tag name and determine if it's a prerelease
id: get_tag_info
run: |
FULL_TAG=${GITHUB_REF#refs/tags/}
if [[ $FULL_TAG == release-* ]]; then
TAG_NAME=${FULL_TAG#release-}
IS_PRERELEASE=false
elif [[ $FULL_TAG == prerelease-* ]]; then
TAG_NAME=${FULL_TAG#prerelease-}
IS_PRERELEASE=true
else
echo "Tag does not match expected patterns" >&2
exit 1
fi
echo "FULL_TAG=$TAG_NAME" >> $GITHUB_ENV
echo "TAG_NAME=$TAG_NAME" >> $GITHUB_ENV
echo "IS_PRERELEASE=$IS_PRERELEASE" >> $GITHUB_ENV
- uses: actions/download-artifact@v4
with:
# unpacks all CIBW artifacts into dist/
pattern: cibw-*
path: dist
merge-multiple: true
- name: Create Draft Release
id: create_release
uses: softprops/action-gh-release@v2
if: startsWith(github.ref, 'refs/tags/')
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
name: ${{ env.TAG_NAME }}
draft: true
prerelease: ${{ env.IS_PRERELEASE }}
files: "./dist/*"

View File

@ -15,7 +15,7 @@ jobs:
env: env:
GITHUB_CONTEXT: ${{ toJson(github) }} GITHUB_CONTEXT: ${{ toJson(github) }}
run: echo "$GITHUB_CONTEXT" run: echo "$GITHUB_CONTEXT"
- uses: actions/checkout@v3 - uses: actions/checkout@v4
- uses: actions/setup-python@v4 - uses: actions/setup-python@v4
- name: Install and run explosion-bot - name: Install and run explosion-bot
run: | run: |

View File

@ -16,7 +16,7 @@ jobs:
if: github.repository_owner == 'explosion' if: github.repository_owner == 'explosion'
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- uses: dessant/lock-threads@v4 - uses: dessant/lock-threads@v5
with: with:
process-only: 'issues' process-only: 'issues'
issue-inactive-days: '30' issue-inactive-days: '30'

29
.github/workflows/publish_pypi.yml vendored Normal file
View File

@ -0,0 +1,29 @@
# The cibuildwheel action triggers on creation of a release, this
# triggers on publication.
# The expected workflow is to create a draft release and let the wheels
# upload, and then hit 'publish', which uploads to PyPi.
on:
release:
types:
- published
jobs:
upload_pypi:
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/spacy
permissions:
id-token: write
contents: read
if: github.event_name == 'release' && github.event.action == 'published'
# or, alternatively, upload to PyPI on every tag starting with 'v' (remove on: release above to use this)
# if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v')
steps:
- uses: robinraju/release-downloader@v1
with:
tag: ${{ github.event.release.tag_name }}
fileName: '*'
out-file-path: 'dist'
- uses: pypa/gh-action-pypi-publish@release/v1

View File

@ -14,7 +14,7 @@ jobs:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- name: Checkout - name: Checkout
uses: actions/checkout@v3 uses: actions/checkout@v4
with: with:
ref: ${{ matrix.branch }} ref: ${{ matrix.branch }}
- name: Get commits from past 24 hours - name: Get commits from past 24 hours

View File

@ -18,7 +18,7 @@ jobs:
run: | run: |
echo "$GITHUB_CONTEXT" echo "$GITHUB_CONTEXT"
- uses: actions/checkout@v3 - uses: actions/checkout@v4
- uses: actions/setup-python@v4 - uses: actions/setup-python@v4
with: with:
python-version: '3.10' python-version: '3.10'

View File

@ -2,6 +2,8 @@ name: tests
on: on:
push: push:
tags-ignore:
- '**'
branches-ignore: branches-ignore:
- "spacy.io" - "spacy.io"
- "nightly.spacy.io" - "nightly.spacy.io"
@ -10,7 +12,6 @@ on:
- "*.md" - "*.md"
- "*.mdx" - "*.mdx"
- "website/**" - "website/**"
- ".github/workflows/**"
pull_request: pull_request:
types: [opened, synchronize, reopened, edited] types: [opened, synchronize, reopened, edited]
paths-ignore: paths-ignore:
@ -25,13 +26,12 @@ jobs:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- name: Check out repo - name: Check out repo
uses: actions/checkout@v3 uses: actions/checkout@v4
- name: Configure Python version - name: Configure Python version
uses: actions/setup-python@v4 uses: actions/setup-python@v4
with: with:
python-version: "3.7" python-version: "3.10"
architecture: x64
- name: black - name: black
run: | run: |
@ -45,11 +45,12 @@ jobs:
run: | run: |
python -m pip install flake8==5.0.4 python -m pip install flake8==5.0.4
python -m flake8 spacy --count --select=E901,E999,F821,F822,F823,W605 --show-source --statistics python -m flake8 spacy --count --select=E901,E999,F821,F822,F823,W605 --show-source --statistics
- name: cython-lint # Unfortunately cython-lint isn't working after the shift to Cython 3.
run: | #- name: cython-lint
python -m pip install cython-lint -c requirements.txt # run: |
# E501: line too log, W291: trailing whitespace, E266: too many leading '#' for block comment # python -m pip install cython-lint -c requirements.txt
cython-lint spacy --ignore E501,W291,E266 # # E501: line too log, W291: trailing whitespace, E266: too many leading '#' for block comment
# cython-lint spacy --ignore E501,W291,E266
tests: tests:
name: Test name: Test
@ -58,28 +59,18 @@ jobs:
fail-fast: true fail-fast: true
matrix: matrix:
os: [ubuntu-latest, windows-latest, macos-latest] os: [ubuntu-latest, windows-latest, macos-latest]
python_version: ["3.11", "3.12.0-rc.2"] python_version: ["3.9", "3.12", "3.13"]
include:
- os: windows-latest
python_version: "3.7"
- os: macos-latest
python_version: "3.8"
- os: ubuntu-latest
python_version: "3.9"
- os: windows-latest
python_version: "3.10"
runs-on: ${{ matrix.os }} runs-on: ${{ matrix.os }}
steps: steps:
- name: Check out repo - name: Check out repo
uses: actions/checkout@v3 uses: actions/checkout@v4
- name: Configure Python version - name: Configure Python version
uses: actions/setup-python@v4 uses: actions/setup-python@v4
with: with:
python-version: ${{ matrix.python_version }} python-version: ${{ matrix.python_version }}
architecture: x64
- name: Install dependencies - name: Install dependencies
run: | run: |
@ -157,7 +148,9 @@ jobs:
- name: "Test assemble CLI" - name: "Test assemble CLI"
run: | run: |
python -c "import spacy; config = spacy.util.load_config('ner.cfg'); config['components']['ner'] = {'source': 'ca_core_news_sm'}; config.to_disk('ner_source_sm.cfg')" python -c "import spacy; config = spacy.util.load_config('ner.cfg'); config['components']['ner'] = {'source': 'ca_core_news_sm'}; config.to_disk('ner_source_sm.cfg')"
PYTHONWARNINGS="error,ignore::DeprecationWarning" python -m spacy assemble ner_source_sm.cfg output_dir python -m spacy assemble ner_source_sm.cfg output_dir
env:
PYTHONWARNINGS: "error,ignore::DeprecationWarning"
if: matrix.python_version == '3.9' if: matrix.python_version == '3.9'
- name: "Test assemble CLI vectors warning" - name: "Test assemble CLI vectors warning"

View File

@ -20,13 +20,12 @@ jobs:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- name: Check out repo - name: Check out repo
uses: actions/checkout@v3 uses: actions/checkout@v4
- name: Configure Python version - name: Configure Python version
uses: actions/setup-python@v4 uses: actions/setup-python@v4
with: with:
python-version: "3.7" python-version: "3.7"
architecture: x64
- name: Validate website/meta/universe.json - name: Validate website/meta/universe.json
run: | run: |

View File

@ -35,7 +35,7 @@ so that more people can benefit from it.
When opening an issue, use a **descriptive title** and include your When opening an issue, use a **descriptive title** and include your
**environment** (operating system, Python version, spaCy version). Our **environment** (operating system, Python version, spaCy version). Our
[issue template](https://github.com/explosion/spaCy/issues/new) helps you [issue templates](https://github.com/explosion/spaCy/issues/new/choose) help you
remember the most important details to include. If you've discovered a bug, you remember the most important details to include. If you've discovered a bug, you
can also submit a [regression test](#fixing-bugs) straight away. When you're can also submit a [regression test](#fixing-bugs) straight away. When you're
opening an issue to report the bug, simply refer to your pull request in the opening an issue to report the bug, simply refer to your pull request in the
@ -449,13 +449,12 @@ and plugins in spaCy v3.0, and we can't wait to see what you build with it!
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
[`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars) [`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
to make it easier to find. Those are also the topics we're linking to from the to make it easier to find. Those are also the topics we're linking to from the
spaCy website. If you're sharing your project on Twitter, feel free to tag spaCy website. If you're sharing your project on X, feel free to tag
[@spacy_io](https://twitter.com/spacy_io) so we can check it out. [@spacy_io](https://x.com/spacy_io) so we can check it out.
- Once your extension is published, you can open an issue on the - Once your extension is published, you can open a
[issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the [PR](https://github.com/explosion/spaCy/pulls) to suggest it for the
[resources directory](https://spacy.io/usage/resources#extensions) on the [Universe](https://spacy.io/universe) page.
website.
📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).** 📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).**

View File

@ -1,6 +1,6 @@
The MIT License (MIT) The MIT License (MIT)
Copyright (C) 2016-2022 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal Copyright (C) 2016-2024 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal

View File

@ -4,5 +4,6 @@ include README.md
include pyproject.toml include pyproject.toml
include spacy/py.typed include spacy/py.typed
recursive-include spacy/cli *.yml recursive-include spacy/cli *.yml
recursive-include spacy/tests *.json
recursive-include licenses * recursive-include licenses *
recursive-exclude spacy *.cpp recursive-exclude spacy *.cpp

View File

@ -16,7 +16,7 @@ model packaging, deployment and workflow management. spaCy is commercial
open-source software, released under the open-source software, released under the
[MIT license](https://github.com/explosion/spaCy/blob/master/LICENSE). [MIT license](https://github.com/explosion/spaCy/blob/master/LICENSE).
💫 **Version 3.7 out now!** 💫 **Version 3.8 out now!**
[Check out the release notes here.](https://github.com/explosion/spaCy/releases) [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
[![tests](https://github.com/explosion/spaCy/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spaCy/actions/workflows/tests.yml) [![tests](https://github.com/explosion/spaCy/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spaCy/actions/workflows/tests.yml)
@ -28,7 +28,6 @@ open-source software, released under the
<br /> <br />
[![PyPi downloads](https://static.pepy.tech/personalized-badge/spacy?period=total&units=international_system&left_color=grey&right_color=orange&left_text=pip%20downloads)](https://pypi.org/project/spacy/) [![PyPi downloads](https://static.pepy.tech/personalized-badge/spacy?period=total&units=international_system&left_color=grey&right_color=orange&left_text=pip%20downloads)](https://pypi.org/project/spacy/)
[![Conda downloads](https://img.shields.io/conda/dn/conda-forge/spacy?label=conda%20downloads)](https://anaconda.org/conda-forge/spacy) [![Conda downloads](https://img.shields.io/conda/dn/conda-forge/spacy?label=conda%20downloads)](https://anaconda.org/conda-forge/spacy)
[![spaCy on Twitter](https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow)](https://twitter.com/spacy_io)
## 📖 Documentation ## 📖 Documentation
@ -39,28 +38,37 @@ open-source software, released under the
| 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. | | 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. |
| 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run. | | 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run. |
| 🎛 **[API Reference]** | The detailed reference for spaCy's API. | | 🎛 **[API Reference]** | The detailed reference for spaCy's API. |
| ⏩ **[GPU Processing]** | Use spaCy with CUDA-compatible GPU processing. |
| 📦 **[Models]** | Download trained pipelines for spaCy. | | 📦 **[Models]** | Download trained pipelines for spaCy. |
| 🦙 **[Large Language Models]** | Integrate LLMs into spaCy pipelines. |
| 🌌 **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. | | 🌌 **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. |
| ⚙️ **[spaCy VS Code Extension]** | Additional tooling and features for working with spaCy's config files. | | ⚙️ **[spaCy VS Code Extension]** | Additional tooling and features for working with spaCy's config files. |
| 👩‍🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. | | 👩‍🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. |
| 📰 **[Blog]** | Read about current spaCy and Prodigy development, releases, talks and more from Explosion. |
| 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. | | 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. |
| 🔴 **[Live Stream]** | Join Matt as he works on spaCy and chat about NLP, live every week. |
| 🛠 **[Changelog]** | Changes and version history. | | 🛠 **[Changelog]** | Changes and version history. |
| 💝 **[Contribute]** | How to contribute to the spaCy project and code base. | | 💝 **[Contribute]** | How to contribute to the spaCy project and code base. |
| <a href="https://explosion.ai/spacy-tailored-pipelines"><img src="https://user-images.githubusercontent.com/13643239/152853098-1c761611-ccb0-4ec6-9066-b234552831fe.png" width="125" alt="spaCy Tailored Pipelines"/></a> | Get a custom spaCy pipeline, tailor-made for your NLP problem by spaCy's core developers. Streamlined, production-ready, predictable and maintainable. Start by completing our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more &rarr;](https://explosion.ai/spacy-tailored-pipelines)** | | 👕 **[Swag]** | Support us and our work with unique, custom-designed swag! |
| <a href="https://explosion.ai/spacy-tailored-analysis"><img src="https://user-images.githubusercontent.com/1019791/206151300-b00cd189-e503-4797-aa1e-1bb6344062c5.png" width="125" alt="spaCy Tailored Pipelines"/></a> | Bespoke advice for problem solving, strategy and analysis for applied NLP projects. Services include data strategy, code reviews, pipeline design and annotation coaching. Curious? Fill in our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more &rarr;](https://explosion.ai/spacy-tailored-analysis)** | | <a href="https://explosion.ai/tailored-solutions"><img src="https://github.com/explosion/spaCy/assets/13643239/36d2a42e-98c0-4599-90e1-788ef75181be" width="150" alt="Tailored Solutions"/></a> | Custom NLP consulting, implementation and strategic advice by spaCys core development team. Streamlined, production-ready, predictable and maintainable. Send us an email or take our 5-minute questionnaire, and well'be in touch! **[Learn more &rarr;](https://explosion.ai/tailored-solutions)** |
[spacy 101]: https://spacy.io/usage/spacy-101 [spacy 101]: https://spacy.io/usage/spacy-101
[new in v3.0]: https://spacy.io/usage/v3 [new in v3.0]: https://spacy.io/usage/v3
[usage guides]: https://spacy.io/usage/ [usage guides]: https://spacy.io/usage/
[api reference]: https://spacy.io/api/ [api reference]: https://spacy.io/api/
[gpu processing]: https://spacy.io/usage#gpu
[models]: https://spacy.io/models [models]: https://spacy.io/models
[large language models]: https://spacy.io/usage/large-language-models
[universe]: https://spacy.io/universe [universe]: https://spacy.io/universe
[spacy vs code extension]: https://github.com/explosion/spacy-vscode [spacy vs code extension]: https://github.com/explosion/spacy-vscode
[videos]: https://www.youtube.com/c/ExplosionAI [videos]: https://www.youtube.com/c/ExplosionAI
[live stream]: https://www.youtube.com/playlist?list=PLBmcuObd5An5_iAxNYLJa_xWmNzsYce8c
[online course]: https://course.spacy.io [online course]: https://course.spacy.io
[blog]: https://explosion.ai
[project templates]: https://github.com/explosion/projects [project templates]: https://github.com/explosion/projects
[changelog]: https://spacy.io/usage#changelog [changelog]: https://spacy.io/usage#changelog
[contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md [contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
[swag]: https://explosion.ai/merch
## 💬 Where to ask questions ## 💬 Where to ask questions
@ -72,13 +80,14 @@ more people can benefit from it.
| Type | Platforms | | Type | Platforms |
| ------------------------------- | --------------------------------------- | | ------------------------------- | --------------------------------------- |
| 🚨 **Bug Reports** | [GitHub Issue Tracker] | | 🚨 **Bug Reports** | [GitHub Issue Tracker] |
| 🎁 **Feature Requests & Ideas** | [GitHub Discussions] | | 🎁 **Feature Requests & Ideas** | [GitHub Discussions] · [Live Stream] |
| 👩‍💻 **Usage Questions** | [GitHub Discussions] · [Stack Overflow] | | 👩‍💻 **Usage Questions** | [GitHub Discussions] · [Stack Overflow] |
| 🗯 **General Discussion** | [GitHub Discussions] | | 🗯 **General Discussion** | [GitHub Discussions] · [Live Stream] |
[github issue tracker]: https://github.com/explosion/spaCy/issues [github issue tracker]: https://github.com/explosion/spaCy/issues
[github discussions]: https://github.com/explosion/spaCy/discussions [github discussions]: https://github.com/explosion/spaCy/discussions
[stack overflow]: https://stackoverflow.com/questions/tagged/spacy [stack overflow]: https://stackoverflow.com/questions/tagged/spacy
[live stream]: https://www.youtube.com/playlist?list=PLBmcuObd5An5_iAxNYLJa_xWmNzsYce8c
## Features ## Features
@ -108,7 +117,7 @@ For detailed installation instructions, see the
- **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual - **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual
Studio) Studio)
- **Python version**: Python 3.7+ (only 64 bit) - **Python version**: Python >=3.7, <3.13 (only 64 bit)
- **Package managers**: [pip] · [conda] (via `conda-forge`) - **Package managers**: [pip] · [conda] (via `conda-forge`)
[pip]: https://pypi.org/project/spacy/ [pip]: https://pypi.org/project/spacy/

20
bin/release.sh Executable file
View File

@ -0,0 +1,20 @@
#!/usr/bin/env bash
set -e
# Insist repository is clean
git diff-index --quiet HEAD
version=$(grep "__version__ = " spacy/about.py)
version=${version/__version__ = }
version=${version/\'/}
version=${version/\'/}
version=${version/\"/}
version=${version/\"/}
echo "Pushing release-v"$version
git tag -d release-v$version || true
git push origin :release-v$version || true
git tag release-v$version
git push origin release-v$version

View File

@ -1,6 +1,2 @@
# build version constraints for use with wheelwright # build version constraints for use with wheelwright
numpy==1.15.0; python_version=='3.7' and platform_machine!='aarch64' numpy>=2.0.0,<3.0.0
numpy==1.19.2; python_version=='3.7' and platform_machine=='aarch64'
numpy==1.17.3; python_version=='3.8' and platform_machine!='aarch64'
numpy==1.19.2; python_version=='3.8' and platform_machine=='aarch64'
numpy>=1.25.0; python_version>='3.9'

View File

@ -158,3 +158,45 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE. SOFTWARE.
SciPy
-----
* Files: scorer.py
The implementation of trapezoid() is adapted from SciPy, which is distributed
under the following license:
New BSD License
Copyright (c) 2001-2002 Enthought, Inc. 2003-2023, SciPy Developers.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided
with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View File

@ -1,15 +1,67 @@
[build-system] [build-system]
requires = [ requires = [
"setuptools", "setuptools",
"cython>=0.25,<3.0", "cython>=3.0,<4.0",
"cymem>=2.0.2,<2.1.0", "cymem>=2.0.2,<2.1.0",
"preshed>=3.0.2,<3.1.0", "preshed>=3.0.2,<3.1.0",
"murmurhash>=0.28.0,<1.1.0", "murmurhash>=0.28.0,<1.1.0",
"thinc>=8.1.8,<8.3.0", "thinc>=8.3.4,<8.4.0",
"numpy>=1.15.0; python_version < '3.9'", "numpy>=2.0.0,<3.0.0"
"numpy>=1.25.0; python_version >= '3.9'",
] ]
build-backend = "setuptools.build_meta" build-backend = "setuptools.build_meta"
[tool.cibuildwheel]
build = "*"
skip = "pp* cp36* cp37* cp38* *-win32 *i686*"
test-skip = ""
free-threaded-support = false
archs = ["native"]
build-frontend = "default"
config-settings = {}
dependency-versions = "pinned"
environment = { PIP_CONSTRAINT = "build-constraints.txt" }
environment-pass = []
build-verbosity = 0
before-all = "curl https://sh.rustup.rs -sSf | sh -s -- -y --profile minimal --default-toolchain stable"
before-build = "pip install -r requirements.txt && python setup.py clean"
repair-wheel-command = ""
test-command = ""
before-test = ""
test-requires = []
test-extras = []
container-engine = "docker"
manylinux-x86_64-image = "manylinux2014"
manylinux-i686-image = "manylinux2014"
manylinux-aarch64-image = "manylinux2014"
manylinux-ppc64le-image = "manylinux2014"
manylinux-s390x-image = "manylinux2014"
manylinux-pypy_x86_64-image = "manylinux2014"
manylinux-pypy_i686-image = "manylinux2014"
manylinux-pypy_aarch64-image = "manylinux2014"
musllinux-x86_64-image = "musllinux_1_2"
musllinux-i686-image = "musllinux_1_2"
musllinux-aarch64-image = "musllinux_1_2"
musllinux-ppc64le-image = "musllinux_1_2"
musllinux-s390x-image = "musllinux_1_2"
[tool.cibuildwheel.linux]
repair-wheel-command = "auditwheel repair -w {dest_dir} {wheel}"
[tool.cibuildwheel.macos]
repair-wheel-command = "delocate-wheel --require-archs {delocate_archs} -w {dest_dir} -v {wheel}"
[tool.cibuildwheel.windows]
[tool.cibuildwheel.pyodide]
[tool.isort] [tool.isort]
profile = "black" profile = "black"

View File

@ -3,30 +3,26 @@ spacy-legacy>=3.0.11,<3.1.0
spacy-loggers>=1.0.0,<2.0.0 spacy-loggers>=1.0.0,<2.0.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=8.1.8,<8.3.0 thinc>=8.3.4,<8.4.0
ml_datasets>=0.2.0,<0.3.0 ml_datasets>=0.2.0,<0.3.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
wasabi>=0.9.1,<1.2.0 wasabi>=0.9.1,<1.2.0
srsly>=2.4.3,<3.0.0 srsly>=2.4.3,<3.0.0
catalogue>=2.0.6,<2.1.0 catalogue>=2.0.6,<2.1.0
typer>=0.3.0,<0.10.0 typer-slim>=0.3.0,<1.0.0
smart-open>=5.2.1,<7.0.0 weasel>=0.1.0,<0.5.0
weasel>=0.1.0,<0.4.0
# Third party dependencies # Third party dependencies
numpy>=1.15.0; python_version < "3.9" numpy>=2.0.0,<3.0.0
numpy>=1.19.0; python_version >= "3.9"
requests>=2.13.0,<3.0.0 requests>=2.13.0,<3.0.0
tqdm>=4.38.0,<5.0.0 tqdm>=4.38.0,<5.0.0
pydantic>=1.7.4,!=1.8,!=1.8.1,<3.0.0 pydantic>=1.7.4,!=1.8,!=1.8.1,<3.0.0
jinja2 jinja2
langcodes>=3.2.0,<4.0.0
# Official Python utilities # Official Python utilities
setuptools setuptools
packaging>=20.0 packaging>=20.0
typing_extensions>=3.7.4.1,<4.5.0; python_version < "3.8"
# Development dependencies # Development dependencies
pre-commit>=2.13.0 pre-commit>=2.13.0
cython>=0.25,<3.0 cython>=3.0,<4.0
pytest>=5.2.0,!=7.1.0 pytest>=5.2.0,!=7.1.0
pytest-timeout>=1.3.0,<2.0.0 pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0 mock>=2.0.0,<3.0.0

View File

@ -17,11 +17,11 @@ classifiers =
Operating System :: Microsoft :: Windows Operating System :: Microsoft :: Windows
Programming Language :: Cython Programming Language :: Cython
Programming Language :: Python :: 3 Programming Language :: Python :: 3
Programming Language :: Python :: 3.7
Programming Language :: Python :: 3.8
Programming Language :: Python :: 3.9 Programming Language :: Python :: 3.9
Programming Language :: Python :: 3.10 Programming Language :: Python :: 3.10
Programming Language :: Python :: 3.11 Programming Language :: Python :: 3.11
Programming Language :: Python :: 3.12
Programming Language :: Python :: 3.13
Topic :: Scientific/Engineering Topic :: Scientific/Engineering
project_urls = project_urls =
Release notes = https://github.com/explosion/spaCy/releases Release notes = https://github.com/explosion/spaCy/releases
@ -30,18 +30,18 @@ project_urls =
[options] [options]
zip_safe = false zip_safe = false
include_package_data = true include_package_data = true
python_requires = >=3.7 python_requires = >=3.9,<3.14
# NOTE: This section is superseded by pyproject.toml and will be removed in # NOTE: This section is superseded by pyproject.toml and will be removed in
# spaCy v4 # spaCy v4
setup_requires = setup_requires =
cython>=0.25,<3.0 cython>=3.0,<4.0
numpy>=1.15.0; python_version < "3.9" numpy>=2.0.0,<3.0.0; python_version < "3.9"
numpy>=1.19.0; python_version >= "3.9" numpy>=2.0.0,<3.0.0; python_version >= "3.9"
# We also need our Cython packages here to compile against # We also need our Cython packages here to compile against
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
thinc>=8.1.8,<8.3.0 thinc>=8.3.4,<8.4.0
install_requires = install_requires =
# Our libraries # Our libraries
spacy-legacy>=3.0.11,<3.1.0 spacy-legacy>=3.0.11,<3.1.0
@ -49,14 +49,13 @@ install_requires =
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=8.1.8,<8.3.0 thinc>=8.3.4,<8.4.0
wasabi>=0.9.1,<1.2.0 wasabi>=0.9.1,<1.2.0
srsly>=2.4.3,<3.0.0 srsly>=2.4.3,<3.0.0
catalogue>=2.0.6,<2.1.0 catalogue>=2.0.6,<2.1.0
weasel>=0.1.0,<0.4.0 weasel>=0.1.0,<0.5.0
# Third-party dependencies # Third-party dependencies
typer>=0.3.0,<0.10.0 typer-slim>=0.3.0,<1.0.0
smart-open>=5.2.1,<7.0.0
tqdm>=4.38.0,<5.0.0 tqdm>=4.38.0,<5.0.0
numpy>=1.15.0; python_version < "3.9" numpy>=1.15.0; python_version < "3.9"
numpy>=1.19.0; python_version >= "3.9" numpy>=1.19.0; python_version >= "3.9"
@ -66,8 +65,6 @@ install_requires =
# Official Python utilities # Official Python utilities
setuptools setuptools
packaging>=20.0 packaging>=20.0
typing_extensions>=3.7.4.1,<4.5.0; python_version < "3.8"
langcodes>=3.2.0,<4.0.0
[options.entry_points] [options.entry_points]
console_scripts = console_scripts =
@ -117,7 +114,7 @@ cuda12x =
cuda-autodetect = cuda-autodetect =
cupy-wheel>=11.0.0,<13.0.0 cupy-wheel>=11.0.0,<13.0.0
apple = apple =
thinc-apple-ops>=0.1.0.dev0,<1.0.0 thinc-apple-ops>=1.0.0,<2.0.0
# Language tokenizers with external dependencies # Language tokenizers with external dependencies
ja = ja =
sudachipy>=0.5.2,!=0.6.1 sudachipy>=0.5.2,!=0.6.1

View File

@ -17,6 +17,7 @@ from .cli.info import info # noqa: F401
from .errors import Errors from .errors import Errors
from .glossary import explain # noqa: F401 from .glossary import explain # noqa: F401
from .language import Language from .language import Language
from .registrations import REGISTRY_POPULATED, populate_registry
from .util import logger, registry # noqa: F401 from .util import logger, registry # noqa: F401
from .vocab import Vocab from .vocab import Vocab

View File

@ -1,5 +1,5 @@
# fmt: off # fmt: off
__title__ = "spacy" __title__ = "spacy"
__version__ = "3.7.2" __version__ = "3.8.7"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"

View File

@ -1,5 +1,7 @@
from wasabi import msg from wasabi import msg
# Needed for testing
from . import download as download_module # noqa: F401
from ._util import app, setup_cli # noqa: F401 from ._util import app, setup_cli # noqa: F401
from .apply import apply # noqa: F401 from .apply import apply # noqa: F401
from .assemble import assemble_cli # noqa: F401 from .assemble import assemble_cli # noqa: F401

View File

@ -13,7 +13,7 @@ from .. import util
from ..language import Language from ..language import Language
from ..tokens import Doc from ..tokens import Doc
from ..training import Corpus from ..training import Corpus
from ._util import Arg, Opt, benchmark_cli, setup_gpu from ._util import Arg, Opt, benchmark_cli, import_code, setup_gpu
@benchmark_cli.command( @benchmark_cli.command(
@ -30,12 +30,14 @@ def benchmark_speed_cli(
use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU"), use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU"),
n_batches: int = Opt(50, "--batches", help="Minimum number of batches to benchmark", min=30,), n_batches: int = Opt(50, "--batches", help="Minimum number of batches to benchmark", min=30,),
warmup_epochs: int = Opt(3, "--warmup", "-w", min=0, help="Number of iterations over the data for warmup"), warmup_epochs: int = Opt(3, "--warmup", "-w", min=0, help="Number of iterations over the data for warmup"),
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
# fmt: on # fmt: on
): ):
""" """
Benchmark a pipeline. Expects a loadable spaCy pipeline and benchmark Benchmark a pipeline. Expects a loadable spaCy pipeline and benchmark
data in the binary .spacy format. data in the binary .spacy format.
""" """
import_code(code_path)
setup_gpu(use_gpu=use_gpu, silent=False) setup_gpu(use_gpu=use_gpu, silent=False)
nlp = util.load_model(model) nlp = util.load_model(model)
@ -171,5 +173,5 @@ def print_outliers(sample: numpy.ndarray):
def warmup( def warmup(
nlp: Language, docs: List[Doc], warmup_epochs: int, batch_size: Optional[int] nlp: Language, docs: List[Doc], warmup_epochs: int, batch_size: Optional[int]
) -> numpy.ndarray: ) -> numpy.ndarray:
docs = warmup_epochs * docs docs = [doc.copy() for doc in docs * warmup_epochs]
return annotate(nlp, docs, batch_size) return annotate(nlp, docs, batch_size)

View File

@ -170,7 +170,7 @@ def debug_model(
msg.divider(f"STEP 3 - prediction") msg.divider(f"STEP 3 - prediction")
msg.info(str(prediction)) msg.info(str(prediction))
msg.good(f"Succesfully ended analysis - model looks good.") msg.good(f"Successfully ended analysis - model looks good.")
def _sentences(): def _sentences():

View File

@ -1,5 +1,6 @@
import sys import sys
from typing import Optional, Sequence from typing import Optional, Sequence
from urllib.parse import urljoin
import requests import requests
import typer import typer
@ -7,7 +8,14 @@ from wasabi import msg
from .. import about from .. import about
from ..errors import OLD_MODEL_SHORTCUTS from ..errors import OLD_MODEL_SHORTCUTS
from ..util import get_minor_version, is_package, is_prerelease_version, run_command from ..util import (
get_minor_version,
is_in_interactive,
is_in_jupyter,
is_package,
is_prerelease_version,
run_command,
)
from ._util import SDIST_SUFFIX, WHEEL_SUFFIX, Arg, Opt, app from ._util import SDIST_SUFFIX, WHEEL_SUFFIX, Arg, Opt, app
@ -56,6 +64,13 @@ def download(
) )
pip_args = pip_args + ("--no-deps",) pip_args = pip_args + ("--no-deps",)
if direct: if direct:
# Reject model names with '/', in order to prevent shenanigans.
if "/" in model:
msg.fail(
title="Model download rejected",
text=f"Cannot download model '{model}'. Models are expected to be file names, not URLs or fragments",
exits=True,
)
components = model.split("-") components = model.split("-")
model_name = "".join(components[:-1]) model_name = "".join(components[:-1])
version = components[-1] version = components[-1]
@ -77,6 +92,27 @@ def download(
"Download and installation successful", "Download and installation successful",
f"You can now load the package via spacy.load('{model_name}')", f"You can now load the package via spacy.load('{model_name}')",
) )
if is_in_jupyter():
reload_deps_msg = (
"If you are in a Jupyter or Colab notebook, you may need to "
"restart Python in order to load all the package's dependencies. "
"You can do this by selecting the 'Restart kernel' or 'Restart "
"runtime' option."
)
msg.warn(
"Restart to reload dependencies",
reload_deps_msg,
)
elif is_in_interactive():
reload_deps_msg = (
"If you are in an interactive Python session, you may need to "
"exit and restart Python to load all the package's dependencies. "
"You can exit with Ctrl-D (or Ctrl-Z and Enter on Windows)."
)
msg.warn(
"Restart to reload dependencies",
reload_deps_msg,
)
def get_model_filename(model_name: str, version: str, sdist: bool = False) -> str: def get_model_filename(model_name: str, version: str, sdist: bool = False) -> str:
@ -125,7 +161,16 @@ def get_latest_version(model: str) -> str:
def download_model( def download_model(
filename: str, user_pip_args: Optional[Sequence[str]] = None filename: str, user_pip_args: Optional[Sequence[str]] = None
) -> None: ) -> None:
download_url = about.__download_url__ + "/" + filename # Construct the download URL carefully. We need to make sure we don't
# allow relative paths or other shenanigans to trick us into download
# from outside our own repo.
base_url = about.__download_url__
# urljoin requires that the path ends with /, or the last path part will be dropped
if not base_url.endswith("/"):
base_url = about.__download_url__ + "/"
download_url = urljoin(base_url, filename)
if not download_url.startswith(about.__download_url__):
raise ValueError(f"Download from {filename} rejected. Was it a relative path?")
pip_args = list(user_pip_args) if user_pip_args is not None else [] pip_args = list(user_pip_args) if user_pip_args is not None else []
cmd = [sys.executable, "-m", "pip", "install"] + pip_args + [download_url] cmd = [sys.executable, "-m", "pip", "install"] + pip_args + [download_url]
run_command(cmd) run_command(cmd)

View File

@ -39,7 +39,7 @@ def find_threshold_cli(
# fmt: on # fmt: on
): ):
""" """
Runs prediction trials for a trained model with varying tresholds to maximize Runs prediction trials for a trained model with varying thresholds to maximize
the specified metric. The search space for the threshold is traversed linearly the specified metric. The search space for the threshold is traversed linearly
from 0 to 1 in `n_trials` steps. Results are displayed in a table on `stdout` from 0 to 1 in `n_trials` steps. Results are displayed in a table on `stdout`
(the corresponding API call to `spacy.cli.find_threshold.find_threshold()` (the corresponding API call to `spacy.cli.find_threshold.find_threshold()`
@ -81,7 +81,7 @@ def find_threshold(
silent: bool = True, silent: bool = True,
) -> Tuple[float, float, Dict[float, float]]: ) -> Tuple[float, float, Dict[float, float]]:
""" """
Runs prediction trials for models with varying tresholds to maximize the specified metric. Runs prediction trials for models with varying thresholds to maximize the specified metric.
model (Union[str, Path]): Pipeline to evaluate. Can be a package or a path to a data directory. model (Union[str, Path]): Pipeline to evaluate. Can be a package or a path to a data directory.
data_path (Path): Path to file with DocBin with docs to use for threshold search. data_path (Path): Path to file with DocBin with docs to use for threshold search.
pipe_name (str): Name of pipe to examine thresholds for. pipe_name (str): Name of pipe to examine thresholds for.

View File

@ -1,5 +1,7 @@
import os
import re import re
import shutil import shutil
import subprocess
import sys import sys
from collections import defaultdict from collections import defaultdict
from pathlib import Path from pathlib import Path
@ -11,6 +13,7 @@ from thinc.api import Config
from wasabi import MarkdownRenderer, Printer, get_raw_input from wasabi import MarkdownRenderer, Printer, get_raw_input
from .. import about, util from .. import about, util
from ..compat import importlib_metadata
from ..schemas import ModelMetaSchema, validate from ..schemas import ModelMetaSchema, validate
from ._util import SDIST_SUFFIX, WHEEL_SUFFIX, Arg, Opt, app, string_to_list from ._util import SDIST_SUFFIX, WHEEL_SUFFIX, Arg, Opt, app, string_to_list
@ -27,6 +30,7 @@ def package_cli(
version: Optional[str] = Opt(None, "--version", "-v", help="Package version to override meta"), version: Optional[str] = Opt(None, "--version", "-v", help="Package version to override meta"),
build: str = Opt("sdist", "--build", "-b", help="Comma-separated formats to build: sdist and/or wheel, or none."), build: str = Opt("sdist", "--build", "-b", help="Comma-separated formats to build: sdist and/or wheel, or none."),
force: bool = Opt(False, "--force", "-f", "-F", help="Force overwriting existing data in output directory"), force: bool = Opt(False, "--force", "-f", "-F", help="Force overwriting existing data in output directory"),
require_parent: bool = Opt(True, "--require-parent/--no-require-parent", "-R", "-R", help="Include the parent package (e.g. spacy) in the requirements"),
# fmt: on # fmt: on
): ):
""" """
@ -35,7 +39,7 @@ def package_cli(
specified output directory, and the data will be copied over. If specified output directory, and the data will be copied over. If
--create-meta is set and a meta.json already exists in the output directory, --create-meta is set and a meta.json already exists in the output directory,
the existing values will be used as the defaults in the command-line prompt. the existing values will be used as the defaults in the command-line prompt.
After packaging, "python setup.py sdist" is run in the package directory, After packaging, "python -m build --sdist" is run in the package directory,
which will create a .tar.gz archive that can be installed via "pip install". which will create a .tar.gz archive that can be installed via "pip install".
If additional code files are provided (e.g. Python files containing custom If additional code files are provided (e.g. Python files containing custom
@ -57,6 +61,7 @@ def package_cli(
create_sdist=create_sdist, create_sdist=create_sdist,
create_wheel=create_wheel, create_wheel=create_wheel,
force=force, force=force,
require_parent=require_parent,
silent=False, silent=False,
) )
@ -71,6 +76,7 @@ def package(
create_meta: bool = False, create_meta: bool = False,
create_sdist: bool = True, create_sdist: bool = True,
create_wheel: bool = False, create_wheel: bool = False,
require_parent: bool = False,
force: bool = False, force: bool = False,
silent: bool = True, silent: bool = True,
) -> None: ) -> None:
@ -78,9 +84,17 @@ def package(
input_path = util.ensure_path(input_dir) input_path = util.ensure_path(input_dir)
output_path = util.ensure_path(output_dir) output_path = util.ensure_path(output_dir)
meta_path = util.ensure_path(meta_path) meta_path = util.ensure_path(meta_path)
if create_wheel and not has_wheel(): if create_wheel and not has_wheel() and not has_build():
err = "Generating a binary .whl file requires wheel to be installed" err = (
msg.fail(err, "pip install wheel", exits=1) "Generating wheels requires 'build' or 'wheel' (deprecated) to be installed"
)
msg.fail(err, "pip install build", exits=1)
if not has_build():
msg.warn(
"Generating packages without the 'build' package is deprecated and "
"will not be supported in the future. To install 'build': pip "
"install build"
)
if not input_path or not input_path.exists(): if not input_path or not input_path.exists():
msg.fail("Can't locate pipeline data", input_path, exits=1) msg.fail("Can't locate pipeline data", input_path, exits=1)
if not output_path or not output_path.exists(): if not output_path or not output_path.exists():
@ -102,7 +116,7 @@ def package(
if not meta_path.exists() or not meta_path.is_file(): if not meta_path.exists() or not meta_path.is_file():
msg.fail("Can't load pipeline meta.json", meta_path, exits=1) msg.fail("Can't load pipeline meta.json", meta_path, exits=1)
meta = srsly.read_json(meta_path) meta = srsly.read_json(meta_path)
meta = get_meta(input_dir, meta) meta = get_meta(input_dir, meta, require_parent=require_parent)
if meta["requirements"]: if meta["requirements"]:
msg.good( msg.good(
f"Including {len(meta['requirements'])} package requirement(s) from " f"Including {len(meta['requirements'])} package requirement(s) from "
@ -175,6 +189,7 @@ def package(
imports.append(code_path.stem) imports.append(code_path.stem)
shutil.copy(str(code_path), str(package_path)) shutil.copy(str(code_path), str(package_path))
create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2)) create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2))
create_file(main_path / "setup.py", TEMPLATE_SETUP) create_file(main_path / "setup.py", TEMPLATE_SETUP)
create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST) create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST)
init_py = TEMPLATE_INIT.format( init_py = TEMPLATE_INIT.format(
@ -184,12 +199,37 @@ def package(
msg.good(f"Successfully created package directory '{model_name_v}'", main_path) msg.good(f"Successfully created package directory '{model_name_v}'", main_path)
if create_sdist: if create_sdist:
with util.working_dir(main_path): with util.working_dir(main_path):
util.run_command([sys.executable, "setup.py", "sdist"], capture=False) # run directly, since util.run_command is not designed to continue
# after a command fails
ret = subprocess.run(
[sys.executable, "-m", "build", ".", "--sdist"],
env=os.environ.copy(),
)
if ret.returncode != 0:
msg.warn(
"Creating sdist with 'python -m build' failed. Falling "
"back to deprecated use of 'python setup.py sdist'"
)
util.run_command([sys.executable, "setup.py", "sdist"], capture=False)
zip_file = main_path / "dist" / f"{model_name_v}{SDIST_SUFFIX}" zip_file = main_path / "dist" / f"{model_name_v}{SDIST_SUFFIX}"
msg.good(f"Successfully created zipped Python package", zip_file) msg.good(f"Successfully created zipped Python package", zip_file)
if create_wheel: if create_wheel:
with util.working_dir(main_path): with util.working_dir(main_path):
util.run_command([sys.executable, "setup.py", "bdist_wheel"], capture=False) # run directly, since util.run_command is not designed to continue
# after a command fails
ret = subprocess.run(
[sys.executable, "-m", "build", ".", "--wheel"],
env=os.environ.copy(),
)
if ret.returncode != 0:
msg.warn(
"Creating wheel with 'python -m build' failed. Falling "
"back to deprecated use of 'wheel' with "
"'python setup.py bdist_wheel'"
)
util.run_command(
[sys.executable, "setup.py", "bdist_wheel"], capture=False
)
wheel_name_squashed = re.sub("_+", "_", model_name_v) wheel_name_squashed = re.sub("_+", "_", model_name_v)
wheel = main_path / "dist" / f"{wheel_name_squashed}{WHEEL_SUFFIX}" wheel = main_path / "dist" / f"{wheel_name_squashed}{WHEEL_SUFFIX}"
msg.good(f"Successfully created binary wheel", wheel) msg.good(f"Successfully created binary wheel", wheel)
@ -209,6 +249,17 @@ def has_wheel() -> bool:
return False return False
def has_build() -> bool:
# it's very likely that there is a local directory named build/ (especially
# in an editable install), so an import check is not sufficient; instead
# check that there is a package version
try:
importlib_metadata.version("build")
return True
except importlib_metadata.PackageNotFoundError: # type: ignore[attr-defined]
return False
def get_third_party_dependencies( def get_third_party_dependencies(
config: Config, exclude: List[str] = util.SimpleFrozenList() config: Config, exclude: List[str] = util.SimpleFrozenList()
) -> List[str]: ) -> List[str]:
@ -255,6 +306,8 @@ def get_third_party_dependencies(
modules.add(func_info["module"].split(".")[0]) # type: ignore[union-attr] modules.add(func_info["module"].split(".")[0]) # type: ignore[union-attr]
dependencies = [] dependencies = []
for module_name in modules: for module_name in modules:
if module_name == about.__title__:
continue
if module_name in distributions: if module_name in distributions:
dist = distributions.get(module_name) dist = distributions.get(module_name)
if dist: if dist:
@ -285,7 +338,9 @@ def create_file(file_path: Path, contents: str) -> None:
def get_meta( def get_meta(
model_path: Union[str, Path], existing_meta: Dict[str, Any] model_path: Union[str, Path],
existing_meta: Dict[str, Any],
require_parent: bool = False,
) -> Dict[str, Any]: ) -> Dict[str, Any]:
meta: Dict[str, Any] = { meta: Dict[str, Any] = {
"lang": "en", "lang": "en",
@ -314,6 +369,8 @@ def get_meta(
existing_reqs = [util.split_requirement(req)[0] for req in meta["requirements"]] existing_reqs = [util.split_requirement(req)[0] for req in meta["requirements"]]
reqs = get_third_party_dependencies(nlp.config, exclude=existing_reqs) reqs = get_third_party_dependencies(nlp.config, exclude=existing_reqs)
meta["requirements"].extend(reqs) meta["requirements"].extend(reqs)
if require_parent and about.__title__ not in meta["requirements"]:
meta["requirements"].append(about.__title__ + meta["spacy_version"])
return meta return meta
@ -488,8 +545,11 @@ def list_files(data_dir):
def list_requirements(meta): def list_requirements(meta):
parent_package = meta.get('parent_package', 'spacy') # Up to version 3.7, we included the parent package
requirements = [parent_package + meta['spacy_version']] # in requirements by default. This behaviour is removed
# in 3.8, with a setting to include the parent package in
# the requirements list in the meta if desired.
requirements = []
if 'setup_requires' in meta: if 'setup_requires' in meta:
requirements += meta['setup_requires'] requirements += meta['setup_requires']
if 'requirements' in meta: if 'requirements' in meta:

View File

@ -271,8 +271,9 @@ grad_factor = 1.0
@layers = "reduce_mean.v1" @layers = "reduce_mean.v1"
[components.textcat.model.linear_model] [components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v2" @architectures = "spacy.TextCatBOW.v3"
exclusive_classes = true exclusive_classes = true
length = 262144
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false
@ -308,8 +309,9 @@ grad_factor = 1.0
@layers = "reduce_mean.v1" @layers = "reduce_mean.v1"
[components.textcat_multilabel.model.linear_model] [components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v2" @architectures = "spacy.TextCatBOW.v3"
exclusive_classes = false exclusive_classes = false
length = 262144
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false
@ -542,14 +544,15 @@ nO = null
width = ${components.tok2vec.model.encode.width} width = ${components.tok2vec.model.encode.width}
[components.textcat.model.linear_model] [components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v2" @architectures = "spacy.TextCatBOW.v3"
exclusive_classes = true exclusive_classes = true
length = 262144
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false
{% else -%} {% else -%}
[components.textcat.model] [components.textcat.model]
@architectures = "spacy.TextCatBOW.v2" @architectures = "spacy.TextCatBOW.v3"
exclusive_classes = true exclusive_classes = true
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false
@ -570,15 +573,17 @@ nO = null
width = ${components.tok2vec.model.encode.width} width = ${components.tok2vec.model.encode.width}
[components.textcat_multilabel.model.linear_model] [components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v2" @architectures = "spacy.TextCatBOW.v3"
exclusive_classes = false exclusive_classes = false
length = 262144
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false
{% else -%} {% else -%}
[components.textcat_multilabel.model] [components.textcat_multilabel.model]
@architectures = "spacy.TextCatBOW.v2" @architectures = "spacy.TextCatBOW.v3"
exclusive_classes = false exclusive_classes = false
length = 262144
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false
{%- endif %} {%- endif %}

View File

@ -142,7 +142,25 @@ class SpanRenderer:
spans (list): Individual entity spans and their start, end, label, kb_id and kb_url. spans (list): Individual entity spans and their start, end, label, kb_id and kb_url.
title (str / None): Document title set in Doc.user_data['title']. title (str / None): Document title set in Doc.user_data['title'].
""" """
per_token_info = [] per_token_info = self._assemble_per_token_info(tokens, spans)
markup = self._render_markup(per_token_info)
markup = TPL_SPANS.format(content=markup, dir=self.direction)
if title:
markup = TPL_TITLE.format(title=title) + markup
return markup
@staticmethod
def _assemble_per_token_info(
tokens: List[str], spans: List[Dict[str, Any]]
) -> List[Dict[str, List[Dict[str, Any]]]]:
"""Assembles token info used to generate markup in render_spans().
tokens (List[str]): Tokens in text.
spans (List[Dict[str, Any]]): Spans in text.
RETURNS (List[Dict[str, List[Dict, str, Any]]]): Per token info needed to render HTML markup for given tokens
and spans.
"""
per_token_info: List[Dict[str, List[Dict[str, Any]]]] = []
# we must sort so that we can correctly describe when spans need to "stack" # we must sort so that we can correctly describe when spans need to "stack"
# which is determined by their start token, then span length (longer spans on top), # which is determined by their start token, then span length (longer spans on top),
# then break any remaining ties with the span label # then break any remaining ties with the span label
@ -154,21 +172,22 @@ class SpanRenderer:
s["label"], s["label"],
), ),
) )
for s in spans: for s in spans:
# this is the vertical 'slot' that the span will be rendered in # this is the vertical 'slot' that the span will be rendered in
# vertical_position = span_label_offset + (offset_step * (slot - 1)) # vertical_position = span_label_offset + (offset_step * (slot - 1))
s["render_slot"] = 0 s["render_slot"] = 0
for idx, token in enumerate(tokens): for idx, token in enumerate(tokens):
# Identify if a token belongs to a Span (and which) and if it's a # Identify if a token belongs to a Span (and which) and if it's a
# start token of said Span. We'll use this for the final HTML render # start token of said Span. We'll use this for the final HTML render
token_markup: Dict[str, Any] = {} token_markup: Dict[str, Any] = {}
token_markup["text"] = token token_markup["text"] = token
concurrent_spans = 0 intersecting_spans: List[Dict[str, Any]] = []
entities = [] entities = []
for span in spans: for span in spans:
ent = {} ent = {}
if span["start_token"] <= idx < span["end_token"]: if span["start_token"] <= idx < span["end_token"]:
concurrent_spans += 1
span_start = idx == span["start_token"] span_start = idx == span["start_token"]
ent["label"] = span["label"] ent["label"] = span["label"]
ent["is_start"] = span_start ent["is_start"] = span_start
@ -176,7 +195,12 @@ class SpanRenderer:
# When the span starts, we need to know how many other # When the span starts, we need to know how many other
# spans are on the 'span stack' and will be rendered. # spans are on the 'span stack' and will be rendered.
# This value becomes the vertical render slot for this entire span # This value becomes the vertical render slot for this entire span
span["render_slot"] = concurrent_spans span["render_slot"] = (
intersecting_spans[-1]["render_slot"]
if len(intersecting_spans)
else 0
) + 1
intersecting_spans.append(span)
ent["render_slot"] = span["render_slot"] ent["render_slot"] = span["render_slot"]
kb_id = span.get("kb_id", "") kb_id = span.get("kb_id", "")
kb_url = span.get("kb_url", "#") kb_url = span.get("kb_url", "#")
@ -193,11 +217,8 @@ class SpanRenderer:
span["render_slot"] = 0 span["render_slot"] = 0
token_markup["entities"] = entities token_markup["entities"] = entities
per_token_info.append(token_markup) per_token_info.append(token_markup)
markup = self._render_markup(per_token_info)
markup = TPL_SPANS.format(content=markup, dir=self.direction) return per_token_info
if title:
markup = TPL_TITLE.format(title=title) + markup
return markup
def _render_markup(self, per_token_info: List[Dict[str, Any]]) -> str: def _render_markup(self, per_token_info: List[Dict[str, Any]]) -> str:
"""Render the markup from per-token information""" """Render the markup from per-token information"""

View File

@ -220,6 +220,7 @@ class Warnings(metaclass=ErrorsWithCodes):
"key attribute for vectors, configure it through Vectors(attr=) or " "key attribute for vectors, configure it through Vectors(attr=) or "
"'spacy init vectors --attr'") "'spacy init vectors --attr'")
W126 = ("These keys are unsupported: {unsupported}") W126 = ("These keys are unsupported: {unsupported}")
W127 = ("Not all `Language.pipe` worker processes completed successfully")
class Errors(metaclass=ErrorsWithCodes): class Errors(metaclass=ErrorsWithCodes):
@ -227,7 +228,6 @@ class Errors(metaclass=ErrorsWithCodes):
E002 = ("Can't find factory for '{name}' for language {lang} ({lang_code}). " E002 = ("Can't find factory for '{name}' for language {lang} ({lang_code}). "
"This usually happens when spaCy calls `nlp.{method}` with a custom " "This usually happens when spaCy calls `nlp.{method}` with a custom "
"component name that's not registered on the current language class. " "component name that's not registered on the current language class. "
"If you're using a Transformer, make sure to install 'spacy-transformers'. "
"If you're using a custom component, make sure you've added the " "If you're using a custom component, make sure you've added the "
"decorator `@Language.component` (for function components) or " "decorator `@Language.component` (for function components) or "
"`@Language.factory` (for class components).\n\nAvailable " "`@Language.factory` (for class components).\n\nAvailable "
@ -984,6 +984,10 @@ class Errors(metaclass=ErrorsWithCodes):
"predicted docs when training {component}.") "predicted docs when training {component}.")
E1055 = ("The 'replace_listener' callback expects {num_params} parameters, " E1055 = ("The 'replace_listener' callback expects {num_params} parameters, "
"but only callbacks with one or three parameters are supported") "but only callbacks with one or three parameters are supported")
E1056 = ("The `TextCatBOW` architecture expects a length of at least 1, was {length}.")
E1057 = ("The `TextCatReduce` architecture must be used with at least one "
"reduction. Please enable one of `use_reduce_first`, "
"`use_reduce_last`, `use_reduce_max` or `use_reduce_mean`.")
# Deprecated model shortcuts, only used in errors and warnings # Deprecated model shortcuts, only used in errors and warnings

16
spacy/lang/bo/__init__.py Normal file
View File

@ -0,0 +1,16 @@
from ...language import BaseDefaults, Language
from .lex_attrs import LEX_ATTRS
from .stop_words import STOP_WORDS
class TibetanDefaults(BaseDefaults):
lex_attr_getters = LEX_ATTRS
stop_words = STOP_WORDS
class Tibetan(Language):
lang = "bo"
Defaults = TibetanDefaults
__all__ = ["Tibetan"]

16
spacy/lang/bo/examples.py Normal file
View File

@ -0,0 +1,16 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.bo.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"དོན་དུ་རྒྱ་མཚོ་བླ་མ་ཞེས་བྱ་ཞིང༌།",
"ཏཱ་ལའི་ཞེས་པ་ནི་སོག་སྐད་ཡིན་པ་དེ་བོད་སྐད་དུ་རྒྱ་མཚོའི་དོན་དུ་འཇུག",
"སོག་པོ་ཨལ་ཐན་རྒྱལ་པོས་རྒྱལ་དབང་བསོད་ནམས་རྒྱ་མཚོར་ཆེ་བསྟོད་ཀྱི་མཚན་གསོལ་བ་ཞིག་ཡིན་ཞིང༌།",
"རྗེས་སུ་རྒྱལ་བ་དགེ་འདུན་གྲུབ་དང༌། དགེ་འདུན་རྒྱ་མཚོ་སོ་སོར་ཡང་ཏཱ་ལའི་བླ་མའི་སྐུ་ཕྲེང་དང་པོ་དང༌།",
"གཉིས་པའི་མཚན་དེ་གསོལ་ཞིང༌།༸རྒྱལ་དབང་སྐུ་ཕྲེང་ལྔ་པས་དགའ་ལྡན་ཕོ་བྲང་གི་སྲིད་དབང་བཙུགས་པ་ནས་ཏཱ་ལའི་བླ་མ་ནི་བོད་ཀྱི་ཆོས་སྲིད་གཉིས་ཀྱི་དབུ་ཁྲིད་དུ་གྱུར་ཞིང་།",
"ད་ལྟའི་བར་ཏཱ་ལའི་བླ་མ་སྐུ་ཕྲེང་བཅུ་བཞི་བྱོན་ཡོད།",
]

View File

@ -0,0 +1,65 @@
from ...attrs import LIKE_NUM
# reference 1: https://en.wikipedia.org/wiki/Tibetan_numerals
_num_words = [
"ཀླད་ཀོར་",
"གཅིག་",
"གཉིས་",
"གསུམ་",
"བཞི་",
"ལྔ་",
"དྲུག་",
"བདུན་",
"བརྒྱད་",
"དགུ་",
"བཅུ་",
"བཅུ་གཅིག་",
"བཅུ་གཉིས་",
"བཅུ་གསུམ་",
"བཅུ་བཞི་",
"བཅུ་ལྔ་",
"བཅུ་དྲུག་",
"བཅུ་བདུན་",
"བཅུ་པརྒྱད",
"བཅུ་དགུ་",
"ཉི་ཤུ་",
"སུམ་ཅུ",
"བཞི་བཅུ",
"ལྔ་བཅུ",
"དྲུག་ཅུ",
"བདུན་ཅུ",
"བརྒྱད་ཅུ",
"དགུ་བཅུ",
"བརྒྱ་",
"སྟོང་",
"ཁྲི་",
"ས་ཡ་",
" བྱེ་བ་",
"དུང་ཕྱུར་",
"ཐེར་འབུམ་",
"ཐེར་འབུམ་ཆེན་པོ་",
"ཁྲག་ཁྲིག་",
"ཁྲག་ཁྲིག་ཆེན་པོ་",
]
def like_num(text):
"""
Check if text resembles a number
"""
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

198
spacy/lang/bo/stop_words.py Normal file
View File

@ -0,0 +1,198 @@
# Source: https://zenodo.org/records/10148636
STOP_WORDS = set(
"""
གས
མས
འད
པས
གཞན
དང
གས
བཅས
ངས
ལས
ཙམ
ཡང
མཐའདག
འད
རང
ངམ
དག
འང
ལགས
ཚང
ཐམསཅད
དམ
འམ
བས
ལགས
གས
མས
བམ
ནམ
ནམ
ངམ
འགའ
ཤས
གམ
ལགས
ཅང
འགའ
སམ
འང
ལས
འཕ
བར
དང
འག
སམ
ཟད
འམ
མམ
དམ
དག
ལམ
ནང
ཙམ
རམ
ཨང
གས
ལགས
པས
རབ
རམ
བས
གཞན
འབའ
གམ
བམ
ཙམ
མམ
ཏམ
ཏམ
ཤས
""".split()
)

View File

@ -6,7 +6,8 @@ _num_words = [
"nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
"sixteen", "seventeen", "eighteen", "nineteen", "twenty", "thirty", "forty", "sixteen", "seventeen", "eighteen", "nineteen", "twenty", "thirty", "forty",
"fifty", "sixty", "seventy", "eighty", "ninety", "hundred", "thousand", "fifty", "sixty", "seventy", "eighty", "ninety", "hundred", "thousand",
"million", "billion", "trillion", "quadrillion", "gajillion", "bazillion" "million", "billion", "trillion", "quadrillion", "quintillion", "sextillion",
"septillion", "octillion", "nonillion", "decillion", "gajillion", "bazillion"
] ]
_ordinal_words = [ _ordinal_words = [
"first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth", "first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth",
@ -14,7 +15,8 @@ _ordinal_words = [
"fifteenth", "sixteenth", "seventeenth", "eighteenth", "nineteenth", "fifteenth", "sixteenth", "seventeenth", "eighteenth", "nineteenth",
"twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth", "twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth",
"eightieth", "ninetieth", "hundredth", "thousandth", "millionth", "billionth", "eightieth", "ninetieth", "hundredth", "thousandth", "millionth", "billionth",
"trillionth", "quadrillionth", "gajillionth", "bazillionth" "trillionth", "quadrillionth", "quintillionth", "sextillionth", "septillionth",
"octillionth", "nonillionth", "decillionth", "gajillionth", "bazillionth"
] ]
# fmt: on # fmt: on

18
spacy/lang/fo/__init__.py Normal file
View File

@ -0,0 +1,18 @@
from ...language import BaseDefaults, Language
from ..punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
class FaroeseDefaults(BaseDefaults):
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
infixes = TOKENIZER_INFIXES
suffixes = TOKENIZER_SUFFIXES
prefixes = TOKENIZER_PREFIXES
class Faroese(Language):
lang = "fo"
Defaults = FaroeseDefaults
__all__ = ["Faroese"]

View File

@ -0,0 +1,90 @@
from ...symbols import ORTH
from ...util import update_exc
from ..tokenizer_exceptions import BASE_EXCEPTIONS
_exc = {}
for orth in [
"apr.",
"aug.",
"avgr.",
"árg.",
"ávís.",
"beinl.",
"blkv.",
"blaðkv.",
"blm.",
"blaðm.",
"bls.",
"blstj.",
"blaðstj.",
"des.",
"eint.",
"febr.",
"fyrrv.",
"góðk.",
"h.m.",
"innt.",
"jan.",
"kl.",
"m.a.",
"mðr.",
"mió.",
"nr.",
"nto.",
"nov.",
"nút.",
"o.a.",
"o.a.m.",
"o.a.tíl.",
"o.fl.",
"ff.",
"o.m.a.",
"o.o.",
"o.s.fr.",
"o.tíl.",
"o.ø.",
"okt.",
"omf.",
"pst.",
"ritstj.",
"sbr.",
"sms.",
"smst.",
"smb.",
"sb.",
"sbrt.",
"sp.",
"sept.",
"spf.",
"spsk.",
"t.e.",
"t.s.",
"t.s.s.",
"tlf.",
"tel.",
"tsk.",
"t.o.v.",
"t.d.",
"uml.",
"ums.",
"uppl.",
"upprfr.",
"uppr.",
"útg.",
"útl.",
"útr.",
"vanl.",
"v.",
"v.h.",
"v.ø.o.",
"viðm.",
"viðv.",
"vm.",
"v.m.",
]:
_exc[orth] = [{ORTH: orth}]
capitalized = orth.capitalize()
_exc[capitalized] = [{ORTH: capitalized}]
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)

18
spacy/lang/gd/__init__.py Normal file
View File

@ -0,0 +1,18 @@
from typing import Optional
from ...language import BaseDefaults, Language
from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
class ScottishDefaults(BaseDefaults):
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
stop_words = STOP_WORDS
class Scottish(Language):
lang = "gd"
Defaults = ScottishDefaults
__all__ = ["Scottish"]

388
spacy/lang/gd/stop_words.py Normal file
View File

@ -0,0 +1,388 @@
STOP_WORDS = set(
"""
'ad
'ar
'd # iad
'g # ag
'ga
'gam
'gan
'gar
'gur
'm # am
'n # an
'n seo
'na
'nad
'nam
'nan
'nar
'nuair
'nur
's
'sa
'san
'sann
'se
'sna
a
a'
a'd # agad
a'm # agam
a-chèile
a-seo
a-sin
a-siud
a chionn
a chionn 's
a chèile
a chéile
a dh'
a h-uile
a seo
ac' # aca
aca
aca-san
acasan
ach
ag
agad
agad-sa
agads'
agadsa
agaibh
agaibhse
againn
againne
agam
agam-sa
agams'
agamsa
agus
aice
aice-se
aicese
aig
aig' # aige
aige
aige-san
aigesan
air
air-san
air neo
airsan
am
an
an seo
an sin
an siud
an uair
ann
ann a
ann a'
ann a shin
ann am
ann an
annad
annam
annam-s'
annamsa
anns
anns an
annta
aon
ar
as
asad
asda
asta
b'
bho
bhon
bhuaidhe # bhuaithe
bhuainn
bhuaipe
bhuaithe
bhuapa
bhur
brì
bu
c'à
car son
carson
cha
chan
chionn
choir
chon
chun
chèile
chéile
chòir
cia mheud
ciamar
co-dhiubh
cuide
cuin
cuin'
cuine
'
càil
càit
càit'
càite
mheud
d'
da
de
dh'
dha
dhaibh
dhaibh-san
dhaibhsan
dhan
dhasan
dhe
dhen
dheth
dhi
dhiom
dhiot
dhith
dhiubh
dhomh
dhomh-s'
dhomhsa
dhu'sa # dhut-sa
dhuibh
dhuibhse
dhuinn
dhuinne
dhuit
dhut
dhutsa
dhut-sa
dhà
dhà-san
dhàsan
dhòmhsa
diubh
do
docha
don
mar
mar
dòch'
dòcha
e
eadar
eatarra
eatorra
eile
esan
fa
far
feud
fhad
fheudar
fhearr
fhein
fheudar
fheàrr
fhèin
fhéin
fhìn
fo
fodha
fodhainn
foipe
fon
fèin
ga
gach
gam
gan
ge brith
ged
gu
gu
gu ruige
gun
gur
gus
i
iad
iadsan
innte
is
ise
le
leam
leam-sa
leamsa
leat
leat-sa
leatha
leatsa
leibh
leis
leis-san
leoth'
leotha
leotha-san
linn
m'
m'a
ma
mac
man
mar
mas
mathaid
mi
mis'
mise
mo
mu
mu 'n
mun
mur
mura
mus
na
na b'
na bu
na iad
nach
nad
nam
nan
nar
nas
neo
no
nuair
o
o'n
oir
oirbh
oirbh-se
oirnn
oirnne
oirre
on
orm
orm-sa
ormsa
orra
orra-san
orrasan
ort
os
r'
ri
ribh
rinn
ris
rithe
rithe-se
rium
rium-sa
riums'
riumsa
riut
riuth'
riutha
riuthasan
ro
ro'n
roimh
roimhe
romhainn
romham
romhpa
ron
ruibh
ruinn
ruinne
sa
san
sann
se
seach
seo
seothach
shin
sibh
sibh-se
sibhse
sin
sineach
sinn
sinne
siod
siodach
siud
siudach
sna # ann an
t'
tarsaing
tarsainn
tarsuinn
thar
thoigh
thro
thu
thuc'
thuca
thugad
thugaibh
thugainn
thugam
thugamsa
thuice
thuige
thus'
thusa
timcheall
toigh
toil
tro
tro' # troimh
troimh
troimhe
tron
tu
tusa
uair
ud
ugaibh
ugam-s'
ugam-sa
uice
uige
uige-san
umad
unnta # ann an
ur
urrainn
à
às
àsan
á
ás
è
ì
ò
ó
""".split(
"\n"
)
)

File diff suppressed because it is too large Load Diff

View File

@ -1,5 +1,5 @@
The list of Croatian lemmas was extracted from the reldi-tagger repository (https://github.com/clarinsi/reldi-tagger). The list of Croatian lemmas was extracted from the reldi-tagger repository (https://github.com/clarinsi/reldi-tagger).
Reldi-tagger is licesned under the Apache 2.0 licence. Reldi-tagger is licensed under the Apache 2.0 licence.
@InProceedings{ljubesic16-new, @InProceedings{ljubesic16-new,
author = {Nikola Ljubešić and Filip Klubička and Željko Agić and Ivo-Pavao Jazbec}, author = {Nikola Ljubešić and Filip Klubička and Željko Agić and Ivo-Pavao Jazbec},

52
spacy/lang/ht/__init__.py Normal file
View File

@ -0,0 +1,52 @@
from typing import Callable, Optional
from thinc.api import Model
from ...language import BaseDefaults, Language
from .lemmatizer import HaitianCreoleLemmatizer
from .lex_attrs import LEX_ATTRS
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
from .stop_words import STOP_WORDS
from .syntax_iterators import SYNTAX_ITERATORS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tag_map import TAG_MAP
class HaitianCreoleDefaults(BaseDefaults):
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
prefixes = TOKENIZER_PREFIXES
infixes = TOKENIZER_INFIXES
suffixes = TOKENIZER_SUFFIXES
lex_attr_getters = LEX_ATTRS
syntax_iterators = SYNTAX_ITERATORS
stop_words = STOP_WORDS
tag_map = TAG_MAP
class HaitianCreole(Language):
lang = "ht"
Defaults = HaitianCreoleDefaults
@HaitianCreole.factory(
"lemmatizer",
assigns=["token.lemma"],
default_config={
"model": None,
"mode": "rule",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0},
)
def make_lemmatizer(
nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
):
return HaitianCreoleLemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["HaitianCreole"]

18
spacy/lang/ht/examples.py Normal file
View File

@ -0,0 +1,18 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.ht.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Apple ap panse achte yon demaraj nan Wayòm Ini pou $1 milya dola",
"Machin otonòm fè responsablite asirans lan ale sou men fabrikan yo",
"San Francisco ap konsidere entèdi robo ki livre sou twotwa yo",
"Lond se yon gwo vil nan Wayòm Ini",
"Kote ou ye?",
"Kilès ki prezidan Lafrans?",
"Ki kapital Etazini?",
"Kile Barack Obama te fèt?",
]

View File

@ -0,0 +1,51 @@
from typing import List, Tuple
from ...pipeline import Lemmatizer
from ...tokens import Token
from ...lookups import Lookups
class HaitianCreoleLemmatizer(Lemmatizer):
"""
Minimal Haitian Creole lemmatizer.
Returns a word's base form based on rules and lookup,
or defaults to the original form.
"""
def is_base_form(self, token: Token) -> bool:
morph = token.morph.to_dict()
upos = token.pos_.lower()
# Consider unmarked forms to be base
if upos in {"noun", "verb", "adj", "adv"}:
if not morph:
return True
if upos == "noun" and morph.get("Number") == "Sing":
return True
if upos == "verb" and morph.get("VerbForm") == "Inf":
return True
if upos == "adj" and morph.get("Degree") == "Pos":
return True
return False
def rule_lemmatize(self, token: Token) -> List[str]:
string = token.text.lower()
pos = token.pos_.lower()
cache_key = (token.orth, token.pos)
if cache_key in self.cache:
return self.cache[cache_key]
forms = []
# fallback rule: just return lowercased form
forms.append(string)
self.cache[cache_key] = forms
return forms
@classmethod
def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]:
if mode == "rule":
required = ["lemma_lookup", "lemma_rules", "lemma_exc", "lemma_index"]
return (required, [])
return super().get_lookups_config(mode)

View File

@ -0,0 +1,78 @@
from ...attrs import LIKE_NUM, NORM
# Cardinal numbers in Creole
_num_words = set(
"""
zewo youn en de twa kat senk sis sèt uit nèf dis
onz douz trèz katoz kenz sèz disèt dizwit diznèf
vent trant karant sinkant swasant swasann-dis
san mil milyon milya
""".split()
)
# Ordinal numbers in Creole (some are French-influenced, some simplified)
_ordinal_words = set(
"""
premye dezyèm twazyèm katryèm senkyèm sizyèm sètvyèm uitvyèm nèvyèm dizyèm
onzèm douzyèm trèzyèm katozyèm kenzèm sèzyèm disetyèm dizwityèm diznèvyèm
ventyèm trantyèm karantyèm sinkantyèm swasantyèm
swasann-disyèm santyèm milyèm milyonnyèm milyadyèm
""".split()
)
NORM_MAP = {
"'m": "mwen",
"'w": "ou",
"'l": "li",
"'n": "nou",
"'y": "yo",
"m": "mwen",
"w": "ou",
"l": "li",
"n": "nou",
"y": "yo",
"m": "mwen",
"n": "nou",
"l": "li",
"y": "yo",
"w": "ou",
"t": "te",
"k": "ki",
"p": "pa",
"M": "Mwen",
"N": "Nou",
"L": "Li",
"Y": "Yo",
"W": "Ou",
"T": "Te",
"K": "Ki",
"P": "Pa",
}
def like_num(text):
text = text.strip().lower()
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
if text in _ordinal_words:
return True
# Handle things like "3yèm", "10yèm", "25yèm", etc.
if text.endswith("yèm") and text[:-3].isdigit():
return True
return False
def norm_custom(text):
return NORM_MAP.get(text, text.lower())
LEX_ATTRS = {
LIKE_NUM: like_num,
NORM: norm_custom,
}

View File

@ -0,0 +1,43 @@
from ..char_classes import (
ALPHA,
ALPHA_LOWER,
ALPHA_UPPER,
CONCAT_QUOTES,
HYPHENS,
LIST_PUNCT,
LIST_QUOTES,
LIST_ELLIPSES,
LIST_ICONS,
merge_chars,
)
ELISION = "'".replace(" ", "")
_prefixes_elision = "m n l y t k w"
_prefixes_elision += " " + _prefixes_elision.upper()
TOKENIZER_PREFIXES = LIST_PUNCT + LIST_QUOTES + [
r"(?:({pe})[{el}])(?=[{a}])".format(
a=ALPHA, el=ELISION, pe=merge_chars(_prefixes_elision)
)
]
TOKENIZER_SUFFIXES = LIST_PUNCT + LIST_QUOTES + LIST_ELLIPSES + [
r"(?<=[0-9])%", # numbers like 10%
r"(?<=[0-9])(?:{h})".format(h=HYPHENS), # hyphens after numbers
r"(?<=[{a}])[']".format(a=ALPHA), # apostrophes after letters
r"(?<=[{a}])['][mwlnytk](?=\s|$)".format(a=ALPHA), # contractions
r"(?<=[{a}0-9])\)", # right parenthesis after letter/number
r"(?<=[{a}])\.(?=\s|$)".format(a=ALPHA), # period after letter if space or end of string
r"(?<=\))[\.\?!]", # punctuation immediately after right parenthesis
]
TOKENIZER_INFIXES = LIST_ELLIPSES + LIST_ICONS + [
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}0-9])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION),
]

View File

@ -0,0 +1,50 @@
STOP_WORDS = set(
"""
a ak an ankò ant apre ap atò avan avanlè
byen byenke
chak
de depi deja deja
e en epi èske
fòk
gen genyen
ki kisa kilès kote koukou konsa konbyen konn konnen kounye kouman
la l laa le li lye
m m' mwen
nan nap nou n'
ou oumenm
pa paske pami pandan pito pou pral preske pwiske
se selman si sou sòt
ta tap tankou te toujou tou tan tout toutotan twòp tèl
w w' wi wè
y y' yo yon yonn
non o oh eh
sa san si swa si
men mèsi oswa osinon
"""
.split()
)
# Add common contractions, with and without apostrophe variants
contractions = ["m'", "n'", "w'", "y'", "l'", "t'", "k'"]
for apostrophe in ["'", "", ""]:
for word in contractions:
STOP_WORDS.add(word.replace("'", apostrophe))

View File

@ -0,0 +1,74 @@
from typing import Iterator, Tuple, Union
from ...errors import Errors
from ...symbols import NOUN, PRON, PROPN
from ...tokens import Doc, Span
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
"""
Detect base noun phrases from a dependency parse for Haitian Creole.
Works on both Doc and Span objects.
"""
# Core nominal dependencies common in Haitian Creole
labels = [
"nsubj",
"obj",
"obl",
"nmod",
"appos",
"ROOT",
]
# Modifiers to optionally include in chunk (to the right)
post_modifiers = ["compound", "flat", "flat:name", "fixed"]
doc = doclike.doc
if not doc.has_annotation("DEP"):
raise ValueError(Errors.E029)
np_deps = {doc.vocab.strings.add(label) for label in labels}
np_mods = {doc.vocab.strings.add(mod) for mod in post_modifiers}
conj_label = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP")
adp_pos = doc.vocab.strings.add("ADP")
cc_pos = doc.vocab.strings.add("CCONJ")
prev_end = -1
for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON):
continue
if word.left_edge.i <= prev_end:
continue
if word.dep in np_deps:
right_end = word
# expand to include known modifiers to the right
for child in word.rights:
if child.dep in np_mods:
right_end = child.right_edge
elif child.pos == NOUN:
right_end = child.right_edge
left_index = word.left_edge.i
# Skip prepositions at the start
if word.left_edge.pos == adp_pos:
left_index += 1
prev_end = right_end.i
yield left_index, right_end.i + 1, np_label
elif word.dep == conj_label:
head = word.head
while head.dep == conj_label and head.head.i < head.i:
head = head.head
if head.dep in np_deps:
left_index = word.left_edge.i
if word.left_edge.pos == cc_pos:
left_index += 1
prev_end = word.i
yield left_index, word.i + 1, np_label
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}

21
spacy/lang/ht/tag_map.py Normal file
View File

@ -0,0 +1,21 @@
from spacy.symbols import NOUN, VERB, AUX, ADJ, ADV, PRON, DET, ADP, SCONJ, CCONJ, PART, INTJ, NUM, PROPN, PUNCT, SYM, X
TAG_MAP = {
"NOUN": {"pos": NOUN},
"VERB": {"pos": VERB},
"AUX": {"pos": AUX},
"ADJ": {"pos": ADJ},
"ADV": {"pos": ADV},
"PRON": {"pos": PRON},
"DET": {"pos": DET},
"ADP": {"pos": ADP},
"SCONJ": {"pos": SCONJ},
"CCONJ": {"pos": CCONJ},
"PART": {"pos": PART},
"INTJ": {"pos": INTJ},
"NUM": {"pos": NUM},
"PROPN": {"pos": PROPN},
"PUNCT": {"pos": PUNCT},
"SYM": {"pos": SYM},
"X": {"pos": X},
}

View File

@ -0,0 +1,121 @@
from spacy.symbols import ORTH, NORM
def make_variants(base, first_norm, second_orth, second_norm):
return {
base: [
{ORTH: base.split("'")[0] + "'", NORM: first_norm},
{ORTH: second_orth, NORM: second_norm},
],
base.capitalize(): [
{ORTH: base.split("'")[0].capitalize() + "'", NORM: first_norm.capitalize()},
{ORTH: second_orth, NORM: second_norm},
]
}
TOKENIZER_EXCEPTIONS = {
"Dr.": [{ORTH: "Dr."}]
}
# Apostrophe forms
TOKENIZER_EXCEPTIONS.update(make_variants("m'ap", "mwen", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("n'ap", "nou", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("l'ap", "li", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("y'ap", "yo", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("m'te", "mwen", "te", "te"))
TOKENIZER_EXCEPTIONS.update(make_variants("m'pral", "mwen", "pral", "pral"))
TOKENIZER_EXCEPTIONS.update(make_variants("w'ap", "ou", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("k'ap", "ki", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("p'ap", "pa", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("t'ap", "te", "ap", "ap"))
# Non-apostrophe contractions (with capitalized variants)
TOKENIZER_EXCEPTIONS.update({
"map": [
{ORTH: "m", NORM: "mwen"},
{ORTH: "ap", NORM: "ap"},
],
"Map": [
{ORTH: "M", NORM: "Mwen"},
{ORTH: "ap", NORM: "ap"},
],
"lem": [
{ORTH: "le", NORM: "le"},
{ORTH: "m", NORM: "mwen"},
],
"Lem": [
{ORTH: "Le", NORM: "Le"},
{ORTH: "m", NORM: "mwen"},
],
"lew": [
{ORTH: "le", NORM: "le"},
{ORTH: "w", NORM: "ou"},
],
"Lew": [
{ORTH: "Le", NORM: "Le"},
{ORTH: "w", NORM: "ou"},
],
"nap": [
{ORTH: "n", NORM: "nou"},
{ORTH: "ap", NORM: "ap"},
],
"Nap": [
{ORTH: "N", NORM: "Nou"},
{ORTH: "ap", NORM: "ap"},
],
"lap": [
{ORTH: "l", NORM: "li"},
{ORTH: "ap", NORM: "ap"},
],
"Lap": [
{ORTH: "L", NORM: "Li"},
{ORTH: "ap", NORM: "ap"},
],
"yap": [
{ORTH: "y", NORM: "yo"},
{ORTH: "ap", NORM: "ap"},
],
"Yap": [
{ORTH: "Y", NORM: "Yo"},
{ORTH: "ap", NORM: "ap"},
],
"mte": [
{ORTH: "m", NORM: "mwen"},
{ORTH: "te", NORM: "te"},
],
"Mte": [
{ORTH: "M", NORM: "Mwen"},
{ORTH: "te", NORM: "te"},
],
"mpral": [
{ORTH: "m", NORM: "mwen"},
{ORTH: "pral", NORM: "pral"},
],
"Mpral": [
{ORTH: "M", NORM: "Mwen"},
{ORTH: "pral", NORM: "pral"},
],
"wap": [
{ORTH: "w", NORM: "ou"},
{ORTH: "ap", NORM: "ap"},
],
"Wap": [
{ORTH: "W", NORM: "Ou"},
{ORTH: "ap", NORM: "ap"},
],
"kap": [
{ORTH: "k", NORM: "ki"},
{ORTH: "ap", NORM: "ap"},
],
"Kap": [
{ORTH: "K", NORM: "Ki"},
{ORTH: "ap", NORM: "ap"},
],
"tap": [
{ORTH: "t", NORM: "te"},
{ORTH: "ap", NORM: "ap"},
],
"Tap": [
{ORTH: "T", NORM: "Te"},
{ORTH: "ap", NORM: "ap"},
],
})

View File

@ -32,7 +32,6 @@ split_mode = null
""" """
@registry.tokenizers("spacy.ja.JapaneseTokenizer")
def create_tokenizer(split_mode: Optional[str] = None): def create_tokenizer(split_mode: Optional[str] = None):
def japanese_tokenizer_factory(nlp): def japanese_tokenizer_factory(nlp):
return JapaneseTokenizer(nlp.vocab, split_mode=split_mode) return JapaneseTokenizer(nlp.vocab, split_mode=split_mode)

View File

@ -0,0 +1,16 @@
from ...language import BaseDefaults, Language
from .lex_attrs import LEX_ATTRS
from .stop_words import STOP_WORDS
class KurmanjiDefaults(BaseDefaults):
stop_words = STOP_WORDS
lex_attr_getters = LEX_ATTRS
class Kurmanji(Language):
lang = "kmr"
Defaults = KurmanjiDefaults
__all__ = ["Kurmanji"]

View File

@ -0,0 +1,17 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.kmr.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Berê mirovan her tim li geşedana pêşerojê ye", # People's gaze is always on the development of the future
"Kawa Nemir di 14 salan de Ulysses wergerand Kurmancî.", # Kawa Nemir translated Ulysses into Kurmanji in 14 years.
"Mem Ararat hunermendekî Kurd yê bi nav û deng e.", # Mem Ararat is a famous Kurdish artist
"Firat Cewerî 40 sal e pirtûkên Kurdî dinivîsîne.", # Firat Ceweri has been writing Kurdish books for 40 years
"Rojnamegerê ciwan nûçeyeke balkêş li ser rewşa aborî nivîsand", # The young journalist wrote an interesting news article about the economic situation
"Sektora çandiniyê beşeke giring a belavkirina gaza serayê li seranserê cîhanê pêk tîne", # The agricultural sector constitutes an important part of greenhouse gas emissions worldwide
"Xwendekarên jêhatî di pêşbaziya matematîkê de serkeftî bûn", # Talented students succeeded in the mathematics competition
"Ji ber ji tunebûnê bavê min xwişkeke min nedan xwendin ew ji min re bû derd û kulek.", # Because of poverty, my father didn't send my sister to school, which became a pain and sorrow for me
]

138
spacy/lang/kmr/lex_attrs.py Normal file
View File

@ -0,0 +1,138 @@
from ...attrs import LIKE_NUM
_num_words = [
"sifir",
"yek",
"du",
"",
"çar",
"pênc",
"şeş",
"heft",
"heşt",
"neh",
"deh",
"yazde",
"dazde",
"sêzde",
"çarde",
"pazde",
"şazde",
"hevde",
"hejde",
"nozde",
"bîst",
"",
"çil",
"pêncî",
"şêst",
"heftê",
"heştê",
"nod",
"sed",
"hezar",
"milyon",
"milyar",
]
_ordinal_words = [
"yekem",
"yekemîn",
"duyem",
"duyemîn",
"sêyem",
"sêyemîn",
"çarem",
"çaremîn",
"pêncem",
"pêncemîn",
"şeşem",
"şeşemîn",
"heftem",
"heftemîn",
"heştem",
"heştemîn",
"nehem",
"nehemîn",
"dehem",
"dehemîn",
"yazdehem",
"yazdehemîn",
"dazdehem",
"dazdehemîn",
"sêzdehem",
"sêzdehemîn",
"çardehem",
"çardehemîn",
"pazdehem",
"pazdehemîn",
"şanzdehem",
"şanzdehemîn",
"hevdehem",
"hevdehemîn",
"hejdehem",
"hejdehemîn",
"nozdehem",
"nozdehemîn",
"bîstem",
"bîstemîn",
"sîyem",
"sîyemîn",
"çilem",
"çilemîn",
"pêncîyem",
"pênciyemîn",
"şêstem",
"şêstemîn",
"heftêyem",
"heftêyemîn",
"heştêyem",
"heştêyemîn",
"notem",
"notemîn",
"sedem",
"sedemîn",
"hezarem",
"hezaremîn",
"milyonem",
"milyonemîn",
"milyarem",
"milyaremîn",
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
text_lower = text.lower()
if text_lower in _num_words:
return True
# Check ordinal number
if text_lower in _ordinal_words:
return True
if is_digit(text_lower):
return True
return False
def is_digit(text):
endings = ("em", "yem", "emîn", "yemîn")
for ending in endings:
to = len(ending)
if text.endswith(ending) and text[:-to].isdigit():
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -0,0 +1,44 @@
STOP_WORDS = set(
"""
û
li
bi
di
da
de
ji
ku
ew
ez
tu
em
hûn
ew
ev
min
te
me
we
wan
va
çi
çawa
çima
kengî
li ku
çend
çiqas
her
hin
gelek
hemû
kes
tişt
""".split()
)

View File

@ -20,7 +20,6 @@ DEFAULT_CONFIG = """
""" """
@registry.tokenizers("spacy.ko.KoreanTokenizer")
def create_tokenizer(): def create_tokenizer():
def korean_tokenizer_factory(nlp): def korean_tokenizer_factory(nlp):
return KoreanTokenizer(nlp.vocab) return KoreanTokenizer(nlp.vocab)

View File

@ -24,12 +24,6 @@ class MacedonianDefaults(BaseDefaults):
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS stop_words = STOP_WORDS
@classmethod
def create_lemmatizer(cls, nlp=None, lookups=None):
if lookups is None:
lookups = Lookups()
return MacedonianLemmatizer(lookups)
class Macedonian(Language): class Macedonian(Language):
lang = "mk" lang = "mk"

20
spacy/lang/nn/__init__.py Normal file
View File

@ -0,0 +1,20 @@
from ...language import BaseDefaults, Language
from ..nb import SYNTAX_ITERATORS
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
class NorwegianNynorskDefaults(BaseDefaults):
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
prefixes = TOKENIZER_PREFIXES
infixes = TOKENIZER_INFIXES
suffixes = TOKENIZER_SUFFIXES
syntax_iterators = SYNTAX_ITERATORS
class NorwegianNynorsk(Language):
lang = "nn"
Defaults = NorwegianNynorskDefaults
__all__ = ["NorwegianNynorsk"]

15
spacy/lang/nn/examples.py Normal file
View File

@ -0,0 +1,15 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.nn.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
# sentences taken from Omsetjingsminne frå Nynorsk pressekontor 2022 (https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-80/)
sentences = [
"Konseptet går ut på at alle tre omgangar tel, alle hopparar må stille i kvalifiseringa og poengsummen skal telje.",
"Det er ein meir enn i same periode i fjor.",
"Det har lava ned enorme snømengder i store delar av Europa den siste tida.",
"Akhtar Chaudhry er ikkje innstilt på Oslo-lista til SV, men utfordrar Heikki Holmås om førsteplassen.",
]

View File

@ -0,0 +1,74 @@
from ..char_classes import (
ALPHA,
ALPHA_LOWER,
ALPHA_UPPER,
CONCAT_QUOTES,
CURRENCY,
LIST_CURRENCY,
LIST_ELLIPSES,
LIST_ICONS,
LIST_PUNCT,
LIST_QUOTES,
PUNCT,
UNITS,
)
from ..punctuation import TOKENIZER_SUFFIXES
_quotes = CONCAT_QUOTES.replace("'", "")
_list_punct = [x for x in LIST_PUNCT if x != "#"]
_list_icons = [x for x in LIST_ICONS if x != "°"]
_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
_list_quotes = [x for x in LIST_QUOTES if x != "\\'"]
_prefixes = (
["§", "%", "=", "", "", r"\+(?![0-9])"]
+ _list_punct
+ LIST_ELLIPSES
+ LIST_QUOTES
+ LIST_CURRENCY
+ LIST_ICONS
)
_infixes = (
LIST_ELLIPSES
+ _list_icons
+ [
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])[:<>=/](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
]
)
_suffixes = (
LIST_PUNCT
+ LIST_ELLIPSES
+ _list_quotes
+ _list_icons
+ ["", ""]
+ [
r"(?<=[0-9])\+",
r"(?<=°[FfCcKk])\.",
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
r"(?<=[0-9])(?:{u})".format(u=UNITS),
r"(?<=[{al}{e}{p}(?:{q})])\.".format(
al=ALPHA_LOWER, e=r"%²\-\+", q=_quotes, p=PUNCT
),
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
]
+ [r"(?<=[^sSxXzZ])'"]
)
_suffixes += [
suffix
for suffix in TOKENIZER_SUFFIXES
if suffix not in ["'s", "'S", "s", "S", r"\'"]
]
TOKENIZER_PREFIXES = _prefixes
TOKENIZER_INFIXES = _infixes
TOKENIZER_SUFFIXES = _suffixes

View File

@ -0,0 +1,228 @@
from ...symbols import NORM, ORTH
from ...util import update_exc
from ..tokenizer_exceptions import BASE_EXCEPTIONS
_exc = {}
for exc_data in [
{ORTH: "jan.", NORM: "januar"},
{ORTH: "feb.", NORM: "februar"},
{ORTH: "mar.", NORM: "mars"},
{ORTH: "apr.", NORM: "april"},
{ORTH: "jun.", NORM: "juni"},
# note: "jul." is in the simple list below without a NORM exception
{ORTH: "aug.", NORM: "august"},
{ORTH: "sep.", NORM: "september"},
{ORTH: "okt.", NORM: "oktober"},
{ORTH: "nov.", NORM: "november"},
{ORTH: "des.", NORM: "desember"},
]:
_exc[exc_data[ORTH]] = [exc_data]
for orth in [
"Ap.",
"Aq.",
"Ca.",
"Chr.",
"Co.",
"Dr.",
"F.eks.",
"Fr.p.",
"Frp.",
"Grl.",
"Kr.",
"Kr.F.",
"Kr.F.s",
"Mr.",
"Mrs.",
"Pb.",
"Pr.",
"Sp.",
"St.",
"a.m.",
"ad.",
"adm.dir.",
"adr.",
"b.c.",
"bl.a.",
"bla.",
"bm.",
"bnr.",
"bto.",
"c.c.",
"ca.",
"cand.mag.",
"co.",
"d.d.",
"d.m.",
"d.y.",
"dept.",
"dr.",
"dr.med.",
"dr.philos.",
"dr.psychol.",
"dss.",
"dvs.",
"e.Kr.",
"e.l.",
"eg.",
"eig.",
"ekskl.",
"el.",
"et.",
"etc.",
"etg.",
"ev.",
"evt.",
"f.",
"f.Kr.",
"f.eks.",
"f.o.m.",
"fhv.",
"fk.",
"foreg.",
"fork.",
"fv.",
"fvt.",
"g.",
"gl.",
"gno.",
"gnr.",
"grl.",
"gt.",
"h.r.adv.",
"hhv.",
"hoh.",
"hr.",
"ifb.",
"ifm.",
"iht.",
"inkl.",
"istf.",
"jf.",
"jr.",
"jul.",
"juris.",
"kfr.",
"kgl.",
"kgl.res.",
"kl.",
"komm.",
"kr.",
"kst.",
"lat.",
"lø.",
"m.a.",
"m.a.o.",
"m.fl.",
"m.m.",
"m.v.",
"ma.",
"mag.art.",
"md.",
"mfl.",
"mht.",
"mill.",
"min.",
"mnd.",
"moh.",
"mrd.",
"muh.",
"mv.",
"mva.",
"n.å.",
"ndf.",
"nr.",
"nto.",
"nyno.",
"o.a.",
"o.l.",
"obl.",
"off.",
"ofl.",
"on.",
"op.",
"org.",
"osv.",
"ovf.",
"p.",
"p.a.",
"p.g.a.",
"p.m.",
"p.t.",
"pga.",
"ph.d.",
"pkt.",
"pr.",
"pst.",
"pt.",
"red.anm.",
"ref.",
"res.",
"res.kap.",
"resp.",
"rv.",
"s.",
"s.d.",
"s.k.",
"s.u.",
"s.å.",
"sen.",
"sep.",
"siviling.",
"sms.",
"snr.",
"spm.",
"sr.",
"sst.",
"st.",
"st.meld.",
"st.prp.",
"stip.",
"stk.",
"stud.",
"sv.",
"såk.",
"sø.",
"t.d.",
"t.h.",
"t.o.m.",
"t.v.",
"temp.",
"ti.",
"tils.",
"tilsv.",
"tl;dr",
"tlf.",
"to.",
"ult.",
"utg.",
"v.",
"vedk.",
"vedr.",
"vg.",
"vgs.",
"vha.",
"vit.ass.",
"vn.",
"vol.",
"vs.",
"vsa.",
"§§",
"©NTB",
"årg.",
"årh.",
]:
_exc[orth] = [{ORTH: orth}]
# Dates
for h in range(1, 31 + 1):
for period in ["."]:
_exc[f"{h}{period}"] = [{ORTH: f"{h}."}]
_custom_base_exc = {"i.": [{ORTH: "i", NORM: "i"}, {ORTH: "."}]}
_exc.update(_custom_base_exc)
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)

View File

@ -13,7 +13,6 @@ DEFAULT_CONFIG = """
""" """
@registry.tokenizers("spacy.th.ThaiTokenizer")
def create_thai_tokenizer(): def create_thai_tokenizer():
def thai_tokenizer_factory(nlp): def thai_tokenizer_factory(nlp):
return ThaiTokenizer(nlp.vocab) return ThaiTokenizer(nlp.vocab)

View File

@ -22,7 +22,6 @@ use_pyvi = true
""" """
@registry.tokenizers("spacy.vi.VietnameseTokenizer")
def create_vietnamese_tokenizer(use_pyvi: bool = True): def create_vietnamese_tokenizer(use_pyvi: bool = True):
def vietnamese_tokenizer_factory(nlp): def vietnamese_tokenizer_factory(nlp):
return VietnameseTokenizer(nlp.vocab, use_pyvi=use_pyvi) return VietnameseTokenizer(nlp.vocab, use_pyvi=use_pyvi)

View File

@ -46,7 +46,6 @@ class Segmenter(str, Enum):
return list(cls.__members__.keys()) return list(cls.__members__.keys())
@registry.tokenizers("spacy.zh.ChineseTokenizer")
def create_chinese_tokenizer(segmenter: Segmenter = Segmenter.char): def create_chinese_tokenizer(segmenter: Segmenter = Segmenter.char):
def chinese_tokenizer_factory(nlp): def chinese_tokenizer_factory(nlp):
return ChineseTokenizer(nlp.vocab, segmenter=segmenter) return ChineseTokenizer(nlp.vocab, segmenter=segmenter)

View File

@ -5,7 +5,7 @@ import multiprocessing as mp
import random import random
import traceback import traceback
import warnings import warnings
from contextlib import contextmanager from contextlib import ExitStack, contextmanager
from copy import deepcopy from copy import deepcopy
from dataclasses import dataclass from dataclasses import dataclass
from itertools import chain, cycle from itertools import chain, cycle
@ -30,8 +30,11 @@ from typing import (
overload, overload,
) )
import numpy
import srsly import srsly
from cymem.cymem import Pool
from thinc.api import Config, CupyOps, Optimizer, get_current_ops from thinc.api import Config, CupyOps, Optimizer, get_current_ops
from thinc.util import convert_recursive
from . import about, ty, util from . import about, ty, util
from .compat import Literal from .compat import Literal
@ -101,7 +104,6 @@ class BaseDefaults:
writing_system = {"direction": "ltr", "has_case": True, "has_letters": True} writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}
@registry.tokenizers("spacy.Tokenizer.v1")
def create_tokenizer() -> Callable[["Language"], Tokenizer]: def create_tokenizer() -> Callable[["Language"], Tokenizer]:
"""Registered function to create a tokenizer. Returns a factory that takes """Registered function to create a tokenizer. Returns a factory that takes
the nlp object and returns a Tokenizer instance using the language detaults. the nlp object and returns a Tokenizer instance using the language detaults.
@ -127,7 +129,6 @@ def create_tokenizer() -> Callable[["Language"], Tokenizer]:
return tokenizer_factory return tokenizer_factory
@registry.misc("spacy.LookupsDataLoader.v1")
def load_lookups_data(lang, tables): def load_lookups_data(lang, tables):
util.logger.debug("Loading lookups from spacy-lookups-data: %s", tables) util.logger.debug("Loading lookups from spacy-lookups-data: %s", tables)
lookups = load_lookups(lang=lang, tables=tables) lookups = load_lookups(lang=lang, tables=tables)
@ -140,7 +141,7 @@ class Language:
Defaults (class): Settings, data and factory methods for creating the `nlp` Defaults (class): Settings, data and factory methods for creating the `nlp`
object and processing pipeline. object and processing pipeline.
lang (str): IETF language code, such as 'en'. lang (str): Two-letter ISO 639-1 or three-letter ISO 639-3 language codes, such as 'en' and 'eng'.
DOCS: https://spacy.io/api/language DOCS: https://spacy.io/api/language
""" """
@ -182,6 +183,9 @@ class Language:
DOCS: https://spacy.io/api/language#init DOCS: https://spacy.io/api/language#init
""" """
from .pipeline.factories import register_factories
register_factories()
# We're only calling this to import all factories provided via entry # We're only calling this to import all factories provided via entry
# points. The factory decorator applied to these functions takes care # points. The factory decorator applied to these functions takes care
# of the rest. # of the rest.
@ -1211,7 +1215,7 @@ class Language:
examples, examples,
): ):
eg.predicted = doc eg.predicted = doc
return losses return _replace_numpy_floats(losses)
def rehearse( def rehearse(
self, self,
@ -1462,7 +1466,7 @@ class Language:
results = scorer.score(examples, per_component=per_component) results = scorer.score(examples, per_component=per_component)
n_words = sum(len(eg.predicted) for eg in examples) n_words = sum(len(eg.predicted) for eg in examples)
results["speed"] = n_words / (end_time - start_time) results["speed"] = n_words / (end_time - start_time)
return results return _replace_numpy_floats(results)
def create_optimizer(self): def create_optimizer(self):
"""Create an optimizer, usually using the [training.optimizer] config.""" """Create an optimizer, usually using the [training.optimizer] config."""
@ -1683,6 +1687,12 @@ class Language:
for proc in procs: for proc in procs:
proc.start() proc.start()
# Close writing-end of channels. This is needed to avoid that reading
# from the channel blocks indefinitely when the worker closes the
# channel.
for tx in bytedocs_send_ch:
tx.close()
# Cycle channels not to break the order of docs. # Cycle channels not to break the order of docs.
# The received object is a batch of byte-encoded docs, so flatten them with chain.from_iterable. # The received object is a batch of byte-encoded docs, so flatten them with chain.from_iterable.
byte_tuples = chain.from_iterable( byte_tuples = chain.from_iterable(
@ -1705,8 +1715,27 @@ class Language:
# tell `sender` that one batch was consumed. # tell `sender` that one batch was consumed.
sender.step() sender.step()
finally: finally:
# If we are stopping in an orderly fashion, the workers' queues
# are empty. Put the sentinel in their queues to signal that work
# is done, so that they can exit gracefully.
for q in texts_q:
q.put(_WORK_DONE_SENTINEL)
q.close()
# Otherwise, we are stopping because the error handler raised an
# exception. The sentinel will be last to go out of the queue.
# To avoid doing unnecessary work or hanging on platforms that
# block on sending (Windows), we'll close our end of the channel.
# This signals to the worker that it can exit the next time it
# attempts to send data down the channel.
for r in bytedocs_recv_ch:
r.close()
for proc in procs: for proc in procs:
proc.terminate() proc.join()
if not all(proc.exitcode == 0 for proc in procs):
warnings.warn(Warnings.W127)
def _link_components(self) -> None: def _link_components(self) -> None:
"""Register 'listeners' within pipeline components, to allow them to """Register 'listeners' within pipeline components, to allow them to
@ -2066,6 +2095,38 @@ class Language:
util.replace_model_node(pipe.model, listener, new_model) # type: ignore[attr-defined] util.replace_model_node(pipe.model, listener, new_model) # type: ignore[attr-defined]
tok2vec.remove_listener(listener, pipe_name) tok2vec.remove_listener(listener, pipe_name)
@contextmanager
def memory_zone(self, mem: Optional[Pool] = None) -> Iterator[Pool]:
"""Begin a block where all resources allocated during the block will
be freed at the end of it. If a resources was created within the
memory zone block, accessing it outside the block is invalid.
Behaviour of this invalid access is undefined. Memory zones should
not be nested.
The memory zone is helpful for services that need to process large
volumes of text with a defined memory budget.
Example
-------
>>> with nlp.memory_zone():
... for doc in nlp.pipe(texts):
... process_my_doc(doc)
>>> # use_doc(doc) <-- Invalid: doc was allocated in the memory zone
"""
if mem is None:
mem = Pool()
# The ExitStack allows programmatic nested context managers.
# We don't know how many we need, so it would be awkward to have
# them as nested blocks.
with ExitStack() as stack:
contexts = [stack.enter_context(self.vocab.memory_zone(mem))]
if hasattr(self.tokenizer, "memory_zone"):
contexts.append(stack.enter_context(self.tokenizer.memory_zone(mem)))
for _, pipe in self.pipeline:
if hasattr(pipe, "memory_zone"):
contexts.append(stack.enter_context(pipe.memory_zone(mem)))
yield mem
def to_disk( def to_disk(
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList() self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
) -> None: ) -> None:
@ -2083,7 +2144,9 @@ class Language:
serializers["tokenizer"] = lambda p: self.tokenizer.to_disk( # type: ignore[union-attr] serializers["tokenizer"] = lambda p: self.tokenizer.to_disk( # type: ignore[union-attr]
p, exclude=["vocab"] p, exclude=["vocab"]
) )
serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta) serializers["meta.json"] = lambda p: srsly.write_json(
p, _replace_numpy_floats(self.meta)
)
serializers["config.cfg"] = lambda p: self.config.to_disk(p) serializers["config.cfg"] = lambda p: self.config.to_disk(p)
for name, proc in self._components: for name, proc in self._components:
if name in exclude: if name in exclude:
@ -2197,7 +2260,9 @@ class Language:
serializers: Dict[str, Callable[[], bytes]] = {} serializers: Dict[str, Callable[[], bytes]] = {}
serializers["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude) serializers["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"]) # type: ignore[union-attr] serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"]) # type: ignore[union-attr]
serializers["meta.json"] = lambda: srsly.json_dumps(self.meta) serializers["meta.json"] = lambda: srsly.json_dumps(
_replace_numpy_floats(self.meta)
)
serializers["config.cfg"] = lambda: self.config.to_bytes() serializers["config.cfg"] = lambda: self.config.to_bytes()
for name, proc in self._components: for name, proc in self._components:
if name in exclude: if name in exclude:
@ -2248,6 +2313,12 @@ class Language:
return self return self
def _replace_numpy_floats(meta_dict: dict) -> dict:
return convert_recursive(
lambda v: isinstance(v, numpy.floating), lambda v: float(v), dict(meta_dict)
)
@dataclass @dataclass
class FactoryMeta: class FactoryMeta:
"""Dataclass containing information about a component and its defaults """Dataclass containing information about a component and its defaults
@ -2323,6 +2394,13 @@ def _apply_pipes(
while True: while True:
try: try:
texts_with_ctx = receiver.get() texts_with_ctx = receiver.get()
# Stop working if we encounter the end-of-work sentinel.
if isinstance(texts_with_ctx, _WorkDoneSentinel):
sender.close()
receiver.close()
return
docs = ( docs = (
ensure_doc(doc_like, context) for doc_like, context in texts_with_ctx ensure_doc(doc_like, context) for doc_like, context in texts_with_ctx
) )
@ -2331,11 +2409,23 @@ def _apply_pipes(
# Connection does not accept unpickable objects, so send list. # Connection does not accept unpickable objects, so send list.
byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs] byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
padding = [(None, None, None)] * (len(texts_with_ctx) - len(byte_docs)) padding = [(None, None, None)] * (len(texts_with_ctx) - len(byte_docs))
sender.send(byte_docs + padding) # type: ignore[operator] data: Sequence[Tuple[Optional[bytes], Optional[Any], Optional[bytes]]] = (
byte_docs + padding # type: ignore[operator]
)
except Exception: except Exception:
error_msg = [(None, None, srsly.msgpack_dumps(traceback.format_exc()))] error_msg = [(None, None, srsly.msgpack_dumps(traceback.format_exc()))]
padding = [(None, None, None)] * (len(texts_with_ctx) - 1) padding = [(None, None, None)] * (len(texts_with_ctx) - 1)
sender.send(error_msg + padding) data = error_msg + padding
try:
sender.send(data)
except BrokenPipeError:
# Parent has closed the pipe prematurely. This happens when a
# worker encounters an error and the error handler is set to
# stop processing.
sender.close()
receiver.close()
return
class _Sender: class _Sender:
@ -2365,3 +2455,10 @@ class _Sender:
if self.count >= self.chunk_size: if self.count >= self.chunk_size:
self.count = 0 self.count = 0
self.send() self.send()
class _WorkDoneSentinel:
pass
_WORK_DONE_SENTINEL = _WorkDoneSentinel()

View File

@ -35,7 +35,7 @@ cdef class Lexeme:
return self return self
@staticmethod @staticmethod
cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil: cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) noexcept nogil:
if name < (sizeof(flags_t) * 8): if name < (sizeof(flags_t) * 8):
Lexeme.c_set_flag(lex, name, value) Lexeme.c_set_flag(lex, name, value)
elif name == ID: elif name == ID:
@ -54,7 +54,7 @@ cdef class Lexeme:
lex.lang = value lex.lang = value
@staticmethod @staticmethod
cdef inline attr_t get_struct_attr(const LexemeC* lex, attr_id_t feat_name) nogil: cdef inline attr_t get_struct_attr(const LexemeC* lex, attr_id_t feat_name) noexcept nogil:
if feat_name < (sizeof(flags_t) * 8): if feat_name < (sizeof(flags_t) * 8):
if Lexeme.c_check_flag(lex, feat_name): if Lexeme.c_check_flag(lex, feat_name):
return 1 return 1
@ -82,7 +82,7 @@ cdef class Lexeme:
return 0 return 0
@staticmethod @staticmethod
cdef inline bint c_check_flag(const LexemeC* lexeme, attr_id_t flag_id) nogil: cdef inline bint c_check_flag(const LexemeC* lexeme, attr_id_t flag_id) noexcept nogil:
cdef flags_t one = 1 cdef flags_t one = 1
if lexeme.flags & (one << flag_id): if lexeme.flags & (one << flag_id):
return True return True
@ -90,7 +90,7 @@ cdef class Lexeme:
return False return False
@staticmethod @staticmethod
cdef inline bint c_set_flag(LexemeC* lex, attr_id_t flag_id, bint value) nogil: cdef inline bint c_set_flag(LexemeC* lex, attr_id_t flag_id, bint value) noexcept nogil:
cdef flags_t one = 1 cdef flags_t one = 1
if value: if value:
lex.flags |= one << flag_id lex.flags |= one << flag_id

View File

@ -70,7 +70,7 @@ cdef class Lexeme:
if isinstance(other, Lexeme): if isinstance(other, Lexeme):
a = self.orth a = self.orth
b = other.orth b = other.orth
elif isinstance(other, long): elif isinstance(other, int):
a = self.orth a = self.orth
b = other b = other
elif isinstance(other, str): elif isinstance(other, str):
@ -104,7 +104,7 @@ cdef class Lexeme:
# skip PROB, e.g. from lexemes.jsonl # skip PROB, e.g. from lexemes.jsonl
if isinstance(value, float): if isinstance(value, float):
continue continue
elif isinstance(value, (int, long)): elif isinstance(value, int):
Lexeme.set_struct_attr(self.c, attr, value) Lexeme.set_struct_attr(self.c, attr, value)
else: else:
Lexeme.set_struct_attr(self.c, attr, self.vocab.strings.add(value)) Lexeme.set_struct_attr(self.c, attr, self.vocab.strings.add(value))
@ -164,45 +164,48 @@ cdef class Lexeme:
vector = self.vector vector = self.vector
return numpy.sqrt((vector**2).sum()) return numpy.sqrt((vector**2).sum())
property vector: @property
def vector(self):
"""A real-valued meaning representation. """A real-valued meaning representation.
RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array
representing the lexeme's semantics. representing the lexeme's semantics.
""" """
def __get__(self): cdef int length = self.vocab.vectors_length
cdef int length = self.vocab.vectors_length if length == 0:
if length == 0: raise ValueError(Errors.E010)
raise ValueError(Errors.E010) return self.vocab.get_vector(self.c.orth)
return self.vocab.get_vector(self.c.orth)
def __set__(self, vector): @vector.setter
if len(vector) != self.vocab.vectors_length: def vector(self, vector):
raise ValueError(Errors.E073.format(new_length=len(vector), if len(vector) != self.vocab.vectors_length:
length=self.vocab.vectors_length)) raise ValueError(Errors.E073.format(new_length=len(vector),
self.vocab.set_vector(self.c.orth, vector) length=self.vocab.vectors_length))
self.vocab.set_vector(self.c.orth, vector)
property rank: @property
def rank(self):
"""RETURNS (str): Sequential ID of the lexeme's lexical type, used """RETURNS (str): Sequential ID of the lexeme's lexical type, used
to index into tables, e.g. for word vectors.""" to index into tables, e.g. for word vectors."""
def __get__(self): return self.c.id
return self.c.id
def __set__(self, value): @rank.setter
self.c.id = value def rank(self, value):
self.c.id = value
property sentiment: @property
def sentiment(self):
"""RETURNS (float): A scalar value indicating the positivity or """RETURNS (float): A scalar value indicating the positivity or
negativity of the lexeme.""" negativity of the lexeme."""
def __get__(self): sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment", {})
sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment", {}) return sentiment_table.get(self.c.orth, 0.0)
return sentiment_table.get(self.c.orth, 0.0)
def __set__(self, float x): @sentiment.setter
if "lexeme_sentiment" not in self.vocab.lookups: def sentiment(self, float x):
self.vocab.lookups.add_table("lexeme_sentiment") if "lexeme_sentiment" not in self.vocab.lookups:
sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment") self.vocab.lookups.add_table("lexeme_sentiment")
sentiment_table[self.c.orth] = x sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment")
sentiment_table[self.c.orth] = x
@property @property
def orth_(self): def orth_(self):
@ -216,306 +219,338 @@ cdef class Lexeme:
"""RETURNS (str): The original verbatim text of the lexeme.""" """RETURNS (str): The original verbatim text of the lexeme."""
return self.orth_ return self.orth_
property lower: @property
def lower(self):
"""RETURNS (uint64): Lowercase form of the lexeme.""" """RETURNS (uint64): Lowercase form of the lexeme."""
def __get__(self): return self.c.lower
return self.c.lower
def __set__(self, attr_t x): @lower.setter
self.c.lower = x def lower(self, attr_t x):
self.c.lower = x
property norm: @property
def norm(self):
"""RETURNS (uint64): The lexeme's norm, i.e. a normalised form of the """RETURNS (uint64): The lexeme's norm, i.e. a normalised form of the
lexeme text. lexeme text.
""" """
def __get__(self): return self.c.norm
return self.c.norm
def __set__(self, attr_t x): @norm.setter
if "lexeme_norm" not in self.vocab.lookups: def norm(self, attr_t x):
self.vocab.lookups.add_table("lexeme_norm") if "lexeme_norm" not in self.vocab.lookups:
norm_table = self.vocab.lookups.get_table("lexeme_norm") self.vocab.lookups.add_table("lexeme_norm")
norm_table[self.c.orth] = self.vocab.strings[x] norm_table = self.vocab.lookups.get_table("lexeme_norm")
self.c.norm = x norm_table[self.c.orth] = self.vocab.strings[x]
self.c.norm = x
property shape: @property
def shape(self):
"""RETURNS (uint64): Transform of the word's string, to show """RETURNS (uint64): Transform of the word's string, to show
orthographic features. orthographic features.
""" """
def __get__(self): return self.c.shape
return self.c.shape
def __set__(self, attr_t x): @shape.setter
self.c.shape = x def shape(self, attr_t x):
self.c.shape = x
property prefix: @property
def prefix(self):
"""RETURNS (uint64): Length-N substring from the start of the word. """RETURNS (uint64): Length-N substring from the start of the word.
Defaults to `N=1`. Defaults to `N=1`.
""" """
def __get__(self): return self.c.prefix
return self.c.prefix
def __set__(self, attr_t x): @prefix.setter
self.c.prefix = x def prefix(self, attr_t x):
self.c.prefix = x
property suffix: @property
def suffix(self):
"""RETURNS (uint64): Length-N substring from the end of the word. """RETURNS (uint64): Length-N substring from the end of the word.
Defaults to `N=3`. Defaults to `N=3`.
""" """
def __get__(self): return self.c.suffix
return self.c.suffix
def __set__(self, attr_t x): @suffix.setter
self.c.suffix = x def suffix(self, attr_t x):
self.c.suffix = x
property cluster: @property
def cluster(self):
"""RETURNS (int): Brown cluster ID.""" """RETURNS (int): Brown cluster ID."""
def __get__(self): cluster_table = self.vocab.lookups.get_table("lexeme_cluster", {})
cluster_table = self.vocab.lookups.get_table("lexeme_cluster", {}) return cluster_table.get(self.c.orth, 0)
return cluster_table.get(self.c.orth, 0)
def __set__(self, int x): @cluster.setter
cluster_table = self.vocab.lookups.get_table("lexeme_cluster", {}) def cluster(self, int x):
cluster_table[self.c.orth] = x cluster_table = self.vocab.lookups.get_table("lexeme_cluster", {})
cluster_table[self.c.orth] = x
property lang: @property
def lang(self):
"""RETURNS (uint64): Language of the parent vocabulary.""" """RETURNS (uint64): Language of the parent vocabulary."""
def __get__(self): return self.c.lang
return self.c.lang
def __set__(self, attr_t x): @lang.setter
self.c.lang = x def lang(self, attr_t x):
self.c.lang = x
property prob: @property
def prob(self):
"""RETURNS (float): Smoothed log probability estimate of the lexeme's """RETURNS (float): Smoothed log probability estimate of the lexeme's
type.""" type."""
def __get__(self): prob_table = self.vocab.lookups.get_table("lexeme_prob", {})
prob_table = self.vocab.lookups.get_table("lexeme_prob", {}) settings_table = self.vocab.lookups.get_table("lexeme_settings", {})
settings_table = self.vocab.lookups.get_table("lexeme_settings", {}) default_oov_prob = settings_table.get("oov_prob", -20.0)
default_oov_prob = settings_table.get("oov_prob", -20.0) return prob_table.get(self.c.orth, default_oov_prob)
return prob_table.get(self.c.orth, default_oov_prob)
def __set__(self, float x): @prob.setter
prob_table = self.vocab.lookups.get_table("lexeme_prob", {}) def prob(self, float x):
prob_table[self.c.orth] = x prob_table = self.vocab.lookups.get_table("lexeme_prob", {})
prob_table[self.c.orth] = x
property lower_: @property
def lower_(self):
"""RETURNS (str): Lowercase form of the word.""" """RETURNS (str): Lowercase form of the word."""
def __get__(self): return self.vocab.strings[self.c.lower]
return self.vocab.strings[self.c.lower]
def __set__(self, str x): @lower_.setter
self.c.lower = self.vocab.strings.add(x) def lower_(self, str x):
self.c.lower = self.vocab.strings.add(x)
property norm_: @property
def norm_(self):
"""RETURNS (str): The lexeme's norm, i.e. a normalised form of the """RETURNS (str): The lexeme's norm, i.e. a normalised form of the
lexeme text. lexeme text.
""" """
def __get__(self): return self.vocab.strings[self.c.norm]
return self.vocab.strings[self.c.norm]
def __set__(self, str x): @norm_.setter
self.norm = self.vocab.strings.add(x) def norm_(self, str x):
self.norm = self.vocab.strings.add(x)
property shape_: @property
def shape_(self):
"""RETURNS (str): Transform of the word's string, to show """RETURNS (str): Transform of the word's string, to show
orthographic features. orthographic features.
""" """
def __get__(self): return self.vocab.strings[self.c.shape]
return self.vocab.strings[self.c.shape]
def __set__(self, str x): @shape_.setter
self.c.shape = self.vocab.strings.add(x) def shape_(self, str x):
self.c.shape = self.vocab.strings.add(x)
property prefix_: @property
def prefix_(self):
"""RETURNS (str): Length-N substring from the start of the word. """RETURNS (str): Length-N substring from the start of the word.
Defaults to `N=1`. Defaults to `N=1`.
""" """
def __get__(self): return self.vocab.strings[self.c.prefix]
return self.vocab.strings[self.c.prefix]
def __set__(self, str x): @prefix_.setter
self.c.prefix = self.vocab.strings.add(x) def prefix_(self, str x):
self.c.prefix = self.vocab.strings.add(x)
property suffix_: @property
def suffix_(self):
"""RETURNS (str): Length-N substring from the end of the word. """RETURNS (str): Length-N substring from the end of the word.
Defaults to `N=3`. Defaults to `N=3`.
""" """
def __get__(self): return self.vocab.strings[self.c.suffix]
return self.vocab.strings[self.c.suffix]
def __set__(self, str x): @suffix_.setter
self.c.suffix = self.vocab.strings.add(x) def suffix_(self, str x):
self.c.suffix = self.vocab.strings.add(x)
property lang_: @property
def lang_(self):
"""RETURNS (str): Language of the parent vocabulary.""" """RETURNS (str): Language of the parent vocabulary."""
def __get__(self): return self.vocab.strings[self.c.lang]
return self.vocab.strings[self.c.lang]
def __set__(self, str x): @lang_.setter
self.c.lang = self.vocab.strings.add(x) def lang_(self, str x):
self.c.lang = self.vocab.strings.add(x)
property flags: @property
def flags(self):
"""RETURNS (uint64): Container of the lexeme's binary flags.""" """RETURNS (uint64): Container of the lexeme's binary flags."""
def __get__(self): return self.c.flags
return self.c.flags
def __set__(self, flags_t x): @flags.setter
self.c.flags = x def flags(self, flags_t x):
self.c.flags = x
@property @property
def is_oov(self): def is_oov(self):
"""RETURNS (bool): Whether the lexeme is out-of-vocabulary.""" """RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
return self.orth not in self.vocab.vectors return self.orth not in self.vocab.vectors
property is_stop: @property
def is_stop(self):
"""RETURNS (bool): Whether the lexeme is a stop word.""" """RETURNS (bool): Whether the lexeme is a stop word."""
def __get__(self): return Lexeme.c_check_flag(self.c, IS_STOP)
return Lexeme.c_check_flag(self.c, IS_STOP)
def __set__(self, bint x): @is_stop.setter
Lexeme.c_set_flag(self.c, IS_STOP, x) def is_stop(self, bint x):
Lexeme.c_set_flag(self.c, IS_STOP, x)
property is_alpha: @property
def is_alpha(self):
"""RETURNS (bool): Whether the lexeme consists of alphabetic """RETURNS (bool): Whether the lexeme consists of alphabetic
characters. Equivalent to `lexeme.text.isalpha()`. characters. Equivalent to `lexeme.text.isalpha()`.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c, IS_ALPHA)
return Lexeme.c_check_flag(self.c, IS_ALPHA)
def __set__(self, bint x): @is_alpha.setter
Lexeme.c_set_flag(self.c, IS_ALPHA, x) def is_alpha(self, bint x):
Lexeme.c_set_flag(self.c, IS_ALPHA, x)
property is_ascii: @property
def is_ascii(self):
"""RETURNS (bool): Whether the lexeme consists of ASCII characters. """RETURNS (bool): Whether the lexeme consists of ASCII characters.
Equivalent to `[any(ord(c) >= 128 for c in lexeme.text)]`. Equivalent to `[any(ord(c) >= 128 for c in lexeme.text)]`.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c, IS_ASCII)
return Lexeme.c_check_flag(self.c, IS_ASCII)
def __set__(self, bint x): @is_ascii.setter
Lexeme.c_set_flag(self.c, IS_ASCII, x) def is_ascii(self, bint x):
Lexeme.c_set_flag(self.c, IS_ASCII, x)
property is_digit: @property
def is_digit(self):
"""RETURNS (bool): Whether the lexeme consists of digits. Equivalent """RETURNS (bool): Whether the lexeme consists of digits. Equivalent
to `lexeme.text.isdigit()`. to `lexeme.text.isdigit()`.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c, IS_DIGIT)
return Lexeme.c_check_flag(self.c, IS_DIGIT)
def __set__(self, bint x): @is_digit.setter
Lexeme.c_set_flag(self.c, IS_DIGIT, x) def is_digit(self, bint x):
Lexeme.c_set_flag(self.c, IS_DIGIT, x)
property is_lower: @property
def is_lower(self):
"""RETURNS (bool): Whether the lexeme is in lowercase. Equivalent to """RETURNS (bool): Whether the lexeme is in lowercase. Equivalent to
`lexeme.text.islower()`. `lexeme.text.islower()`.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c, IS_LOWER)
return Lexeme.c_check_flag(self.c, IS_LOWER)
def __set__(self, bint x): @is_lower.setter
Lexeme.c_set_flag(self.c, IS_LOWER, x) def is_lower(self, bint x):
Lexeme.c_set_flag(self.c, IS_LOWER, x)
property is_upper: @property
def is_upper(self):
"""RETURNS (bool): Whether the lexeme is in uppercase. Equivalent to """RETURNS (bool): Whether the lexeme is in uppercase. Equivalent to
`lexeme.text.isupper()`. `lexeme.text.isupper()`.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c, IS_UPPER)
return Lexeme.c_check_flag(self.c, IS_UPPER)
def __set__(self, bint x): @is_upper.setter
Lexeme.c_set_flag(self.c, IS_UPPER, x) def is_upper(self, bint x):
Lexeme.c_set_flag(self.c, IS_UPPER, x)
property is_title: @property
def is_title(self):
"""RETURNS (bool): Whether the lexeme is in titlecase. Equivalent to """RETURNS (bool): Whether the lexeme is in titlecase. Equivalent to
`lexeme.text.istitle()`. `lexeme.text.istitle()`.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c, IS_TITLE)
return Lexeme.c_check_flag(self.c, IS_TITLE)
def __set__(self, bint x): @is_title.setter
Lexeme.c_set_flag(self.c, IS_TITLE, x) def is_title(self, bint x):
Lexeme.c_set_flag(self.c, IS_TITLE, x)
property is_punct: @property
def is_punct(self):
"""RETURNS (bool): Whether the lexeme is punctuation.""" """RETURNS (bool): Whether the lexeme is punctuation."""
def __get__(self): return Lexeme.c_check_flag(self.c, IS_PUNCT)
return Lexeme.c_check_flag(self.c, IS_PUNCT)
def __set__(self, bint x): @is_punct.setter
Lexeme.c_set_flag(self.c, IS_PUNCT, x) def is_punct(self, bint x):
Lexeme.c_set_flag(self.c, IS_PUNCT, x)
property is_space: @property
def is_space(self):
"""RETURNS (bool): Whether the lexeme consist of whitespace characters. """RETURNS (bool): Whether the lexeme consist of whitespace characters.
Equivalent to `lexeme.text.isspace()`. Equivalent to `lexeme.text.isspace()`.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c, IS_SPACE)
return Lexeme.c_check_flag(self.c, IS_SPACE)
def __set__(self, bint x): @is_space.setter
Lexeme.c_set_flag(self.c, IS_SPACE, x) def is_space(self, bint x):
Lexeme.c_set_flag(self.c, IS_SPACE, x)
property is_bracket: @property
def is_bracket(self):
"""RETURNS (bool): Whether the lexeme is a bracket.""" """RETURNS (bool): Whether the lexeme is a bracket."""
def __get__(self): return Lexeme.c_check_flag(self.c, IS_BRACKET)
return Lexeme.c_check_flag(self.c, IS_BRACKET)
def __set__(self, bint x): @is_bracket.setter
Lexeme.c_set_flag(self.c, IS_BRACKET, x) def is_bracket(self, bint x):
Lexeme.c_set_flag(self.c, IS_BRACKET, x)
property is_quote: @property
def is_quote(self):
"""RETURNS (bool): Whether the lexeme is a quotation mark.""" """RETURNS (bool): Whether the lexeme is a quotation mark."""
def __get__(self): return Lexeme.c_check_flag(self.c, IS_QUOTE)
return Lexeme.c_check_flag(self.c, IS_QUOTE)
def __set__(self, bint x): @is_quote.setter
Lexeme.c_set_flag(self.c, IS_QUOTE, x) def is_quote(self, bint x):
Lexeme.c_set_flag(self.c, IS_QUOTE, x)
property is_left_punct: @property
def is_left_punct(self):
"""RETURNS (bool): Whether the lexeme is left punctuation, e.g. (.""" """RETURNS (bool): Whether the lexeme is left punctuation, e.g. (."""
def __get__(self): return Lexeme.c_check_flag(self.c, IS_LEFT_PUNCT)
return Lexeme.c_check_flag(self.c, IS_LEFT_PUNCT)
def __set__(self, bint x): @is_left_punct.setter
Lexeme.c_set_flag(self.c, IS_LEFT_PUNCT, x) def is_left_punct(self, bint x):
Lexeme.c_set_flag(self.c, IS_LEFT_PUNCT, x)
property is_right_punct: @property
def is_right_punct(self):
"""RETURNS (bool): Whether the lexeme is right punctuation, e.g. ).""" """RETURNS (bool): Whether the lexeme is right punctuation, e.g. )."""
def __get__(self): return Lexeme.c_check_flag(self.c, IS_RIGHT_PUNCT)
return Lexeme.c_check_flag(self.c, IS_RIGHT_PUNCT)
def __set__(self, bint x): @is_right_punct.setter
Lexeme.c_set_flag(self.c, IS_RIGHT_PUNCT, x) def is_right_punct(self, bint x):
Lexeme.c_set_flag(self.c, IS_RIGHT_PUNCT, x)
property is_currency: @property
def is_currency(self):
"""RETURNS (bool): Whether the lexeme is a currency symbol, e.g. $, €.""" """RETURNS (bool): Whether the lexeme is a currency symbol, e.g. $, €."""
def __get__(self): return Lexeme.c_check_flag(self.c, IS_CURRENCY)
return Lexeme.c_check_flag(self.c, IS_CURRENCY)
def __set__(self, bint x): @is_currency.setter
Lexeme.c_set_flag(self.c, IS_CURRENCY, x) def is_currency(self, bint x):
Lexeme.c_set_flag(self.c, IS_CURRENCY, x)
property like_url: @property
def like_url(self):
"""RETURNS (bool): Whether the lexeme resembles a URL.""" """RETURNS (bool): Whether the lexeme resembles a URL."""
def __get__(self): return Lexeme.c_check_flag(self.c, LIKE_URL)
return Lexeme.c_check_flag(self.c, LIKE_URL)
def __set__(self, bint x): @like_url.setter
Lexeme.c_set_flag(self.c, LIKE_URL, x) def like_url(self, bint x):
Lexeme.c_set_flag(self.c, LIKE_URL, x)
property like_num: @property
def like_num(self):
"""RETURNS (bool): Whether the lexeme represents a number, e.g. "10.9", """RETURNS (bool): Whether the lexeme represents a number, e.g. "10.9",
"10", "ten", etc. "10", "ten", etc.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c, LIKE_NUM)
return Lexeme.c_check_flag(self.c, LIKE_NUM)
def __set__(self, bint x): @like_num.setter
Lexeme.c_set_flag(self.c, LIKE_NUM, x) def like_num(self, bint x):
Lexeme.c_set_flag(self.c, LIKE_NUM, x)
property like_email: @property
def like_email(self):
"""RETURNS (bool): Whether the lexeme resembles an email address.""" """RETURNS (bool): Whether the lexeme resembles an email address."""
def __get__(self): return Lexeme.c_check_flag(self.c, LIKE_EMAIL)
return Lexeme.c_check_flag(self.c, LIKE_EMAIL)
def __set__(self, bint x): @like_email.setter
Lexeme.c_set_flag(self.c, LIKE_EMAIL, x) def like_email(self, bint x):
Lexeme.c_set_flag(self.c, LIKE_EMAIL, x)

View File

@ -1,4 +1,4 @@
# cython: binding=True, infer_types=True # cython: binding=True, infer_types=True, language_level=3
from cpython.object cimport PyObject from cpython.object cimport PyObject
from libc.stdint cimport int64_t from libc.stdint cimport int64_t
@ -27,6 +27,5 @@ cpdef bint levenshtein_compare(input_text: str, pattern_text: str, fuzzy: int =
return levenshtein(input_text, pattern_text, max_edits) <= max_edits return levenshtein(input_text, pattern_text, max_edits) <= max_edits
@registry.misc("spacy.levenshtein_compare.v1")
def make_levenshtein_compare(): def make_levenshtein_compare():
return levenshtein_compare return levenshtein_compare

View File

@ -625,7 +625,7 @@ cdef action_t get_action(
const TokenC * token, const TokenC * token,
const attr_t * extra_attrs, const attr_t * extra_attrs,
const int8_t * predicate_matches const int8_t * predicate_matches
) nogil: ) noexcept nogil:
"""We need to consider: """We need to consider:
a) Does the token match the specification? [Yes, No] a) Does the token match the specification? [Yes, No]
b) What's the quantifier? [1, 0+, ?] b) What's the quantifier? [1, 0+, ?]
@ -740,7 +740,7 @@ cdef int8_t get_is_match(
const TokenC* token, const TokenC* token,
const attr_t* extra_attrs, const attr_t* extra_attrs,
const int8_t* predicate_matches const int8_t* predicate_matches
) nogil: ) noexcept nogil:
for i in range(state.pattern.nr_py): for i in range(state.pattern.nr_py):
if predicate_matches[state.pattern.py_predicates[i]] == -1: if predicate_matches[state.pattern.py_predicates[i]] == -1:
return 0 return 0
@ -755,14 +755,14 @@ cdef int8_t get_is_match(
return True return True
cdef inline int8_t get_is_final(PatternStateC state) nogil: cdef inline int8_t get_is_final(PatternStateC state) noexcept nogil:
if state.pattern[1].quantifier == FINAL_ID: if state.pattern[1].quantifier == FINAL_ID:
return 1 return 1
else: else:
return 0 return 0
cdef inline int8_t get_quantifier(PatternStateC state) nogil: cdef inline int8_t get_quantifier(PatternStateC state) noexcept nogil:
return state.pattern.quantifier return state.pattern.quantifier
@ -805,7 +805,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
return pattern return pattern
cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil: cdef attr_t get_ent_id(const TokenPatternC* pattern) noexcept nogil:
while pattern.quantifier != FINAL_ID: while pattern.quantifier != FINAL_ID:
pattern += 1 pattern += 1
id_attr = pattern[0].attrs[0] id_attr = pattern[0].attrs[0]

View File

@ -47,7 +47,7 @@ cdef class PhraseMatcher:
self._terminal_hash = 826361138722620965 self._terminal_hash = 826361138722620965
map_init(self.mem, self.c_map, 8) map_init(self.mem, self.c_map, 8)
if isinstance(attr, (int, long)): if isinstance(attr, int):
self.attr = attr self.attr = attr
else: else:
if attr is None: if attr is None:

View File

@ -7,7 +7,6 @@ from ..tokens import Doc
from ..util import registry from ..util import registry
@registry.layers("spacy.CharEmbed.v1")
def CharacterEmbed(nM: int, nC: int) -> Model[List[Doc], List[Floats2d]]: def CharacterEmbed(nM: int, nC: int) -> Model[List[Doc], List[Floats2d]]:
# nM: Number of dimensions per character. nC: Number of characters. # nM: Number of dimensions per character. nC: Number of characters.
return Model( return Model(

View File

@ -3,7 +3,6 @@ from thinc.api import Model, normal_init
from ..util import registry from ..util import registry
@registry.layers("spacy.PrecomputableAffine.v1")
def PrecomputableAffine(nO, nI, nF, nP, dropout=0.1): def PrecomputableAffine(nO, nI, nF, nP, dropout=0.1):
model = Model( model = Model(
"precomputable_affine", "precomputable_affine",

View File

@ -50,7 +50,6 @@ def models_with_nvtx_range(nlp, forward_color: int, backprop_color: int):
return nlp return nlp
@registry.callbacks("spacy.models_with_nvtx_range.v1")
def create_models_with_nvtx_range( def create_models_with_nvtx_range(
forward_color: int = -1, backprop_color: int = -1 forward_color: int = -1, backprop_color: int = -1
) -> Callable[["Language"], "Language"]: ) -> Callable[["Language"], "Language"]:
@ -110,7 +109,6 @@ def pipes_with_nvtx_range(
return nlp return nlp
@registry.callbacks("spacy.models_and_pipes_with_nvtx_range.v1")
def create_models_and_pipes_with_nvtx_range( def create_models_and_pipes_with_nvtx_range(
forward_color: int = -1, forward_color: int = -1,
backprop_color: int = -1, backprop_color: int = -1,

View File

@ -4,7 +4,6 @@ from ..attrs import LOWER
from ..util import registry from ..util import registry
@registry.layers("spacy.extract_ngrams.v1")
def extract_ngrams(ngram_size: int, attr: int = LOWER) -> Model: def extract_ngrams(ngram_size: int, attr: int = LOWER) -> Model:
model: Model = Model("extract_ngrams", forward) model: Model = Model("extract_ngrams", forward)
model.attrs["ngram_size"] = ngram_size model.attrs["ngram_size"] = ngram_size

View File

@ -6,7 +6,6 @@ from thinc.types import Ints1d, Ragged
from ..util import registry from ..util import registry
@registry.layers("spacy.extract_spans.v1")
def extract_spans() -> Model[Tuple[Ragged, Ragged], Ragged]: def extract_spans() -> Model[Tuple[Ragged, Ragged], Ragged]:
"""Extract spans from a sequence of source arrays, as specified by an array """Extract spans from a sequence of source arrays, as specified by an array
of (start, end) indices. The output is a ragged array of the of (start, end) indices. The output is a ragged array of the

View File

@ -6,8 +6,9 @@ from thinc.types import Ints2d
from ..tokens import Doc from ..tokens import Doc
@registry.layers("spacy.FeatureExtractor.v1") def FeatureExtractor(
def FeatureExtractor(columns: List[Union[int, str]]) -> Model[List[Doc], List[Ints2d]]: columns: Union[List[str], List[int], List[Union[int, str]]]
) -> Model[List[Doc], List[Ints2d]]:
return Model("extract_features", forward, attrs={"columns": columns}) return Model("extract_features", forward, attrs={"columns": columns})

View File

@ -28,7 +28,6 @@ from ...vocab import Vocab
from ..extract_spans import extract_spans from ..extract_spans import extract_spans
@registry.architectures("spacy.EntityLinker.v2")
def build_nel_encoder( def build_nel_encoder(
tok2vec: Model, nO: Optional[int] = None tok2vec: Model, nO: Optional[int] = None
) -> Model[List[Doc], Floats2d]: ) -> Model[List[Doc], Floats2d]:
@ -92,7 +91,6 @@ def span_maker_forward(model, docs: List[Doc], is_train) -> Tuple[Ragged, Callab
return out, lambda x: [] return out, lambda x: []
@registry.misc("spacy.KBFromFile.v1")
def load_kb( def load_kb(
kb_path: Path, kb_path: Path,
) -> Callable[[Vocab], KnowledgeBase]: ) -> Callable[[Vocab], KnowledgeBase]:
@ -104,7 +102,6 @@ def load_kb(
return kb_from_file return kb_from_file
@registry.misc("spacy.EmptyKB.v2")
def empty_kb_for_config() -> Callable[[Vocab, int], KnowledgeBase]: def empty_kb_for_config() -> Callable[[Vocab, int], KnowledgeBase]:
def empty_kb_factory(vocab: Vocab, entity_vector_length: int): def empty_kb_factory(vocab: Vocab, entity_vector_length: int):
return InMemoryLookupKB(vocab=vocab, entity_vector_length=entity_vector_length) return InMemoryLookupKB(vocab=vocab, entity_vector_length=entity_vector_length)
@ -112,7 +109,6 @@ def empty_kb_for_config() -> Callable[[Vocab, int], KnowledgeBase]:
return empty_kb_factory return empty_kb_factory
@registry.misc("spacy.EmptyKB.v1")
def empty_kb( def empty_kb(
entity_vector_length: int, entity_vector_length: int,
) -> Callable[[Vocab], KnowledgeBase]: ) -> Callable[[Vocab], KnowledgeBase]:
@ -122,12 +118,10 @@ def empty_kb(
return empty_kb_factory return empty_kb_factory
@registry.misc("spacy.CandidateGenerator.v1")
def create_candidates() -> Callable[[KnowledgeBase, Span], Iterable[Candidate]]: def create_candidates() -> Callable[[KnowledgeBase, Span], Iterable[Candidate]]:
return get_candidates return get_candidates
@registry.misc("spacy.CandidateBatchGenerator.v1")
def create_candidates_batch() -> Callable[ def create_candidates_batch() -> Callable[
[KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]] [KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]
]: ]:

View File

@ -30,7 +30,6 @@ if TYPE_CHECKING:
from ...vocab import Vocab # noqa: F401 from ...vocab import Vocab # noqa: F401
@registry.architectures("spacy.PretrainVectors.v1")
def create_pretrain_vectors( def create_pretrain_vectors(
maxout_pieces: int, hidden_size: int, loss: str maxout_pieces: int, hidden_size: int, loss: str
) -> Callable[["Vocab", Model], Model]: ) -> Callable[["Vocab", Model], Model]:
@ -57,7 +56,6 @@ def create_pretrain_vectors(
return create_vectors_objective return create_vectors_objective
@registry.architectures("spacy.PretrainCharacters.v1")
def create_pretrain_characters( def create_pretrain_characters(
maxout_pieces: int, hidden_size: int, n_characters: int maxout_pieces: int, hidden_size: int, n_characters: int
) -> Callable[["Vocab", Model], Model]: ) -> Callable[["Vocab", Model], Model]:

View File

@ -11,7 +11,6 @@ from .._precomputable_affine import PrecomputableAffine
from ..tb_framework import TransitionModel from ..tb_framework import TransitionModel
@registry.architectures("spacy.TransitionBasedParser.v2")
def build_tb_parser_model( def build_tb_parser_model(
tok2vec: Model[List[Doc], List[Floats2d]], tok2vec: Model[List[Doc], List[Floats2d]],
state_type: Literal["parser", "ner"], state_type: Literal["parser", "ner"],

View File

@ -10,7 +10,6 @@ InT = List[Doc]
OutT = Floats2d OutT = Floats2d
@registry.architectures("spacy.SpanFinder.v1")
def build_finder_model( def build_finder_model(
tok2vec: Model[InT, List[Floats2d]], scorer: Model[OutT, OutT] tok2vec: Model[InT, List[Floats2d]], scorer: Model[OutT, OutT]
) -> Model[InT, OutT]: ) -> Model[InT, OutT]:

View File

@ -22,7 +22,6 @@ from ...util import registry
from ..extract_spans import extract_spans from ..extract_spans import extract_spans
@registry.layers("spacy.LinearLogistic.v1")
def build_linear_logistic(nO=None, nI=None) -> Model[Floats2d, Floats2d]: def build_linear_logistic(nO=None, nI=None) -> Model[Floats2d, Floats2d]:
"""An output layer for multi-label classification. It uses a linear layer """An output layer for multi-label classification. It uses a linear layer
followed by a logistic activation. followed by a logistic activation.
@ -30,7 +29,6 @@ def build_linear_logistic(nO=None, nI=None) -> Model[Floats2d, Floats2d]:
return chain(Linear(nO=nO, nI=nI, init_W=glorot_uniform_init), Logistic()) return chain(Linear(nO=nO, nI=nI, init_W=glorot_uniform_init), Logistic())
@registry.layers("spacy.mean_max_reducer.v1")
def build_mean_max_reducer(hidden_size: int) -> Model[Ragged, Floats2d]: def build_mean_max_reducer(hidden_size: int) -> Model[Ragged, Floats2d]:
"""Reduce sequences by concatenating their mean and max pooled vectors, """Reduce sequences by concatenating their mean and max pooled vectors,
and then combine the concatenated vectors with a hidden layer. and then combine the concatenated vectors with a hidden layer.
@ -46,7 +44,6 @@ def build_mean_max_reducer(hidden_size: int) -> Model[Ragged, Floats2d]:
) )
@registry.architectures("spacy.SpanCategorizer.v1")
def build_spancat_model( def build_spancat_model(
tok2vec: Model[List[Doc], List[Floats2d]], tok2vec: Model[List[Doc], List[Floats2d]],
reducer: Model[Ragged, Floats2d], reducer: Model[Ragged, Floats2d],

View File

@ -7,7 +7,6 @@ from ...tokens import Doc
from ...util import registry from ...util import registry
@registry.architectures("spacy.Tagger.v2")
def build_tagger_model( def build_tagger_model(
tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None, normalize=False tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None, normalize=False
) -> Model[List[Doc], List[Floats2d]]: ) -> Model[List[Doc], List[Floats2d]]:

View File

@ -1,21 +1,27 @@
from functools import partial from functools import partial
from typing import List, Optional, cast from typing import List, Optional, Tuple, cast
from thinc.api import ( from thinc.api import (
Dropout, Dropout,
Gelu,
LayerNorm, LayerNorm,
Linear, Linear,
Logistic, Logistic,
Maxout, Maxout,
Model, Model,
ParametricAttention, ParametricAttention,
ParametricAttention_v2,
Relu, Relu,
Softmax, Softmax,
SparseLinear, SparseLinear,
SparseLinear_v2,
chain, chain,
clone, clone,
concatenate, concatenate,
list2ragged, list2ragged,
reduce_first,
reduce_last,
reduce_max,
reduce_mean, reduce_mean,
reduce_sum, reduce_sum,
residual, residual,
@ -25,9 +31,10 @@ from thinc.api import (
) )
from thinc.layers.chain import init as init_chain from thinc.layers.chain import init as init_chain
from thinc.layers.resizable import resize_linear_weighted, resize_model from thinc.layers.resizable import resize_linear_weighted, resize_model
from thinc.types import Floats2d from thinc.types import ArrayXd, Floats2d
from ...attrs import ORTH from ...attrs import ORTH
from ...errors import Errors
from ...tokens import Doc from ...tokens import Doc
from ...util import registry from ...util import registry
from ..extract_ngrams import extract_ngrams from ..extract_ngrams import extract_ngrams
@ -37,7 +44,6 @@ from .tok2vec import get_tok2vec_width
NEG_VALUE = -5000 NEG_VALUE = -5000
@registry.architectures("spacy.TextCatCNN.v2")
def build_simple_cnn_text_classifier( def build_simple_cnn_text_classifier(
tok2vec: Model, exclusive_classes: bool, nO: Optional[int] = None tok2vec: Model, exclusive_classes: bool, nO: Optional[int] = None
) -> Model[List[Doc], Floats2d]: ) -> Model[List[Doc], Floats2d]:
@ -47,39 +53,15 @@ def build_simple_cnn_text_classifier(
outputs sum to 1. If exclusive_classes=False, a logistic non-linearity outputs sum to 1. If exclusive_classes=False, a logistic non-linearity
is applied instead, so that outputs are in the range [0, 1]. is applied instead, so that outputs are in the range [0, 1].
""" """
fill_defaults = {"b": 0, "W": 0} return build_reduce_text_classifier(
with Model.define_operators({">>": chain}): tok2vec=tok2vec,
cnn = tok2vec >> list2ragged() >> reduce_mean() exclusive_classes=exclusive_classes,
nI = tok2vec.maybe_get_dim("nO") use_reduce_first=False,
if exclusive_classes: use_reduce_last=False,
output_layer = Softmax(nO=nO, nI=nI) use_reduce_max=False,
fill_defaults["b"] = NEG_VALUE use_reduce_mean=True,
resizable_layer: Model = resizable( nO=nO,
output_layer, )
resize_layer=partial(
resize_linear_weighted, fill_defaults=fill_defaults
),
)
model = cnn >> resizable_layer
else:
output_layer = Linear(nO=nO, nI=nI)
resizable_layer = resizable(
output_layer,
resize_layer=partial(
resize_linear_weighted, fill_defaults=fill_defaults
),
)
model = cnn >> resizable_layer >> Logistic()
model.set_ref("output_layer", output_layer)
model.attrs["resize_output"] = partial(
resize_and_set_ref,
resizable_layer=resizable_layer,
)
model.set_ref("tok2vec", tok2vec)
if nO is not None:
model.set_dim("nO", cast(int, nO))
model.attrs["multi_label"] = not exclusive_classes
return model
def resize_and_set_ref(model, new_nO, resizable_layer): def resize_and_set_ref(model, new_nO, resizable_layer):
@ -89,16 +71,52 @@ def resize_and_set_ref(model, new_nO, resizable_layer):
return model return model
@registry.architectures("spacy.TextCatBOW.v2")
def build_bow_text_classifier( def build_bow_text_classifier(
exclusive_classes: bool, exclusive_classes: bool,
ngram_size: int, ngram_size: int,
no_output_layer: bool, no_output_layer: bool,
nO: Optional[int] = None, nO: Optional[int] = None,
) -> Model[List[Doc], Floats2d]:
return _build_bow_text_classifier(
exclusive_classes=exclusive_classes,
ngram_size=ngram_size,
no_output_layer=no_output_layer,
nO=nO,
sparse_linear=SparseLinear(nO=nO),
)
def build_bow_text_classifier_v3(
exclusive_classes: bool,
ngram_size: int,
no_output_layer: bool,
length: int = 262144,
nO: Optional[int] = None,
) -> Model[List[Doc], Floats2d]:
if length < 1:
raise ValueError(Errors.E1056.format(length=length))
# Find k such that 2**(k-1) < length <= 2**k.
length = 2 ** (length - 1).bit_length()
return _build_bow_text_classifier(
exclusive_classes=exclusive_classes,
ngram_size=ngram_size,
no_output_layer=no_output_layer,
nO=nO,
sparse_linear=SparseLinear_v2(nO=nO, length=length),
)
def _build_bow_text_classifier(
exclusive_classes: bool,
ngram_size: int,
no_output_layer: bool,
sparse_linear: Model[Tuple[ArrayXd, ArrayXd, ArrayXd], ArrayXd],
nO: Optional[int] = None,
) -> Model[List[Doc], Floats2d]: ) -> Model[List[Doc], Floats2d]:
fill_defaults = {"b": 0, "W": 0} fill_defaults = {"b": 0, "W": 0}
with Model.define_operators({">>": chain}): with Model.define_operators({">>": chain}):
sparse_linear = SparseLinear(nO=nO)
output_layer = None output_layer = None
if not no_output_layer: if not no_output_layer:
fill_defaults["b"] = NEG_VALUE fill_defaults["b"] = NEG_VALUE
@ -121,12 +139,14 @@ def build_bow_text_classifier(
return model return model
@registry.architectures("spacy.TextCatEnsemble.v2")
def build_text_classifier_v2( def build_text_classifier_v2(
tok2vec: Model[List[Doc], List[Floats2d]], tok2vec: Model[List[Doc], List[Floats2d]],
linear_model: Model[List[Doc], Floats2d], linear_model: Model[List[Doc], Floats2d],
nO: Optional[int] = None, nO: Optional[int] = None,
) -> Model[List[Doc], Floats2d]: ) -> Model[List[Doc], Floats2d]:
# TODO: build the model with _build_parametric_attention_with_residual_nonlinear
# in spaCy v4. We don't do this in spaCy v3 to preserve model
# compatibility.
exclusive_classes = not linear_model.attrs["multi_label"] exclusive_classes = not linear_model.attrs["multi_label"]
with Model.define_operators({">>": chain, "|": concatenate}): with Model.define_operators({">>": chain, "|": concatenate}):
width = tok2vec.maybe_get_dim("nO") width = tok2vec.maybe_get_dim("nO")
@ -161,6 +181,11 @@ def build_text_classifier_v2(
def init_ensemble_textcat(model, X, Y) -> Model: def init_ensemble_textcat(model, X, Y) -> Model:
# When tok2vec is lazily initialized, we need to initialize it before
# the rest of the chain to ensure that we can get its width.
tok2vec = model.get_ref("tok2vec")
tok2vec.initialize(X)
tok2vec_width = get_tok2vec_width(model) tok2vec_width = get_tok2vec_width(model)
model.get_ref("attention_layer").set_dim("nO", tok2vec_width) model.get_ref("attention_layer").set_dim("nO", tok2vec_width)
model.get_ref("maxout_layer").set_dim("nO", tok2vec_width) model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
@ -171,7 +196,6 @@ def init_ensemble_textcat(model, X, Y) -> Model:
return model return model
@registry.architectures("spacy.TextCatLowData.v1")
def build_text_classifier_lowdata( def build_text_classifier_lowdata(
width: int, dropout: Optional[float], nO: Optional[int] = None width: int, dropout: Optional[float], nO: Optional[int] = None
) -> Model[List[Doc], Floats2d]: ) -> Model[List[Doc], Floats2d]:
@ -190,3 +214,151 @@ def build_text_classifier_lowdata(
model = model >> Dropout(dropout) model = model >> Dropout(dropout)
model = model >> Logistic() model = model >> Logistic()
return model return model
def build_textcat_parametric_attention_v1(
tok2vec: Model[List[Doc], List[Floats2d]],
exclusive_classes: bool,
nO: Optional[int] = None,
) -> Model[List[Doc], Floats2d]:
width = tok2vec.maybe_get_dim("nO")
parametric_attention = _build_parametric_attention_with_residual_nonlinear(
tok2vec=tok2vec,
nonlinear_layer=Maxout(nI=width, nO=width),
key_transform=Gelu(nI=width, nO=width),
)
with Model.define_operators({">>": chain}):
if exclusive_classes:
output_layer = Softmax(nO=nO)
else:
output_layer = Linear(nO=nO) >> Logistic()
model = parametric_attention >> output_layer
if model.has_dim("nO") is not False and nO is not None:
model.set_dim("nO", cast(int, nO))
model.set_ref("output_layer", output_layer)
model.attrs["multi_label"] = not exclusive_classes
return model
def _build_parametric_attention_with_residual_nonlinear(
*,
tok2vec: Model[List[Doc], List[Floats2d]],
nonlinear_layer: Model[Floats2d, Floats2d],
key_transform: Optional[Model[Floats2d, Floats2d]] = None,
) -> Model[List[Doc], Floats2d]:
with Model.define_operators({">>": chain, "|": concatenate}):
width = tok2vec.maybe_get_dim("nO")
attention_layer = ParametricAttention_v2(nO=width, key_transform=key_transform)
norm_layer = LayerNorm(nI=width)
parametric_attention = (
tok2vec
>> list2ragged()
>> attention_layer
>> reduce_sum()
>> residual(nonlinear_layer >> norm_layer >> Dropout(0.0))
)
parametric_attention.init = _init_parametric_attention_with_residual_nonlinear
parametric_attention.set_ref("tok2vec", tok2vec)
parametric_attention.set_ref("attention_layer", attention_layer)
parametric_attention.set_ref("key_transform", key_transform)
parametric_attention.set_ref("nonlinear_layer", nonlinear_layer)
parametric_attention.set_ref("norm_layer", norm_layer)
return parametric_attention
def _init_parametric_attention_with_residual_nonlinear(model, X, Y) -> Model:
# When tok2vec is lazily initialized, we need to initialize it before
# the rest of the chain to ensure that we can get its width.
tok2vec = model.get_ref("tok2vec")
tok2vec.initialize(X)
tok2vec_width = get_tok2vec_width(model)
model.get_ref("attention_layer").set_dim("nO", tok2vec_width)
model.get_ref("key_transform").set_dim("nI", tok2vec_width)
model.get_ref("key_transform").set_dim("nO", tok2vec_width)
model.get_ref("nonlinear_layer").set_dim("nI", tok2vec_width)
model.get_ref("nonlinear_layer").set_dim("nO", tok2vec_width)
model.get_ref("norm_layer").set_dim("nI", tok2vec_width)
model.get_ref("norm_layer").set_dim("nO", tok2vec_width)
init_chain(model, X, Y)
return model
def build_reduce_text_classifier(
tok2vec: Model,
exclusive_classes: bool,
use_reduce_first: bool,
use_reduce_last: bool,
use_reduce_max: bool,
use_reduce_mean: bool,
nO: Optional[int] = None,
) -> Model[List[Doc], Floats2d]:
"""Build a model that classifies pooled `Doc` representations.
Pooling is performed using reductions. Reductions are concatenated when
multiple reductions are used.
tok2vec (Model): the tok2vec layer to pool over.
exclusive_classes (bool): Whether or not classes are mutually exclusive.
use_reduce_first (bool): Pool by using the hidden representation of the
first token of a `Doc`.
use_reduce_last (bool): Pool by using the hidden representation of the
last token of a `Doc`.
use_reduce_max (bool): Pool by taking the maximum values of the hidden
representations of a `Doc`.
use_reduce_mean (bool): Pool by taking the mean of all hidden
representations of a `Doc`.
nO (Optional[int]): Number of classes.
"""
fill_defaults = {"b": 0, "W": 0}
reductions = []
if use_reduce_first:
reductions.append(reduce_first())
if use_reduce_last:
reductions.append(reduce_last())
if use_reduce_max:
reductions.append(reduce_max())
if use_reduce_mean:
reductions.append(reduce_mean())
if not len(reductions):
raise ValueError(Errors.E1057)
with Model.define_operators({">>": chain}):
cnn = tok2vec >> list2ragged() >> concatenate(*reductions)
nO_tok2vec = tok2vec.maybe_get_dim("nO")
nI = nO_tok2vec * len(reductions) if nO_tok2vec is not None else None
if exclusive_classes:
output_layer = Softmax(nO=nO, nI=nI)
fill_defaults["b"] = NEG_VALUE
resizable_layer: Model = resizable(
output_layer,
resize_layer=partial(
resize_linear_weighted, fill_defaults=fill_defaults
),
)
model = cnn >> resizable_layer
else:
output_layer = Linear(nO=nO, nI=nI)
resizable_layer = resizable(
output_layer,
resize_layer=partial(
resize_linear_weighted, fill_defaults=fill_defaults
),
)
model = cnn >> resizable_layer >> Logistic()
model.set_ref("output_layer", output_layer)
model.attrs["resize_output"] = partial(
resize_and_set_ref,
resizable_layer=resizable_layer,
)
model.set_ref("tok2vec", tok2vec)
if nO is not None:
model.set_dim("nO", cast(int, nO))
model.attrs["multi_label"] = not exclusive_classes
return model

View File

@ -29,7 +29,6 @@ from ..featureextractor import FeatureExtractor
from ..staticvectors import StaticVectors from ..staticvectors import StaticVectors
@registry.architectures("spacy.Tok2VecListener.v1")
def tok2vec_listener_v1(width: int, upstream: str = "*"): def tok2vec_listener_v1(width: int, upstream: str = "*"):
tok2vec = Tok2VecListener(upstream_name=upstream, width=width) tok2vec = Tok2VecListener(upstream_name=upstream, width=width)
return tok2vec return tok2vec
@ -46,7 +45,6 @@ def get_tok2vec_width(model: Model):
return nO return nO
@registry.architectures("spacy.HashEmbedCNN.v2")
def build_hash_embed_cnn_tok2vec( def build_hash_embed_cnn_tok2vec(
*, *,
width: int, width: int,
@ -102,7 +100,6 @@ def build_hash_embed_cnn_tok2vec(
) )
@registry.architectures("spacy.Tok2Vec.v2")
def build_Tok2Vec_model( def build_Tok2Vec_model(
embed: Model[List[Doc], List[Floats2d]], embed: Model[List[Doc], List[Floats2d]],
encode: Model[List[Floats2d], List[Floats2d]], encode: Model[List[Floats2d], List[Floats2d]],
@ -123,10 +120,9 @@ def build_Tok2Vec_model(
return tok2vec return tok2vec
@registry.architectures("spacy.MultiHashEmbed.v2")
def MultiHashEmbed( def MultiHashEmbed(
width: int, width: int,
attrs: List[Union[str, int]], attrs: Union[List[str], List[int], List[Union[str, int]]],
rows: List[int], rows: List[int],
include_static_vectors: bool, include_static_vectors: bool,
) -> Model[List[Doc], List[Floats2d]]: ) -> Model[List[Doc], List[Floats2d]]:
@ -192,7 +188,7 @@ def MultiHashEmbed(
) )
else: else:
model = chain( model = chain(
FeatureExtractor(list(attrs)), FeatureExtractor(attrs),
cast(Model[List[Ints2d], Ragged], list2ragged()), cast(Model[List[Ints2d], Ragged], list2ragged()),
with_array(concatenate(*embeddings)), with_array(concatenate(*embeddings)),
max_out, max_out,
@ -201,7 +197,6 @@ def MultiHashEmbed(
return model return model
@registry.architectures("spacy.CharacterEmbed.v2")
def CharacterEmbed( def CharacterEmbed(
width: int, width: int,
rows: int, rows: int,
@ -278,7 +273,6 @@ def CharacterEmbed(
return model return model
@registry.architectures("spacy.MaxoutWindowEncoder.v2")
def MaxoutWindowEncoder( def MaxoutWindowEncoder(
width: int, window_size: int, maxout_pieces: int, depth: int width: int, window_size: int, maxout_pieces: int, depth: int
) -> Model[List[Floats2d], List[Floats2d]]: ) -> Model[List[Floats2d], List[Floats2d]]:
@ -310,7 +304,6 @@ def MaxoutWindowEncoder(
return with_array(model, pad=receptive_field) return with_array(model, pad=receptive_field)
@registry.architectures("spacy.MishWindowEncoder.v2")
def MishWindowEncoder( def MishWindowEncoder(
width: int, window_size: int, depth: int width: int, window_size: int, depth: int
) -> Model[List[Floats2d], List[Floats2d]]: ) -> Model[List[Floats2d], List[Floats2d]]:
@ -333,7 +326,6 @@ def MishWindowEncoder(
return with_array(model) return with_array(model)
@registry.architectures("spacy.TorchBiLSTMEncoder.v1")
def BiLSTMEncoder( def BiLSTMEncoder(
width: int, depth: int, dropout: float width: int, depth: int, dropout: float
) -> Model[List[Floats2d], List[Floats2d]]: ) -> Model[List[Floats2d], List[Floats2d]]:

View File

@ -52,14 +52,14 @@ cdef SizesC get_c_sizes(model, int batch_size) except *:
return output return output
cdef ActivationsC alloc_activations(SizesC n) nogil: cdef ActivationsC alloc_activations(SizesC n) noexcept nogil:
cdef ActivationsC A cdef ActivationsC A
memset(&A, 0, sizeof(A)) memset(&A, 0, sizeof(A))
resize_activations(&A, n) resize_activations(&A, n)
return A return A
cdef void free_activations(const ActivationsC* A) nogil: cdef void free_activations(const ActivationsC* A) noexcept nogil:
free(A.token_ids) free(A.token_ids)
free(A.scores) free(A.scores)
free(A.unmaxed) free(A.unmaxed)
@ -67,7 +67,7 @@ cdef void free_activations(const ActivationsC* A) nogil:
free(A.is_valid) free(A.is_valid)
cdef void resize_activations(ActivationsC* A, SizesC n) nogil: cdef void resize_activations(ActivationsC* A, SizesC n) noexcept nogil:
if n.states <= A._max_size: if n.states <= A._max_size:
A._curr_size = n.states A._curr_size = n.states
return return
@ -100,7 +100,7 @@ cdef void resize_activations(ActivationsC* A, SizesC n) nogil:
cdef void predict_states( cdef void predict_states(
CBlas cblas, ActivationsC* A, StateC** states, const WeightsC* W, SizesC n CBlas cblas, ActivationsC* A, StateC** states, const WeightsC* W, SizesC n
) nogil: ) noexcept nogil:
resize_activations(A, n) resize_activations(A, n)
for i in range(n.states): for i in range(n.states):
states[i].set_context_tokens(&A.token_ids[i*n.feats], n.feats) states[i].set_context_tokens(&A.token_ids[i*n.feats], n.feats)
@ -159,7 +159,7 @@ cdef void sum_state_features(
int B, int B,
int F, int F,
int O int O
) nogil: ) noexcept nogil:
cdef int idx, b, f cdef int idx, b, f
cdef const float* feature cdef const float* feature
padding = cached padding = cached
@ -183,7 +183,7 @@ cdef void cpu_log_loss(
const int* is_valid, const int* is_valid,
const float* scores, const float* scores,
int O int O
) nogil: ) noexcept nogil:
"""Do multi-label log loss""" """Do multi-label log loss"""
cdef double max_, gmax, Z, gZ cdef double max_, gmax, Z, gZ
best = arg_max_if_gold(scores, costs, is_valid, O) best = arg_max_if_gold(scores, costs, is_valid, O)
@ -209,7 +209,7 @@ cdef void cpu_log_loss(
cdef int arg_max_if_gold( cdef int arg_max_if_gold(
const weight_t* scores, const weight_t* costs, const int* is_valid, int n const weight_t* scores, const weight_t* costs, const int* is_valid, int n
) nogil: ) noexcept nogil:
# Find minimum cost # Find minimum cost
cdef float cost = 1 cdef float cost = 1
for i in range(n): for i in range(n):
@ -224,7 +224,7 @@ cdef int arg_max_if_gold(
return best return best
cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) nogil: cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) noexcept nogil:
cdef int best = -1 cdef int best = -1
for i in range(n): for i in range(n):
if is_valid[i] >= 1: if is_valid[i] >= 1:

View File

@ -13,7 +13,6 @@ from ..vectors import Mode, Vectors
from ..vocab import Vocab from ..vocab import Vocab
@registry.layers("spacy.StaticVectors.v2")
def StaticVectors( def StaticVectors(
nO: Optional[int] = None, nO: Optional[int] = None,
nM: Optional[int] = None, nM: Optional[int] = None,

View File

@ -4,7 +4,6 @@ from ..util import registry
from .parser_model import ParserStepModel from .parser_model import ParserStepModel
@registry.layers("spacy.TransitionModel.v1")
def TransitionModel( def TransitionModel(
tok2vec, lower, upper, resize_output, dropout=0.2, unseen_classes=set() tok2vec, lower, upper, resize_output, dropout=0.2, unseen_classes=set()
): ):

View File

@ -57,16 +57,20 @@ cdef class Morphology:
field_feature_pairs = [] field_feature_pairs = []
for field in sorted(string_features): for field in sorted(string_features):
values = string_features[field] values = string_features[field]
self.strings.add(field, allow_transient=False),
field_id = self.strings[field]
for value in values.split(self.VALUE_SEP): for value in values.split(self.VALUE_SEP):
field_sep_value = field + self.FIELD_SEP + value
self.strings.add(field_sep_value, allow_transient=False),
field_feature_pairs.append(( field_feature_pairs.append((
self.strings.add(field), field_id,
self.strings.add(field + self.FIELD_SEP + value), self.strings[field_sep_value]
)) ))
cdef MorphAnalysisC tag = self.create_morph_tag(field_feature_pairs) cdef MorphAnalysisC tag = self.create_morph_tag(field_feature_pairs)
# the hash key for the tag is either the hash of the normalized UFEATS # the hash key for the tag is either the hash of the normalized UFEATS
# string or the hash of an empty placeholder # string or the hash of an empty placeholder
norm_feats_string = self.normalize_features(features) norm_feats_string = self.normalize_features(features)
tag.key = self.strings.add(norm_feats_string) tag.key = self.strings.add(norm_feats_string, allow_transient=False)
self.insert(tag) self.insert(tag)
return tag.key return tag.key

View File

@ -25,3 +25,8 @@ IDS = {
NAMES = {value: key for key, value in IDS.items()} NAMES = {value: key for key, value in IDS.items()}
# As of Cython 3.1, the global Python namespace no longer has the enum
# contents by default.
globals().update(IDS)

View File

@ -17,7 +17,7 @@ from ...typedefs cimport attr_t
from ...vocab cimport EMPTY_LEXEME from ...vocab cimport EMPTY_LEXEME
cdef inline bint is_space_token(const TokenC* token) nogil: cdef inline bint is_space_token(const TokenC* token) noexcept nogil:
return Lexeme.c_check_flag(token.lex, IS_SPACE) return Lexeme.c_check_flag(token.lex, IS_SPACE)
cdef struct ArcC: cdef struct ArcC:
@ -41,7 +41,7 @@ cdef cppclass StateC:
int offset int offset
int _b_i int _b_i
__init__(const TokenC* sent, int length) nogil: inline __init__(const TokenC* sent, int length) noexcept nogil:
this._sent = sent this._sent = sent
this._heads = <int*>calloc(length, sizeof(int)) this._heads = <int*>calloc(length, sizeof(int))
if not (this._sent and this._heads): if not (this._sent and this._heads):
@ -57,10 +57,10 @@ cdef cppclass StateC:
memset(&this._empty_token, 0, sizeof(TokenC)) memset(&this._empty_token, 0, sizeof(TokenC))
this._empty_token.lex = &EMPTY_LEXEME this._empty_token.lex = &EMPTY_LEXEME
__dealloc__(): inline __dealloc__():
free(this._heads) free(this._heads)
void set_context_tokens(int* ids, int n) nogil: inline void set_context_tokens(int* ids, int n) noexcept nogil:
cdef int i, j cdef int i, j
if n == 1: if n == 1:
if this.B(0) >= 0: if this.B(0) >= 0:
@ -131,14 +131,14 @@ cdef cppclass StateC:
else: else:
ids[i] = -1 ids[i] = -1
int S(int i) nogil const: inline int S(int i) noexcept nogil const:
if i >= this._stack.size(): if i >= this._stack.size():
return -1 return -1
elif i < 0: elif i < 0:
return -1 return -1
return this._stack.at(this._stack.size() - (i+1)) return this._stack.at(this._stack.size() - (i+1))
int B(int i) nogil const: inline int B(int i) noexcept nogil const:
if i < 0: if i < 0:
return -1 return -1
elif i < this._rebuffer.size(): elif i < this._rebuffer.size():
@ -150,19 +150,19 @@ cdef cppclass StateC:
else: else:
return b_i return b_i
const TokenC* B_(int i) nogil const: inline const TokenC* B_(int i) noexcept nogil const:
return this.safe_get(this.B(i)) return this.safe_get(this.B(i))
const TokenC* E_(int i) nogil const: inline const TokenC* E_(int i) noexcept nogil const:
return this.safe_get(this.E(i)) return this.safe_get(this.E(i))
const TokenC* safe_get(int i) nogil const: inline const TokenC* safe_get(int i) noexcept nogil const:
if i < 0 or i >= this.length: if i < 0 or i >= this.length:
return &this._empty_token return &this._empty_token
else: else:
return &this._sent[i] return &this._sent[i]
void map_get_arcs(const unordered_map[int, vector[ArcC]] &heads_arcs, vector[ArcC]* out) nogil const: inline void map_get_arcs(const unordered_map[int, vector[ArcC]] &heads_arcs, vector[ArcC]* out) noexcept nogil const:
cdef const vector[ArcC]* arcs cdef const vector[ArcC]* arcs
head_arcs_it = heads_arcs.const_begin() head_arcs_it = heads_arcs.const_begin()
while head_arcs_it != heads_arcs.const_end(): while head_arcs_it != heads_arcs.const_end():
@ -175,23 +175,23 @@ cdef cppclass StateC:
incr(arcs_it) incr(arcs_it)
incr(head_arcs_it) incr(head_arcs_it)
void get_arcs(vector[ArcC]* out) nogil const: inline void get_arcs(vector[ArcC]* out) noexcept nogil const:
this.map_get_arcs(this._left_arcs, out) this.map_get_arcs(this._left_arcs, out)
this.map_get_arcs(this._right_arcs, out) this.map_get_arcs(this._right_arcs, out)
int H(int child) nogil const: inline int H(int child) noexcept nogil const:
if child >= this.length or child < 0: if child >= this.length or child < 0:
return -1 return -1
else: else:
return this._heads[child] return this._heads[child]
int E(int i) nogil const: inline int E(int i) noexcept nogil const:
if this._ents.size() == 0: if this._ents.size() == 0:
return -1 return -1
else: else:
return this._ents.back().start return this._ents.back().start
int nth_child(const unordered_map[int, vector[ArcC]]& heads_arcs, int head, int idx) nogil const: inline int nth_child(const unordered_map[int, vector[ArcC]]& heads_arcs, int head, int idx) noexcept nogil const:
if idx < 1: if idx < 1:
return -1 return -1
@ -215,22 +215,22 @@ cdef cppclass StateC:
return -1 return -1
int L(int head, int idx) nogil const: inline int L(int head, int idx) noexcept nogil const:
return this.nth_child(this._left_arcs, head, idx) return this.nth_child(this._left_arcs, head, idx)
int R(int head, int idx) nogil const: inline int R(int head, int idx) noexcept nogil const:
return this.nth_child(this._right_arcs, head, idx) return this.nth_child(this._right_arcs, head, idx)
bint empty() nogil const: inline bint empty() noexcept nogil const:
return this._stack.size() == 0 return this._stack.size() == 0
bint eol() nogil const: inline bint eol() noexcept nogil const:
return this.buffer_length() == 0 return this.buffer_length() == 0
bint is_final() nogil const: inline bint is_final() noexcept nogil const:
return this.stack_depth() <= 0 and this.eol() return this.stack_depth() <= 0 and this.eol()
int cannot_sent_start(int word) nogil const: inline int cannot_sent_start(int word) noexcept nogil const:
if word < 0 or word >= this.length: if word < 0 or word >= this.length:
return 0 return 0
elif this._sent[word].sent_start == -1: elif this._sent[word].sent_start == -1:
@ -238,7 +238,7 @@ cdef cppclass StateC:
else: else:
return 0 return 0
int is_sent_start(int word) nogil const: inline int is_sent_start(int word) noexcept nogil const:
if word < 0 or word >= this.length: if word < 0 or word >= this.length:
return 0 return 0
elif this._sent[word].sent_start == 1: elif this._sent[word].sent_start == 1:
@ -248,20 +248,20 @@ cdef cppclass StateC:
else: else:
return 0 return 0
void set_sent_start(int word, int value) nogil: inline void set_sent_start(int word, int value) noexcept nogil:
if value >= 1: if value >= 1:
this._sent_starts.insert(word) this._sent_starts.insert(word)
bint has_head(int child) nogil const: inline bint has_head(int child) noexcept nogil const:
return this._heads[child] >= 0 return this._heads[child] >= 0
int l_edge(int word) nogil const: inline int l_edge(int word) noexcept nogil const:
return word return word
int r_edge(int word) nogil const: inline int r_edge(int word) noexcept nogil const:
return word return word
int n_arcs(const unordered_map[int, vector[ArcC]] &heads_arcs, int head) nogil const: inline int n_arcs(const unordered_map[int, vector[ArcC]] &heads_arcs, int head) noexcept nogil const:
cdef int n = 0 cdef int n = 0
head_arcs_it = heads_arcs.const_find(head) head_arcs_it = heads_arcs.const_find(head)
if head_arcs_it == heads_arcs.const_end(): if head_arcs_it == heads_arcs.const_end():
@ -277,28 +277,28 @@ cdef cppclass StateC:
return n return n
int n_L(int head) nogil const: inline int n_L(int head) noexcept nogil const:
return n_arcs(this._left_arcs, head) return n_arcs(this._left_arcs, head)
int n_R(int head) nogil const: inline int n_R(int head) noexcept nogil const:
return n_arcs(this._right_arcs, head) return n_arcs(this._right_arcs, head)
bint stack_is_connected() nogil const: inline bint stack_is_connected() noexcept nogil const:
return False return False
bint entity_is_open() nogil const: inline bint entity_is_open() noexcept nogil const:
if this._ents.size() == 0: if this._ents.size() == 0:
return False return False
else: else:
return this._ents.back().end == -1 return this._ents.back().end == -1
int stack_depth() nogil const: inline int stack_depth() noexcept nogil const:
return this._stack.size() return this._stack.size()
int buffer_length() nogil const: inline int buffer_length() noexcept nogil const:
return (this.length - this._b_i) + this._rebuffer.size() return (this.length - this._b_i) + this._rebuffer.size()
void push() nogil: inline void push() noexcept nogil:
b0 = this.B(0) b0 = this.B(0)
if this._rebuffer.size(): if this._rebuffer.size():
b0 = this._rebuffer.back() b0 = this._rebuffer.back()
@ -308,32 +308,32 @@ cdef cppclass StateC:
this._b_i += 1 this._b_i += 1
this._stack.push_back(b0) this._stack.push_back(b0)
void pop() nogil: inline void pop() noexcept nogil:
this._stack.pop_back() this._stack.pop_back()
void force_final() nogil: inline void force_final() noexcept nogil:
# This should only be used in desperate situations, as it may leave # This should only be used in desperate situations, as it may leave
# the analysis in an unexpected state. # the analysis in an unexpected state.
this._stack.clear() this._stack.clear()
this._b_i = this.length this._b_i = this.length
void unshift() nogil: inline void unshift() noexcept nogil:
s0 = this._stack.back() s0 = this._stack.back()
this._unshiftable[s0] = 1 this._unshiftable[s0] = 1
this._rebuffer.push_back(s0) this._rebuffer.push_back(s0)
this._stack.pop_back() this._stack.pop_back()
int is_unshiftable(int item) nogil const: inline int is_unshiftable(int item) noexcept nogil const:
if item >= this._unshiftable.size(): if item >= this._unshiftable.size():
return 0 return 0
else: else:
return this._unshiftable.at(item) return this._unshiftable.at(item)
void set_reshiftable(int item) nogil: inline void set_reshiftable(int item) noexcept nogil:
if item < this._unshiftable.size(): if item < this._unshiftable.size():
this._unshiftable[item] = 0 this._unshiftable[item] = 0
void add_arc(int head, int child, attr_t label) nogil: inline void add_arc(int head, int child, attr_t label) noexcept nogil:
if this.has_head(child): if this.has_head(child):
this.del_arc(this.H(child), child) this.del_arc(this.H(child), child)
cdef ArcC arc cdef ArcC arc
@ -346,7 +346,7 @@ cdef cppclass StateC:
this._right_arcs[arc.head].push_back(arc) this._right_arcs[arc.head].push_back(arc)
this._heads[child] = head this._heads[child] = head
void map_del_arc(unordered_map[int, vector[ArcC]]* heads_arcs, int h_i, int c_i) nogil: inline void map_del_arc(unordered_map[int, vector[ArcC]]* heads_arcs, int h_i, int c_i) noexcept nogil:
arcs_it = heads_arcs.find(h_i) arcs_it = heads_arcs.find(h_i)
if arcs_it == heads_arcs.end(): if arcs_it == heads_arcs.end():
return return
@ -367,13 +367,13 @@ cdef cppclass StateC:
arc.label = 0 arc.label = 0
break break
void del_arc(int h_i, int c_i) nogil: inline void del_arc(int h_i, int c_i) noexcept nogil:
if h_i > c_i: if h_i > c_i:
this.map_del_arc(&this._left_arcs, h_i, c_i) this.map_del_arc(&this._left_arcs, h_i, c_i)
else: else:
this.map_del_arc(&this._right_arcs, h_i, c_i) this.map_del_arc(&this._right_arcs, h_i, c_i)
SpanC get_ent() nogil const: inline SpanC get_ent() noexcept nogil const:
cdef SpanC ent cdef SpanC ent
if this._ents.size() == 0: if this._ents.size() == 0:
ent.start = 0 ent.start = 0
@ -383,17 +383,17 @@ cdef cppclass StateC:
else: else:
return this._ents.back() return this._ents.back()
void open_ent(attr_t label) nogil: inline void open_ent(attr_t label) noexcept nogil:
cdef SpanC ent cdef SpanC ent
ent.start = this.B(0) ent.start = this.B(0)
ent.label = label ent.label = label
ent.end = -1 ent.end = -1
this._ents.push_back(ent) this._ents.push_back(ent)
void close_ent() nogil: inline void close_ent() noexcept nogil:
this._ents.back().end = this.B(0)+1 this._ents.back().end = this.B(0)+1
void clone(const StateC* src) nogil: inline void clone(const StateC* src) noexcept nogil:
this.length = src.length this.length = src.length
this._sent = src._sent this._sent = src._sent
this._stack = src._stack this._stack = src._stack

View File

@ -155,7 +155,7 @@ cdef GoldParseStateC create_gold_state(
return gs return gs
cdef void update_gold_state(GoldParseStateC* gs, const StateC* s) nogil: cdef void update_gold_state(GoldParseStateC* gs, const StateC* s) noexcept nogil:
for i in range(gs.length): for i in range(gs.length):
gs.state_bits[i] = set_state_flag( gs.state_bits[i] = set_state_flag(
gs.state_bits[i], gs.state_bits[i],
@ -203,7 +203,7 @@ cdef class ArcEagerGold:
def __init__(self, ArcEager moves, StateClass stcls, Example example): def __init__(self, ArcEager moves, StateClass stcls, Example example):
self.mem = Pool() self.mem = Pool()
heads, labels = example.get_aligned_parse(projectivize=True) heads, labels = example.get_aligned_parse(projectivize=True)
labels = [example.x.vocab.strings.add(label) if label is not None else MISSING_DEP for label in labels] labels = [example.x.vocab.strings.add(label, allow_transient=False) if label is not None else MISSING_DEP for label in labels]
sent_starts = _get_aligned_sent_starts(example) sent_starts = _get_aligned_sent_starts(example)
assert len(heads) == len(labels) == len(sent_starts), (len(heads), len(labels), len(sent_starts)) assert len(heads) == len(labels) == len(sent_starts), (len(heads), len(labels), len(sent_starts))
self.c = create_gold_state(self.mem, stcls.c, heads, labels, sent_starts) self.c = create_gold_state(self.mem, stcls.c, heads, labels, sent_starts)
@ -239,12 +239,12 @@ def _get_aligned_sent_starts(example):
return [None] * len(example.x) return [None] * len(example.x)
cdef int check_state_gold(char state_bits, char flag) nogil: cdef int check_state_gold(char state_bits, char flag) noexcept nogil:
cdef char one = 1 cdef char one = 1
return 1 if (state_bits & (one << flag)) else 0 return 1 if (state_bits & (one << flag)) else 0
cdef int set_state_flag(char state_bits, char flag, int value) nogil: cdef int set_state_flag(char state_bits, char flag, int value) noexcept nogil:
cdef char one = 1 cdef char one = 1
if value: if value:
return state_bits | (one << flag) return state_bits | (one << flag)
@ -252,27 +252,27 @@ cdef int set_state_flag(char state_bits, char flag, int value) nogil:
return state_bits & ~(one << flag) return state_bits & ~(one << flag)
cdef int is_head_in_stack(const GoldParseStateC* gold, int i) nogil: cdef int is_head_in_stack(const GoldParseStateC* gold, int i) noexcept nogil:
return check_state_gold(gold.state_bits[i], HEAD_IN_STACK) return check_state_gold(gold.state_bits[i], HEAD_IN_STACK)
cdef int is_head_in_buffer(const GoldParseStateC* gold, int i) nogil: cdef int is_head_in_buffer(const GoldParseStateC* gold, int i) noexcept nogil:
return check_state_gold(gold.state_bits[i], HEAD_IN_BUFFER) return check_state_gold(gold.state_bits[i], HEAD_IN_BUFFER)
cdef int is_head_unknown(const GoldParseStateC* gold, int i) nogil: cdef int is_head_unknown(const GoldParseStateC* gold, int i) noexcept nogil:
return check_state_gold(gold.state_bits[i], HEAD_UNKNOWN) return check_state_gold(gold.state_bits[i], HEAD_UNKNOWN)
cdef int is_sent_start(const GoldParseStateC* gold, int i) nogil: cdef int is_sent_start(const GoldParseStateC* gold, int i) noexcept nogil:
return check_state_gold(gold.state_bits[i], IS_SENT_START) return check_state_gold(gold.state_bits[i], IS_SENT_START)
cdef int is_sent_start_unknown(const GoldParseStateC* gold, int i) nogil: cdef int is_sent_start_unknown(const GoldParseStateC* gold, int i) noexcept nogil:
return check_state_gold(gold.state_bits[i], SENT_START_UNKNOWN) return check_state_gold(gold.state_bits[i], SENT_START_UNKNOWN)
# Helper functions for the arc-eager oracle # Helper functions for the arc-eager oracle
cdef weight_t push_cost(const StateC* state, const GoldParseStateC* gold) nogil: cdef weight_t push_cost(const StateC* state, const GoldParseStateC* gold) noexcept nogil:
cdef weight_t cost = 0 cdef weight_t cost = 0
b0 = state.B(0) b0 = state.B(0)
if b0 < 0: if b0 < 0:
@ -285,7 +285,7 @@ cdef weight_t push_cost(const StateC* state, const GoldParseStateC* gold) nogil:
return cost return cost
cdef weight_t pop_cost(const StateC* state, const GoldParseStateC* gold) nogil: cdef weight_t pop_cost(const StateC* state, const GoldParseStateC* gold) noexcept nogil:
cdef weight_t cost = 0 cdef weight_t cost = 0
s0 = state.S(0) s0 = state.S(0)
if s0 < 0: if s0 < 0:
@ -296,7 +296,7 @@ cdef weight_t pop_cost(const StateC* state, const GoldParseStateC* gold) nogil:
return cost return cost
cdef bint arc_is_gold(const GoldParseStateC* gold, int head, int child) nogil: cdef bint arc_is_gold(const GoldParseStateC* gold, int head, int child) noexcept nogil:
if is_head_unknown(gold, child): if is_head_unknown(gold, child):
return True return True
elif gold.heads[child] == head: elif gold.heads[child] == head:
@ -305,7 +305,7 @@ cdef bint arc_is_gold(const GoldParseStateC* gold, int head, int child) nogil:
return False return False
cdef bint label_is_gold(const GoldParseStateC* gold, int child, attr_t label) nogil: cdef bint label_is_gold(const GoldParseStateC* gold, int child, attr_t label) noexcept nogil:
if is_head_unknown(gold, child): if is_head_unknown(gold, child):
return True return True
elif label == 0: elif label == 0:
@ -316,7 +316,7 @@ cdef bint label_is_gold(const GoldParseStateC* gold, int child, attr_t label) no
return False return False
cdef bint _is_gold_root(const GoldParseStateC* gold, int word) nogil: cdef bint _is_gold_root(const GoldParseStateC* gold, int word) noexcept nogil:
return gold.heads[word] == word or is_head_unknown(gold, word) return gold.heads[word] == word or is_head_unknown(gold, word)
@ -336,7 +336,7 @@ cdef class Shift:
* Advance buffer * Advance buffer
""" """
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
if st.stack_depth() == 0: if st.stack_depth() == 0:
return 1 return 1
elif st.buffer_length() < 2: elif st.buffer_length() < 2:
@ -349,11 +349,11 @@ cdef class Shift:
return 1 return 1
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) noexcept nogil:
st.push() st.push()
@staticmethod @staticmethod
cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil: cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) noexcept nogil:
gold = <const GoldParseStateC*>_gold gold = <const GoldParseStateC*>_gold
return gold.push_cost return gold.push_cost
@ -375,7 +375,7 @@ cdef class Reduce:
cost by those arcs. cost by those arcs.
""" """
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
if st.stack_depth() == 0: if st.stack_depth() == 0:
return False return False
elif st.buffer_length() == 0: elif st.buffer_length() == 0:
@ -386,14 +386,14 @@ cdef class Reduce:
return True return True
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) noexcept nogil:
if st.has_head(st.S(0)) or st.stack_depth() == 1: if st.has_head(st.S(0)) or st.stack_depth() == 1:
st.pop() st.pop()
else: else:
st.unshift() st.unshift()
@staticmethod @staticmethod
cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil: cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) noexcept nogil:
gold = <const GoldParseStateC*>_gold gold = <const GoldParseStateC*>_gold
if state.is_sent_start(state.B(0)): if state.is_sent_start(state.B(0)):
return 0 return 0
@ -421,7 +421,7 @@ cdef class LeftArc:
pop_cost - Arc(B[0], S[0], label) + (Arc(S[1], S[0]) if H(S[0]) else Arcs(S, S[0])) pop_cost - Arc(B[0], S[0], label) + (Arc(S[1], S[0]) if H(S[0]) else Arcs(S, S[0]))
""" """
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
if st.stack_depth() == 0: if st.stack_depth() == 0:
return 0 return 0
elif st.buffer_length() == 0: elif st.buffer_length() == 0:
@ -434,7 +434,7 @@ cdef class LeftArc:
return 1 return 1
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) noexcept nogil:
st.add_arc(st.B(0), st.S(0), label) st.add_arc(st.B(0), st.S(0), label)
# If we change the stack, it's okay to remove the shifted mark, as # If we change the stack, it's okay to remove the shifted mark, as
# we can't get in an infinite loop this way. # we can't get in an infinite loop this way.
@ -442,7 +442,7 @@ cdef class LeftArc:
st.pop() st.pop()
@staticmethod @staticmethod
cdef inline weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil: cdef inline weight_t cost(const StateC* state, const void* _gold, attr_t label) noexcept nogil:
gold = <const GoldParseStateC*>_gold gold = <const GoldParseStateC*>_gold
cdef weight_t cost = gold.pop_cost cdef weight_t cost = gold.pop_cost
s0 = state.S(0) s0 = state.S(0)
@ -474,7 +474,7 @@ cdef class RightArc:
push_cost + (not shifted[b0] and Arc(B[1:], B[0])) - Arc(S[0], B[0], label) push_cost + (not shifted[b0] and Arc(B[1:], B[0])) - Arc(S[0], B[0], label)
""" """
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
if st.stack_depth() == 0: if st.stack_depth() == 0:
return 0 return 0
elif st.buffer_length() == 0: elif st.buffer_length() == 0:
@ -488,12 +488,12 @@ cdef class RightArc:
return 1 return 1
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) noexcept nogil:
st.add_arc(st.S(0), st.B(0), label) st.add_arc(st.S(0), st.B(0), label)
st.push() st.push()
@staticmethod @staticmethod
cdef inline weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil: cdef inline weight_t cost(const StateC* state, const void* _gold, attr_t label) noexcept nogil:
gold = <const GoldParseStateC*>_gold gold = <const GoldParseStateC*>_gold
cost = gold.push_cost cost = gold.push_cost
s0 = state.S(0) s0 = state.S(0)
@ -525,7 +525,7 @@ cdef class Break:
* Arcs between S and B[1] * Arcs between S and B[1]
""" """
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
if st.buffer_length() < 2: if st.buffer_length() < 2:
return False return False
elif st.B(1) != st.B(0) + 1: elif st.B(1) != st.B(0) + 1:
@ -538,11 +538,11 @@ cdef class Break:
return True return True
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) noexcept nogil:
st.set_sent_start(st.B(1), 1) st.set_sent_start(st.B(1), 1)
@staticmethod @staticmethod
cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil: cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) noexcept nogil:
gold = <const GoldParseStateC*>_gold gold = <const GoldParseStateC*>_gold
cdef int b0 = state.B(0) cdef int b0 = state.B(0)
cdef int cost = 0 cdef int cost = 0
@ -785,7 +785,7 @@ cdef class ArcEager(TransitionSystem):
else: else:
return False return False
cdef int set_valid(self, int* output, const StateC* st) nogil: cdef int set_valid(self, int* output, const StateC* st) noexcept nogil:
cdef int[N_MOVES] is_valid cdef int[N_MOVES] is_valid
is_valid[SHIFT] = Shift.is_valid(st, 0) is_valid[SHIFT] = Shift.is_valid(st, 0)
is_valid[REDUCE] = Reduce.is_valid(st, 0) is_valid[REDUCE] = Reduce.is_valid(st, 0)

View File

@ -110,7 +110,7 @@ cdef void update_gold_state(GoldNERStateC* gs, const StateC* state) except *:
cdef do_func_t[N_MOVES] do_funcs cdef do_func_t[N_MOVES] do_funcs
cdef bint _entity_is_sunk(const StateC* state, Transition* golds) nogil: cdef bint _entity_is_sunk(const StateC* state, Transition* golds) noexcept nogil:
if not state.entity_is_open(): if not state.entity_is_open():
return False return False
@ -238,7 +238,7 @@ cdef class BiluoPushDown(TransitionSystem):
def add_action(self, int action, label_name, freq=None): def add_action(self, int action, label_name, freq=None):
cdef attr_t label_id cdef attr_t label_id
if not isinstance(label_name, (int, long)): if not isinstance(label_name, int):
label_id = self.strings.add(label_name) label_id = self.strings.add(label_name)
else: else:
label_id = label_name label_id = label_name
@ -347,21 +347,21 @@ cdef class BiluoPushDown(TransitionSystem):
cdef class Missing: cdef class Missing:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
return False return False
@staticmethod @staticmethod
cdef int transition(StateC* s, attr_t label) nogil: cdef int transition(StateC* s, attr_t label) noexcept nogil:
pass pass
@staticmethod @staticmethod
cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil: cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) noexcept nogil:
return 9000 return 9000
cdef class Begin: cdef class Begin:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
cdef int preset_ent_iob = st.B_(0).ent_iob cdef int preset_ent_iob = st.B_(0).ent_iob
cdef attr_t preset_ent_label = st.B_(0).ent_type cdef attr_t preset_ent_label = st.B_(0).ent_type
if st.entity_is_open(): if st.entity_is_open():
@ -400,13 +400,13 @@ cdef class Begin:
return True return True
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) noexcept nogil:
st.open_ent(label) st.open_ent(label)
st.push() st.push()
st.pop() st.pop()
@staticmethod @staticmethod
cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil: cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) noexcept nogil:
gold = <GoldNERStateC*>_gold gold = <GoldNERStateC*>_gold
b0 = s.B(0) b0 = s.B(0)
cdef int cost = 0 cdef int cost = 0
@ -439,7 +439,7 @@ cdef class Begin:
cdef class In: cdef class In:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
if not st.entity_is_open(): if not st.entity_is_open():
return False return False
if st.buffer_length() < 2: if st.buffer_length() < 2:
@ -475,12 +475,12 @@ cdef class In:
return True return True
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) noexcept nogil:
st.push() st.push()
st.pop() st.pop()
@staticmethod @staticmethod
cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil: cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) noexcept nogil:
gold = <GoldNERStateC*>_gold gold = <GoldNERStateC*>_gold
cdef int next_act = gold.ner[s.B(1)].move if s.B(1) >= 0 else OUT cdef int next_act = gold.ner[s.B(1)].move if s.B(1) >= 0 else OUT
cdef int g_act = gold.ner[s.B(0)].move cdef int g_act = gold.ner[s.B(0)].move
@ -510,7 +510,7 @@ cdef class In:
cdef class Last: cdef class Last:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
cdef int preset_ent_iob = st.B_(0).ent_iob cdef int preset_ent_iob = st.B_(0).ent_iob
cdef attr_t preset_ent_label = st.B_(0).ent_type cdef attr_t preset_ent_label = st.B_(0).ent_type
if label == 0: if label == 0:
@ -535,13 +535,13 @@ cdef class Last:
return True return True
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) noexcept nogil:
st.close_ent() st.close_ent()
st.push() st.push()
st.pop() st.pop()
@staticmethod @staticmethod
cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil: cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) noexcept nogil:
gold = <GoldNERStateC*>_gold gold = <GoldNERStateC*>_gold
b0 = s.B(0) b0 = s.B(0)
ent_start = s.E(0) ent_start = s.E(0)
@ -581,7 +581,7 @@ cdef class Last:
cdef class Unit: cdef class Unit:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
cdef int preset_ent_iob = st.B_(0).ent_iob cdef int preset_ent_iob = st.B_(0).ent_iob
cdef attr_t preset_ent_label = st.B_(0).ent_type cdef attr_t preset_ent_label = st.B_(0).ent_type
if label == 0: if label == 0:
@ -609,14 +609,14 @@ cdef class Unit:
return True return True
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) noexcept nogil:
st.open_ent(label) st.open_ent(label)
st.close_ent() st.close_ent()
st.push() st.push()
st.pop() st.pop()
@staticmethod @staticmethod
cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil: cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) noexcept nogil:
gold = <GoldNERStateC*>_gold gold = <GoldNERStateC*>_gold
cdef int g_act = gold.ner[s.B(0)].move cdef int g_act = gold.ner[s.B(0)].move
cdef attr_t g_tag = gold.ner[s.B(0)].label cdef attr_t g_tag = gold.ner[s.B(0)].label
@ -646,7 +646,7 @@ cdef class Unit:
cdef class Out: cdef class Out:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
cdef int preset_ent_iob = st.B_(0).ent_iob cdef int preset_ent_iob = st.B_(0).ent_iob
if st.entity_is_open(): if st.entity_is_open():
return False return False
@ -658,12 +658,12 @@ cdef class Out:
return True return True
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) noexcept nogil:
st.push() st.push()
st.pop() st.pop()
@staticmethod @staticmethod
cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil: cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) noexcept nogil:
gold = <GoldNERStateC*>_gold gold = <GoldNERStateC*>_gold
cdef int g_act = gold.ner[s.B(0)].move cdef int g_act = gold.ner[s.B(0)].move
cdef weight_t cost = 0 cdef weight_t cost = 0

View File

@ -94,7 +94,7 @@ cdef bool _has_head_as_ancestor(int tokenid, int head, const vector[int]& heads)
return False return False
cdef string heads_to_string(const vector[int]& heads) nogil: cdef string heads_to_string(const vector[int]& heads) noexcept nogil:
cdef vector[int].const_iterator citer cdef vector[int].const_iterator citer
cdef string cycle_str cdef string cycle_str
@ -183,7 +183,7 @@ cpdef deprojectivize(Doc doc):
new_label, head_label = label.split(DELIMITER) new_label, head_label = label.split(DELIMITER)
new_head = _find_new_head(doc[i], head_label) new_head = _find_new_head(doc[i], head_label)
doc.c[i].head = new_head.i - i doc.c[i].head = new_head.i - i
doc.c[i].dep = doc.vocab.strings.add(new_label) doc.c[i].dep = doc.vocab.strings.add(new_label, allow_transient=False)
set_children_from_heads(doc.c, 0, doc.length) set_children_from_heads(doc.c, 0, doc.length)
return doc return doc

View File

@ -29,7 +29,7 @@ cdef class StateClass:
return [self.B(i) for i in range(self.c.buffer_length())] return [self.B(i) for i in range(self.c.buffer_length())]
@property @property
def token_vector_lenth(self): def token_vector_length(self):
return self.doc.tensor.shape[1] return self.doc.tensor.shape[1]
@property @property

View File

@ -15,22 +15,22 @@ cdef struct Transition:
weight_t score weight_t score
bint (*is_valid)(const StateC* state, attr_t label) nogil bint (*is_valid)(const StateC* state, attr_t label) noexcept nogil
weight_t (*get_cost)(const StateC* state, const void* gold, attr_t label) nogil weight_t (*get_cost)(const StateC* state, const void* gold, attr_t label) noexcept nogil
int (*do)(StateC* state, attr_t label) nogil int (*do)(StateC* state, attr_t label) noexcept nogil
ctypedef weight_t (*get_cost_func_t)( ctypedef weight_t (*get_cost_func_t)(
const StateC* state, const void* gold, attr_tlabel const StateC* state, const void* gold, attr_tlabel
) nogil ) noexcept nogil
ctypedef weight_t (*move_cost_func_t)( ctypedef weight_t (*move_cost_func_t)(
const StateC* state, const void* gold const StateC* state, const void* gold
) nogil ) noexcept nogil
ctypedef weight_t (*label_cost_func_t)( ctypedef weight_t (*label_cost_func_t)(
const StateC* state, const void* gold, attr_t label const StateC* state, const void* gold, attr_t label
) nogil ) noexcept nogil
ctypedef int (*do_func_t)(StateC* state, attr_t label) nogil ctypedef int (*do_func_t)(StateC* state, attr_t label) noexcept nogil
ctypedef void* (*init_state_t)(Pool mem, int length, void* tokens) except NULL ctypedef void* (*init_state_t)(Pool mem, int length, void* tokens) except NULL
@ -53,7 +53,7 @@ cdef class TransitionSystem:
cdef Transition init_transition(self, int clas, int move, attr_t label) except * cdef Transition init_transition(self, int clas, int move, attr_t label) except *
cdef int set_valid(self, int* output, const StateC* st) nogil cdef int set_valid(self, int* output, const StateC* st) noexcept nogil
cdef int set_costs(self, int* is_valid, weight_t* costs, cdef int set_costs(self, int* is_valid, weight_t* costs,
const StateC* state, gold) except -1 const StateC* state, gold) except -1

View File

@ -149,7 +149,7 @@ cdef class TransitionSystem:
action = self.lookup_transition(move_name) action = self.lookup_transition(move_name)
return action.is_valid(stcls.c, action.label) return action.is_valid(stcls.c, action.label)
cdef int set_valid(self, int* is_valid, const StateC* st) nogil: cdef int set_valid(self, int* is_valid, const StateC* st) noexcept nogil:
cdef int i cdef int i
for i in range(self.n_moves): for i in range(self.n_moves):
is_valid[i] = self.c[i].is_valid(st, self.c[i].label) is_valid[i] = self.c[i].is_valid(st, self.c[i].label)
@ -191,8 +191,7 @@ cdef class TransitionSystem:
def add_action(self, int action, label_name): def add_action(self, int action, label_name):
cdef attr_t label_id cdef attr_t label_id
if not isinstance(label_name, int) and \ if not isinstance(label_name, int):
not isinstance(label_name, long):
label_id = self.strings.add(label_name) label_id = self.strings.add(label_name)
else: else:
label_id = label_name label_id = label_name

View File

@ -1,3 +1,5 @@
import importlib
import sys
from pathlib import Path from pathlib import Path
from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union
@ -22,19 +24,6 @@ TagMapType = Dict[str, Dict[Union[int, str], Union[int, str]]]
MorphRulesType = Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]] MorphRulesType = Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]
@Language.factory(
"attribute_ruler",
default_config={
"validate": False,
"scorer": {"@scorers": "spacy.attribute_ruler_scorer.v1"},
},
)
def make_attribute_ruler(
nlp: Language, name: str, validate: bool, scorer: Optional[Callable]
):
return AttributeRuler(nlp.vocab, name, validate=validate, scorer=scorer)
def attribute_ruler_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]: def attribute_ruler_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
def morph_key_getter(token, attr): def morph_key_getter(token, attr):
return getattr(token, attr).key return getattr(token, attr).key
@ -54,7 +43,6 @@ def attribute_ruler_score(examples: Iterable[Example], **kwargs) -> Dict[str, An
return results return results
@registry.scorers("spacy.attribute_ruler_scorer.v1")
def make_attribute_ruler_scorer(): def make_attribute_ruler_scorer():
return attribute_ruler_score return attribute_ruler_score
@ -355,3 +343,11 @@ def _split_morph_attrs(attrs: dict) -> Tuple[dict, dict]:
else: else:
morph_attrs[k] = v morph_attrs[k] = v
return other_attrs, morph_attrs return other_attrs, morph_attrs
# Setup backwards compatibility hook for factories
def __getattr__(name):
if name == "make_attribute_ruler":
module = importlib.import_module("spacy.pipeline.factories")
return module.make_attribute_ruler
raise AttributeError(f"module {__name__} has no attribute {name}")

View File

@ -1,4 +1,6 @@
# cython: infer_types=True, binding=True # cython: infer_types=True, binding=True
import importlib
import sys
from collections import defaultdict from collections import defaultdict
from typing import Callable, Optional from typing import Callable, Optional
@ -39,188 +41,6 @@ subword_features = true
DEFAULT_PARSER_MODEL = Config().from_str(default_model_config)["model"] DEFAULT_PARSER_MODEL = Config().from_str(default_model_config)["model"]
@Language.factory(
"parser",
assigns=["token.dep", "token.head", "token.is_sent_start", "doc.sents"],
default_config={
"moves": None,
"update_with_oracle_cut_size": 100,
"learn_tokens": False,
"min_action_freq": 30,
"model": DEFAULT_PARSER_MODEL,
"scorer": {"@scorers": "spacy.parser_scorer.v1"},
},
default_score_weights={
"dep_uas": 0.5,
"dep_las": 0.5,
"dep_las_per_type": None,
"sents_p": None,
"sents_r": None,
"sents_f": 0.0,
},
)
def make_parser(
nlp: Language,
name: str,
model: Model,
moves: Optional[TransitionSystem],
update_with_oracle_cut_size: int,
learn_tokens: bool,
min_action_freq: int,
scorer: Optional[Callable],
):
"""Create a transition-based DependencyParser component. The dependency parser
jointly learns sentence segmentation and labelled dependency parsing, and can
optionally learn to merge tokens that had been over-segmented by the tokenizer.
The parser uses a variant of the non-monotonic arc-eager transition-system
described by Honnibal and Johnson (2014), with the addition of a "break"
transition to perform the sentence segmentation. Nivre's pseudo-projective
dependency transformation is used to allow the parser to predict
non-projective parses.
The parser is trained using an imitation learning objective. The parser follows
the actions predicted by the current weights, and at each state, determines
which actions are compatible with the optimal parse that could be reached
from the current state. The weights such that the scores assigned to the
set of optimal actions is increased, while scores assigned to other
actions are decreased. Note that more than one action may be optimal for
a given state.
model (Model): The model for the transition-based parser. The model needs
to have a specific substructure of named components --- see the
spacy.ml.tb_framework.TransitionModel for details.
moves (Optional[TransitionSystem]): This defines how the parse-state is created,
updated and evaluated. If 'moves' is None, a new instance is
created with `self.TransitionSystem()`. Defaults to `None`.
update_with_oracle_cut_size (int): During training, cut long sequences into
shorter segments by creating intermediate states based on the gold-standard
history. The model is not very sensitive to this parameter, so you usually
won't need to change it. 100 is a good default.
learn_tokens (bool): Whether to learn to merge subtokens that are split
relative to the gold standard. Experimental.
min_action_freq (int): The minimum frequency of labelled actions to retain.
Rarer labelled actions have their label backed-off to "dep". While this
primarily affects the label accuracy, it can also affect the attachment
structure, as the labels are used to represent the pseudo-projectivity
transformation.
scorer (Optional[Callable]): The scoring method.
"""
return DependencyParser(
nlp.vocab,
model,
name,
moves=moves,
update_with_oracle_cut_size=update_with_oracle_cut_size,
multitasks=[],
learn_tokens=learn_tokens,
min_action_freq=min_action_freq,
beam_width=1,
beam_density=0.0,
beam_update_prob=0.0,
# At some point in the future we can try to implement support for
# partial annotations, perhaps only in the beam objective.
incorrect_spans_key=None,
scorer=scorer,
)
@Language.factory(
"beam_parser",
assigns=["token.dep", "token.head", "token.is_sent_start", "doc.sents"],
default_config={
"beam_width": 8,
"beam_density": 0.01,
"beam_update_prob": 0.5,
"moves": None,
"update_with_oracle_cut_size": 100,
"learn_tokens": False,
"min_action_freq": 30,
"model": DEFAULT_PARSER_MODEL,
"scorer": {"@scorers": "spacy.parser_scorer.v1"},
},
default_score_weights={
"dep_uas": 0.5,
"dep_las": 0.5,
"dep_las_per_type": None,
"sents_p": None,
"sents_r": None,
"sents_f": 0.0,
},
)
def make_beam_parser(
nlp: Language,
name: str,
model: Model,
moves: Optional[TransitionSystem],
update_with_oracle_cut_size: int,
learn_tokens: bool,
min_action_freq: int,
beam_width: int,
beam_density: float,
beam_update_prob: float,
scorer: Optional[Callable],
):
"""Create a transition-based DependencyParser component that uses beam-search.
The dependency parser jointly learns sentence segmentation and labelled
dependency parsing, and can optionally learn to merge tokens that had been
over-segmented by the tokenizer.
The parser uses a variant of the non-monotonic arc-eager transition-system
described by Honnibal and Johnson (2014), with the addition of a "break"
transition to perform the sentence segmentation. Nivre's pseudo-projective
dependency transformation is used to allow the parser to predict
non-projective parses.
The parser is trained using a global objective. That is, it learns to assign
probabilities to whole parses.
model (Model): The model for the transition-based parser. The model needs
to have a specific substructure of named components --- see the
spacy.ml.tb_framework.TransitionModel for details.
moves (Optional[TransitionSystem]): This defines how the parse-state is created,
updated and evaluated. If 'moves' is None, a new instance is
created with `self.TransitionSystem()`. Defaults to `None`.
update_with_oracle_cut_size (int): During training, cut long sequences into
shorter segments by creating intermediate states based on the gold-standard
history. The model is not very sensitive to this parameter, so you usually
won't need to change it. 100 is a good default.
beam_width (int): The number of candidate analyses to maintain.
beam_density (float): The minimum ratio between the scores of the first and
last candidates in the beam. This allows the parser to avoid exploring
candidates that are too far behind. This is mostly intended to improve
efficiency, but it can also improve accuracy as deeper search is not
always better.
beam_update_prob (float): The chance of making a beam update, instead of a
greedy update. Greedy updates are an approximation for the beam updates,
and are faster to compute.
learn_tokens (bool): Whether to learn to merge subtokens that are split
relative to the gold standard. Experimental.
min_action_freq (int): The minimum frequency of labelled actions to retain.
Rarer labelled actions have their label backed-off to "dep". While this
primarily affects the label accuracy, it can also affect the attachment
structure, as the labels are used to represent the pseudo-projectivity
transformation.
"""
return DependencyParser(
nlp.vocab,
model,
name,
moves=moves,
update_with_oracle_cut_size=update_with_oracle_cut_size,
beam_width=beam_width,
beam_density=beam_density,
beam_update_prob=beam_update_prob,
multitasks=[],
learn_tokens=learn_tokens,
min_action_freq=min_action_freq,
# At some point in the future we can try to implement support for
# partial annotations, perhaps only in the beam objective.
incorrect_spans_key=None,
scorer=scorer,
)
def parser_score(examples, **kwargs): def parser_score(examples, **kwargs):
"""Score a batch of examples. """Score a batch of examples.
@ -246,7 +66,6 @@ def parser_score(examples, **kwargs):
return results return results
@registry.scorers("spacy.parser_scorer.v1")
def make_parser_scorer(): def make_parser_scorer():
return parser_score return parser_score
@ -346,3 +165,14 @@ cdef class DependencyParser(Parser):
# because we instead have a label frequency cut-off and back off rare # because we instead have a label frequency cut-off and back off rare
# labels to 'dep'. # labels to 'dep'.
pass pass
# Setup backwards compatibility hook for factories
def __getattr__(name):
if name == "make_parser":
module = importlib.import_module("spacy.pipeline.factories")
return module.make_parser
elif name == "make_beam_parser":
module = importlib.import_module("spacy.pipeline.factories")
return module.make_beam_parser
raise AttributeError(f"module {__name__} has no attribute {name}")

View File

@ -1,3 +1,5 @@
import importlib
import sys
from collections import Counter from collections import Counter
from itertools import islice from itertools import islice
from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, cast from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, cast
@ -39,43 +41,6 @@ subword_features = true
DEFAULT_EDIT_TREE_LEMMATIZER_MODEL = Config().from_str(default_model_config)["model"] DEFAULT_EDIT_TREE_LEMMATIZER_MODEL = Config().from_str(default_model_config)["model"]
@Language.factory(
"trainable_lemmatizer",
assigns=["token.lemma"],
requires=[],
default_config={
"model": DEFAULT_EDIT_TREE_LEMMATIZER_MODEL,
"backoff": "orth",
"min_tree_freq": 3,
"overwrite": False,
"top_k": 1,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0},
)
def make_edit_tree_lemmatizer(
nlp: Language,
name: str,
model: Model,
backoff: Optional[str],
min_tree_freq: int,
overwrite: bool,
top_k: int,
scorer: Optional[Callable],
):
"""Construct an EditTreeLemmatizer component."""
return EditTreeLemmatizer(
nlp.vocab,
model,
name,
backoff=backoff,
min_tree_freq=min_tree_freq,
overwrite=overwrite,
top_k=top_k,
scorer=scorer,
)
class EditTreeLemmatizer(TrainablePipe): class EditTreeLemmatizer(TrainablePipe):
""" """
Lemmatizer that lemmatizes each word using a predicted edit tree. Lemmatizer that lemmatizes each word using a predicted edit tree.
@ -421,3 +386,11 @@ class EditTreeLemmatizer(TrainablePipe):
self.tree2label[tree_id] = len(self.cfg["labels"]) self.tree2label[tree_id] = len(self.cfg["labels"])
self.cfg["labels"].append(tree_id) self.cfg["labels"].append(tree_id)
return self.tree2label[tree_id] return self.tree2label[tree_id]
# Setup backwards compatibility hook for factories
def __getattr__(name):
if name == "make_edit_tree_lemmatizer":
module = importlib.import_module("spacy.pipeline.factories")
return module.make_edit_tree_lemmatizer
raise AttributeError(f"module {__name__} has no attribute {name}")

Some files were not shown because too many files have changed in this diff Show More