Commit Graph

87 Commits

Author SHA1 Message Date
Matthew Honnibal
5bebbf7550
Python 3.13 support (#13823)
In order to support Python 3.13, we had to migrate to Cython 3.0. This caused some tricky interaction with our Pydantic usage, because Cython 3 uses the from __future__ import annotations semantics, which causes type annotations to be saved as strings.

The end result is that we can't have Language.factory decorated functions in Cython modules anymore, as the Language.factory decorator expects to inspect the signature of the functions and build a Pydantic model. If the function is implemented in Cython, an error is raised because the type is not resolved.

To address this I've moved the factory functions into a new module, spacy.pipeline.factories. I've added __getattr__ importlib hooks to the previous locations, in case anyone was importing these functions directly. The change should have no backwards compatibility implications.

Along the way I've also refactored the registration of functions for the config. Previously these ran as import-time side-effects, using the registry decorator. I've created instead a new module spacy.registrations. When the registry is accessed it calls a function ensure_populated(), which cases the registrations to occur.

I've made a similar change to the Language.factory registrations in the new spacy.pipeline.factories module.

I want to remove these import-time side-effects so that we can speed up the loading time of the library, which can be especially painful on the CLI. I also find that I'm often working to track down the implementations of functions referenced by strings in the config. Having the registrations all happen in one place will make this easier.

With these changes I've fortunately avoided the need to migrate to Pydantic v2 properly --- we're still using the v1 compatibility shim. We might not be able to hold out forever though: Pydantic (reasonably) aren't actively supporting the v1 shims. I put a lot of work into v2 migration when investigating the 3.13 support, and it's definitely challenging. In any case, it's a relief that we don't have to do the v2 migration at the same time as the Cython 3.0/Python 3.13 support.
2025-05-22 13:47:21 +02:00
Daniël de Kok
e2b70df012
Configure isort to use the Black profile, recursively isort the spacy module (#12721)
* Use isort with Black profile

* isort all the things

* Fix import cycles as a result of import sorting

* Add DOCBIN_ALL_ATTRS type definition

* Add isort to requirements

* Remove isort from build dependencies check

* Typo
2023-06-14 17:48:41 +02:00
Ines Montani
933a7cf8d1 Fix Lexeme.from_ptr 2020-08-10 16:43:37 +02:00
Ines Montani
24f72c669c Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
adrianeboyd
a5cd203284
Reduce stored lexemes data, move feats to lookups (#5238)
* Reduce stored lexemes data, move feats to lookups

* Move non-derivable lexemes features (`norm / cluster / prob`) to
`spacy-lookups-data` as lookups
  * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups
  * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in
    lookups only
* Remove serialization of lexemes data as `vocab/lexemes.bin`
  * Remove `SerializedLexemeC`
  * Remove `Lexeme.to_bytes/from_bytes`
* Modify normalization exception loading:
  * Always create `Vocab.lookups` table `lexeme_norm` for
    normalization exceptions
  * Load base exceptions from `lang.norm_exceptions`, but load
    language-specific exceptions from lookups
  * Set `lex_attr_getter[NORM]` including new lookups table in
    `BaseDefaults.create_vocab()` and when deserializing `Vocab`
* Remove all cached lexemes when deserializing vocab to override
  existing normalizations with the new normalizations (as a replacement
  for the previous step that replaced all lexemes data with the
  deserialized data)

* Skip English normalization test

Skip English normalization test because the data is now in
`spacy-lookups-data`.

* Remove norm exceptions

Moved to spacy-lookups-data.

* Move norm exceptions test to spacy-lookups-data

* Load extra lookups from spacy-lookups-data lazily

Load extra lookups (currently for cluster and prob) lazily from the
entry point `lg_extra` as `Vocab.lookups_extra`.

* Skip creating lexeme cache on load

To improve model loading times, do not create the full lexeme cache when
loading. The lexemes will be created on demand when processing.

* Identify numeric values in Lexeme.set_attrs()

With the removal of a special case for `PROB`, also identify `float` to
avoid trying to convert it with the `StringStore`.

* Skip lexeme cache init in from_bytes

* Unskip and update lookups tests for python3.6+

* Update vocab pickle to include lookups_extra

* Update vocab serialization tests

Check strings rather than lexemes since lexemes aren't initialized
automatically, account for addition of "_SP".

* Re-skip lookups test because of python3.5

* Skip PROB/float values in Lexeme.set_attrs

* Convert is_oov from lexeme flag to lex in vectors

Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether
the lexeme has a vector.

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-19 15:59:14 +02:00
adrianeboyd
98c59027ed
Use max(uint64) for OOV lexeme rank (#5303)
* Use max(uint64) for OOV lexeme rank

* Add test for default OOV rank

* Revert back to thinc==7.4.0

Requiring the updated version of thinc was unnecessary.

* Define OOV_RANK in one place

Define OOV_RANK in one place in `util`.

* Fix formatting [ci skip]

* Switch to external definitions of max(uint64)

Switch to external defintions of max(uint64) and confirm that they are
equal.
2020-04-15 13:49:47 +02:00
Ines Montani
648f61d077
Tidy up compiler flags and imports (#5071) 2020-03-02 11:48:10 +01:00
Ines Montani
62b558ab72 💫 Support lexical attributes in retokenizer attrs (closes #2390) (#3325)
* Fix formatting and whitespace

* Add support for lexical attributes (closes #2390)

* Document lexical attribute setting during retokenization

* Assign variable oputside of nested loop
2019-02-24 21:13:51 +01:00
Matthew Honnibal
84e66ca6d4 WIP on stringstore change. 27 failures 2017-05-28 14:06:40 +02:00
Matthew Honnibal
f51e6a6c16 Adjust lexeme sizing for attr_t being 64 bit 2017-05-28 12:51:09 +02:00
Matthew Honnibal
793430aa7a Get spaCy train command working with neural network
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
Wolfgang Seeker
03fb498dbe introduce lang field for LexemeC to hold language id
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Matthew Honnibal
193f127f81 * Fix ugly py_check_flag and py_set_flag functions in Lexeme 2015-09-15 13:06:18 +10:00
Matthew Honnibal
e7e529edf4 * Fix Lexeme.check_flag 2015-09-10 14:45:43 +02:00
Matthew Honnibal
4f8e38271d * Fix merge errors in lexeme.pxd 2015-09-06 20:19:08 +02:00
Matthew Honnibal
86c888667f * Merge in changes from de branch 2015-09-06 19:49:28 +02:00
Matthew Honnibal
d2fc104a26 * Begin merge of Gazetteer and DE branches 2015-09-06 19:45:15 +02:00
Matthew Honnibal
e35bb36be7 * Ensure Lexeme.check_flag returns a boolean value 2015-09-06 17:52:32 +02:00
Matthew Honnibal
6f1743692a * Work on language-independent refactoring 2015-08-23 20:49:18 +02:00
Matthew Honnibal
cad0cca4e3 * Tmp 2015-08-22 22:04:34 +02:00
Matthew Honnibal
c263577424 * Fix lower attribute in lexeme.pxd 2015-08-06 16:07:41 +02:00
Matthew Honnibal
6bb96c122d * Host IS_ flags in attrs.pxd, and add properties for them on Token and Lexeme objects 2015-07-26 16:37:16 +02:00
Matthew Honnibal
4dddc8a69b * Fix type declarations for attr_t. Remove unused id_t. 2015-07-18 22:39:57 +02:00
Matthew Honnibal
a6d040bd11 * Import Lexeme attrs from spacy.attrs, not spacy.typedefs 2015-07-16 11:20:08 +02:00
Matthew Honnibal
65251e7625 * Remove redundant attr_id_t from typedefs.pxd 2015-07-16 00:58:51 +02:00
Matthew Honnibal
78db7e32f7 * Remove has_sense method from Lexeme declaration 2015-07-08 19:41:20 +02:00
Matthew Honnibal
b64c843861 * Remove senses attr 2015-07-08 19:26:24 +02:00
Matthew Honnibal
2b8459d9a8 * Add senses flag to Lexeme 2015-07-01 20:10:41 +02:00
Matthew Honnibal
c04e6ebca6 * Allow user to load different sized vectors. 2015-06-05 16:26:39 +02:00
Jordan Suchow
3a8d9b37a6 Remove trailing whitespace 2015-04-19 13:01:38 -07:00
Matthew Honnibal
321b402739 * Store the l2 norm of the word's vector 2015-02-07 08:42:16 -05:00
Matthew Honnibal
fda94271af * Rename NORM1 and NORM2 attrs to lower and norm 2015-01-24 06:17:03 +11:00
Matthew Honnibal
5ed8b2b98f * Rename sic to orth 2015-01-23 02:08:25 +11:00
Matthew Honnibal
5e63c606ad * Rename vec to repvec 2015-01-22 02:03:54 +11:00
Matthew Honnibal
6c7e44140b * Work on word vectors, and other stuff 2015-01-17 16:21:17 +11:00
Matthew Honnibal
7d3c40de7d * Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme 2015-01-15 00:33:16 +11:00
Matthew Honnibal
0930892fc1 * Tmp. Working on refactor. Compiles, must hook up lexical feats. 2015-01-14 00:03:48 +11:00
Matthew Honnibal
46da3d74d2 * Tmp. Refactoring, introducing a Lexeme PyObject. 2015-01-12 11:23:44 +11:00
Matthew Honnibal
ce2edd6312 * Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 10:26:22 +11:00
Matthew Honnibal
4c4aa2c5c9 * Work on train 2014-12-22 07:25:43 +11:00
Matthew Honnibal
f6556d8e5d * Refactor, move Lexeme struct to structs.pxd 2014-12-20 06:51:03 +11:00
Matthew Honnibal
9959a64f7b * Working morphology and lemmatisation. POS tagging quite fast. 2014-12-10 08:09:32 +11:00
Matthew Honnibal
ef4398b204 * Rearrange POS stuff, so that language-specific stuff can live in language-specific modules 2014-12-07 23:52:41 +11:00
Matthew Honnibal
49f3780ff5 * Fiddle with lexeme attrs 2014-12-04 21:22:38 +11:00
Matthew Honnibal
e1b1f45cc9 * Add STEM attribute to lexeme 2014-12-04 20:46:20 +11:00
Matthew Honnibal
d70d31aa45 * Introduce first attempt at const-ness 2014-12-03 15:44:25 +11:00
Matthew Honnibal
b463a7eb86 * Make flag-setting a language-specific thing 2014-12-03 11:04:32 +11:00
Matthew Honnibal
50309e6e49 * Fix context vector, importing all features 2014-11-05 22:11:39 +11:00
Matthew Honnibal
70ea862703 * Remove vocab10k field, and add flags for gazetteers 2014-11-03 00:13:51 +11:00
Matthew Honnibal
8335706321 * Add LIKE_URL and LIKE_NUMBER flag features 2014-11-02 13:19:23 +11:00