Commit Graph

24 Commits

Author SHA1 Message Date
Matthew Honnibal
5bebbf7550
Python 3.13 support (#13823)
In order to support Python 3.13, we had to migrate to Cython 3.0. This caused some tricky interaction with our Pydantic usage, because Cython 3 uses the from __future__ import annotations semantics, which causes type annotations to be saved as strings.

The end result is that we can't have Language.factory decorated functions in Cython modules anymore, as the Language.factory decorator expects to inspect the signature of the functions and build a Pydantic model. If the function is implemented in Cython, an error is raised because the type is not resolved.

To address this I've moved the factory functions into a new module, spacy.pipeline.factories. I've added __getattr__ importlib hooks to the previous locations, in case anyone was importing these functions directly. The change should have no backwards compatibility implications.

Along the way I've also refactored the registration of functions for the config. Previously these ran as import-time side-effects, using the registry decorator. I've created instead a new module spacy.registrations. When the registry is accessed it calls a function ensure_populated(), which cases the registrations to occur.

I've made a similar change to the Language.factory registrations in the new spacy.pipeline.factories module.

I want to remove these import-time side-effects so that we can speed up the loading time of the library, which can be especially painful on the CLI. I also find that I'm often working to track down the implementations of functions referenced by strings in the config. Having the registrations all happen in one place will make this easier.

With these changes I've fortunately avoided the need to migrate to Pydantic v2 properly --- we're still using the v1 compatibility shim. We might not be able to hold out forever though: Pydantic (reasonably) aren't actively supporting the v1 shims. I put a lot of work into v2 migration when investigating the 3.13 support, and it's definitely challenging. In any case, it's a relief that we don't have to do the v2 migration at the same time as the Cython 3.0/Python 3.13 support.
2025-05-22 13:47:21 +02:00
Basile Dura
b0228d8ea6
ci: add cython linter (#12694)
* chore: add cython-linter dev dependency

* fix: lexeme.pyx

* fix: morphology.pxd

* fix: tokenizer.pxd

* fix: vocab.pxd

* fix: morphology.pxd (line length)

* ci: add cython-lint

* ci: fix cython-lint call

* Fix kb/candidate.pyx.

* Fix kb/kb.pyx.

* Fix kb/kb_in_memory.pyx.

* Fix kb.

* Fix training/ partially.

* Fix training/. Ignore trailing whitespaces and too long lines.

* Fix ml/.

* Fix matcher/.

* Fix pipeline/.

* Fix tokens/.

* Fix build errors. Fix vocab.pyx.

* Fix cython-lint install and run.

* Fix lexeme.pyx, parts_of_speech.pxd, vectors.pyx. Temporarily disable cython-lint execution.

* Fix attrs.pyx, lexeme.pyx, symbols.pxd, isort issues.

* Make cython-lint install conditional. Fix tokenizer.pyx.

* Fix remaining files. Reenable cython-lint check.

* Readded parentheses.

* Fix test_build_dependencies().

* Add explanatory comment to cython-lint execution.

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-07-19 12:03:31 +02:00
Daniël de Kok
e2b70df012
Configure isort to use the Black profile, recursively isort the spacy module (#12721)
* Use isort with Black profile

* isort all the things

* Fix import cycles as a result of import sorting

* Add DOCBIN_ALL_ATTRS type definition

* Add isort to requirements

* Remove isort from build dependencies check

* Typo
2023-06-14 17:48:41 +02:00
svlandeg
ed53bb979d cleanup 2021-01-13 14:20:05 +01:00
svlandeg
86a4e316b8 fix sent_starts 2021-01-13 13:47:25 +01:00
Ines Montani
648f61d077
Tidy up compiler flags and imports (#5071) 2020-03-02 11:48:10 +01:00
Ines Montani
de11ea753a Merge branch 'master' into develop 2020-02-18 14:47:23 +01:00
adrianeboyd
5ee9d8c9b8
Add MORPH attr, add support in retokenizer (#4947)
* Add MORPH attr / symbol for token attrs

* Update retokenizer for MORPH
2020-01-29 17:45:46 +01:00
Sofie Van Landeghem
a1b22e90cd serialize ENT_ID (#4852)
* expand serialization test for custom token attribute

* add failing test for issue 4849

* define ENT_ID as attr and use in doc serialization

* fix few typos
2020-01-06 14:57:34 +01:00
svlandeg
8608685543 ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
Matthew Honnibal
8aa7882762
Make NORM a token attribute (#3029)
See #3028. The solution in this patch is pretty debateable.

What we do is give the TokenC struct a .norm field, by repurposing the previously idle .sense attribute. It's nice to repurpose a previous field because it means the TokenC doesn't change size, so even if someone's using the internals very deeply, nothing will break.

The weird thing here is that the TokenC and the LexemeC both have an attribute named NORM. This arguably assists in backwards compatibility. On the other hand, maybe it's really bad! We're changing the semantics of the attribute subtly, so maybe it's better if someone calling lex.norm gets a breakage, and instead is told to write lex.default_norm?

Overall I believe this patch makes the NORM feature work the way we sort of expected it to work. Certainly it's much more like how the docs describe it, and more in line with how we've been directing people to use the norm attribute. We'll also be able to use token.norm to do stuff like spelling correction, which is pretty cool.
2018-12-08 10:49:10 +01:00
Ines Montani
3141e04822
💫 New system for error messages and warnings (#2163)
* Add spacy.errors module

* Update deprecation and user warnings

* Replace errors and asserts with new error message system

* Remove redundant asserts

* Fix whitespace

* Add messages for print/util.prints statements

* Fix typo

* Fix typos

* Move CLI messages to spacy.cli._messages

* Add decorator to display error code with message

An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.

* Remove unused link in spacy.about

* Update errors for invalid pipeline components

* Improve error for unknown factories

* Add displaCy warnings

* Update formatting consistency

* Move error message to spacy.errors

* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Thomas Opsomer
515e25910e fix sent_start in serialization 2018-01-28 19:50:42 +01:00
Matthew Honnibal
a002264fec Remove caching of Token in Doc, as caused cycle. 2017-10-16 19:34:21 +02:00
Matthew Honnibal
87613edf8f Add set_struct_attr staticmethod to token 2016-11-25 12:41:47 +01:00
Matthew Honnibal
9a03a3f85e Add get_struct_attr staticmethod to Token, to match Lexeme.get_struct_attr. 2016-11-25 11:35:17 +01:00
Matthew Honnibal
09ab447a18 Remove tensor property from token. 2016-10-17 02:45:09 +02:00
Matthew Honnibal
f3be9d0a9a Add tensor field to Lexeme, Token, Doc and Span, so that users have a place to hang neural network outputs 2016-10-14 03:24:13 +02:00
Matthew Honnibal
58e83fe34b Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match. 2016-09-21 14:54:55 +02:00
Wolfgang Seeker
46e3f979f1 add function for setting head and label to token
change PseudoProjectivity.deprojectivize to use these functions
2016-03-11 17:31:06 +01:00
Matthew Honnibal
30be4f15da * Import attrs from spacy.attrs, not spacy.typedefs 2015-07-16 11:23:25 +02:00
Matthew Honnibal
81aa4e6dcc * Go back to having token reference doc, instead of complicated gymnastics. Rename the attr 'doc', to expose it in the API 2015-07-14 00:10:11 +02:00
Matthew Honnibal
8214b74eec * Restore _py_tokens cache, to handle orphan tokens. 2015-07-13 22:28:10 +02:00
Matthew Honnibal
dba6b47d4e * Refactor monster tokens.pyx file, into a tokens/ subpackage. Try to break the cycle between Doc and Token, and remove the need to pass around a unicode string reference 2015-07-13 19:20:48 +02:00