Commit Graph

9443 Commits

Author SHA1 Message Date
Matthew Honnibal
c015dd1fa6 isort 2025-05-28 17:27:59 +02:00
Matthew Honnibal
80aa445f34 Format 2025-05-28 17:27:36 +02:00
Matthew Honnibal
79f9d3ea2a Merge branch 'master' into fix/enum-python-types 2025-05-28 17:26:47 +02:00
Jeff Adolphe
41e07772dc
Added Haitian Creole (ht) Language Support to spaCy (#13807)
This PR adds official support for Haitian Creole (ht) to spaCy's spacy/lang module.
It includes:

    Added all core language data files for spacy/lang/ht:
        tokenizer_exceptions.py
        punctuation.py
        lex_attrs.py
        syntax_iterators.py
        lemmatizer.py
        stop_words.py
        tag_map.py

    Unit tests for tokenizer and noun chunking (test_tokenizer.py, test_noun_chunking.py, etc.). Passed all 58 pytest spacy/tests/lang/ht tests that I've created.

    Basic tokenizer rules adapted for Haitian Creole orthography and informal contractions.

    Custom like_num atrribute supporting Haitian number formats (e.g., "3yèm").

    Support for common informal apostrophe usage (e.g., "m'ap", "n'ap", "di'm").

    Ensured no breakages in other language modules.

    Followed spaCy coding style (PEP8, Black).

This provides a foundation for Haitian Creole NLP development using spaCy.
2025-05-28 17:23:38 +02:00
BLKSerene
7b1d6e58ff
Remove dependency on langcodes (#13760)
This PR removes the dependency on langcodes introduced in #9342.

While the introduction of langcodes allows a significantly wider range of language codes, there are some unexpected side effects:

    zh-Hant (Traditional Chinese) should be mapped to zh intead of None, as spaCy's Chinese model is based on pkuseg which supports tokenization of both Simplified and Traditional Chinese.
    Since it is possible that spaCy may have a model for Norwegian Nynorsk in the future, mapping no (macrolanguage Norwegian) to nb (Norwegian Bokmål) might be misleading. In that case, the user should be asked to specify nb or nn (Norwegian Nynorsk) specifically or consult the doc.
    Same as above for regional variants of languages such as en_gb and en_us.

Overall, IMHO, introducing an extra dependency just for the conversion of language codes is an overkill. It is possible that most user just need the conversion between 2/3-letter ISO codes and a simple dictionary lookup should suffice.

With this PR, ISO 639-1 and ISO 639-3 codes are supported. ISO 639-2/B (bibliographic codes which are not favored and used in ISO 639-3) and deprecated ISO 639-1/2 codes are also supported to maximize backward compatibility.
2025-05-28 17:21:46 +02:00
Matthew Honnibal
864c2f3b51 Format 2025-05-28 17:06:11 +02:00
Matthew Honnibal
75a9d9b9ad Test and fix issue13769 2025-05-28 17:04:23 +02:00
Matthew Honnibal
2567266bf7 Merge branch 'master' into fix/enum-python-types 2025-05-27 11:16:12 +02:00
Matthew Honnibal
5e1ee975c9 Fix quirk of enum values in Python
After the Cython 3 change, the types of enum members such as
spacy.parts_of_speech.NOUN became 'flag', rather than simple 'int'.
This change mostly doesn't matter because the flag type does duck-type
like an int -- it compares, additions, prints etc the same. However,
it doesn't repr the same and if you do an isinstance check it will fail.
It's therefore better to just make them ints like they were before.
2025-05-27 11:09:37 +02:00
d0ngw
46613e27cf
fix: match hyphenated words to lemmas in index_table (e.g. "co-authored" -> "co-author") (#13816) 2025-05-27 01:20:26 +02:00
omahs
b205ff65e6
fix typos (#13813) 2025-05-26 16:05:29 +02:00
Matthew Honnibal
d08f4e3b10 Increment version 2025-05-22 13:58:00 +02:00
Matthew Honnibal
6036f344d3 Remove print statements 2025-05-22 13:56:31 +02:00
Matthew Honnibal
5bebbf7550
Python 3.13 support (#13823)
In order to support Python 3.13, we had to migrate to Cython 3.0. This caused some tricky interaction with our Pydantic usage, because Cython 3 uses the from __future__ import annotations semantics, which causes type annotations to be saved as strings.

The end result is that we can't have Language.factory decorated functions in Cython modules anymore, as the Language.factory decorator expects to inspect the signature of the functions and build a Pydantic model. If the function is implemented in Cython, an error is raised because the type is not resolved.

To address this I've moved the factory functions into a new module, spacy.pipeline.factories. I've added __getattr__ importlib hooks to the previous locations, in case anyone was importing these functions directly. The change should have no backwards compatibility implications.

Along the way I've also refactored the registration of functions for the config. Previously these ran as import-time side-effects, using the registry decorator. I've created instead a new module spacy.registrations. When the registry is accessed it calls a function ensure_populated(), which cases the registrations to occur.

I've made a similar change to the Language.factory registrations in the new spacy.pipeline.factories module.

I want to remove these import-time side-effects so that we can speed up the loading time of the library, which can be especially painful on the CLI. I also find that I'm often working to track down the implementations of functions referenced by strings in the config. Having the registrations all happen in one place will make this easier.

With these changes I've fortunately avoided the need to migrate to Pydantic v2 properly --- we're still using the v1 compatibility shim. We might not be able to hold out forever though: Pydantic (reasonably) aren't actively supporting the v1 shims. I put a lot of work into v2 migration when investigating the 3.13 support, and it's definitely challenging. In any case, it's a relief that we don't have to do the v2 migration at the same time as the Cython 3.0/Python 3.13 support.
2025-05-22 13:47:21 +02:00
Matthew Honnibal
911539e9a4 Update version 2025-05-18 12:18:38 +02:00
Matthew Honnibal
d0c705cbc9 Increment version 2025-04-01 09:40:59 +02:00
Matthew Honnibal
ba7468e32e
Update requirements, fixing windows crashes (#13727)
* Re-enable pretraining test

* Require thinc 8.3.4

* Reformat

* Re-enable test
2025-01-13 16:39:46 +01:00
Matthew Honnibal
311f7cc9fb Set version to v3.8.4 2024-12-11 14:14:08 +01:00
Matthew Honnibal
a6317b3836
Fix allocation of non-transient strings in StringStore (#13713)
* Fix bug in memory-zone code when adding non-transient strings. The error could result in segmentation faults or other memory errors during memory zones if new labels were added to the model.
* Fix handling of new morphological labels within memory zones. Addresses second issue reported in Memory leak of MorphAnalysis object. #13684
2024-12-11 13:06:53 +01:00
Andrei (Andrey) Khropov
8d2902b0e7
Fix misspelling (#13631) [ci skip] 2024-10-11 11:23:12 +02:00
Matthew Honnibal
bda4bb0184
Try disabling pretraining tests to probe windows ci failure (#13646) 2024-10-02 01:01:40 +02:00
Matthew Honnibal
0cdcfe56cb Set version to v3.8.2 2024-10-01 16:47:24 +02:00
Matthew Honnibal
9c5b61bdff isort 2024-10-01 12:38:51 +02:00
Matthew Honnibal
725ccbac39 Format 2024-10-01 12:38:02 +02:00
Matthew Honnibal
a8837beab7 Set version to v3.8.1 2024-10-01 12:37:11 +02:00
Matthew Honnibal
114b4894fb Fix --require-parent default 2024-09-29 15:50:31 +02:00
Matthew Honnibal
dec13b4258 Fix inverted cli arg 2024-09-29 15:50:05 +02:00
Matthew Honnibal
c03f060527 Allow positive option --require-parent 2024-09-29 14:30:14 +02:00
Matthew Honnibal
6255cb985f Include version constraint in parent package requirement 2024-09-29 14:22:21 +02:00
Matthew Honnibal
3b165a8716 Simplify setting to require parent package 2024-09-29 14:19:10 +02:00
Matthew Honnibal
969832f5d6 Fix package 2024-09-29 14:00:11 +02:00
Matthew Honnibal
8ce53a6bbe Syntax 2024-09-29 13:51:44 +02:00
Matthew Honnibal
6fa0d709d5 Support option to not depend on parent package in spacy package 2024-09-29 13:51:04 +02:00
Matthew Honnibal
5010fcbd3a Fix numpy constant 2024-09-14 13:13:11 +02:00
Matthew Honnibal
de4f19f3a3 Fix version 2024-09-14 13:12:44 +02:00
Matthew Honnibal
3d03565498 Replace numpy floats in evaluate and update 2024-09-14 12:55:53 +02:00
Matthew Honnibal
0576a1ff56 Fix numpy floats in meta.json 2024-09-14 12:54:08 +02:00
Matthew Honnibal
2f1e7ed09a Lint 2024-09-14 11:36:27 +02:00
Matthew Honnibal
e2dc9b79e1 Format 2024-09-14 11:29:40 +02:00
Matthew Honnibal
3c3d75015b Set version to v3.7.7 2024-09-14 11:27:32 +02:00
Matthew Honnibal
50aa3b5cbe Merge branch 'master' of https://github.com/explosion/spaCy 2024-09-14 11:09:44 +02:00
Matthew Honnibal
8266031454 Merge numpy version update 2024-09-14 11:08:35 +02:00
Matthew Honnibal
69ecb85fad Set version to v3.8.1 2024-09-13 10:43:40 +02:00
Matthew Honnibal
b427597fc8 Set version to v3.8.0 2024-09-11 21:32:26 +02:00
Matthew Honnibal
c068e1de1b Fix dependencies 2024-09-11 15:57:52 +02:00
marinelay
b18cc94451
Delete unnecessary method (#13441)
Co-authored-by: marinelay <marinelay@gmail.com>
2024-09-09 20:57:13 +02:00
Matthew Honnibal
4cc3ebe74e Format 2024-09-09 20:56:01 +02:00
Matthew Honnibal
a019315534 Fix memory zones 2024-09-09 13:49:41 +02:00
Matthew Honnibal
59ac7e6bdb Format 2024-09-09 11:22:52 +02:00
Matthew Honnibal
b65491b641 Set version to v3.8.0.dev0 2024-09-09 11:20:23 +02:00