Commit Graph

16248 Commits

Author SHA1 Message Date
Christopher Degawa
276c2a2347
Merge 04703d0d06 into 41e07772dc 2025-05-28 15:15:09 -05:00
Jeff Adolphe
41e07772dc
Added Haitian Creole (ht) Language Support to spaCy (#13807)
This PR adds official support for Haitian Creole (ht) to spaCy's spacy/lang module.
It includes:

    Added all core language data files for spacy/lang/ht:
        tokenizer_exceptions.py
        punctuation.py
        lex_attrs.py
        syntax_iterators.py
        lemmatizer.py
        stop_words.py
        tag_map.py

    Unit tests for tokenizer and noun chunking (test_tokenizer.py, test_noun_chunking.py, etc.). Passed all 58 pytest spacy/tests/lang/ht tests that I've created.

    Basic tokenizer rules adapted for Haitian Creole orthography and informal contractions.

    Custom like_num atrribute supporting Haitian number formats (e.g., "3yèm").

    Support for common informal apostrophe usage (e.g., "m'ap", "n'ap", "di'm").

    Ensured no breakages in other language modules.

    Followed spaCy coding style (PEP8, Black).

This provides a foundation for Haitian Creole NLP development using spaCy.
2025-05-28 17:23:38 +02:00
Martin Schorfmann
e8f40e2169
Correct API docs for Span.lemma_, Vocab.to_bytes and Vectors.__init__ (#13436)
* Correct code example for Span.lemma_ in API Docs (#13405)

* Correct documented return type of Vocab.to_bytes in API docs

* Correct wording for Vectors.__init__ in API docs
2025-05-28 17:22:50 +02:00
BLKSerene
7b1d6e58ff
Remove dependency on langcodes (#13760)
This PR removes the dependency on langcodes introduced in #9342.

While the introduction of langcodes allows a significantly wider range of language codes, there are some unexpected side effects:

    zh-Hant (Traditional Chinese) should be mapped to zh intead of None, as spaCy's Chinese model is based on pkuseg which supports tokenization of both Simplified and Traditional Chinese.
    Since it is possible that spaCy may have a model for Norwegian Nynorsk in the future, mapping no (macrolanguage Norwegian) to nb (Norwegian Bokmål) might be misleading. In that case, the user should be asked to specify nb or nn (Norwegian Nynorsk) specifically or consult the doc.
    Same as above for regional variants of languages such as en_gb and en_us.

Overall, IMHO, introducing an extra dependency just for the conversion of language codes is an overkill. It is possible that most user just need the conversion between 2/3-letter ISO codes and a simple dictionary lookup should suffice.

With this PR, ISO 639-1 and ISO 639-3 codes are supported. ISO 639-2/B (bibliographic codes which are not favored and used in ISO 639-3) and deprecated ISO 639-1/2 codes are also supported to maximize backward compatibility.
2025-05-28 17:21:46 +02:00
Matthew Honnibal
864c2f3b51 Format 2025-05-28 17:06:11 +02:00
Matthew Honnibal
75a9d9b9ad Test and fix issue13769 2025-05-28 17:04:23 +02:00
Ilie
bec546cec0
Add TeNs plugin (#13800)
Co-authored-by: Ilie Cristian Dorobat <idorobat@cisco.com>
2025-05-27 01:21:07 +02:00
d0ngw
46613e27cf
fix: match hyphenated words to lemmas in index_table (e.g. "co-authored" -> "co-author") (#13816) 2025-05-27 01:20:26 +02:00
omahs
b205ff65e6
fix typos (#13813) 2025-05-26 16:05:29 +02:00
BLKSerene
92f1b8cdb4
Switch to typer-slim (#13759) 2025-05-26 16:03:49 +02:00
Matthew Honnibal
4b65aa79ee Add release script 2025-05-22 14:00:48 +02:00
Matthew Honnibal
d08f4e3b10 Increment version 2025-05-22 13:58:00 +02:00
Matthew Honnibal
6036f344d3 Remove print statements 2025-05-22 13:56:31 +02:00
Matthew Honnibal
5bebbf7550
Python 3.13 support (#13823)
In order to support Python 3.13, we had to migrate to Cython 3.0. This caused some tricky interaction with our Pydantic usage, because Cython 3 uses the from __future__ import annotations semantics, which causes type annotations to be saved as strings.

The end result is that we can't have Language.factory decorated functions in Cython modules anymore, as the Language.factory decorator expects to inspect the signature of the functions and build a Pydantic model. If the function is implemented in Cython, an error is raised because the type is not resolved.

To address this I've moved the factory functions into a new module, spacy.pipeline.factories. I've added __getattr__ importlib hooks to the previous locations, in case anyone was importing these functions directly. The change should have no backwards compatibility implications.

Along the way I've also refactored the registration of functions for the config. Previously these ran as import-time side-effects, using the registry decorator. I've created instead a new module spacy.registrations. When the registry is accessed it calls a function ensure_populated(), which cases the registrations to occur.

I've made a similar change to the Language.factory registrations in the new spacy.pipeline.factories module.

I want to remove these import-time side-effects so that we can speed up the loading time of the library, which can be especially painful on the CLI. I also find that I'm often working to track down the implementations of functions referenced by strings in the config. Having the registrations all happen in one place will make this easier.

With these changes I've fortunately avoided the need to migrate to Pydantic v2 properly --- we're still using the v1 compatibility shim. We might not be able to hold out forever though: Pydantic (reasonably) aren't actively supporting the v1 shims. I put a lot of work into v2 migration when investigating the 3.13 support, and it's definitely challenging. In any case, it's a relief that we don't have to do the v2 migration at the same time as the Cython 3.0/Python 3.13 support.
2025-05-22 13:47:21 +02:00
Matthew Honnibal
911539e9a4 Update version 2025-05-18 12:18:38 +02:00
Matthew Honnibal
22c1bc785b Replace lte with lt for clarity 2025-05-18 12:18:17 +02:00
Matthew Honnibal
cb5e760e91 Fix python version supported 2025-05-18 12:17:23 +02:00
Gunther Cox
87ec2b72a5
Update spaCy Universe entry for ChatterBot to use correct name casing (#13784) 2025-05-12 07:47:50 +02:00
翟持江
aa8de0ed37
Update embeddings-transformers.mdx, update trf_data examples info in <Runtime usage> (#13811) 2025-05-12 07:47:12 +02:00
Adrien Carpentier
98a19df91a
docs: fix README.md for compatible Python versions (#13749) 2025-04-11 20:56:52 +02:00
Matthew Honnibal
92bd042502 Allow Python 3.13 2025-04-03 23:15:12 +02:00
Matthew Honnibal
d0c705cbc9 Increment version 2025-04-01 09:40:59 +02:00
Christopher Degawa
04703d0d06
displacy: fix import path for ipython 9.0.1
Signed-off-by: Christopher Degawa <ccom@randomderp.com>
2025-03-04 00:32:12 -06:00
Matthew Honnibal
b3c46c315e Add support for linux-arm 2025-02-03 18:32:23 +01:00
Ines Montani
d194f06437 Add live stream to site [ci skip] 2025-02-03 09:42:52 +01:00
Ines Montani
055e07d9cc Update README.md [ci skip] 2025-02-03 09:38:32 +01:00
Ines Montani
8e1c14e977 Add live stream to README [ci skip] 2025-02-03 09:37:48 +01:00
Christine P. Chai
4278182dd0
Change Twitter to X (#13740) [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2025-02-03 09:30:21 +01:00
Matthew Honnibal
85cc763006 Fix python version requirement 2025-01-13 18:17:36 +01:00
Matthew Honnibal
ba7468e32e
Update requirements, fixing windows crashes (#13727)
* Re-enable pretraining test

* Require thinc 8.3.4

* Reformat

* Re-enable test
2025-01-13 16:39:46 +01:00
Matthew Honnibal
311f7cc9fb Set version to v3.8.4 2024-12-11 14:14:08 +01:00
Matthew Honnibal
682140496a Align requirements better 2024-12-11 14:13:51 +01:00
Matthew Honnibal
343f4f21d7 Enable Python 3.13 2024-12-11 14:13:28 +01:00
Matthew Honnibal
be0fa812c2 Update cibuildwheel 2024-12-11 13:08:40 +01:00
Matthew Honnibal
a6317b3836
Fix allocation of non-transient strings in StringStore (#13713)
* Fix bug in memory-zone code when adding non-transient strings. The error could result in segmentation faults or other memory errors during memory zones if new labels were added to the model.
* Fix handling of new morphological labels within memory zones. Addresses second issue reported in Memory leak of MorphAnalysis object. #13684
2024-12-11 13:06:53 +01:00
Ines Montani
3e30b5bef6 Add spacy-layout [ci skip] 2024-11-19 10:43:40 +01:00
Matthew Honnibal
3ecec1324c
Usage page on memory management, explaining memory zones and doc_cleaner (#13643) [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2024-10-23 12:42:54 +02:00
Ikko Eltociear Ashimine
15fbf5ef36
docs: update rule-based-matching.mdx (#13665) [ci skip] 2024-10-23 12:07:01 +02:00
Sergei Pashakhin
1ee9a19059
Fix typo (#13657) [ci skip] 2024-10-23 12:06:36 +02:00
thjbdvlt
0d7e57fc3e
universe-pipeline-solipCysme-french (#13627) [ci skip] 2024-10-11 11:26:15 +02:00
Ines Montani
ae5c3e078d Fix universe.json [ci skip] 2024-10-11 11:24:42 +02:00
Andrei (Andrey) Khropov
8d2902b0e7
Fix misspelling (#13631) [ci skip] 2024-10-11 11:23:12 +02:00
aravind-mc
44d1906453
Update universe.json to add my spaCy online course (#13632) [ci skip] 2024-10-11 11:21:57 +02:00
sam rxh
52a4cb0d14
Fix 'issue template' link in CONTRIBUTING.md (#13587) [ci skip] 2024-10-11 11:20:34 +02:00
Ines Montani
10a6f508ab Fix landing banner links [ci skip] 2024-10-11 11:19:10 +02:00
Matthew Honnibal
bda4bb0184
Try disabling pretraining tests to probe windows ci failure (#13646) 2024-10-02 01:01:40 +02:00
Matthew Honnibal
628c973db5 Note minimum python requirement in setup.cfg 2024-10-02 00:49:09 +02:00
Matthew Honnibal
e0782c5e4c Merge branch 'master' into v3.8.x 2024-10-01 23:57:48 +02:00
Matthew Honnibal
5230754986 Fix thinc dependncy 2024-10-01 23:49:17 +02:00
Matthew Honnibal
411b70f5f3 Upd requirements 2024-10-01 23:46:54 +02:00