spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-07 09:11:12 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	79f9d3ea2a	Merge branch 'master' into fix/enum-python-types	2025-05-28 17:26:47 +02:00
Jeff Adolphe	41e07772dc	Added Haitian Creole (ht) Language Support to spaCy (#13807 ) This PR adds official support for Haitian Creole (ht) to spaCy's spacy/lang module. It includes: Added all core language data files for spacy/lang/ht: tokenizer_exceptions.py punctuation.py lex_attrs.py syntax_iterators.py lemmatizer.py stop_words.py tag_map.py Unit tests for tokenizer and noun chunking (test_tokenizer.py, test_noun_chunking.py, etc.). Passed all 58 pytest spacy/tests/lang/ht tests that I've created. Basic tokenizer rules adapted for Haitian Creole orthography and informal contractions. Custom like_num atrribute supporting Haitian number formats (e.g., "3yèm"). Support for common informal apostrophe usage (e.g., "m'ap", "n'ap", "di'm"). Ensured no breakages in other language modules. Followed spaCy coding style (PEP8, Black). This provides a foundation for Haitian Creole NLP development using spaCy.	2025-05-28 17:23:38 +02:00
Martin Schorfmann	e8f40e2169	Correct API docs for Span.lemma_, Vocab.to_bytes and Vectors.__init__ (#13436 ) * Correct code example for Span.lemma_ in API Docs (#13405) * Correct documented return type of Vocab.to_bytes in API docs * Correct wording for Vectors.__init__ in API docs	2025-05-28 17:22:50 +02:00
BLKSerene	7b1d6e58ff	Remove dependency on langcodes (#13760 ) This PR removes the dependency on langcodes introduced in #9342. While the introduction of langcodes allows a significantly wider range of language codes, there are some unexpected side effects: zh-Hant (Traditional Chinese) should be mapped to zh intead of None, as spaCy's Chinese model is based on pkuseg which supports tokenization of both Simplified and Traditional Chinese. Since it is possible that spaCy may have a model for Norwegian Nynorsk in the future, mapping no (macrolanguage Norwegian) to nb (Norwegian Bokmål) might be misleading. In that case, the user should be asked to specify nb or nn (Norwegian Nynorsk) specifically or consult the doc. Same as above for regional variants of languages such as en_gb and en_us. Overall, IMHO, introducing an extra dependency just for the conversion of language codes is an overkill. It is possible that most user just need the conversion between 2/3-letter ISO codes and a simple dictionary lookup should suffice. With this PR, ISO 639-1 and ISO 639-3 codes are supported. ISO 639-2/B (bibliographic codes which are not favored and used in ISO 639-3) and deprecated ISO 639-1/2 codes are also supported to maximize backward compatibility.	2025-05-28 17:21:46 +02:00
Matthew Honnibal	864c2f3b51	Format	2025-05-28 17:06:11 +02:00
Matthew Honnibal	75a9d9b9ad	Test and fix issue13769	2025-05-28 17:04:23 +02:00
Matthew Honnibal	2567266bf7	Merge branch 'master' into fix/enum-python-types	2025-05-27 11:16:12 +02:00
Matthew Honnibal	5e1ee975c9	Fix quirk of enum values in Python After the Cython 3 change, the types of enum members such as spacy.parts_of_speech.NOUN became 'flag', rather than simple 'int'. This change mostly doesn't matter because the flag type does duck-type like an int -- it compares, additions, prints etc the same. However, it doesn't repr the same and if you do an isinstance check it will fail. It's therefore better to just make them ints like they were before.	2025-05-27 11:09:37 +02:00
Ilie	bec546cec0	Add TeNs plugin (#13800 ) Co-authored-by: Ilie Cristian Dorobat <idorobat@cisco.com>	2025-05-27 01:21:07 +02:00
d0ngw	46613e27cf	fix: match hyphenated words to lemmas in index_table (e.g. "co-authored" -> "co-author") (#13816 )	2025-05-27 01:20:26 +02:00
omahs	b205ff65e6	fix typos (#13813 )	2025-05-26 16:05:29 +02:00
BLKSerene	92f1b8cdb4	Switch to typer-slim (#13759 )	2025-05-26 16:03:49 +02:00
Matthew Honnibal	4b65aa79ee	Add release script	2025-05-22 14:00:48 +02:00
Matthew Honnibal	d08f4e3b10	Increment version	2025-05-22 13:58:00 +02:00
Matthew Honnibal	6036f344d3	Remove print statements	2025-05-22 13:56:31 +02:00
Matthew Honnibal	5bebbf7550	Python 3.13 support (#13823 ) In order to support Python 3.13, we had to migrate to Cython 3.0. This caused some tricky interaction with our Pydantic usage, because Cython 3 uses the from __future__ import annotations semantics, which causes type annotations to be saved as strings. The end result is that we can't have Language.factory decorated functions in Cython modules anymore, as the Language.factory decorator expects to inspect the signature of the functions and build a Pydantic model. If the function is implemented in Cython, an error is raised because the type is not resolved. To address this I've moved the factory functions into a new module, spacy.pipeline.factories. I've added __getattr__ importlib hooks to the previous locations, in case anyone was importing these functions directly. The change should have no backwards compatibility implications. Along the way I've also refactored the registration of functions for the config. Previously these ran as import-time side-effects, using the registry decorator. I've created instead a new module spacy.registrations. When the registry is accessed it calls a function ensure_populated(), which cases the registrations to occur. I've made a similar change to the Language.factory registrations in the new spacy.pipeline.factories module. I want to remove these import-time side-effects so that we can speed up the loading time of the library, which can be especially painful on the CLI. I also find that I'm often working to track down the implementations of functions referenced by strings in the config. Having the registrations all happen in one place will make this easier. With these changes I've fortunately avoided the need to migrate to Pydantic v2 properly --- we're still using the v1 compatibility shim. We might not be able to hold out forever though: Pydantic (reasonably) aren't actively supporting the v1 shims. I put a lot of work into v2 migration when investigating the 3.13 support, and it's definitely challenging. In any case, it's a relief that we don't have to do the v2 migration at the same time as the Cython 3.0/Python 3.13 support.	2025-05-22 13:47:21 +02:00
Matthew Honnibal	911539e9a4	Update version	2025-05-18 12:18:38 +02:00
Matthew Honnibal	22c1bc785b	Replace lte with lt for clarity	2025-05-18 12:18:17 +02:00
Matthew Honnibal	cb5e760e91	Fix python version supported	2025-05-18 12:17:23 +02:00
Gunther Cox	87ec2b72a5	Update spaCy Universe entry for ChatterBot to use correct name casing (#13784 )	2025-05-12 07:47:50 +02:00
翟持江	aa8de0ed37	Update embeddings-transformers.mdx, update trf_data examples info in <Runtime usage> (#13811 )	2025-05-12 07:47:12 +02:00
Adrien Carpentier	98a19df91a	docs: fix README.md for compatible Python versions (#13749 )	2025-04-11 20:56:52 +02:00
Matthew Honnibal	92bd042502	Allow Python 3.13	2025-04-03 23:15:12 +02:00
Matthew Honnibal	d0c705cbc9	Increment version	2025-04-01 09:40:59 +02:00
Matthew Honnibal	b3c46c315e	Add support for linux-arm	2025-02-03 18:32:23 +01:00
Ines Montani	d194f06437	Add live stream to site [ci skip]	2025-02-03 09:42:52 +01:00
Ines Montani	055e07d9cc	Update README.md [ci skip]	2025-02-03 09:38:32 +01:00
Ines Montani	8e1c14e977	Add live stream to README [ci skip]	2025-02-03 09:37:48 +01:00
Christine P. Chai	4278182dd0	Change Twitter to X (#13740 ) [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2025-02-03 09:30:21 +01:00
Matthew Honnibal	85cc763006	Fix python version requirement	2025-01-13 18:17:36 +01:00
Matthew Honnibal	ba7468e32e	Update requirements, fixing windows crashes (#13727 ) * Re-enable pretraining test * Require thinc 8.3.4 * Reformat * Re-enable test	2025-01-13 16:39:46 +01:00
Matthew Honnibal	311f7cc9fb	Set version to v3.8.4	2024-12-11 14:14:08 +01:00
Matthew Honnibal	682140496a	Align requirements better	2024-12-11 14:13:51 +01:00
Matthew Honnibal	343f4f21d7	Enable Python 3.13	2024-12-11 14:13:28 +01:00
Matthew Honnibal	be0fa812c2	Update cibuildwheel	2024-12-11 13:08:40 +01:00
Matthew Honnibal	a6317b3836	Fix allocation of non-transient strings in StringStore (#13713 ) * Fix bug in memory-zone code when adding non-transient strings. The error could result in segmentation faults or other memory errors during memory zones if new labels were added to the model. * Fix handling of new morphological labels within memory zones. Addresses second issue reported in Memory leak of MorphAnalysis object. #13684	2024-12-11 13:06:53 +01:00
Ines Montani	3e30b5bef6	Add spacy-layout [ci skip]	2024-11-19 10:43:40 +01:00
Matthew Honnibal	3ecec1324c	Usage page on memory management, explaining memory zones and doc_cleaner (#13643 ) [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2024-10-23 12:42:54 +02:00
Ikko Eltociear Ashimine	15fbf5ef36	docs: update rule-based-matching.mdx (#13665 ) [ci skip]	2024-10-23 12:07:01 +02:00
Sergei Pashakhin	1ee9a19059	Fix typo (#13657 ) [ci skip]	2024-10-23 12:06:36 +02:00
thjbdvlt	0d7e57fc3e	universe-pipeline-solipCysme-french (#13627 ) [ci skip]	2024-10-11 11:26:15 +02:00
Ines Montani	ae5c3e078d	Fix universe.json [ci skip]	2024-10-11 11:24:42 +02:00
Andrei (Andrey) Khropov	8d2902b0e7	Fix misspelling (#13631 ) [ci skip]	2024-10-11 11:23:12 +02:00
aravind-mc	44d1906453	Update universe.json to add my spaCy online course (#13632 ) [ci skip]	2024-10-11 11:21:57 +02:00
sam rxh	52a4cb0d14	Fix 'issue template' link in CONTRIBUTING.md (#13587 ) [ci skip]	2024-10-11 11:20:34 +02:00
Ines Montani	10a6f508ab	Fix landing banner links [ci skip]	2024-10-11 11:19:10 +02:00
Matthew Honnibal	bda4bb0184	Try disabling pretraining tests to probe windows ci failure (#13646 )	2024-10-02 01:01:40 +02:00
Matthew Honnibal	628c973db5	Note minimum python requirement in setup.cfg	2024-10-02 00:49:09 +02:00
Matthew Honnibal	e0782c5e4c	Merge branch 'master' into v3.8.x	2024-10-01 23:57:48 +02:00
Matthew Honnibal	5230754986	Fix thinc dependncy	2024-10-01 23:49:17 +02:00

1 2 3 4 5 ...

16249 Commits