spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-25 15:39:46 +03:00

Author	SHA1	Message	Date
Jeff Adolphe	41e07772dc	Added Haitian Creole (ht) Language Support to spaCy (#13807 ) This PR adds official support for Haitian Creole (ht) to spaCy's spacy/lang module. It includes: Added all core language data files for spacy/lang/ht: tokenizer_exceptions.py punctuation.py lex_attrs.py syntax_iterators.py lemmatizer.py stop_words.py tag_map.py Unit tests for tokenizer and noun chunking (test_tokenizer.py, test_noun_chunking.py, etc.). Passed all 58 pytest spacy/tests/lang/ht tests that I've created. Basic tokenizer rules adapted for Haitian Creole orthography and informal contractions. Custom like_num atrribute supporting Haitian number formats (e.g., "3yèm"). Support for common informal apostrophe usage (e.g., "m'ap", "n'ap", "di'm"). Ensured no breakages in other language modules. Followed spaCy coding style (PEP8, Black). This provides a foundation for Haitian Creole NLP development using spaCy.	2025-05-28 17:23:38 +02:00
BLKSerene	7b1d6e58ff	Remove dependency on langcodes (#13760 ) This PR removes the dependency on langcodes introduced in #9342. While the introduction of langcodes allows a significantly wider range of language codes, there are some unexpected side effects: zh-Hant (Traditional Chinese) should be mapped to zh intead of None, as spaCy's Chinese model is based on pkuseg which supports tokenization of both Simplified and Traditional Chinese. Since it is possible that spaCy may have a model for Norwegian Nynorsk in the future, mapping no (macrolanguage Norwegian) to nb (Norwegian Bokmål) might be misleading. In that case, the user should be asked to specify nb or nn (Norwegian Nynorsk) specifically or consult the doc. Same as above for regional variants of languages such as en_gb and en_us. Overall, IMHO, introducing an extra dependency just for the conversion of language codes is an overkill. It is possible that most user just need the conversion between 2/3-letter ISO codes and a simple dictionary lookup should suffice. With this PR, ISO 639-1 and ISO 639-3 codes are supported. ISO 639-2/B (bibliographic codes which are not favored and used in ISO 639-3) and deprecated ISO 639-1/2 codes are also supported to maximize backward compatibility.	2025-05-28 17:21:46 +02:00
Matthew Honnibal	864c2f3b51	Format	2025-05-28 17:06:11 +02:00
Matthew Honnibal	75a9d9b9ad	Test and fix issue13769	2025-05-28 17:04:23 +02:00
d0ngw	46613e27cf	fix: match hyphenated words to lemmas in index_table (e.g. "co-authored" -> "co-author") (#13816 )	2025-05-27 01:20:26 +02:00
omahs	b205ff65e6	fix typos (#13813 )	2025-05-26 16:05:29 +02:00
Matthew Honnibal	d08f4e3b10	Increment version	2025-05-22 13:58:00 +02:00
Matthew Honnibal	6036f344d3	Remove print statements	2025-05-22 13:56:31 +02:00
Matthew Honnibal	5bebbf7550	Python 3.13 support (#13823 ) In order to support Python 3.13, we had to migrate to Cython 3.0. This caused some tricky interaction with our Pydantic usage, because Cython 3 uses the from __future__ import annotations semantics, which causes type annotations to be saved as strings. The end result is that we can't have Language.factory decorated functions in Cython modules anymore, as the Language.factory decorator expects to inspect the signature of the functions and build a Pydantic model. If the function is implemented in Cython, an error is raised because the type is not resolved. To address this I've moved the factory functions into a new module, spacy.pipeline.factories. I've added __getattr__ importlib hooks to the previous locations, in case anyone was importing these functions directly. The change should have no backwards compatibility implications. Along the way I've also refactored the registration of functions for the config. Previously these ran as import-time side-effects, using the registry decorator. I've created instead a new module spacy.registrations. When the registry is accessed it calls a function ensure_populated(), which cases the registrations to occur. I've made a similar change to the Language.factory registrations in the new spacy.pipeline.factories module. I want to remove these import-time side-effects so that we can speed up the loading time of the library, which can be especially painful on the CLI. I also find that I'm often working to track down the implementations of functions referenced by strings in the config. Having the registrations all happen in one place will make this easier. With these changes I've fortunately avoided the need to migrate to Pydantic v2 properly --- we're still using the v1 compatibility shim. We might not be able to hold out forever though: Pydantic (reasonably) aren't actively supporting the v1 shims. I put a lot of work into v2 migration when investigating the 3.13 support, and it's definitely challenging. In any case, it's a relief that we don't have to do the v2 migration at the same time as the Cython 3.0/Python 3.13 support.	2025-05-22 13:47:21 +02:00
Matthew Honnibal	911539e9a4	Update version	2025-05-18 12:18:38 +02:00
Matthew Honnibal	d0c705cbc9	Increment version	2025-04-01 09:40:59 +02:00
Matthew Honnibal	ba7468e32e	Update requirements, fixing windows crashes (#13727 ) * Re-enable pretraining test * Require thinc 8.3.4 * Reformat * Re-enable test	2025-01-13 16:39:46 +01:00
Matthew Honnibal	311f7cc9fb	Set version to v3.8.4	2024-12-11 14:14:08 +01:00
Matthew Honnibal	a6317b3836	Fix allocation of non-transient strings in StringStore (#13713 ) * Fix bug in memory-zone code when adding non-transient strings. The error could result in segmentation faults or other memory errors during memory zones if new labels were added to the model. * Fix handling of new morphological labels within memory zones. Addresses second issue reported in Memory leak of MorphAnalysis object. #13684	2024-12-11 13:06:53 +01:00
Andrei (Andrey) Khropov	8d2902b0e7	Fix misspelling (#13631 ) [ci skip]	2024-10-11 11:23:12 +02:00
Matthew Honnibal	bda4bb0184	Try disabling pretraining tests to probe windows ci failure (#13646 )	2024-10-02 01:01:40 +02:00
Matthew Honnibal	0cdcfe56cb	Set version to v3.8.2	2024-10-01 16:47:24 +02:00
Matthew Honnibal	9c5b61bdff	isort	2024-10-01 12:38:51 +02:00
Matthew Honnibal	725ccbac39	Format	2024-10-01 12:38:02 +02:00
Matthew Honnibal	a8837beab7	Set version to v3.8.1	2024-10-01 12:37:11 +02:00
Matthew Honnibal	114b4894fb	Fix --require-parent default	2024-09-29 15:50:31 +02:00
Matthew Honnibal	dec13b4258	Fix inverted cli arg	2024-09-29 15:50:05 +02:00
Matthew Honnibal	c03f060527	Allow positive option --require-parent	2024-09-29 14:30:14 +02:00
Matthew Honnibal	6255cb985f	Include version constraint in parent package requirement	2024-09-29 14:22:21 +02:00
Matthew Honnibal	3b165a8716	Simplify setting to require parent package	2024-09-29 14:19:10 +02:00
Matthew Honnibal	969832f5d6	Fix package	2024-09-29 14:00:11 +02:00
Matthew Honnibal	8ce53a6bbe	Syntax	2024-09-29 13:51:44 +02:00
Matthew Honnibal	6fa0d709d5	Support option to not depend on parent package in spacy package	2024-09-29 13:51:04 +02:00
Matthew Honnibal	5010fcbd3a	Fix numpy constant	2024-09-14 13:13:11 +02:00
Matthew Honnibal	de4f19f3a3	Fix version	2024-09-14 13:12:44 +02:00
Matthew Honnibal	3d03565498	Replace numpy floats in evaluate and update	2024-09-14 12:55:53 +02:00
Matthew Honnibal	0576a1ff56	Fix numpy floats in meta.json	2024-09-14 12:54:08 +02:00
Matthew Honnibal	2f1e7ed09a	Lint	2024-09-14 11:36:27 +02:00
Matthew Honnibal	e2dc9b79e1	Format	2024-09-14 11:29:40 +02:00
Matthew Honnibal	3c3d75015b	Set version to v3.7.7	2024-09-14 11:27:32 +02:00
Matthew Honnibal	50aa3b5cbe	Merge branch 'master' of https://github.com/explosion/spaCy	2024-09-14 11:09:44 +02:00
Matthew Honnibal	8266031454	Merge numpy version update	2024-09-14 11:08:35 +02:00
Matthew Honnibal	69ecb85fad	Set version to v3.8.1	2024-09-13 10:43:40 +02:00
Matthew Honnibal	b427597fc8	Set version to v3.8.0	2024-09-11 21:32:26 +02:00
Matthew Honnibal	c068e1de1b	Fix dependencies	2024-09-11 15:57:52 +02:00
marinelay	b18cc94451	Delete unnecessary method (#13441 ) Co-authored-by: marinelay <marinelay@gmail.com>	2024-09-09 20:57:13 +02:00
Matthew Honnibal	4cc3ebe74e	Format	2024-09-09 20:56:01 +02:00
Matthew Honnibal	a019315534	Fix memory zones	2024-09-09 13:49:41 +02:00
Matthew Honnibal	59ac7e6bdb	Format	2024-09-09 11:22:52 +02:00
Matthew Honnibal	b65491b641	Set version to v3.8.0.dev0	2024-09-09 11:20:23 +02:00
Matthew Honnibal	1b8d560d0e	Support 'memory zones' for user memory management (#13621 ) Add a context manage nlp.memory_zone(), which will begin memory_zone() blocks on the vocab, string store, and potentially other components. Example usage: ``` with nlp.memory_zone(): for text in nlp.pipe(texts): do_something(doc) # do_something(doc) <-- Invalid ``` Once the memory_zone() block expires, spaCy will free any shared resources that were allocated for the text-processing that occurred within the memory_zone. If you create Doc objects within a memory zone, it's invalid to access them once the memory zone is expired. The purpose of this is that spaCy creates and stores Lexeme objects in the Vocab that can be shared between multiple Doc objects. It also interns strings. Normally, spaCy can't know when all Doc objects using a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab, causing memory pressure. Memory zones solve this problem by telling spaCy "okay none of the documents allocated within this block will be accessed again". This lets spaCy free all new Lexeme objects and other data that were created during the block. The mechanism is general, so memory_zone() context managers can be added to other components that could benefit from them, e.g. pipeline components. I experimented with adding memory zone support to the tokenizer as well, for its cache. However, this seems unnecessarily complicated. It makes more sense to just stick a limit on the cache size. This lets spaCy benefit from the efficiency advantage of the cache better, because we can maintain a (bounded) cache even if only small batches of documents are being processed.	2024-09-09 11:19:39 +02:00
ykyogoku	608f65ce40	add Tibetan (#13510 )	2024-09-09 11:18:03 +02:00
Muzaffer Cikay	acbf2a428f	Add Kurdish Kurmanji language (#13561 ) * Add Kurdish Kurmanji language * Add lex_attrs	2024-09-09 11:15:40 +02:00
Mark Liberko	55db9c2e87	Added gd language folder (#13570 ) Implemented a foundational Scottish Gaelic (gd) language option with tokenizer_exceptions and stop_words files.	2024-09-09 11:14:09 +02:00
Matthew Honnibal	319e02545c	Set version to 3.7.6	2024-08-20 12:16:08 +02:00

1 2 3 4 5 ...

9438 Commits