spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-12-03 08:14:20 +03:00

Author	SHA1	Message	Date
Li Zhe	a69eb445dc	fix the wrong hash url in adding-languages.md file (#5810 ) * fix the wrong hash url in adding-languages.md file change the #101 url hash path to #language-data * filled in the spaCy Contributor Agreement filled in the spaCy Contributor Agreement	2020-07-25 13:13:38 +02:00
Adriane Boyd	19dc42776a	Remove hard-coded GPU ID from pretrain (#5808 )	2020-07-24 09:26:26 +02:00
Joshua Olson	6d4d5c074c	Mark Japanese documents as tagged. (#5803 ) Mark the document as tagged before returning it to the user from the JapaneseTokenizer. Fixes #5802	2020-07-23 08:57:01 +02:00
Adriane Boyd	038ff1a811	Improve warnings around normalization tables (#5794 ) Provide more customized normalization table warnings when training a new model. Only suggest installing `spacy-lookups-data` if it's not already installed and it includes a table for this language (currently checked in a hard-coded list).	2020-07-22 16:04:58 +02:00
Adriane Boyd	bf24f7f672	Update invalid tag maps (#5796 ) * Remove copy of (old?) PTB tag map for: bn, eu * Remove unsupported features from: hy, pl, ro, ru	2020-07-22 16:02:51 +02:00
Alec Chapman	a8978ca285	Add VA COVID-19 NLP project to spaCy Universe (#5777 ) * Update universe.json Add cov-bsv to "resources" * Update universe.json * add contributor agreement	2020-07-19 13:35:31 +02:00
Adriane Boyd	597bcc629e	Improve tag map initialization and updating (#5768 ) * Improve tag map initialization and updating Generalize tag map initialization and updating so that a provided tag map can be loaded correctly in the CLI. * normalize provided tag map as necessary * use the same method for initializing and overwriting the tag map * Reinitialize cache after loading new tag map Reinitialize the cache with the right size after loading a new tag map.	2020-07-19 11:13:39 +02:00
Adriane Boyd	7e14272096	Lower upper pin for cupy to 8.0.0 (#5773 )	2020-07-19 11:10:11 +02:00
Adriane Boyd	cd5af72c9a	Update pkuseg version (#5774 ) * Update pkuseg version in Chinese tokenizer warnings * Update pkuseg version in `Makefile` * Remove warning about python3.8 wheels in docs	2020-07-19 11:09:49 +02:00
Ines Montani	6f4e4aceb3	Add Plausible [ci skip]	2020-07-18 23:50:29 +02:00
Adriane Boyd	5228920e2f	Clarify warning W030 for misaligned BILUO tags (#5761 )	2020-07-14 14:09:48 +02:00
Adriane Boyd	7ea2cc7650	Set version to 2.3.2 (#5756 )	2020-07-13 14:55:56 +02:00
Mark Neumann	27a1cd3c63	fix meta serialization in train (#5751 ) Co-authored-by: Mark Neumann <markng@allenai.org>	2020-07-12 22:06:46 +02:00
Adriane Boyd	0a62098c5f	Fix lemmatizer is_base_form for python2.7 (#5734 ) * Fix lemmatizer init args for python2.7 * Move English is_base_form to a class method * Skip test pickling PhraseMatcher for python2	2020-07-09 22:11:24 +02:00
Adriane Boyd	923affd091	Remove is_base_form from French lemmatizer (#5733 ) Remove English-specific is_base_form from French lemmatizer.	2020-07-09 22:11:13 +02:00
Ines Montani	3d83721551	Merge pull request #5723 from gandersen101/fix-spaczz-universe-typo	2020-07-08 11:35:40 +02:00
gandersen101	893133873d	Fix quote issue in spaczz universe.json	2020-07-07 19:16:28 -05:00
Ines Montani	109849bd31	Fix and update universe.json [ci skip]	2020-07-07 21:12:28 +02:00
gandersen101	9097549227	Adding spaczz package to universe.json (#5717 ) * Adding spaczz package to universe.json * Adding contributor agreement.	2020-07-07 20:55:24 +02:00
Jonathan Besomi	546f3d10d4	Add texthero to universe.json (#5716 ) * Add texthero to universe.json * Add spaCy contributor Agreement	2020-07-07 20:54:22 +02:00
Mike Izbicki	7a2ca00794	fix bug in Korean language, resulting in 100x speedup by reducing overhead of mecab (#5701 ) * speed up Korean nlp 100x by stopping mecab from reloading on each doc * add contributor agreement * rename variables to improve code readability	2020-07-06 17:03:33 +02:00
graue70	9860b8399e	Fix typo in test function docstring (#5696 )	2020-07-05 15:49:06 +02:00
Matthew Honnibal	3e78e82a83	Experimental character-based pretraining (#5700 ) * Use cosine loss in Cloze multitask * Fix char_embed for gpu * Call resume_training for base model in train CLI * Fix bilstm_depth default in pretrain command * Implement character-based pretraining objective * Use chars loss in ClozeMultitask * Add method to decode predicted characters * Fix number characters * Rescale gradients for mlm * Fix char embed+vectors in ml * Fix pipes * Fix pretrain args * Move get_characters_loss * Fix import * Fix import * Mention characters loss option in pretrain * Remove broken 'self attention' option in pretrain * Revert "Remove broken 'self attention' option in pretrain" This reverts commit `56b820f6af`. * Document 'characters' objective of pretrain	2020-07-05 15:48:39 +02:00
Adriane Boyd	86d13a9fb8	Set version to 2.3.1 (#5705 )	2020-07-03 13:38:41 +02:00
Matthias Hertel	2fb9bd795d	Fixed vocabulary in the entity linker training example (#5676 ) * entity linker training example: model loading changed according to issue 5668 (https://github.com/explosion/spaCy/issues/5668) + vocab_path is a required argument * contributor agreement	2020-07-03 10:24:02 +02:00
Adriane Boyd	a77c4c3465	Add strings and ENT_KB_ID to Doc serialization (#5691 ) * Add strings for all writeable Token attributes to `Doc.to/from_bytes()`. * Add ENT_KB_ID to default attributes.	2020-07-02 17:11:57 +02:00
Adriane Boyd	971826a96d	Include git commit in package and model meta (#5694 ) * Include git commit in package and model meta * Rewrite to read file in setup * Fix file handle	2020-07-02 17:10:27 +02:00
Adriane Boyd	2bd78c39e3	Fix multiple context manages in examples (#5690 )	2020-07-02 10:36:07 +02:00
Ines Montani	6bc643d2e2	Update netlify.toml [ci skip]	2020-07-01 21:34:17 +02:00
Ines Montani	f2a932a60c	Update netlify.toml [ci skip]	2020-07-01 13:34:35 +02:00
Álvaro Abella Bascarán	ff0dbe5c64	Fix in docs: pipe(docs) instead of pipe(texts) (#5680 ) Very minor fix in docs, specifically in this part: ``` matcher = PhraseMatcher(nlp.vocab) > for doc in matcher.pipe(texts, batch_size=50): > pass ``` `texts` suggests the input is an iterable of strings. I replaced it for `docs`.	2020-06-30 20:00:50 +02:00
Matthias Hertel	8b0f749606	Website: fixed the token span in the text about the rule-based matching example (#5669 ) * fixed token span in pattern matcher example * contributor agreement	2020-06-30 19:58:23 +02:00
Matthew Honnibal	2d715451a2	Revert "Convert custom user_data to token extension format for Japanese tokenizer (#5652 )" (#5665 ) This reverts commit `1dd38191ec`.	2020-06-29 14:34:15 +02:00
Adriane Boyd	1dd38191ec	Convert custom user_data to token extension format for Japanese tokenizer (#5652 ) * Convert custom user_data to token extension format Convert the user_data values so that they can be loaded as custom token extensions for `inflection`, `reading_form`, `sub_tokens`, and `lemma`. * Reset Underscore state in ja tokenizer tests	2020-06-29 14:20:26 +02:00
Adriane Boyd	167df42cb6	Move lemmatizer is_base_form to language settings (#5663 ) Move `Lemmatizer.is_base_form` to the language settings so that each language can provide a language-specific method as `LanguageDefaults.is_base_form`. The existing English-specific `Lemmatizer.is_base_form` is moved to `EnglishDefaults`.	2020-06-29 14:16:57 +02:00
Adriane Boyd	c4d0209472	Extend v2.3 migration guide (#5653 ) * Extend preloaded vocab section * Add section on tag maps	2020-06-26 14:12:29 +02:00
PluieElectrique	90c7eb0e2f	Reduce memory usage of Lookup's BloomFilter (#5606 ) * Reduce memory usage of Lookup's BloomFilter * Remove extra Table update	2020-06-26 14:09:10 +02:00
Adriane Boyd	b7107ac89f	Disregard special tag _SP in check for new tag map (#5641 ) * Skip special tag _SP in check for new tag map In `Tagger.begin_training()` check for new tags aside from `_SP` in the new tag map initialized from the provided gold tuples when determining whether to reinitialize the morphology with the new tag map. * Simplify _SP check	2020-06-26 09:23:21 +02:00
Adriane Boyd	fd4287c178	Fix backslashes in warnings config diff (#5640 ) Fix backslashes in warnings config diff in v2.3 migration section.	2020-06-24 10:26:12 +02:00
Adriane Boyd	6fe6e761de	Skip vocab in component config overrides (#5624 )	2020-06-23 23:21:11 +02:00
Adriane Boyd	7ce451c211	Extend what's new in v2.3 with vocab / is_oov (#5635 )	2020-06-23 16:48:59 +02:00
Adriane Boyd	d94e961f14	Fix polarity of Token.is_oov and Lexeme.is_oov (#5634 ) Fix `Token.is_oov` and `Lexeme.is_oov` so they return `True` when the lexeme does not have a vector.	2020-06-23 13:29:51 +02:00
Richard Liaw	0ef78bad93	contribute (#5632 )	2020-06-23 08:53:58 +02:00
Adriane Boyd	bc1cb30b21	Add warnings example in v2.3 migration guide (#5627 )	2020-06-22 14:37:24 +02:00
Hiroshi Matsuda	150a39ccca	Japanese model: add user_dict entries and small refactor (#5573 ) * user_dict fields: adding inflections, reading_forms, sub_tokens deleting: unidic_tags improve code readability around the token alignment procedure * add test cases, replace fugashi with sudachipy in conftest * move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer * tag is space -> both surface and tag are spaces * consider len(text)==0	2020-06-22 14:32:25 +02:00
Rameshh	c34420794a	Add Nepali Language (#5622 ) * added support for nepali lang * added examples and test files * added spacy contributor agreement	2020-06-22 10:25:46 +02:00
Karen Hambardzumyan	66a4834e56	Some changes for Armenian (#5616 ) * Fixing numericals * We need a Armenian question sign to make the sentence a question	2020-06-22 08:50:34 +02:00
Karen Hambardzumyan	ff6a084e9c	Create mahnerak.md (#5615 )	2020-06-20 11:14:26 +02:00
Marat M. Yavrumyan	8120b641cc	Update lex_attrs.py (#5608 )	2020-06-19 20:00:34 +02:00
Marat M. Yavrumyan	ccd7edf04b	Create myavrum.md (#5612 )	2020-06-19 18:34:27 +02:00

1 2 3 4 5 ...

11575 Commits