spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-24 02:51:58 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	7137a16df8	Mention characters loss option in pretrain	2020-07-02 22:31:55 +02:00
Matthew Honnibal	f85e41ca3e	Fix import	2020-07-02 22:29:55 +02:00
Matthew Honnibal	114df464fa	Fix import	2020-07-02 21:55:00 +02:00
Matthew Honnibal	cd2fa89d93	Move get_characters_loss	2020-07-02 21:45:59 +02:00
Matthew Honnibal	892f2552e0	Fix pretrain args	2020-07-02 20:42:02 +02:00
Matthew Honnibal	2f91722525	Fix pipes	2020-07-02 20:42:02 +02:00
Matthw Honnibal	07a3b9baed	Fix char embed+vectors in ml	2020-07-02 20:42:02 +02:00
Matthw Honnibal	fa8ad11158	Rescale gradients for mlm	2020-07-02 20:42:02 +02:00
Matthw Honnibal	f597992411	Fix number characters	2020-07-02 20:42:01 +02:00
Matthw Honnibal	29f8413095	Add method to decode predicted characters	2020-07-02 20:42:01 +02:00
Matthw Honnibal	cff1f5e48a	Use chars loss in ClozeMultitask	2020-07-02 20:42:01 +02:00
Matthw Honnibal	52c21159b0	Implement character-based pretraining objective	2020-07-02 20:42:01 +02:00
Matthw Honnibal	0104789b72	Fix bilstm_depth default in pretrain command	2020-07-02 20:42:01 +02:00
Matthw Honnibal	4fb31adde5	Call resume_training for base model in train CLI	2020-07-02 20:42:01 +02:00
Matthw Honnibal	3edb297649	Fix char_embed for gpu	2020-07-02 20:42:01 +02:00
Matthw Honnibal	6aef2edfb1	Use cosine loss in Cloze multitask	2020-07-02 20:42:00 +02:00
Adriane Boyd	a77c4c3465	Add strings and ENT_KB_ID to Doc serialization (#5691 ) * Add strings for all writeable Token attributes to `Doc.to/from_bytes()`. * Add ENT_KB_ID to default attributes.	2020-07-02 17:11:57 +02:00
Adriane Boyd	971826a96d	Include git commit in package and model meta (#5694 ) * Include git commit in package and model meta * Rewrite to read file in setup * Fix file handle	2020-07-02 17:10:27 +02:00
Matthew Honnibal	2d715451a2	Revert "Convert custom user_data to token extension format for Japanese tokenizer (#5652 )" (#5665 ) This reverts commit `1dd38191ec`.	2020-06-29 14:34:15 +02:00
Adriane Boyd	1dd38191ec	Convert custom user_data to token extension format for Japanese tokenizer (#5652 ) * Convert custom user_data to token extension format Convert the user_data values so that they can be loaded as custom token extensions for `inflection`, `reading_form`, `sub_tokens`, and `lemma`. * Reset Underscore state in ja tokenizer tests	2020-06-29 14:20:26 +02:00
Adriane Boyd	167df42cb6	Move lemmatizer is_base_form to language settings (#5663 ) Move `Lemmatizer.is_base_form` to the language settings so that each language can provide a language-specific method as `LanguageDefaults.is_base_form`. The existing English-specific `Lemmatizer.is_base_form` is moved to `EnglishDefaults`.	2020-06-29 14:16:57 +02:00
PluieElectrique	90c7eb0e2f	Reduce memory usage of Lookup's BloomFilter (#5606 ) * Reduce memory usage of Lookup's BloomFilter * Remove extra Table update	2020-06-26 14:09:10 +02:00
Adriane Boyd	b7107ac89f	Disregard special tag _SP in check for new tag map (#5641 ) * Skip special tag _SP in check for new tag map In `Tagger.begin_training()` check for new tags aside from `_SP` in the new tag map initialized from the provided gold tuples when determining whether to reinitialize the morphology with the new tag map. * Simplify _SP check	2020-06-26 09:23:21 +02:00
Adriane Boyd	6fe6e761de	Skip vocab in component config overrides (#5624 )	2020-06-23 23:21:11 +02:00
Adriane Boyd	d94e961f14	Fix polarity of Token.is_oov and Lexeme.is_oov (#5634 ) Fix `Token.is_oov` and `Lexeme.is_oov` so they return `True` when the lexeme does not have a vector.	2020-06-23 13:29:51 +02:00
Hiroshi Matsuda	150a39ccca	Japanese model: add user_dict entries and small refactor (#5573 ) * user_dict fields: adding inflections, reading_forms, sub_tokens deleting: unidic_tags improve code readability around the token alignment procedure * add test cases, replace fugashi with sudachipy in conftest * move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer * tag is space -> both surface and tag are spaces * consider len(text)==0	2020-06-22 14:32:25 +02:00
Rameshh	c34420794a	Add Nepali Language (#5622 ) * added support for nepali lang * added examples and test files * added spacy contributor agreement	2020-06-22 10:25:46 +02:00
Karen Hambardzumyan	66a4834e56	Some changes for Armenian (#5616 ) * Fixing numericals * We need a Armenian question sign to make the sentence a question	2020-06-22 08:50:34 +02:00
Marat M. Yavrumyan	8120b641cc	Update lex_attrs.py (#5608 )	2020-06-19 20:00:34 +02:00
Ines Montani	e9d3e177f0	Merge branch 'master' into v2.3.x	2020-06-16 16:31:38 +02:00
Matthew Honnibal	7ff447c5a0	Set version to v2.3.0	2020-06-15 18:22:25 +02:00
Adriane Boyd	0d8405aafa	Updates to docstrings (#5589 )	2020-06-15 14:58:36 +02:00
Adriane Boyd	e867e9fa8f	Fix and add warnings related to spacy-lookups-data (#5588 ) * Fix warning message for lemmatization tables * Add a warning when the `lexeme_norm` table is empty. (Given the relatively lang-specific loading for `Lookups`, it seemed like too much overhead to dynamically extract the list of languages, so for now it's hard-coded.)	2020-06-15 14:58:29 +02:00
Arvind Srinivasan	f698007907	Added Tamil Example Sentences (#5583 ) * Added Examples for Tamil Sentences #### Description This PR add example sentences for the Tamil language which were missing as per issue #1107 #### Type of Change This is an enhancement. * Accepting spaCy Contributor Agreement * Signed on my behalf as an individual	2020-06-15 14:58:21 +02:00
Adriane Boyd	c94f7d0e75	Updates to docstrings (#5589 )	2020-06-15 14:56:51 +02:00
Adriane Boyd	c482f20778	Fix and add warnings related to spacy-lookups-data (#5588 ) * Fix warning message for lemmatization tables * Add a warning when the `lexeme_norm` table is empty. (Given the relatively lang-specific loading for `Lookups`, it seemed like too much overhead to dynamically extract the list of languages, so for now it's hard-coded.)	2020-06-15 14:56:04 +02:00
Arvind Srinivasan	aa5b40fa64	Added Tamil Example Sentences (#5583 ) * Added Examples for Tamil Sentences #### Description This PR add example sentences for the Tamil language which were missing as per issue #1107 #### Type of Change This is an enhancement. * Accepting spaCy Contributor Agreement * Signed on my behalf as an individual	2020-06-13 15:56:26 +02:00
theudas	3f5e2f9d99	Added Parameter to NEL to take n sentences into account (#5548 ) * added setting for neighbour sentence in NEL * added spaCy contributor agreement * added multi sentence also for training * made the try-except block smaller	2020-06-12 15:15:03 +02:00
adrianeboyd	4724fa4cf4	Expand Japanese requirements warning (#5572 ) Include explicit install instructions in Japanese requirements warning.	2020-06-12 15:14:55 +02:00
adrianeboyd	44967a3f9c	Update pytest conf for sudachipy with Japanese (#5574 )	2020-06-12 15:14:47 +02:00
theudas	fa46e0bef2	Added Parameter to NEL to take n sentences into account (#5548 ) * added setting for neighbour sentence in NEL * added spaCy contributor agreement * added multi sentence also for training * made the try-except block smaller	2020-06-12 02:03:23 +02:00
adrianeboyd	556895177e	Expand Japanese requirements warning (#5572 ) Include explicit install instructions in Japanese requirements warning.	2020-06-11 13:47:37 +02:00
adrianeboyd	fe167fcf7d	Update pytest conf for sudachipy with Japanese (#5574 )	2020-06-11 10:23:50 +02:00
Jones Martins	bab30e4ad2	Add "c'mon" token exception (#5570 ) * Add "c'mon" exception * Fix typo in "C'mon" exception	2020-06-10 21:54:06 +02:00
Jones Martins	28db7dd5d9	Add missing pronoums/determiners (#5569 ) * Add missing pronoums/determiners * Add test for missing pronoums * Add contributor file	2020-06-10 18:47:04 +02:00
adrianeboyd	0a70bd6281	Bump version to 2.3.0.dev1 (#5567 )	2020-06-09 15:47:31 +02:00
adrianeboyd	b7e6e1b9a7	Disable sentence segmentation in ja tokenizer (#5566 )	2020-06-09 12:00:59 +02:00
adrianeboyd	f162815f45	Handle empty and whitespace-only docs for Japanese (#5564 ) Handle empty and whitespace-only docs in the custom alignment method used by the Japanese tokenizer.	2020-06-08 21:09:23 +02:00
adrianeboyd	3bf111585d	Update Japanese tokenizer config and add serialization (#5562 ) * Use `config` dict for tokenizer settings * Add serialization of split mode setting * Add tests for tokenizer split modes and serialization of split mode setting Based on #5561	2020-06-08 16:29:05 +02:00
Hiroshi Matsuda	456bf47f51	fix a bug causing mis-alignments (#5560 )	2020-06-08 15:49:34 +02:00

1 2 3 4 5 ...

6872 Commits