* Use cosine loss in Cloze multitask
* Fix char_embed for gpu
* Call resume_training for base model in train CLI
* Fix bilstm_depth default in pretrain command
* Implement character-based pretraining objective
* Use chars loss in ClozeMultitask
* Add method to decode predicted characters
* Fix number characters
* Rescale gradients for mlm
* Fix char embed+vectors in ml
* Fix pipes
* Fix pretrain args
* Move get_characters_loss
* Fix import
* Fix import
* Mention characters loss option in pretrain
* Remove broken 'self attention' option in pretrain
* Revert "Remove broken 'self attention' option in pretrain"
This reverts commit 56b820f6af.
* Document 'characters' objective of pretrain
* entity linker training example: model loading changed according to issue 5668 (https://github.com/explosion/spaCy/issues/5668) + vocab_path is a required argument
* contributor agreement
Very minor fix in docs, specifically in this part:
```
matcher = PhraseMatcher(nlp.vocab)
> for doc in matcher.pipe(texts, batch_size=50):
> pass
```
`texts` suggests the input is an iterable of strings. I replaced it for `docs`.
* Convert custom user_data to token extension format
Convert the user_data values so that they can be loaded as custom token
extensions for `inflection`, `reading_form`, `sub_tokens`, and `lemma`.
* Reset Underscore state in ja tokenizer tests
Move `Lemmatizer.is_base_form` to the language settings so that each
language can provide a language-specific method as
`LanguageDefaults.is_base_form`.
The existing English-specific `Lemmatizer.is_base_form` is moved to
`EnglishDefaults`.
* Fix typos and auto-format [ci skip]
* Add pkuseg warnings and auto-format [ci skip]
* Update Binder URL [ci skip]
* Update Binder version [ci skip]
* Update alignment example for new gold.align
* Update POS in tagging example
* Fix numpy.zeros() dtype for Doc.from_array
* Change example title to Dr.
Change example title to Dr. so the current model does exclude the title
in the initial example.
* Fix spacy convert argument
* Warning for sudachipy 0.4.5 (#5611)
* Create myavrum.md (#5612)
* Update lex_attrs.py (#5608)
* Create mahnerak.md (#5615)
* Some changes for Armenian (#5616)
* Fixing numericals
* We need a Armenian question sign to make the sentence a question
* Add Nepali Language (#5622)
* added support for nepali lang
* added examples and test files
* added spacy contributor agreement
* Japanese model: add user_dict entries and small refactor (#5573)
* user_dict fields: adding inflections, reading_forms, sub_tokens
deleting: unidic_tags
improve code readability around the token alignment procedure
* add test cases, replace fugashi with sudachipy in conftest
* move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer
* tag is space -> both surface and tag are spaces
* consider len(text)==0
* Add warnings example in v2.3 migration guide (#5627)
* contribute (#5632)
* Fix polarity of Token.is_oov and Lexeme.is_oov (#5634)
Fix `Token.is_oov` and `Lexeme.is_oov` so they return `True` when the
lexeme does **not** have a vector.
* Extend what's new in v2.3 with vocab / is_oov (#5635)
* Skip vocab in component config overrides (#5624)
* Fix backslashes in warnings config diff (#5640)
Fix backslashes in warnings config diff in v2.3 migration section.
* Disregard special tag _SP in check for new tag map (#5641)
* Skip special tag _SP in check for new tag map
In `Tagger.begin_training()` check for new tags aside from `_SP` in the
new tag map initialized from the provided gold tuples when determining
whether to reinitialize the morphology with the new tag map.
* Simplify _SP check
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Marat M. Yavrumyan <myavrum@ysu.am>
Co-authored-by: Karen Hambardzumyan <mahnerak@gmail.com>
Co-authored-by: Rameshh <30867740+rameshhpathak@users.noreply.github.com>
Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Skip special tag _SP in check for new tag map
In `Tagger.begin_training()` check for new tags aside from `_SP` in the
new tag map initialized from the provided gold tuples when determining
whether to reinitialize the morphology with the new tag map.
* Simplify _SP check
* user_dict fields: adding inflections, reading_forms, sub_tokens
deleting: unidic_tags
improve code readability around the token alignment procedure
* add test cases, replace fugashi with sudachipy in conftest
* move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer
* tag is space -> both surface and tag are spaces
* consider len(text)==0