Move `Lemmatizer.is_base_form` to the language settings so that each
language can provide a language-specific method as
`LanguageDefaults.is_base_form`.
The existing English-specific `Lemmatizer.is_base_form` is moved to
`EnglishDefaults`.
* Skip special tag _SP in check for new tag map
In `Tagger.begin_training()` check for new tags aside from `_SP` in the
new tag map initialized from the provided gold tuples when determining
whether to reinitialize the morphology with the new tag map.
* Simplify _SP check
* user_dict fields: adding inflections, reading_forms, sub_tokens
deleting: unidic_tags
improve code readability around the token alignment procedure
* add test cases, replace fugashi with sudachipy in conftest
* move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer
* tag is space -> both surface and tag are spaces
* consider len(text)==0
* Update website models for v2.3.0
* Add docs for Chinese word segmentation
* Tighten up Chinese docs section
* Merge branch 'master' into docs/v2.3.0 [ci skip]
* Merge branch 'master' into docs/v2.3.0 [ci skip]
* Auto-format and update version
* Update matcher.md
* Update languages and sorting
* Typo in landing page
* Infobox about token_match behavior
* Add meta and basic docs for Japanese
* POS -> TAG in models table
* Add info about lookups for normalization
* Updates to API docs for v2.3
* Update adding norm exceptions for adding languages
* Add --omit-extra-lookups to CLI API docs
* Add initial draft of "What's New in v2.3"
* Add new in v2.3 tags to Chinese and Japanese sections
* Add tokenizer to migration section
* Add new in v2.3 flags to init-model
* Typo
* More what's new in v2.3
Co-authored-by: Ines Montani <ines@ines.io>
* Fix warning message for lemmatization tables
* Add a warning when the `lexeme_norm` table is empty. (Given the
relatively lang-specific loading for `Lookups`, it seemed like too much
overhead to dynamically extract the list of languages, so for now it's
hard-coded.)
* Added Examples for Tamil Sentences
#### Description
This PR add example sentences for the Tamil language which were missing as per issue #1107
#### Type of Change
This is an enhancement.
* Accepting spaCy Contributor Agreement
* Signed on my behalf as an individual
* Fix warning message for lemmatization tables
* Add a warning when the `lexeme_norm` table is empty. (Given the
relatively lang-specific loading for `Lookups`, it seemed like too much
overhead to dynamically extract the list of languages, so for now it's
hard-coded.)
* Added Examples for Tamil Sentences
#### Description
This PR add example sentences for the Tamil language which were missing as per issue #1107
#### Type of Change
This is an enhancement.
* Accepting spaCy Contributor Agreement
* Signed on my behalf as an individual
* added setting for neighbour sentence in NEL
* added spaCy contributor agreement
* added multi sentence also for training
* made the try-except block smaller
* added setting for neighbour sentence in NEL
* added spaCy contributor agreement
* added multi sentence also for training
* made the try-except block smaller