* Added Italian POS-aware lemmatizer.
Also added the code used to build the lookup tables by POS.
* Create gtoffoli.md
* Add imports and format
* Remove helper script
* Use lemma_lookup instead of lemma_lookup_legacy
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update Catalan language data
Update Catalan language data based on contributions from the Text Mining
Unit at the Barcelona Supercomputing Center:
https://github.com/TeMU-BSC/spacy4release/tree/main/lang_data
* Update tokenizer settings for UD Catalan AnCora
Update for UD Catalan AnCora v2.7 with merged multi-word tokens.
* Update test
* Move prefix patternt to more generic infix pattern
* Clean up
For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2`
requirement to the mode `pymorphy2` so that lookup or other lemmatizer
modes can be loaded without installing `pymorphy2`.
* "y" etc.
Many changes described in pull request
* Update spacy/lang/fr/stop_words.py
* Update spacy/lang/fr/stop_words.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add all symbols in Unicode Currency Symbols block
In #8102 it came up that the rupee symbol was treated different from
dollar / euro / yen symbols. This adds many symbols not already
included.
* Fix test
* Fix training test
* Adapt tokenization methods from `pyvi` to preserve text encoding and
whitespace
* Add serialization support similar to Chinese and Japanese
Note: as for Chinese and Japanese, some settings are duplicated in
`config.cfg` and `tokenizer/cfg`.
* To allow default lookup lemmatization with a blank Russian model,
rename pymorphy2 lookup mode to `pymorphy2_lookup`
* Bug fix: update pymorphy2 lookup lemmatize to return list rather than
string
Fix class variable and init for `UkrainianLemmatizer` so that it loads
the `uk` dictionaries rather than having the parent `RussianLemmatizer`
override with the `ru` settings.
* Initial Spanish lemmatizer
* Handle merged verb+pron(s) multi-word tokens
* Use VERB for AUX rule lookup
* Add morph to lemma cache key
* Fix aux lookups, minor refactoring
* Improve verb+pron handling
* Move verb+pron handling into its own method
* Check for exceptions (primarily for se)
* Collect pronouns in the same (not reversed) order
* Only add modified possible lemmas
* raise NotImplementedError when noun_chunks iterator is not implemented
* bring back, fix and document span.noun_chunks
* formatting
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>