* Fix incorrect pickling of Japanese and Korean pipelines, which led to
the entire pipeline being reset if pickled
* Enable pickling of Vietnamese tokenizer
* Update tokenizer APIs for Chinese, Japanese, Korean, Thai, and
Vietnamese so that only the `Vocab` is required for initialization
* Add scorer option to components
Add an optional `scorer` parameter to all pipeline components. If a
scoring function is provided, it overrides the default scoring method
for that component.
* Add registered scorers for all components
* Add `scorers` registry
* Move all scoring methods outside of components as independent
functions and register
* Use the registered scoring methods as defaults in configs and inits
Additional:
* The scoring methods no longer have access to the full component, so
use settings from `cfg` as default scorer options to handle settings
such as `labels`, `threshold`, and `positive_label`
* The `attribute_ruler` scoring method no longer has access to the
patterns, so all scoring methods are called
* Bug fix: `spancat` scoring method is updated to set `allow_overlap` to
score overlapping spans correctly
* Update Russian lemmatizer to use direct score method
* Check type of cfg in Pipe.score
* Fix check
* Update spacy/pipeline/sentencizer.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove validate_examples from scoring functions
* Use Pipe.labels instead of Pipe.cfg["labels"]
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add missing punctuation for Tigrinya and Amharic
* Fix numeral and ordinal numbers for Tigrinya
- Amharic was used in many cases
- Also fixed some typos
* Update Tigrinya stop-words
* Contributor agreement for fgaim
* Fix typo in "ti" lang test
* Remove multi-word entries from numbers and ordinals
* Add more stop words and Improve the readability
* Add and categorize the tokenizer exceptions for `bg` lang
* Create syrull.md
* Add references for the additional stop words and tokenizer exc abbrs
* Add ancient Greek language support
Initial commit
* Contributor Agreement
* grc tokenizer test added and files formatted with black, unnecessary import removed
Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Commas in lists fixed. __init__py added to test
* Update lex_attrs.py
* Update stop_words.py
* Update stop_words.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* ✨ implement noun_chunks for dutch language
* copy/paste FR and SV syntax iterators to accomodate UD tags
* added tests with dutch text
* signed contributor agreement
* 🐛 fix noun chunks generator
* built from scratch
* define noun chunk as a single Noun-Phrase
* includes some corner cases debugging (incorrect POS tagging)
* test with provided annotated sample (POS, DEP)
* ✅ fix failing test
* CI pipeline did not like the added sample file
* add the sample as a pytest fixture
* Update spacy/lang/nl/syntax_iterators.py
* Update spacy/lang/nl/syntax_iterators.py
Code readability
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/tests/lang/nl/test_noun_chunks.py
correct comment
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* finalize code
* change "if next_word" into "if next_word is not None"
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added Italian POS-aware lemmatizer.
Also added the code used to build the lookup tables by POS.
* Create gtoffoli.md
* Add imports and format
* Remove helper script
* Use lemma_lookup instead of lemma_lookup_legacy
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update Catalan language data
Update Catalan language data based on contributions from the Text Mining
Unit at the Barcelona Supercomputing Center:
https://github.com/TeMU-BSC/spacy4release/tree/main/lang_data
* Update tokenizer settings for UD Catalan AnCora
Update for UD Catalan AnCora v2.7 with merged multi-word tokens.
* Update test
* Move prefix patternt to more generic infix pattern
* Clean up
For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2`
requirement to the mode `pymorphy2` so that lookup or other lemmatizer
modes can be loaded without installing `pymorphy2`.
* "y" etc.
Many changes described in pull request
* Update spacy/lang/fr/stop_words.py
* Update spacy/lang/fr/stop_words.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add all symbols in Unicode Currency Symbols block
In #8102 it came up that the rupee symbol was treated different from
dollar / euro / yen symbols. This adds many symbols not already
included.
* Fix test
* Fix training test
* Adapt tokenization methods from `pyvi` to preserve text encoding and
whitespace
* Add serialization support similar to Chinese and Japanese
Note: as for Chinese and Japanese, some settings are duplicated in
`config.cfg` and `tokenizer/cfg`.
* To allow default lookup lemmatization with a blank Russian model,
rename pymorphy2 lookup mode to `pymorphy2_lookup`
* Bug fix: update pymorphy2 lookup lemmatize to return list rather than
string
Fix class variable and init for `UkrainianLemmatizer` so that it loads
the `uk` dictionaries rather than having the parent `RussianLemmatizer`
override with the `ru` settings.
* Initial Spanish lemmatizer
* Handle merged verb+pron(s) multi-word tokens
* Use VERB for AUX rule lookup
* Add morph to lemma cache key
* Fix aux lookups, minor refactoring
* Improve verb+pron handling
* Move verb+pron handling into its own method
* Check for exceptions (primarily for se)
* Collect pronouns in the same (not reversed) order
* Only add modified possible lemmas
* raise NotImplementedError when noun_chunks iterator is not implemented
* bring back, fix and document span.noun_chunks
* formatting
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Update stop_words.py
Added three aditional stopwords: "a" and "o" that means "the", and "e" that means "and"
* Create cristianasp.md
* zero edit to push CI
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add syntax iterators for danish
* add test noun chunks for danish syntax iterators
* add contributor agreement
* update da syntax iterators to remove nested chunks
* add tests for da noun chunks
* Fix test
* add missing import
* fix example
* Prevent overlapping noun chunks
Prevent overlapping noun chunks by tracking the end index of the
previous noun chunk span.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add Amharic to space
* clean up
* Add some PRON_LEMMA
* add Tigrinya support
* remove text_noun_chunks
* Tigrinya Support
* added some more details for ti
* fix unit test
* add amharic char range
* changes from review
* amharic and tigrinya share same unicode block
* get rid of _amharic/_tigrinya in char_classes
Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>
* added single and paired orth variants
* added token match
* added long text tokenization test
* inverted init
* normalized lemmas to lowercase
* more abbrevs
* tests for ordinals and abbrevs
* separated period abbvrevs to another list
* fiex typo
* added ordinal and abbrev tests
* added number tests for dates
* minor refinement
* added inflected abbrevs regex
* added percentage and inflection
* cosmetics
* added token match
* added url inflection tests
* excluded url tokens from custom pattern
* removed url match import
* Add `cuda110` to setup.cfg and quickstart dropdown
* Switch to `pip` for pip-only packages in conda quickstart instructions
* Update zh pkuseg install message with version range and conda
* Remove `zh` from `extras_require` because the default doesn't require
additional packages
* Include Macedonian language
* Fix indentation at char_classes.py
* Fix indentation at char_classes.py
* Add Macedonian tests, update lex_attrs and char_classes
* Import unicode literals for python 2
* added tr_vocab to config
* basic test
* added syntax iterator to Turkish lang class
* first version for Turkish syntax iter, without flat
* added simple tests with nmod, amod, det
* more tests to amod and nmod
* separated noun chunks and parser test
* rearrangement after nchunk parser separation
* added recursive NPs
* tests with complicated recursive NPs
* tests with conjed NPs
* additional tests for conj NP
* small modification for shaving off conj from NP
* added tests with flat
* more tests with flat
* added examples with flats conjed
* added inner func for flat trick
* corrected parse
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* added tr_vocab to config
* basic test
* added syntax iterator to Turkish lang class
* first version for Turkish syntax iter, without flat
* added simple tests with nmod, amod, det
* more tests to amod and nmod
* separated noun chunks and parser test
* rearrangement after nchunk parser separation
* added recursive NPs
* tests with complicated recursive NPs
* tests with conjed NPs
* additional tests for conj NP
* small modification for shaving off conj from NP
* added tests with flat
* more tests with flat
* added examples with flats conjed
* added inner func for flat trick
* corrected parse
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* feat: added turkish tag map
* feat: morph rules cconj and sconj
* feat: more conjuncts
* feat: added popular postpositions
* feat: added adverbs
* feat: added personal pronouns
* feat: added reflexive pronouns
* minor: corrected case capital
* minor: fixed comma typo
* feat: added indef pronouns
* feat: added dict iter
* fixed comma typo
* updated language class with tag map and morph
* use default tag map instead
* removed tag map
* Hindi: Adds tests for lexical attributes (norm and like_num)
* Signs and sdds the contributor agreement
* Add ordinal numbers to be tagged as like_num
* Adds alternate pronunciation for 31 and 39