* Override language defaults for null token and URL match
When the serialized `token_match` or `url_match` is `None`, override the
language defaults to preserve `None` on deserialization.
* Fix fixtures in tests
* add syntax iterators for danish
* add test noun chunks for danish syntax iterators
* add contributor agreement
* update da syntax iterators to remove nested chunks
* add tests for da noun chunks
* Fix test
* add missing import
* fix example
* Prevent overlapping noun chunks
Prevent overlapping noun chunks by tracking the end index of the
previous noun chunk span.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add Amharic to space
* clean up
* Add some PRON_LEMMA
* add Tigrinya support
* remove text_noun_chunks
* Tigrinya Support
* added some more details for ti
* fix unit test
* add amharic char range
* changes from review
* amharic and tigrinya share same unicode block
* get rid of _amharic/_tigrinya in char_classes
Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>
* Only set NORM on Token in retokenizer
Instead of setting `NORM` on both the token and lexeme, set `NORM` only
on the token.
The retokenizer tries to set all possible attributes with
`Token/Lexeme.set_struct_attr` so that it doesn't have to enumerate
which attributes are available for each. `NORM` is the only attribute
that's stored on both and for most cases it doesn't make sense to set
the global norms based on a individual retokenization. For lexeme-only
attributes like `IS_STOP` there's no way to avoid the global side
effects, but I think that `NORM` would be better only on the token.
* Fix test
For the `DependencyMatcher`:
* Fix on_match callback so that it is called once per matched pattern
* Fix results so that patterns with empty match lists are not returned
Modify the internal pattern representation in `Matcher` patterns to
identify the final ID state using a unique quantifier rather than a
combination of other attributes.
It was insufficient to identify the final ID node based on an
uninitialized `quantifier` (coincidentally being the same as the `ZERO`)
with `nr_attr` as 0. (In addition, it was potentially bug-prone that
`nr_attr` was set to 0 even though attrs were allocated.)
In the case of `{"OP": "!"}` (a valid, if pointless, pattern), `nr_attr`
is 0 and the quantifier is ZERO, so the previous methods for
incrementing to the ID node at the end of the pattern weren't able to
distinguish the final ID node from the `{"OP": "!"}` pattern.
* added single and paired orth variants
* added token match
* added long text tokenization test
* inverted init
* normalized lemmas to lowercase
* more abbrevs
* tests for ordinals and abbrevs
* separated period abbvrevs to another list
* fiex typo
* added ordinal and abbrev tests
* added number tests for dates
* minor refinement
* added inflected abbrevs regex
* added percentage and inflection
* cosmetics
* added token match
* added url inflection tests
* excluded url tokens from custom pattern
* removed url match import
* Include Macedonian language
* Fix indentation at char_classes.py
* Fix indentation at char_classes.py
* Add Macedonian tests, update lex_attrs and char_classes
* Import unicode literals for python 2
* added tr_vocab to config
* basic test
* added syntax iterator to Turkish lang class
* first version for Turkish syntax iter, without flat
* added simple tests with nmod, amod, det
* more tests to amod and nmod
* separated noun chunks and parser test
* rearrangement after nchunk parser separation
* added recursive NPs
* tests with complicated recursive NPs
* tests with conjed NPs
* additional tests for conj NP
* small modification for shaving off conj from NP
* added tests with flat
* more tests with flat
* added examples with flats conjed
* added inner func for flat trick
* corrected parse
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Hindi: Adds tests for lexical attributes (norm and like_num)
* Signs and sdds the contributor agreement
* Add ordinal numbers to be tagged as like_num
* Adds alternate pronunciation for 31 and 39
* Regression test for issue 6207
* Fix issue 6207
* Sign contributor agreement
* Minor adjustments to test
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update stop_words.py
Hebrew STOP WORDS
* Update stop_words.py
* contributor
* contributor
* add some common domain extentions
support human number 1K/1M....
* support human number 1K/1M....
* hebrew number tokenize
1K/1M implement in EN
* test human tokenize fix
* test
* heb like num
revert human number change
* heb like num
* Create lex_attrs.py
Hello,
I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech.
* Update __init__.py
Updated for use with new Czech Lex_attrs file
* Update stop_words.py
* Create test_text.py
* add like_num testing for czech
Co-authored-by: holubvl3 <47881982+holubvl3@users.noreply.github.com>
Co-authored-by: holubvl3 <vilemrousi@gmail.com>
Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>
* Add a warning when a subpattern is not processed and discarded
* Normalize subpattern attribute/operator keys to upper case like
top-level attributes
* Allow Doc.char_span to snap to token boundaries
Add a `mode` option to allow `Doc.char_span` to snap to token
boundaries. The `mode` options:
* `strict`: character offsets must match token boundaries (default, same as
before)
* `inside`: all tokens completely within the character span
* `outside`: all tokens at least partially covered by the character span
Add a new helper function `token_by_char` that returns the token
corresponding to a character position in the text. Update
`token_by_start` and `token_by_end` to use `token_by_char` for more
efficient searching.
* Remove unused import
* Rename mode to alignment_mode
Rename `mode` to `alignment_mode` with the options
`strict`/`contract`/`expand`. Any unrecognized modes are silently
converted to `strict`.
* Convert custom user_data to token extension format
Convert the user_data values so that they can be loaded as custom token
extensions for `inflection`, `reading_form`, `sub_tokens`, and `lemma`.
* Reset Underscore state in ja tokenizer tests
Move `Lemmatizer.is_base_form` to the language settings so that each
language can provide a language-specific method as
`LanguageDefaults.is_base_form`.
The existing English-specific `Lemmatizer.is_base_form` is moved to
`EnglishDefaults`.
* Skip special tag _SP in check for new tag map
In `Tagger.begin_training()` check for new tags aside from `_SP` in the
new tag map initialized from the provided gold tuples when determining
whether to reinitialize the morphology with the new tag map.
* Simplify _SP check
* user_dict fields: adding inflections, reading_forms, sub_tokens
deleting: unidic_tags
improve code readability around the token alignment procedure
* add test cases, replace fugashi with sudachipy in conftest
* move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer
* tag is space -> both surface and tag are spaces
* consider len(text)==0
* Fix warning message for lemmatization tables
* Add a warning when the `lexeme_norm` table is empty. (Given the
relatively lang-specific loading for `Lookups`, it seemed like too much
overhead to dynamically extract the list of languages, so for now it's
hard-coded.)