Remove the non-working `--use-chars` option from the train CLI. The
implementation of the option across component types and the CLI settings
could be fixed, but the `CharacterEmbed` model does not work on GPU in
v2 so it's better to remove it.
* Only set NORM on Token in retokenizer
Instead of setting `NORM` on both the token and lexeme, set `NORM` only
on the token.
The retokenizer tries to set all possible attributes with
`Token/Lexeme.set_struct_attr` so that it doesn't have to enumerate
which attributes are available for each. `NORM` is the only attribute
that's stored on both and for most cases it doesn't make sense to set
the global norms based on a individual retokenization. For lexeme-only
attributes like `IS_STOP` there's no way to avoid the global side
effects, but I think that `NORM` would be better only on the token.
* Fix test
For the `DependencyMatcher`:
* Fix on_match callback so that it is called once per matched pattern
* Fix results so that patterns with empty match lists are not returned
Modify the internal pattern representation in `Matcher` patterns to
identify the final ID state using a unique quantifier rather than a
combination of other attributes.
It was insufficient to identify the final ID node based on an
uninitialized `quantifier` (coincidentally being the same as the `ZERO`)
with `nr_attr` as 0. (In addition, it was potentially bug-prone that
`nr_attr` was set to 0 even though attrs were allocated.)
In the case of `{"OP": "!"}` (a valid, if pointless, pattern), `nr_attr`
is 0 and the quantifier is ZERO, so the previous methods for
incrementing to the ID node at the end of the pattern weren't able to
distinguish the final ID node from the `{"OP": "!"}` pattern.
* added single and paired orth variants
* added token match
* added long text tokenization test
* inverted init
* normalized lemmas to lowercase
* more abbrevs
* tests for ordinals and abbrevs
* separated period abbvrevs to another list
* fiex typo
* added ordinal and abbrev tests
* added number tests for dates
* minor refinement
* added inflected abbrevs regex
* added percentage and inflection
* cosmetics
* added token match
* added url inflection tests
* excluded url tokens from custom pattern
* removed url match import
* Include Macedonian language
* Fix indentation at char_classes.py
* Fix indentation at char_classes.py
* Add Macedonian tests, update lex_attrs and char_classes
* Import unicode literals for python 2
* added tr_vocab to config
* basic test
* added syntax iterator to Turkish lang class
* first version for Turkish syntax iter, without flat
* added simple tests with nmod, amod, det
* more tests to amod and nmod
* separated noun chunks and parser test
* rearrangement after nchunk parser separation
* added recursive NPs
* tests with complicated recursive NPs
* tests with conjed NPs
* additional tests for conj NP
* small modification for shaving off conj from NP
* added tests with flat
* more tests with flat
* added examples with flats conjed
* added inner func for flat trick
* corrected parse
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* feat: added turkish tag map
* feat: morph rules cconj and sconj
* feat: more conjuncts
* feat: added popular postpositions
* feat: added adverbs
* feat: added personal pronouns
* feat: added reflexive pronouns
* minor: corrected case capital
* minor: fixed comma typo
* feat: added indef pronouns
* feat: added dict iter
* fixed comma typo
* updated language class with tag map and morph
* use default tag map instead
* removed tag map
* Hindi: Adds tests for lexical attributes (norm and like_num)
* Signs and sdds the contributor agreement
* Add ordinal numbers to be tagged as like_num
* Adds alternate pronunciation for 31 and 39
* Regression test for issue 6207
* Fix issue 6207
* Sign contributor agreement
* Minor adjustments to test
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* reorder so tagmap is replaced only if a custom file is provided.
* Remove unneeded variable initialization
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* create contributor agreement
* Update Indonesian example. (see #1107)
Update Indonesian examples with more proper phrases. the current phrases contains sensitive and violent words.
* Update stop_words.py
Hebrew STOP WORDS
* Update stop_words.py
* contributor
* contributor
* add some common domain extentions
support human number 1K/1M....
* support human number 1K/1M....
* hebrew number tokenize
1K/1M implement in EN
* test human tokenize fix
* test
* heb like num
revert human number change
* heb like num
* Create lex_attrs.py
Hello,
I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech.
* Update __init__.py
Updated for use with new Czech Lex_attrs file
* Update stop_words.py
* Create test_text.py
* add like_num testing for czech
Co-authored-by: holubvl3 <47881982+holubvl3@users.noreply.github.com>
Co-authored-by: holubvl3 <vilemrousi@gmail.com>
Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>