spaCy/spacy/tests/tokenizer
Adriane Boyd 5861308910 Generalize handling of tokenizer special cases
Handle tokenizer special cases more generally by using the Matcher
internally to match special cases after the affix/token_match
tokenization is complete.

Instead of only matching special cases while processing balanced or
nearly balanced prefixes and suffixes, this recognizes special cases in
a wider range of contexts:

* Allows arbitrary numbers of prefixes/affixes around special cases
* Allows special cases separated by infixes

Existing tests/settings that couldn't be preserved as before:

* The emoticon '")' is no longer a supported special case
* The emoticon ':)' in "example:)" is a false positive again

When merged with #4258 (or the relevant cache bugfix), the affix and
token_match properties should be modified to flush and reload all
special cases to use the updated internal tokenization with the Matcher.
2019-09-08 20:35:16 +02:00
..
__init__.py add __init__.py to empty package dirs 2016-03-14 11:28:03 +01:00
sun.txt Move text file back to tokenizer tests directory 2017-01-12 02:10:23 +01:00
test_exceptions.py Generalize handling of tokenizer special cases 2019-09-08 20:35:16 +02:00
test_naughty_strings.py Fix regex deprecation warnings 2019-02-21 11:56:47 +01:00
test_tokenizer.py Generalize handling of tokenizer special cases 2019-09-08 20:35:16 +02:00
test_urls.py Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
test_whitespace.py 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00