spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-15 11:40:37 +03:00

History

Adriane Boyd 5861308910 Generalize handling of tokenizer special cases Handle tokenizer special cases more generally by using the Matcher internally to match special cases after the affix/token_match tokenization is complete. Instead of only matching special cases while processing balanced or nearly balanced prefixes and suffixes, this recognizes special cases in a wider range of contexts: * Allows arbitrary numbers of prefixes/affixes around special cases * Allows special cases separated by infixes Existing tests/settings that couldn't be preserved as before: * The emoticon '")' is no longer a supported special case * The emoticon ':)' in "example:)" is a false positive again When merged with #4258 (or the relevant cache bugfix), the affix and token_match properties should be modified to flush and reload all special cases to use the updated internal tokenization with the Matcher.		2019-09-08 20:35:16 +02:00
..
__init__.py	add __init__.py to empty package dirs	2016-03-14 11:28:03 +01:00
sun.txt	Move text file back to tokenizer tests directory	2017-01-12 02:10:23 +01:00
test_exceptions.py	Generalize handling of tokenizer special cases	2019-09-08 20:35:16 +02:00
test_naughty_strings.py	Fix regex deprecation warnings	2019-02-21 11:56:47 +01:00
test_tokenizer.py	Generalize handling of tokenizer special cases	2019-09-08 20:35:16 +02:00
test_urls.py	Replacing regex library with re to increase tokenization speed (#3218 )	2019-02-01 18:05:22 +11:00
test_whitespace.py	💫 Tidy up and auto-format tests (#2967 )	2018-11-27 01:09:36 +01:00