Commit Graph

8 Commits

Author SHA1 Message Date
Adriane Boyd
5861308910 Generalize handling of tokenizer special cases
Handle tokenizer special cases more generally by using the Matcher
internally to match special cases after the affix/token_match
tokenization is complete.

Instead of only matching special cases while processing balanced or
nearly balanced prefixes and suffixes, this recognizes special cases in
a wider range of contexts:

* Allows arbitrary numbers of prefixes/affixes around special cases
* Allows special cases separated by infixes

Existing tests/settings that couldn't be preserved as before:

* The emoticon '")' is no longer a supported special case
* The emoticon ':)' in "example:)" is a false positive again

When merged with #4258 (or the relevant cache bugfix), the affix and
token_match properties should be modified to flush and reload all
special cases to use the updated internal tokenization with the Matcher.
2019-09-08 20:35:16 +02:00
Ines Montani
b6e991440c 💫 Tidy up and auto-format tests (#2967)
* Auto-format tests with black

* Add flake8 config

* Tidy up and remove unused imports

* Fix redefinitions of test functions

* Replace orths_and_spaces with words and spaces

* Fix compatibility with pytest 4.0

* xfail test for now

Test was previously overwritten by following test due to naming conflict, so failure wasn't reported

* Unfail passing test

* Only use fixture via arguments

Fixes pytest 4.0 compatibility
2018-11-27 01:09:36 +01:00
Matthew Honnibal
8c945310fb Excuse emoji failure on narrow unicode builds 2017-09-16 16:21:13 +02:00
Gyorgy Orosz
8c0b4b850e Fixed emoji handling for Hungarian 2017-05-30 21:34:46 +02:00
ines
a8e58e04ef Add symbols class to punctuation rules to handle emoji (see #1088)
Currently doesn't work for Hungarian, because of conflicts with the
custom punctuation rules. Also doesn't take multi-character emoji like
👩🏽‍💻 into account.
2017-05-27 17:57:10 +02:00
Ines Montani
bbe7cab3a1 Move non-English-specific tests back to general tokenizer tests 2017-01-05 18:09:29 +01:00
Ines Montani
c5f2dc15de Move English tokenizer tests to directory /en 2017-01-05 16:25:04 +01:00
Ines Montani
0e65dca9a5 Modernize and merge tokenizer tests for exception and emoticons 2017-01-05 13:11:31 +01:00