spaCy/spacy/tests/tokenizer
Adriane Boyd 2ea9b58006
Ignore prefix in suffix matches (#9155)
* Ignore prefix in suffix matches

Ignore the currently matched prefix when looking for suffix matches in
the tokenizer. Otherwise a lookbehind in the suffix pattern may match
incorrectly due the presence of the prefix in the token string.

* Move °[cfkCFK]. to a tokenizer exception

* Adjust exceptions for same tokenization as v3.1

* Also update test accordingly

* Continue to split . after °CFK if ° is not a prefix

* Exclude new ° exceptions for pl

* Switch back to default tokenization of "° C ."

* Revert "Exclude new ° exceptions for pl"

This reverts commit 952013a5b4.

* Add exceptions for °C for hu
2021-10-27 13:02:25 +02:00
..
__init__.py Revert #4334 2019-09-29 17:32:12 +02:00
sun.txt Revert #4334 2019-09-29 17:32:12 +02:00
test_exceptions.py Ignore prefix in suffix matches (#9155) 2021-10-27 13:02:25 +02:00
test_explain.py Update Tokenizer.explain with special matches (#7749) 2021-04-19 19:08:20 +10:00
test_naughty_strings.py Merge branch 'develop' into master-tmp 2020-09-04 13:15:36 +02:00
test_tokenizer.py Ignore prefix in suffix matches (#9155) 2021-10-27 13:02:25 +02:00
test_urls.py Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
test_whitespace.py Merge branch 'develop' into master-tmp 2020-09-04 13:15:36 +02:00