spaCy/spacy/tests/regression
Matthew Honnibal e65b5bb9a0 Fix tokenizer on Python2.7 (#3460)
spaCy v2.1 switched to the built-in re module, where v2.0 had been using
the third-party regex library. When the tokenizer was deserialized on
Python2.7, the `re.compile()` function was called with expressions that
featured escaped unicode codepoints that were not in Python2.7's unicode
database.

Problems occurred when we had a range between two of these unknown
codepoints, like this:

```
    '[\\uAA77-\\uAA79]'
```

On Python2.7, the unknown codepoints are not unescaped correctly,
resulting in arbitrary out-of-range characters being matched by the
expression.

This problem does not occur if we instead have a range between two
unicode literals, rather than the escape sequences. To fix the bug, we
therefore add a new compat function that unescapes unicode sequences
using the `ast.literal_eval()` function. Care is taken to ensure we
do not also escape non-unicode sequences.

Closes #3356.

- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-22 13:42:47 +01:00
..
__init__.py Add __init__.py file for regression tests 2016-11-01 13:45:06 +01:00
test_issue1-1000.py Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293) 2019-02-20 22:10:13 +01:00
test_issue1001-1500.py Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293) 2019-02-20 22:10:13 +01:00
test_issue1501-2000.py Merge regression tests 2019-02-24 20:31:38 +01:00
test_issue2001-2500.py Tidy up 2019-03-11 13:34:14 +01:00
test_issue2501-3000.py Merge regression tests 2019-02-24 21:03:39 +01:00
test_issue3002.py Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293) 2019-02-20 22:10:13 +01:00
test_issue3009.py Tidy up and fix small bugs and typos 2019-02-08 14:14:49 +01:00
test_issue3012.py Tidy up and fix small bugs and typos 2019-02-08 14:14:49 +01:00
test_issue3199.py Only run noun chunks iterator in Span if available (closes #3199) 2019-02-08 18:33:16 +01:00
test_issue3209.py Un-x-fail passing test 2019-02-24 20:24:15 +01:00
test_issue3248.py Tidy up and auto-format 2019-02-13 15:29:08 +01:00
test_issue3277.py 💫 Add en/em dash to prefixes and suffixes (#3281) 2019-02-15 10:29:59 +01:00
test_issue3288.py Tidy up tests 2019-02-24 14:11:23 +01:00
test_issue3289.py Tidy up tests 2019-02-24 14:11:23 +01:00
test_issue3328.py Auto-format 2019-02-27 11:56:45 +01:00
test_issue3331.py Add xfailing test for #3331 2019-02-25 22:33:30 +01:00
test_issue3345.py Un-xfail passing tests and tidy up 2019-03-10 18:42:16 +01:00
test_issue3356.py Fix tokenizer on Python2.7 (#3460) 2019-03-22 13:42:47 +01:00
test_issue3410.py Add actual deprecation warning for n_threads (resolves #3410) 2019-03-15 16:38:44 +01:00