mirror of
https://github.com/explosion/spaCy.git
synced 2025-10-25 13:11:03 +03:00
spaCy v2.1 switched to the built-in re module, where v2.0 had been using
the third-party regex library. When the tokenizer was deserialized on
Python2.7, the `re.compile()` function was called with expressions that
featured escaped unicode codepoints that were not in Python2.7's unicode
database.
Problems occurred when we had a range between two of these unknown
codepoints, like this:
```
'[\\uAA77-\\uAA79]'
```
On Python2.7, the unknown codepoints are not unescaped correctly,
resulting in arbitrary out-of-range characters being matched by the
expression.
This problem does not occur if we instead have a range between two
unicode literals, rather than the escape sequences. To fix the bug, we
therefore add a new compat function that unescapes unicode sequences
using the `ast.literal_eval()` function. Care is taken to ensure we
do not also escape non-unicode sequences.
Closes #3356.
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
|
||
|---|---|---|
| .. | ||
| __init__.py | ||
| test_issue1-1000.py | ||
| test_issue1001-1500.py | ||
| test_issue1501-2000.py | ||
| test_issue2001-2500.py | ||
| test_issue2501-3000.py | ||
| test_issue3002.py | ||
| test_issue3009.py | ||
| test_issue3012.py | ||
| test_issue3199.py | ||
| test_issue3209.py | ||
| test_issue3248.py | ||
| test_issue3277.py | ||
| test_issue3288.py | ||
| test_issue3289.py | ||
| test_issue3328.py | ||
| test_issue3331.py | ||
| test_issue3345.py | ||
| test_issue3356.py | ||
| test_issue3410.py | ||