mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 18:06:29 +03:00
e65b5bb9a0
spaCy v2.1 switched to the built-in re module, where v2.0 had been using the third-party regex library. When the tokenizer was deserialized on Python2.7, the `re.compile()` function was called with expressions that featured escaped unicode codepoints that were not in Python2.7's unicode database. Problems occurred when we had a range between two of these unknown codepoints, like this: ``` '[\\uAA77-\\uAA79]' ``` On Python2.7, the unknown codepoints are not unescaped correctly, resulting in arbitrary out-of-range characters being matched by the expression. This problem does not occur if we instead have a range between two unicode literals, rather than the escape sequences. To fix the bug, we therefore add a new compat function that unescapes unicode sequences using the `ast.literal_eval()` function. Care is taken to ensure we do not also escape non-unicode sequences. Closes #3356. - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. |
||
---|---|---|
.. | ||
__init__.py | ||
test_issue1-1000.py | ||
test_issue1001-1500.py | ||
test_issue1501-2000.py | ||
test_issue2001-2500.py | ||
test_issue2501-3000.py | ||
test_issue3002.py | ||
test_issue3009.py | ||
test_issue3012.py | ||
test_issue3199.py | ||
test_issue3209.py | ||
test_issue3248.py | ||
test_issue3277.py | ||
test_issue3288.py | ||
test_issue3289.py | ||
test_issue3328.py | ||
test_issue3331.py | ||
test_issue3345.py | ||
test_issue3356.py | ||
test_issue3410.py |