spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-11-30 06:46:14 +03:00

History

Matthew Honnibal e65b5bb9a0 Fix tokenizer on Python2.7 (#3460 ) spaCy v2.1 switched to the built-in re module, where v2.0 had been using the third-party regex library. When the tokenizer was deserialized on Python2.7, the `re.compile()` function was called with expressions that featured escaped unicode codepoints that were not in Python2.7's unicode database. Problems occurred when we had a range between two of these unknown codepoints, like this: ``` '[\\uAA77-\\uAA79]' ``` On Python2.7, the unknown codepoints are not unescaped correctly, resulting in arbitrary out-of-range characters being matched by the expression. This problem does not occur if we instead have a range between two unicode literals, rather than the escape sequences. To fix the bug, we therefore add a new compat function that unescapes unicode sequences using the `ast.literal_eval()` function. Care is taken to ensure we do not also escape non-unicode sequences. Closes #3356. - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.		2019-03-22 13:42:47 +01:00
..
__init__.py	Add __init__.py file for regression tests	2016-11-01 13:45:06 +01:00
test_issue1-1000.py	Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293 )	2019-02-20 22:10:13 +01:00
test_issue1001-1500.py	Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293 )	2019-02-20 22:10:13 +01:00
test_issue1501-2000.py	Merge regression tests	2019-02-24 20:31:38 +01:00
test_issue2001-2500.py	Tidy up	2019-03-11 13:34:14 +01:00
test_issue2501-3000.py	Merge regression tests	2019-02-24 21:03:39 +01:00
test_issue3002.py	Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293 )	2019-02-20 22:10:13 +01:00
test_issue3009.py	Tidy up and fix small bugs and typos	2019-02-08 14:14:49 +01:00
test_issue3012.py	Tidy up and fix small bugs and typos	2019-02-08 14:14:49 +01:00
test_issue3199.py	Only run noun chunks iterator in Span if available (closes #3199 )	2019-02-08 18:33:16 +01:00
test_issue3209.py	Un-x-fail passing test	2019-02-24 20:24:15 +01:00
test_issue3248.py	Tidy up and auto-format	2019-02-13 15:29:08 +01:00
test_issue3277.py	💫 Add en/em dash to prefixes and suffixes (#3281 )	2019-02-15 10:29:59 +01:00
test_issue3288.py	Tidy up tests	2019-02-24 14:11:23 +01:00
test_issue3289.py	Tidy up tests	2019-02-24 14:11:23 +01:00
test_issue3328.py	Auto-format	2019-02-27 11:56:45 +01:00
test_issue3331.py	Add xfailing test for #3331	2019-02-25 22:33:30 +01:00
test_issue3345.py	Un-xfail passing tests and tidy up	2019-03-10 18:42:16 +01:00
test_issue3356.py	Fix tokenizer on Python2.7 (#3460 )	2019-03-22 13:42:47 +01:00
test_issue3410.py	Add actual deprecation warning for n_threads (resolves #3410 )	2019-03-15 16:38:44 +01:00