spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-19 13:41:00 +03:00

History

Adriane Boyd 4448680750 Fix alignment for 1-to-1 tokens and lowercasing (#6476 ) * When checking for token alignments, check not only that the tokens are identical but that the character positions are both at the start of a token. It's possible for the tokens to be identical even though the two tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs. `["a", "''", "'"]`, where the middle tokens are identical but should not be aligned on the token level at character position 2 since it's the start of one token but the middle of another. * Use the lowercased version of the token texts to create the character-to-token alignment because lowercasing can change the string length (e.g., for `İ`, see the not-a-bug bug report: https://bugs.python.org/issue34723)		2020-12-08 14:25:16 +08:00
..
__init__.py	move tests to correct subdir	2020-09-15 21:40:38 +02:00
test_augmenters.py	Update data augmenters (#6196 )	2020-10-04 17:46:29 +02:00
test_new_example.py	Refactor Token morph setting (#6175 )	2020-10-01 22:21:46 +02:00
test_readers.py	Fixes in test suite (#6457 )	2020-12-02 12:57:08 +01:00
test_training.py	Fix alignment for 1-to-1 tokens and lowercasing (#6476 )	2020-12-08 14:25:16 +08:00