1
1
mirror of https://github.com/explosion/spaCy.git synced 2025-01-24 08:14:15 +03:00
spaCy/spacy/tests/lang/bg/test_tokenizer.py
Richard Hudson a9559e7435
Handle Cyrillic combining diacritics ()
* Handle Russian, Ukrainian and Bulgarian

* Corrections

* Correction

* Correction to comment

* Changes based on review

* Correction

* Reverted irrelevant change in punctuation.py

* Remove unnecessary group

* Reverted accidental change
2022-06-28 15:35:32 +02:00

9 lines
252 B
Python

import pytest
def test_bg_tokenizer_handles_final_diacritics(bg_tokenizer):
text = "Ня̀маше яйца̀. Ня̀маше яйца̀."
tokens = bg_tokenizer(text)
assert tokens[1].text == "яйца̀"
assert tokens[2].text == "."