spaCy/spacy/tests/lang/grc/test_tokenizer.py
Jacobo Myerston 3e8bc1272f
add punctuation to grc (#11426)
* add punctuation to grc

Add support for special editorial punctuation that is common in ancient Greek texts.  Ancient Greek texts, as found in digital and print form, have been largely edited by scholars. Restorations and improvements are normally marked with special characters that need to be handled properly by the tokenizer.

* add unit tests

* simplify regex

* move generic quotes to char classes

* rename unit test

* fix regex

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-09-27 11:38:56 +02:00

19 lines
1.3 KiB
Python

import pytest
# fmt: off
GRC_TOKEN_EXCEPTION_TESTS = [
("τὸ 〈τῆς〉 φιλοσοφίας ἔργον ἔνιοί φασιν ἀπὸ ⟦βαρβάρων⟧ ἄρξαι.", ["τὸ", "", "τῆς", "", "φιλοσοφίας", "ἔργον", "ἔνιοί", "φασιν", "ἀπὸ", "", "βαρβάρων", "", "ἄρξαι", "."]),
("τὴν δὲ τῶν Αἰγυπτίων φιλοσοφίαν εἶναι τοιαύτην περί τε †θεῶν† καὶ ὑπὲρ δικαιοσύνης.", ["τὴν", "δὲ", "τῶν", "Αἰγυπτίων", "φιλοσοφίαν", "εἶναι", "τοιαύτην", "περί", "τε", "", "θεῶν", "", "καὶ", "ὑπὲρ", "δικαιοσύνης", "."]),
("⸏πόσις δ' Ἐρεχθεύς ἐστί μοι σεσωσμένος⸏", ["", "πόσις", "δ'", "Ἐρεχθεύς", "ἐστί", "μοι", "σεσωσμένος", ""]),
("⸏ὔπνον ἴδωμεν⸎", ["", "ὔπνον", "ἴδωμεν", ""]),
]
# fmt: on
@pytest.mark.parametrize("text,expected_tokens", GRC_TOKEN_EXCEPTION_TESTS)
def test_grc_tokenizer(grc_tokenizer, text, expected_tokens):
tokens = grc_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list