spaCy/spacy/tests/lang
Jacobo Myerston 3e8bc1272f
add punctuation to grc (#11426)
* add punctuation to grc

Add support for special editorial punctuation that is common in ancient Greek texts.  Ancient Greek texts, as found in digital and print form, have been largely edited by scholars. Restorations and improvements are normally marked with special characters that need to be handled properly by the tokenizer.

* add unit tests

* simplify regex

* move generic quotes to char classes

* rename unit test

* fix regex

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-09-27 11:38:56 +02:00
..
af New tests for a number of alpha languages (#9703) 2021-11-28 21:59:23 +01:00
am Tidy up and auto-format 2021-01-15 11:57:36 +11:00
ar Remove POS, TAG and LEMMA from tokenizer exceptions 2020-07-22 23:09:01 +02:00
bg Handle Cyrillic combining diacritics (#10837) 2022-06-28 15:35:32 +02:00
bn Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
ca Update Catalan tokenizer (#9297) 2021-09-27 14:42:30 +02:00
cs Remove unicode declarations and update language data 2020-09-04 13:19:16 +02:00
da Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3 2021-01-14 11:49:58 +01:00
de Tidy up and auto-format 2020-09-29 21:39:28 +02:00
dsb Add Lower Sorbian support. (#10431) 2022-03-07 16:57:14 +01:00
el Tidy up and auto-format 2020-09-29 21:39:28 +02:00
en Remove English exceptions with mismatched features (#10873) 2022-06-03 09:44:04 +02:00
es Migrate regression tests into the main test suite (#9655) 2021-12-04 20:34:48 +01:00
et New tests for a number of alpha languages (#9703) 2021-11-28 21:59:23 +01:00
eu Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
fa Tidy up and auto-format 2020-09-29 21:39:28 +02:00
fi Auto-format code with black (#10333) 2022-02-21 09:15:42 +01:00
fr Revert "Bump sudachipy version (#9917)" (#10071) 2022-01-17 10:38:37 +01:00
ga Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
grc add punctuation to grc (#11426) 2022-09-27 11:38:56 +02:00
gu Remove unicode declarations and tidy up 2020-06-21 22:34:10 +02:00
he Merge branch 'develop' into master-tmp 2020-09-04 13:15:36 +02:00
hi Migrate regression tests into the main test suite (#9655) 2021-12-04 20:34:48 +01:00
hr New tests for a number of alpha languages (#9703) 2021-11-28 21:59:23 +01:00
hsb Add Upper Sorbian support. (#10432) 2022-03-07 16:20:39 +01:00
hu 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
hy Remove unicode declarations and tidy up 2020-06-21 22:34:10 +02:00
id Tidy up and auto-format 2020-09-29 21:39:28 +02:00
is New tests for a number of alpha languages (#9703) 2021-11-28 21:59:23 +01:00
it Revert "Bump sudachipy version (#9917)" (#10071) 2022-01-17 10:38:37 +01:00
ja Migrate regression tests into the main test suite (#9655) 2021-12-04 20:34:48 +01:00
ko Handle unknown tags in KoreanTokenizer tag map (#10536) 2022-03-24 11:25:36 +01:00
ky Update Cython string types (#9143) 2021-09-13 17:02:17 +02:00
la Auto-format code with black (#11427) 2022-09-02 11:43:20 +02:00
lb Remove POS, TAG and LEMMA from tokenizer exceptions 2020-07-22 23:09:01 +02:00
lg luganda language extension (#10847) 2022-08-23 13:09:36 +02:00
lt Merge branch 'master' into tmp/sync 2020-03-26 13:38:14 +01:00
lv New tests for a number of alpha languages (#9703) 2021-11-28 21:59:23 +01:00
mk Tidy up and auto-format 2021-01-05 13:41:53 +11:00
ml Remove unicode declarations and tidy up 2020-06-21 22:34:10 +02:00
nb Tidy up and auto-format 2020-09-29 21:39:28 +02:00
ne Tidy up and auto-format 2020-09-29 21:39:28 +02:00
nl Fix Dutch noun chunks to skip overlapping spans (#11275) 2022-08-10 09:49:08 +02:00
pl Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
pt Portuguese noun chunks review (#9559) 2021-11-04 23:55:49 +01:00
ro Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
ru Clean up warnings in the test suite (#11331) 2022-08-22 12:04:30 +02:00
sa Tidy up and auto-format 2020-09-29 21:39:28 +02:00
sk New tests for a number of alpha languages (#9703) 2021-11-28 21:59:23 +01:00
sl Updates to Slovenian language (#11162) 2022-08-05 10:10:18 +02:00
sq New tests for a number of alpha languages (#9703) 2021-11-28 21:59:23 +01:00
sr Un-xfail passing tests 2019-12-25 18:02:20 +01:00
sv Migrate regression tests into the main test suite (#9655) 2021-12-04 20:34:48 +01:00
ta Basic tests for the Tamil language (#10629) 2022-04-07 14:47:37 +02:00
th Update custom tokenizer APIs and pickling (#8972) 2021-08-19 14:37:47 +02:00
ti Update Tigrinya ትግርኛ language support (#8900) 2021-08-10 13:55:08 +02:00
tl Add initial Tagalog (tl) tests (#9582) 2021-11-02 08:35:49 +01:00
tr removing print statements from the test suite (#10712) 2022-04-27 09:14:25 +02:00
tt Merge branch 'master' into develop 2020-02-18 14:47:23 +01:00
uk Clean up warnings in the test suite (#11331) 2022-08-22 12:04:30 +02:00
ur Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
vi Update custom tokenizer APIs and pickling (#8972) 2021-08-19 14:37:47 +02:00
xx New tests for a number of alpha languages (#9703) 2021-11-28 21:59:23 +01:00
yo Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
zh Tidy up and auto-format 2020-10-03 17:20:18 +02:00
__init__.py Revert #4334 2019-09-29 17:32:12 +02:00
test_attrs.py Intify IOB (#9738) 2022-01-20 13:19:38 +01:00
test_initialize.py Fix Azerbaijani init, extend lang init tests (#8656) 2021-07-09 15:36:35 +02:00
test_lemmatizers.py Update Catalan language data (#8308) 2021-06-11 10:21:22 +02:00