spaCy/spacy/lang
Jacobo Myerston 3e8bc1272f
add punctuation to grc (#11426)
* add punctuation to grc

Add support for special editorial punctuation that is common in ancient Greek texts.  Ancient Greek texts, as found in digital and print form, have been largely edited by scholars. Restorations and improvements are normally marked with special characters that need to be handled properly by the tokenizer.

* add unit tests

* simplify regex

* move generic quotes to char classes

* rename unit test

* fix regex

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-09-27 11:38:56 +02:00
..
af 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
am Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
ar 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
az 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
bg Handle Cyrillic combining diacritics (#10837) 2022-06-28 15:35:32 +02:00
bn Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
ca Fix lookup usage in French/Catalan (fix #11347) (#11382) 2022-08-29 10:32:38 +02:00
cs 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
da 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
de 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
dsb Auto-format code with black (#10479) 2022-03-11 12:20:23 +01:00
el Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
en Remove English exceptions with mismatched features (#10873) 2022-06-03 09:44:04 +02:00
es Fix some issues in Spanish examples 2022-04-18 22:12:57 +02:00
et 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
eu 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
fa Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
fi Add a noun chunker for Finnish (#10214) 2022-02-08 08:44:11 +01:00
fr Fix lookup usage in French/Catalan (fix #11347) (#11382) 2022-08-29 10:32:38 +02:00
ga Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
grc add punctuation to grc (#11426) 2022-09-27 11:38:56 +02:00
gu 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
he 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
hi fix: Add missing comma to _eleven_to_beyond (#10166) 2022-01-30 16:45:06 +09:00
hr 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
hsb Auto-format code with black (#10479) 2022-03-11 12:20:23 +01:00
hu Ignore prefix in suffix matches (#9155) 2021-10-27 13:02:25 +02:00
hy 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
id 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
is 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
it Revert "Bump sudachipy version (#9917)" (#10071) 2022-01-17 10:38:37 +01:00
ja Format (#9630) 2021-11-05 09:56:26 +01:00
kn 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
ko Fix regex invalid escape sequences (#11276) 2022-08-09 10:59:36 +02:00
ky 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
la Auto-format code with black (#11427) 2022-09-02 11:43:20 +02:00
lb 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
lg luganda language extension (#10847) 2022-08-23 13:09:36 +02:00
lij 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
lt 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
lv 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
mk Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
ml 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
mr 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
nb Revert "Bump sudachipy version (#9917)" (#10071) 2022-01-17 10:38:37 +01:00
ne 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
nl Fix Dutch noun chunks to skip overlapping spans (#11275) 2022-08-10 09:49:08 +02:00
pl Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
pt Portuguese noun chunks review (#9559) 2021-11-04 23:55:49 +01:00
ro 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
ru Switch ru and uk lemmatizers to pymorphy3 (#11345) 2022-08-22 11:27:14 +02:00
sa 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
si Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
sk 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
sl Updates to Slovenian language (#11162) 2022-08-05 10:10:18 +02:00
sq 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
sr 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
sv Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
ta 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
te 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
th Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
ti Format (#9630) 2021-11-05 09:56:26 +01:00
tl 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
tn 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
tr Fixed typo in Turkish lang. (#10582) 2022-03-30 13:16:08 +02:00
tt 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
uk Switch ru and uk lemmatizers to pymorphy3 (#11345) 2022-08-22 11:27:14 +02:00
ur 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
vi Auto-format code with black (#10687) 2022-04-22 11:24:53 +02:00
xx fix: Add missing comma to examples.py (#10167) 2022-01-30 16:43:29 +09:00
yo 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
zh Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
__init__.py Remove imports in /lang/__init__.py 2017-05-08 23:58:07 +02:00
char_classes.py add punctuation to grc (#11426) 2022-09-27 11:38:56 +02:00
lex_attrs.py Use tokenizer URL_MATCH pattern in LIKE_URL (#8765) 2021-07-27 12:07:01 +02:00
norm_exceptions.py Tidy up and auto-format 2020-02-18 15:38:18 +01:00
punctuation.py Handle Cyrillic combining diacritics (#10837) 2022-06-28 15:35:32 +02:00
tokenizer_exceptions.py Ignore prefix in suffix matches (#9155) 2021-10-27 13:02:25 +02:00