spaCy/spacy/lang
Sofie 9745b0d523 Improve Italian & Urdu tokenization accuracy (#3228)
## Description

1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour.

2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour.

### Types of change
Enhancement of Italian & Urdu tokenization

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-04 22:39:25 +01:00
..
ar Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
bn Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
ca Improve Catalan tokenization accuracy (#3225) 2019-02-04 20:37:19 +11:00
da Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
de Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
el Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
en Remove pre-set lemma for "cause" (resolves #2165) 2018-12-14 12:51:18 +01:00
es 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
fa Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
fi 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
fr Improve Catalan tokenization accuracy (#3225) 2019-02-04 20:37:19 +11:00
ga 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
he 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
hi 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
hr 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
hu Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
id Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
it Improve Italian & Urdu tokenization accuracy (#3228) 2019-02-04 22:39:25 +01:00
ja Auto-format 2018-12-18 15:02:11 +01:00
nb Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
nl 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
pl 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
pt 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
ro 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
ru Enabling tests/lang/ru/test_lemmatizer.py, fixing a unicode issue (#3084) 2018-12-30 12:10:26 +01:00
si 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
sv 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
te 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
th 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
tl Added alpha support for Tagalog language (#3062) 2018-12-18 13:08:38 +01:00
tr 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
tt Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
ur Improve Italian & Urdu tokenization accuracy (#3228) 2019-02-04 22:39:25 +01:00
vi 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
xx 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
zh 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
__init__.py Remove imports in /lang/__init__.py 2017-05-08 23:58:07 +02:00
char_classes.py Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
lex_attrs.py Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
norm_exceptions.py 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
punctuation.py Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
tag_map.py 💫 Tidy up and auto-format .py files (#2983) 2018-11-30 17:03:03 +01:00
tokenizer_exceptions.py Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00