spaCy/spacy/tests/lang
Sofie 9745b0d523 Improve Italian & Urdu tokenization accuracy (#3228)
## Description

1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour.

2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour.

### Types of change
Enhancement of Italian & Urdu tokenization

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-04 22:39:25 +01:00
..
ar Tidy up and format remaining files 2018-11-30 17:43:08 +01:00
bn 💫 Port master changes over to develop (#2979) 2018-11-29 16:30:29 +01:00
ca Improve Italian & Urdu tokenization accuracy (#3228) 2019-02-04 22:39:25 +01:00
da 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
de 💫 Port master changes over to develop (#2979) 2018-11-29 16:30:29 +01:00
el 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
en Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
es 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
fi 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
fr Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
ga 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
he 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
hu Tidy up and format remaining files 2018-11-30 17:43:08 +01:00
id 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
it Improve Italian & Urdu tokenization accuracy (#3228) 2019-02-04 22:39:25 +01:00
ja 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
nb 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
nl 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
pt 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
ro 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
ru Replacing regex library with re to increase tokenization speed (#3218) 2019-02-01 18:05:22 +11:00
sv 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
th 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
tr 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
tt 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
ur Improve Italian & Urdu tokenization accuracy (#3228) 2019-02-04 22:39:25 +01:00
__init__.py Remove imports in /lang/__init__.py 2017-05-08 23:58:07 +02:00
test_attrs.py 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00