spaCy/punctuation.py at 6cdb7b2e0416da26d87b0aa57db700817e1ad763 - spaCy - Gitea

explosion/spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-26 18:06:29 +03:00

Sofie 9745b0d523 Improve Italian & Urdu tokenization accuracy (#3228 )

## Description

1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour.

2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour.

### Types of change
Enhancement of Italian & Urdu tokenization

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

2019-02-04 22:39:25 +01:00

16 lines

326 B

Python

Raw Blame History

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

 # coding: utf8
 from __future__ import unicode_literals
 from ..punctuation import TOKENIZER_INFIXES
 from ..char_classes import ALPHA
 ELISION = " ' ’ ".strip().replace(" ", "").replace("\n", "")
 _infixes = TOKENIZER_INFIXES + [
     r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
 ]
 TOKENIZER_INFIXES = _infixes