spaCy/tokenizer_exceptions.py at ae7c728c5f76d09f77981132f93702ecdfbeab1f - spaCy - Gitea

explosion/spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-12 09:12:21 +03:00

Sofie 9a478b6db8 Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293 )

* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* remove duplicate

* remove xfail for Issue #2179 fixed by Matt

* adjust documentation and remove reference to regex lib

2019-02-20 22:10:13 +01:00

10 lines

173 B

Python

Raw Blame History

 # coding: utf8
 from __future__ import unicode_literals
 from ...symbols import ORTH, LEMMA
 _exc = {
     "po'": [{ORTH: "po'", LEMMA: 'poco'}]
 }
 TOKENIZER_EXCEPTIONS = _exc