explosion/spaCy - spaCy - Gitea

explosion/spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-09-22 20:09:18 +03:00

16 lines

326 B

Python

Raw Normal View History

-												Improve Catalan tokenization accuracy (#3225)

* small hyphen clean up for French

* catalan infix similar to french

											
										
										
											2019-02-04 12:37:19 +03:00
+								# coding: utf8
 								from __future__ import unicode_literals
 								from ..punctuation import TOKENIZER_INFIXES
 								from ..char_classes import ALPHA
 								ELISION = " ' ’ ".strip().replace(" ", "").replace("\n", "")
 								_infixes = TOKENIZER_INFIXES + [
 								    r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION)
 								]
 								TOKENIZER_INFIXES = _infixes