spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-03-13 07:55:49 +03:00

History

adrianeboyd 6942a6a69b Extend default punct for sentencizer (#4290 ) Most of these characters are for languages / writing systems that aren't supported by spacy, but I don't think it causes problems to include them. In the UD evals, Hindi and Urdu improve a lot as expected (from 0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil improves in combination with #4288. The punctuation list is converted to a set internally because of its increased length. Sentence final punctuation generated with: ``` unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}' ``` See: https://stackoverflow.com/a/9508766/461847 Fixes #4269.		2019-09-14 15:25:48 +02:00
..
__init__.py	Merge branch 'master' into feature/el-framework	2019-03-26 11:00:02 +01:00
entityruler.py	Fix typo in docstrings [ci skip]	2019-08-22 16:24:15 +02:00
functions.py	Tidy up and improve docs and docstrings (#3370 )	2019-03-08 11:42:26 +01:00
hooks.py	💫 Add better and serializable sentencizer (#3471 )	2019-03-23 15:45:02 +01:00
pipes.pyx	Extend default punct for sentencizer (#4290 )	2019-09-14 15:25:48 +02:00