spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-11-23 11:16:01 +03:00

History

adrianeboyd 6942a6a69b Extend default punct for sentencizer (#4290 ) Most of these characters are for languages / writing systems that aren't supported by spacy, but I don't think it causes problems to include them. In the UD evals, Hindi and Urdu improve a lot as expected (from 0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil improves in combination with #4288. The punctuation list is converted to a set internally because of its increased length. Sentence final punctuation generated with: ``` unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}' ``` See: https://stackoverflow.com/a/9508766/461847 Fixes #4269.		2019-09-14 15:25:48 +02:00
..
__init__.py	Start adding tests for new pipeline management	2017-10-07 01:48:23 +02:00
test_entity_linker.py	CLI scripts for entity linking (wikipedia & generic) (#4091 )	2019-08-13 15:38:59 +02:00
test_entity_ruler.py	Improve token pattern checking without validation (#4105 )	2019-08-21 14:00:37 +02:00
test_factories.py	💫 Tidy up and auto-format tests (#2967 )	2018-11-27 01:09:36 +01:00
test_pipe_methods.py	💫 Add Language.pipe_labels (#4276 )	2019-09-12 10:56:28 +02:00
test_sentencizer.py	Extend default punct for sentencizer (#4290 )	2019-09-14 15:25:48 +02:00
test_textcat.py	💫 Tidy up and auto-format tests (#2967 )	2018-11-27 01:09:36 +01:00