mirror of
https://github.com/explosion/spaCy.git
synced 2025-10-24 20:51:30 +03:00
Most of these characters are for languages / writing systems that aren't supported by spacy, but I don't think it causes problems to include them. In the UD evals, Hindi and Urdu improve a lot as expected (from 0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil improves in combination with #4288. The punctuation list is converted to a set internally because of its increased length. Sentence final punctuation generated with: ``` unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}' ``` See: https://stackoverflow.com/a/9508766/461847 Fixes #4269. |
||
|---|---|---|
| .. | ||
| __init__.py | ||
| test_entity_linker.py | ||
| test_entity_ruler.py | ||
| test_factories.py | ||
| test_pipe_methods.py | ||
| test_sentencizer.py | ||
| test_textcat.py | ||