spaCy/spacy/pipeline
adrianeboyd 6942a6a69b Extend default punct for sentencizer (#4290)
Most of these characters are for languages / writing systems that aren't
supported by spacy, but I don't think it causes problems to include
them. In the UD evals, Hindi and Urdu improve a lot as expected (from
0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil
improves in combination with #4288.

The punctuation list is converted to a set internally because of its
increased length.

Sentence final punctuation generated with:

```
unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}'
```

See: https://stackoverflow.com/a/9508766/461847

Fixes #4269.
2019-09-14 15:25:48 +02:00
..
__init__.py Merge branch 'master' into feature/el-framework 2019-03-26 11:00:02 +01:00
entityruler.py Fix typo in docstrings [ci skip] 2019-08-22 16:24:15 +02:00
functions.py Tidy up and improve docs and docstrings (#3370) 2019-03-08 11:42:26 +01:00
hooks.py 💫 Add better and serializable sentencizer (#3471) 2019-03-23 15:45:02 +01:00
pipes.pyx Extend default punct for sentencizer (#4290) 2019-09-14 15:25:48 +02:00