spaCy/spacy/tests/pipeline
adrianeboyd 6942a6a69b Extend default punct for sentencizer (#4290)
Most of these characters are for languages / writing systems that aren't
supported by spacy, but I don't think it causes problems to include
them. In the UD evals, Hindi and Urdu improve a lot as expected (from
0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil
improves in combination with #4288.

The punctuation list is converted to a set internally because of its
increased length.

Sentence final punctuation generated with:

```
unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}'
```

See: https://stackoverflow.com/a/9508766/461847

Fixes #4269.
2019-09-14 15:25:48 +02:00
..
__init__.py Start adding tests for new pipeline management 2017-10-07 01:48:23 +02:00
test_entity_linker.py CLI scripts for entity linking (wikipedia & generic) (#4091) 2019-08-13 15:38:59 +02:00
test_entity_ruler.py Improve token pattern checking without validation (#4105) 2019-08-21 14:00:37 +02:00
test_factories.py 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00
test_pipe_methods.py 💫 Add Language.pipe_labels (#4276) 2019-09-12 10:56:28 +02:00
test_sentencizer.py Extend default punct for sentencizer (#4290) 2019-09-14 15:25:48 +02:00
test_textcat.py 💫 Tidy up and auto-format tests (#2967) 2018-11-27 01:09:36 +01:00