1
1
mirror of https://github.com/explosion/spaCy.git synced 2025-02-24 15:47:33 +03:00
Commit Graph

8 Commits

Author SHA1 Message Date
adrianeboyd
a6e521cd79
Add is_sent_end token property ()
Reconstruction of the original PR  by @MiniLau.

Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
2020-04-29 12:53:16 +02:00
adrianeboyd
a938566b62 Fix Sentencizer.pipe() for empty doc () 2020-01-28 11:36:49 +01:00
adrianeboyd
48ea2e8d0f Restructure Sentencizer to follow Pipe API ()
* Restructure Sentencizer to follow Pipe API

Restructure Sentencizer to follow Pipe API so that it can be scored with
`nlp.evaluate()`.

* Add Sentencizer pipe() test
2019-11-27 16:33:34 +01:00
adrianeboyd
a3509f67d4 Extend unicode character block for Sinhala ()
* Extend unicode character block for Sinhala

* Add sentencizer tests for more languages
2019-10-07 13:17:03 +02:00
Ines Montani
3d8fd4b461 Revert 2019-09-29 17:32:12 +02:00
Ines Montani
c9cd516d96 Move tests out of package ()
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
adrianeboyd
6942a6a69b Extend default punct for sentencizer ()
Most of these characters are for languages / writing systems that aren't
supported by spacy, but I don't think it causes problems to include
them. In the UD evals, Hindi and Urdu improve a lot as expected (from
0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil
improves in combination with .

The punctuation list is converted to a set internally because of its
increased length.

Sentence final punctuation generated with:

```
unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}'
```

See: https://stackoverflow.com/a/9508766/461847

Fixes .
2019-09-14 15:25:48 +02:00
Ines Montani
06bf130890 💫 Add better and serializable sentencizer ()
* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs
2019-03-23 15:45:02 +01:00