spaCy/spacy/tokens
Adriane Boyd c62fd878a3
Allow Doc.char_span to snap to token boundaries (#5849)
* Allow Doc.char_span to snap to token boundaries

Add a `mode` option to allow `Doc.char_span` to snap to token
boundaries. The `mode` options:

* `strict`: character offsets must match token boundaries (default, same as
before)
* `inside`: all tokens completely within the character span
* `outside`: all tokens at least partially covered by the character span

Add a new helper function `token_by_char` that returns the token
corresponding to a character position in the text. Update
`token_by_start` and `token_by_end` to use `token_by_char` for more
efficient searching.

* Remove unused import

* Rename mode to alignment_mode

Rename `mode` to `alignment_mode` with the options
`strict`/`contract`/`expand`. Any unrecognized modes are silently
converted to `strict`.
2020-08-04 13:36:32 +02:00
..
__init__.pxd * Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx 2015-07-13 20:20:58 +02:00
__init__.py DocPallet -> DocBin 2019-09-18 15:15:37 +02:00
_retokenize.pyx Disallow merging 0-length spans 2020-05-22 10:14:34 +02:00
_serialize.py Include Doc.cats in serialization of Doc and DocBin (#4774) 2019-12-06 14:07:39 +01:00
doc.pxd Normalize TokenC.sent_start values for Matcher (#5346) 2020-04-29 12:57:30 +02:00
doc.pyx Allow Doc.char_span to snap to token boundaries (#5849) 2020-08-04 13:36:32 +02:00
morphanalysis.pxd Add header for morphanalysis 2019-03-07 17:24:57 +01:00
morphanalysis.pyx Remove MorphAnalysis __str__ and __repr__ 2020-05-29 14:33:47 +02:00
span.pxd annotate kb_id through ents in doc 2019-03-22 11:36:44 +01:00
span.pyx Add Span index boundary checks (#5861) 2020-08-04 13:35:25 +02:00
token.pxd serialize ENT_ID (#4852) 2020-01-06 14:57:34 +01:00
token.pyx Fix polarity of Token.is_oov and Lexeme.is_oov (#5634) 2020-06-23 13:29:51 +02:00
underscore.py load Underscore state when multiprocessing 2020-02-12 11:50:42 +01:00