spaCy/spacy/tests/doc
Adriane Boyd c62fd878a3
Allow Doc.char_span to snap to token boundaries (#5849)
* Allow Doc.char_span to snap to token boundaries

Add a `mode` option to allow `Doc.char_span` to snap to token
boundaries. The `mode` options:

* `strict`: character offsets must match token boundaries (default, same as
before)
* `inside`: all tokens completely within the character span
* `outside`: all tokens at least partially covered by the character span

Add a new helper function `token_by_char` that returns the token
corresponding to a character position in the text. Update
`token_by_start` and `token_by_end` to use `token_by_char` for more
efficient searching.

* Remove unused import

* Rename mode to alignment_mode

Rename `mode` to `alignment_mode` with the options
`strict`/`contract`/`expand`. Any unrecognized modes are silently
converted to `strict`.
2020-08-04 13:36:32 +02:00
..
__init__.py Revert #4334 2019-09-29 17:32:12 +02:00
test_add_entities.py Fix test imports 2019-09-29 17:34:56 +02:00
test_array.py Tidy up and auto-format 2020-03-25 12:28:12 +01:00
test_creation.py Tidy up and auto-format 2020-05-21 14:14:01 +02:00
test_doc_api.py Add strings and ENT_KB_ID to Doc serialization (#5691) 2020-07-02 17:11:57 +02:00
test_morphanalysis.py Revert #4334 2019-09-29 17:32:12 +02:00
test_pickle_doc.py Revert #4334 2019-09-29 17:32:12 +02:00
test_retokenize_merge.py Disallow merging 0-length spans 2020-05-22 10:14:34 +02:00
test_retokenize_split.py Fix realloc in retokenizer.split() (#4606) 2019-11-11 16:26:46 +01:00
test_span.py Allow Doc.char_span to snap to token boundaries (#5849) 2020-08-04 13:36:32 +02:00
test_to_json.py Revert #4334 2019-09-29 17:32:12 +02:00
test_token_api.py Tidy up and auto-format 2020-05-21 14:14:01 +02:00
test_underscore.py use clean_underscore fixture 2020-02-23 15:49:20 +01:00