spaCy/spacy/tests/pipeline
Adriane Boyd bf0cdae8d4
Add token_splitter component (#6726)
* Add long_token_splitter component

Add a `long_token_splitter` component for use with transformer
pipelines. This component splits up long tokens like URLs into smaller
tokens. This is particularly relevant for pretrained pipelines with
`strided_spans`, since the user can't change the length of the span
`window` and may not wish to preprocess the input texts.

The `long_token_splitter` splits tokens that are at least
`long_token_length` tokens long into smaller tokens of `split_length`
size.

Notes:

* Since this is intended for use as the first component in a pipeline,
the token splitter does not try to preserve any token annotation.
* API docs to come when the API is stable.

* Adjust API, add test

* Fix name in factory
2021-01-17 19:54:41 +08:00
..
__init__.py Revert #4334 2019-09-29 17:32:12 +02:00
test_analysis.py Simplify pipe analysis 2020-08-01 13:40:06 +02:00
test_attributeruler.py Tidy up and auto-format 2021-01-05 13:41:53 +11:00
test_entity_linker.py adding tests for trained models to ensure predict reproducibility 2020-10-13 21:07:13 +02:00
test_entity_ruler.py Tidy up and auto-format 2021-01-05 13:41:53 +11:00
test_functions.py Add token_splitter component (#6726) 2021-01-17 19:54:41 +08:00
test_initialize.py Test with default value 2020-09-29 17:00:40 +02:00
test_lemmatizer.py Use logger.warning instead of logger.warn (#6596) 2020-12-21 08:25:10 +08:00
test_models.py call NumpyOps instead of get_current_ops() 2020-10-14 16:55:00 +02:00
test_morphologizer.py Handle unset token.morph in Morphologizer (#6704) 2021-01-15 17:20:10 +01:00
test_pipe_factories.py Tidy up and auto-format 2021-01-05 13:41:53 +11:00
test_pipe_methods.py Fix typo in test 2020-10-09 18:00:21 +02:00
test_sentencizer.py Refactor Docs.is_ flags (#6044) 2020-09-17 00:14:01 +02:00
test_senter.py adding tests for trained models to ensure predict reproducibility 2020-10-13 21:07:13 +02:00
test_tagger.py Sync missing and misaligned values in Tagger loss (#6689) 2021-01-10 11:30:37 +11:00
test_textcat.py Fix test 2021-01-15 12:51:02 +11:00
test_tok2vec.py Fix types of Tok2Vec encoding architectures (#6442) 2021-01-07 16:39:27 +11:00