spaCy/spacy/pipeline
Adriane Boyd bf0cdae8d4
Add token_splitter component (#6726)
* Add long_token_splitter component

Add a `long_token_splitter` component for use with transformer
pipelines. This component splits up long tokens like URLs into smaller
tokens. This is particularly relevant for pretrained pipelines with
`strided_spans`, since the user can't change the length of the span
`window` and may not wish to preprocess the input texts.

The `long_token_splitter` splits tokens that are at least
`long_token_length` tokens long into smaller tokens of `split_length`
size.

Notes:

* Since this is intended for use as the first component in a pipeline,
the token splitter does not try to preserve any token annotation.
* API docs to come when the API is stable.

* Adjust API, add test

* Fix name in factory
2021-01-17 19:54:41 +08:00
..
_parser_internals Fix assertion in default get oracle sequence usage (#6738) 2021-01-16 16:07:39 +01:00
__init__.py multi-label textcat component (#6474) 2021-01-06 13:07:14 +11:00
attributeruler.py Tidy up and auto-format 2021-01-05 13:41:53 +11:00
dep_parser.pyx refer to _parser_internals.nonproj.DELIMITER 2021-01-07 18:58:13 +01:00
entity_linker.py fix embed_size in Entity Linker architecture (#6343) 2020-11-04 22:20:13 +01:00
entityruler.py Merge branch 'master' into pr/6444 2020-12-09 11:09:40 +11:00
functions.py Add token_splitter component (#6726) 2021-01-17 19:54:41 +08:00
lemmatizer.py Use logger.warning instead of logger.warn (#6596) 2020-12-21 08:25:10 +08:00
morphologizer.pyx Handle unset token.morph in Morphologizer (#6704) 2021-01-15 17:20:10 +01:00
multitask.pyx remove labels from constructor 2020-11-11 21:34:12 +01:00
ner.pyx Getting scores out of beam_ner (#6575) 2021-01-06 12:02:32 +01:00
pipe.pxd TrainablePipe (#6213) 2020-10-08 21:33:49 +02:00
pipe.pyx TrainablePipe (#6213) 2020-10-08 21:33:49 +02:00
sentencizer.pyx Handle missing reference values in scorer (#6286) 2020-11-03 15:47:18 +01:00
senter.pyx Handle missing reference values in scorer (#6286) 2020-11-03 15:47:18 +01:00
tagger.pyx Sync missing and misaligned values in Tagger loss (#6689) 2021-01-10 11:30:37 +11:00
textcat_multilabel.py Tidy up and auto-format 2021-01-15 11:57:36 +11:00
textcat.py Tidy up and auto-format 2021-01-15 11:57:36 +11:00
tok2vec.py Revert added_strings change (#6236) 2020-10-10 18:55:07 +02:00
trainable_pipe.pxd Revert added_strings change (#6236) 2020-10-10 18:55:07 +02:00
trainable_pipe.pyx always return losses 2020-10-14 15:00:49 +02:00
transition_parser.pxd TrainablePipe (#6213) 2020-10-08 21:33:49 +02:00
transition_parser.pyx Add beam_parser and beam_ner components for v3 (#6369) 2020-12-13 09:08:32 +08:00