spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-11 00:32:40 +03:00

Author	SHA1	Message	Date
Grivaz	39815513e2	Add split one token into several (resolves #2838 ) (#3253 ) * Add split one token into several (resolves #2838) * Improve error message for token splitting * Make retokenizer.split() tests use a Token object Change retokenizer.split() to use a Token object, instead of an index. * Pass Token into retokenize.split() Tweak retokenize.split() API so that we pass the `Token` object, not the index. * Fix token.idx in retokenize.split() * Test that token.idx is correct after split * Fix token.idx for split tokens * Fix retokenize.split() * Fix retokenize.split * Fix retokenize.split() test	2019-02-15 01:27:13 +11:00
Grivaz	aeba99ab0d	Introduces a bulk merge function, in order to solve issue #653 (#2696 ) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions	2018-09-10 16:41:42 +02:00
ines	b59e3b157f	Don't require attrs argument in Doc.retokenize and allow both ints and unicode (resolves #2304 )	2018-05-20 15:15:37 +02:00
Matthew Honnibal	abf8b16d71	Add doc.retokenize() context manager (#2172 ) This patch takes a step towards #1487 by introducing the doc.retokenize() context manager, to handle merging spans, and soon splitting tokens. The idea is to do merging and splitting like this: with doc.retokenize() as retokenizer: for start, end, label in matches: retokenizer.merge(doc[start : end], attrs={'ent_type': label}) The retokenizer accumulates the merge requests, and applies them together at the end of the block. This will allow retokenization to be more efficient, and much less error prone. A retokenizer.split() function will then be added, to handle splitting a single token into multiple tokens. These methods take `Span` and `Token` objects; if the user wants to go directly from offsets, they can append to the .merges and .splits lists on the retokenizer. The doc.merge() method's behaviour remains unchanged, so this patch should be 100% backwards incompatible (modulo bugs). Internally, doc.merge() fixes up the arguments (to handle the various deprecated styles), opens the retokenizer, and makes the single merge. We can later start making deprecation warnings on direct calls to doc.merge(), to migrate people to use of the retokenize context manager.	2018-04-03 14:10:35 +02:00

4 Commits