mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-24 00:04:15 +03:00
abf8b16d71
This patch takes a step towards #1487 by introducing the doc.retokenize() context manager, to handle merging spans, and soon splitting tokens. The idea is to do merging and splitting like this: with doc.retokenize() as retokenizer: for start, end, label in matches: retokenizer.merge(doc[start : end], attrs={'ent_type': label}) The retokenizer accumulates the merge requests, and applies them together at the end of the block. This will allow retokenization to be more efficient, and much less error prone. A retokenizer.split() function will then be added, to handle splitting a single token into multiple tokens. These methods take `Span` and `Token` objects; if the user wants to go directly from offsets, they can append to the .merges and .splits lists on the retokenizer. The doc.merge() method's behaviour remains unchanged, so this patch should be 100% backwards incompatible (modulo bugs). Internally, doc.merge() fixes up the arguments (to handle the various deprecated styles), opens the retokenizer, and makes the single merge. We can later start making deprecation warnings on direct calls to doc.merge(), to migrate people to use of the retokenize context manager. |
||
---|---|---|
.. | ||
__init__.pxd | ||
__init__.py | ||
_retokenize.pyx | ||
doc.pxd | ||
doc.pyx | ||
printers.py | ||
span.pxd | ||
span.pyx | ||
token.pxd | ||
token.pyx | ||
underscore.py |