spaCy/spacy
Matthew Honnibal 168fa080b7 Add doc.retokenize() context manager
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-03-31 15:17:13 +02:00
..
cli Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
data Make spacy/data a package 2017-03-18 20:04:22 +01:00
displacy Don't use deprecated Doc.merge call in displaCy 2018-01-27 11:25:05 +01:00
lang Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
syntax Add more state Python variables, to make testing easier 2018-03-30 13:26:08 +02:00
tests Add state tests, esp. for split function 2018-03-30 13:25:46 +02:00
tokens Add doc.retokenize() context manager 2018-03-31 15:17:13 +02:00
__init__.pxd * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. 2014-10-24 02:23:42 +11:00
__init__.py Remove dummy variable from function calls 2018-01-05 09:37:05 +01:00
__main__.py Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
_align.pyx Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
_matcher2_notes.py Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
_ml.py Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
about.py Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
attrs.pxd Fix LANG symbol 2018-02-17 18:10:50 +01:00
attrs.pyx Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
compat.py Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
glossary.py Fix typo in glossary (resolves #1964) 2018-02-10 11:58:41 +01:00
gold.pxd Add placeholder fused_tokens to GoldC 2018-03-30 13:26:21 +02:00
gold.pyx Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
language.py Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
lemmatizer.py Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
lexeme.pxd WIP on stringstore change. 27 failures 2017-05-28 14:06:40 +02:00
lexeme.pyx added new lexical feat to lexeme 2018-02-11 18:51:48 +01:00
matcher.pyx Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
morphology.pxd Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
morphology.pyx Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
parts_of_speech.pxd Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
parts_of_speech.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
pipeline.pxd Fix names of pipeline components 2017-10-26 12:38:23 +02:00
pipeline.pyx Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
scorer.py Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
strings.pxd Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00
strings.pyx Use safer method to get string without hit 2017-11-14 22:58:46 +03:00
structs.pxd Make TokenC.sent_tart an int, to allow ternary value 2017-10-08 19:58:54 +02:00
symbols.pxd Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
symbols.pyx Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
tokenizer.pxd Disable tokenizer cache for special-cases. Fixes #1250 2017-10-24 16:08:05 +02:00
tokenizer.pyx Merge pull request #1611 from fsonntag/master 2017-11-29 23:11:23 +01:00
typedefs.pxd Work on changing StringStore to return hashes. 2017-05-28 12:36:27 +02:00
typedefs.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
util.py Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
vectors.pyx Fix Vectors pickling 2018-03-14 16:59:37 +01:00
vocab.pxd Add Vocab.cfg attr, to hold stuff like oov probs 2017-10-30 16:08:50 +01:00
vocab.pyx Make Vocab.__contains__ work with ints. Fixes #1868 2018-01-23 23:26:47 +01:00