Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip]

This commit is contained in:
Ines Montani 2019-10-01 21:59:50 +02:00
parent 667f294627
commit 475e3188ce
2 changed files with 26 additions and 8 deletions

View File

@ -324,7 +324,9 @@ class Errors(object):
E101 = ("NODE_NAME should be a new node and NBOR_NAME should already have "
"have been declared in previous edges.")
E102 = ("Can't merge non-disjoint spans. '{token}' is already part of "
"tokens to merge.")
"tokens to merge. If you want to find the longest non-overlapping "
"spans, you can use the util.filter_spans helper:\n"
"https://spacy.io/api/top-level#util.filter_spans")
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
"token can only be part of one entity, so make sure the entities "
"you're setting don't overlap.")

View File

@ -1086,6 +1086,14 @@ with doc.retokenize() as retokenizer:
print("After:", [token.text for token in doc])
```
> #### Tip: merging entities and noun phrases
>
> If you need to merge named entities or noun chunks, check out the built-in
> [`merge_entities`](/api/pipeline-functions#merge_entities) and
> [`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) pipeline
> components. When added to your pipeline using `nlp.add_pipe`, they'll take
> care of merging the spans automatically.
If an attribute in the `attrs` is a context-dependent token attribute, it will
be applied to the underlying [`Token`](/api/token). For example `LEMMA`, `POS`
or `DEP` only apply to a word in context, so they're token attributes. If an
@ -1094,16 +1102,24 @@ underlying [`Lexeme`](/api/lexeme), the entry in the vocabulary. For example,
`LOWER` or `IS_STOP` apply to all words of the same spelling, regardless of the
context.
<Infobox title="Tip: merging entities and noun phrases">
<Infobox variant="warning" title="Note on merging overlapping spans">
If you need to merge named entities or noun chunks, check out the built-in
[`merge_entities`](/api/pipeline-functions#merge_entities) and
[`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) pipeline
components. When added to your pipeline using `nlp.add_pipe`, they'll take care
of merging the spans automatically.
If you're trying to merge spans that overlap, spaCy will raise an error because
it's unclear how the result should look. Depending on the application, you may
want to match the shortest or longest possible span, so it's up to you to filter
them. If you're looking for the longest non-overlapping span, you can use the
[`util.filter_spans`](/api/top-level#util.filter_spans) helper:
```python
doc = nlp("I live in Berlin Kreuzberg")
spans = [doc[3:5], doc[3:4], doc[4:5]]
filtered_spans = filter_spans(spans)
```
</Infobox>
### Splitting tokens
The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting
one token into two or more tokens. This can be useful for cases where
tokenization rules alone aren't sufficient. For example, you might want to split
@ -1168,7 +1184,7 @@ with doc.retokenize() as retokenizer:
<Infobox title="Important note" variant="warning">
When splitting tokens, the subtoken texts always have to match the original
token text  or, put differently `''.join(subtokens) == token.text` always needs
token text  or, put differently `"".join(subtokens) == token.text` always needs
to hold true. If this wasn't the case, splitting tokens could easily end up
producing confusing and unexpected results that would contradict spaCy's
non-destructive tokenization policy.