Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip]

This commit is contained in:
Ines Montani 2019-10-01 21:59:50 +02:00
parent 667f294627
commit 475e3188ce
2 changed files with 26 additions and 8 deletions

View File

@ -324,7 +324,9 @@ class Errors(object):
E101 = ("NODE_NAME should be a new node and NBOR_NAME should already have " E101 = ("NODE_NAME should be a new node and NBOR_NAME should already have "
"have been declared in previous edges.") "have been declared in previous edges.")
E102 = ("Can't merge non-disjoint spans. '{token}' is already part of " E102 = ("Can't merge non-disjoint spans. '{token}' is already part of "
"tokens to merge.") "tokens to merge. If you want to find the longest non-overlapping "
"spans, you can use the util.filter_spans helper:\n"
"https://spacy.io/api/top-level#util.filter_spans")
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A " E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
"token can only be part of one entity, so make sure the entities " "token can only be part of one entity, so make sure the entities "
"you're setting don't overlap.") "you're setting don't overlap.")

View File

@ -1086,6 +1086,14 @@ with doc.retokenize() as retokenizer:
print("After:", [token.text for token in doc]) print("After:", [token.text for token in doc])
``` ```
> #### Tip: merging entities and noun phrases
>
> If you need to merge named entities or noun chunks, check out the built-in
> [`merge_entities`](/api/pipeline-functions#merge_entities) and
> [`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) pipeline
> components. When added to your pipeline using `nlp.add_pipe`, they'll take
> care of merging the spans automatically.
If an attribute in the `attrs` is a context-dependent token attribute, it will If an attribute in the `attrs` is a context-dependent token attribute, it will
be applied to the underlying [`Token`](/api/token). For example `LEMMA`, `POS` be applied to the underlying [`Token`](/api/token). For example `LEMMA`, `POS`
or `DEP` only apply to a word in context, so they're token attributes. If an or `DEP` only apply to a word in context, so they're token attributes. If an
@ -1094,16 +1102,24 @@ underlying [`Lexeme`](/api/lexeme), the entry in the vocabulary. For example,
`LOWER` or `IS_STOP` apply to all words of the same spelling, regardless of the `LOWER` or `IS_STOP` apply to all words of the same spelling, regardless of the
context. context.
<Infobox title="Tip: merging entities and noun phrases"> <Infobox variant="warning" title="Note on merging overlapping spans">
If you need to merge named entities or noun chunks, check out the built-in If you're trying to merge spans that overlap, spaCy will raise an error because
[`merge_entities`](/api/pipeline-functions#merge_entities) and it's unclear how the result should look. Depending on the application, you may
[`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) pipeline want to match the shortest or longest possible span, so it's up to you to filter
components. When added to your pipeline using `nlp.add_pipe`, they'll take care them. If you're looking for the longest non-overlapping span, you can use the
of merging the spans automatically. [`util.filter_spans`](/api/top-level#util.filter_spans) helper:
```python
doc = nlp("I live in Berlin Kreuzberg")
spans = [doc[3:5], doc[3:4], doc[4:5]]
filtered_spans = filter_spans(spans)
```
</Infobox> </Infobox>
### Splitting tokens
The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting
one token into two or more tokens. This can be useful for cases where one token into two or more tokens. This can be useful for cases where
tokenization rules alone aren't sufficient. For example, you might want to split tokenization rules alone aren't sufficient. For example, you might want to split
@ -1168,7 +1184,7 @@ with doc.retokenize() as retokenizer:
<Infobox title="Important note" variant="warning"> <Infobox title="Important note" variant="warning">
When splitting tokens, the subtoken texts always have to match the original When splitting tokens, the subtoken texts always have to match the original
token text  or, put differently `''.join(subtokens) == token.text` always needs token text  or, put differently `"".join(subtokens) == token.text` always needs
to hold true. If this wasn't the case, splitting tokens could easily end up to hold true. If this wasn't the case, splitting tokens could easily end up
producing confusing and unexpected results that would contradict spaCy's producing confusing and unexpected results that would contradict spaCy's
non-destructive tokenization policy. non-destructive tokenization policy.