Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip]

2025-12-08 10:44:30 +03:00 · 2019-10-01 21:59:50 +02:00 · 2019-10-01 21:59:50 +02:00 · 475e3188ce
commit 475e3188ce
parent 667f294627
2 changed files with 26 additions and 8 deletions
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -324,7 +324,9 @@ class Errors(object):
    E101 = ("NODE_NAME should be a new node and NBOR_NAME should already have "
            "have been declared in previous edges.")
    E102 = ("Can't merge non-disjoint spans. '{token}' is already part of "
-            "tokens to merge.")
+            "tokens to merge. If you want to find the longest non-overlapping "
+            "spans, you can use the util.filter_spans helper:\n"
+            "https://spacy.io/api/top-level#util.filter_spans")
    E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
            "token can only be part of one entity, so make sure the entities "
            "you're setting don't overlap.")
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -1086,6 +1086,14 @@ with doc.retokenize() as retokenizer:
 print("After:", [token.text for token in doc])
 ```

+> #### Tip: merging entities and noun phrases
+>
+> If you need to merge named entities or noun chunks, check out the built-in
+> [`merge_entities`](/api/pipeline-functions#merge_entities) and
+> [`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) pipeline
+> components. When added to your pipeline using `nlp.add_pipe`, they'll take
+> care of merging the spans automatically.
+
 If an attribute in the `attrs` is a context-dependent token attribute, it will
 be applied to the underlying [`Token`](/api/token). For example `LEMMA`, `POS`
 or `DEP` only apply to a word in context, so they're token attributes. If an
@ -1094,16 +1102,24 @@ underlying [`Lexeme`](/api/lexeme), the entry in the vocabulary. For example,
 `LOWER` or `IS_STOP` apply to all words of the same spelling, regardless of the
 context.

-<Infobox title="Tip: merging entities and noun phrases">
+<Infobox variant="warning" title="Note on merging overlapping spans">

-If you need to merge named entities or noun chunks, check out the built-in
-[`merge_entities`](/api/pipeline-functions#merge_entities) and
-[`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) pipeline
-components. When added to your pipeline using `nlp.add_pipe`, they'll take care
-of merging the spans automatically.
+If you're trying to merge spans that overlap, spaCy will raise an error because
+it's unclear how the result should look. Depending on the application, you may
+want to match the shortest or longest possible span, so it's up to you to filter
+them. If you're looking for the longest non-overlapping span, you can use the
+[`util.filter_spans`](/api/top-level#util.filter_spans) helper:
+
+```python
+doc = nlp("I live in Berlin Kreuzberg")
+spans = [doc[3:5], doc[3:4], doc[4:5]]
+filtered_spans = filter_spans(spans)
+```

 </Infobox>

+### Splitting tokens
+
 The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting
 one token into two or more tokens. This can be useful for cases where
 tokenization rules alone aren't sufficient. For example, you might want to split
@ -1168,7 +1184,7 @@ with doc.retokenize() as retokenizer:
 <Infobox title="Important note" variant="warning">

 When splitting tokens, the subtoken texts always have to match the original
-token text – or, put differently `''.join(subtokens) == token.text` always needs
+token text – or, put differently `"".join(subtokens) == token.text` always needs
 to hold true. If this wasn't the case, splitting tokens could easily end up
 producing confusing and unexpected results that would contradict spaCy's
 non-destructive tokenization policy.