mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 17:24:41 +03:00
Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip]
This commit is contained in:
parent
667f294627
commit
475e3188ce
|
@ -324,7 +324,9 @@ class Errors(object):
|
||||||
E101 = ("NODE_NAME should be a new node and NBOR_NAME should already have "
|
E101 = ("NODE_NAME should be a new node and NBOR_NAME should already have "
|
||||||
"have been declared in previous edges.")
|
"have been declared in previous edges.")
|
||||||
E102 = ("Can't merge non-disjoint spans. '{token}' is already part of "
|
E102 = ("Can't merge non-disjoint spans. '{token}' is already part of "
|
||||||
"tokens to merge.")
|
"tokens to merge. If you want to find the longest non-overlapping "
|
||||||
|
"spans, you can use the util.filter_spans helper:\n"
|
||||||
|
"https://spacy.io/api/top-level#util.filter_spans")
|
||||||
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
|
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
|
||||||
"token can only be part of one entity, so make sure the entities "
|
"token can only be part of one entity, so make sure the entities "
|
||||||
"you're setting don't overlap.")
|
"you're setting don't overlap.")
|
||||||
|
|
|
@ -1086,6 +1086,14 @@ with doc.retokenize() as retokenizer:
|
||||||
print("After:", [token.text for token in doc])
|
print("After:", [token.text for token in doc])
|
||||||
```
|
```
|
||||||
|
|
||||||
|
> #### Tip: merging entities and noun phrases
|
||||||
|
>
|
||||||
|
> If you need to merge named entities or noun chunks, check out the built-in
|
||||||
|
> [`merge_entities`](/api/pipeline-functions#merge_entities) and
|
||||||
|
> [`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) pipeline
|
||||||
|
> components. When added to your pipeline using `nlp.add_pipe`, they'll take
|
||||||
|
> care of merging the spans automatically.
|
||||||
|
|
||||||
If an attribute in the `attrs` is a context-dependent token attribute, it will
|
If an attribute in the `attrs` is a context-dependent token attribute, it will
|
||||||
be applied to the underlying [`Token`](/api/token). For example `LEMMA`, `POS`
|
be applied to the underlying [`Token`](/api/token). For example `LEMMA`, `POS`
|
||||||
or `DEP` only apply to a word in context, so they're token attributes. If an
|
or `DEP` only apply to a word in context, so they're token attributes. If an
|
||||||
|
@ -1094,16 +1102,24 @@ underlying [`Lexeme`](/api/lexeme), the entry in the vocabulary. For example,
|
||||||
`LOWER` or `IS_STOP` apply to all words of the same spelling, regardless of the
|
`LOWER` or `IS_STOP` apply to all words of the same spelling, regardless of the
|
||||||
context.
|
context.
|
||||||
|
|
||||||
<Infobox title="Tip: merging entities and noun phrases">
|
<Infobox variant="warning" title="Note on merging overlapping spans">
|
||||||
|
|
||||||
If you need to merge named entities or noun chunks, check out the built-in
|
If you're trying to merge spans that overlap, spaCy will raise an error because
|
||||||
[`merge_entities`](/api/pipeline-functions#merge_entities) and
|
it's unclear how the result should look. Depending on the application, you may
|
||||||
[`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) pipeline
|
want to match the shortest or longest possible span, so it's up to you to filter
|
||||||
components. When added to your pipeline using `nlp.add_pipe`, they'll take care
|
them. If you're looking for the longest non-overlapping span, you can use the
|
||||||
of merging the spans automatically.
|
[`util.filter_spans`](/api/top-level#util.filter_spans) helper:
|
||||||
|
|
||||||
|
```python
|
||||||
|
doc = nlp("I live in Berlin Kreuzberg")
|
||||||
|
spans = [doc[3:5], doc[3:4], doc[4:5]]
|
||||||
|
filtered_spans = filter_spans(spans)
|
||||||
|
```
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
### Splitting tokens
|
||||||
|
|
||||||
The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting
|
The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting
|
||||||
one token into two or more tokens. This can be useful for cases where
|
one token into two or more tokens. This can be useful for cases where
|
||||||
tokenization rules alone aren't sufficient. For example, you might want to split
|
tokenization rules alone aren't sufficient. For example, you might want to split
|
||||||
|
@ -1168,7 +1184,7 @@ with doc.retokenize() as retokenizer:
|
||||||
<Infobox title="Important note" variant="warning">
|
<Infobox title="Important note" variant="warning">
|
||||||
|
|
||||||
When splitting tokens, the subtoken texts always have to match the original
|
When splitting tokens, the subtoken texts always have to match the original
|
||||||
token text – or, put differently `''.join(subtokens) == token.text` always needs
|
token text – or, put differently `"".join(subtokens) == token.text` always needs
|
||||||
to hold true. If this wasn't the case, splitting tokens could easily end up
|
to hold true. If this wasn't the case, splitting tokens could easily end up
|
||||||
producing confusing and unexpected results that would contradict spaCy's
|
producing confusing and unexpected results that would contradict spaCy's
|
||||||
non-destructive tokenization policy.
|
non-destructive tokenization policy.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user