span.ent only returns first sentence (#7084)

* return first sentence when span contains sentence boundary

* docs fix

* small fixes

* cleanup
This commit is contained in:
Sofie Van Landeghem 2021-02-19 13:02:38 +01:00 committed by GitHub
parent 30e1a89aeb
commit 709c9e75af
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 63 additions and 14 deletions

View File

@ -61,7 +61,6 @@ def test_issue7029():
losses = {} losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses) nlp.update(train_examples, sgd=optimizer, losses=losses)
texts = ["first", "second", "third", "fourth", "and", "then", "some", ""] texts = ["first", "second", "third", "fourth", "and", "then", "some", ""]
nlp.select_pipes(enable=["tok2vec", "tagger"])
docs1 = list(nlp.pipe(texts, batch_size=1)) docs1 = list(nlp.pipe(texts, batch_size=1))
docs2 = list(nlp.pipe(texts, batch_size=4)) docs2 = list(nlp.pipe(texts, batch_size=4))
assert [doc[0].tag_ for doc in docs1[:-1]] == [doc[0].tag_ for doc in docs2[:-1]] assert [doc[0].tag_ for doc in docs1[:-1]] == [doc[0].tag_ for doc in docs2[:-1]]

View File

@ -1,5 +1,3 @@
import pytest
from spacy.tokens.doc import Doc from spacy.tokens.doc import Doc
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.pipeline._parser_internals.arc_eager import ArcEager from spacy.pipeline._parser_internals.arc_eager import ArcEager

View File

@ -0,0 +1,18 @@
from spacy.lang.en import English
def test_issue7065():
text = "Kathleen Battle sang in Mahler 's Symphony No. 8 at the Cincinnati Symphony Orchestra 's May Festival."
nlp = English()
nlp.add_pipe("sentencizer")
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "THING", "pattern": [{"LOWER": "symphony"}, {"LOWER": "no"}, {"LOWER": "."}, {"LOWER": "8"}]}]
ruler.add_patterns(patterns)
doc = nlp(text)
sentences = [s for s in doc.sents]
assert len(sentences) == 2
sent0 = sentences[0]
ent = doc.ents[0]
assert ent.start < sent0.end < ent.end
assert sentences.index(ent.sent) == 0

View File

@ -357,7 +357,12 @@ cdef class Span:
@property @property
def sent(self): def sent(self):
"""RETURNS (Span): The sentence span that the span is a part of.""" """Obtain the sentence that contains this span. If the given span
crosses sentence boundaries, return only the first sentence
to which it belongs.
RETURNS (Span): The sentence span that the span is a part of.
"""
if "sent" in self.doc.user_span_hooks: if "sent" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["sent"](self) return self.doc.user_span_hooks["sent"](self)
# Use `sent_start` token attribute to find sentence boundaries # Use `sent_start` token attribute to find sentence boundaries
@ -367,8 +372,8 @@ cdef class Span:
start = self.start start = self.start
while self.doc.c[start].sent_start != 1 and start > 0: while self.doc.c[start].sent_start != 1 and start > 0:
start += -1 start += -1
# Find end of the sentence # Find end of the sentence - can be within the entity
end = self.end end = self.start + 1
while end < self.doc.length and self.doc.c[end].sent_start != 1: while end < self.doc.length and self.doc.c[end].sent_start != 1:
end += 1 end += 1
n += 1 n += 1

View File

@ -219,7 +219,7 @@ alignment mode `"strict".
| `alignment_mode` | How character indices snap to token boundaries. Options: `"strict"` (no snapping), `"contract"` (span of all tokens completely within the character span), `"expand"` (span of all tokens at least partially covered by the character span). Defaults to `"strict"`. ~~str~~ | | `alignment_mode` | How character indices snap to token boundaries. Options: `"strict"` (no snapping), `"contract"` (span of all tokens completely within the character span), `"expand"` (span of all tokens at least partially covered by the character span). Defaults to `"strict"`. ~~str~~ |
| **RETURNS** | The newly constructed object or `None`. ~~Optional[Span]~~ | | **RETURNS** | The newly constructed object or `None`. ~~Optional[Span]~~ |
## Doc.set_ents {#ents tag="method" new="3"} ## Doc.set_ents {#set_ents tag="method" new="3"}
Set the named entities in the document. Set the named entities in the document.
@ -633,12 +633,14 @@ not been implemeted for the given language, a `NotImplementedError` is raised.
| ---------- | ------------------------------------- | | ---------- | ------------------------------------- |
| **YIELDS** | Noun chunks in the document. ~~Span~~ | | **YIELDS** | Noun chunks in the document. ~~Span~~ |
## Doc.sents {#sents tag="property" model="parser"} ## Doc.sents {#sents tag="property" model="sentences"}
Iterate over the sentences in the document. Sentence spans have no label. To Iterate over the sentences in the document. Sentence spans have no label.
improve accuracy on informal texts, spaCy calculates sentence boundaries from
the syntactic dependency parse. If the parser is disabled, the `sents` iterator This property is only available when
will be unavailable. [sentence boundaries](/usage/linguistic-features#sbd) have been set on the
document by the `parser`, `senter`, `sentencizer` or some custom function. It
will raise an error otherwise.
> #### Example > #### Example
> >

View File

@ -483,13 +483,40 @@ The L2 norm of the span's vector representation.
| ----------- | --------------------------------------------------- | | ----------- | --------------------------------------------------- |
| **RETURNS** | The L2 norm of the vector representation. ~~float~~ | | **RETURNS** | The L2 norm of the vector representation. ~~float~~ |
## Span.sent {#sent tag="property" model="sentences"}
The sentence span that this span is a part of. This property is only available
when [sentence boundaries](/usage/linguistic-features#sbd) have been set on the
document by the `parser`, `senter`, `sentencizer` or some custom function. It
will raise an error otherwise.
If the span happens to cross sentence boundaries, only the first sentence will
be returned. If it is required that the sentence always includes the
full span, the result can be adjusted as such:
```python
sent = span.sent
sent = doc[sent.start : max(sent.end, span.end)]
```
> #### Example
>
> ```python
> doc = nlp("Give it back! He pleaded.")
> span = doc[1:3]
> assert span.sent.text == "Give it back!"
> ```
| Name | Description |
| ----------- | ------------------------------------------------------- |
| **RETURNS** | The sentence span that this span is a part of. ~~Span~~ |
## Attributes {#attributes} ## Attributes {#attributes}
| Name | Description | | Name | Description |
| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | | --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `doc` | The parent document. ~~Doc~~ | | `doc` | The parent document. ~~Doc~~ |
| `tensor` <Tag variant="new">2.1.7</Tag> | The span's slice of the parent `Doc`'s tensor. ~~numpy.ndarray~~ | | `tensor` <Tag variant="new">2.1.7</Tag> | The span's slice of the parent `Doc`'s tensor. ~~numpy.ndarray~~ |
| `sent` | The sentence span that this span is a part of. ~~Span~~ |
| `start` | The token offset for the start of the span. ~~int~~ | | `start` | The token offset for the start of the span. ~~int~~ |
| `end` | The token offset for the end of the span. ~~int~~ | | `end` | The token offset for the end of the span. ~~int~~ |
| `start_char` | The character offset for the start of the span. ~~int~~ | | `start_char` | The character offset for the start of the span. ~~int~~ |

View File

@ -585,7 +585,7 @@ print(ent_francisco) # ['Francisco', 'I', 'GPE']
To ensure that the sequence of token annotations remains consistent, you have to To ensure that the sequence of token annotations remains consistent, you have to
set entity annotations **at the document level**. However, you can't write set entity annotations **at the document level**. However, you can't write
directly to the `token.ent_iob` or `token.ent_type` attributes, so the easiest directly to the `token.ent_iob` or `token.ent_type` attributes, so the easiest
way to set entities is to assign to the [`doc.ents`](/api/doc#ents) attribute way to set entities is to use the [`doc.set_ents`](/api/doc#set_ents) function
and create the new entity as a [`Span`](/api/span). and create the new entity as a [`Span`](/api/span).
```python ```python