spaCy/website/docs/api/span.md
Sofie Van Landeghem 709c9e75af
span.ent only returns first sentence (#7084)
* return first sentence when span contains sentence boundary

* docs fix

* small fixes

* cleanup
2021-02-19 23:02:38 +11:00

23 KiB
Raw Blame History

title tag source
Span class spacy/tokens/span.pyx

A slice from a Doc object.

Span.__init__

Create a Span object from the slice doc[start : end].

Example

doc = nlp("Give it back! He pleaded.")
span = doc[1:4]
assert [t.text for t in span] ==  ["it", "back", "!"]
Name Description
doc The parent document. Doc
start The index of the first token of the span. int
end The index of the first token after the span. int
label A label to attach to the span, e.g. for named entities. Union[str, int]
kb_id A knowledge base ID to attach to the span, e.g. for named entities. Union[str, int]
vector A meaning representation of the span. numpy.ndarray[ndim=1, dtype=float32]

Span.__getitem__

Get a Token object.

Example

doc = nlp("Give it back! He pleaded.")
span = doc[1:4]
assert span[1].text == "back"
Name Description
i The index of the token within the span. int
RETURNS The token at span[i]. Token

Get a Span object.

Example

doc = nlp("Give it back! He pleaded.")
span = doc[1:4]
assert span[1:3].text == "back!"
Name Description
start_end The slice of the span to get. Tuple[int, int]
RETURNS The span at span[start : end]. Span

Span.__iter__

Iterate over Token objects.

Example

doc = nlp("Give it back! He pleaded.")
span = doc[1:4]
assert [t.text for t in span] == ["it", "back", "!"]
Name Description
YIELDS A Token object. Token

Span.__len__

Get the number of tokens in the span.

Example

doc = nlp("Give it back! He pleaded.")
span = doc[1:4]
assert len(span) == 3
Name Description
RETURNS The number of tokens in the span. int

Span.set_extension

Define a custom attribute on the Span which becomes available via Span._. For details, see the documentation on custom attributes.

Example

from spacy.tokens import Span
city_getter = lambda span: any(city in span.text for city in ("New York", "Paris", "Berlin"))
Span.set_extension("has_city", getter=city_getter)
doc = nlp("I like New York in Autumn")
assert doc[1:4]._.has_city
Name Description
name Name of the attribute to set by the extension. For example, "my_attr" will be available as span._.my_attr. str
default Optional default value of the attribute if no getter or method is defined. Optional[Any]
method Set a custom method on the object, for example span._.compare(other_span). Optional[CallableSpan, ...], Any
getter Getter function that takes the object and returns an attribute value. Is called when the user accesses the ._ attribute. Optional[CallableSpan], Any
setter Setter function that takes the Span and a value, and modifies the object. Is called when the user writes to the Span._ attribute. Optional[CallableSpan, Any], None
force Force overwriting existing attribute. bool

Span.get_extension

Look up a previously registered extension by name. Returns a 4-tuple (default, method, getter, setter) if the extension is registered. Raises a KeyError otherwise.

Example

from spacy.tokens import Span
Span.set_extension("is_city", default=False)
extension = Span.get_extension("is_city")
assert extension == (False, None, None, None)
Name Description
name Name of the extension. str
RETURNS A (default, method, getter, setter) tuple of the extension. Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]

Span.has_extension

Check whether an extension has been registered on the Span class.

Example

from spacy.tokens import Span
Span.set_extension("is_city", default=False)
assert Span.has_extension("is_city")
Name Description
name Name of the extension to check. str
RETURNS Whether the extension has been registered. bool

Span.remove_extension

Remove a previously registered extension.

Example

from spacy.tokens import Span
Span.set_extension("is_city", default=False)
removed = Span.remove_extension("is_city")
assert not Span.has_extension("is_city")
Name Description
name Name of the extension. str
RETURNS A (default, method, getter, setter) tuple of the removed extension. Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]

Span.char_span

Create a Span object from the slice span.text[start:end]. Returns None if the character indices don't map to a valid span.

Example

doc = nlp("I like New York")
span = doc[1:4].char_span(5, 13, label="GPE")
assert span.text == "New York"
Name Description
start The index of the first character of the span. int
end The index of the last character after the span. int
label A label to attach to the span, e.g. for named entities. Union[int, str]
kb_id 2.2 An ID from a knowledge base to capture the meaning of a named entity. Union[int, str]
vector A meaning representation of the span. numpy.ndarray[ndim=1, dtype=float32]
RETURNS The newly constructed object or None. Optional[Span]

Span.similarity

Make a semantic similarity estimate. The default estimate is cosine similarity using an average of word vectors.

Example

doc = nlp("green apples and red oranges")
green_apples = doc[:2]
red_oranges = doc[3:]
apples_oranges = green_apples.similarity(red_oranges)
oranges_apples = red_oranges.similarity(green_apples)
assert apples_oranges == oranges_apples
Name Description
other The object to compare with. By default, accepts Doc, Span, Token and Lexeme objects. Union[Doc, Span, Token, Lexeme]
RETURNS A scalar similarity score. Higher is more similar. float

Span.get_lca_matrix

Calculates the lowest common ancestor matrix for a given Span. Returns LCA matrix containing the integer index of the ancestor, or -1 if no common ancestor is found, e.g. if span excludes a necessary ancestor.

Example

doc = nlp("I like New York in Autumn")
span = doc[1:4]
matrix = span.get_lca_matrix()
# array([[0, 0, 0], [0, 1, 2], [0, 2, 2]], dtype=int32)
Name Description
RETURNS The lowest common ancestor matrix of the Span. numpy.ndarray[ndim=2, dtype=int32]

Span.to_array

Given a list of M attribute IDs, export the tokens to a numpy ndarray of shape (N, M), where N is the length of the document. The values will be 32-bit integers.

Example

from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
doc = nlp("I like New York in Autumn.")
span = doc[2:3]
# All strings mapped to integers, for easy export to numpy
np_array = span.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
Name Description
attr_ids A list of attributes (int IDs or string names) or a single attribute (int ID or string name). Union[int, str, List[Union[int, str]]]
RETURNS The exported attributes as a numpy array. Union[numpy.ndarray[ndim=2, dtype=uint64], numpy.ndarray[ndim=1, dtype=uint64]]

Span.ents

The named entities in the span. Returns a tuple of named entity Span objects, if the entity recognizer has been applied.

Example

doc = nlp("Mr. Best flew to New York on Saturday morning.")
span = doc[0:6]
ents = list(span.ents)
assert ents[0].label == 346
assert ents[0].label_ == "PERSON"
assert ents[0].text == "Mr. Best"
Name Description
RETURNS Entities in the span, one Span per entity. Tuple[Span, ...]

Span.noun_chunks

Iterate over the base noun phrases in the span. Yields base noun-phrase Span objects, if the document has been syntactically parsed. A base noun phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be nested within it so no NP-level coordination, no prepositional phrases, and no relative clauses.

If the noun_chunk syntax iterator has not been implemeted for the given language, a NotImplementedError is raised.

Example

doc = nlp("A phrase with another phrase occurs.")
span = doc[3:5]
chunks = list(span.noun_chunks)
assert len(chunks) == 1
assert chunks[0].text == "another phrase"
Name Description
YIELDS Noun chunks in the span. Span

Span.as_doc

Create a new Doc object corresponding to the Span, with a copy of the data.

Example

doc = nlp("I like New York in Autumn.")
span = doc[2:4]
doc2 = span.as_doc()
assert doc2.text == "New York"
Name Description
copy_user_data Whether or not to copy the original doc's user data. bool
RETURNS A Doc object of the Span's content. Doc

Span.root

The token with the shortest path to the root of the sentence (or the root itself). If multiple tokens are equally high in the tree, the first token is taken.

Example

doc = nlp("I like New York in Autumn.")
i, like, new, york, in_, autumn, dot = range(len(doc))
assert doc[new].head.text == "York"
assert doc[york].head.text == "like"
new_york = doc[new:york+1]
assert new_york.root.text == "York"
Name Description
RETURNS The root token. Token

Span.conjuncts

A tuple of tokens coordinated to span.root.

Example

doc = nlp("I like apples and oranges")
apples_conjuncts = doc[2:3].conjuncts
assert [t.text for t in apples_conjuncts] == ["oranges"]
Name Description
RETURNS The coordinated tokens. Tuple[Token, ...]

Span.lefts

Tokens that are to the left of the span, whose heads are within the span.

Example

doc = nlp("I like New York in Autumn.")
lefts = [t.text for t in doc[3:7].lefts]
assert lefts == ["New"]
Name Description
YIELDS A left-child of a token of the span. Token

Span.rights

Tokens that are to the right of the span, whose heads are within the span.

Example

doc = nlp("I like New York in Autumn.")
rights = [t.text for t in doc[2:4].rights]
assert rights == ["in"]
Name Description
YIELDS A right-child of a token of the span. Token

Span.n_lefts

The number of tokens that are to the left of the span, whose heads are within the span.

Example

doc = nlp("I like New York in Autumn.")
assert doc[3:7].n_lefts == 1
Name Description
RETURNS The number of left-child tokens. int

Span.n_rights

The number of tokens that are to the right of the span, whose heads are within the span.

Example

doc = nlp("I like New York in Autumn.")
assert doc[2:4].n_rights == 1
Name Description
RETURNS The number of right-child tokens. int

Span.subtree

Tokens within the span and tokens which descend from them.

Example

doc = nlp("Give it back! He pleaded.")
subtree = [t.text for t in doc[:3].subtree]
assert subtree == ["Give", "it", "back", "!"]
Name Description
YIELDS A token within the span, or a descendant from it. Token

Span.has_vector

A boolean value indicating whether a word vector is associated with the object.

Example

doc = nlp("I like apples")
assert doc[1:].has_vector
Name Description
RETURNS Whether the span has a vector data attached. bool

Span.vector

A real-valued meaning representation. Defaults to an average of the token vectors.

Example

doc = nlp("I like apples")
assert doc[1:].vector.dtype == "float32"
assert doc[1:].vector.shape == (300,)
Name Description
RETURNS A 1-dimensional array representing the span's vector. `numpy.ndarray[ndim=1, dtype=float32]

Span.vector_norm

The L2 norm of the span's vector representation.

Example

doc = nlp("I like apples")
doc[1:].vector_norm # 4.800883928527915
doc[2:].vector_norm # 6.895897646384268
assert doc[1:].vector_norm != doc[2:].vector_norm
Name Description
RETURNS The L2 norm of the vector representation. float

Span.sent

The sentence span that this span is a part of. This property is only available when sentence boundaries have been set on the document by the parser, senter, sentencizer or some custom function. It will raise an error otherwise.

If the span happens to cross sentence boundaries, only the first sentence will be returned. If it is required that the sentence always includes the full span, the result can be adjusted as such:

sent = span.sent
sent = doc[sent.start : max(sent.end, span.end)]

Example

doc = nlp("Give it back! He pleaded.")
span = doc[1:3]
assert span.sent.text == "Give it back!"
Name Description
RETURNS The sentence span that this span is a part of. Span

Attributes

Name Description
doc The parent document. Doc
tensor 2.1.7 The span's slice of the parent Doc's tensor. numpy.ndarray
start The token offset for the start of the span. int
end The token offset for the end of the span. int
start_char The character offset for the start of the span. int
end_char The character offset for the end of the span. int
text A string representation of the span text. str
text_with_ws The text content of the span with a trailing whitespace character if the last token has one. str
orth ID of the verbatim text content. int
orth_ Verbatim text content (identical to Span.text). Exists mostly for consistency with the other attributes. str
label The hash value of the span's label. int
label_ The span's label. str
lemma_ The span's lemma. Equivalent to "".join(token.text_with_ws for token in span). str
kb_id The hash value of the knowledge base ID referred to by the span. int
kb_id_ The knowledge base ID referred to by the span. str
ent_id The hash value of the named entity the token is an instance of. int
ent_id_ The string ID of the named entity the token is an instance of. str
sentiment A scalar value indicating the positivity or negativity of the span. float
_ User space for adding custom attribute extensions. Underscore