spaCy/website/docs/api/span.md
Ines Montani 06d8c3a20f Revert "Merge branch 'master' into spacy.io"
This reverts commit c8bb08b545, reversing
changes made to b6a509a8d1.
2019-09-30 13:14:48 +02:00

20 KiB

title tag source
Span class spacy/tokens/span.pyx

A slice from a Doc object.

Span.__init__

Create a Span object from the slice doc[start : end].

Example

doc = nlp(u"Give it back! He pleaded.")
span = doc[1:4]
assert [t.text for t in span] ==  [u"it", u"back", u"!"]
Name Type Description
doc Doc The parent document.
start int The index of the first token of the span.
end int The index of the first token after the span.
label int / unicode A label to attach to the span, e.g. for named entities. As of v2.1, the label can also be a unicode string.
vector numpy.ndarray[ndim=1, dtype='float32'] A meaning representation of the span.
RETURNS Span The newly constructed object.

Span.__getitem__

Get a Token object.

Example

doc = nlp(u"Give it back! He pleaded.")
span = doc[1:4]
assert span[1].text == "back"
Name Type Description
i int The index of the token within the span.
RETURNS Token The token at span[i].

Get a Span object.

Example

doc = nlp(u"Give it back! He pleaded.")
span = doc[1:4]
assert span[1:3].text == u"back!"
Name Type Description
start_end tuple The slice of the span to get.
RETURNS Span The span at span[start : end].

Span.__iter__

Iterate over Token objects.

Example

doc = nlp(u"Give it back! He pleaded.")
span = doc[1:4]
assert [t.text for t in span] == [u"it", u"back", u"!"]
Name Type Description
YIELDS Token A Token object.

Span.__len__

Get the number of tokens in the span.

Example

doc = nlp(u"Give it back! He pleaded.")
span = doc[1:4]
assert len(span) == 3
Name Type Description
RETURNS int The number of tokens in the span.

Span.set_extension

Define a custom attribute on the Span which becomes available via Span._. For details, see the documentation on custom attributes.

Example

from spacy.tokens import Span
city_getter = lambda span: any(city in span.text for city in (u"New York", u"Paris", u"Berlin"))
Span.set_extension("has_city", getter=city_getter)
doc = nlp(u"I like New York in Autumn")
assert doc[1:4]._.has_city
Name Type Description
name unicode Name of the attribute to set by the extension. For example, 'my_attr' will be available as span._.my_attr.
default - Optional default value of the attribute if no getter or method is defined.
method callable Set a custom method on the object, for example span._.compare(other_span).
getter callable Getter function that takes the object and returns an attribute value. Is called when the user accesses the ._ attribute.
setter callable Setter function that takes the Span and a value, and modifies the object. Is called when the user writes to the Span._ attribute.
force bool Force overwriting existing attribute.

Span.get_extension

Look up a previously registered extension by name. Returns a 4-tuple (default, method, getter, setter) if the extension is registered. Raises a KeyError otherwise.

Example

from spacy.tokens import Span
Span.set_extension("is_city", default=False)
extension = Span.get_extension("is_city")
assert extension == (False, None, None, None)
Name Type Description
name unicode Name of the extension.
RETURNS tuple A (default, method, getter, setter) tuple of the extension.

Span.has_extension

Check whether an extension has been registered on the Span class.

Example

from spacy.tokens import Span
Span.set_extension("is_city", default=False)
assert Span.has_extension("is_city")
Name Type Description
name unicode Name of the extension to check.
RETURNS bool Whether the extension has been registered.

Span.remove_extension

Remove a previously registered extension.

Example

from spacy.tokens import Span
Span.set_extension("is_city", default=False)
removed = Span.remove_extension("is_city")
assert not Span.has_extension("is_city")
Name Type Description
name unicode Name of the extension.
RETURNS tuple A (default, method, getter, setter) tuple of the removed extension.

Span.similarity

Make a semantic similarity estimate. The default estimate is cosine similarity using an average of word vectors.

Example

doc = nlp(u"green apples and red oranges")
green_apples = doc[:2]
red_oranges = doc[3:]
apples_oranges = green_apples.similarity(red_oranges)
oranges_apples = red_oranges.similarity(green_apples)
assert apples_oranges == oranges_apples
Name Type Description
other - The object to compare with. By default, accepts Doc, Span, Token and Lexeme objects.
RETURNS float A scalar similarity score. Higher is more similar.

Span.get_lca_matrix

Calculates the lowest common ancestor matrix for a given Span. Returns LCA matrix containing the integer index of the ancestor, or -1 if no common ancestor is found, e.g. if span excludes a necessary ancestor.

Example

doc = nlp(u"I like New York in Autumn")
span = doc[1:4]
matrix = span.get_lca_matrix()
# array([[0, 0, 0], [0, 1, 2], [0, 2, 2]], dtype=int32)
Name Type Description
RETURNS numpy.ndarray[ndim=2, dtype='int32'] The lowest common ancestor matrix of the Span.

Span.to_array

Given a list of M attribute IDs, export the tokens to a numpy ndarray of shape (N, M), where N is the length of the document. The values will be 32-bit integers.

Example

from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
doc = nlp(u"I like New York in Autumn.")
span = doc[2:3]
# All strings mapped to integers, for easy export to numpy
np_array = span.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
Name Type Description
attr_ids list A list of attribute ID ints.
RETURNS numpy.ndarray[long, ndim=2] A feature matrix, with one row per word, and one column per attribute indicated in the input attr_ids.

Span.merge

As of v2.1.0, Span.merge still works but is considered deprecated. You should use the new and less error-prone Doc.retokenize instead.

Retokenize the document, such that the span is merged into a single token.

Example

doc = nlp(u"I like New York in Autumn.")
span = doc[2:4]
span.merge()
assert len(doc) == 6
assert doc[2].text == u"New York"
Name Type Description
**attributes - Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span.
RETURNS Token The newly merged token.

Span.ents

The named entities in the span. Returns a tuple of named entity Span objects, if the entity recognizer has been applied.

Example

doc = nlp(u"Mr. Best flew to New York on Saturday morning.")
span = doc[0:6]
ents = list(span.ents)
assert ents[0].label == 346
assert ents[0].label_ == "PERSON"
assert ents[0].text == u"Mr. Best"
Name Type Description
RETURNS tuple Entities in the span, one Span per entity.

Span.as_doc

Create a new Doc object corresponding to the Span, with a copy of the data.

Example

doc = nlp(u"I like New York in Autumn.")
span = doc[2:4]
doc2 = span.as_doc()
assert doc2.text == u"New York"
Name Type Description
copy_user_data bool Whether or not to copy the original doc's user data.
RETURNS Doc A Doc object of the Span's content.

Span.root

The token with the shortest path to the root of the sentence (or the root itself). If multiple tokens are equally high in the tree, the first token is taken.

Example

doc = nlp(u"I like New York in Autumn.")
i, like, new, york, in_, autumn, dot = range(len(doc))
assert doc[new].head.text == u"York"
assert doc[york].head.text == u"like"
new_york = doc[new:york+1]
assert new_york.root.text == u"York"
Name Type Description
RETURNS Token The root token.

Span.conjuncts

A tuple of tokens coordinated to span.root.

Example

doc = nlp(u"I like apples and oranges")
apples_conjuncts = doc[2:3].conjuncts
assert [t.text for t in apples_conjuncts] == [u"oranges"]
Name Type Description
RETURNS tuple The coordinated tokens.

Span.lefts

Tokens that are to the left of the span, whose heads are within the span.

Example

doc = nlp(u"I like New York in Autumn.")
lefts = [t.text for t in doc[3:7].lefts]
assert lefts == [u"New"]
Name Type Description
YIELDS Token A left-child of a token of the span.

Span.rights

Tokens that are to the right of the span, whose heads are within the span.

Example

doc = nlp(u"I like New York in Autumn.")
rights = [t.text for t in doc[2:4].rights]
assert rights == [u"in"]
Name Type Description
YIELDS Token A right-child of a token of the span.

Span.n_lefts

The number of tokens that are to the left of the span, whose heads are within the span.

Example

doc = nlp(u"I like New York in Autumn.")
assert doc[3:7].n_lefts == 1
Name Type Description
RETURNS int The number of left-child tokens.

Span.n_rights

The number of tokens that are to the right of the span, whose heads are within the span.

Example

doc = nlp(u"I like New York in Autumn.")
assert doc[2:4].n_rights == 1
Name Type Description
RETURNS int The number of right-child tokens.

Span.subtree

Tokens within the span and tokens which descend from them.

Example

doc = nlp(u"Give it back! He pleaded.")
subtree = [t.text for t in doc[:3].subtree]
assert subtree == [u"Give", u"it", u"back", u"!"]
Name Type Description
YIELDS Token A token within the span, or a descendant from it.

Span.has_vector

A boolean value indicating whether a word vector is associated with the object.

Example

doc = nlp(u"I like apples")
assert doc[1:].has_vector
Name Type Description
RETURNS bool Whether the span has a vector data attached.

Span.vector

A real-valued meaning representation. Defaults to an average of the token vectors.

Example

doc = nlp(u"I like apples")
assert doc[1:].vector.dtype == "float32"
assert doc[1:].vector.shape == (300,)
Name Type Description
RETURNS numpy.ndarray[ndim=1, dtype='float32'] A 1D numpy array representing the span's semantics.

Span.vector_norm

The L2 norm of the span's vector representation.

Example

doc = nlp(u"I like apples")
doc[1:].vector_norm # 4.800883928527915
doc[2:].vector_norm # 6.895897646384268
assert doc[1:].vector_norm != doc[2:].vector_norm
Name Type Description
RETURNS float The L2 norm of the vector representation.

Attributes

Name Type Description
doc Doc The parent document.
tensor 2.1.7 ndarray The span's slice of the parent Doc's tensor.
sent Span The sentence span that this span is a part of.
start int The token offset for the start of the span.
end int The token offset for the end of the span.
start_char int The character offset for the start of the span.
end_char int The character offset for the end of the span.
text unicode A unicode representation of the span text.
text_with_ws unicode The text content of the span with a trailing whitespace character if the last token has one.
orth int ID of the verbatim text content.
orth_ unicode Verbatim text content (identical to Span.text). Exists mostly for consistency with the other attributes.
label int The span's label.
label_ unicode The span's label.
lemma_ unicode The span's lemma.
ent_id int The hash value of the named entity the token is an instance of.
ent_id_ unicode The string ID of the named entity the token is an instance of.
sentiment float A scalar value indicating the positivity or negativity of the span.
_ Underscore User space for adding custom attribute extensions.