* document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * typo fix * add candidate API to kb documentation * update API sidebar with EntityLinker and KnowledgeBase * remove EL from 101 docs * remove entity linker from 101 pipelines / rephrase * custom el model instead of existing model * set version to 2.2 for EL functionality * update documentation for 2 CLI scripts
21 KiB
title | tag | source |
---|---|---|
Span | class | spacy/tokens/span.pyx |
A slice from a Doc
object.
Span.__init__
Create a Span object from the slice doc[start : end]
.
Example
doc = nlp(u"Give it back! He pleaded.") span = doc[1:4] assert [t.text for t in span] == [u"it", u"back", u"!"]
Name | Type | Description |
---|---|---|
doc |
Doc |
The parent document. |
start |
int | The index of the first token of the span. |
end |
int | The index of the first token after the span. |
label |
int / unicode | A label to attach to the span, e.g. for named entities. As of v2.1, the label can also be a unicode string. |
kb_id |
int / unicode | A knowledge base ID to attach to the span, e.g. for named entities. The ID can be an integer or a unicode string. |
vector |
numpy.ndarray[ndim=1, dtype='float32'] |
A meaning representation of the span. |
RETURNS | Span |
The newly constructed object. |
Span.__getitem__
Get a Token
object.
Example
doc = nlp(u"Give it back! He pleaded.") span = doc[1:4] assert span[1].text == "back"
Name | Type | Description |
---|---|---|
i |
int | The index of the token within the span. |
RETURNS | Token |
The token at span[i] . |
Get a Span
object.
Example
doc = nlp(u"Give it back! He pleaded.") span = doc[1:4] assert span[1:3].text == u"back!"
Name | Type | Description |
---|---|---|
start_end |
tuple | The slice of the span to get. |
RETURNS | Span |
The span at span[start : end] . |
Span.__iter__
Iterate over Token
objects.
Example
doc = nlp(u"Give it back! He pleaded.") span = doc[1:4] assert [t.text for t in span] == [u"it", u"back", u"!"]
Name | Type | Description |
---|---|---|
YIELDS | Token |
A Token object. |
Span.__len__
Get the number of tokens in the span.
Example
doc = nlp(u"Give it back! He pleaded.") span = doc[1:4] assert len(span) == 3
Name | Type | Description |
---|---|---|
RETURNS | int | The number of tokens in the span. |
Span.set_extension
Define a custom attribute on the Span
which becomes available via Span._
.
For details, see the documentation on
custom attributes.
Example
from spacy.tokens import Span city_getter = lambda span: any(city in span.text for city in (u"New York", u"Paris", u"Berlin")) Span.set_extension("has_city", getter=city_getter) doc = nlp(u"I like New York in Autumn") assert doc[1:4]._.has_city
Name | Type | Description |
---|---|---|
name |
unicode | Name of the attribute to set by the extension. For example, 'my_attr' will be available as span._.my_attr . |
default |
- | Optional default value of the attribute if no getter or method is defined. |
method |
callable | Set a custom method on the object, for example span._.compare(other_span) . |
getter |
callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the ._ attribute. |
setter |
callable | Setter function that takes the Span and a value, and modifies the object. Is called when the user writes to the Span._ attribute. |
force |
bool | Force overwriting existing attribute. |
Span.get_extension
Look up a previously registered extension by name. Returns a 4-tuple
(default, method, getter, setter)
if the extension is registered. Raises a
KeyError
otherwise.
Example
from spacy.tokens import Span Span.set_extension("is_city", default=False) extension = Span.get_extension("is_city") assert extension == (False, None, None, None)
Name | Type | Description |
---|---|---|
name |
unicode | Name of the extension. |
RETURNS | tuple | A (default, method, getter, setter) tuple of the extension. |
Span.has_extension
Check whether an extension has been registered on the Span
class.
Example
from spacy.tokens import Span Span.set_extension("is_city", default=False) assert Span.has_extension("is_city")
Name | Type | Description |
---|---|---|
name |
unicode | Name of the extension to check. |
RETURNS | bool | Whether the extension has been registered. |
Span.remove_extension
Remove a previously registered extension.
Example
from spacy.tokens import Span Span.set_extension("is_city", default=False) removed = Span.remove_extension("is_city") assert not Span.has_extension("is_city")
Name | Type | Description |
---|---|---|
name |
unicode | Name of the extension. |
RETURNS | tuple | A (default, method, getter, setter) tuple of the removed extension. |
Span.similarity
Make a semantic similarity estimate. The default estimate is cosine similarity using an average of word vectors.
Example
doc = nlp(u"green apples and red oranges") green_apples = doc[:2] red_oranges = doc[3:] apples_oranges = green_apples.similarity(red_oranges) oranges_apples = red_oranges.similarity(green_apples) assert apples_oranges == oranges_apples
Name | Type | Description |
---|---|---|
other |
- | The object to compare with. By default, accepts Doc , Span , Token and Lexeme objects. |
RETURNS | float | A scalar similarity score. Higher is more similar. |
Span.get_lca_matrix
Calculates the lowest common ancestor matrix for a given Span
. Returns LCA
matrix containing the integer index of the ancestor, or -1
if no common
ancestor is found, e.g. if span excludes a necessary ancestor.
Example
doc = nlp(u"I like New York in Autumn") span = doc[1:4] matrix = span.get_lca_matrix() # array([[0, 0, 0], [0, 1, 2], [0, 2, 2]], dtype=int32)
Name | Type | Description |
---|---|---|
RETURNS | numpy.ndarray[ndim=2, dtype='int32'] |
The lowest common ancestor matrix of the Span . |
Span.to_array
Given a list of M
attribute IDs, export the tokens to a numpy ndarray
of
shape (N, M)
, where N
is the length of the document. The values will be
32-bit integers.
Example
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA doc = nlp(u"I like New York in Autumn.") span = doc[2:3] # All strings mapped to integers, for easy export to numpy np_array = span.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
Name | Type | Description |
---|---|---|
attr_ids |
list | A list of attribute ID ints. |
RETURNS | numpy.ndarray[long, ndim=2] |
A feature matrix, with one row per word, and one column per attribute indicated in the input attr_ids . |
Span.merge
As of v2.1.0, Span.merge
still works but is considered deprecated. You should
use the new and less error-prone Doc.retokenize
instead.
Retokenize the document, such that the span is merged into a single token.
Example
doc = nlp(u"I like New York in Autumn.") span = doc[2:4] span.merge() assert len(doc) == 6 assert doc[2].text == u"New York"
Name | Type | Description |
---|---|---|
**attributes |
- | Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span. |
RETURNS | Token |
The newly merged token. |
Span.ents
The named entities in the span. Returns a tuple of named entity Span
objects,
if the entity recognizer has been applied.
Example
doc = nlp(u"Mr. Best flew to New York on Saturday morning.") span = doc[0:6] ents = list(span.ents) assert ents[0].label == 346 assert ents[0].label_ == "PERSON" assert ents[0].text == u"Mr. Best"
Name | Type | Description |
---|---|---|
RETURNS | tuple | Entities in the span, one Span per entity. |
Span.as_doc
Create a new Doc
object corresponding to the Span
, with a copy of the data.
Example
doc = nlp(u"I like New York in Autumn.") span = doc[2:4] doc2 = span.as_doc() assert doc2.text == u"New York"
Name | Type | Description |
---|---|---|
RETURNS | Doc |
A Doc object of the Span 's content. |
Span.root
The token with the shortest path to the root of the sentence (or the root itself). If multiple tokens are equally high in the tree, the first token is taken.
Example
doc = nlp(u"I like New York in Autumn.") i, like, new, york, in_, autumn, dot = range(len(doc)) assert doc[new].head.text == u"York" assert doc[york].head.text == u"like" new_york = doc[new:york+1] assert new_york.root.text == u"York"
Name | Type | Description |
---|---|---|
RETURNS | Token |
The root token. |
Span.conjuncts
A tuple of tokens coordinated to span.root
.
Example
doc = nlp(u"I like apples and oranges") apples_conjuncts = doc[2:3].conjuncts assert [t.text for t in apples_conjuncts] == [u"oranges"]
Name | Type | Description |
---|---|---|
RETURNS | tuple |
The coordinated tokens. |
Span.lefts
Tokens that are to the left of the span, whose heads are within the span.
Example
doc = nlp(u"I like New York in Autumn.") lefts = [t.text for t in doc[3:7].lefts] assert lefts == [u"New"]
Name | Type | Description |
---|---|---|
YIELDS | Token |
A left-child of a token of the span. |
Span.rights
Tokens that are to the right of the span, whose heads are within the span.
Example
doc = nlp(u"I like New York in Autumn.") rights = [t.text for t in doc[2:4].rights] assert rights == [u"in"]
Name | Type | Description |
---|---|---|
YIELDS | Token |
A right-child of a token of the span. |
Span.n_lefts
The number of tokens that are to the left of the span, whose heads are within the span.
Example
doc = nlp(u"I like New York in Autumn.") assert doc[3:7].n_lefts == 1
Name | Type | Description |
---|---|---|
RETURNS | int | The number of left-child tokens. |
Span.n_rights
The number of tokens that are to the right of the span, whose heads are within the span.
Example
doc = nlp(u"I like New York in Autumn.") assert doc[2:4].n_rights == 1
Name | Type | Description |
---|---|---|
RETURNS | int | The number of right-child tokens. |
Span.subtree
Tokens within the span and tokens which descend from them.
Example
doc = nlp(u"Give it back! He pleaded.") subtree = [t.text for t in doc[:3].subtree] assert subtree == [u"Give", u"it", u"back", u"!"]
Name | Type | Description |
---|---|---|
YIELDS | Token |
A token within the span, or a descendant from it. |
Span.has_vector
A boolean value indicating whether a word vector is associated with the object.
Example
doc = nlp(u"I like apples") assert doc[1:].has_vector
Name | Type | Description |
---|---|---|
RETURNS | bool | Whether the span has a vector data attached. |
Span.vector
A real-valued meaning representation. Defaults to an average of the token vectors.
Example
doc = nlp(u"I like apples") assert doc[1:].vector.dtype == "float32" assert doc[1:].vector.shape == (300,)
Name | Type | Description |
---|---|---|
RETURNS | numpy.ndarray[ndim=1, dtype='float32'] |
A 1D numpy array representing the span's semantics. |
Span.vector_norm
The L2 norm of the span's vector representation.
Example
doc = nlp(u"I like apples") doc[1:].vector_norm # 4.800883928527915 doc[2:].vector_norm # 6.895897646384268 assert doc[1:].vector_norm != doc[2:].vector_norm
Name | Type | Description |
---|---|---|
RETURNS | float | The L2 norm of the vector representation. |
Attributes
Name | Type | Description |
---|---|---|
doc |
Doc |
The parent document. |
tensor 2.1.7 |
ndarray |
The span's slice of the parent Doc 's tensor. |
sent |
Span |
The sentence span that this span is a part of. |
start |
int | The token offset for the start of the span. |
end |
int | The token offset for the end of the span. |
start_char |
int | The character offset for the start of the span. |
end_char |
int | The character offset for the end of the span. |
text |
unicode | A unicode representation of the span text. |
text_with_ws |
unicode | The text content of the span with a trailing whitespace character if the last token has one. |
orth |
int | ID of the verbatim text content. |
orth_ |
unicode | Verbatim text content (identical to Span.text ). Exists mostly for consistency with the other attributes. |
label |
int | The hash value of the span's label. |
label_ |
unicode | The span's label. |
lemma_ |
unicode | The span's lemma. |
kb_id |
int | The hash value of the knowledge base ID referred to by the span. |
kb_id_ |
unicode | The knowledge base ID referred to by the span. |
ent_id |
int | The hash value of the named entity the token is an instance of. |
ent_id_ |
unicode | The string ID of the named entity the token is an instance of. |
sentiment |
float | A scalar value indicating the positivity or negativity of the span. |
_ |
Underscore |
User space for adding custom attribute extensions. |