spaCy/span.md at cb41a33d14fc3974dbfd5a60d9a92ed970d25792

mirror of https://github.com/explosion/spaCy.git synced 2025-07-11 08:42:28 +03:00

Sofie Van Landeghem 0b4b4f1819 Documentation for Entity Linking (#4065 )

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

2019-09-12 11:38:34 +02:00

21 KiB

Raw Blame History

title	tag	source
Span	class	spacy/tokens/span.pyx

A slice from a Doc object.

Span.init

Create a Span object from the slice doc[start : end].

Example

doc = nlp(u"Give it back! He pleaded.")
span = doc[1:4]
assert [t.text for t in span] ==  [u"it", u"back", u"!"]

Name	Type	Description
`doc`	`Doc`	The parent document.
`start`	int	The index of the first token of the span.
`end`	int	The index of the first token after the span.
`label`	int / unicode	A label to attach to the span, e.g. for named entities. As of v2.1, the label can also be a unicode string.
`kb_id`	int / unicode	A knowledge base ID to attach to the span, e.g. for named entities. The ID can be an integer or a unicode string.
`vector`	`numpy.ndarray[ndim=1, dtype='float32']`	A meaning representation of the span.
RETURNS	`Span`	The newly constructed object.

Span.getitem

Get a Token object.

Example

doc = nlp(u"Give it back! He pleaded.")
span = doc[1:4]
assert span[1].text == "back"

Name	Type	Description
`i`	int	The index of the token within the span.
RETURNS	`Token`	The token at `span[i]`.

Get a Span object.

Example

doc = nlp(u"Give it back! He pleaded.")
span = doc[1:4]
assert span[1:3].text == u"back!"

Name	Type	Description
`start_end`	tuple	The slice of the span to get.
RETURNS	`Span`	The span at `span[start : end]`.

Span.iter

Iterate over Token objects.

Example

doc = nlp(u"Give it back! He pleaded.")
span = doc[1:4]
assert [t.text for t in span] == [u"it", u"back", u"!"]

Name	Type	Description
YIELDS	`Token`	A `Token` object.

Span.len

Get the number of tokens in the span.

Example

doc = nlp(u"Give it back! He pleaded.")
span = doc[1:4]
assert len(span) == 3

Name	Type	Description
RETURNS	int	The number of tokens in the span.

Span.set_extension

Define a custom attribute on the Span which becomes available via Span._. For details, see the documentation on custom attributes.

Example

from spacy.tokens import Span
city_getter = lambda span: any(city in span.text for city in (u"New York", u"Paris", u"Berlin"))
Span.set_extension("has_city", getter=city_getter)
doc = nlp(u"I like New York in Autumn")
assert doc[1:4]._.has_city

Name	Type	Description
`name`	unicode	Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `span._.my_attr`.
`default`	-	Optional default value of the attribute if no getter or method is defined.
`method`	callable	Set a custom method on the object, for example `span._.compare(other_span)`.
`getter`	callable	Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute.
`setter`	callable	Setter function that takes the `Span` and a value, and modifies the object. Is called when the user writes to the `Span._` attribute.
`force`	bool	Force overwriting existing attribute.

Span.get_extension

Look up a previously registered extension by name. Returns a 4-tuple (default, method, getter, setter) if the extension is registered. Raises a KeyError otherwise.

Example

from spacy.tokens import Span
Span.set_extension("is_city", default=False)
extension = Span.get_extension("is_city")
assert extension == (False, None, None, None)

Name	Type	Description
`name`	unicode	Name of the extension.
RETURNS	tuple	A `(default, method, getter, setter)` tuple of the extension.

Span.has_extension

Check whether an extension has been registered on the Span class.

Example

from spacy.tokens import Span
Span.set_extension("is_city", default=False)
assert Span.has_extension("is_city")

Name	Type	Description
`name`	unicode	Name of the extension to check.
RETURNS	bool	Whether the extension has been registered.

Span.remove_extension

Remove a previously registered extension.

Example

from spacy.tokens import Span
Span.set_extension("is_city", default=False)
removed = Span.remove_extension("is_city")
assert not Span.has_extension("is_city")

Name	Type	Description
`name`	unicode	Name of the extension.
RETURNS	tuple	A `(default, method, getter, setter)` tuple of the removed extension.

Span.similarity

Make a semantic similarity estimate. The default estimate is cosine similarity using an average of word vectors.

Example

doc = nlp(u"green apples and red oranges")
green_apples = doc[:2]
red_oranges = doc[3:]
apples_oranges = green_apples.similarity(red_oranges)
oranges_apples = red_oranges.similarity(green_apples)
assert apples_oranges == oranges_apples

Name	Type	Description
`other`	-	The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects.
RETURNS	float	A scalar similarity score. Higher is more similar.

Span.get_lca_matrix

Calculates the lowest common ancestor matrix for a given Span. Returns LCA matrix containing the integer index of the ancestor, or -1 if no common ancestor is found, e.g. if span excludes a necessary ancestor.

Example

doc = nlp(u"I like New York in Autumn")
span = doc[1:4]
matrix = span.get_lca_matrix()
# array([[0, 0, 0], [0, 1, 2], [0, 2, 2]], dtype=int32)

Name	Type	Description
RETURNS	`numpy.ndarray[ndim=2, dtype='int32']`	The lowest common ancestor matrix of the `Span`.

Span.to_array

Given a list of M attribute IDs, export the tokens to a numpy ndarray of shape (N, M), where N is the length of the document. The values will be 32-bit integers.

Example

from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
doc = nlp(u"I like New York in Autumn.")
span = doc[2:3]
# All strings mapped to integers, for easy export to numpy
np_array = span.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])

Name	Type	Description
`attr_ids`	list	A list of attribute ID ints.
RETURNS	`numpy.ndarray[long, ndim=2]`	A feature matrix, with one row per word, and one column per attribute indicated in the input `attr_ids`.

Span.merge

As of v2.1.0, Span.merge still works but is considered deprecated. You should use the new and less error-prone Doc.retokenize instead.

Retokenize the document, such that the span is merged into a single token.

Example

doc = nlp(u"I like New York in Autumn.")
span = doc[2:4]
span.merge()
assert len(doc) == 6
assert doc[2].text == u"New York"

Name	Type	Description
`**attributes`	-	Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span.
RETURNS	`Token`	The newly merged token.

Span.ents

The named entities in the span. Returns a tuple of named entity Span objects, if the entity recognizer has been applied.

Example

doc = nlp(u"Mr. Best flew to New York on Saturday morning.")
span = doc[0:6]
ents = list(span.ents)
assert ents[0].label == 346
assert ents[0].label_ == "PERSON"
assert ents[0].text == u"Mr. Best"

Name	Type	Description
RETURNS	tuple	Entities in the span, one `Span` per entity.

Span.as_doc

Create a new Doc object corresponding to the Span, with a copy of the data.

Example

doc = nlp(u"I like New York in Autumn.")
span = doc[2:4]
doc2 = span.as_doc()
assert doc2.text == u"New York"

Name	Type	Description
RETURNS	`Doc`	A `Doc` object of the `Span`'s content.

Span.root

The token with the shortest path to the root of the sentence (or the root itself). If multiple tokens are equally high in the tree, the first token is taken.

Example

doc = nlp(u"I like New York in Autumn.")
i, like, new, york, in_, autumn, dot = range(len(doc))
assert doc[new].head.text == u"York"
assert doc[york].head.text == u"like"
new_york = doc[new:york+1]
assert new_york.root.text == u"York"

Name	Type	Description
RETURNS	`Token`	The root token.

Span.conjuncts

A tuple of tokens coordinated to span.root.

Example

doc = nlp(u"I like apples and oranges")
apples_conjuncts = doc[2:3].conjuncts
assert [t.text for t in apples_conjuncts] == [u"oranges"]

Name	Type	Description
RETURNS	`tuple`	The coordinated tokens.

Span.lefts

Tokens that are to the left of the span, whose heads are within the span.

Example

doc = nlp(u"I like New York in Autumn.")
lefts = [t.text for t in doc[3:7].lefts]
assert lefts == [u"New"]

Name	Type	Description
YIELDS	`Token`	A left-child of a token of the span.

Span.rights

Tokens that are to the right of the span, whose heads are within the span.

Example

doc = nlp(u"I like New York in Autumn.")
rights = [t.text for t in doc[2:4].rights]
assert rights == [u"in"]

Name	Type	Description
YIELDS	`Token`	A right-child of a token of the span.

Span.n_lefts

The number of tokens that are to the left of the span, whose heads are within the span.

Example

doc = nlp(u"I like New York in Autumn.")
assert doc[3:7].n_lefts == 1

Name	Type	Description
RETURNS	int	The number of left-child tokens.

Span.n_rights

The number of tokens that are to the right of the span, whose heads are within the span.

Example

doc = nlp(u"I like New York in Autumn.")
assert doc[2:4].n_rights == 1

Name	Type	Description
RETURNS	int	The number of right-child tokens.

Span.subtree

Tokens within the span and tokens which descend from them.

Example

doc = nlp(u"Give it back! He pleaded.")
subtree = [t.text for t in doc[:3].subtree]
assert subtree == [u"Give", u"it", u"back", u"!"]

Name	Type	Description
YIELDS	`Token`	A token within the span, or a descendant from it.

Span.has_vector

A boolean value indicating whether a word vector is associated with the object.

Example

doc = nlp(u"I like apples")
assert doc[1:].has_vector

Name	Type	Description
RETURNS	bool	Whether the span has a vector data attached.

Span.vector

A real-valued meaning representation. Defaults to an average of the token vectors.

Example

doc = nlp(u"I like apples")
assert doc[1:].vector.dtype == "float32"
assert doc[1:].vector.shape == (300,)

Name	Type	Description
RETURNS	`numpy.ndarray[ndim=1, dtype='float32']`	A 1D numpy array representing the span's semantics.

Span.vector_norm

The L2 norm of the span's vector representation.

Example

doc = nlp(u"I like apples")
doc[1:].vector_norm # 4.800883928527915
doc[2:].vector_norm # 6.895897646384268
assert doc[1:].vector_norm != doc[2:].vector_norm

Name	Type	Description
RETURNS	float	The L2 norm of the vector representation.

Attributes

Name	Type	Description
`doc`	`Doc`	The parent document.
`tensor` 2.1.7	`ndarray`	The span's slice of the parent `Doc`'s tensor.
`sent`	`Span`	The sentence span that this span is a part of.
`start`	int	The token offset for the start of the span.
`end`	int	The token offset for the end of the span.
`start_char`	int	The character offset for the start of the span.
`end_char`	int	The character offset for the end of the span.
`text`	unicode	A unicode representation of the span text.
`text_with_ws`	unicode	The text content of the span with a trailing whitespace character if the last token has one.
`orth`	int	ID of the verbatim text content.
`orth_`	unicode	Verbatim text content (identical to `Span.text`). Exists mostly for consistency with the other attributes.
`label`	int	The hash value of the span's label.
`label_`	unicode	The span's label.
`lemma_`	unicode	The span's lemma.
`kb_id`	int	The hash value of the knowledge base ID referred to by the span.
`kb_id_`	unicode	The knowledge base ID referred to by the span.
`ent_id`	int	The hash value of the named entity the token is an instance of.
`ent_id_`	unicode	The string ID of the named entity the token is an instance of.
`sentiment`	float	A scalar value indicating the positivity or negativity of the span.
`_`	`Underscore`	User space for adding custom attribute extensions.

21 KiB Raw Blame History

Span.__init__

Example

Span.__getitem__

Example

Example

Span.__iter__

Example

Span.__len__

Example

Span.set_extension

Example

Span.get_extension

Example

Span.has_extension

Example

Span.remove_extension

Example

Span.similarity

Example

Span.get_lca_matrix

Example

Span.to_array

Example

Span.merge

Example

Span.ents

Example

Span.as_doc

Example

Span.root

Example

Span.conjuncts

Example

Span.lefts

Example

Span.rights

Example

Span.n_lefts

Example

Span.n_rights

Example

Span.subtree

Example

Span.has_vector

Example

Span.vector

Example

Span.vector_norm

Example

Attributes

21 KiB

Raw Blame History

Span.init

Span.getitem

Span.iter

Span.len