spaCy/span.md at d25f09468c4eca20eb464d78d35e439474ed2dbc

mirror of https://github.com/explosion/spaCy.git synced 2024-09-21 11:29:13 +03:00

* Add SpanRuler component

Add a `SpanRuler` component similar to `EntityRuler` that saves a list
of matched spans to `Doc.spans[spans_key]`. The matches from the token
and phrase matchers are deduplicated and sorted before assignment but
are not otherwise filtered.

* Update spacy/pipeline/span_ruler.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix cast

* Add self.key property

* Use number of patterns as length

* Remove patterns kwarg from init

* Update spacy/tests/pipeline/test_span_ruler.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add options for spans filter and setting to ents

* Add `spans_filter` option as a registered function'
* Make `spans_key` optional and if `None`, set to `doc.ents` instead of
`doc.spans[spans_key]`.

* Update and generalize tests

* Add test for setting doc.ents, fix key property type

* Fix typing

* Allow independent doc.spans and doc.ents

* If `spans_key` is set, set `doc.spans` with `spans_filter`.
* If `annotate_ents` is set, set `doc.ents` with `ents_fitler`.
  * Use `util.filter_spans` by default as `ents_filter`.
  * Use a custom warning if the filter does not work for `doc.ents`.

* Enable use of SpanC.id in Span

* Support id in SpanRuler as Span.id

* Update types

* `id` can only be provided as string (already by `PatternType`
definition)

* Update all uses of Span.id/ent_id in Doc

* Rename Span id kwarg to span_id

* Update types and docs

* Add ents filter to mimic EntityRuler overwrite_ents

* Refactor `ents_filter` to take `entities, spans` args for more
  filtering options
* Give registered filters more descriptive names
* Allow registered `filter_spans` filter
  (`spacy.first_longest_spans_filter.v1`) to take any number of
  `Iterable[Span]` objects as args so it can be used for spans filter
  or ents filter

* Implement future entity ruler as span ruler

Implement a compatible `entity_ruler` as `future_entity_ruler` using
`SpanRuler` as the underlying component:
* Add `sort_key` and `sort_reverse` to allow the sorting behavior to be
  customized. (Necessary for the same sorting/filtering as in
  `EntityRuler`.)
* Implement `overwrite_overlapping_ents_filter` and
  `preserve_existing_ents_filter` to support
  `EntityRuler.overwrite_ents` settings.
* Add `remove_by_id` to support `EntityRuler.remove` functionality.
* Refactor `entity_ruler` tests to parametrize all tests to test both
  `entity_ruler` and `future_entity_ruler`
* Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns`
  properties.

Additional changes:

* Move all config settings to top-level attributes to avoid duplicating
  settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of
  casting.)

* Format

* Fix filter make method name

* Refactor to use same error for removing by label or ID

* Also provide existing spans to spans filter

* Support ids property

* Remove token_patterns and phrase_patterns

* Update docstrings

* Add span ruler docs

* Fix types

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move sorting into filters

* Check for all tokens in seen tokens in entity ruler filters

* Remove registered sort key

* Set Token.ent_id in a backwards-compatible way in Doc.set_ents

* Remove sort options from API docs

* Update docstrings

* Rename entity ruler filters

* Fix and parameterize scoring

* Add id to Span API docs

* Fix typo in API docs

* Include explicit labeled=True for scorer

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

2022-06-02 13:12:53 +02:00

25 KiB

Raw Blame History

title	tag	source
Span	class	spacy/tokens/span.pyx

A slice from a Doc object.

Span.init

Create a Span object from the slice doc[start : end].

Example

doc = nlp("Give it back! He pleaded.")
span = doc[1:4]
assert [t.text for t in span] ==  ["it", "back", "!"]

Name	Description
`doc`	The parent document. ~~Doc~~
`start`	The index of the first token of the span. ~~int~~
`end`	The index of the first token after the span. ~~int~~
`label`	A label to attach to the span, e.g. for named entities. ~~Union[str, int]~~
`vector`	A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~
`vector_norm`	The L2 norm of the document's vector representation. ~~float~~
`kb_id`	A knowledge base ID to attach to the span, e.g. for named entities. ~~Union[str, int]~~
`span_id`	An ID to associate with the span. ~~Union[str, int]~~

Span.getitem

Get a Token object.

Example

doc = nlp("Give it back! He pleaded.")
span = doc[1:4]
assert span[1].text == "back"

Name	Description
`i`	The index of the token within the span. ~~int~~
RETURNS	The token at `span[i]`. ~~Token~~

Get a Span object.

Example

doc = nlp("Give it back! He pleaded.")
span = doc[1:4]
assert span[1:3].text == "back!"

Name	Description
`start_end`	The slice of the span to get. ~~Tuple[int, int]~~
RETURNS	The span at `span[start : end]`. ~~Span~~

Span.iter

Iterate over Token objects.

Example

doc = nlp("Give it back! He pleaded.")
span = doc[1:4]
assert [t.text for t in span] == ["it", "back", "!"]

Name	Description
YIELDS	A `Token` object. ~~Token~~

Span.len

Get the number of tokens in the span.

Example

doc = nlp("Give it back! He pleaded.")
span = doc[1:4]
assert len(span) == 3

Name	Description
RETURNS	The number of tokens in the span. ~~int~~

Span.set_extension

Define a custom attribute on the Span which becomes available via Span._. For details, see the documentation on custom attributes.

Example

from spacy.tokens import Span
city_getter = lambda span: any(city in span.text for city in ("New York", "Paris", "Berlin"))
Span.set_extension("has_city", getter=city_getter)
doc = nlp("I like New York in Autumn")
assert doc[1:4]._.has_city

Name	Description
`name`	Name of the attribute to set by the extension. For example, `"my_attr"` will be available as `span._.my_attr`. ~~str~~
`default`	Optional default value of the attribute if no getter or method is defined. ~~Optional[Any]~~
`method`	Set a custom method on the object, for example `span._.compare(other_span)`. ~~Optional[CallableSpan, ...], Any~~
`getter`	Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. ~~Optional[CallableSpan], Any~~
`setter`	Setter function that takes the `Span` and a value, and modifies the object. Is called when the user writes to the `Span._` attribute. ~~Optional[CallableSpan, Any], None~~
`force`	Force overwriting existing attribute. ~~bool~~

Span.get_extension

Look up a previously registered extension by name. Returns a 4-tuple (default, method, getter, setter) if the extension is registered. Raises a KeyError otherwise.

Example

from spacy.tokens import Span
Span.set_extension("is_city", default=False)
extension = Span.get_extension("is_city")
assert extension == (False, None, None, None)

Name	Description
`name`	Name of the extension. ~~str~~
RETURNS	A `(default, method, getter, setter)` tuple of the extension. ~~Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]~~

Span.has_extension

Check whether an extension has been registered on the Span class.

Example

from spacy.tokens import Span
Span.set_extension("is_city", default=False)
assert Span.has_extension("is_city")

Name	Description
`name`	Name of the extension to check. ~~str~~
RETURNS	Whether the extension has been registered. ~~bool~~

Span.remove_extension

Remove a previously registered extension.

Example

from spacy.tokens import Span
Span.set_extension("is_city", default=False)
removed = Span.remove_extension("is_city")
assert not Span.has_extension("is_city")

Name	Description
`name`	Name of the extension. ~~str~~
RETURNS	A `(default, method, getter, setter)` tuple of the removed extension. ~~Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]~~

Span.char_span

Create a Span object from the slice span.text[start:end]. Returns None if the character indices don't map to a valid span.

Example

doc = nlp("I like New York")
span = doc[1:4].char_span(5, 13, label="GPE")
assert span.text == "New York"

Name	Description
`start`	The index of the first character of the span. ~~int~~
`end`	The index of the last character after the span. ~~int~~
`label`	A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~
`kb_id` 2.2	An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~
`vector`	A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~
RETURNS	The newly constructed object or `None`. ~~Optional[Span]~~

Span.similarity

Make a semantic similarity estimate. The default estimate is cosine similarity using an average of word vectors.

Example

doc = nlp("green apples and red oranges")
green_apples = doc[:2]
red_oranges = doc[3:]
apples_oranges = green_apples.similarity(red_oranges)
oranges_apples = red_oranges.similarity(green_apples)
assert apples_oranges == oranges_apples

Name	Description
`other`	The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. ~~Union[Doc, Span, Token, Lexeme]~~
RETURNS	A scalar similarity score. Higher is more similar. ~~float~~

Span.get_lca_matrix

Calculates the lowest common ancestor matrix for a given Span. Returns LCA matrix containing the integer index of the ancestor, or -1 if no common ancestor is found, e.g. if span excludes a necessary ancestor.

Example

doc = nlp("I like New York in Autumn")
span = doc[1:4]
matrix = span.get_lca_matrix()
# array([[0, 0, 0], [0, 1, 2], [0, 2, 2]], dtype=int32)

Name	Description
RETURNS	The lowest common ancestor matrix of the `Span`. ~~numpy.ndarray[ndim=2, dtype=int32]~~

Span.to_array

Given a list of M attribute IDs, export the tokens to a numpy ndarray of shape (N, M), where N is the length of the document. The values will be 32-bit integers.

Example

from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
doc = nlp("I like New York in Autumn.")
span = doc[2:3]
# All strings mapped to integers, for easy export to numpy
np_array = span.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])

Name	Description
`attr_ids`	A list of attributes (int IDs or string names) or a single attribute (int ID or string name). ~~Union[int, str, List[Union[int, str]]]~~
RETURNS	The exported attributes as a numpy array. ~~Union[numpy.ndarray[ndim=2, dtype=uint64], numpy.ndarray[ndim=1, dtype=uint64]]~~

Span.ents

The named entities that fall completely within the span. Returns a tuple of Span objects.

Example

doc = nlp("Mr. Best flew to New York on Saturday morning.")
span = doc[0:6]
ents = list(span.ents)
assert ents[0].label == 346
assert ents[0].label_ == "PERSON"
assert ents[0].text == "Mr. Best"

Name	Description
RETURNS	Entities in the span, one `Span` per entity. ~~Tuple[Span, ...]~~

Span.noun_chunks

Iterate over the base noun phrases in the span. Yields base noun-phrase Span objects, if the document has been syntactically parsed. A base noun phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be nested within it – so no NP-level coordination, no prepositional phrases, and no relative clauses.

If the noun_chunk syntax iterator has not been implemeted for the given language, a NotImplementedError is raised.

Example

doc = nlp("A phrase with another phrase occurs.")
span = doc[3:5]
chunks = list(span.noun_chunks)
assert len(chunks) == 1
assert chunks[0].text == "another phrase"

Name	Description
YIELDS	Noun chunks in the span. ~~Span~~

Span.as_doc

Create a new Doc object corresponding to the Span, with a copy of the data.

When calling this on many spans from the same doc, passing in a precomputed array representation of the doc using the array_head and array args can save time.

Example

doc = nlp("I like New York in Autumn.")
span = doc[2:4]
doc2 = span.as_doc()
assert doc2.text == "New York"

Name	Description
`copy_user_data`	Whether or not to copy the original doc's user data. ~~bool~~
`array_head`	Precomputed array attributes (headers) of the original doc, as generated by `Doc._get_array_attrs()`. ~~Tuple~~
`array`	Precomputed array version of the original doc as generated by `Doc.to_array`. ~~numpy.ndarray~~
RETURNS	A `Doc` object of the `Span`'s content. ~~Doc~~

Span.root

The token with the shortest path to the root of the sentence (or the root itself). If multiple tokens are equally high in the tree, the first token is taken.

Example

doc = nlp("I like New York in Autumn.")
i, like, new, york, in_, autumn, dot = range(len(doc))
assert doc[new].head.text == "York"
assert doc[york].head.text == "like"
new_york = doc[new:york+1]
assert new_york.root.text == "York"

Name	Description
RETURNS	The root token. ~~Token~~

Span.conjuncts

A tuple of tokens coordinated to span.root.

Example

doc = nlp("I like apples and oranges")
apples_conjuncts = doc[2:3].conjuncts
assert [t.text for t in apples_conjuncts] == ["oranges"]

Name	Description
RETURNS	The coordinated tokens. ~~Tuple[Token, ...]~~

Span.lefts

Tokens that are to the left of the span, whose heads are within the span.

Example

doc = nlp("I like New York in Autumn.")
lefts = [t.text for t in doc[3:7].lefts]
assert lefts == ["New"]

Name	Description
YIELDS	A left-child of a token of the span. ~~Token~~

Span.rights

Tokens that are to the right of the span, whose heads are within the span.

Example

doc = nlp("I like New York in Autumn.")
rights = [t.text for t in doc[2:4].rights]
assert rights == ["in"]

Name	Description
YIELDS	A right-child of a token of the span. ~~Token~~

Span.n_lefts

The number of tokens that are to the left of the span, whose heads are within the span.

Example

doc = nlp("I like New York in Autumn.")
assert doc[3:7].n_lefts == 1

Name	Description
RETURNS	The number of left-child tokens. ~~int~~

Span.n_rights

The number of tokens that are to the right of the span, whose heads are within the span.

Example

doc = nlp("I like New York in Autumn.")
assert doc[2:4].n_rights == 1

Name	Description
RETURNS	The number of right-child tokens. ~~int~~

Span.subtree

Tokens within the span and tokens which descend from them.

Example

doc = nlp("Give it back! He pleaded.")
subtree = [t.text for t in doc[:3].subtree]
assert subtree == ["Give", "it", "back", "!"]

Name	Description
YIELDS	A token within the span, or a descendant from it. ~~Token~~

Span.has_vector

A boolean value indicating whether a word vector is associated with the object.

Example

doc = nlp("I like apples")
assert doc[1:].has_vector

Name	Description
RETURNS	Whether the span has a vector data attached. ~~bool~~

Span.vector

A real-valued meaning representation. Defaults to an average of the token vectors.

Example

doc = nlp("I like apples")
assert doc[1:].vector.dtype == "float32"
assert doc[1:].vector.shape == (300,)

Name	Description
RETURNS	A 1-dimensional array representing the span's vector. ~~`numpy.ndarray[ndim=1, dtype=float32]~~

Span.vector_norm

The L2 norm of the span's vector representation.

Example

doc = nlp("I like apples")
doc[1:].vector_norm # 4.800883928527915
doc[2:].vector_norm # 6.895897646384268
assert doc[1:].vector_norm != doc[2:].vector_norm

Name	Description
RETURNS	The L2 norm of the vector representation. ~~float~~

Span.sent

The sentence span that this span is a part of. This property is only available when sentence boundaries have been set on the document by the parser, senter, sentencizer or some custom function. It will raise an error otherwise.

If the span happens to cross sentence boundaries, only the first sentence will be returned. If it is required that the sentence always includes the full span, the result can be adjusted as such:

sent = span.sent
sent = doc[sent.start : max(sent.end, span.end)]

Example

doc = nlp("Give it back! He pleaded.")
span = doc[1:3]
assert span.sent.text == "Give it back!"

Name	Description
RETURNS	The sentence span that this span is a part of. ~~Span~~

Span.sents

Returns a generator over the sentences the span belongs to. This property is only available when sentence boundaries have been set on the document by the parser, senter, sentencizer or some custom function. It will raise an error otherwise.

If the span happens to cross sentence boundaries, all sentences the span overlaps with will be returned.

Example

doc = nlp("Give it back! He pleaded.")
span = doc[2:4]
assert len(span.sents) == 2

Name	Description
RETURNS	A generator yielding sentences this `Span` is a part of ~~Iterable[Span]~~

Attributes

Name	Description
`doc`	The parent document. ~~Doc~~
`tensor` 2.1.7	The span's slice of the parent `Doc`'s tensor. ~~numpy.ndarray~~
`start`	The token offset for the start of the span. ~~int~~
`end`	The token offset for the end of the span. ~~int~~
`start_char`	The character offset for the start of the span. ~~int~~
`end_char`	The character offset for the end of the span. ~~int~~
`text`	A string representation of the span text. ~~str~~
`text_with_ws`	The text content of the span with a trailing whitespace character if the last token has one. ~~str~~
`orth`	ID of the verbatim text content. ~~int~~
`orth_`	Verbatim text content (identical to `Span.text`). Exists mostly for consistency with the other attributes. ~~str~~
`label`	The hash value of the span's label. ~~int~~
`label_`	The span's label. ~~str~~
`lemma_`	The span's lemma. Equivalent to `"".join(token.text_with_ws for token in span)`. ~~str~~
`kb_id`	The hash value of the knowledge base ID referred to by the span. ~~int~~
`kb_id_`	The knowledge base ID referred to by the span. ~~str~~
`ent_id`	The hash value of the named entity the root token is an instance of. ~~int~~
`ent_id_`	The string ID of the named entity the root token is an instance of. ~~str~~
`id`	The hash value of the span's ID. ~~int~~
`id_`	The span's ID. ~~str~~
`sentiment`	A scalar value indicating the positivity or negativity of the span. ~~float~~
`_`	User space for adding custom attribute extensions. ~~Underscore~~

25 KiB Raw Blame History Unescape Escape

Span.__init__

Example

Span.__getitem__

Example

Example

Span.__iter__

Example

Span.__len__

Example

Span.set_extension

Example

Span.get_extension

Example

Span.has_extension

Example

Span.remove_extension

Example

Span.char_span

Example

Span.similarity

Example

Span.get_lca_matrix

Example

Span.to_array

Example

Span.ents

Example

Span.noun_chunks

Example

Span.as_doc

Example

Span.root

Example

Span.conjuncts

Example

Span.lefts

Example

Span.rights

Example

Span.n_lefts

Example

Span.n_rights

Example

Span.subtree

Example

Span.has_vector

Example

Span.vector

Example

Span.vector_norm

Example

Span.sent

Example

Span.sents

Example

Attributes

25 KiB

Raw Blame History

Span.init

Span.getitem

Span.iter

Span.len