spaCy/doc.md at 9c064e6ad95ac2733fc27a874644b6ad8caecedf

mirror of https://github.com/explosion/spaCy.git synced 2024-09-21 19:39:13 +03:00

Ines Montani ebcf2bb1c3 Add Doc.lang and Doc.lang_

2019-03-11 14:21:40 +01:00

35 KiB

Raw Blame History

title	tag	teaser	source
Doc	class	A container for accessing linguistic annotations.	spacy/tokens/doc.pyx

A Doc is a sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings. The Doc object holds an array of TokenC] structs. The Python-level Token and Span objects are views of this array, i.e. they don't own the data themselves.

Example

# Construction 1
doc = nlp(u"Some text")

# Construction 2
from spacy.tokens import Doc
words = [u"hello", u"world", u"!"]
spaces = [True, False, False]
doc = Doc(nlp.vocab, words=words, spaces=spaces)

Doc.init

Construct a Doc object. The most common way to get a Doc object is via the nlp object.

Name	Type	Description
`vocab`	`Vocab`	A storage container for lexical types.
`words`	iterable	A list of strings to add to the container.
`spaces`	iterable	A list of boolean values indicating whether each word has a subsequent space. Must have the same length as `words`, if specified. Defaults to a sequence of `True`.
RETURNS	`Doc`	The newly constructed object.

Doc.getitem

Get a Token object at position i, where i is an integer. Negative indexing is supported, and follows the usual Python semantics, i.e. doc[-2] is doc[len(doc) - 2].

Example

doc = nlp(u"Give it back! He pleaded.")
assert doc[0].text == "Give"
assert doc[-1].text == "."
span = doc[1:3]
assert span.text == "it back"

Name	Type	Description
`i`	int	The index of the token.
RETURNS	`Token`	The token at `doc[i]`.

Get a Span object, starting at position start (token index) and ending at position end (token index). For instance, doc[2:5] produces a span consisting of tokens 2, 3 and 4. Stepped slices (e.g. doc[start : end : step]) are not supported, as Span objects must be contiguous (cannot have gaps). You can use negative indices and open-ended ranges, which have their normal Python semantics.

Name	Type	Description
`start_end`	tuple	The slice of the document to get.
RETURNS	`Span`	The span at `doc[start:end]`.

Doc.iter

Iterate over Token objects, from which the annotations can be easily accessed.

Example

doc = nlp(u'Give it back')
assert [t.text for t in doc] == [u'Give', u'it', u'back']

This is the main way of accessing Token objects, which are the main way annotations are accessed from Python. If faster-than-Python speeds are required, you can instead access the annotations as a numpy array, or access the underlying C data directly from Cython.

Name	Type	Description
YIELDS	`Token`	A `Token` object.

Doc.len

Get the number of tokens in the document.

Example

doc = nlp(u"Give it back! He pleaded.")
assert len(doc) == 7

Name	Type	Description
RETURNS	int	The number of tokens in the document.

Doc.set_extension

Define a custom attribute on the Doc which becomes available via Doc._. For details, see the documentation on custom attributes.

Example

from spacy.tokens import Doc
city_getter = lambda doc: any(city in doc.text for city in ('New York', 'Paris', 'Berlin'))
Doc.set_extension('has_city', getter=city_getter)
doc = nlp(u'I like New York')
assert doc._.has_city

Name	Type	Description
`name`	unicode	Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `doc._.my_attr`.
`default`	-	Optional default value of the attribute if no getter or method is defined.
`method`	callable	Set a custom method on the object, for example `doc._.compare(other_doc)`.
`getter`	callable	Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute.
`setter`	callable	Setter function that takes the `Doc` and a value, and modifies the object. Is called when the user writes to the `Doc._` attribute.
`force`	bool	Force overwriting existing attribute.

Doc.get_extension

Look up a previously registered extension by name. Returns a 4-tuple (default, method, getter, setter) if the extension is registered. Raises a KeyError otherwise.

Example

from spacy.tokens import Doc
Doc.set_extension('has_city', default=False)
extension = Doc.get_extension('has_city')
assert extension == (False, None, None, None)

Name	Type	Description
`name`	unicode	Name of the extension.
RETURNS	tuple	A `(default, method, getter, setter)` tuple of the extension.

Doc.has_extension

Check whether an extension has been registered on the Doc class.

Example

from spacy.tokens import Doc
Doc.set_extension('has_city', default=False)
assert Doc.has_extension('has_city')

Name	Type	Description
`name`	unicode	Name of the extension to check.
RETURNS	bool	Whether the extension has been registered.

Doc.remove_extension

Remove a previously registered extension.

Example

from spacy.tokens import Doc
Doc.set_extension('has_city', default=False)
removed = Doc.remove_extension('has_city')
assert not Doc.has_extension('has_city')

Name	Type	Description
`name`	unicode	Name of the extension.
RETURNS	tuple	A `(default, method, getter, setter)` tuple of the removed extension.

Doc.char_span

Create a Span object from the slice doc.text[start:end]. Returns None if the character indices don't map to a valid span.

Example

doc = nlp(u"I like New York")
span = doc.char_span(7, 15, label=u"GPE")
assert span.text == "New York"

Name	Type	Description
`start`	int	The index of the first character of the span.
`end`	int	The index of the last character after the span.
`label`	uint64 / unicode	A label to attach to the Span, e.g. for named entities.
`vector`	`numpy.ndarray[ndim=1, dtype='float32']`	A meaning representation of the span.
RETURNS	`Span`	The newly constructed object or `None`.

Doc.similarity

Make a semantic similarity estimate. The default estimate is cosine similarity using an average of word vectors.

Example

apples = nlp(u"I like apples")
oranges = nlp(u"I like oranges")
apples_oranges = apples.similarity(oranges)
oranges_apples = oranges.similarity(apples)
assert apples_oranges == oranges_apples

Name	Type	Description
`other`	-	The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects.
RETURNS	float	A scalar similarity score. Higher is more similar.

Doc.count_by

Count the frequencies of a given attribute. Produces a dict of {attr (int): count (ints)} frequencies, keyed by the values of the given attribute ID.

Example

from spacy.attrs import ORTH
doc = nlp(u"apple apple orange banana")
assert doc.count_by(ORTH) == {7024L: 1, 119552L: 1, 2087L: 2}
doc.to_array([ORTH])
# array([[11880], [11880], [7561], [12800]])

Name	Type	Description
`attr_id`	int	The attribute ID
RETURNS	dict	A dictionary mapping attributes to integer counts.

Doc.get_lca_matrix

Calculates the lowest common ancestor matrix for a given Doc. Returns LCA matrix containing the integer index of the ancestor, or -1 if no common ancestor is found, e.g. if span excludes a necessary ancestor.

Example

doc = nlp(u"This is a test")
matrix = doc.get_lca_matrix()
# array([[0, 1, 1, 1], [1, 1, 1, 1], [1, 1, 2, 3], [1, 1, 3, 3]], dtype=int32)

Name	Type	Description
RETURNS	`numpy.ndarray[ndim=2, dtype='int32']`	The lowest common ancestor matrix of the `Doc`.

Doc.to_json

Convert a Doc to JSON. The format it produces will be the new format for the spacy train command (not implemented yet). If custom underscore attributes are specified, their values need to be JSON-serializable. They'll be added to an "_" key in the data, e.g. "_": {"foo": "bar"}.

Example

doc = nlp(u"Hello")
json_doc = doc.to_json()

Result

{
  "text": "Hello",
  "ents": [],
  "sents": [{"start": 0, "end": 5}],
  "tokens": [{"id": 0, "start": 0, "end": 5, "pos": "INTJ", "tag": "UH", "dep": "ROOT", "head": 0}
  ]
}

Name	Type	Description
`underscore`	list	Optional list of string names of custom JSON-serializable `doc._.` attributes.
RETURNS	dict	The JSON-formatted data.

spaCy previously implemented a Doc.print_tree method that returned a similar JSON-formatted representation of a Doc. As of v2.1, this method is deprecated in favor of Doc.to_json. If you need more complex nested representations, you might want to write your own function to extract the data.

Doc.to_array

Export given token attributes to a numpy ndarray. If attr_ids is a sequence of M attributes, the output array will be of shape (N, M), where N is the length of the Doc (in tokens). If attr_ids is a single attribute, the output shape will be (N,). You can specify attributes by integer ID (e.g. spacy.attrs.LEMMA) or string name (e.g. 'LEMMA' or 'lemma'). The values will be 64-bit integers.

Returns a 2D array with one row per token and one column per attribute (when attr_ids is a list), or as a 1D numpy array, with one item per attribute (when attr_ids is a single value).

Example

from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
doc = nlp(text)
# All strings mapped to integers, for easy export to numpy
np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
np_array = doc.to_array("POS")

Name	Type	Description
`attr_ids`	list or int or string	A list of attributes (int IDs or string names) or a single attribute (int ID or string name)
RETURNS	`numpy.ndarray[ndim=2, dtype='uint64']` or `numpy.ndarray[ndim=1, dtype='uint64']`	The exported attributes as a numpy array.

Doc.from_array

Load attributes from a numpy array. Write to a Doc object, from an (M, N) array of attributes.

Example

from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
from spacy.tokens import Doc
doc = nlp(u"Hello world!")
np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
doc2 = Doc(doc.vocab, words=[t.text for t in doc])
doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array)
assert doc[0].pos_ == doc2[0].pos_

Name	Type	Description
`attrs`	list	A list of attribute ID ints.
`array`	`numpy.ndarray[ndim=2, dtype='int32']`	The attribute values to load.
`exclude`	list	String names of serialization fields to exclude.
RETURNS	`Doc`	Itself.

Doc.to_disk

Save the current state to a directory.

Example

doc.to_disk("/path/to/doc")

Name	Type	Description
`path`	unicode / `Path`	A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects.
`exclude`	list	String names of serialization fields to exclude.

Doc.from_disk

Loads state from a directory. Modifies the object in place and returns it.

Example

from spacy.tokens import Doc
from spacy.vocab import Vocab
doc = Doc(Vocab()).from_disk("/path/to/doc")

Name	Type	Description
`path`	unicode / `Path`	A path to a directory. Paths may be either strings or `Path`-like objects.
`exclude`	list	String names of serialization fields to exclude.
RETURNS	`Doc`	The modified `Doc` object.

Doc.to_bytes

Serialize, i.e. export the document contents to a binary string.

Example

doc = nlp(u"Give it back! He pleaded.")
doc_bytes = doc.to_bytes()

Name	Type	Description
`exclude`	list	String names of serialization fields to exclude.
RETURNS	bytes	A losslessly serialized copy of the `Doc`, including all annotations.

Doc.from_bytes

Deserialize, i.e. import the document contents from a binary string.

Example

from spacy.tokens import Doc
text = u"Give it back! He pleaded."
doc = nlp(text)
bytes = doc.to_bytes()
doc2 = Doc(doc.vocab).from_bytes(bytes)
assert doc.text == doc2.text

Name	Type	Description
`data`	bytes	The string to load from.
`exclude`	list	String names of serialization fields to exclude.
RETURNS	`Doc`	The `Doc` object.

Doc.retokenize

Context manager to handle retokenization of the Doc. Modifications to the Doc's tokenization are stored, and then made all at once when the context manager exits. This is much more efficient, and less error-prone. All views of the Doc (Span and Token) created before the retokenization are invalidated, although they may accidentally continue to work.

Example

doc = nlp("Hello world!")
with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[0:2])

Name	Type	Description
RETURNS	`Retokenizer`	The retokenizer.

Retokenizer.merge

Mark a span for merging. The attrs will be applied to the resulting token (if they're context-dependent token attributes like LEMMA or DEP) or to the underlying lexeme (if they're context-independent lexical attributes like LOWER or IS_STOP). Writable custom extension attributes can be provided as a dictionary mapping attribute names to values as the "_" key.

Example

doc = nlp(u"I like David Bowie")
with doc.retokenize() as retokenizer:
    attrs = {"LEMMA": u"David Bowie"}
    retokenizer.merge(doc[2:4], attrs=attrs)

Name	Type	Description
`span`	`Span`	The span to merge.
`attrs`	dict	Attributes to set on the merged token.

Retokenizer.split

Mark a token for splitting, into the specified orths. The heads are required to specify how the new subtokens should be integrated into the dependency tree. The list of per-token heads can either be a token in the original document, e.g. doc[2], or a tuple consisting of the token in the original document and its subtoken index. For example, (doc[3], 1) will attach the subtoken to the second subtoken of doc[3].

This mechanism allows attaching subtokens to other newly created subtokens, without having to keep track of the changing token indices. If the specified head token will be split within the retokenizer block and no subtoken index is specified, it will default to 0. Attributes to set on subtokens can be provided as a list of values. They'll be applied to the resulting token (if they're context-dependent token attributes like LEMMA or DEP) or to the underlying lexeme (if they're context-independent lexical attributes like LOWER or IS_STOP).

Example

doc = nlp(u"I live in NewYork")
with doc.retokenize() as retokenizer:
    heads = [(doc[3], 1), doc[2]]
    attrs = {"POS": ["PROPN", "PROPN"],
             "DEP": ["pobj", "compound"]}
    retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)

Name	Type	Description
`token`	`Token`	The token to split.
`orths`	list	The verbatim text of the split tokens. Needs to match the text of the original token.
`heads`	list	List of `token` or `(token, subtoken)` tuples specifying the tokens to attach the newly split subtokens to.
`attrs`	dict	Attributes to set on all split tokens. Attribute names mapped to list of per-token attribute values.

Doc.merge

As of v2.1.0, Doc.merge still works but is considered deprecated. You should use the new and less error-prone Doc.retokenize instead.

Retokenize the document, such that the span at doc.text[start_idx : end_idx] is merged into a single token. If start_idx and end_idx do not mark start and end token boundaries, the document remains unchanged.

Example

doc = nlp(u"Los Angeles start.")
doc.merge(0, len("Los Angeles"), "NNP", "Los Angeles", "GPE")
assert [t.text for t in doc] == [u"Los Angeles", u"start", u"."]

Name	Type	Description
`start_idx`	int	The character index of the start of the slice to merge.
`end_idx`	int	The character index after the end of the slice to merge.
`**attributes`	-	Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span.
RETURNS	`Token`	The newly merged token, or `None` if the start and end indices did not fall at token boundaries

Doc.ents

The named entities in the document. Returns a tuple of named entity Span objects, if the entity recognizer has been applied.

Example

doc = nlp(u"Mr. Best flew to New York on Saturday morning.")
ents = list(doc.ents)
assert ents[0].label == 346
assert ents[0].label_ == u"PERSON"
assert ents[0].text == u"Mr. Best"

Name	Type	Description
RETURNS	tuple	Entities in the document, one `Span` per entity.

Doc.noun_chunks

Iterate over the base noun phrases in the document. Yields base noun-phrase Span objects, if the document has been syntactically parsed. A base noun phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be nested within it – so no NP-level coordination, no prepositional phrases, and no relative clauses.

Example

doc = nlp(u"A phrase with another phrase occurs.")
chunks = list(doc.noun_chunks)
assert chunks[0].text == u"A phrase"
assert chunks[1].text == u"another phrase"

Name	Type	Description
YIELDS	`Span`	Noun chunks in the document.

Doc.sents

Iterate over the sentences in the document. Sentence spans have no label. To improve accuracy on informal texts, spaCy calculates sentence boundaries from the syntactic dependency parse. If the parser is disabled, the sents iterator will be unavailable.

Example

doc = nlp(u"This is a sentence. Here's another...")
sents = list(doc.sents)
assert len(sents) == 2
assert [s.root.text for s in sents] == [u"is", u"'s"]

Name	Type	Description
YIELDS	`Span`	Sentences in the document.

Doc.has_vector

A boolean value indicating whether a word vector is associated with the object.

Example

doc = nlp(u"I like apples")
assert doc.has_vector

Name	Type	Description
RETURNS	bool	Whether the document has a vector data attached.

Doc.vector

A real-valued meaning representation. Defaults to an average of the token vectors.

Example

doc = nlp(u"I like apples")
assert doc.vector.dtype == 'float32'
assert doc.vector.shape == (300,)

Name	Type	Description
RETURNS	`numpy.ndarray[ndim=1, dtype='float32']`	A 1D numpy array representing the document's semantics.

Doc.vector_norm

The L2 norm of the document's vector representation.

Example

doc1 = nlp(u"I like apples")
doc2 = nlp(u"I like oranges")
doc1.vector_norm  # 4.54232424414368
doc2.vector_norm  # 3.304373298575751
assert doc1.vector_norm != doc2.vector_norm

Name	Type	Description
RETURNS	float	The L2 norm of the vector representation.

Attributes

Name	Type	Description
`text`	unicode	A unicode representation of the document text.
`text_with_ws`	unicode	An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`.
`mem`	`Pool`	The document's local memory heap, for all C data it owns.
`vocab`	`Vocab`	The store of lexical types.
`tensor` 2	object	Container for dense vector representations.
`cats` 2	dictionary	Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float.
`user_data`	-	A generic storage area, for user custom data.
`lang` 2.1	int	Language of the document's vocabulary.
`lang_` 2.1	unicode	Language of the document's vocabulary.
`is_tagged`	bool	A flag indicating that the document has been part-of-speech tagged.
`is_parsed`	bool	A flag indicating that the document has been syntactically parsed.
`is_sentenced`	bool	A flag indicating that sentence boundaries have been applied to the document.
`is_nered` 2.1	bool	A flag indicating that named entities have been set. Will return `True` if any of the tokens has an entity tag set, even if the others are unknown.
`sentiment`	float	The document's positivity/negativity score, if available.
`user_hooks`	dict	A dictionary that allows customization of the `Doc`'s properties.
`user_token_hooks`	dict	A dictionary that allows customization of properties of `Token` children.
`user_span_hooks`	dict	A dictionary that allows customization of properties of `Span` children.
`_`	`Underscore`	User space for adding custom attribute extensions.

Serialization fields

During serialization, spaCy will export several data fields used to restore different aspects of the object. If needed, you can exclude them from serialization by passing in the string names via the exclude argument.

Example

data = doc.to_bytes(exclude=["text", "tensor"])
doc.from_disk("./doc.bin", exclude=["user_data"])

Name	Description
`text`	The value of the `Doc.text` attribute.
`sentiment`	The value of the `Doc.sentiment` attribute.
`tensor`	The value of the `Doc.tensor` attribute.
`user_data`	The value of the `Doc.user_data` dictionary.
`user_data_keys`	The keys of the `Doc.user_data` dictionary.
`user_data_values`	The values of the `Doc.user_data` dictionary.

35 KiB Raw Blame History Unescape Escape

Example

Doc.__init__

Doc.__getitem__

Example

Doc.__iter__

Example

Doc.__len__

Example

Doc.set_extension

Example

Doc.get_extension

Example

Doc.has_extension

Example

Doc.remove_extension

Example

Doc.char_span

Example

Doc.similarity

Example

Doc.count_by

Example

Doc.get_lca_matrix

Example

Doc.to_json

Example

Result

Doc.to_array

Example

Doc.from_array

Example

Doc.to_disk

Example

Doc.from_disk

Example

Doc.to_bytes

Example

Doc.from_bytes

Example

Doc.retokenize

Example

Retokenizer.merge

Example

Retokenizer.split

Example

Doc.merge

Example

Doc.ents

Example

Doc.noun_chunks

Example

Doc.sents

Example

Doc.has_vector

Example

Doc.vector

Example

Doc.vector_norm

Example

Attributes

Serialization fields

Example

35 KiB

Raw Blame History

Doc.init

Doc.getitem

Doc.iter

Doc.len