* Make serialization methods consistent exclude keyword argument instead of random named keyword arguments and deprecation handling * Update docs and add section on serialization fields
		
			
				
	
	
	
		
			35 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	| title | tag | teaser | source | 
|---|---|---|---|
| Doc | class | A container for accessing linguistic annotations. | spacy/tokens/doc.pyx | 
A Doc is a sequence of Token objects. Access sentences and
named entities, export annotations to numpy arrays, losslessly serialize to
compressed binary strings. The Doc object holds an array of TokenC] structs.
The Python-level Token and Span objects are views of this
array, i.e. they don't own the data themselves.
Example
# Construction 1 doc = nlp(u"Some text") # Construction 2 from spacy.tokens import Doc words = [u"hello", u"world", u"!"] spaces = [True, False, False] doc = Doc(nlp.vocab, words=words, spaces=spaces)
Doc.__init__
Construct a Doc object. The most common way to get a Doc object is via the
nlp object.
| Name | Type | Description | 
|---|---|---|
| vocab | Vocab | A storage container for lexical types. | 
| words | iterable | A list of strings to add to the container. | 
| spaces | iterable | A list of boolean values indicating whether each word has a subsequent space. Must have the same length as words, if specified. Defaults to a sequence ofTrue. | 
| RETURNS | Doc | The newly constructed object. | 
Doc.__getitem__
Get a Token object at position i, where i is an integer.
Negative indexing is supported, and follows the usual Python semantics, i.e.
doc[-2] is doc[len(doc) - 2].
Example
doc = nlp(u"Give it back! He pleaded.") assert doc[0].text == "Give" assert doc[-1].text == "." span = doc[1:3] assert span.text == "it back"
| Name | Type | Description | 
|---|---|---|
| i | int | The index of the token. | 
| RETURNS | Token | The token at doc[i]. | 
Get a Span object, starting at position start (token index) and
ending at position end (token index). For instance, doc[2:5] produces a span
consisting of tokens 2, 3 and 4. Stepped slices (e.g. doc[start : end : step])
are not supported, as Span objects must be contiguous (cannot have gaps). You
can use negative indices and open-ended ranges, which have their normal Python
semantics.
| Name | Type | Description | 
|---|---|---|
| start_end | tuple | The slice of the document to get. | 
| RETURNS | Span | The span at doc[start:end]. | 
Doc.__iter__
Iterate over Token objects, from which the annotations can be easily accessed.
Example
doc = nlp(u'Give it back') assert [t.text for t in doc] == [u'Give', u'it', u'back']
This is the main way of accessing Token objects, which are the
main way annotations are accessed from Python. If faster-than-Python speeds are
required, you can instead access the annotations as a numpy array, or access the
underlying C data directly from Cython.
| Name | Type | Description | 
|---|---|---|
| YIELDS | Token | A Tokenobject. | 
Doc.__len__
Get the number of tokens in the document.
Example
doc = nlp(u"Give it back! He pleaded.") assert len(doc) == 7
| Name | Type | Description | 
|---|---|---|
| RETURNS | int | The number of tokens in the document. | 
Doc.set_extension
Define a custom attribute on the Doc which becomes available via Doc._. For
details, see the documentation on
custom attributes.
Example
from spacy.tokens import Doc city_getter = lambda doc: any(city in doc.text for city in ('New York', 'Paris', 'Berlin')) Doc.set_extension('has_city', getter=city_getter) doc = nlp(u'I like New York') assert doc._.has_city
| Name | Type | Description | 
|---|---|---|
| name | unicode | Name of the attribute to set by the extension. For example, 'my_attr'will be available asdoc._.my_attr. | 
| default | - | Optional default value of the attribute if no getter or method is defined. | 
| method | callable | Set a custom method on the object, for example doc._.compare(other_doc). | 
| getter | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the ._attribute. | 
| setter | callable | Setter function that takes the Docand a value, and modifies the object. Is called when the user writes to theDoc._attribute. | 
| force | bool | Force overwriting existing attribute. | 
Doc.get_extension
Look up a previously registered extension by name. Returns a 4-tuple
(default, method, getter, setter) if the extension is registered. Raises a
KeyError otherwise.
Example
from spacy.tokens import Doc Doc.set_extension('has_city', default=False) extension = Doc.get_extension('has_city') assert extension == (False, None, None, None)
| Name | Type | Description | 
|---|---|---|
| name | unicode | Name of the extension. | 
| RETURNS | tuple | A (default, method, getter, setter)tuple of the extension. | 
Doc.has_extension
Check whether an extension has been registered on the Doc class.
Example
from spacy.tokens import Doc Doc.set_extension('has_city', default=False) assert Doc.has_extension('has_city')
| Name | Type | Description | 
|---|---|---|
| name | unicode | Name of the extension to check. | 
| RETURNS | bool | Whether the extension has been registered. | 
Doc.remove_extension
Remove a previously registered extension.
Example
from spacy.tokens import Doc Doc.set_extension('has_city', default=False) removed = Doc.remove_extension('has_city') assert not Doc.has_extension('has_city')
| Name | Type | Description | 
|---|---|---|
| name | unicode | Name of the extension. | 
| RETURNS | tuple | A (default, method, getter, setter)tuple of the removed extension. | 
Doc.char_span
Create a Span object from the slice doc.text[start:end]. Returns None if
the character indices don't map to a valid span.
Example
doc = nlp(u"I like New York") span = doc.char_span(7, 15, label=u"GPE") assert span.text == "New York"
| Name | Type | Description | 
|---|---|---|
| start | int | The index of the first character of the span. | 
| end | int | The index of the last character after the span. | 
| label | uint64 / unicode | A label to attach to the Span, e.g. for named entities. | 
| vector | numpy.ndarray[ndim=1, dtype='float32'] | A meaning representation of the span. | 
| RETURNS | Span | The newly constructed object or None. | 
Doc.similarity
Make a semantic similarity estimate. The default estimate is cosine similarity using an average of word vectors.
Example
apples = nlp(u"I like apples") oranges = nlp(u"I like oranges") apples_oranges = apples.similarity(oranges) oranges_apples = oranges.similarity(apples) assert apples_oranges == oranges_apples
| Name | Type | Description | 
|---|---|---|
| other | - | The object to compare with. By default, accepts Doc,Span,TokenandLexemeobjects. | 
| RETURNS | float | A scalar similarity score. Higher is more similar. | 
Doc.count_by
Count the frequencies of a given attribute. Produces a dict of
{attr (int): count (ints)} frequencies, keyed by the values of the given
attribute ID.
Example
from spacy.attrs import ORTH doc = nlp(u"apple apple orange banana") assert doc.count_by(ORTH) == {7024L: 1, 119552L: 1, 2087L: 2} doc.to_array([ORTH]) # array([[11880], [11880], [7561], [12800]])
| Name | Type | Description | 
|---|---|---|
| attr_id | int | The attribute ID | 
| RETURNS | dict | A dictionary mapping attributes to integer counts. | 
Doc.get_lca_matrix
Calculates the lowest common ancestor matrix for a given Doc. Returns LCA
matrix containing the integer index of the ancestor, or -1 if no common
ancestor is found, e.g. if span excludes a necessary ancestor.
Example
doc = nlp(u"This is a test") matrix = doc.get_lca_matrix() # array([[0, 1, 1, 1], [1, 1, 1, 1], [1, 1, 2, 3], [1, 1, 3, 3]], dtype=int32)
| Name | Type | Description | 
|---|---|---|
| RETURNS | numpy.ndarray[ndim=2, dtype='int32'] | The lowest common ancestor matrix of the Doc. | 
Doc.to_json
Convert a Doc to JSON. The format it produces will be the new format for the
spacy train command (not implemented yet). If custom
underscore attributes are specified, their values need to be JSON-serializable.
They'll be added to an "_" key in the data, e.g. "_": {"foo": "bar"}.
Example
doc = nlp(u"Hello") json_doc = doc.to_json()Result
{ "text": "Hello", "ents": [], "sents": [{"start": 0, "end": 5}], "tokens": [{"id": 0, "start": 0, "end": 5, "pos": "INTJ", "tag": "UH", "dep": "ROOT", "head": 0} ] }
| Name | Type | Description | 
|---|---|---|
| underscore | list | Optional list of string names of custom JSON-serializable doc._.attributes. | 
| RETURNS | dict | The JSON-formatted data. | 
spaCy previously implemented a Doc.print_tree method that returned a similar
JSON-formatted representation of a Doc. As of v2.1, this method is deprecated
in favor of Doc.to_json. If you need more complex nested representations, you
might want to write your own function to extract the data.
Doc.to_array
Export given token attributes to a numpy ndarray. If attr_ids is a sequence
of M attributes, the output array will be of shape (N, M), where N is the
length of the Doc (in tokens). If attr_ids is a single attribute, the output
shape will be (N,). You can specify attributes by integer ID (e.g.
spacy.attrs.LEMMA) or string name (e.g. 'LEMMA' or 'lemma'). The values will
be 64-bit integers.
Returns a 2D array with one row per token and one column per attribute (when
attr_ids is a list), or as a 1D numpy array, with one item per attribute (when
attr_ids is a single value).
Example
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA doc = nlp(text) # All strings mapped to integers, for easy export to numpy np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA]) np_array = doc.to_array("POS")
| Name | Type | Description | 
|---|---|---|
| attr_ids | list or int or string | A list of attributes (int IDs or string names) or a single attribute (int ID or string name) | 
| RETURNS | numpy.ndarray[ndim=2, dtype='uint64']ornumpy.ndarray[ndim=1, dtype='uint64'] | The exported attributes as a numpy array. | 
Doc.from_array
Load attributes from a numpy array. Write to a Doc object, from an (M, N)
array of attributes.
Example
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA from spacy.tokens import Doc doc = nlp(u"Hello world!") np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA]) doc2 = Doc(doc.vocab, words=[t.text for t in doc]) doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array) assert doc[0].pos_ == doc2[0].pos_
| Name | Type | Description | 
|---|---|---|
| attrs | list | A list of attribute ID ints. | 
| array | numpy.ndarray[ndim=2, dtype='int32'] | The attribute values to load. | 
| exclude | list | String names of serialization fields to exclude. | 
| RETURNS | Doc | Itself. | 
Doc.to_disk
Save the current state to a directory.
Example
doc.to_disk("/path/to/doc")
| Name | Type | Description | 
|---|---|---|
| path | unicode / Path | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects. | 
| exclude | list | String names of serialization fields to exclude. | 
Doc.from_disk
Loads state from a directory. Modifies the object in place and returns it.
Example
from spacy.tokens import Doc from spacy.vocab import Vocab doc = Doc(Vocab()).from_disk("/path/to/doc")
| Name | Type | Description | 
|---|---|---|
| path | unicode / Path | A path to a directory. Paths may be either strings or Path-like objects. | 
| exclude | list | String names of serialization fields to exclude. | 
| RETURNS | Doc | The modified Docobject. | 
Doc.to_bytes
Serialize, i.e. export the document contents to a binary string.
Example
doc = nlp(u"Give it back! He pleaded.") doc_bytes = doc.to_bytes()
| Name | Type | Description | 
|---|---|---|
| exclude | list | String names of serialization fields to exclude. | 
| RETURNS | bytes | A losslessly serialized copy of the Doc, including all annotations. | 
Doc.from_bytes
Deserialize, i.e. import the document contents from a binary string.
Example
from spacy.tokens import Doc text = u"Give it back! He pleaded." doc = nlp(text) bytes = doc.to_bytes() doc2 = Doc(doc.vocab).from_bytes(bytes) assert doc.text == doc2.text
| Name | Type | Description | 
|---|---|---|
| data | bytes | The string to load from. | 
| exclude | list | String names of serialization fields to exclude. | 
| RETURNS | Doc | The Docobject. | 
Doc.retokenize
Context manager to handle retokenization of the Doc. Modifications to the
Doc's tokenization are stored, and then made all at once when the context
manager exits. This is much more efficient, and less error-prone. All views of
the Doc (Span and Token) created before the retokenization are
invalidated, although they may accidentally continue to work.
Example
doc = nlp("Hello world!") with doc.retokenize() as retokenizer: retokenizer.merge(doc[0:2])
| Name | Type | Description | 
|---|---|---|
| RETURNS | Retokenizer | The retokenizer. | 
Retokenizer.merge
Mark a span for merging. The attrs will be applied to the resulting token (if
they're context-dependent token attributes like LEMMA or DEP) or to the
underlying lexeme (if they're context-independent lexical attributes like
LOWER or IS_STOP). Writable custom extension attributes can be provided as a
dictionary mapping attribute names to values as the "_" key.
Example
doc = nlp(u"I like David Bowie") with doc.retokenize() as retokenizer: attrs = {"LEMMA": u"David Bowie"} retokenizer.merge(doc[2:4], attrs=attrs)
| Name | Type | Description | 
|---|---|---|
| span | Span | The span to merge. | 
| attrs | dict | Attributes to set on the merged token. | 
Retokenizer.split
Mark a token for splitting, into the specified orths. The heads are required
to specify how the new subtokens should be integrated into the dependency tree.
The list of per-token heads can either be a token in the original document, e.g.
doc[2], or a tuple consisting of the token in the original document and its
subtoken index. For example, (doc[3], 1) will attach the subtoken to the
second subtoken of doc[3].
This mechanism allows attaching subtokens to other newly created subtokens,
without having to keep track of the changing token indices. If the specified
head token will be split within the retokenizer block and no subtoken index is
specified, it will default to 0. Attributes to set on subtokens can be
provided as a list of values. They'll be applied to the resulting token (if
they're context-dependent token attributes like LEMMA or DEP) or to the
underlying lexeme (if they're context-independent lexical attributes like
LOWER or IS_STOP).
Example
doc = nlp(u"I live in NewYork") with doc.retokenize() as retokenizer: heads = [(doc[3], 1), doc[2]] attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]} retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)
| Name | Type | Description | 
|---|---|---|
| token | Token | The token to split. | 
| orths | list | The verbatim text of the split tokens. Needs to match the text of the original token. | 
| heads | list | List of tokenor(token, subtoken)tuples specifying the tokens to attach the newly split subtokens to. | 
| attrs | dict | Attributes to set on all split tokens. Attribute names mapped to list of per-token attribute values. | 
Doc.merge
As of v2.1.0, Doc.merge still works but is considered deprecated. You should
use the new and less error-prone Doc.retokenize
instead.
Retokenize the document, such that the span at doc.text[start_idx : end_idx]
is merged into a single token. If start_idx and end_idx do not mark start
and end token boundaries, the document remains unchanged.
Example
doc = nlp(u"Los Angeles start.") doc.merge(0, len("Los Angeles"), "NNP", "Los Angeles", "GPE") assert [t.text for t in doc] == [u"Los Angeles", u"start", u"."]
| Name | Type | Description | 
|---|---|---|
| start_idx | int | The character index of the start of the slice to merge. | 
| end_idx | int | The character index after the end of the slice to merge. | 
| **attributes | - | Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span. | 
| RETURNS | Token | The newly merged token, or Noneif the start and end indices did not fall at token boundaries | 
Doc.ents
The named entities in the document. Returns a tuple of named entity Span
objects, if the entity recognizer has been applied.
Example
doc = nlp(u"Mr. Best flew to New York on Saturday morning.") ents = list(doc.ents) assert ents[0].label == 346 assert ents[0].label_ == u"PERSON" assert ents[0].text == u"Mr. Best"
| Name | Type | Description | 
|---|---|---|
| RETURNS | tuple | Entities in the document, one Spanper entity. | 
Doc.noun_chunks
Iterate over the base noun phrases in the document. Yields base noun-phrase
Span objects, if the document has been syntactically parsed. A base noun
phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be
nested within it – so no NP-level coordination, no prepositional phrases, and no
relative clauses.
Example
doc = nlp(u"A phrase with another phrase occurs.") chunks = list(doc.noun_chunks) assert chunks[0].text == u"A phrase" assert chunks[1].text == u"another phrase"
| Name | Type | Description | 
|---|---|---|
| YIELDS | Span | Noun chunks in the document. | 
Doc.sents
Iterate over the sentences in the document. Sentence spans have no label. To
improve accuracy on informal texts, spaCy calculates sentence boundaries from
the syntactic dependency parse. If the parser is disabled, the sents iterator
will be unavailable.
Example
doc = nlp(u"This is a sentence. Here's another...") sents = list(doc.sents) assert len(sents) == 2 assert [s.root.text for s in sents] == [u"is", u"'s"]
| Name | Type | Description | 
|---|---|---|
| YIELDS | Span | Sentences in the document. | 
Doc.has_vector
A boolean value indicating whether a word vector is associated with the object.
Example
doc = nlp(u"I like apples") assert doc.has_vector
| Name | Type | Description | 
|---|---|---|
| RETURNS | bool | Whether the document has a vector data attached. | 
Doc.vector
A real-valued meaning representation. Defaults to an average of the token vectors.
Example
doc = nlp(u"I like apples") assert doc.vector.dtype == 'float32' assert doc.vector.shape == (300,)
| Name | Type | Description | 
|---|---|---|
| RETURNS | numpy.ndarray[ndim=1, dtype='float32'] | A 1D numpy array representing the document's semantics. | 
Doc.vector_norm
The L2 norm of the document's vector representation.
Example
doc1 = nlp(u"I like apples") doc2 = nlp(u"I like oranges") doc1.vector_norm # 4.54232424414368 doc2.vector_norm # 3.304373298575751 assert doc1.vector_norm != doc2.vector_norm
| Name | Type | Description | 
|---|---|---|
| RETURNS | float | The L2 norm of the vector representation. | 
Attributes
| Name | Type | Description | 
|---|---|---|
| text | unicode | A unicode representation of the document text. | 
| text_with_ws | unicode | An alias of Doc.text, provided for duck-type compatibility withSpanandToken. | 
| mem | Pool | The document's local memory heap, for all C data it owns. | 
| vocab | Vocab | The store of lexical types. | 
| tensor2 | object | Container for dense vector representations. | 
| cats2 | dictionary | Maps either a label to a score for categories applied to whole document, or (start_char, end_char, label)to score for categories applied to spans.start_charandend_charshould be character offsets, label can be either a string or an integer ID, and score should be a float. | 
| user_data | - | A generic storage area, for user custom data. | 
| is_tagged | bool | A flag indicating that the document has been part-of-speech tagged. | 
| is_parsed | bool | A flag indicating that the document has been syntactically parsed. | 
| is_sentenced | bool | A flag indicating that sentence boundaries have been applied to the document. | 
| is_nered2.1 | bool | A flag indicating that named entities have been set. Will return Trueif any of the tokens has an entity tag set, even if the others are unknown. | 
| sentiment | float | The document's positivity/negativity score, if available. | 
| user_hooks | dict | A dictionary that allows customization of the Doc's properties. | 
| user_token_hooks | dict | A dictionary that allows customization of properties of Tokenchildren. | 
| user_span_hooks | dict | A dictionary that allows customization of properties of Spanchildren. | 
| _ | Underscore | User space for adding custom attribute extensions. | 
Serialization fields
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the exclude argument.
Example
data = doc.to_bytes(exclude=["text", "tensor"]) doc.from_disk("./doc.bin", exclude=["user_data"])
| Name | Description | 
|---|---|
| text | The value of the Doc.textattribute. | 
| sentiment | The value of the Doc.sentimentattribute. | 
| tensor | The value of the Doc.tensorattribute. | 
| user_data | The value of the Doc.user_datadictionary. | 
| user_data_keys | The keys of the Doc.user_datadictionary. | 
| user_data_values | The values of the Doc.user_datadictionary. |