5.5 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	| title | tag | new | teaser | source | 
|---|---|---|---|---|
| DocBin | class | 2.2 | Pack Doc objects for binary serialization | spacy/tokens/_serialize.py | 
The DocBin class lets you efficiently serialize the information from a
collection of Doc objects. You can control which information is serialized by
passing a list of attribute IDs, and optionally also specify whether the user
data is serialized. The DocBin is faster and produces smaller data sizes than
pickle, and allows you to deserialize without executing arbitrary Python code. A
notable downside to this format is that you can't easily extract just one
document from the DocBin. The serialization format is gzipped msgpack, where
the msgpack object has the following structure:
### msgpack object strcutrue
{
    "attrs": List[uint64],    # e.g. [TAG, HEAD, ENT_IOB, ENT_TYPE]
    "tokens": bytes,          # Serialized numpy uint64 array with the token data
    "spaces": bytes,          # Serialized numpy boolean array with spaces data
    "lengths": bytes,         # Serialized numpy int32 array with the doc lengths
    "strings": List[unicode]  # List of unique strings in the token data
}
Strings for the words, tags, labels etc are represented by 64-bit hashes in the
token data, and every string that occurs at least once is passed via the strings
object. This means the storage is more efficient if you pack more documents
together, because you have less duplication in the strings. For usage examples,
see the docs on serializing Doc objects.
DocBin.__init__
Create a DocBin object to hold serialized annotations.
Example
from spacy.tokens import DocBin doc_bin = DocBin(attrs=["ENT_IOB", "ENT_TYPE"])
| Argument | Type | Description | 
|---|---|---|
| attrs | list | List of attributes to serialize. orth(hash of token text) andspacy(whether the token is followed by whitespace) are always serialized, so they're not required. Defaults toNone. | 
| store_user_data | bool | Whether to include the Doc.user_dataand the values of custom extension attributes. Defaults toFalse. | 
| RETURNS | DocBin | The newly constructed object. | 
DocBin._\len__
Get the number of Doc objects that were added to the DocBin.
Example
doc_bin = DocBin(attrs=["LEMMA"]) doc = nlp("This is a document to serialize.") doc_bin.add(doc) assert len(doc_bin) == 1
| Argument | Type | Description | 
|---|---|---|
| RETURNS | int | The number of Docs added to theDocBin. | 
DocBin.add
Add a Doc's annotations to the DocBin for serialization.
Example
doc_bin = DocBin(attrs=["LEMMA"]) doc = nlp("This is a document to serialize.") doc_bin.add(doc)
| Argument | Type | Description | 
|---|---|---|
| doc | Doc | The Docobject to add. | 
DocBin.get_docs
Recover Doc objects from the annotations, using the given vocab.
Example
docs = list(doc_bin.get_docs(nlp.vocab))
| Argument | Type | Description | 
|---|---|---|
| vocab | Vocab | The shared vocab. | 
| YIELDS | Doc | The Docobjects. | 
DocBin.merge
Extend the annotations of this DocBin with the annotations from another. Will
raise an error if the pre-defined attrs of the two DocBins don't match.
Example
doc_bin1 = DocBin(attrs=["LEMMA", "POS"]) doc_bin1.add(nlp("Hello world")) doc_bin2 = DocBin(attrs=["LEMMA", "POS"]) doc_bin2.add(nlp("This is a sentence")) merged_bins = doc_bin1.merge(doc_bin2) assert len(merged_bins) == 2
| Argument | Type | Description | 
|---|---|---|
| other | DocBin | The DocBinto merge into the current bin. | 
DocBin.to_bytes
Serialize the DocBin's annotations to a bytestring.
Example
doc_bin = DocBin(attrs=["DEP", "HEAD"]) doc_bin_bytes = doc_bin.to_bytes()
| Argument | Type | Description | 
|---|---|---|
| RETURNS | bytes | The serialized DocBin. | 
DocBin.from_bytes
Deserialize the DocBin's annotations from a bytestring.
Example
doc_bin_bytes = doc_bin.to_bytes() new_doc_bin = DocBin().from_bytes(doc_bin_bytes)
| Argument | Type | Description | 
|---|---|---|
| bytes_data | bytes | The data to load from. | 
| RETURNS | DocBin | The loaded DocBin. |