mirror of
https://github.com/explosion/spaCy.git
synced 2025-10-24 20:51:30 +03:00
* add informative warning when messing up store_user_data DocBin flags * add informative warning when messing up store_user_data DocBin flags * cleanup test * rename to patterns_path
185 lines
7.2 KiB
Markdown
185 lines
7.2 KiB
Markdown
---
|
|
title: DocBin
|
|
tag: class
|
|
new: 2.2
|
|
teaser: Pack Doc objects for binary serialization
|
|
source: spacy/tokens/_serialize.py
|
|
---
|
|
|
|
The `DocBin` class lets you efficiently serialize the information from a
|
|
collection of `Doc` objects. You can control which information is serialized by
|
|
passing a list of attribute IDs, and optionally also specify whether the user
|
|
data is serialized. The `DocBin` is faster and produces smaller data sizes than
|
|
pickle, and allows you to deserialize without executing arbitrary Python code. A
|
|
notable downside to this format is that you can't easily extract just one
|
|
document from the `DocBin`. The serialization format is gzipped msgpack, where
|
|
the msgpack object has the following structure:
|
|
|
|
```python
|
|
### msgpack object structrue
|
|
{
|
|
"version": str, # DocBin version number
|
|
"attrs": List[uint64], # e.g. [TAG, HEAD, ENT_IOB, ENT_TYPE]
|
|
"tokens": bytes, # Serialized numpy uint64 array with the token data
|
|
"spaces": bytes, # Serialized numpy boolean array with spaces data
|
|
"lengths": bytes, # Serialized numpy int32 array with the doc lengths
|
|
"strings": List[str] # List of unique strings in the token data
|
|
}
|
|
```
|
|
|
|
Strings for the words, tags, labels etc are represented by 64-bit hashes in the
|
|
token data, and every string that occurs at least once is passed via the strings
|
|
object. This means the storage is more efficient if you pack more documents
|
|
together, because you have less duplication in the strings. For usage examples,
|
|
see the docs on [serializing `Doc` objects](/usage/saving-loading#docs).
|
|
|
|
## DocBin.\_\_init\_\_ {#init tag="method"}
|
|
|
|
Create a `DocBin` object to hold serialized annotations.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.tokens import DocBin
|
|
> doc_bin = DocBin(attrs=["ENT_IOB", "ENT_TYPE"])
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ----------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `attrs` | List of attributes to serialize. `ORTH` (hash of token text) and `SPACY` (whether the token is followed by whitespace) are always serialized, so they're not required. Defaults to `("ORTH", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_KB_ID", "LEMMA", "MORPH", "POS")`. ~~Iterable[str]~~ |
|
|
| `store_user_data` | Whether to write the `Doc.user_data` and the values of custom extension attributes to file/bytes. Defaults to `False`. ~~bool~~ |
|
|
| `docs` | `Doc` objects to add on initialization. ~~Iterable[Doc]~~ |
|
|
|
|
## DocBin.\_\len\_\_ {#len tag="method"}
|
|
|
|
Get the number of `Doc` objects that were added to the `DocBin`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc_bin = DocBin(attrs=["LEMMA"])
|
|
> doc = nlp("This is a document to serialize.")
|
|
> doc_bin.add(doc)
|
|
> assert len(doc_bin) == 1
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ----------- | --------------------------------------------------- |
|
|
| **RETURNS** | The number of `Doc`s added to the `DocBin`. ~~int~~ |
|
|
|
|
## DocBin.add {#add tag="method"}
|
|
|
|
Add a `Doc`'s annotations to the `DocBin` for serialization.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc_bin = DocBin(attrs=["LEMMA"])
|
|
> doc = nlp("This is a document to serialize.")
|
|
> doc_bin.add(doc)
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| -------- | -------------------------------- |
|
|
| `doc` | The `Doc` object to add. ~~Doc~~ |
|
|
|
|
## DocBin.get_docs {#get_docs tag="method"}
|
|
|
|
Recover `Doc` objects from the annotations, using the given vocab.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> docs = list(doc_bin.get_docs(nlp.vocab))
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ---------- | --------------------------- |
|
|
| `vocab` | The shared vocab. ~~Vocab~~ |
|
|
| **YIELDS** | The `Doc` objects. ~~Doc~~ |
|
|
|
|
## DocBin.merge {#merge tag="method"}
|
|
|
|
Extend the annotations of this `DocBin` with the annotations from another. Will
|
|
raise an error if the pre-defined `attrs` of the two `DocBin`s don't match.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc_bin1 = DocBin(attrs=["LEMMA", "POS"])
|
|
> doc_bin1.add(nlp("Hello world"))
|
|
> doc_bin2 = DocBin(attrs=["LEMMA", "POS"])
|
|
> doc_bin2.add(nlp("This is a sentence"))
|
|
> doc_bin1.merge(doc_bin2)
|
|
> assert len(doc_bin1) == 2
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| -------- | ------------------------------------------------------ |
|
|
| `other` | The `DocBin` to merge into the current bin. ~~DocBin~~ |
|
|
|
|
## DocBin.to_bytes {#to_bytes tag="method"}
|
|
|
|
Serialize the `DocBin`'s annotations to a bytestring.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> docs = [nlp("Hello world!")]
|
|
> doc_bin = DocBin(docs=docs)
|
|
> doc_bin_bytes = doc_bin.to_bytes()
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ----------- | ---------------------------------- |
|
|
| **RETURNS** | The serialized `DocBin`. ~~bytes~~ |
|
|
|
|
## DocBin.from_bytes {#from_bytes tag="method"}
|
|
|
|
Deserialize the `DocBin`'s annotations from a bytestring.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc_bin_bytes = doc_bin.to_bytes()
|
|
> new_doc_bin = DocBin().from_bytes(doc_bin_bytes)
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ------------ | -------------------------------- |
|
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
|
| **RETURNS** | The loaded `DocBin`. ~~DocBin~~ |
|
|
|
|
## DocBin.to_disk {#to_disk tag="method" new="3"}
|
|
|
|
Save the serialized `DocBin` to a file. Typically uses the `.spacy` extension
|
|
and the result can be used as the input data for
|
|
[`spacy train`](/api/cli#train).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> docs = [nlp("Hello world!")]
|
|
> doc_bin = DocBin(docs=docs)
|
|
> doc_bin.to_disk("./data.spacy")
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| -------- | -------------------------------------------------------------------------- |
|
|
| `path` | The file path, typically with the `.spacy` extension. ~~Union[str, Path]~~ |
|
|
|
|
## DocBin.from_disk {#from_disk tag="method" new="3"}
|
|
|
|
Load a serialized `DocBin` from a file. Typically uses the `.spacy` extension.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc_bin = DocBin().from_disk("./data.spacy")
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ----------- | -------------------------------------------------------------------------- |
|
|
| `path` | The file path, typically with the `.spacy` extension. ~~Union[str, Path]~~ |
|
|
| **RETURNS** | The loaded `DocBin`. ~~DocBin~~ |
|