Clarify serialization of extension attributes (closes #4377) [ci skip]

This commit is contained in:
Ines Montani 2019-10-05 11:58:00 +02:00
parent fec9433044
commit e65dffd80b
2 changed files with 20 additions and 1 deletions

View File

@ -46,7 +46,7 @@ Create a `DocBin` object to hold serialized annotations.
| Argument | Type | Description |
| ----------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `attrs` | list | List of attributes to serialize. `orth` (hash of token text) and `spacy` (whether the token is followed by whitespace) are always serialized, so they're not required. Defaults to `None`. |
| `store_user_data` | bool | Whether to include the `Doc.user_data`. Defaults to `False`. |
| `store_user_data` | bool | Whether to include the `Doc.user_data` and the values of custom extension attributes. Defaults to `False`. |
| **RETURNS** | `DocBin` | The newly constructed object. |
## DocBin.\_\len\_\_ {#len tag="method"}

View File

@ -92,6 +92,25 @@ doc_bin = DocBin().from_bytes(bytes_data)
docs = list(doc_bin.get_docs(nlp.vocab))
```
If `store_user_data` is set to `True`, the `Doc.user_data` will be serialized as
well, which includes the values of
[extension attributes](/processing-pipelines#custom-components-attributes) (if
they're serializable with msgpack).
<Infobox title="Important note on serializing extension attributes" variant="warning">
Including the `Doc.user_data` and extension attributes will only serialize the
**values** of the attributes. To restore the values and access them via the
`doc._.` property, you need to register the global attribute on the `Doc` again.
```python
docs = list(doc_bin.get_docs(nlp.vocab))
Doc.set_extension("my_custom_attr", default=None)
print([doc._.my_custom_attr for doc in docs])
```
</Infobox>
### Using Pickle {#pickle}
> #### Example