Update v2-2 docs

This commit is contained in:
Matthew Honnibal 2019-09-18 14:07:55 +02:00
parent fa9a283128
commit f537cbeacc

View File

@ -98,9 +98,10 @@ on disk**.
> #### Example
>
> ```python
> scorer = nlp.evaluate(dev_data)
> print(scorer.textcat_scores, scorer.textcats_per_cat)
> ```bash
> spacy train en /path/to/output /path/to/train /path/to/dev \
> --pipeline textcat \
> --textcat-arch simple_cnn --textcat-multilabel
> ```
When training your models using the `spacy train` command, you can now also
@ -117,6 +118,34 @@ classification.
</Infobox>
### New DocPallet class to efficiently Doc collections
> #### Example
>
> ```python
> from spacy.tokens import DocPallet
> pallet = DocPallet(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=False)
> for doc in nlp.pipe(texts):
> pallet.add(doc)
> byte_data = pallet.to_bytes()
> # Deserialize later, e.g. in a new process
> nlp = spacy.blank("en")
> pallet = DocPallet()
> docs = list(pallet.get_docs(nlp.vocab))
> ```
If you're working with lots of data, you'll probably need to pass analyses
between machines, either to use something like Dask or Spark, or even just to
save out work to disk. Often it's sufficient to use the doc.to_array()
functionality for this, and just serialize the numpy arrays --- but other times
you want a more general way to save and restore `Doc` objects.
The new `DocPallet` class makes it easy to serialize and deserialize
a collection of `Doc` objects together, and is much more efficient than
calling `doc.to_bytes()` on each individual `Doc` object. You can also control
what data gets saved, and you can merge pallets together for easy
map/reduce-style processing.
### CLI command to debug and validate training data {#debug-data}
> #### Example