diff --git a/website/docs/usage/v2-2.md b/website/docs/usage/v2-2.md index 2109ae812..8f339cb9b 100644 --- a/website/docs/usage/v2-2.md +++ b/website/docs/usage/v2-2.md @@ -98,9 +98,10 @@ on disk**. > #### Example > -> ```python -> scorer = nlp.evaluate(dev_data) -> print(scorer.textcat_scores, scorer.textcats_per_cat) +> ```bash +> spacy train en /path/to/output /path/to/train /path/to/dev \ +> --pipeline textcat \ +> --textcat-arch simple_cnn --textcat-multilabel > ``` When training your models using the `spacy train` command, you can now also @@ -117,6 +118,34 @@ classification. +### New DocPallet class to efficiently Doc collections + +> #### Example +> +> ```python +> from spacy.tokens import DocPallet +> pallet = DocPallet(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=False) +> for doc in nlp.pipe(texts): +> pallet.add(doc) +> byte_data = pallet.to_bytes() +> # Deserialize later, e.g. in a new process +> nlp = spacy.blank("en") +> pallet = DocPallet() +> docs = list(pallet.get_docs(nlp.vocab)) +> ``` + +If you're working with lots of data, you'll probably need to pass analyses +between machines, either to use something like Dask or Spark, or even just to +save out work to disk. Often it's sufficient to use the doc.to_array() +functionality for this, and just serialize the numpy arrays --- but other times +you want a more general way to save and restore `Doc` objects. + +The new `DocPallet` class makes it easy to serialize and deserialize +a collection of `Doc` objects together, and is much more efficient than +calling `doc.to_bytes()` on each individual `Doc` object. You can also control +what data gets saved, and you can merge pallets together for easy +map/reduce-style processing. + ### CLI command to debug and validate training data {#debug-data} > #### Example