Update details [ci skip]

2025-11-04 01:48:04 +03:00 · 2021-06-23 13:05:56 +10:00 · 2021-06-23 13:05:56 +10:00 · ca0d904faa
commit ca0d904faa
parent e9b68d4f4c
2 changed files with 66 additions and 10 deletions
--- a/website/docs/images/prodigy_spans-manual.jpg
+++ b/website/docs/images/prodigy_spans-manual.jpg
--- a/website/docs/usage/v3-1.md
+++ b/website/docs/usage/v3-1.md
@ -12,7 +12,30 @@ menu:

 ### Using predicted annotations during training {#predicted-annotations-training}

-<!-- TODO: write -->
+By default, components are updated in isolation during training, which means
+that they don't see the predictions of any earlier components in the pipeline.
+The new
+[`[training.annotating_components]`](/usage/training#annotating-components)
+config setting lets you specify pipeline component names that should set
+annotations on the predicted docs during training. This makes it easy to use the
+predictions of a previous component in the pipeline as features for a subsequent
+component, e.g. the dependency labels in the tagger:
+
+```ini
+### config.cfg (excerpt) {highlight="7,12"}
+[nlp]
+pipeline = ["parser", "tagger"]
+
+[components.tagger.model.tok2vec.embed]
+@architectures = "spacy.MultiHashEmbed.v1"
+width = ${components.tagger.model.tok2vec.encode.width}
+attrs = ["NORM","DEP"]
+rows = [5000,2500]
+include_static_vectors = false
+
+[training]
+annotating_components = ["parser"]
+```

 <Project id="pipelines/tagger_parser_predicted_annotations">

@ -41,7 +64,7 @@ available via the [`Doc.spans`](/api/doc#spans) container.

 <Infobox title="Tip: Create data with Prodigy's new span annotation UI">

-<!-- TODO: screenshot -->
+[![Prodigy: example of the new manual spans UI](../images/prodigy_spans-manual.jpg)](https://support.prodi.gy/t/3861)

 The upcoming version of our annotation tool [Prodigy](https://prodi.gy)
 (currently available as a [pre-release](https://support.prodi.gy/t/3861) for all
@ -66,11 +89,11 @@ for spaCy's `SpanCategorizer` component.
 The [`EntityRecognizer`](/api/entityrecognizer) can now be updated with known
 incorrect annotations, which lets you take advantage of partial and sparse data.
 For example, you'll be able to use the information that certain spans of text
-are definitely **not** `PERSON` entities, without having to provide the
-complete-gold standard annotations for the given example. The incorrect span
-annotations can be added via the [`Doc.spans`](/api/doc#spans) in the training
-data under the key defined as
-[`incorrect_spans_key`](/api/entityrecognizer#init) in the component config.
+are definitely **not** `PERSON` entities, without having to provide the complete
+gold-standard annotations for the given example. The incorrect span annotations
+can be added via the [`Doc.spans`](/api/doc#spans) in the training data under
+the key defined as [`incorrect_spans_key`](/api/entityrecognizer#init) in the
+component config.

 ```python
 train_doc = nlp.make_doc("Barack Obama was born in Hawaii.")
@ -104,7 +127,12 @@ your own.

 ### Resizable text classification architectures {#resizable-textcat}

-<!-- TODO: write -->
+Previously, a trained [`TextCategorizer`](/api/textcategorizer) architectures
+could not be resized, meaning that you couldn't add new labels to an already
+trained text classifier. In spaCy v3.1, the
+[TextCatCNN](/api/architectures#TextCatCNN) and
+[TextCatBOW](/api/architectures#TextCatBOW) architectures are now resizable,
+while ensuring that the predictions for the old labels remain the same.

 ### CLI command to assemble pipeline from config {#assemble}

@ -119,11 +147,39 @@ $ python -m spacy assemble config.cfg ./output

 ### Support for streaming large or infinite corpora {#streaming-corpora}

-<!-- TODO: write -->
+> #### config.cfg (excerpt)
+>
+> ```ini
+> [training]
+> max_epochs = -1
+> ```
+
+The training process now supports streaming large or infinite corpora
+out-of-the-box, which can be controlled via the
+[`[training.max_epochs]`](/api/data-formats#training) config setting. Setting it
+to `-1` means that the train corpus should be streamed rather than loaded into
+memory with no shuffling within the training loop. For details on how to
+implement a custom corpus loader, e.g. to stream in data from a remote storage,
+see the usage guide on
+[custom data reading](/usage/training#custom-code-readers-batchers).
+
+When streaming a corpus, only the first 100 examples will be used for
+[initialization](/usage/training#config-lifecycle). This is no problem if you're
+training a component like the text classifier with data that specifies all
+available labels in every example. If necessary, you can use the
+[`init labels`](/api/cli#init-labels) command to pre-generate the labels for
+your components using a representative sample so the model can be initialized
+correctly before training.

 ### New lemmatizers for Catalan and Italian {#pos-lemmatizers}

-<!-- TODO: write -->
+The trained pipelines for [Catalan](/models/ca) and [Italian](/models/it) now
+include lemmatizers that use the predicted part-of-speech tags as part of the
+lookup lemmatization for higher lemmatization accuracy. If you're training your
+own pipelines for these languages and you want to include a lemmatizer, make
+sure you have the
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package
+installed, which provides the relevant tables.

 ## Notes about upgrading from v3.0 {#upgrading}