Update details [ci skip]

2025-07-22 05:59:56 +03:00 · 2021-06-23 13:05:56 +10:00 · 2021-06-23 13:05:56 +10:00 · ca0d904faa
commit ca0d904faa
parent e9b68d4f4c
2 changed files with 66 additions and 10 deletions
--- a/website/docs/images/prodigy_spans-manual.jpg
+++ b/website/docs/images/prodigy_spans-manual.jpg
--- a/website/docs/usage/v3-1.md
+++ b/website/docs/usage/v3-1.md
@ -12,7 +12,30 @@ menu:
 ### Using predicted annotations during training {#predicted-annotations-training}
-<!-- TODO: write -->
+By default, components are updated in isolation during training, which means
 that they don't see the predictions of any earlier components in the pipeline.
 The new
 [`[training.annotating_components]`](/usage/training#annotating-components)
 config setting lets you specify pipeline component names that should set
 annotations on the predicted docs during training. This makes it easy to use the
 predictions of a previous component in the pipeline as features for a subsequent
 component, e.g. the dependency labels in the tagger:
 ```ini
 ### config.cfg (excerpt) {highlight="7,12"}
 [nlp]
 pipeline = ["parser", "tagger"]
 [components.tagger.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
 width = ${components.tagger.model.tok2vec.encode.width}
 attrs = ["NORM","DEP"]
 rows = [5000,2500]
 include_static_vectors = false
 [training]
 annotating_components = ["parser"]
 ```
 <Project id="pipelines/tagger_parser_predicted_annotations">
@ -41,7 +64,7 @@ available via the [`Doc.spans`](/api/doc#spans) container.
 <Infobox title="Tip: Create data with Prodigy's new span annotation UI">
-<!-- TODO: screenshot -->
+[![Prodigy: example of the new manual spans UI](../images/prodigy_spans-manual.jpg)](https://support.prodi.gy/t/3861)
 The upcoming version of our annotation tool [Prodigy](https://prodi.gy)
 (currently available as a [pre-release](https://support.prodi.gy/t/3861) for all
@ -66,11 +89,11 @@ for spaCy's `SpanCategorizer` component.
 The [`EntityRecognizer`](/api/entityrecognizer) can now be updated with known
 incorrect annotations, which lets you take advantage of partial and sparse data.
 For example, you'll be able to use the information that certain spans of text
-are definitely **not** `PERSON` entities, without having to provide the
+are definitely **not** `PERSON` entities, without having to provide the complete
-complete-gold standard annotations for the given example. The incorrect span
+gold-standard annotations for the given example. The incorrect span annotations
-annotations can be added via the [`Doc.spans`](/api/doc#spans) in the training
+can be added via the [`Doc.spans`](/api/doc#spans) in the training data under
-data under the key defined as
+the key defined as [`incorrect_spans_key`](/api/entityrecognizer#init) in the
-[`incorrect_spans_key`](/api/entityrecognizer#init) in the component config.
+component config.
 ```python
 train_doc = nlp.make_doc("Barack Obama was born in Hawaii.")
@ -104,7 +127,12 @@ your own.
 ### Resizable text classification architectures {#resizable-textcat}
-<!-- TODO: write -->
+Previously, a trained [`TextCategorizer`](/api/textcategorizer) architectures
 could not be resized, meaning that you couldn't add new labels to an already
 trained text classifier. In spaCy v3.1, the
 [TextCatCNN](/api/architectures#TextCatCNN) and
 [TextCatBOW](/api/architectures#TextCatBOW) architectures are now resizable,
 while ensuring that the predictions for the old labels remain the same.
 ### CLI command to assemble pipeline from config {#assemble}
@ -119,11 +147,39 @@ $ python -m spacy assemble config.cfg ./output
 ### Support for streaming large or infinite corpora {#streaming-corpora}
-<!-- TODO: write -->
+> #### config.cfg (excerpt)
 >
 > ```ini
 > [training]
 > max_epochs = -1
 > ```
 The training process now supports streaming large or infinite corpora
 out-of-the-box, which can be controlled via the
 [`[training.max_epochs]`](/api/data-formats#training) config setting. Setting it
 to `-1` means that the train corpus should be streamed rather than loaded into
 memory with no shuffling within the training loop. For details on how to
 implement a custom corpus loader, e.g. to stream in data from a remote storage,
 see the usage guide on
 [custom data reading](/usage/training#custom-code-readers-batchers).
 When streaming a corpus, only the first 100 examples will be used for
 [initialization](/usage/training#config-lifecycle). This is no problem if you're
 training a component like the text classifier with data that specifies all
 available labels in every example. If necessary, you can use the
 [`init labels`](/api/cli#init-labels) command to pre-generate the labels for
 your components using a representative sample so the model can be initialized
 correctly before training.
 ### New lemmatizers for Catalan and Italian {#pos-lemmatizers}
-<!-- TODO: write -->
+The trained pipelines for [Catalan](/models/ca) and [Italian](/models/it) now
 include lemmatizers that use the predicted part-of-speech tags as part of the
 lookup lemmatization for higher lemmatization accuracy. If you're training your
 own pipelines for these languages and you want to include a lemmatizer, make
 sure you have the
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package
 installed, which provides the relevant tables.
 ## Notes about upgrading from v3.0 {#upgrading}