Update details [ci skip]

This commit is contained in:
Ines Montani 2021-06-23 13:05:56 +10:00
parent e9b68d4f4c
commit ca0d904faa
2 changed files with 66 additions and 10 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 304 KiB

View File

@ -12,7 +12,30 @@ menu:
### Using predicted annotations during training {#predicted-annotations-training}
<!-- TODO: write -->
By default, components are updated in isolation during training, which means
that they don't see the predictions of any earlier components in the pipeline.
The new
[`[training.annotating_components]`](/usage/training#annotating-components)
config setting lets you specify pipeline component names that should set
annotations on the predicted docs during training. This makes it easy to use the
predictions of a previous component in the pipeline as features for a subsequent
component, e.g. the dependency labels in the tagger:
```ini
### config.cfg (excerpt) {highlight="7,12"}
[nlp]
pipeline = ["parser", "tagger"]
[components.tagger.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tagger.model.tok2vec.encode.width}
attrs = ["NORM","DEP"]
rows = [5000,2500]
include_static_vectors = false
[training]
annotating_components = ["parser"]
```
<Project id="pipelines/tagger_parser_predicted_annotations">
@ -41,7 +64,7 @@ available via the [`Doc.spans`](/api/doc#spans) container.
<Infobox title="Tip: Create data with Prodigy's new span annotation UI">
<!-- TODO: screenshot -->
[![Prodigy: example of the new manual spans UI](../images/prodigy_spans-manual.jpg)](https://support.prodi.gy/t/3861)
The upcoming version of our annotation tool [Prodigy](https://prodi.gy)
(currently available as a [pre-release](https://support.prodi.gy/t/3861) for all
@ -66,11 +89,11 @@ for spaCy's `SpanCategorizer` component.
The [`EntityRecognizer`](/api/entityrecognizer) can now be updated with known
incorrect annotations, which lets you take advantage of partial and sparse data.
For example, you'll be able to use the information that certain spans of text
are definitely **not** `PERSON` entities, without having to provide the
complete-gold standard annotations for the given example. The incorrect span
annotations can be added via the [`Doc.spans`](/api/doc#spans) in the training
data under the key defined as
[`incorrect_spans_key`](/api/entityrecognizer#init) in the component config.
are definitely **not** `PERSON` entities, without having to provide the complete
gold-standard annotations for the given example. The incorrect span annotations
can be added via the [`Doc.spans`](/api/doc#spans) in the training data under
the key defined as [`incorrect_spans_key`](/api/entityrecognizer#init) in the
component config.
```python
train_doc = nlp.make_doc("Barack Obama was born in Hawaii.")
@ -104,7 +127,12 @@ your own.
### Resizable text classification architectures {#resizable-textcat}
<!-- TODO: write -->
Previously, a trained [`TextCategorizer`](/api/textcategorizer) architectures
could not be resized, meaning that you couldn't add new labels to an already
trained text classifier. In spaCy v3.1, the
[TextCatCNN](/api/architectures#TextCatCNN) and
[TextCatBOW](/api/architectures#TextCatBOW) architectures are now resizable,
while ensuring that the predictions for the old labels remain the same.
### CLI command to assemble pipeline from config {#assemble}
@ -119,11 +147,39 @@ $ python -m spacy assemble config.cfg ./output
### Support for streaming large or infinite corpora {#streaming-corpora}
<!-- TODO: write -->
> #### config.cfg (excerpt)
>
> ```ini
> [training]
> max_epochs = -1
> ```
The training process now supports streaming large or infinite corpora
out-of-the-box, which can be controlled via the
[`[training.max_epochs]`](/api/data-formats#training) config setting. Setting it
to `-1` means that the train corpus should be streamed rather than loaded into
memory with no shuffling within the training loop. For details on how to
implement a custom corpus loader, e.g. to stream in data from a remote storage,
see the usage guide on
[custom data reading](/usage/training#custom-code-readers-batchers).
When streaming a corpus, only the first 100 examples will be used for
[initialization](/usage/training#config-lifecycle). This is no problem if you're
training a component like the text classifier with data that specifies all
available labels in every example. If necessary, you can use the
[`init labels`](/api/cli#init-labels) command to pre-generate the labels for
your components using a representative sample so the model can be initialized
correctly before training.
### New lemmatizers for Catalan and Italian {#pos-lemmatizers}
<!-- TODO: write -->
The trained pipelines for [Catalan](/models/ca) and [Italian](/models/it) now
include lemmatizers that use the predicted part-of-speech tags as part of the
lookup lemmatization for higher lemmatization accuracy. If you're training your
own pipelines for these languages and you want to include a lemmatizer, make
sure you have the
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package
installed, which provides the relevant tables.
## Notes about upgrading from v3.0 {#upgrading}