mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
Update details [ci skip]
This commit is contained in:
parent
e9b68d4f4c
commit
ca0d904faa
BIN
website/docs/images/prodigy_spans-manual.jpg
Normal file
BIN
website/docs/images/prodigy_spans-manual.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 304 KiB |
|
@ -12,7 +12,30 @@ menu:
|
|||
|
||||
### Using predicted annotations during training {#predicted-annotations-training}
|
||||
|
||||
<!-- TODO: write -->
|
||||
By default, components are updated in isolation during training, which means
|
||||
that they don't see the predictions of any earlier components in the pipeline.
|
||||
The new
|
||||
[`[training.annotating_components]`](/usage/training#annotating-components)
|
||||
config setting lets you specify pipeline component names that should set
|
||||
annotations on the predicted docs during training. This makes it easy to use the
|
||||
predictions of a previous component in the pipeline as features for a subsequent
|
||||
component, e.g. the dependency labels in the tagger:
|
||||
|
||||
```ini
|
||||
### config.cfg (excerpt) {highlight="7,12"}
|
||||
[nlp]
|
||||
pipeline = ["parser", "tagger"]
|
||||
|
||||
[components.tagger.model.tok2vec.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
width = ${components.tagger.model.tok2vec.encode.width}
|
||||
attrs = ["NORM","DEP"]
|
||||
rows = [5000,2500]
|
||||
include_static_vectors = false
|
||||
|
||||
[training]
|
||||
annotating_components = ["parser"]
|
||||
```
|
||||
|
||||
<Project id="pipelines/tagger_parser_predicted_annotations">
|
||||
|
||||
|
@ -41,7 +64,7 @@ available via the [`Doc.spans`](/api/doc#spans) container.
|
|||
|
||||
<Infobox title="Tip: Create data with Prodigy's new span annotation UI">
|
||||
|
||||
<!-- TODO: screenshot -->
|
||||
[![Prodigy: example of the new manual spans UI](../images/prodigy_spans-manual.jpg)](https://support.prodi.gy/t/3861)
|
||||
|
||||
The upcoming version of our annotation tool [Prodigy](https://prodi.gy)
|
||||
(currently available as a [pre-release](https://support.prodi.gy/t/3861) for all
|
||||
|
@ -66,11 +89,11 @@ for spaCy's `SpanCategorizer` component.
|
|||
The [`EntityRecognizer`](/api/entityrecognizer) can now be updated with known
|
||||
incorrect annotations, which lets you take advantage of partial and sparse data.
|
||||
For example, you'll be able to use the information that certain spans of text
|
||||
are definitely **not** `PERSON` entities, without having to provide the
|
||||
complete-gold standard annotations for the given example. The incorrect span
|
||||
annotations can be added via the [`Doc.spans`](/api/doc#spans) in the training
|
||||
data under the key defined as
|
||||
[`incorrect_spans_key`](/api/entityrecognizer#init) in the component config.
|
||||
are definitely **not** `PERSON` entities, without having to provide the complete
|
||||
gold-standard annotations for the given example. The incorrect span annotations
|
||||
can be added via the [`Doc.spans`](/api/doc#spans) in the training data under
|
||||
the key defined as [`incorrect_spans_key`](/api/entityrecognizer#init) in the
|
||||
component config.
|
||||
|
||||
```python
|
||||
train_doc = nlp.make_doc("Barack Obama was born in Hawaii.")
|
||||
|
@ -104,7 +127,12 @@ your own.
|
|||
|
||||
### Resizable text classification architectures {#resizable-textcat}
|
||||
|
||||
<!-- TODO: write -->
|
||||
Previously, a trained [`TextCategorizer`](/api/textcategorizer) architectures
|
||||
could not be resized, meaning that you couldn't add new labels to an already
|
||||
trained text classifier. In spaCy v3.1, the
|
||||
[TextCatCNN](/api/architectures#TextCatCNN) and
|
||||
[TextCatBOW](/api/architectures#TextCatBOW) architectures are now resizable,
|
||||
while ensuring that the predictions for the old labels remain the same.
|
||||
|
||||
### CLI command to assemble pipeline from config {#assemble}
|
||||
|
||||
|
@ -119,11 +147,39 @@ $ python -m spacy assemble config.cfg ./output
|
|||
|
||||
### Support for streaming large or infinite corpora {#streaming-corpora}
|
||||
|
||||
<!-- TODO: write -->
|
||||
> #### config.cfg (excerpt)
|
||||
>
|
||||
> ```ini
|
||||
> [training]
|
||||
> max_epochs = -1
|
||||
> ```
|
||||
|
||||
The training process now supports streaming large or infinite corpora
|
||||
out-of-the-box, which can be controlled via the
|
||||
[`[training.max_epochs]`](/api/data-formats#training) config setting. Setting it
|
||||
to `-1` means that the train corpus should be streamed rather than loaded into
|
||||
memory with no shuffling within the training loop. For details on how to
|
||||
implement a custom corpus loader, e.g. to stream in data from a remote storage,
|
||||
see the usage guide on
|
||||
[custom data reading](/usage/training#custom-code-readers-batchers).
|
||||
|
||||
When streaming a corpus, only the first 100 examples will be used for
|
||||
[initialization](/usage/training#config-lifecycle). This is no problem if you're
|
||||
training a component like the text classifier with data that specifies all
|
||||
available labels in every example. If necessary, you can use the
|
||||
[`init labels`](/api/cli#init-labels) command to pre-generate the labels for
|
||||
your components using a representative sample so the model can be initialized
|
||||
correctly before training.
|
||||
|
||||
### New lemmatizers for Catalan and Italian {#pos-lemmatizers}
|
||||
|
||||
<!-- TODO: write -->
|
||||
The trained pipelines for [Catalan](/models/ca) and [Italian](/models/it) now
|
||||
include lemmatizers that use the predicted part-of-speech tags as part of the
|
||||
lookup lemmatization for higher lemmatization accuracy. If you're training your
|
||||
own pipelines for these languages and you want to include a lemmatizer, make
|
||||
sure you have the
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package
|
||||
installed, which provides the relevant tables.
|
||||
|
||||
## Notes about upgrading from v3.0 {#upgrading}
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user