mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-14 03:26:24 +03:00
Update details [ci skip]
This commit is contained in:
parent
e9b68d4f4c
commit
ca0d904faa
BIN
website/docs/images/prodigy_spans-manual.jpg
Normal file
BIN
website/docs/images/prodigy_spans-manual.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 304 KiB |
|
@ -12,7 +12,30 @@ menu:
|
||||||
|
|
||||||
### Using predicted annotations during training {#predicted-annotations-training}
|
### Using predicted annotations during training {#predicted-annotations-training}
|
||||||
|
|
||||||
<!-- TODO: write -->
|
By default, components are updated in isolation during training, which means
|
||||||
|
that they don't see the predictions of any earlier components in the pipeline.
|
||||||
|
The new
|
||||||
|
[`[training.annotating_components]`](/usage/training#annotating-components)
|
||||||
|
config setting lets you specify pipeline component names that should set
|
||||||
|
annotations on the predicted docs during training. This makes it easy to use the
|
||||||
|
predictions of a previous component in the pipeline as features for a subsequent
|
||||||
|
component, e.g. the dependency labels in the tagger:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
### config.cfg (excerpt) {highlight="7,12"}
|
||||||
|
[nlp]
|
||||||
|
pipeline = ["parser", "tagger"]
|
||||||
|
|
||||||
|
[components.tagger.model.tok2vec.embed]
|
||||||
|
@architectures = "spacy.MultiHashEmbed.v1"
|
||||||
|
width = ${components.tagger.model.tok2vec.encode.width}
|
||||||
|
attrs = ["NORM","DEP"]
|
||||||
|
rows = [5000,2500]
|
||||||
|
include_static_vectors = false
|
||||||
|
|
||||||
|
[training]
|
||||||
|
annotating_components = ["parser"]
|
||||||
|
```
|
||||||
|
|
||||||
<Project id="pipelines/tagger_parser_predicted_annotations">
|
<Project id="pipelines/tagger_parser_predicted_annotations">
|
||||||
|
|
||||||
|
@ -41,7 +64,7 @@ available via the [`Doc.spans`](/api/doc#spans) container.
|
||||||
|
|
||||||
<Infobox title="Tip: Create data with Prodigy's new span annotation UI">
|
<Infobox title="Tip: Create data with Prodigy's new span annotation UI">
|
||||||
|
|
||||||
<!-- TODO: screenshot -->
|
[![Prodigy: example of the new manual spans UI](../images/prodigy_spans-manual.jpg)](https://support.prodi.gy/t/3861)
|
||||||
|
|
||||||
The upcoming version of our annotation tool [Prodigy](https://prodi.gy)
|
The upcoming version of our annotation tool [Prodigy](https://prodi.gy)
|
||||||
(currently available as a [pre-release](https://support.prodi.gy/t/3861) for all
|
(currently available as a [pre-release](https://support.prodi.gy/t/3861) for all
|
||||||
|
@ -66,11 +89,11 @@ for spaCy's `SpanCategorizer` component.
|
||||||
The [`EntityRecognizer`](/api/entityrecognizer) can now be updated with known
|
The [`EntityRecognizer`](/api/entityrecognizer) can now be updated with known
|
||||||
incorrect annotations, which lets you take advantage of partial and sparse data.
|
incorrect annotations, which lets you take advantage of partial and sparse data.
|
||||||
For example, you'll be able to use the information that certain spans of text
|
For example, you'll be able to use the information that certain spans of text
|
||||||
are definitely **not** `PERSON` entities, without having to provide the
|
are definitely **not** `PERSON` entities, without having to provide the complete
|
||||||
complete-gold standard annotations for the given example. The incorrect span
|
gold-standard annotations for the given example. The incorrect span annotations
|
||||||
annotations can be added via the [`Doc.spans`](/api/doc#spans) in the training
|
can be added via the [`Doc.spans`](/api/doc#spans) in the training data under
|
||||||
data under the key defined as
|
the key defined as [`incorrect_spans_key`](/api/entityrecognizer#init) in the
|
||||||
[`incorrect_spans_key`](/api/entityrecognizer#init) in the component config.
|
component config.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
train_doc = nlp.make_doc("Barack Obama was born in Hawaii.")
|
train_doc = nlp.make_doc("Barack Obama was born in Hawaii.")
|
||||||
|
@ -104,7 +127,12 @@ your own.
|
||||||
|
|
||||||
### Resizable text classification architectures {#resizable-textcat}
|
### Resizable text classification architectures {#resizable-textcat}
|
||||||
|
|
||||||
<!-- TODO: write -->
|
Previously, a trained [`TextCategorizer`](/api/textcategorizer) architectures
|
||||||
|
could not be resized, meaning that you couldn't add new labels to an already
|
||||||
|
trained text classifier. In spaCy v3.1, the
|
||||||
|
[TextCatCNN](/api/architectures#TextCatCNN) and
|
||||||
|
[TextCatBOW](/api/architectures#TextCatBOW) architectures are now resizable,
|
||||||
|
while ensuring that the predictions for the old labels remain the same.
|
||||||
|
|
||||||
### CLI command to assemble pipeline from config {#assemble}
|
### CLI command to assemble pipeline from config {#assemble}
|
||||||
|
|
||||||
|
@ -119,11 +147,39 @@ $ python -m spacy assemble config.cfg ./output
|
||||||
|
|
||||||
### Support for streaming large or infinite corpora {#streaming-corpora}
|
### Support for streaming large or infinite corpora {#streaming-corpora}
|
||||||
|
|
||||||
<!-- TODO: write -->
|
> #### config.cfg (excerpt)
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [training]
|
||||||
|
> max_epochs = -1
|
||||||
|
> ```
|
||||||
|
|
||||||
|
The training process now supports streaming large or infinite corpora
|
||||||
|
out-of-the-box, which can be controlled via the
|
||||||
|
[`[training.max_epochs]`](/api/data-formats#training) config setting. Setting it
|
||||||
|
to `-1` means that the train corpus should be streamed rather than loaded into
|
||||||
|
memory with no shuffling within the training loop. For details on how to
|
||||||
|
implement a custom corpus loader, e.g. to stream in data from a remote storage,
|
||||||
|
see the usage guide on
|
||||||
|
[custom data reading](/usage/training#custom-code-readers-batchers).
|
||||||
|
|
||||||
|
When streaming a corpus, only the first 100 examples will be used for
|
||||||
|
[initialization](/usage/training#config-lifecycle). This is no problem if you're
|
||||||
|
training a component like the text classifier with data that specifies all
|
||||||
|
available labels in every example. If necessary, you can use the
|
||||||
|
[`init labels`](/api/cli#init-labels) command to pre-generate the labels for
|
||||||
|
your components using a representative sample so the model can be initialized
|
||||||
|
correctly before training.
|
||||||
|
|
||||||
### New lemmatizers for Catalan and Italian {#pos-lemmatizers}
|
### New lemmatizers for Catalan and Italian {#pos-lemmatizers}
|
||||||
|
|
||||||
<!-- TODO: write -->
|
The trained pipelines for [Catalan](/models/ca) and [Italian](/models/it) now
|
||||||
|
include lemmatizers that use the predicted part-of-speech tags as part of the
|
||||||
|
lookup lemmatization for higher lemmatization accuracy. If you're training your
|
||||||
|
own pipelines for these languages and you want to include a lemmatizer, make
|
||||||
|
sure you have the
|
||||||
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package
|
||||||
|
installed, which provides the relevant tables.
|
||||||
|
|
||||||
## Notes about upgrading from v3.0 {#upgrading}
|
## Notes about upgrading from v3.0 {#upgrading}
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user