Merge pull request #8938 from explosion/docs/prodigy-v1-11-project [ci skip]

Update Prodigy project template for v1.11
This commit is contained in:
Ines Montani 2021-08-12 21:16:49 +10:00
parent c581848cbb
commit 647abe186c
2 changed files with 70 additions and 39 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 200 KiB

View File

@ -758,16 +758,6 @@ workflows, but only one can be tracked by DVC.
### Prodigy {#prodigy} <IntegrationLogo name="prodigy" width={100} height="auto" align="right" /> ### Prodigy {#prodigy} <IntegrationLogo name="prodigy" width={100} height="auto" align="right" />
<Infobox title="This section is still under construction" emoji="🚧" variant="warning">
The Prodigy integration will require a nightly version of Prodigy that supports
spaCy v3+. You can already use annotations created with Prodigy in spaCy v3 by
exporting your data with
[`data-to-spacy`](https://prodi.gy/docs/recipes#data-to-spacy) and running
[`spacy convert`](/api/cli#convert) to convert it to the binary format.
</Infobox>
[Prodigy](https://prodi.gy) is a modern annotation tool for creating training [Prodigy](https://prodi.gy) is a modern annotation tool for creating training
data for machine learning models, developed by us. It integrates with spaCy data for machine learning models, developed by us. It integrates with spaCy
out-of-the-box and provides many different out-of-the-box and provides many different
@ -776,17 +766,23 @@ with and without a model in the loop. If Prodigy is installed in your project,
you can start the annotation server from your `project.yml` for a tight feedback you can start the annotation server from your `project.yml` for a tight feedback
loop between data development and training. loop between data development and training.
The following example command starts the Prodigy app using the <Infobox variant="warning">
[`ner.correct`](https://prodi.gy/docs/recipes#ner-correct) recipe and streams in
suggestions for the given entity labels produced by a pretrained model. You can This integration requires [Prodigy v1.11](https://prodi.gy/docs/changelog#v1.11)
then correct the suggestions manually in the UI. After you save and exit the or higher. If you're using an older version of Prodigy, you can still use your
server, the full dataset is exported in spaCy's format and split into a training annotations in spaCy v3 by exporting your data with
and evaluation set. [`data-to-spacy`](https://prodi.gy/docs/recipes#data-to-spacy) and running
[`spacy convert`](/api/cli#convert) to convert it to the binary format.
</Infobox>
The following example shows a workflow for merging and exporting NER annotations
collected with Prodigy and training a spaCy pipeline:
> #### Example usage > #### Example usage
> >
> ```cli > ```cli
> $ python -m spacy project run annotate > $ python -m spacy project run all
> ``` > ```
<!-- prettier-ignore --> <!-- prettier-ignore -->
@ -794,36 +790,71 @@ and evaluation set.
### project.yml ### project.yml
vars: vars:
prodigy: prodigy:
dataset: 'ner_articles' train_dataset: "fashion_brands_training"
labels: 'PERSON,ORG,PRODUCT' eval_dataset: "fashion_brands_eval"
model: 'en_core_web_md'
workflows:
all:
- data-to-spacy
- train_spacy
commands: commands:
- name: annotate - name: "data-to-spacy"
- script: help: "Merge your annotations and create data in spaCy's binary format"
- 'python -m prodigy ner.correct ${vars.prodigy.dataset} ${vars.prodigy.model} ./assets/raw_data.jsonl --labels ${vars.prodigy.labels}' script:
- 'python -m prodigy data-to-spacy ./corpus/train.json ./corpus/eval.json --ner ${vars.prodigy.dataset}' - "python -m prodigy data-to-spacy corpus/ --ner ${vars.prodigy.train_dataset},eval:${vars.prodigy.eval_dataset}"
- 'python -m spacy convert ./corpus/train.json ./corpus/train.spacy' outputs:
- 'python -m spacy convert ./corpus/eval.json ./corpus/eval.spacy' - "corpus/train.spacy"
- deps: - "corpus/dev.spacy"
- 'assets/raw_data.jsonl' - name: "train_spacy"
- outputs: help: "Train a named entity recognition model with spaCy"
- 'corpus/train.spacy' script:
- 'corpus/eval.spacy' - "python -m spacy train configs/config.cfg --output training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy"
deps:
- "corpus/train.spacy"
- "corpus/dev.spacy"
outputs:
- "training/model-best"
``` ```
You can use the same approach for other types of projects and annotation > #### Example train curve output
>
> [![Screenshot of train curve terminal output](../images/prodigy_train_curve.jpg)](https://prodi.gy/docs/recipes#train-curve)
The [`train-curve`](https://prodi.gy/docs/recipes#train-curve) recipe is another
cool workflow you can include in your project. It will run the training with
different portions of the data, e.g. 25%, 50%, 75% and 100%. As a rule of thumb,
if accuracy increases in the last segment, this could indicate that collecting
more annotations of the same type might improve the model further.
<!-- prettier-ignore -->
```yaml
### project.yml (excerpt)
- name: "train_curve"
help: "Train the model with Prodigy by using different portions of training examples to evaluate if more annotations can potentially improve the performance"
script:
- "python -m prodigy train-curve --ner ${vars.prodigy.train_dataset},eval:${vars.prodigy.eval_dataset} --config configs/${vars.config} --show-plot"
```
You can use the same approach for various types of projects and annotation
workflows, including workflows, including
[text classification](https://prodi.gy/docs/recipes#textcat), [named entity recognition](https://prodi.gy/docs/named-entity-recognition),
[dependency parsing](https://prodi.gy/docs/recipes#dep), [span categorization](https://prodi.gy/docs/span-categorization),
[text classification](https://prodi.gy/docs/text-classification),
[dependency parsing](https://prodi.gy/docs/dependencies-relations),
[part-of-speech tagging](https://prodi.gy/docs/recipes#pos) or fully [part-of-speech tagging](https://prodi.gy/docs/recipes#pos) or fully
[custom recipes](https://prodi.gy/docs/custom-recipes) for instance, an A/B [custom recipes](https://prodi.gy/docs/custom-recipes). You can also use spaCy
evaluation workflow that lets you compare two different models and their project templates to quickly start the annotation server to collect more
results. annotations and add them to your Prodigy dataset.
<!-- TODO: <Project id="integrations/prodigy"> <Project id="integrations/prodigy">
</Project> --> Get started with spaCy and Prodigy using our project template. It includes
commands to create a merged training corpus from your Prodigy annotations,
training and packaging a spaCy pipeline and analyzing if more annotations may
improve performance.
</Project>
--- ---