mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
Merge pull request #8938 from explosion/docs/prodigy-v1-11-project [ci skip]
Update Prodigy project template for v1.11
This commit is contained in:
commit
6260f044cc
BIN
website/docs/images/prodigy_train_curve.jpg
Normal file
BIN
website/docs/images/prodigy_train_curve.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 200 KiB |
|
@ -758,16 +758,6 @@ workflows, but only one can be tracked by DVC.
|
||||||
|
|
||||||
### Prodigy {#prodigy} <IntegrationLogo name="prodigy" width={100} height="auto" align="right" />
|
### Prodigy {#prodigy} <IntegrationLogo name="prodigy" width={100} height="auto" align="right" />
|
||||||
|
|
||||||
<Infobox title="This section is still under construction" emoji="🚧" variant="warning">
|
|
||||||
|
|
||||||
The Prodigy integration will require a nightly version of Prodigy that supports
|
|
||||||
spaCy v3+. You can already use annotations created with Prodigy in spaCy v3 by
|
|
||||||
exporting your data with
|
|
||||||
[`data-to-spacy`](https://prodi.gy/docs/recipes#data-to-spacy) and running
|
|
||||||
[`spacy convert`](/api/cli#convert) to convert it to the binary format.
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
[Prodigy](https://prodi.gy) is a modern annotation tool for creating training
|
[Prodigy](https://prodi.gy) is a modern annotation tool for creating training
|
||||||
data for machine learning models, developed by us. It integrates with spaCy
|
data for machine learning models, developed by us. It integrates with spaCy
|
||||||
out-of-the-box and provides many different
|
out-of-the-box and provides many different
|
||||||
|
@ -776,17 +766,23 @@ with and without a model in the loop. If Prodigy is installed in your project,
|
||||||
you can start the annotation server from your `project.yml` for a tight feedback
|
you can start the annotation server from your `project.yml` for a tight feedback
|
||||||
loop between data development and training.
|
loop between data development and training.
|
||||||
|
|
||||||
The following example command starts the Prodigy app using the
|
<Infobox variant="warning">
|
||||||
[`ner.correct`](https://prodi.gy/docs/recipes#ner-correct) recipe and streams in
|
|
||||||
suggestions for the given entity labels produced by a pretrained model. You can
|
This integration requires [Prodigy v1.11](https://prodi.gy/docs/changelog#v1.11)
|
||||||
then correct the suggestions manually in the UI. After you save and exit the
|
or higher. If you're using an older version of Prodigy, you can still use your
|
||||||
server, the full dataset is exported in spaCy's format and split into a training
|
annotations in spaCy v3 by exporting your data with
|
||||||
and evaluation set.
|
[`data-to-spacy`](https://prodi.gy/docs/recipes#data-to-spacy) and running
|
||||||
|
[`spacy convert`](/api/cli#convert) to convert it to the binary format.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
The following example shows a workflow for merging and exporting NER annotations
|
||||||
|
collected with Prodigy and training a spaCy pipeline:
|
||||||
|
|
||||||
> #### Example usage
|
> #### Example usage
|
||||||
>
|
>
|
||||||
> ```cli
|
> ```cli
|
||||||
> $ python -m spacy project run annotate
|
> $ python -m spacy project run all
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
<!-- prettier-ignore -->
|
<!-- prettier-ignore -->
|
||||||
|
@ -794,36 +790,71 @@ and evaluation set.
|
||||||
### project.yml
|
### project.yml
|
||||||
vars:
|
vars:
|
||||||
prodigy:
|
prodigy:
|
||||||
dataset: 'ner_articles'
|
train_dataset: "fashion_brands_training"
|
||||||
labels: 'PERSON,ORG,PRODUCT'
|
eval_dataset: "fashion_brands_eval"
|
||||||
model: 'en_core_web_md'
|
|
||||||
|
workflows:
|
||||||
|
all:
|
||||||
|
- data-to-spacy
|
||||||
|
- train_spacy
|
||||||
|
|
||||||
commands:
|
commands:
|
||||||
- name: annotate
|
- name: "data-to-spacy"
|
||||||
- script:
|
help: "Merge your annotations and create data in spaCy's binary format"
|
||||||
- 'python -m prodigy ner.correct ${vars.prodigy.dataset} ${vars.prodigy.model} ./assets/raw_data.jsonl --labels ${vars.prodigy.labels}'
|
script:
|
||||||
- 'python -m prodigy data-to-spacy ./corpus/train.json ./corpus/eval.json --ner ${vars.prodigy.dataset}'
|
- "python -m prodigy data-to-spacy corpus/ --ner ${vars.prodigy.train_dataset},eval:${vars.prodigy.eval_dataset}"
|
||||||
- 'python -m spacy convert ./corpus/train.json ./corpus/train.spacy'
|
outputs:
|
||||||
- 'python -m spacy convert ./corpus/eval.json ./corpus/eval.spacy'
|
- "corpus/train.spacy"
|
||||||
- deps:
|
- "corpus/dev.spacy"
|
||||||
- 'assets/raw_data.jsonl'
|
- name: "train_spacy"
|
||||||
- outputs:
|
help: "Train a named entity recognition model with spaCy"
|
||||||
- 'corpus/train.spacy'
|
script:
|
||||||
- 'corpus/eval.spacy'
|
- "python -m spacy train configs/config.cfg --output training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy"
|
||||||
|
deps:
|
||||||
|
- "corpus/train.spacy"
|
||||||
|
- "corpus/dev.spacy"
|
||||||
|
outputs:
|
||||||
|
- "training/model-best"
|
||||||
```
|
```
|
||||||
|
|
||||||
You can use the same approach for other types of projects and annotation
|
> #### Example train curve output
|
||||||
|
>
|
||||||
|
> [![Screenshot of train curve terminal output](../images/prodigy_train_curve.jpg)](https://prodi.gy/docs/recipes#train-curve)
|
||||||
|
|
||||||
|
The [`train-curve`](https://prodi.gy/docs/recipes#train-curve) recipe is another
|
||||||
|
cool workflow you can include in your project. It will run the training with
|
||||||
|
different portions of the data, e.g. 25%, 50%, 75% and 100%. As a rule of thumb,
|
||||||
|
if accuracy increases in the last segment, this could indicate that collecting
|
||||||
|
more annotations of the same type might improve the model further.
|
||||||
|
|
||||||
|
<!-- prettier-ignore -->
|
||||||
|
```yaml
|
||||||
|
### project.yml (excerpt)
|
||||||
|
- name: "train_curve"
|
||||||
|
help: "Train the model with Prodigy by using different portions of training examples to evaluate if more annotations can potentially improve the performance"
|
||||||
|
script:
|
||||||
|
- "python -m prodigy train-curve --ner ${vars.prodigy.train_dataset},eval:${vars.prodigy.eval_dataset} --config configs/${vars.config} --show-plot"
|
||||||
|
```
|
||||||
|
|
||||||
|
You can use the same approach for various types of projects and annotation
|
||||||
workflows, including
|
workflows, including
|
||||||
[text classification](https://prodi.gy/docs/recipes#textcat),
|
[named entity recognition](https://prodi.gy/docs/named-entity-recognition),
|
||||||
[dependency parsing](https://prodi.gy/docs/recipes#dep),
|
[span categorization](https://prodi.gy/docs/span-categorization),
|
||||||
|
[text classification](https://prodi.gy/docs/text-classification),
|
||||||
|
[dependency parsing](https://prodi.gy/docs/dependencies-relations),
|
||||||
[part-of-speech tagging](https://prodi.gy/docs/recipes#pos) or fully
|
[part-of-speech tagging](https://prodi.gy/docs/recipes#pos) or fully
|
||||||
[custom recipes](https://prodi.gy/docs/custom-recipes) – for instance, an A/B
|
[custom recipes](https://prodi.gy/docs/custom-recipes). You can also use spaCy
|
||||||
evaluation workflow that lets you compare two different models and their
|
project templates to quickly start the annotation server to collect more
|
||||||
results.
|
annotations and add them to your Prodigy dataset.
|
||||||
|
|
||||||
<!-- TODO: <Project id="integrations/prodigy">
|
<Project id="integrations/prodigy">
|
||||||
|
|
||||||
</Project> -->
|
Get started with spaCy and Prodigy using our project template. It includes
|
||||||
|
commands to create a merged training corpus from your Prodigy annotations,
|
||||||
|
training and packaging a spaCy pipeline and analyzing if more annotations may
|
||||||
|
improve performance.
|
||||||
|
|
||||||
|
</Project>
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user