mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 09:14:32 +03:00
Update example and training docs
This commit is contained in:
parent
2b60e894cb
commit
2298e129e6
|
@ -23,6 +23,7 @@ both documents.
|
|||
> ```python
|
||||
> from spacy.tokens import Doc
|
||||
> from spacy.gold import Example
|
||||
>
|
||||
> words = ["hello", "world", "!"]
|
||||
> spaces = [True, False, False]
|
||||
> predicted = Doc(nlp.vocab, words=words, spaces=spaces)
|
||||
|
@ -50,6 +51,7 @@ annotations provided as a dictionary.
|
|||
> ```python
|
||||
> from spacy.tokens import Doc
|
||||
> from spacy.gold import Example
|
||||
>
|
||||
> predicted = Doc(vocab, words=["Apply", "some", "sunscreen"])
|
||||
> token_ref = ["Apply", "some", "sun", "screen"]
|
||||
> tags_ref = ["VERB", "DET", "NOUN", "NOUN"]
|
||||
|
|
|
@ -26,7 +26,7 @@
|
|||
<path fill="none" stroke="#f33" stroke-width="2" stroke-miterlimit="10" d="M521 195l-35.2-49.3"/>
|
||||
<path fill="#f33" stroke="#f33" stroke-width="2" stroke-miterlimit="10" d="M482.3 140.8l8 4.2-4.5.7-2 4z"/>
|
||||
<path fill="#fff2cc" stroke="#d6b656" stroke-width="2" d="M491 195h120v67H491z"/>
|
||||
<text class="svg__trainloop__text" dy="1em" transform="translate(513.5 218.5)" width="73" height="18">GoldParse</text>
|
||||
<text class="svg__trainloop__text" dy="1em" transform="translate(513.5 218.5)" width="73" height="18">Example</text>
|
||||
<path fill="none" stroke="#f33" stroke-width="2" stroke-miterlimit="10" d="M466 59V21h-40.8"/>
|
||||
<path fill="#f33" stroke="#f33" stroke-width="2" stroke-miterlimit="10" d="M419.2 21l8-4-2 4 2 4z"/>
|
||||
<path fill="#f99" stroke="#f33" stroke-width="2" stroke-miterlimit="10" d="M436 59h60l30 40-30 40h-60l-30-40z"/>
|
||||
|
|
Before Width: | Height: | Size: 3.9 KiB After Width: | Height: | Size: 3.9 KiB |
|
@ -375,6 +375,18 @@ mattis pretium.
|
|||
|
||||
## Internal training API {#api}
|
||||
|
||||
<Infobox variant="warning">
|
||||
|
||||
spaCy gives you full control over the training loop. However, for most use
|
||||
cases, it's recommended to train your models via the
|
||||
[`spacy train`](/api/cli#train) command with a [`config.cfg`](#config) to keep
|
||||
track of your settings and hyperparameters, instead of writing your own training
|
||||
scripts from scratch.
|
||||
|
||||
</Infobox>
|
||||
|
||||
<!-- TODO: maybe add something about why the Example class is great and its benefits, and how it's passed around, holds the alignment etc -->
|
||||
|
||||
The [`Example`](/api/example) object contains annotated training data, also
|
||||
called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object
|
||||
that will hold the predictions, and another `Doc` object that holds the
|
||||
|
@ -393,42 +405,52 @@ example = Example(predicted, reference)
|
|||
|
||||
Alternatively, the `reference` `Doc` with the gold-standard annotations can be
|
||||
created from a dictionary with keyword arguments specifying the annotations,
|
||||
like `tags` or `entities`:
|
||||
like `tags` or `entities`. Using the `Example` object and its gold-standard
|
||||
annotations, the model can be updated to learn a sentence of three words with
|
||||
their assigned part-of-speech tags.
|
||||
|
||||
> #### About the tag map
|
||||
>
|
||||
> The tag map is part of the vocabulary and defines the annotation scheme. If
|
||||
> you're training a new language model, this will let you map the tags present
|
||||
> in the treebank you train on to spaCy's tag scheme:
|
||||
>
|
||||
> ```python
|
||||
> tag_map = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}}
|
||||
> vocab = Vocab(tag_map=tag_map)
|
||||
> ```
|
||||
|
||||
```python
|
||||
words = ["I", "like", "stuff"]
|
||||
tags = ["NOUN", "VERB", "NOUN"]
|
||||
predicted = Doc(en_vocab, words=words)
|
||||
predicted = Doc(nlp.vocab, words=words)
|
||||
example = Example.from_dict(predicted, {"tags": tags})
|
||||
```
|
||||
|
||||
Using the `Example` object and its gold-standard annotations, the model can be
|
||||
updated to learn a sentence of three words with their assigned part-of-speech
|
||||
tags.
|
||||
|
||||
<!-- TODO: is this the best place for the tag_map explanation ? -->
|
||||
|
||||
The [tag map](/usage/adding-languages#tag-map) is part of the vocabulary and
|
||||
defines the annotation scheme. If you're training a new language model, this
|
||||
will let you map the tags present in the treebank you train on to spaCy's tag
|
||||
scheme:
|
||||
|
||||
```python
|
||||
vocab = Vocab(tag_map={"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}})
|
||||
```
|
||||
|
||||
Another example shows how to define gold-standard named entities:
|
||||
|
||||
```python
|
||||
doc = Doc(vocab, words=["Facebook", "released", "React", "in", "2014"])
|
||||
example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]})
|
||||
```
|
||||
|
||||
Here's another example that shows how to define gold-standard named entities.
|
||||
The letters added before the labels refer to the tags of the
|
||||
[BILUO scheme](/usage/linguistic-features#updating-biluo) – `O` is a token
|
||||
outside an entity, `U` an single entity unit, `B` the beginning of an entity,
|
||||
`I` a token inside an entity and `L` the last token of an entity.
|
||||
|
||||
```python
|
||||
doc = Doc(nlp.vocab, words=["Facebook", "released", "React", "in", "2014"])
|
||||
example = Example.from_dict(doc, {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]})
|
||||
```
|
||||
|
||||
<Infobox title="Migrating from v2.x" variant="warning">
|
||||
|
||||
As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class.
|
||||
It can be constructed in a very similar way, from a `Doc` and a dictionary of
|
||||
annotations:
|
||||
|
||||
```diff
|
||||
- gold = GoldParse(doc, entities=entities)
|
||||
+ example = Example.from_dict(doc, {"entities": entities})
|
||||
```
|
||||
|
||||
</Infobox>
|
||||
|
||||
> - **Training data**: The training examples.
|
||||
> - **Text and label**: The current example.
|
||||
> - **Doc**: A `Doc` object created from the example text.
|
||||
|
@ -479,9 +501,21 @@ The [`nlp.update`](/api/language#update) method takes the following arguments:
|
|||
| `drop` | Dropout rate. Makes it harder for the model to just memorize the data. |
|
||||
| `sgd` | An [`Optimizer`](https://thinc.ai/docs/api-optimizers) object, which updated the model's weights. If not set, spaCy will create a new one and save it for further use. |
|
||||
|
||||
<!-- TODO: DocBin format ? -->
|
||||
<Infobox title="Migrating from v2.x" variant="warning">
|
||||
|
||||
Instead of writing your own training loop, you can also use the built-in
|
||||
[`train`](/api/cli#train) command, which expects data in spaCy's
|
||||
[JSON format](/api/data-formats#json-input). On each epoch, a model will be
|
||||
saved out to the directory.
|
||||
As of v3.0, the [`Example`](/api/example) object replaces the `GoldParse` class
|
||||
and the "simple training style" of calling `nlp.update` with a text and a
|
||||
dictionary of annotations. Updating your code to use the `Example` object should
|
||||
be very straightforward: you can call
|
||||
[`Example.from_dict`](/api/example#from_dict) with a [`Doc`](/api/doc) and the
|
||||
dictionary of annotations:
|
||||
|
||||
```diff
|
||||
text = "Facebook released React in 2014"
|
||||
annotations = {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]}
|
||||
+ example = Example.from_dict(nlp.make_doc(text), {"entities": entities})
|
||||
- nlp.update([text], [annotations])
|
||||
+ nlp.update([example])
|
||||
```
|
||||
|
||||
</Infobox>
|
||||
|
|
Loading…
Reference in New Issue
Block a user