mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 01:46:28 +03:00
Update training docs [ci skip]
This commit is contained in:
parent
b544dcb3c5
commit
a31e9e1cd5
|
@ -6,6 +6,7 @@ menu:
|
|||
- ['NER', 'ner']
|
||||
- ['Tagger & Parser', 'tagger-parser']
|
||||
- ['Text Classification', 'textcat']
|
||||
- ['Entity Linking', 'entity-linker']
|
||||
- ['Tips and Advice', 'tips']
|
||||
---
|
||||
|
||||
|
@ -415,76 +416,6 @@ referred to as the "catastrophic forgetting" problem.
|
|||
4. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
|
||||
5. **Test** the model to make sure the new entity is recognized correctly.
|
||||
|
||||
## Entity linking {#entity-linker}
|
||||
|
||||
To train an entity linking model, you first need to define a knowledge base
|
||||
(KB).
|
||||
|
||||
### Creating a knowledge base {#kb}
|
||||
|
||||
A KB consists of a list of entities with unique identifiers. Each such entity
|
||||
has an entity vector that will be used to measure similarity with the context in
|
||||
which an entity is used. These vectors are pretrained and stored in the KB
|
||||
before the entity linking model will be trained.
|
||||
|
||||
The following example shows how to build a knowledge base from scratch, given a
|
||||
list of entities and potential aliases. The script further demonstrates how to
|
||||
pretrain and store the entity vectors. To run this example, the script needs
|
||||
access to a `vocab` instance or an `nlp` model with pretrained word embeddings.
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py
|
||||
```
|
||||
|
||||
#### Step by step guide {#step-by-step-kb}
|
||||
|
||||
1. **Load the model** you want to start with, or create an **empty model** using
|
||||
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language and
|
||||
a pre-defined [`vocab`](/api/vocab) object.
|
||||
2. **Pretrain the entity embeddings** by running the descriptions of the
|
||||
entities through a simple encoder-decoder network. The current implementation
|
||||
requires the `nlp` model to have access to pre-trained word embeddings, but a
|
||||
custom implementation of this enoding step can also be used.
|
||||
3. **Construct the KB** by defining all entities with their pretrained vectors,
|
||||
and all aliases with their prior probabilities.
|
||||
4. **Save** the KB using [`kb.dump`](/api/kb#dump).
|
||||
5. **Test** the KB to make sure the entities were added correctly.
|
||||
|
||||
### Training an entity linking model {#entity-linker-model}
|
||||
|
||||
This example shows how to create an entity linker pipe using a previously
|
||||
created knowledge base. The entity linker pipe is then trained with your own
|
||||
examples. To do so, you'll need to provide **example texts**, and the
|
||||
**character offsets** and **knowledge base identifiers** of each entity
|
||||
contained in the texts.
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py
|
||||
```
|
||||
|
||||
#### Step by step guide {#step-by-step-entity-linker}
|
||||
|
||||
1. **Load the KB** you want to start with, and specify the path to the `Vocab`
|
||||
object that was used to create this KB. Then, create an **empty model** using
|
||||
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language.
|
||||
Don't forget to add the KB to the entity linker, and to add the entity linker
|
||||
to the pipeline. In practical applications, you will want a more advanced
|
||||
pipeline including also a component for
|
||||
[named entity recognition](/usage/training#ner). If you're using a model with
|
||||
additional components, make sure to disable all other pipeline components
|
||||
during training using [`nlp.disable_pipes`](/api/language#disable_pipes).
|
||||
This way, you'll only be training the entity linker.
|
||||
2. **Shuffle and loop over** the examples. For each example, **update the
|
||||
model** by calling [`nlp.update`](/api/language#update), which steps through
|
||||
the annotated examples of the input. For each combination of a mention in
|
||||
text and a potential KB identifier, the model makes a **prediction** whether
|
||||
or not this is the correct match. It then consults the annotations to see
|
||||
whether it was right. If it was wrong, it adjusts its weights so that the
|
||||
correct combination will score higher next time.
|
||||
3. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
|
||||
4. **Test** the model to make sure the entities in the training data are
|
||||
recognized correctly.
|
||||
|
||||
## Training the tagger and parser {#tagger-parser}
|
||||
|
||||
### Updating the Dependency Parser {#example-train-parser}
|
||||
|
@ -665,6 +596,76 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_textcat.p
|
|||
7. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
|
||||
8. **Test** the model to make sure the text classifier works as expected.
|
||||
|
||||
## Entity linking {#entity-linker}
|
||||
|
||||
To train an entity linking model, you first need to define a knowledge base
|
||||
(KB).
|
||||
|
||||
### Creating a knowledge base {#kb}
|
||||
|
||||
A KB consists of a list of entities with unique identifiers. Each such entity
|
||||
has an entity vector that will be used to measure similarity with the context in
|
||||
which an entity is used. These vectors are pretrained and stored in the KB
|
||||
before the entity linking model will be trained.
|
||||
|
||||
The following example shows how to build a knowledge base from scratch, given a
|
||||
list of entities and potential aliases. The script further demonstrates how to
|
||||
pretrain and store the entity vectors. To run this example, the script needs
|
||||
access to a `vocab` instance or an `nlp` model with pretrained word embeddings.
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spaCy/tree/master/examples/training/pretrain_kb.py
|
||||
```
|
||||
|
||||
#### Step by step guide {#step-by-step-kb}
|
||||
|
||||
1. **Load the model** you want to start with, or create an **empty model** using
|
||||
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language and
|
||||
a pre-defined [`vocab`](/api/vocab) object.
|
||||
2. **Pretrain the entity embeddings** by running the descriptions of the
|
||||
entities through a simple encoder-decoder network. The current implementation
|
||||
requires the `nlp` model to have access to pre-trained word embeddings, but a
|
||||
custom implementation of this enoding step can also be used.
|
||||
3. **Construct the KB** by defining all entities with their pretrained vectors,
|
||||
and all aliases with their prior probabilities.
|
||||
4. **Save** the KB using [`kb.dump`](/api/kb#dump).
|
||||
5. **Test** the KB to make sure the entities were added correctly.
|
||||
|
||||
### Training an entity linking model {#entity-linker-model}
|
||||
|
||||
This example shows how to create an entity linker pipe using a previously
|
||||
created knowledge base. The entity linker pipe is then trained with your own
|
||||
examples. To do so, you'll need to provide **example texts**, and the
|
||||
**character offsets** and **knowledge base identifiers** of each entity
|
||||
contained in the texts.
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py
|
||||
```
|
||||
|
||||
#### Step by step guide {#step-by-step-entity-linker}
|
||||
|
||||
1. **Load the KB** you want to start with, and specify the path to the `Vocab`
|
||||
object that was used to create this KB. Then, create an **empty model** using
|
||||
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language.
|
||||
Don't forget to add the KB to the entity linker, and to add the entity linker
|
||||
to the pipeline. In practical applications, you will want a more advanced
|
||||
pipeline including also a component for
|
||||
[named entity recognition](/usage/training#ner). If you're using a model with
|
||||
additional components, make sure to disable all other pipeline components
|
||||
during training using [`nlp.disable_pipes`](/api/language#disable_pipes).
|
||||
This way, you'll only be training the entity linker.
|
||||
2. **Shuffle and loop over** the examples. For each example, **update the
|
||||
model** by calling [`nlp.update`](/api/language#update), which steps through
|
||||
the annotated examples of the input. For each combination of a mention in
|
||||
text and a potential KB identifier, the model makes a **prediction** whether
|
||||
or not this is the correct match. It then consults the annotations to see
|
||||
whether it was right. If it was wrong, it adjusts its weights so that the
|
||||
correct combination will score higher next time.
|
||||
3. **Save** the trained model using [`nlp.to_disk`](/api/language#to_disk).
|
||||
4. **Test** the model to make sure the entities in the training data are
|
||||
recognized correctly.
|
||||
|
||||
## Optimization tips and advice {#tips}
|
||||
|
||||
There are lots of conflicting "recipes" for training deep neural networks at the
|
||||
|
|
Loading…
Reference in New Issue
Block a user