mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 18:06:29 +03:00
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
This commit is contained in:
commit
8685229891
|
@ -11,9 +11,17 @@ menu:
|
||||||
- ['Entity Linking', 'entitylinker']
|
- ['Entity Linking', 'entitylinker']
|
||||||
---
|
---
|
||||||
|
|
||||||
TODO: intro and how architectures work, link to
|
A **model architecture** is a function that wires up a
|
||||||
[`registry`](/api/top-level#registry),
|
[`Model`](https://thinc.ai/docs/api-model) instance, which you can then use in a
|
||||||
[custom functions](/usage/training#custom-functions) usage etc.
|
pipeline component or as a layer of a larger network. This page documents
|
||||||
|
spaCy's built-in architectures that are used for different NLP tasks. All
|
||||||
|
trainable [built-in components](/api#architecture-pipeline) expect a `model`
|
||||||
|
argument defined in the config and document their the default architecture.
|
||||||
|
Custom architectures can be registered using the
|
||||||
|
[`@spacy.registry.architectures`](/api/top-level#regsitry) decorator and used as
|
||||||
|
part of the [training config](/usage/training#custom-functions). Also see the
|
||||||
|
usage documentation on
|
||||||
|
[layers and model architectures](/usage/layers-architectures).
|
||||||
|
|
||||||
## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"}
|
## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"}
|
||||||
|
|
||||||
|
@ -284,8 +292,18 @@ on [static vectors](/usage/embeddings-transformers#static-vectors) for details.
|
||||||
|
|
||||||
The following architectures are provided by the package
|
The following architectures are provided by the package
|
||||||
[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the
|
||||||
[usage documentation](/usage/embeddings-transformers) for how to integrate the
|
[usage documentation](/usage/embeddings-transformers#transformers) for how to
|
||||||
architectures into your training config.
|
integrate the architectures into your training config.
|
||||||
|
|
||||||
|
<Infobox variant="warning">
|
||||||
|
|
||||||
|
Note that in order to use these architectures in your config, you need to
|
||||||
|
install the
|
||||||
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the
|
||||||
|
[installation docs](/usage/embeddings-transformers#transformers-installation)
|
||||||
|
for details and system requirements.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
### spacy-transformers.TransformerModel.v1 {#TransformerModel}
|
### spacy-transformers.TransformerModel.v1 {#TransformerModel}
|
||||||
|
|
||||||
|
|
|
@ -9,7 +9,7 @@ menu:
|
||||||
next: /usage/projects
|
next: /usage/projects
|
||||||
---
|
---
|
||||||
|
|
||||||
A **model architecture** is a function that wires up a
|
A **model architecture** is a function that wires up a
|
||||||
[Thinc `Model`](https://thinc.ai/docs/api-model) instance, which you can then
|
[Thinc `Model`](https://thinc.ai/docs/api-model) instance, which you can then
|
||||||
use in a component or as a layer of a larger network. You can use Thinc as a
|
use in a component or as a layer of a larger network. You can use Thinc as a
|
||||||
thin wrapper around frameworks such as PyTorch, TensorFlow or MXNet, or you can
|
thin wrapper around frameworks such as PyTorch, TensorFlow or MXNet, or you can
|
||||||
|
|
|
@ -6,8 +6,7 @@ menu:
|
||||||
- ['Quickstart', 'quickstart']
|
- ['Quickstart', 'quickstart']
|
||||||
- ['Config System', 'config']
|
- ['Config System', 'config']
|
||||||
- ['Custom Functions', 'custom-functions']
|
- ['Custom Functions', 'custom-functions']
|
||||||
- ['Transfer Learning', 'transfer-learning']
|
# - ['Parallel Training', 'parallel-training']
|
||||||
- ['Parallel Training', 'parallel-training']
|
|
||||||
- ['Internal API', 'api']
|
- ['Internal API', 'api']
|
||||||
---
|
---
|
||||||
|
|
||||||
|
@ -92,16 +91,6 @@ spaCy's binary `.spacy` format. You can either include the data paths in the
|
||||||
$ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
|
$ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
|
||||||
```
|
```
|
||||||
|
|
||||||
<!-- TODO:
|
|
||||||
<Project id="some_example_project">
|
|
||||||
|
|
||||||
The easiest way to get started with an end-to-end training process is to clone a
|
|
||||||
[project](/usage/projects) template. Projects let you manage multi-step
|
|
||||||
workflows, from data preprocessing to training and packaging your model.
|
|
||||||
|
|
||||||
</Project>
|
|
||||||
-->
|
|
||||||
|
|
||||||
## Training config {#config}
|
## Training config {#config}
|
||||||
|
|
||||||
Training config files include all **settings and hyperparameters** for training
|
Training config files include all **settings and hyperparameters** for training
|
||||||
|
@ -400,13 +389,11 @@ recipe once the dish has already been prepared. You have to make a new one.
|
||||||
spaCy includes a variety of built-in [architectures](/api/architectures) for
|
spaCy includes a variety of built-in [architectures](/api/architectures) for
|
||||||
different tasks. For example:
|
different tasks. For example:
|
||||||
|
|
||||||
<!-- TODO: model return types -->
|
|
||||||
|
|
||||||
| Architecture | Description |
|
| Architecture | Description |
|
||||||
| ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| [HashEmbedCNN](/api/architectures#HashEmbedCNN) | Build spaCy’s "standard" embedding layer, which uses hash embedding with subword features and a CNN with layer-normalized maxout. ~~Model[List[Doc], List[Floats2d]]~~ |
|
| [HashEmbedCNN](/api/architectures#HashEmbedCNN) | Build spaCy’s "standard" embedding layer, which uses hash embedding with subword features and a CNN with layer-normalized maxout. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
| [TransitionBasedParser](/api/architectures#TransitionBasedParser) | Build a [transition-based parser](https://explosion.ai/blog/parsing-english-in-python) model used in the default [`EntityRecognizer`](/api/entityrecognizer) and [`DependencyParser`](/api/dependencyparser). ~~Model[List[Docs], List[List[Floats2d]]]~~ |
|
| [TransitionBasedParser](/api/architectures#TransitionBasedParser) | Build a [transition-based parser](https://explosion.ai/blog/parsing-english-in-python) model used in the default [`EntityRecognizer`](/api/entityrecognizer) and [`DependencyParser`](/api/dependencyparser). ~~Model[List[Docs], List[List[Floats2d]]]~~ |
|
||||||
| [TextCatEnsemble](/api/architectures#TextCatEnsemble) | Stacked ensemble of a bag-of-words model and a neural network model with an internal CNN embedding layer. Used in the default [`TextCategorizer`](/api/textcategorizer). ~~Model~~ |
|
| [TextCatEnsemble](/api/architectures#TextCatEnsemble) | Stacked ensemble of a bag-of-words model and a neural network model with an internal CNN embedding layer. Used in the default [`TextCategorizer`](/api/textcategorizer). ~~Model[List[Doc], Floats2d]~~ |
|
||||||
|
|
||||||
<!-- TODO: link to not yet existing usage page on custom architectures etc. -->
|
<!-- TODO: link to not yet existing usage page on custom architectures etc. -->
|
||||||
|
|
||||||
|
@ -755,71 +742,10 @@ def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Examp
|
||||||
return create_filtered_batches
|
return create_filtered_batches
|
||||||
```
|
```
|
||||||
|
|
||||||
<!-- TODO:
|
|
||||||
|
|
||||||
<Project id="example_pytorch_model">
|
|
||||||
|
|
||||||
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum
|
|
||||||
sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat
|
|
||||||
mattis pretium.
|
|
||||||
|
|
||||||
</Project>
|
|
||||||
|
|
||||||
-->
|
|
||||||
|
|
||||||
### Defining custom architectures {#custom-architectures}
|
### Defining custom architectures {#custom-architectures}
|
||||||
|
|
||||||
<!-- TODO: this should probably move to new section on models -->
|
<!-- TODO: this should probably move to new section on models -->
|
||||||
|
|
||||||
## Transfer learning {#transfer-learning}
|
|
||||||
|
|
||||||
<!-- TODO: write something, link to embeddings and transformers page – should probably wait until transformers/embeddings/transfer learning docs are done -->
|
|
||||||
|
|
||||||
### Using transformer models like BERT {#transformers}
|
|
||||||
|
|
||||||
spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
|
|
||||||
can use models implemented in a variety of frameworks. A transformer model is
|
|
||||||
just a statistical model, so the
|
|
||||||
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
|
|
||||||
actually has very little work to do: it just has to provide a few functions that
|
|
||||||
do the required plumbing. It also provides a pipeline component,
|
|
||||||
[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
|
|
||||||
you save the transformer outputs for later use.
|
|
||||||
|
|
||||||
<!-- TODO:
|
|
||||||
|
|
||||||
<Project id="en_core_trf_lg">
|
|
||||||
|
|
||||||
Try out a BERT-based model pipeline using this project template: swap in your
|
|
||||||
data, edit the settings and hyperparameters and train, evaluate, package and
|
|
||||||
visualize your model.
|
|
||||||
|
|
||||||
</Project>
|
|
||||||
-->
|
|
||||||
|
|
||||||
For more details on how to integrate transformer models into your training
|
|
||||||
config and customize the implementations, see the usage guide on
|
|
||||||
[training transformers](/usage/embeddings-transformers#transformers-training).
|
|
||||||
|
|
||||||
### Pretraining with spaCy {#pretraining}
|
|
||||||
|
|
||||||
<!-- TODO: document spacy pretrain, objectives etc. – should probably wait until transformers/embeddings/transfer learning docs are done -->
|
|
||||||
|
|
||||||
## Parallel Training with Ray {#parallel-training}
|
|
||||||
|
|
||||||
<!-- TODO:
|
|
||||||
|
|
||||||
|
|
||||||
<Project id="some_example_project">
|
|
||||||
|
|
||||||
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum
|
|
||||||
sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat
|
|
||||||
mattis pretium.
|
|
||||||
|
|
||||||
</Project>
|
|
||||||
|
|
||||||
-->
|
|
||||||
|
|
||||||
## Internal training API {#api}
|
## Internal training API {#api}
|
||||||
|
|
||||||
<Infobox variant="warning">
|
<Infobox variant="warning">
|
||||||
|
@ -880,8 +806,8 @@ example = Example.from_dict(predicted, {"tags": tags})
|
||||||
Here's another example that shows how to define gold-standard named entities.
|
Here's another example that shows how to define gold-standard named entities.
|
||||||
The letters added before the labels refer to the tags of the
|
The letters added before the labels refer to the tags of the
|
||||||
[BILUO scheme](/usage/linguistic-features#updating-biluo) – `O` is a token
|
[BILUO scheme](/usage/linguistic-features#updating-biluo) – `O` is a token
|
||||||
outside an entity, `U` a single entity unit, `B` the beginning of an entity,
|
outside an entity, `U` a single entity unit, `B` the beginning of an entity, `I`
|
||||||
`I` a token inside an entity and `L` the last token of an entity.
|
a token inside an entity and `L` the last token of an entity.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
doc = Doc(nlp.vocab, words=["Facebook", "released", "React", "in", "2014"])
|
doc = Doc(nlp.vocab, words=["Facebook", "released", "React", "in", "2014"])
|
||||||
|
|
|
@ -363,7 +363,7 @@ body [id]:target
|
||||||
color: var(--color-red-medium)
|
color: var(--color-red-medium)
|
||||||
background: var(--color-red-transparent)
|
background: var(--color-red-transparent)
|
||||||
|
|
||||||
&.italic, &.comment
|
&.italic
|
||||||
font-style: italic
|
font-style: italic
|
||||||
|
|
||||||
|
|
||||||
|
@ -384,11 +384,9 @@ body [id]:target
|
||||||
// Settings for ini syntax (config files)
|
// Settings for ini syntax (config files)
|
||||||
[class*="language-ini"]
|
[class*="language-ini"]
|
||||||
color: var(--syntax-comment)
|
color: var(--syntax-comment)
|
||||||
font-style: italic !important
|
|
||||||
|
|
||||||
.token
|
.token
|
||||||
color: var(--color-subtle)
|
color: var(--color-subtle)
|
||||||
font-style: normal !important
|
|
||||||
|
|
||||||
|
|
||||||
.gatsby-highlight-code-line
|
.gatsby-highlight-code-line
|
||||||
|
@ -426,7 +424,6 @@ body [id]:target
|
||||||
|
|
||||||
.cm-comment
|
.cm-comment
|
||||||
color: var(--syntax-comment)
|
color: var(--syntax-comment)
|
||||||
font-style: italic
|
|
||||||
|
|
||||||
.cm-keyword
|
.cm-keyword
|
||||||
color: var(--syntax-keyword)
|
color: var(--syntax-keyword)
|
||||||
|
|
Loading…
Reference in New Issue
Block a user