mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 09:14:32 +03:00
175 lines
9.9 KiB
Markdown
175 lines
9.9 KiB
Markdown
---
|
||
title: Models
|
||
teaser: Downloadable pretrained models for spaCy
|
||
menu:
|
||
- ['Quickstart', 'quickstart']
|
||
- ['Model Architecture', 'architecture']
|
||
- ['Conventions', 'conventions']
|
||
---
|
||
|
||
The models directory includes two types of pretrained models:
|
||
|
||
1. **Core models:** General-purpose pretrained models to predict named entities,
|
||
part-of-speech tags and syntactic dependencies. Can be used out-of-the-box
|
||
and fine-tuned on more specific data.
|
||
2. **Starter models:** Transfer learning starter packs with pretrained weights
|
||
you can initialize your models with to achieve better accuracy. They can
|
||
include word vectors (which will be used as features during training) or
|
||
other pretrained representations like BERT. These models don't include
|
||
components for specific tasks like NER or text classification and are
|
||
intended to be used as base models when training your own models.
|
||
|
||
### Quickstart {hidden="true"}
|
||
|
||
import QuickstartModels from 'widgets/quickstart-models.js'
|
||
|
||
<QuickstartModels title="Quickstart" id="quickstart" description="Install a default model, get the code to load it from within spaCy and test it." />
|
||
|
||
<Infobox title="📖 Installation and usage">
|
||
|
||
For more details on how to use models with spaCy, see the
|
||
[usage guide on models](/usage/models).
|
||
|
||
</Infobox>
|
||
|
||
## Model architecture {#architecture}
|
||
|
||
spaCy v2.0 features new neural models for **tagging**, **parsing** and **entity
|
||
recognition**. The models have been designed and implemented from scratch
|
||
specifically for spaCy, to give you an unmatched balance of speed, size and
|
||
accuracy. A novel bloom embedding strategy with subword features is used to
|
||
support huge vocabularies in tiny tables. Convolutional layers with residual
|
||
connections, layer normalization and maxout non-linearity are used, giving much
|
||
better efficiency than the standard BiLSTM solution.
|
||
|
||
The parser and NER use an imitation learning objective to deliver **accuracy
|
||
in-line with the latest research systems**, even when evaluated from raw text.
|
||
With these innovations, spaCy v2.0's models are **10× smaller**, **20% more
|
||
accurate**, and **even cheaper to run** than the previous generation. The
|
||
current architecture hasn't been published yet, but in the meantime we prepared
|
||
a video that explains how the models work, with particular focus on NER.
|
||
|
||
<YouTube id="sqDHBH9IjRU" />
|
||
|
||
The parsing model is a blend of recent results. The two recent inspirations have
|
||
been the work of Eli Klipperwasser and Yoav Goldberg at Bar Ilan[^1], and the
|
||
SyntaxNet team from Google. The foundation of the parser is still based on the
|
||
work of Joakim Nivre[^2], who introduced the transition-based framework[^3], the
|
||
arc-eager transition system, and the imitation learning objective. The model is
|
||
implemented using [Thinc](https://github.com/explosion/thinc), spaCy's machine
|
||
learning library. We first predict context-sensitive vectors for each word in
|
||
the input:
|
||
|
||
```python
|
||
(embed_lower | embed_prefix | embed_suffix | embed_shape)
|
||
>> Maxout(token_width)
|
||
>> convolution ** 4
|
||
```
|
||
|
||
This convolutional layer is shared between the tagger, parser and NER, and will
|
||
also be shared by the future neural lemmatizer. Because the parser shares these
|
||
layers with the tagger, the parser does not require tag features. I got this
|
||
trick from David Weiss's "Stack Combination" paper[^4].
|
||
|
||
To boost the representation, the tagger actually predicts a "super tag" with
|
||
POS, morphology and dependency label[^5]. The tagger predicts these supertags by
|
||
adding a softmax layer onto the convolutional layer – so, we're teaching the
|
||
convolutional layer to give us a representation that's one affine transform from
|
||
this informative lexical information. This is obviously good for the parser
|
||
(which backprops to the convolutions, too). The parser model makes a state
|
||
vector by concatenating the vector representations for its context tokens. The
|
||
current context tokens:
|
||
|
||
| Context tokens | Description |
|
||
| ---------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
|
||
| `S0`, `S1`, `S2` | Top three words on the stack. |
|
||
| `B0`, `B1` | First two words of the buffer. |
|
||
| `S0L1`, `S1L1`, `S2L1`, `B0L1`, `B1L1`<br />`S0L2`, `S1L2`, `S2L2`, `B0L2`, `B1L2` | Leftmost and second leftmost children of `S0`, `S1`, `S2`, `B0` and `B1`. |
|
||
| `S0R1`, `S1R1`, `S2R1`, `B0R1`, `B1R1`<br />`S0R2`, `S1R2`, `S2R2`, `B0R2`, `B1R2` | Rightmost and second rightmost children of `S0`, `S1`, `S2`, `B0` and `B1`. |
|
||
|
||
This makes the state vector quite long: `13*T`, where `T` is the token vector
|
||
width (128 is working well). Fortunately, there's a way to structure the
|
||
computation to save some expense (and make it more GPU-friendly).
|
||
|
||
The parser typically visits `2*N` states for a sentence of length `N` (although
|
||
it may visit more, if it back-tracks with a non-monotonic transition[^4]). A
|
||
naive implementation would require `2*N (B, 13*T) @ (13*T, H)` matrix
|
||
multiplications for a batch of size `B`. We can instead perform one
|
||
`(B*N, T) @ (T, 13*H)` multiplication, to pre-compute the hidden weights for
|
||
each positional feature with respect to the words in the batch. (Note that our
|
||
token vectors come from the CNN — so we can't play this trick over the
|
||
vocabulary. That's how Stanford's NN parser[^3] works — and why its model is so
|
||
big.)
|
||
|
||
This pre-computation strategy allows a nice compromise between GPU-friendliness
|
||
and implementation simplicity. The CNN and the wide lower layer are computed on
|
||
the GPU, and then the precomputed hidden weights are moved to the CPU, before we
|
||
start the transition-based parsing process. This makes a lot of things much
|
||
easier. We don't have to worry about variable-length batch sizes, and we don't
|
||
have to implement the dynamic oracle in CUDA to train.
|
||
|
||
Currently the parser's loss function is multi-label log loss[^6], as the dynamic
|
||
oracle allows multiple states to be 0 cost. This is defined as follows, where
|
||
`gZ` is the sum of the scores assigned to gold classes:
|
||
|
||
```python
|
||
(exp(score) / Z) - (exp(score) / gZ)
|
||
```
|
||
|
||
<Infobox title="Bibliography">
|
||
|
||
1. [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations {#fn-1}](https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41).
|
||
Eliyahu Kiperwasser, Yoav Goldberg. (2016)
|
||
2. [A Dynamic Oracle for Arc-Eager Dependency Parsing {#fn-2}](https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4).
|
||
Yoav Goldberg, Joakim Nivre (2012)
|
||
3. [Parsing English in 500 Lines of Python {#fn-3}](https://explosion.ai/blog/parsing-english-in-python).
|
||
Matthew Honnibal (2013)
|
||
4. [Stack-propagation: Improved Representation Learning for Syntax {#fn-4}](https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466).
|
||
Yuan Zhang, David Weiss (2016)
|
||
5. [Deep multi-task learning with low level tasks supervised at lower layers {#fn-5}](https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86).
|
||
Anders Søgaard, Yoav Goldberg (2016)
|
||
6. [An Improved Non-monotonic Transition System for Dependency Parsing {#fn-6}](https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c).
|
||
Matthew Honnibal, Mark Johnson (2015)
|
||
7. [A Fast and Accurate Dependency Parser using Neural Networks {#fn-7}](http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf).
|
||
Danqi Cheng, Christopher D. Manning (2014)
|
||
8. [Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques {#fn-8}](https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2).
|
||
Stefan Riezler et al. (2002)
|
||
|
||
</Infobox>
|
||
|
||
## Model naming conventions {#conventions}
|
||
|
||
In general, spaCy expects all model packages to follow the naming convention of
|
||
`[lang`\_[name]]. For spaCy's models, we also chose to divide the name into
|
||
three components:
|
||
|
||
1. **Type:** Model capabilities (e.g. `core` for general-purpose model with
|
||
vocabulary, syntax, entities and word vectors, or `depent` for only vocab,
|
||
syntax and entities).
|
||
2. **Genre:** Type of text the model is trained on, e.g. `web` or `news`.
|
||
3. **Size:** Model size indicator, `sm`, `md` or `lg`.
|
||
|
||
For example, `en_core_web_sm` is a small English model trained on written web
|
||
text (blogs, news, comments), that includes vocabulary, vectors, syntax and
|
||
entities.
|
||
|
||
### Model versioning {#model-versioning}
|
||
|
||
Additionally, the model versioning reflects both the compatibility with spaCy,
|
||
as well as the major and minor model version. A model version `a.b.c` translates
|
||
to:
|
||
|
||
- `a`: **spaCy major version**. For example, `2` for spaCy v2.x.
|
||
- `b`: **Model major version**. Models with a different major version can't be
|
||
loaded by the same code. For example, changing the width of the model, adding
|
||
hidden layers or changing the activation changes the model major version.
|
||
- `c`: **Model minor version**. Same model structure, but different parameter
|
||
values, e.g. from being trained on different data, for different numbers of
|
||
iterations, etc.
|
||
|
||
For a detailed compatibility overview, see the
|
||
[`compatibility.json`](https://github.com/explosion/spacy-models/tree/master/compatibility.json)
|
||
in the models repository. This is also the source of spaCy's internal
|
||
compatibility check, performed when you run the [`download`](/api/cli#download)
|
||
command.
|