mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-25 13:11:03 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			175 lines
		
	
	
		
			9.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			175 lines
		
	
	
		
			9.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | ||
| title: Models
 | ||
| teaser: Downloadable pretrained models for spaCy
 | ||
| menu:
 | ||
|   - ['Quickstart', 'quickstart']
 | ||
|   - ['Model Architecture', 'architecture']
 | ||
|   - ['Conventions', 'conventions']
 | ||
| ---
 | ||
| 
 | ||
| The models directory includes two types of pretrained models:
 | ||
| 
 | ||
| 1. **Core models:** General-purpose pretrained models to predict named entities,
 | ||
|    part-of-speech tags and syntactic dependencies. Can be used out-of-the-box
 | ||
|    and fine-tuned on more specific data.
 | ||
| 2. **Starter models:** Transfer learning starter packs with pretrained weights
 | ||
|    you can initialize your models with to achieve better accuracy. They can
 | ||
|    include word vectors (which will be used as features during training) or
 | ||
|    other pretrained representations like BERT. These models don't include
 | ||
|    components for specific tasks like NER or text classification and are
 | ||
|    intended to be used as base models when training your own models.
 | ||
| 
 | ||
| ### Quickstart {hidden="true"}
 | ||
| 
 | ||
| import QuickstartModels from 'widgets/quickstart-models.js'
 | ||
| 
 | ||
| <QuickstartModels title="Quickstart" id="quickstart" description="Install a default model, get the code to load it from within spaCy and test it." />
 | ||
| 
 | ||
| <Infobox title="📖 Installation and usage">
 | ||
| 
 | ||
| For more details on how to use models with spaCy, see the
 | ||
| [usage guide on models](/usage/models).
 | ||
| 
 | ||
| </Infobox>
 | ||
| 
 | ||
| ## Model architecture {#architecture}
 | ||
| 
 | ||
| spaCy v2.0 features new neural models for **tagging**, **parsing** and **entity
 | ||
| recognition**. The models have been designed and implemented from scratch
 | ||
| specifically for spaCy, to give you an unmatched balance of speed, size and
 | ||
| accuracy. A novel bloom embedding strategy with subword features is used to
 | ||
| support huge vocabularies in tiny tables. Convolutional layers with residual
 | ||
| connections, layer normalization and maxout non-linearity are used, giving much
 | ||
| better efficiency than the standard BiLSTM solution.
 | ||
| 
 | ||
| The parser and NER use an imitation learning objective to deliver **accuracy
 | ||
| in-line with the latest research systems**, even when evaluated from raw text.
 | ||
| With these innovations, spaCy v2.0's models are **10× smaller**, **20% more
 | ||
| accurate**, and **even cheaper to run** than the previous generation. The
 | ||
| current architecture hasn't been published yet, but in the meantime we prepared
 | ||
| a video that explains how the models work, with particular focus on NER.
 | ||
| 
 | ||
| <YouTube id="sqDHBH9IjRU" />
 | ||
| 
 | ||
| The parsing model is a blend of recent results. The two recent inspirations have
 | ||
| been the work of Eli Klipperwasser and Yoav Goldberg at Bar Ilan[^1], and the
 | ||
| SyntaxNet team from Google. The foundation of the parser is still based on the
 | ||
| work of Joakim Nivre[^2], who introduced the transition-based framework[^3], the
 | ||
| arc-eager transition system, and the imitation learning objective. The model is
 | ||
| implemented using [Thinc](https://github.com/explosion/thinc), spaCy's machine
 | ||
| learning library. We first predict context-sensitive vectors for each word in
 | ||
| the input:
 | ||
| 
 | ||
| ```python
 | ||
| (embed_lower | embed_prefix | embed_suffix | embed_shape)
 | ||
|     >> Maxout(token_width)
 | ||
|     >> convolution ** 4
 | ||
| ```
 | ||
| 
 | ||
| This convolutional layer is shared between the tagger, parser and NER, and will
 | ||
| also be shared by the future neural lemmatizer. Because the parser shares these
 | ||
| layers with the tagger, the parser does not require tag features. I got this
 | ||
| trick from David Weiss's "Stack Combination" paper[^4].
 | ||
| 
 | ||
| To boost the representation, the tagger actually predicts a "super tag" with
 | ||
| POS, morphology and dependency label[^5]. The tagger predicts these supertags by
 | ||
| adding a softmax layer onto the convolutional layer – so, we're teaching the
 | ||
| convolutional layer to give us a representation that's one affine transform from
 | ||
| this informative lexical information. This is obviously good for the parser
 | ||
| (which backprops to the convolutions, too). The parser model makes a state
 | ||
| vector by concatenating the vector representations for its context tokens. The
 | ||
| current context tokens:
 | ||
| 
 | ||
| | Context tokens                                                                     | Description                                                                 |
 | ||
| | ---------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
 | ||
| | `S0`, `S1`, `S2`                                                                   | Top three words on the stack.                                               |
 | ||
| | `B0`, `B1`                                                                         | First two words of the buffer.                                              |
 | ||
| | `S0L1`, `S1L1`, `S2L1`, `B0L1`, `B1L1`<br />`S0L2`, `S1L2`, `S2L2`, `B0L2`, `B1L2` | Leftmost and second leftmost children of `S0`, `S1`, `S2`, `B0` and `B1`.   |
 | ||
| | `S0R1`, `S1R1`, `S2R1`, `B0R1`, `B1R1`<br />`S0R2`, `S1R2`, `S2R2`, `B0R2`, `B1R2` | Rightmost and second rightmost children of `S0`, `S1`, `S2`, `B0` and `B1`. |
 | ||
| 
 | ||
| This makes the state vector quite long: `13*T`, where `T` is the token vector
 | ||
| width (128 is working well). Fortunately, there's a way to structure the
 | ||
| computation to save some expense (and make it more GPU-friendly).
 | ||
| 
 | ||
| The parser typically visits `2*N` states for a sentence of length `N` (although
 | ||
| it may visit more, if it back-tracks with a non-monotonic transition[^4]). A
 | ||
| naive implementation would require `2*N (B, 13*T) @ (13*T, H)` matrix
 | ||
| multiplications for a batch of size `B`. We can instead perform one
 | ||
| `(B*N, T) @ (T, 13*H)` multiplication, to pre-compute the hidden weights for
 | ||
| each positional feature with respect to the words in the batch. (Note that our
 | ||
| token vectors come from the CNN — so we can't play this trick over the
 | ||
| vocabulary. That's how Stanford's NN parser[^3] works — and why its model is so
 | ||
| big.)
 | ||
| 
 | ||
| This pre-computation strategy allows a nice compromise between GPU-friendliness
 | ||
| and implementation simplicity. The CNN and the wide lower layer are computed on
 | ||
| the GPU, and then the precomputed hidden weights are moved to the CPU, before we
 | ||
| start the transition-based parsing process. This makes a lot of things much
 | ||
| easier. We don't have to worry about variable-length batch sizes, and we don't
 | ||
| have to implement the dynamic oracle in CUDA to train.
 | ||
| 
 | ||
| Currently the parser's loss function is multi-label log loss[^6], as the dynamic
 | ||
| oracle allows multiple states to be 0 cost. This is defined as follows, where
 | ||
| `gZ` is the sum of the scores assigned to gold classes:
 | ||
| 
 | ||
| ```python
 | ||
| (exp(score) / Z) - (exp(score) / gZ)
 | ||
| ```
 | ||
| 
 | ||
| <Infobox title="Bibliography">
 | ||
| 
 | ||
| 1. [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations {#fn-1}](https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41).
 | ||
|    Eliyahu Kiperwasser, Yoav Goldberg. (2016)
 | ||
| 2. [A Dynamic Oracle for Arc-Eager Dependency Parsing {#fn-2}](https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4).
 | ||
|    Yoav Goldberg, Joakim Nivre (2012)
 | ||
| 3. [Parsing English in 500 Lines of Python {#fn-3}](https://explosion.ai/blog/parsing-english-in-python).
 | ||
|    Matthew Honnibal (2013)
 | ||
| 4. [Stack-propagation: Improved Representation Learning for Syntax {#fn-4}](https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466).
 | ||
|    Yuan Zhang, David Weiss (2016)
 | ||
| 5. [Deep multi-task learning with low level tasks supervised at lower layers {#fn-5}](https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86).
 | ||
|    Anders Søgaard, Yoav Goldberg (2016)
 | ||
| 6. [An Improved Non-monotonic Transition System for Dependency Parsing {#fn-6}](https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c).
 | ||
|    Matthew Honnibal, Mark Johnson (2015)
 | ||
| 7. [A Fast and Accurate Dependency Parser using Neural Networks {#fn-7}](http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf).
 | ||
|    Danqi Cheng, Christopher D. Manning (2014)
 | ||
| 8. [Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques {#fn-8}](https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2).
 | ||
|    Stefan Riezler et al. (2002)
 | ||
| 
 | ||
| </Infobox>
 | ||
| 
 | ||
| ## Model naming conventions {#conventions}
 | ||
| 
 | ||
| In general, spaCy expects all model packages to follow the naming convention of
 | ||
| `[lang`\_[name]]. For spaCy's models, we also chose to divide the name into
 | ||
| three components:
 | ||
| 
 | ||
| 1. **Type:** Model capabilities (e.g. `core` for general-purpose model with
 | ||
|    vocabulary, syntax, entities and word vectors, or `depent` for only vocab,
 | ||
|    syntax and entities).
 | ||
| 2. **Genre:** Type of text the model is trained on, e.g. `web` or `news`.
 | ||
| 3. **Size:** Model size indicator, `sm`, `md` or `lg`.
 | ||
| 
 | ||
| For example, `en_core_web_sm` is a small English model trained on written web
 | ||
| text (blogs, news, comments), that includes vocabulary, vectors, syntax and
 | ||
| entities.
 | ||
| 
 | ||
| ### Model versioning {#model-versioning}
 | ||
| 
 | ||
| Additionally, the model versioning reflects both the compatibility with spaCy,
 | ||
| as well as the major and minor model version. A model version `a.b.c` translates
 | ||
| to:
 | ||
| 
 | ||
| - `a`: **spaCy major version**. For example, `2` for spaCy v2.x.
 | ||
| - `b`: **Model major version**. Models with a different major version can't be
 | ||
|   loaded by the same code. For example, changing the width of the model, adding
 | ||
|   hidden layers or changing the activation changes the model major version.
 | ||
| - `c`: **Model minor version**. Same model structure, but different parameter
 | ||
|   values, e.g. from being trained on different data, for different numbers of
 | ||
|   iterations, etc.
 | ||
| 
 | ||
| For a detailed compatibility overview, see the
 | ||
| [`compatibility.json`](https://github.com/explosion/spacy-models/tree/master/compatibility.json)
 | ||
| in the models repository. This is also the source of spaCy's internal
 | ||
| compatibility check, performed when you run the [`download`](/api/cli#download)
 | ||
| command.
 |