spaCy/website/api/_architecture/_nn-model.jade

//- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE

p
    |  spaCy's statistical models have been custom-designed to give a
    |  high-performance mix of speed and accuracy. The current architecture
    |  hasn't been published yet, but in the meantime we prepared a video that
    |  explains how the models work, with particular focus on NER.

+youtube("sqDHBH9IjRU")

p
    |  The parsing model is a blend of recent results. The two recent
    |  inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
    |  Bar Ilan#[+fn(1)], and the SyntaxNet team from Google. The foundation of
    |  the parser is still based on the work of Joakim Nivre#[+fn(2)], who
    |  introduced the transition-based framework#[+fn(3)], the arc-eager
    |  transition system, and the imitation learning objective. The model is
    |  implemented using #[+a(gh("thinc")) Thinc], spaCy's machine learning
    |  library. We first predict context-sensitive vectors for each word in the
    |  input:

+code.
    (embed_lower | embed_prefix | embed_suffix | embed_shape)
        &gt;&gt; Maxout(token_width)
        &gt;&gt; convolution ** 4

p
    |  This convolutional layer is shared between the tagger, parser and NER,
    |  and will also be shared by the future neural lemmatizer. Because the
    |  parser shares these layers with the tagger, the parser does not require
    |  tag features. I got this trick from David Weiss's "Stack Combination"
    |  paper#[+fn(4)].

p
    |  To boost the representation, the tagger actually predicts a "super tag"
    |  with POS, morphology and dependency label#[+fn(5)]. The tagger predicts
    |  these supertags by adding a softmax layer onto the convolutional layer –
    |  so, we're teaching the convolutional layer to give us a representation
    |  that's one affine transform from this informative lexical information.
    |  This is obviously good for the parser (which backprops to the
    |  convolutions too). The parser model makes a state vector by concatenating
    |  the vector representations for its context tokens.  The current context
    |  tokens:

+table
    +row
        +cell #[code S0], #[code S1], #[code S2]
        +cell Top three words on the stack.

    +row
        +cell #[code B0], #[code B1]
        +cell First two words of the buffer.

    +row
        +cell
            |  #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
            |  #[code B1L1]#[br]
            |  #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
            |  #[code B1L2]
        +cell
            |  Leftmost and second leftmost children of #[code S0], #[code S1],
            |  #[code S2], #[code B0] and #[code B1].

    +row
        +cell
            |  #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
            |  #[code B1R1]#[br]
            |  #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
            |  #[code B1R2]
        +cell
            |  Rightmost and second rightmost children of #[code S0], #[code S1],
            |  #[code S2], #[code B0] and #[code B1].

p
    |  This makes the state vector quite long: #[code 13*T], where #[code T] is
    |  the token vector width (128 is working well). Fortunately, there's a way
    |  to structure the computation to save some expense (and make it more
    |  GPU-friendly).

p
    |  The parser typically visits #[code 2*N] states for a sentence of length
    |  #[code N] (although it may visit more, if it back-tracks with a
    |  non-monotonic transition#[+fn(4)]). A naive implementation would require
    |  #[code 2*N (B, 13*T) @ (13*T, H)] matrix multiplications for a batch of
    |  size #[code B]. We can instead perform one #[code (B*N, T) @ (T, 13*H)]
    |  multiplication, to pre-compute the hidden weights for each positional
    |  feature with respect to the words in the batch. (Note that our token
    |  vectors come from the CNN — so we can't play this trick over the
    |  vocabulary. That's how Stanford's NN parser#[+fn(3)] works — and why its
    |  model is so big.)

p
    |  This pre-computation strategy allows a nice compromise between
    |  GPU-friendliness and implementation simplicity. The CNN and the wide
    |  lower layer are computed on the GPU, and then the precomputed hidden
    |  weights are moved to the CPU, before we start the transition-based
    |  parsing process. This makes a lot of things much easier. We don't have to
    |  worry about variable-length batch sizes, and we don't have to implement
    |  the dynamic oracle in CUDA to train.

p
    |  Currently the parser's loss function is multilabel log loss#[+fn(6)], as
    |  the dynamic oracle allows multiple states to be 0 cost. This is defined
    |  as follows, where #[code gZ] is the sum of the scores assigned to gold
    |  classes:

+code.
    (exp(score) / Z) - (exp(score) / gZ)

+bibliography
    +item
        |  #[+a("https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41") Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations]
        br
        |  Eliyahu Kiperwasser, Yoav Goldberg. (2016)

    +item
        |  #[+a("https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4") A Dynamic Oracle for Arc-Eager Dependency Parsing]
        br
        |  Yoav Goldberg, Joakim Nivre (2012)

    +item
        |  #[+a("https://explosion.ai/blog/parsing-english-in-python") Parsing English in 500 Lines of Python]
        br
        |  Matthew Honnibal (2013)

    +item
        |  #[+a("https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466") Stack-propagation: Improved Representation Learning for Syntax]
        br
        |  Yuan Zhang, David Weiss (2016)

    +item
        |  #[+a("https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86") Deep multi-task learning with low level tasks supervised at lower layers]
        br
        |  Anders Søgaard, Yoav Goldberg (2016)

    +item
        |  #[+a("https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c") An Improved Non-monotonic Transition System for Dependency Parsing]
        br
        |  Matthew Honnibal, Mark Johnson (2015)

    +item
        |  #[+a("http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf") A Fast and Accurate Dependency Parser using Neural Networks]
        br
        |  Danqi Cheng, Christopher D. Manning (2014)

    +item
        |  #[+a("https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2") Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques]
        br
        |  Stefan Riezler et al. (2002)
-												Update API documentation

											
										
										
											2017-10-03 15:27:22 +03:00
+								//- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								p
 								    |  spaCy's statistical models have been custom-designed to give a
 								    |  high-performance mix of speed and accuracy. The current architecture
 								    |  hasn't been published yet, but in the meantime we prepared a video that
 								    |  explains how the models work, with particular focus on NER.
 								+youtube("sqDHBH9IjRU")
-												Update API documentation

											
										
										
											2017-10-03 15:27:22 +03:00
+								p
 								    |  The parsing model is a blend of recent results. The two recent
 								    |  inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
 								    |  Bar Ilan#[+fn(1)], and the SyntaxNet team from Google. The foundation of
 								    |  the parser is still based on the work of Joakim Nivre#[+fn(2)], who
 								    |  introduced the transition-based framework#[+fn(3)], the arc-eager
 								    |  transition system, and the imitation learning objective. The model is
 								    |  implemented using #[+a(gh("thinc")) Thinc], spaCy's machine learning
 								    |  library. We first predict context-sensitive vectors for each word in the
 								    |  input:
 								+code.
 								    (embed_lower | embed_prefix | embed_suffix | embed_shape)
 								        &gt;&gt; Maxout(token_width)
 								        &gt;&gt; convolution ** 4
 								p
 								    |  This convolutional layer is shared between the tagger, parser and NER,
 								    |  and will also be shared by the future neural lemmatizer. Because the
 								    |  parser shares these layers with the tagger, the parser does not require
 								    |  tag features. I got this trick from David Weiss's "Stack Combination"
 								    |  paper#[+fn(4)].
 								p
 								    |  To boost the representation, the tagger actually predicts a "super tag"
 								    |  with POS, morphology and dependency label#[+fn(5)]. The tagger predicts
 								    |  these supertags by adding a softmax layer onto the convolutional layer –
 								    |  so, we're teaching the convolutional layer to give us a representation
 								    |  that's one affine transform from this informative lexical information.
 								    |  This is obviously good for the parser (which backprops to the
 								    |  convolutions too). The parser model makes a state vector by concatenating
 								    |  the vector representations for its context tokens.  The current context
 								    |  tokens:
 								+table
 								    +row
 								        +cell #[code S0], #[code S1], #[code S2]
 								        +cell Top three words on the stack.
 								    +row
 								        +cell #[code B0], #[code B1]
 								        +cell First two words of the buffer.
 								    +row
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								        +cell
-												Update API documentation

											
										
										
											2017-10-03 15:27:22 +03:00
+								            |  #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
 								            |  #[code B1L1]#[br]
 								            |  #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
 								            |  #[code B1L2]
 								        +cell
 								            |  Leftmost and second leftmost children of #[code S0], #[code S1],
 								            |  #[code S2], #[code B0] and #[code B1].
 								    +row
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								        +cell
-												Update API documentation

											
										
										
											2017-10-03 15:27:22 +03:00
+								            |  #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
 								            |  #[code B1R1]#[br]
 								            |  #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
 								            |  #[code B1R2]
 								        +cell
 								            |  Rightmost and second rightmost children of #[code S0], #[code S1],
 								            |  #[code S2], #[code B0] and #[code B1].
 								p
 								    |  This makes the state vector quite long: #[code 13*T], where #[code T] is
 								    |  the token vector width (128 is working well). Fortunately, there's a way
 								    |  to structure the computation to save some expense (and make it more
 								    |  GPU-friendly).
 								p
 								    |  The parser typically visits #[code 2*N] states for a sentence of length
 								    |  #[code N] (although it may visit more, if it back-tracks with a
 								    |  non-monotonic transition#[+fn(4)]). A naive implementation would require
 								    |  #[code 2*N (B, 13*T) @ (13*T, H)] matrix multiplications for a batch of
 								    |  size #[code B]. We can instead perform one #[code (B*N, T) @ (T, 13*H)]
 								    |  multiplication, to pre-compute the hidden weights for each positional
 								    |  feature with respect to the words in the batch. (Note that our token
 								    |  vectors come from the CNN — so we can't play this trick over the
 								    |  vocabulary. That's how Stanford's NN parser#[+fn(3)] works — and why its
 								    |  model is so big.)
 								p
 								    |  This pre-computation strategy allows a nice compromise between
 								    |  GPU-friendliness and implementation simplicity. The CNN and the wide
 								    |  lower layer are computed on the GPU, and then the precomputed hidden
 								    |  weights are moved to the CPU, before we start the transition-based
 								    |  parsing process. This makes a lot of things much easier. We don't have to
 								    |  worry about variable-length batch sizes, and we don't have to implement
 								    |  the dynamic oracle in CUDA to train.
 								p
 								    |  Currently the parser's loss function is multilabel log loss#[+fn(6)], as
 								    |  the dynamic oracle allows multiple states to be 0 cost. This is defined
 								    |  as follows, where #[code gZ] is the sum of the scores assigned to gold
 								    |  classes:
 								+code.
 								    (exp(score) / Z) - (exp(score) / gZ)
 								+bibliography
 								    +item
 								        |  #[+a("https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41") Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations]
 								        br
 								        |  Eliyahu Kiperwasser, Yoav Goldberg. (2016)
 								    +item
 								        |  #[+a("https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4") A Dynamic Oracle for Arc-Eager Dependency Parsing]
 								        br
 								        |  Yoav Goldberg, Joakim Nivre (2012)
 								    +item
 								        |  #[+a("https://explosion.ai/blog/parsing-english-in-python") Parsing English in 500 Lines of Python]
 								        br
 								        |  Matthew Honnibal (2013)
 								    +item
 								        |  #[+a("https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466") Stack-propagation: Improved Representation Learning for Syntax]
 								        br
 								        |  Yuan Zhang, David Weiss (2016)
 								    +item
 								        |  #[+a("https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86") Deep multi-task learning with low level tasks supervised at lower layers]
 								        br
 								        |  Anders Søgaard, Yoav Goldberg (2016)
 								    +item
 								        |  #[+a("https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c") An Improved Non-monotonic Transition System for Dependency Parsing]
 								        br
 								        |  Matthew Honnibal, Mark Johnson (2015)
 								    +item
 								        |  #[+a("http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf") A Fast and Accurate Dependency Parser using Neural Networks]
 								        br
 								        |  Danqi Cheng, Christopher D. Manning (2014)
 								    +item
 								        |  #[+a("https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2") Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques]
 								        br
 								        |  Stefan Riezler et al. (2002)