spaCy/spacy/syntax/stateclass.pyx

# coding: utf-8
# cython: infer_types=True
from __future__ import unicode_literals

from libc.string cimport memcpy, memset
from libc.stdint cimport uint32_t, uint64_t

from ..vocab cimport EMPTY_LEXEME
from ..structs cimport Entity
from ..lexeme cimport Lexeme
from ..symbols cimport punct
from ..attrs cimport IS_SPACE
from ..attrs cimport attr_id_t
from ..tokens.token cimport Token
from ..tokens.doc cimport Doc


cdef class StateClass:
    def __init__(self, Doc doc=None, int offset=0):
        cdef Pool mem = Pool()
        self.mem = mem
        if doc is not None:
            self.c = new StateC(doc.c, doc.length)
            self.c.offset = offset

    def __dealloc__(self):
        del self.c

    @property
    def stack(self):
        return {self.S(i) for i in range(self.c._s_i)}

    @property
    def queue(self):
        return {self.B(i) for i in range(self.c.buffer_length())}

    @property
    def token_vector_lenth(self):
        return self.doc.tensor.shape[1]

    def is_final(self):
        return self.c.is_final()

    def copy(self):
        cdef StateClass new_state = StateClass.init(self.c._sent, self.c.length)
        new_state.c.clone(self.c)
        return new_state

    def print_state(self, words):
        words = list(words) + ['_']
        top = words[self.S(0)] + '_%d' % self.S_(0).head
        second = words[self.S(1)] + '_%d' % self.S_(1).head
        third = words[self.S(2)] + '_%d' % self.S_(2).head
        n0 = words[self.B(0)]
        n1 = words[self.B(1)]
        return ' '.join((third, second, top, '|', n0, n1))

    @classmethod
    def nr_context_tokens(cls):
        return 13

    def set_context_tokens(self, int[::1] output):
        output[0] = self.B(0)
        output[1] = self.B(1)
        output[2] = self.S(0)
        output[3] = self.S(1)
        output[4] = self.S(2)
        output[5] = self.L(self.S(0), 1)
        output[6] = self.L(self.S(0), 2)
        output[6] = self.R(self.S(0), 1)
        output[7] = self.L(self.B(0), 1)
        output[8] = self.R(self.S(0), 2)
        output[9] = self.L(self.S(1), 1)
        output[10] = self.L(self.S(1), 2)
        output[11] = self.R(self.S(1), 1)
        output[12] = self.R(self.S(1), 2)

        for i in range(13):
            if output[i] != -1:
                output[i] += self.c.offset
Tidy up and fix formatting and imports 2017-04-15 14:05:15 +03:00			`# coding: utf-8`
Data running through, likely errors in model 2017-05-06 15:22:20 +03:00			`# cython: infer_types=True`
Tidy up and fix formatting and imports 2017-04-15 14:05:15 +03:00			`from __future__ import unicode_literals`

* Add StateClass, to replace/refactor the mess in _state 2015-06-09 02:39:54 +03:00			`from libc.string cimport memcpy, memset`
Data running through, likely errors in model 2017-05-06 15:22:20 +03:00			`from libc.stdint cimport uint32_t, uint64_t`
Tidy up and fix formatting and imports 2017-04-15 14:05:15 +03:00
* Prepare to switch to using state class, instead of state struct 2015-06-09 22:20:14 +03:00			`from ..vocab cimport EMPTY_LEXEME`
* Greedy parsing working with new StateClass. Beam parsing broken 2015-06-10 05:20:23 +03:00			`from ..structs cimport Entity`
* Unwind limit to sentence boundary detection that prevents it from inserting boundaries on whitespace. Replace it with a check for whitespace in StateClass.fast_forward, so that whitespace is LeftArced when it's on the stack. This should prevent the previous problem of whitespace-only sentences. Should fix Issue #184, but may cause further problems. Needs testing. 2016-01-19 04:54:15 +03:00			`from ..lexeme cimport Lexeme`
			`from ..symbols cimport punct`
			`from ..attrs cimport IS_SPACE`
Data running through, likely errors in model 2017-05-06 15:22:20 +03:00			`from ..attrs cimport attr_id_t`
			`from ..tokens.token cimport Token`
Improve integration of NN parser, to support unified training API 2017-05-15 22:46:08 +03:00			`from ..tokens.doc cimport Doc`
* Add StateClass, to replace/refactor the mess in _state 2015-06-09 02:39:54 +03:00

			`cdef class StateClass:`
Improve integration of NN parser, to support unified training API 2017-05-15 22:46:08 +03:00			`def __init__(self, Doc doc=None, int offset=0):`
* Prepare to switch to using state class, instead of state struct 2015-06-09 22:20:14 +03:00			`cdef Pool mem = Pool()`
			`self.mem = mem`
Improve integration of NN parser, to support unified training API 2017-05-15 22:46:08 +03:00			`if doc is not None:`
			`self.c = new StateC(doc.c, doc.length)`
			`self.c.offset = offset`
* Continue proxying. Some problem currently 2016-02-01 04:22:21 +03:00
			`def __dealloc__(self):`
			`del self.c`

* Add stack and queue properties to stateclass, for python access 2015-08-09 00:32:42 +03:00			`@property`
			`def stack(self):`
different handling of space tokens space tokens are now always attached to the previous non-space token there are two exceptions: leading space tokens are attached to the first following non-space token in input that consists exclusively of space tokens, the last space token is the head of all others. 2016-04-13 16:28:28 +03:00			`return {self.S(i) for i in range(self.c._s_i)}`
* Add stack and queue properties to stateclass, for python access 2015-08-09 00:32:42 +03:00
			`@property`
			`def queue(self):`
Fix queue Python property in StateClass 2016-10-16 18:04:41 +03:00			`return {self.B(i) for i in range(self.c.buffer_length())}`
* Add stack and queue properties to stateclass, for python access 2015-08-09 00:32:42 +03:00
Data running through, likely errors in model 2017-05-06 15:22:20 +03:00			`@property`
			`def token_vector_lenth(self):`
			`return self.doc.tensor.shape[1]`

Improve integration of NN parser, to support unified training API 2017-05-15 22:46:08 +03:00			`def is_final(self):`
Data running through, likely errors in model 2017-05-06 15:22:20 +03:00			`return self.c.is_final()`

Improve efficiency of parser batching 2017-05-26 19:31:23 +03:00			`def copy(self):`
			`cdef StateClass new_state = StateClass.init(self.c._sent, self.c.length)`
			`new_state.c.clone(self.c)`
			`return new_state`

* Move StateClass into interface of transition functions 2015-06-10 02:35:28 +03:00			`def print_state(self, words):`
			`words = list(words) + ['_']`
* Add unshift action to StateClass, and track which moves have been shifted 2015-06-10 11:13:03 +03:00			`top = words[self.S(0)] + '_%d' % self.S_(0).head`
			`second = words[self.S(1)] + '_%d' % self.S_(1).head`
			`third = words[self.S(2)] + '_%d' % self.S_(2).head`
Tidy up and fix formatting and imports 2017-04-15 14:05:15 +03:00			`n0 = words[self.B(0)]`
			`n1 = words[self.B(1)]`
* Upd stateclass.print_state 2015-06-14 18:44:29 +03:00			`return ' '.join((third, second, top, '\|', n0, n1))`
Data running through, likely errors in model 2017-05-06 15:22:20 +03:00
Learning smoothly 2017-05-06 21:38:12 +03:00			`@classmethod`
Improve integration of NN parser, to support unified training API 2017-05-15 22:46:08 +03:00			`def nr_context_tokens(cls):`
Update draft of parser neural network model Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU. Outline of the model: We first predict context-sensitive vectors for each word in the input: (embed_lower \| embed_prefix \| embed_suffix \| embed_shape) >> Maxout(token_width) >> convolution ** 4 This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features. To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a representation that's one affine transform from this informative lexical information. This is obviously good for the parser (which backprops to the convolutions too). The parser model makes a state vector by concatenating the vector representations for its context tokens. Current results suggest few context tokens works well. Maybe this is a bug. The current context tokens: * S0, S1, S2: Top three words on the stack * B0, B1: First two words of the buffer * S0L1, S0L2: Leftmost and second leftmost children of S0 * S0R1, S0R2: Rightmost and second rightmost children of S0 * S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0 This makes the state vector quite long: 13T, where T is the token vector width (128 is working well). Fortunately, there's a way to structure the computation to save some expense (and make it more GPU friendly). The parser typically visits 2N states for a sentence of length N (although it may visit more, if it back-tracks with a non-monotonic transition). A naive implementation would require 2N (B, 13T) @ (13T, H) matrix multiplications for a batch of size B. We can instead perform one (BN, T) @ (T, 13*H) multiplication, to pre-compute the hidden weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN -- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model is so big.) This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity. The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier. We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle in CUDA to train. Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to be 0 cost. This is defined as: (exp(score) / Z) - (exp(score) / gZ) Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly, but so far this isn't working well. Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit greatly from the pre-computation trick. 2017-05-13 00:09:15 +03:00			`return 13`
Data running through, likely errors in model 2017-05-06 15:22:20 +03:00
Improve integration of NN parser, to support unified training API 2017-05-15 22:46:08 +03:00			`def set_context_tokens(self, int[::1] output):`
Data running through, likely errors in model 2017-05-06 15:22:20 +03:00			`output[0] = self.B(0)`
Learns things 2017-05-06 18:37:36 +03:00			`output[1] = self.B(1)`
			`output[2] = self.S(0)`
			`output[3] = self.S(1)`
Update draft of parser neural network model Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU. Outline of the model: We first predict context-sensitive vectors for each word in the input: (embed_lower \| embed_prefix \| embed_suffix \| embed_shape) >> Maxout(token_width) >> convolution ** 4 This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features. To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a representation that's one affine transform from this informative lexical information. This is obviously good for the parser (which backprops to the convolutions too). The parser model makes a state vector by concatenating the vector representations for its context tokens. Current results suggest few context tokens works well. Maybe this is a bug. The current context tokens: * S0, S1, S2: Top three words on the stack * B0, B1: First two words of the buffer * S0L1, S0L2: Leftmost and second leftmost children of S0 * S0R1, S0R2: Rightmost and second rightmost children of S0 * S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0 This makes the state vector quite long: 13T, where T is the token vector width (128 is working well). Fortunately, there's a way to structure the computation to save some expense (and make it more GPU friendly). The parser typically visits 2N states for a sentence of length N (although it may visit more, if it back-tracks with a non-monotonic transition). A naive implementation would require 2N (B, 13T) @ (13T, H) matrix multiplications for a batch of size B. We can instead perform one (BN, T) @ (T, 13*H) multiplication, to pre-compute the hidden weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN -- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model is so big.) This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity. The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier. We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle in CUDA to train. Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to be 0 cost. This is defined as: (exp(score) / Z) - (exp(score) / gZ) Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly, but so far this isn't working well. Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit greatly from the pre-computation trick. 2017-05-13 00:09:15 +03:00			`output[4] = self.S(2)`
			`output[5] = self.L(self.S(0), 1)`
			`output[6] = self.L(self.S(0), 2)`
working residual net 2017-05-07 04:57:26 +03:00			`output[6] = self.R(self.S(0), 1)`
Update draft of parser neural network model Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU. Outline of the model: We first predict context-sensitive vectors for each word in the input: (embed_lower \| embed_prefix \| embed_suffix \| embed_shape) >> Maxout(token_width) >> convolution ** 4 This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features. To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a representation that's one affine transform from this informative lexical information. This is obviously good for the parser (which backprops to the convolutions too). The parser model makes a state vector by concatenating the vector representations for its context tokens. Current results suggest few context tokens works well. Maybe this is a bug. The current context tokens: * S0, S1, S2: Top three words on the stack * B0, B1: First two words of the buffer * S0L1, S0L2: Leftmost and second leftmost children of S0 * S0R1, S0R2: Rightmost and second rightmost children of S0 * S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0 This makes the state vector quite long: 13T, where T is the token vector width (128 is working well). Fortunately, there's a way to structure the computation to save some expense (and make it more GPU friendly). The parser typically visits 2N states for a sentence of length N (although it may visit more, if it back-tracks with a non-monotonic transition). A naive implementation would require 2N (B, 13T) @ (13T, H) matrix multiplications for a batch of size B. We can instead perform one (BN, T) @ (T, 13*H) multiplication, to pre-compute the hidden weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN -- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model is so big.) This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity. The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier. We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle in CUDA to train. Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to be 0 cost. This is defined as: (exp(score) / Z) - (exp(score) / gZ) Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly, but so far this isn't working well. Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit greatly from the pre-computation trick. 2017-05-13 00:09:15 +03:00			`output[7] = self.L(self.B(0), 1)`
			`output[8] = self.R(self.S(0), 2)`
			`output[9] = self.L(self.S(1), 1)`
			`output[10] = self.L(self.S(1), 2)`
			`output[11] = self.R(self.S(1), 1)`
			`output[12] = self.R(self.S(1), 2)`
Data running through, likely errors in model 2017-05-06 15:22:20 +03:00
Improve integration of NN parser, to support unified training API 2017-05-15 22:46:08 +03:00			`for i in range(13):`
			`if output[i] != -1:`
			`output[i] += self.c.offset`