spaCy/spacy/lexeme.pyx

# cython: embedsignature=True
from cpython.ref cimport Py_INCREF
from cymem.cymem cimport Pool
from murmurhash.mrmr cimport hash64

from libc.string cimport memset

from .orth cimport word_shape
from .typedefs cimport attr_t
import numpy


memset(&EMPTY_LEXEME, 0, sizeof(LexemeC))


cdef int set_lex_struct_props(LexemeC* lex, dict props, StringStore string_store,
                              const float* empty_vec) except -1:
    lex.length = props['length']
    lex.orth = string_store[props['orth']]
    lex.lower = string_store[props['lower']]
    lex.norm = string_store[props['norm']]
    lex.shape = string_store[props['shape']]
    lex.prefix = string_store[props['prefix']]
    lex.suffix = string_store[props['suffix']]

    lex.cluster = props['cluster']
    lex.prob = props['prob']
    lex.sentiment = props['sentiment']

    lex.flags = props['flags']
    lex.repvec = empty_vec


cdef class Lexeme:
    """An entry in the vocabulary.  A Lexeme has no string context --- it's a
    word-type, as opposed to a word token.  It therefore has no part-of-speech
    tag, dependency parse, or lemma (lemmatization depends on the part-of-speech
    tag).
    """
    def __cinit__(self, int vec_size):
        self.repvec = numpy.ndarray(shape=(vec_size,), dtype=numpy.float32)

    @property
    def has_repvec(self):
        return self.l2_norm != 0

    cpdef bint check(self, attr_id_t flag_id) except -1:
        return self.flags & (1 << flag_id)
* Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme 2015-01-14 16:33:16 +03:00			`# cython: embedsignature=True`
* Upd Tokens to use vector, with bounds checking. 2014-09-15 05:22:40 +04:00			`from cpython.ref cimport Py_INCREF`
* Switch from own memory class to cymem, in pip 2014-09-18 01:09:24 +04:00			`from cymem.cymem cimport Pool`
* Rewriting Lexeme serialization. 2014-10-29 15:19:38 +03:00			`from murmurhash.mrmr cimport hash64`
* Upd Tokens to use vector, with bounds checking. 2014-09-15 05:22:40 +04:00
* Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding. 2014-10-22 18:57:59 +04:00			`from libc.string cimport memset`

* Fix orth import 2015-01-05 10:49:19 +03:00			`from .orth cimport word_shape`
* Tmp. Working on refactor. Compiles, must hook up lexical feats. 2015-01-13 16:03:48 +03:00			`from .typedefs cimport attr_t`
* Work on word vectors, and other stuff 2015-01-17 08:21:17 +03:00			`import numpy`
* Restoring Lexeme-as-struct 2014-09-10 22:41:37 +04:00
* Revising data model of lexeme. Compiles. 2014-10-09 12:53:30 +04:00
* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00			`memset(&EMPTY_LEXEME, 0, sizeof(LexemeC))`
* Revising data model of lexeme. Compiles. 2014-10-09 12:53:30 +04:00

* Work on word vectors, and other stuff 2015-01-17 08:21:17 +03:00			`cdef int set_lex_struct_props(LexemeC* lex, dict props, StringStore string_store,`
			`const float* empty_vec) except -1:`
* Tmp. Working on refactor. Compiles, must hook up lexical feats. 2015-01-13 16:03:48 +03:00			`lex.length = props['length']`
* Rename sic to orth 2015-01-22 18:08:25 +03:00			`lex.orth = string_store[props['orth']]`
Remove trailing whitespace 2015-04-19 11:31:31 +03:00			`lex.lower = string_store[props['lower']]`
			`lex.norm = string_store[props['norm']]`
			`lex.shape = string_store[props['shape']]`
* Tmp. Working on refactor. Compiles, must hook up lexical feats. 2015-01-13 16:03:48 +03:00			`lex.prefix = string_store[props['prefix']]`
			`lex.suffix = string_store[props['suffix']]`
Remove trailing whitespace 2015-04-19 11:31:31 +03:00
* Tmp. Working on refactor. Compiles, must hook up lexical feats. 2015-01-13 16:03:48 +03:00			`lex.cluster = props['cluster']`
			`lex.prob = props['prob']`
			`lex.sentiment = props['sentiment']`

			`lex.flags = props['flags']`
* Rename vec to repvec 2015-01-21 18:03:54 +03:00			`lex.repvec = empty_vec`
* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 02:26:22 +03:00

* Tmp. Refactoring, introducing a Lexeme PyObject. 2015-01-12 03:23:44 +03:00			`cdef class Lexeme:`
* Add docstring to Lexeme 2015-01-24 12:48:34 +03:00			`"""An entry in the vocabulary. A Lexeme has no string context --- it's a`
			`word-type, as opposed to a word token. It therefore has no part-of-speech`
			`tag, dependency parse, or lemma (lemmatization depends on the part-of-speech`
			`tag).`
			`"""`
* Work on word vectors, and other stuff 2015-01-17 08:21:17 +03:00			`def __cinit__(self, int vec_size):`
* Rename sic to orth 2015-01-22 18:08:25 +03:00			`self.repvec = numpy.ndarray(shape=(vec_size,), dtype=numpy.float32)`
* Add a has_repvec property to Lexeme, and a check function to check flags 2015-02-07 16:42:44 +03:00
			`@property`
			`def has_repvec(self):`
			`return self.l2_norm != 0`

			`cpdef bint check(self, attr_id_t flag_id) except -1:`
			`return self.flags & (1 << flag_id)`