spaCy/spacy/structs.pxd
Matthew Honnibal 8aa7882762
Make NORM a token attribute (#3029)
See #3028. The solution in this patch is pretty debateable.

What we do is give the TokenC struct a .norm field, by repurposing the previously idle .sense attribute. It's nice to repurpose a previous field because it means the TokenC doesn't change size, so even if someone's using the internals very deeply, nothing will break.

The weird thing here is that the TokenC and the LexemeC both have an attribute named NORM. This arguably assists in backwards compatibility. On the other hand, maybe it's really bad! We're changing the semantics of the attribute subtly, so maybe it's better if someone calling lex.norm gets a breakage, and instead is told to write lex.default_norm?

Overall I believe this patch makes the NORM feature work the way we sort of expected it to work. Certainly it's much more like how the docs describe it, and more in line with how we've been directing people to use the norm attribute. We'll also be able to use token.norm to do stuff like spelling correction, which is pretty cool.
2018-12-08 10:49:10 +01:00

74 lines
1.4 KiB
Cython

from libc.stdint cimport uint8_t, uint32_t, int32_t, uint64_t
from .typedefs cimport flags_t, attr_t, hash_t
from .parts_of_speech cimport univ_pos_t
cdef struct LexemeC:
flags_t flags
attr_t lang
attr_t id
attr_t length
attr_t orth
attr_t lower
attr_t norm
attr_t shape
attr_t prefix
attr_t suffix
attr_t cluster
float prob
float sentiment
cdef struct SerializedLexemeC:
unsigned char[8 + 8*10 + 4 + 4] data
# sizeof(flags_t) # flags
# + sizeof(attr_t) # lang
# + sizeof(attr_t) # id
# + sizeof(attr_t) # length
# + sizeof(attr_t) # orth
# + sizeof(attr_t) # lower
# + sizeof(attr_t) # norm
# + sizeof(attr_t) # shape
# + sizeof(attr_t) # prefix
# + sizeof(attr_t) # suffix
# + sizeof(attr_t) # cluster
# + sizeof(float) # prob
# + sizeof(float) # cluster
# + sizeof(float) # l2_norm
cdef struct Entity:
hash_t id
int start
int end
attr_t label
cdef struct TokenC:
const LexemeC* lex
uint64_t morph
univ_pos_t pos
bint spacy
attr_t tag
int idx
attr_t lemma
attr_t norm
int head
attr_t dep
uint32_t l_kids
uint32_t r_kids
uint32_t l_edge
uint32_t r_edge
int sent_start
int ent_iob
attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth..
hash_t ent_id