spaCy/spacy/tokens
Adriane Boyd 53c0fb7431
Only set NORM on Token in retokenizer (#6464)
* Only set NORM on Token in retokenizer

Instead of setting `NORM` on both the token and lexeme, set `NORM` only
on the token.

The retokenizer tries to set all possible attributes with
`Token/Lexeme.set_struct_attr` so that it doesn't have to enumerate
which attributes are available for each. `NORM` is the only attribute
that's stored on both and for most cases it doesn't make sense to set
the global norms based on a individual retokenization. For lexeme-only
attributes like `IS_STOP` there's no way to avoid the global side
effects, but I think that `NORM` would be better only on the token.

* Fix test
2020-11-30 09:35:42 +08:00
..
__init__.pxd * Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx 2015-07-13 20:20:58 +02:00
__init__.py DocPallet -> DocBin 2019-09-18 15:15:37 +02:00
_retokenize.pyx Only set NORM on Token in retokenizer (#6464) 2020-11-30 09:35:42 +08:00
_serialize.py Include Doc.cats in serialization of Doc and DocBin (#4774) 2019-12-06 14:07:39 +01:00
doc.pxd Normalize TokenC.sent_start values for Matcher (#5346) 2020-04-29 12:57:30 +02:00
doc.pyx Add ent_id_ to strings serialized with Doc (#6353) 2020-11-10 20:16:07 +08:00
morphanalysis.pxd Add header for morphanalysis 2019-03-07 17:24:57 +01:00
morphanalysis.pyx Remove MorphAnalysis __str__ and __repr__ 2020-05-29 14:33:47 +02:00
span.pxd annotate kb_id through ents in doc 2019-03-22 11:36:44 +01:00
span.pyx Fix/span.sent (#6083) 2020-10-01 14:01:52 +02:00
token.pxd serialize ENT_ID (#4852) 2020-01-06 14:57:34 +01:00
token.pyx Fix polarity of Token.is_oov and Lexeme.is_oov (#5634) 2020-06-23 13:29:51 +02:00
underscore.py load Underscore state when multiprocessing 2020-02-12 11:50:42 +01:00