spaCy/test_issue2754.py at 5d0b60999d7502e88e811f72a7201719b0ed1f2b - spaCy - Gitea

explosion/spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-05 04:13:05 +03:00

Matthew Honnibal 8aa7882762

Make NORM a token attribute (#3029 )

See #3028. The solution in this patch is pretty debateable.

What we do is give the TokenC struct a .norm field, by repurposing the previously idle .sense attribute. It's nice to repurpose a previous field because it means the TokenC doesn't change size, so even if someone's using the internals very deeply, nothing will break.

The weird thing here is that the TokenC and the LexemeC both have an attribute named NORM. This arguably assists in backwards compatibility. On the other hand, maybe it's really bad! We're changing the semantics of the attribute subtly, so maybe it's better if someone calling lex.norm gets a breakage, and instead is told to write lex.default_norm?

Overall I believe this patch makes the NORM feature work the way we sort of expected it to work. Certainly it's much more like how the docs describe it, and more in line with how we've been directing people to use the norm attribute. We'll also be able to use token.norm to do stuff like spelling correction, which is pretty cool.

2018-12-08 10:49:10 +01:00

15 lines

325 B

Python

Raw Blame History

 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 from spacy.lang.en import English
 def test_issue2754():
     """Test that words like 'a' and 'a.m.' don't get exceptional norm values."""
     nlp = English()
     a = nlp('a')
     assert a[0].norm_ == 'a'
     am = nlp('am')
     assert am[0].norm_ == 'am'