spaCy/spacy
Paul O'Leary McCann 1ee6541ab0
Moving Japanese tokenizer extra info to Token.morph (#8977)
* Use morph for extra Japanese tokenizer info

Previously Japanese tokenizer info that didn't correspond to Token
fields was put in user data. Since spaCy core should avoid touching user
data, this moves most information to the Token.morph attribute. It also
adds the normalized form, which wasn't exposed before.

The subtokens, which are a list of full tokens, are still added to user
data, except with the default tokenizer granualarity. With the default
tokenizer settings the subtokens are all None, so in this case the user
data is simply not set.

* Update tests

Also adds a new test for norm data.

* Update docs

* Add Japanese morphologizer factory

Set the default to `extend=True` so that the morphologizer does not
clobber the values set by the tokenizer.

* Use the norm_ field for normalized forms

Before this commit, normalized forms were put in the "norm" field in the
morph attributes. I am not sure why I did that instead of using the
token morph, I think I just forgot about it.

* Skip test if sudachipy is not installed

* Fix import

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-10-01 19:19:26 +02:00
..
cli avoid crash when unicode in title (#9254) 2021-09-22 21:01:34 +02:00
displacy Adjust kb_id visualizer templating and docs 2021-09-23 11:59:02 +02:00
lang Moving Japanese tokenizer extra info to Token.morph (#8977) 2021-10-01 19:19:26 +02:00
matcher Merge remote-tracking branch 'upstream/master' into develop 2021-09-27 09:10:45 +02:00
ml Correct parser.py use_upper param info (#9180) 2021-09-10 16:19:58 +02:00
pipeline Add overwrite settings for more components (#9050) 2021-09-30 15:35:55 +02:00
tests Moving Japanese tokenizer extra info to Token.morph (#8977) 2021-10-01 19:19:26 +02:00
tokens Don't serialize user data in DocBin if not saving it (fix #9190) (#9226) 2021-10-01 12:37:39 +02:00
training Move WandB loggers into spacy-loggers (#9223) 2021-09-29 11:12:50 +02:00
__init__.pxd * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. 2014-10-24 02:23:42 +11:00
__init__.py Tidy up and auto-format 2021-07-18 15:44:56 +10:00
__main__.py Tidy up 2020-06-22 00:45:40 +02:00
about.py Prepare for v3.1.3 (#9200) 2021-09-14 11:03:51 +02:00
attrs.pxd Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
attrs.pyx Update Cython string types (#9143) 2021-09-13 17:02:17 +02:00
compat.py Auto-detect package dependencies in spacy package (#8948) 2021-08-17 14:05:13 +02:00
default_config_pretraining.cfg Add new parameter for saving every n epoch in pretraining (#8912) 2021-08-12 11:14:48 +02:00
default_config.cfg Add training option to set annotations on update (#7767) 2021-04-26 16:53:53 +02:00
errors.py Merge remote-tracking branch 'upstream/master' into develop 2021-09-27 09:10:45 +02:00
glossary.py Add glossary entry for _SP (#8983) 2021-08-20 12:04:02 +02:00
kb.pxd Replace cpdef variables with cdef (#7834) 2021-04-26 16:54:02 +02:00
kb.pyx Update Cython string types (#9143) 2021-09-13 17:02:17 +02:00
language.py Merge remote-tracking branch 'upstream/master' into develop 2021-09-27 09:10:45 +02:00
lexeme.pxd Fix Lexeme.from_ptr 2020-08-10 16:43:37 +02:00
lexeme.pyi Add stub files for main cython classes (#8427) 2021-08-07 12:30:03 +02:00
lexeme.pyx Update Cython string types (#9143) 2021-09-13 17:02:17 +02:00
lookups.py Tidy up code 2021-06-28 12:08:15 +02:00
morphology.pxd Clean up Morphology imports and definitions (#7441) 2021-04-26 16:54:23 +02:00
morphology.pyx Clean up Morphology imports and definitions (#7441) 2021-04-26 16:54:23 +02:00
parts_of_speech.pxd Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
parts_of_speech.pyx Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
pipe_analysis.py Tidy up and auto-format 2020-09-29 21:39:28 +02:00
py.typed Add py.typed 2021-03-16 09:48:31 +01:00
schemas.py Add new parameter for saving every n epoch in pretraining (#8912) 2021-08-12 11:14:48 +02:00
scorer.py Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
strings.pxd Update Cython string types (#9143) 2021-09-13 17:02:17 +02:00
strings.pyi Add stub files for main cython classes (#8427) 2021-08-07 12:30:03 +02:00
strings.pyx Update Cython string types (#9143) 2021-09-13 17:02:17 +02:00
structs.pxd Add SpanGroup and Graph container types to represent arbitrary annotations (#6696) 2021-01-14 17:30:41 +11:00
symbols.pxd introduce token.has_head and refer to MISSING_DEP_ (WIP) 2021-01-12 17:17:06 +01:00
symbols.pyx introduce token.has_head and refer to MISSING_DEP_ (WIP) 2021-01-12 17:17:06 +01:00
tokenizer.pxd Remove two attributes marked for removal in 3.1 (#9150) 2021-09-15 23:07:21 +02:00
tokenizer.pyx Update Cython string types (#9143) 2021-09-13 17:02:17 +02:00
typedefs.pxd Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master 2020-11-25 11:49:34 +01:00
typedefs.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
util.py Merge remote-tracking branch 'upstream/master' into develop 2021-09-27 09:10:45 +02:00
vectors.pyx Fix vectors data on GPU (#7626) 2021-04-19 18:30:03 +10:00
vocab.pxd Remove two attributes marked for removal in 3.1 (#9150) 2021-09-15 23:07:21 +02:00
vocab.pyi Remove two attributes marked for removal in 3.1 (#9150) 2021-09-15 23:07:21 +02:00
vocab.pyx Remove two attributes marked for removal in 3.1 (#9150) 2021-09-15 23:07:21 +02:00