mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 18:06:29 +03:00
bd72fbf09c
* Port Japanese mecab tokenizer from v1 This brings the Mecab-based Japanese tokenization introduced in #1246 to spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag information from Mecab is stored in a token extension. A tag map is also included. As a reminder, Mecab is required because Universal Dependencies are based on Unidic tags, and Janome doesn't support Unidic. Things to check: 1. Is this the right way to use a token extension? 2. What's the right way to implement a JapaneseTagger? The approach in #1246 relied on `tag_from_strings` which is just gone now. I guess the best thing is to just try training spaCy's default Tagger? -POLM * Add tagging/make_doc and tests |
||
---|---|---|
.. | ||
bn | ||
da | ||
de | ||
en | ||
es | ||
fi | ||
fr | ||
ga | ||
he | ||
hu | ||
id | ||
ja | ||
nb | ||
ru | ||
sv | ||
th | ||
tr | ||
__init__.py | ||
test_attrs.py |