mirror of
https://github.com/explosion/spaCy.git
synced 2025-11-14 23:06:01 +03:00
* Port Japanese mecab tokenizer from v1 This brings the Mecab-based Japanese tokenization introduced in #1246 to spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag information from Mecab is stored in a token extension. A tag map is also included. As a reminder, Mecab is required because Universal Dependencies are based on Unidic tags, and Janome doesn't support Unidic. Things to check: 1. Is this the right way to use a token extension? 2. What's the right way to implement a JapaneseTagger? The approach in #1246 relied on `tag_from_strings` which is just gone now. I guess the best thing is to just try training spaCy's default Tagger? -POLM * Add tagging/make_doc and tests |
||
|---|---|---|
| .. | ||
| bn | ||
| da | ||
| de | ||
| en | ||
| es | ||
| fi | ||
| fr | ||
| ga | ||
| he | ||
| hu | ||
| id | ||
| ja | ||
| nb | ||
| ru | ||
| sv | ||
| th | ||
| tr | ||
| __init__.py | ||
| test_attrs.py | ||