spaCy/spacy/tests/lang
Paul O'Leary McCann bd72fbf09c Port Japanese mecab tokenizer from v1 (#2036)
* Port Japanese mecab tokenizer from v1

This brings the Mecab-based Japanese tokenization introduced in #1246 to
spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag
information from Mecab is stored in a token extension. A tag map is also
included.

As a reminder, Mecab is required because Universal Dependencies are
based on Unidic tags, and Janome doesn't support Unidic.

Things to check:

1. Is this the right way to use a token extension?

2. What's the right way to implement a JapaneseTagger? The approach in
 #1246 relied on `tag_from_strings` which is just gone now. I guess the
best thing is to just try training spaCy's default Tagger?

-POLM

* Add tagging/make_doc and tests
2018-05-03 18:38:26 +02:00
..
bn Move language-specific tests to tests/lang 2017-05-09 00:02:37 +02:00
da Add Danish lemmatizer (#2184) 2018-04-07 19:07:28 +02:00
de Add German lemmatizer tests 2017-10-11 13:27:26 +02:00
en Drop six and related hacks as a dependency 2018-03-28 10:45:25 +02:00
es Move language-specific tests to tests/lang 2017-05-09 00:02:37 +02:00
fi Move language-specific tests to tests/lang 2017-05-09 00:02:37 +02:00
fr Fix French test (see #1617) 2017-11-20 13:59:59 +01:00
ga merge 2017-10-31 22:55:59 +00:00
he Move language-specific tests to tests/lang 2017-05-09 00:02:37 +02:00
hu Update tests 2017-06-05 02:09:27 +02:00
id added {pre,suf,in}fix tests 2017-08-20 13:43:00 +07:00
ja Port Japanese mecab tokenizer from v1 (#2036) 2018-05-03 18:38:26 +02:00
nb Move language-specific tests to tests/lang 2017-05-09 00:02:37 +02:00
ru Added tag map, fixed tests fails, added more exceptions 2017-11-26 20:54:48 +03:00
sv Move language-specific tests to tests/lang 2017-05-09 00:02:37 +02:00
th add thai in spacy2 2017-09-26 21:36:27 +07:00
tr Adds Turkish Lemmatization 2017-12-01 17:04:32 +03:00
__init__.py Remove imports in /lang/__init__.py 2017-05-08 23:58:07 +02:00
test_attrs.py added lex test for is_currency 2018-02-11 18:50:50 +01:00