Commit Graph

11 Commits

Author SHA1 Message Date
Paul O'Leary McCann
71ae8013ec [ja] Use user_details instead of a wrapper class
Instead of using a JapaneseDoc wrapper class to store Mecab output,
stash it in `user_data`. -POLM
2017-10-16 00:24:34 +09:00
Paul O'Leary McCann
43eedf73f2 [ja] Stash tokenizer output for speed
Before this commit, the Mecab tokenizer had to be called twice when
creating a Doc- once during tokenization and once during tagging. This
creates a JapaneseDoc wrapper class for Doc that stashes the parsed
tokenizer output to remove redundant processing. -POLM
2017-10-15 23:33:25 +09:00
Paul O'Leary McCann
8b3e1f7b5b Handle out-of-vocab words
Wasn't handling words out of the tokenizer dictionary vocabulary
properly. This adds a fix and test for that. -POLM
2017-08-29 23:58:42 +09:00
Paul O'Leary McCann
adfd987316 Update the TAG_MAP 2017-08-22 00:02:55 +09:00
Paul O'Leary McCann
53e17296e9 Fix pronoun handling
Missed this case earlier.

連体詞 have three classes for UD purposes:

- その -> DET
- それ -> PRON
- 同じ -> ADJ

-POLM
2017-08-22 00:01:49 +09:00
Paul O'Leary McCann
c435f748d7 Put Mecab import in utility function 2017-08-22 00:01:28 +09:00
Paul O'Leary McCann
6e9e686568 Sample implementation of Japanese Tagger (ref #1214)
This is far from complete but it should be enough to check some things.

1. Mecab transition. Janome doesn't support Unidic, only IPAdic, but UD
tag mappings are based on Unidic. This switches out Mecab for Janome to
get around that.

2. Raw tag extension. A simple tag map can't meet the specifications for
UD tag mappings, so this adds an extra field to ambiguous cases. For
this demo it just deals with the simplest case, which only needs to look
at the literal token. (In reality it may be necessary to look at the
whole sentence, but that's another issue.)

3. General code structure. Seems nobody else has implemented a custom
Tagger yet, so still not sure this is the correct way to pass the
vocabulary around, for example.

Any feedback would be greatly appreciated. -POLM
2017-08-08 01:27:15 +09:00
Paul O'Leary McCann
84041a2bb5 Make create_tokenizer work with Japanese 2017-06-28 01:18:05 +09:00
Ines Montani
3ea23a3f4d Fix formatting 2017-05-03 09:44:38 +02:00
Ines Montani
d730eb0c0d Raise custom ImportError if importing janome fails 2017-05-03 09:43:29 +02:00
Yasuaki Uechi
c8f83aeb87 Add basic japanese support 2017-05-03 13:56:21 +09:00