Use a pipe for separating Japanese inflections

Inflection values look like this pipe separated:

    五段-ラ行|連用形-促音便

So using a hyphen erases the original fields.
This commit is contained in:
Paul O'Leary McCann 2021-10-07 17:14:05 +09:00
parent f975690cc9
commit 227f98081b

View File

@ -94,7 +94,7 @@ class JapaneseTokenizer(DummyTokenizer):
DetailedToken(
token.surface(), # orth
"-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"]), # tag
"-".join([xx for xx in token.part_of_speech()[4:] if xx != "*"]), # inf
"|".join([xx for xx in token.part_of_speech()[4:] if xx != "*"]), # inf
token.dictionary_form(), # lemma
token.normalized_form(),
token.reading_form(),