spaCy/spacy/lang/ko
Adriane Boyd 2a558a7cdc
Switch to mecab-ko as default Korean tokenizer (#11294)
* Switch to mecab-ko as default Korean tokenizer

Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.

Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.

* Temporarily run tests with mecab-ko tokenizer

* Fix types

* Fix duplicate test names

* Update requirements test

* Revert "Temporarily run tests with mecab-ko tokenizer"

This reverts commit d2083e7044.

* Add mecab_args setting, fix pickle for KoreanNattoTokenizer

* Fix length check

* Update docs

* Formatting

* Update natto-py error message

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-08-26 10:11:18 +02:00
..
__init__.py Switch to mecab-ko as default Korean tokenizer (#11294) 2022-08-26 10:11:18 +02:00
examples.py Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
lex_attrs.py Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
punctuation.py Fix regex invalid escape sequences (#11276) 2022-08-09 10:59:36 +02:00
stop_words.py Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
tag_map.py Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00