spaCy/spacy/tests/lang/zh
Adriane Boyd 39ebcd9ec9
Refactor Chinese tokenizer configuration (#5736)
* Refactor Chinese tokenizer configuration

Refactor `ChineseTokenizer` configuration so that it uses a single
`segmenter` setting to choose between character segmentation, jieba, and
pkuseg.

* replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting
`segmenter` with the supported values: `char`, `jieba`, `pkuseg`
* make the default segmenter plain character segmentation `char` (no
additional libraries required)

* Fix Chinese serialization test to use char default

* Warn if attempting to customize other segmenter

Add a warning if `Chinese.pkuseg_update_user_dict` is called when
another segmenter is selected.
2020-07-19 13:34:37 +02:00
..
__init__.py Rework Chinese language initialization and tokenization (#4619) 2019-11-11 14:23:21 +01:00
test_serialize.py Refactor Chinese tokenizer configuration (#5736) 2020-07-19 13:34:37 +02:00
test_text.py Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
test_tokenizer.py Refactor Chinese tokenizer configuration (#5736) 2020-07-19 13:34:37 +02:00