spaCy/spacy/lang/zh
Adriane Boyd 39ebcd9ec9
Refactor Chinese tokenizer configuration (#5736)
* Refactor Chinese tokenizer configuration

Refactor `ChineseTokenizer` configuration so that it uses a single
`segmenter` setting to choose between character segmentation, jieba, and
pkuseg.

* replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting
`segmenter` with the supported values: `char`, `jieba`, `pkuseg`
* make the default segmenter plain character segmentation `char` (no
additional libraries required)

* Fix Chinese serialization test to use char default

* Warn if attempting to customize other segmenter

Add a warning if `Chinese.pkuseg_update_user_dict` is called when
another segmenter is selected.
2020-07-19 13:34:37 +02:00
..
__init__.py Refactor Chinese tokenizer configuration (#5736) 2020-07-19 13:34:37 +02:00
examples.py Tidy up and auto-format 2020-02-18 15:38:18 +01:00
lex_attrs.py Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
stop_words.py Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
tag_map.py Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00