spaCy/spacy/lang
Adriane Boyd 39ebcd9ec9
Refactor Chinese tokenizer configuration (#5736)
* Refactor Chinese tokenizer configuration

Refactor `ChineseTokenizer` configuration so that it uses a single
`segmenter` setting to choose between character segmentation, jieba, and
pkuseg.

* replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting
`segmenter` with the supported values: `char`, `jieba`, `pkuseg`
* make the default segmenter plain character segmentation `char` (no
additional libraries required)

* Fix Chinese serialization test to use char default

* Warn if attempting to customize other segmenter

Add a warning if `Chinese.pkuseg_update_user_dict` is called when
another segmenter is selected.
2020-07-19 13:34:37 +02:00
..
af Tidy up and auto-format 2020-02-18 15:38:18 +01:00
ar Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
bg Tidy up and auto-format 2020-02-18 15:38:18 +01:00
bn Tidy up and auto-format 2020-02-18 15:38:18 +01:00
ca Tidy up and auto-format 2020-02-18 15:38:18 +01:00
cs Tidy up and auto-format 2020-02-18 15:38:18 +01:00
da Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
de Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
el Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
en Tidy up and auto-format 2020-06-21 22:38:04 +02:00
es Remove unicode declarations and tidy up 2020-06-21 22:34:10 +02:00
et Tidy up and auto-format 2020-02-18 15:38:18 +01:00
eu Remove unicode declarations 2020-03-26 15:18:32 +01:00
fa Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
fi Merge branch 'master' into tmp/sync 2020-03-26 13:38:14 +01:00
fr Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
ga Tidy up and auto-format 2020-02-18 15:38:18 +01:00
gu Remove unicode declarations and tidy up 2020-06-21 22:34:10 +02:00
he Tidy up and auto-format 2020-02-18 15:38:18 +01:00
hi Tidy up and auto-format 2020-02-18 15:38:18 +01:00
hr Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
hu Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
hy Remove unicode declarations and tidy up 2020-06-21 22:34:10 +02:00
id Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
is Tidy up and auto-format 2020-02-18 15:38:18 +01:00
it Merge branch 'master' into tmp/sync 2020-03-26 13:38:14 +01:00
ja Tidy up and auto-format 2020-06-21 22:38:04 +02:00
kn Remove unicode declarations and tidy up 2020-06-21 22:34:10 +02:00
ko Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
lb Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
lij Remove unicode declarations 2020-03-26 15:18:32 +01:00
lt Remove unicode declarations 2020-03-26 15:18:32 +01:00
lv Tidy up and auto-format 2020-02-18 15:38:18 +01:00
ml Remove unicode declarations and tidy up 2020-06-21 22:34:10 +02:00
mr Tidy up and auto-format 2020-02-18 15:38:18 +01:00
nb Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
nl Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
pl Remove unicode declarations and tidy up 2020-06-21 22:34:10 +02:00
pt Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
ro Remove unicode declarations 2020-03-26 15:18:32 +01:00
ru Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
si Tidy up and auto-format 2020-02-18 15:38:18 +01:00
sk Tidy up and auto-format 2020-03-25 12:28:12 +01:00
sl Tidy up and auto-format 2020-02-18 15:38:18 +01:00
sq Tidy up and auto-format 2020-02-18 15:38:18 +01:00
sr Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
sv Remove unicode declarations and tidy up 2020-06-21 22:34:10 +02:00
ta Tidy up and auto-format 2020-06-21 22:38:04 +02:00
te Tidy up and auto-format 2020-02-18 15:38:18 +01:00
th Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
tl Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
tr Tidy up and auto-format 2020-02-18 15:38:18 +01:00
tt Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
uk Tidy up and auto-format 2020-02-18 15:38:18 +01:00
ur Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
vi Modify morphology to support arbitrary features (#4932) 2020-01-23 22:01:54 +01:00
xx Tidy up and auto-format 2020-02-18 15:38:18 +01:00
yo Tidy up and auto-format 2020-02-18 15:38:18 +01:00
zh Refactor Chinese tokenizer configuration (#5736) 2020-07-19 13:34:37 +02:00
__init__.py Remove imports in /lang/__init__.py 2017-05-08 23:58:07 +02:00
char_classes.py Merge branch 'master' into develop 2020-02-18 14:47:23 +01:00
lex_attrs.py Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
norm_exceptions.py Tidy up and auto-format 2020-02-18 15:38:18 +01:00
punctuation.py Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
tag_map.py Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
tokenizer_exceptions.py Tidy up and auto-format 2020-06-21 22:38:04 +02:00