* Refactor Chinese tokenizer configuration
Refactor `ChineseTokenizer` configuration so that it uses a single
`segmenter` setting to choose between character segmentation, jieba, and
pkuseg.
* replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting
`segmenter` with the supported values: `char`, `jieba`, `pkuseg`
* make the default segmenter plain character segmentation `char` (no
additional libraries required)
* Fix Chinese serialization test to use char default
* Warn if attempting to customize other segmenter
Add a warning if `Chinese.pkuseg_update_user_dict` is called when
another segmenter is selected.
Remove corpus-specific tag maps from the language data for languages
without custom tokenizers. For languages with custom word segmenters
that also provide tags (Japanese and Korean), the tag maps for the
custom tokenizers are kept as the default.
The default tag maps for languages without custom tokenizers are now the
default tag map from `lang/tag_map/py`, UPOS -> UPOS.
Modify jieba install message to instruct the user to use
`ChineseDefaults.use_jieba = False` so that it's possible to load
pkuseg-only models without jieba installed.
* Add pkuseg and serialization support for Chinese
Add support for pkuseg alongside jieba
* Specify model through `Language` meta:
* split on characters (if no word segmentation packages are installed)
```
Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}})
```
* jieba (remains the default tokenizer if installed)
```
Chinese()
Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit
```
* pkuseg
```
Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}})
```
* The new tokenizer setting `require_pkuseg` is used to override
`use_jieba` default, which is intended for models that provide a pkuseg
model:
```
nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}})
nlp = Chinese() # has `use_jieba` as `True` by default
nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer
```
Add support for serialization of tokenizer settings and pkuseg model, if
loaded
* Add sorting for `Language.to_bytes()` serialization of `Language.meta`
so that the (emptied, but still present) tokenizer metadata is in a
consistent position in the serialized data
Extend tests to cover all three tokenizer configurations and
serialization
* Fix from_disk and tests without jieba or pkuseg
* Load cfg first and only show error if `use_pkuseg`
* Fix blank/default initialization in serialization tests
* Explicitly initialize jieba's cache on init
* Add serialization for pkuseg pre/postprocessors
* Reformat pkuseg install message
* Rework Chinese language initialization
* Create a `ChineseTokenizer` class
* Modify jieba post-processing to handle whitespace correctly
* Modify non-jieba character tokenization to handle whitespace correctly
* Add a `create_tokenizer()` method to `ChineseDefaults`
* Load lexical attributes
* Update Chinese tag_map for UD v2
* Add very basic Chinese tests
* Test tokenization with and without jieba
* Test `like_num` attribute
* Fix try_jieba_import()
* Fix zh code formatting
* Add xfail test for vocab.writing_system
* Add vocab.writing_system property
* Set Language.Defaults.writing_system
* Set default writing system
* Remove xfail on test_vocab_writing_system
<!--- Provide a general summary of your changes in the title. -->
## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)
Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.
At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.
### Types of change
enhancement, code style
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.