mirror of
https://github.com/explosion/spaCy.git
synced 2025-11-14 14:56:02 +03:00
* Add pkuseg and serialization support for Chinese
Add support for pkuseg alongside jieba
* Specify model through `Language` meta:
* split on characters (if no word segmentation packages are installed)
```
Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}})
```
* jieba (remains the default tokenizer if installed)
```
Chinese()
Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit
```
* pkuseg
```
Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}})
```
* The new tokenizer setting `require_pkuseg` is used to override
`use_jieba` default, which is intended for models that provide a pkuseg
model:
```
nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}})
nlp = Chinese() # has `use_jieba` as `True` by default
nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer
```
Add support for serialization of tokenizer settings and pkuseg model, if
loaded
* Add sorting for `Language.to_bytes()` serialization of `Language.meta`
so that the (emptied, but still present) tokenizer metadata is in a
consistent position in the serialized data
Extend tests to cover all three tokenizer configurations and
serialization
* Fix from_disk and tests without jieba or pkuseg
* Load cfg first and only show error if `use_pkuseg`
* Fix blank/default initialization in serialization tests
* Explicitly initialize jieba's cache on init
* Add serialization for pkuseg pre/postprocessors
* Reformat pkuseg install message
|
||
|---|---|---|
| .. | ||
| ar | ||
| bn | ||
| ca | ||
| da | ||
| de | ||
| el | ||
| en | ||
| es | ||
| eu | ||
| fi | ||
| fr | ||
| ga | ||
| he | ||
| hu | ||
| hy | ||
| id | ||
| it | ||
| ja | ||
| ko | ||
| lb | ||
| lt | ||
| nb | ||
| nl | ||
| pl | ||
| pt | ||
| ro | ||
| ru | ||
| sr | ||
| sv | ||
| th | ||
| tr | ||
| tt | ||
| uk | ||
| ur | ||
| yo | ||
| zh | ||
| __init__.py | ||
| test_attrs.py | ||
| test_initialize.py | ||