mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 02:06:31 +03:00
f7471abd82
* Add pkuseg and serialization support for Chinese Add support for pkuseg alongside jieba * Specify model through `Language` meta: * split on characters (if no word segmentation packages are installed) ``` Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}}) ``` * jieba (remains the default tokenizer if installed) ``` Chinese() Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit ``` * pkuseg ``` Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}}) ``` * The new tokenizer setting `require_pkuseg` is used to override `use_jieba` default, which is intended for models that provide a pkuseg model: ``` nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}}) nlp = Chinese() # has `use_jieba` as `True` by default nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer ``` Add support for serialization of tokenizer settings and pkuseg model, if loaded * Add sorting for `Language.to_bytes()` serialization of `Language.meta` so that the (emptied, but still present) tokenizer metadata is in a consistent position in the serialized data Extend tests to cover all three tokenizer configurations and serialization * Fix from_disk and tests without jieba or pkuseg * Load cfg first and only show error if `use_pkuseg` * Fix blank/default initialization in serialization tests * Explicitly initialize jieba's cache on init * Add serialization for pkuseg pre/postprocessors * Reformat pkuseg install message |
||
---|---|---|
.. | ||
cli | ||
data | ||
displacy | ||
lang | ||
matcher | ||
ml | ||
pipeline | ||
syntax | ||
tests | ||
tokens | ||
__init__.pxd | ||
__init__.py | ||
__main__.py | ||
_align.pyx | ||
_ml.py | ||
about.py | ||
analysis.py | ||
attrs.pxd | ||
attrs.pyx | ||
compat.py | ||
errors.py | ||
glossary.py | ||
gold.pxd | ||
gold.pyx | ||
kb.pxd | ||
kb.pyx | ||
language.py | ||
lemmatizer.py | ||
lexeme.pxd | ||
lexeme.pyx | ||
lookups.py | ||
morphology.pxd | ||
morphology.pyx | ||
parts_of_speech.pxd | ||
parts_of_speech.pyx | ||
scorer.py | ||
strings.pxd | ||
strings.pyx | ||
structs.pxd | ||
symbols.pxd | ||
symbols.pyx | ||
tokenizer.pxd | ||
tokenizer.pyx | ||
typedefs.pxd | ||
typedefs.pyx | ||
util.py | ||
vectors.pyx | ||
vocab.pxd | ||
vocab.pyx |