mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-15 03:56:23 +03:00
f7471abd82
* Add pkuseg and serialization support for Chinese Add support for pkuseg alongside jieba * Specify model through `Language` meta: * split on characters (if no word segmentation packages are installed) ``` Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}}) ``` * jieba (remains the default tokenizer if installed) ``` Chinese() Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit ``` * pkuseg ``` Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}}) ``` * The new tokenizer setting `require_pkuseg` is used to override `use_jieba` default, which is intended for models that provide a pkuseg model: ``` nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}}) nlp = Chinese() # has `use_jieba` as `True` by default nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer ``` Add support for serialization of tokenizer settings and pkuseg model, if loaded * Add sorting for `Language.to_bytes()` serialization of `Language.meta` so that the (emptied, but still present) tokenizer metadata is in a consistent position in the serialized data Extend tests to cover all three tokenizer configurations and serialization * Fix from_disk and tests without jieba or pkuseg * Load cfg first and only show error if `use_pkuseg` * Fix blank/default initialization in serialization tests * Explicitly initialize jieba's cache on init * Add serialization for pkuseg pre/postprocessors * Reformat pkuseg install message
26 lines
512 B
Python
26 lines
512 B
Python
# coding: utf-8
|
||
from __future__ import unicode_literals
|
||
|
||
|
||
import pytest
|
||
|
||
|
||
@pytest.mark.parametrize(
|
||
"text,match",
|
||
[
|
||
("10", True),
|
||
("1", True),
|
||
("999.0", True),
|
||
("一", True),
|
||
("二", True),
|
||
("〇", True),
|
||
("十一", True),
|
||
("狗", False),
|
||
(",", False),
|
||
],
|
||
)
|
||
def test_lex_attrs_like_number(zh_tokenizer_jieba, text, match):
|
||
tokens = zh_tokenizer_jieba(text)
|
||
assert len(tokens) == 1
|
||
assert tokens[0].like_num == match
|