spaCy/spacy/tests/lang/zh/test_text.py

# coding: utf-8
from __future__ import unicode_literals


import pytest


@pytest.mark.parametrize(
    "text,match",
    [
        ("10", True),
        ("1", True),
        ("999.0", True),
        ("一", True),
        ("二", True),
        ("〇", True),
        ("十一", True),
        ("狗", False),
        (",", False),
    ],
)
def test_lex_attrs_like_number(zh_tokenizer_jieba, text, match):
    tokens = zh_tokenizer_jieba(text)
    assert len(tokens) == 1
    assert tokens[0].like_num == match
-												Rework Chinese language initialization and tokenization (#4619)

* Rework Chinese language initialization

* Create a `ChineseTokenizer` class
  * Modify jieba post-processing to handle whitespace correctly
  * Modify non-jieba character tokenization to handle whitespace correctly

* Add a `create_tokenizer()` method to `ChineseDefaults`

* Load lexical attributes

* Update Chinese tag_map for UD v2

* Add very basic Chinese tests

* Test tokenization with and without jieba

* Test `like_num` attribute

* Fix try_jieba_import()

* Fix zh code formatting

											
										
										
											2019-11-11 16:23:21 +03:00
+								# coding: utf-8
 								from __future__ import unicode_literals
 								import pytest
 								@pytest.mark.parametrize(
 								    "text,match",
 								    [
 								        ("10", True),
 								        ("1", True),
 								        ("999.0", True),
 								        ("一", True),
 								        ("二", True),
 								        ("〇", True),
 								        ("十一", True),
 								        ("狗", False),
 								        (",", False),
 								    ],
 								)
-												Add pkuseg and serialization support for Chinese (#5308)

* Add pkuseg and serialization support for Chinese

Add support for pkuseg alongside jieba

* Specify model through `Language` meta:

  * split on characters (if no word segmentation packages are installed)

```
Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}})
```

  * jieba (remains the default tokenizer if installed)

```
Chinese()
Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit
```

  * pkuseg

```
Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}})
```

* The new tokenizer setting `require_pkuseg` is used to override
`use_jieba` default, which is intended for models that provide a pkuseg
model:

```
nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}})
nlp = Chinese() # has `use_jieba` as `True` by default
nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer
```

Add support for serialization of tokenizer settings and pkuseg model, if
loaded

* Add sorting for `Language.to_bytes()` serialization of `Language.meta`
so that the (emptied, but still present) tokenizer metadata is in a
consistent position in the serialized data

Extend tests to cover all three tokenizer configurations and
serialization

* Fix from_disk and tests without jieba or pkuseg

* Load cfg first and only show error if `use_pkuseg`
* Fix blank/default initialization in serialization tests

* Explicitly initialize jieba's cache on init

* Add serialization for pkuseg pre/postprocessors

* Reformat pkuseg install message
											
										
										
											2020-04-18 18:01:53 +03:00
+								def test_lex_attrs_like_number(zh_tokenizer_jieba, text, match):
 								    tokens = zh_tokenizer_jieba(text)
-												Rework Chinese language initialization and tokenization (#4619)

* Rework Chinese language initialization

* Create a `ChineseTokenizer` class
  * Modify jieba post-processing to handle whitespace correctly
  * Modify non-jieba character tokenization to handle whitespace correctly

* Add a `create_tokenizer()` method to `ChineseDefaults`

* Load lexical attributes

* Update Chinese tag_map for UD v2

* Add very basic Chinese tests

* Test tokenization with and without jieba

* Test `like_num` attribute

* Fix try_jieba_import()

* Fix zh code formatting

											
										
										
											2019-11-11 16:23:21 +03:00
+								    assert len(tokens) == 1
 								    assert tokens[0].like_num == match