Commit Graph

7 Commits

Author SHA1 Message Date
Adriane Boyd
39ebcd9ec9
Refactor Chinese tokenizer configuration (#5736)
* Refactor Chinese tokenizer configuration

Refactor `ChineseTokenizer` configuration so that it uses a single
`segmenter` setting to choose between character segmentation, jieba, and
pkuseg.

* replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting
`segmenter` with the supported values: `char`, `jieba`, `pkuseg`
* make the default segmenter plain character segmentation `char` (no
additional libraries required)

* Fix Chinese serialization test to use char default

* Warn if attempting to customize other segmenter

Add a warning if `Chinese.pkuseg_update_user_dict` is called when
another segmenter is selected.
2020-07-19 13:34:37 +02:00
Ines Montani
24f72c669c Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
Ines Montani
d8f3190c0a Tidy up and auto-format 2020-05-21 14:14:01 +02:00
adrianeboyd
c963e269ba
Add method to update / reset pkuseg user dict (#5404) 2020-05-08 11:21:46 +02:00
adrianeboyd
f7471abd82
Add pkuseg and serialization support for Chinese (#5308)
* Add pkuseg and serialization support for Chinese

Add support for pkuseg alongside jieba

* Specify model through `Language` meta:

  * split on characters (if no word segmentation packages are installed)

```
Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}})
```

  * jieba (remains the default tokenizer if installed)

```
Chinese()
Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit
```

  * pkuseg

```
Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}})
```

* The new tokenizer setting `require_pkuseg` is used to override
`use_jieba` default, which is intended for models that provide a pkuseg
model:

```
nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}})
nlp = Chinese() # has `use_jieba` as `True` by default
nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer
```

Add support for serialization of tokenizer settings and pkuseg model, if
loaded

* Add sorting for `Language.to_bytes()` serialization of `Language.meta`
so that the (emptied, but still present) tokenizer metadata is in a
consistent position in the serialized data

Extend tests to cover all three tokenizer configurations and
serialization

* Fix from_disk and tests without jieba or pkuseg

* Load cfg first and only show error if `use_pkuseg`
* Fix blank/default initialization in serialization tests

* Explicitly initialize jieba's cache on init

* Add serialization for pkuseg pre/postprocessors

* Reformat pkuseg install message
2020-04-18 17:01:53 +02:00
Ines Montani
db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
adrianeboyd
0b9a5f4074 Rework Chinese language initialization and tokenization (#4619)
* Rework Chinese language initialization

* Create a `ChineseTokenizer` class
  * Modify jieba post-processing to handle whitespace correctly
  * Modify non-jieba character tokenization to handle whitespace correctly

* Add a `create_tokenizer()` method to `ChineseDefaults`

* Load lexical attributes

* Update Chinese tag_map for UD v2

* Add very basic Chinese tests

* Test tokenization with and without jieba

* Test `like_num` attribute

* Fix try_jieba_import()

* Fix zh code formatting
2019-11-11 14:23:21 +01:00