Update Chinese usage docs

This commit is contained in:
Adriane Boyd 2020-10-02 10:09:03 +02:00
parent 3908fff899
commit 7670df04dd

View File

@ -85,7 +85,8 @@ import the `MultiLanguage` class directly, or call
### Chinese language support {#chinese new=2.3} ### Chinese language support {#chinese new=2.3}
The Chinese language class supports three word segmentation options: The Chinese language class supports three word segmentation options, `char`,
`jieba` and `pkuseg`:
> ```python > ```python
> from spacy.lang.zh import Chinese > from spacy.lang.zh import Chinese
@ -95,11 +96,12 @@ The Chinese language class supports three word segmentation options:
> >
> # Jieba > # Jieba
> cfg = {"segmenter": "jieba"} > cfg = {"segmenter": "jieba"}
> nlp = Chinese(meta={"tokenizer": {"config": cfg}}) > nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
> >
> # PKUSeg with "default" model provided by pkuseg > # PKUSeg with "default" model provided by pkuseg
> cfg = {"segmenter": "pkuseg", "pkuseg_model": "default"} > cfg = {"segmenter": "pkuseg"}
> nlp = Chinese(meta={"tokenizer": {"config": cfg}}) > nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
> nlp.tokenizer.initialize(pkuseg_model="default")
> ``` > ```
1. **Character segmentation:** Character segmentation is the default 1. **Character segmentation:** Character segmentation is the default
@ -116,41 +118,34 @@ The Chinese language class supports three word segmentation options:
<Infobox variant="warning"> <Infobox variant="warning">
In spaCy v3.0, the default Chinese word segmenter has switched from Jieba to In spaCy v3.0, the default Chinese word segmenter has switched from Jieba to
character segmentation. Also note that character segmentation.
[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can
install it from our fork and compile it locally:
```bash
$ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip
```
</Infobox> </Infobox>
<Accordion title="Details on spaCy's Chinese API"> <Accordion title="Details on spaCy's Chinese API">
The `meta` argument of the `Chinese` language class supports the following The `initialize` method for the Chinese tokenizer class supports the following
following tokenizer config settings: config settings for loading pkuseg models:
| Name | Description | | Name | Description |
| ------------------ | --------------------------------------------------------------------------------------------------------------- | | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
| `segmenter` | Word segmenter: `char`, `jieba` or `pkuseg`. Defaults to `char`. ~~str~~ | | `pkuseg_model` | Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ |
| `pkuseg_model` | **Required for `pkuseg`:** Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ | | `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`. ~~str~~ |
| `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. ~~str~~ |
```python ```python
### Examples ### Examples
# Initialize the pkuseg tokenizer
cfg = {"segmenter": "pkuseg"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
# Load "default" model # Load "default" model
cfg = {"segmenter": "pkuseg", "pkuseg_model": "default"} nlp.tokenizer.initialize(pkuseg_model="default")
nlp = Chinese(config={"tokenizer": {"config": cfg}})
# Load local model # Load local model
cfg = {"segmenter": "pkuseg", "pkuseg_model": "/path/to/pkuseg_model"} nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
nlp = Chinese(config={"tokenizer": {"config": cfg}})
# Override the user directory # Override the user directory
cfg = {"segmenter": "pkuseg", "pkuseg_model": "default", "pkuseg_user_dict": "/path"} nlp.tokenizer.initialize(pkuseg_model="default", pkuseg_user_dict="/path/to/user_dict")
nlp = Chinese(config={"tokenizer": {"config": cfg}})
``` ```
You can also modify the user dictionary on-the-fly: You can also modify the user dictionary on-the-fly:
@ -185,8 +180,11 @@ from spacy.lang.zh import Chinese
# Train pkuseg model # Train pkuseg model
pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model") pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model")
# Load pkuseg model in spaCy Chinese tokenizer # Load pkuseg model in spaCy Chinese tokenizer
nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_model", "require_pkuseg": True}}}) cfg = {"segmenter": "pkuseg"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
``` ```
</Accordion> </Accordion>