Update Chinese usage docs

2025-11-09 12:27:54 +03:00 · 2020-10-02 10:09:03 +02:00 · 2020-10-02 10:09:03 +02:00 · 7670df04dd
commit 7670df04dd
parent 3908fff899
1 changed files with 24 additions and 26 deletions
--- a/website/docs/usage/models.md
+++ b/website/docs/usage/models.md
@ -85,7 +85,8 @@ import the `MultiLanguage` class directly, or call
 ### Chinese language support {#chinese new=2.3}
-The Chinese language class supports three word segmentation options:
+The Chinese language class supports three word segmentation options, `char`,
 `jieba` and `pkuseg`:
 > ```python
 > from spacy.lang.zh import Chinese
@ -95,11 +96,12 @@ The Chinese language class supports three word segmentation options:
 >
 > # Jieba
 > cfg = {"segmenter": "jieba"}
-> nlp = Chinese(meta={"tokenizer": {"config": cfg}})
+> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
 >
 > # PKUSeg with "default" model provided by pkuseg
-> cfg = {"segmenter": "pkuseg", "pkuseg_model": "default"}
+> cfg = {"segmenter": "pkuseg"}
-> nlp = Chinese(meta={"tokenizer": {"config": cfg}})
+> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
 > nlp.tokenizer.initialize(pkuseg_model="default")
 > ```
 1. **Character segmentation:** Character segmentation is the default
@ -116,41 +118,34 @@ The Chinese language class supports three word segmentation options:
 <Infobox variant="warning">
 In spaCy v3.0, the default Chinese word segmenter has switched from Jieba to
-character segmentation. Also note that
+character segmentation.
 [`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
 pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can
 install it from our fork and compile it locally:
 ```bash
 $ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip
 ```
 </Infobox>
 <Accordion title="Details on spaCy's Chinese API">
-The `meta` argument of the `Chinese` language class supports the following
+The `initialize` method for the Chinese tokenizer class supports the following
-following tokenizer config settings:
+config settings for loading pkuseg models:
-| Name               | Description                                                                                                     |
+| Name               | Description                                                                                                                           |
-| ------------------ | --------------------------------------------------------------------------------------------------------------- |
+| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
-| `segmenter`        | Word segmenter: `char`, `jieba` or `pkuseg`. Defaults to `char`. ~~str~~                                        |
+| `pkuseg_model`     | Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~                                                  |
-| `pkuseg_model`     | **Required for `pkuseg`:** Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ |
+| `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`. ~~str~~ |
 | `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. ~~str~~    |
 ```python
 ### Examples
 # Initialize the pkuseg tokenizer
 cfg = {"segmenter": "pkuseg"}
 nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
 # Load "default" model
-cfg = {"segmenter": "pkuseg", "pkuseg_model": "default"}
+nlp.tokenizer.initialize(pkuseg_model="default")
 nlp = Chinese(config={"tokenizer": {"config": cfg}})
 # Load local model
-cfg = {"segmenter": "pkuseg", "pkuseg_model": "/path/to/pkuseg_model"}
+nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
 nlp = Chinese(config={"tokenizer": {"config": cfg}})
 # Override the user directory
-cfg = {"segmenter": "pkuseg", "pkuseg_model": "default", "pkuseg_user_dict": "/path"}
+nlp.tokenizer.initialize(pkuseg_model="default", pkuseg_user_dict="/path/to/user_dict")
 nlp = Chinese(config={"tokenizer": {"config": cfg}})
 ```
 You can also modify the user dictionary on-the-fly:
@ -185,8 +180,11 @@ from spacy.lang.zh import Chinese
 # Train pkuseg model
 pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model")
 # Load pkuseg model in spaCy Chinese tokenizer
-nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_model", "require_pkuseg": True}}})
+cfg = {"segmenter": "pkuseg"}
 nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
 nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
 ```
 </Accordion>