diff --git a/website/docs/usage/models.md b/website/docs/usage/models.md index 3b79c4d0d..f82da44d9 100644 --- a/website/docs/usage/models.md +++ b/website/docs/usage/models.md @@ -259,6 +259,45 @@ used for training the current [Japanese pipelines](/models/ja). +### Korean language support {#korean} + +> #### mecab-ko tokenizer +> +> ```python +> nlp = spacy.blank("ko") +> ``` + +The default MeCab-based Korean tokenizer requires: + +- [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md) +- [mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic) +- [natto-py](https://github.com/buruzaemon/natto-py) + +For some Korean datasets and tasks, the +[rule-based tokenizer](/usage/linguistic-features#tokenization) is better-suited +than MeCab. To configure a Korean pipeline with the rule-based tokenizer: + +> #### Rule-based tokenizer +> +> ```python +> config = {"nlp": {"tokenizer": {"@tokenizers": "spacy.Tokenizer.v1"}}} +> nlp = spacy.blank("ko", config=config) +> ``` + +```ini +### config.cfg +[nlp] +lang = "ko" +tokenizer = {"@tokenizers" = "spacy.Tokenizer.v1"} +``` + + + +The [Korean trained pipelines](/models/ko) use the rule-based tokenizer, so no +additional dependencies are required. + + + ## Installing and using trained pipelines {#download} The easiest way to download a trained pipeline is via spaCy's @@ -417,10 +456,10 @@ doc = nlp("This is a sentence.") You can use the [`info`](/api/cli#info) command or -[`spacy.info()`](/api/top-level#spacy.info) method to print a pipeline -package's meta data before loading it. Each `Language` object with a loaded -pipeline also exposes the pipeline's meta data as the attribute `meta`. For -example, `nlp.meta['version']` will return the package version. +[`spacy.info()`](/api/top-level#spacy.info) method to print a pipeline package's +meta data before loading it. Each `Language` object with a loaded pipeline also +exposes the pipeline's meta data as the attribute `meta`. For example, +`nlp.meta['version']` will return the package version. diff --git a/website/meta/languages.json b/website/meta/languages.json index a7dda6482..1c4379b6d 100644 --- a/website/meta/languages.json +++ b/website/meta/languages.json @@ -114,7 +114,12 @@ { "code": "fi", "name": "Finnish", - "has_examples": true + "has_examples": true, + "models": [ + "fi_core_news_sm", + "fi_core_news_md", + "fi_core_news_lg" + ] }, { "code": "fr", @@ -227,7 +232,12 @@ } ], "example": "이것은 λ¬Έμž₯μž…λ‹ˆλ‹€.", - "has_examples": true + "has_examples": true, + "models": [ + "ko_core_news_sm", + "ko_core_news_md", + "ko_core_news_lg" + ] }, { "code": "ky", @@ -388,7 +398,12 @@ { "code": "sv", "name": "Swedish", - "has_examples": true + "has_examples": true, + "models": [ + "sv_core_news_sm", + "sv_core_news_md", + "sv_core_news_lg" + ] }, { "code": "ta",