Add Finnish, Korean, and Swedish models and Korean support notes (#10355)

* Add Finnish, Korean, and Swedish models to website * Add Korean language support notes
2025-08-09 22:54:53 +03:00 · 2022-03-07 17:03:45 +01:00 · 2022-03-07 17:03:45 +01:00 · b2bbefd0b5
commit b2bbefd0b5
parent 5ca0dbae76
2 changed files with 61 additions and 7 deletions
--- a/website/docs/usage/models.md
+++ b/website/docs/usage/models.md
@ -259,6 +259,45 @@ used for training the current [Japanese pipelines](/models/ja).

 </Infobox>

+### Korean language support {#korean}
+
+> #### mecab-ko tokenizer
+>
+> ```python
+> nlp = spacy.blank("ko")
+> ```
+
+The default MeCab-based Korean tokenizer requires:
+
+- [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md)
+- [mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic)
+- [natto-py](https://github.com/buruzaemon/natto-py)
+
+For some Korean datasets and tasks, the
+[rule-based tokenizer](/usage/linguistic-features#tokenization) is better-suited
+than MeCab. To configure a Korean pipeline with the rule-based tokenizer:
+
+> #### Rule-based tokenizer
+>
+> ```python
+> config = {"nlp": {"tokenizer": {"@tokenizers": "spacy.Tokenizer.v1"}}}
+> nlp = spacy.blank("ko", config=config)
+> ```
+
+```ini
+### config.cfg
+[nlp]
+lang = "ko"
+tokenizer = {"@tokenizers" = "spacy.Tokenizer.v1"}
+```
+
+<Infobox>
+
+The [Korean trained pipelines](/models/ko) use the rule-based tokenizer, so no
+additional dependencies are required.
+
+</Infobox>
+
 ## Installing and using trained pipelines {#download}

 The easiest way to download a trained pipeline is via spaCy's
@ -417,10 +456,10 @@ doc = nlp("This is a sentence.")
 <Infobox title="Tip: Preview model info" emoji="💡">

 You can use the [`info`](/api/cli#info) command or
-[`spacy.info()`](/api/top-level#spacy.info) method to print a pipeline
-package's meta data before loading it. Each `Language` object with a loaded
-pipeline also exposes the pipeline's meta data as the attribute `meta`. For
-example, `nlp.meta['version']` will return the package version.
+[`spacy.info()`](/api/top-level#spacy.info) method to print a pipeline package's
+meta data before loading it. Each `Language` object with a loaded pipeline also
+exposes the pipeline's meta data as the attribute `meta`. For example,
+`nlp.meta['version']` will return the package version.

 </Infobox>

--- a/website/meta/languages.json
+++ b/website/meta/languages.json
@ -114,7 +114,12 @@
        {
            "code": "fi",
            "name": "Finnish",
-            "has_examples": true
+            "has_examples": true,
+            "models": [
+                "fi_core_news_sm",
+                "fi_core_news_md",
+                "fi_core_news_lg"
+            ]
        },
        {
            "code": "fr",
@ -227,7 +232,12 @@
                }
            ],
            "example": "이것은 문장입니다.",
-            "has_examples": true
+            "has_examples": true,
+            "models": [
+                "ko_core_news_sm",
+                "ko_core_news_md",
+                "ko_core_news_lg"
+            ]
        },
        {
            "code": "ky",
@ -388,7 +398,12 @@
        {
            "code": "sv",
            "name": "Swedish",
-            "has_examples": true
+            "has_examples": true,
+            "models": [
+                "sv_core_news_sm",
+                "sv_core_news_md",
+                "sv_core_news_lg"
+            ]
        },
        {
            "code": "ta",