mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 10:16:27 +03:00
Update adding languages docs
This commit is contained in:
parent
7f331eafcd
commit
915b50c736
|
@ -12,14 +12,11 @@ p
|
||||||
| need to:
|
| need to:
|
||||||
|
|
||||||
+list("numbers")
|
+list("numbers")
|
||||||
+item
|
+item Create a #[strong #[code Language] subclass].
|
||||||
| Create a #[strong #[code Language] subclass] and
|
|
||||||
| #[a(href="#language-subclass") implement it].
|
|
||||||
|
|
||||||
+item
|
+item
|
||||||
| Define custom #[strong language data], like a
|
| Define custom #[strong language data], like a
|
||||||
| #[a(href="#stop-words") stop list], #[a(href="#tag-map") tag map]
|
| #[a(href="#stop-words") stop list] and
|
||||||
| and #[a(href="#tokenizer-exceptions") tokenizer exceptions].
|
| #[a(href="#tokenizer-exceptions") tokenizer exceptions].
|
||||||
|
|
||||||
+item
|
+item
|
||||||
| #[strong Build the vocabulary] including
|
| #[strong Build the vocabulary] including
|
||||||
|
@ -28,7 +25,8 @@ p
|
||||||
| #[a(href="#word-vectors") word vectors].
|
| #[a(href="#word-vectors") word vectors].
|
||||||
|
|
||||||
+item
|
+item
|
||||||
| #[strong Set up] a #[a(href="#model-directory") model direcory] and #[strong train] the #[a(href="#train-tagger-parser") tagger and parser].
|
| #[strong Set up] a #[a(href="#model-directory") model direcory] and
|
||||||
|
| #[strong train] the #[a(href="#train-tagger-parser") tagger and parser].
|
||||||
|
|
||||||
p
|
p
|
||||||
| For some languages, you may also want to develop a solution for
|
| For some languages, you may also want to develop a solution for
|
||||||
|
@ -100,21 +98,13 @@ p
|
||||||
| so that Python functions can be used to help you generalise and combine
|
| so that Python functions can be used to help you generalise and combine
|
||||||
| the data as you require.
|
| the data as you require.
|
||||||
|
|
||||||
+infobox("For languages with non-latin characters")
|
|
||||||
| In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
|
|
||||||
| needs to know the language's character set. If the language you're adding
|
|
||||||
| uses non-latin characters, you might need to add the required character
|
|
||||||
| classes to the global
|
|
||||||
| #[+src(gh("spacy", "spacy/lang/punctuation.py")) punctuation.py].
|
|
||||||
| spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
|
|
||||||
| to keep this simple and readable. If the language requires very specific
|
|
||||||
| punctuation rules, you should consider overwriting the default regular
|
|
||||||
| expressions with your own in the language's #[code Defaults].
|
|
||||||
|
|
||||||
p
|
p
|
||||||
| Here's an overview of the individual components that can be included
|
| Here's an overview of the individual components that can be included
|
||||||
| in the language data. For more details on them, see the sections below.
|
| in the language data. For more details on them, see the sections below.
|
||||||
|
|
||||||
|
+image
|
||||||
|
include ../../assets/img/docs/language_data.svg
|
||||||
|
|
||||||
+table(["File name", "Variables", "Description"])
|
+table(["File name", "Variables", "Description"])
|
||||||
+row
|
+row
|
||||||
+cell #[+src(gh()) stop_words.py]
|
+cell #[+src(gh()) stop_words.py]
|
||||||
|
@ -169,6 +159,17 @@ p
|
||||||
+cell #[code LEMMA_RULES], #[code LEMMA_INDEX], #[code LEMMA_EXC] (dicts)
|
+cell #[code LEMMA_RULES], #[code LEMMA_INDEX], #[code LEMMA_EXC] (dicts)
|
||||||
+cell Lemmatization rules, keyed by part of speech.
|
+cell Lemmatization rules, keyed by part of speech.
|
||||||
|
|
||||||
|
+infobox("For languages with non-latin characters")
|
||||||
|
| In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
|
||||||
|
| needs to know the language's character set. If the language you're adding
|
||||||
|
| uses non-latin characters, you might need to add the required character
|
||||||
|
| classes to the global
|
||||||
|
| #[+src(gh("spacy", "spacy/lang/punctuation.py")) punctuation.py].
|
||||||
|
| spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
|
||||||
|
| to keep this simple and readable. If the language requires very specific
|
||||||
|
| punctuation rules, you should consider overwriting the default regular
|
||||||
|
| expressions with your own in the language's #[code Defaults].
|
||||||
|
|
||||||
+h(3, "stop-words") Stop words
|
+h(3, "stop-words") Stop words
|
||||||
|
|
||||||
p
|
p
|
||||||
|
|
Loading…
Reference in New Issue
Block a user