Update adding languages docs

This commit is contained in:
ines 2017-05-13 03:10:50 +02:00
parent 7f331eafcd
commit 915b50c736

View File

@ -12,14 +12,11 @@ p
| need to: | need to:
+list("numbers") +list("numbers")
+item +item Create a #[strong #[code Language] subclass].
| Create a #[strong #[code Language] subclass] and
| #[a(href="#language-subclass") implement it].
+item +item
| Define custom #[strong language data], like a | Define custom #[strong language data], like a
| #[a(href="#stop-words") stop list], #[a(href="#tag-map") tag map] | #[a(href="#stop-words") stop list] and
| and #[a(href="#tokenizer-exceptions") tokenizer exceptions]. | #[a(href="#tokenizer-exceptions") tokenizer exceptions].
+item +item
| #[strong Build the vocabulary] including | #[strong Build the vocabulary] including
@ -28,7 +25,8 @@ p
| #[a(href="#word-vectors") word vectors]. | #[a(href="#word-vectors") word vectors].
+item +item
| #[strong Set up] a #[a(href="#model-directory") model direcory] and #[strong train] the #[a(href="#train-tagger-parser") tagger and parser]. | #[strong Set up] a #[a(href="#model-directory") model direcory] and
| #[strong train] the #[a(href="#train-tagger-parser") tagger and parser].
p p
| For some languages, you may also want to develop a solution for | For some languages, you may also want to develop a solution for
@ -100,21 +98,13 @@ p
| so that Python functions can be used to help you generalise and combine | so that Python functions can be used to help you generalise and combine
| the data as you require. | the data as you require.
+infobox("For languages with non-latin characters")
| In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
| needs to know the language's character set. If the language you're adding
| uses non-latin characters, you might need to add the required character
| classes to the global
| #[+src(gh("spacy", "spacy/lang/punctuation.py")) punctuation.py].
| spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
| to keep this simple and readable. If the language requires very specific
| punctuation rules, you should consider overwriting the default regular
| expressions with your own in the language's #[code Defaults].
p p
| Here's an overview of the individual components that can be included | Here's an overview of the individual components that can be included
| in the language data. For more details on them, see the sections below. | in the language data. For more details on them, see the sections below.
+image
include ../../assets/img/docs/language_data.svg
+table(["File name", "Variables", "Description"]) +table(["File name", "Variables", "Description"])
+row +row
+cell #[+src(gh()) stop_words.py] +cell #[+src(gh()) stop_words.py]
@ -169,6 +159,17 @@ p
+cell #[code LEMMA_RULES], #[code LEMMA_INDEX], #[code LEMMA_EXC] (dicts) +cell #[code LEMMA_RULES], #[code LEMMA_INDEX], #[code LEMMA_EXC] (dicts)
+cell Lemmatization rules, keyed by part of speech. +cell Lemmatization rules, keyed by part of speech.
+infobox("For languages with non-latin characters")
| In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
| needs to know the language's character set. If the language you're adding
| uses non-latin characters, you might need to add the required character
| classes to the global
| #[+src(gh("spacy", "spacy/lang/punctuation.py")) punctuation.py].
| spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
| to keep this simple and readable. If the language requires very specific
| punctuation rules, you should consider overwriting the default regular
| expressions with your own in the language's #[code Defaults].
+h(3, "stop-words") Stop words +h(3, "stop-words") Stop words
p p