mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 04:08:09 +03:00
Update adding languages docs
This commit is contained in:
parent
8c2a0c026d
commit
3665acc0de
|
@ -159,12 +159,22 @@ p
|
||||||
+cell #[code LEMMA_RULES], #[code LEMMA_INDEX], #[code LEMMA_EXC] (dicts)
|
+cell #[code LEMMA_RULES], #[code LEMMA_INDEX], #[code LEMMA_EXC] (dicts)
|
||||||
+cell Lemmatization rules, keyed by part of speech.
|
+cell Lemmatization rules, keyed by part of speech.
|
||||||
|
|
||||||
|
+aside("Should I ever update the global data?")
|
||||||
|
| Reuseable language data is collected as atomic pieces in the root of the
|
||||||
|
| #[+src(gh("spaCy", "lang")) spacy.lang] package. Often, when a new
|
||||||
|
| language is added, you'll find a pattern or symbol that's missing. Even
|
||||||
|
| if it isn't common in other languages, it might be best to add it to the
|
||||||
|
| shared language data, unless it has some conflicting interpretation. For
|
||||||
|
| instance, we don't expect to see guillemot quotation symbols
|
||||||
|
| (#[code »] and #[code «]) in English text. But if we do see
|
||||||
|
| them, we'd probably prefer the tokenizer to split them off.
|
||||||
|
|
||||||
+infobox("For languages with non-latin characters")
|
+infobox("For languages with non-latin characters")
|
||||||
| In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
|
| In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
|
||||||
| needs to know the language's character set. If the language you're adding
|
| needs to know the language's character set. If the language you're adding
|
||||||
| uses non-latin characters, you might need to add the required character
|
| uses non-latin characters, you might need to add the required character
|
||||||
| classes to the global
|
| classes to the global
|
||||||
| #[+src(gh("spacy", "spacy/lang/punctuation.py")) punctuation.py].
|
| #[+src(gh("spacy", "spacy/lang/char_classes.py")) char_classes.py].
|
||||||
| spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
|
| spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
|
||||||
| to keep this simple and readable. If the language requires very specific
|
| to keep this simple and readable. If the language requires very specific
|
||||||
| punctuation rules, you should consider overwriting the default regular
|
| punctuation rules, you should consider overwriting the default regular
|
||||||
|
@ -210,7 +220,7 @@ p
|
||||||
p
|
p
|
||||||
| Tokenizer exceptions can be added in the following format:
|
| Tokenizer exceptions can be added in the following format:
|
||||||
|
|
||||||
+code("language_data.py").
|
+code("tokenizer_exceptions.py (excerpt)").
|
||||||
TOKENIZER_EXCEPTIONS = {
|
TOKENIZER_EXCEPTIONS = {
|
||||||
"don't": [
|
"don't": [
|
||||||
{ORTH: "do", LEMMA: "do"},
|
{ORTH: "do", LEMMA: "do"},
|
||||||
|
@ -280,23 +290,6 @@ p
|
||||||
| novel symbol, #[code.u-nowrap -PRON-], which is used as the lemma for
|
| novel symbol, #[code.u-nowrap -PRON-], which is used as the lemma for
|
||||||
| all personal pronouns.
|
| all personal pronouns.
|
||||||
|
|
||||||
+h(3, "shared-data") Shared language data
|
|
||||||
|
|
||||||
p
|
|
||||||
| Because languages can vary in quite arbitrary ways, spaCy avoids
|
|
||||||
| organising the language data into an explicit inheritance hierarchy.
|
|
||||||
| Instead, reuseable functions and data are collected as atomic pieces in
|
|
||||||
| the root of the #[+src(gh("spaCy", "lang")) spacy.lang] package.
|
|
||||||
|
|
||||||
p
|
|
||||||
| Often, when a new language is added, you'll find a pattern or symbol
|
|
||||||
| that's missing. Even if this pattern or symbol isn't common in other
|
|
||||||
| languages, it might be best to add it to the base expressions, unless it
|
|
||||||
| has some conflicting interpretation. For instance, we don't expect to
|
|
||||||
| see guillemot quotation symbols (#[code »] and #[code «]) in
|
|
||||||
| English text. But if we do see them, we'd probably prefer the tokenizer
|
|
||||||
| to split it off.
|
|
||||||
|
|
||||||
+h(3, "lex-attrs") Lexical attributes
|
+h(3, "lex-attrs") Lexical attributes
|
||||||
|
|
||||||
p
|
p
|
||||||
|
|
Loading…
Reference in New Issue
Block a user