Update adding languages docs

This commit is contained in:
ines 2017-05-13 14:54:58 +02:00
parent 9d85cda8e4
commit 2f54fefb5d

View File

@ -206,6 +206,14 @@ p
being below beside besides between beyond both bottom but by being below beside besides between beyond both bottom but by
""").split()) """).split())
+infobox("Important note")
| When adding stop words from an online source, always #[strong include the link]
| in a comment. Make sure to #[strong proofread] and double-check the words
| carefully. A lot of the lists available online have been passed around
| for years and often contain mistakes, like unicode errors or random words
| that have once been added for a specific use case, but don't actually
| qualify.
+h(3, "tokenizer-exceptions") Tokenizer exceptions +h(3, "tokenizer-exceptions") Tokenizer exceptions
p p
@ -263,6 +271,15 @@ p
# only declare this at the bottom # only declare this at the bottom
TOKENIZER_EXCEPTIONS = dict(_exc) TOKENIZER_EXCEPTIONS = dict(_exc)
+aside("Generating tokenizer exceptions")
| Keep in mind that generating exceptions only makes sense if there's a
| clearly defined and #[strong finite number] of them, like common
| contractions in English. This is not always the case in Spanish for
| instance, infinitive or imperative reflexive verbs and pronouns are one
| token (e.g. "vestirme"). In cases like this, spaCy shouldn't be
| generating exceptions for #[em all verbs]. Instead, this will be handled
| at a later stage during lemmatization.
p p
| When adding the tokenizer exceptions to the #[code Defaults], you can use | When adding the tokenizer exceptions to the #[code Defaults], you can use
| the #[code update_exc()] helper function to merge them with the global | the #[code update_exc()] helper function to merge them with the global
@ -380,6 +397,8 @@ p
+h(3, "morph-rules") Morph rules +h(3, "morph-rules") Morph rules
+h(2, "testing") Testing the new language tokenizer
+h(2, "vocabulary") Building the vocabulary +h(2, "vocabulary") Building the vocabulary
p p