From 2f54fefb5da4aaa6ed09e5c96762ca9e3d3e9141 Mon Sep 17 00:00:00 2001 From: ines Date: Sat, 13 May 2017 14:54:58 +0200 Subject: [PATCH] Update adding languages docs --- website/docs/usage/adding-languages.jade | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/website/docs/usage/adding-languages.jade b/website/docs/usage/adding-languages.jade index 3779480fd..31946ee54 100644 --- a/website/docs/usage/adding-languages.jade +++ b/website/docs/usage/adding-languages.jade @@ -206,6 +206,14 @@ p being below beside besides between beyond both bottom but by """).split()) ++infobox("Important note") + | When adding stop words from an online source, always #[strong include the link] + | in a comment. Make sure to #[strong proofread] and double-check the words + | carefully. A lot of the lists available online have been passed around + | for years and often contain mistakes, like unicode errors or random words + | that have once been added for a specific use case, but don't actually + | qualify. + +h(3, "tokenizer-exceptions") Tokenizer exceptions p @@ -263,6 +271,15 @@ p # only declare this at the bottom TOKENIZER_EXCEPTIONS = dict(_exc) ++aside("Generating tokenizer exceptions") + | Keep in mind that generating exceptions only makes sense if there's a + | clearly defined and #[strong finite number] of them, like common + | contractions in English. This is not always the case – in Spanish for + | instance, infinitive or imperative reflexive verbs and pronouns are one + | token (e.g. "vestirme"). In cases like this, spaCy shouldn't be + | generating exceptions for #[em all verbs]. Instead, this will be handled + | at a later stage during lemmatization. + p | When adding the tokenizer exceptions to the #[code Defaults], you can use | the #[code update_exc()] helper function to merge them with the global @@ -380,6 +397,8 @@ p +h(3, "morph-rules") Morph rules ++h(2, "testing") Testing the new language tokenizer + +h(2, "vocabulary") Building the vocabulary p