From c0a4cab17887d14655659b381ea4ae5e062a5108 Mon Sep 17 00:00:00 2001
From: Ines Montani <ines@ines.io>
Date: Thu, 12 Sep 2019 14:53:06 +0200
Subject: [PATCH] Update "Adding languages" docs [ci skip]

---
 website/docs/usage/adding-languages.md | 131 ++++++++++---------------
 1 file changed, 50 insertions(+), 81 deletions(-)

diff --git a/website/docs/usage/adding-languages.md b/website/docs/usage/adding-languages.md
index 374d948b2..6f8955326 100644
--- a/website/docs/usage/adding-languages.md
+++ b/website/docs/usage/adding-languages.md
@@ -71,21 +71,19 @@ from the global rules. Others, like the tokenizer and norm exceptions, are very
 specific and will make a big difference to spaCy's performance on the particular
 language and training a language model.
 
-| Variable                                  | Type  | Description                                                                                                |
-| ----------------------------------------- | ----- | ---------------------------------------------------------------------------------------------------------- |
-| `STOP_WORDS`                              | set   | Individual words.                                                                                          |
-| `TOKENIZER_EXCEPTIONS`                    | dict  | Keyed by strings mapped to list of one dict per token with token attributes.                               |
-| `TOKEN_MATCH`                             | regex | Regexes to match complex tokens, e.g. URLs.                                                                |
-| `NORM_EXCEPTIONS`                         | dict  | Keyed by strings, mapped to their norms.                                                                   |
-| `TOKENIZER_PREFIXES`                      | list  | Strings or regexes, usually not customized.                                                                |
-| `TOKENIZER_SUFFIXES`                      | list  | Strings or regexes, usually not customized.                                                                |
-| `TOKENIZER_INFIXES`                       | list  | Strings or regexes, usually not customized.                                                                |
-| `LEX_ATTRS`                               | dict  | Attribute ID mapped to function.                                                                           |
-| `SYNTAX_ITERATORS`                        | dict  | Iterator ID mapped to function. Currently only supports `'noun_chunks'`.                                   |
-| `LOOKUP`                                  | dict  | Keyed by strings mapping to their lemma.                                                                   |
-| `LEMMA_RULES`, `LEMMA_INDEX`, `LEMMA_EXC` | dict  | Lemmatization rules, keyed by part of speech.                                                              |
-| `TAG_MAP`                                 | dict  | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
-| `MORPH_RULES`                             | dict  | Keyed by strings mapped to a dict of their morphological features.                                         |
+| Variable               | Type  | Description                                                                                                |
+| ---------------------- | ----- | ---------------------------------------------------------------------------------------------------------- |
+| `STOP_WORDS`           | set   | Individual words.                                                                                          |
+| `TOKENIZER_EXCEPTIONS` | dict  | Keyed by strings mapped to list of one dict per token with token attributes.                               |
+| `TOKEN_MATCH`          | regex | Regexes to match complex tokens, e.g. URLs.                                                                |
+| `NORM_EXCEPTIONS`      | dict  | Keyed by strings, mapped to their norms.                                                                   |
+| `TOKENIZER_PREFIXES`   | list  | Strings or regexes, usually not customized.                                                                |
+| `TOKENIZER_SUFFIXES`   | list  | Strings or regexes, usually not customized.                                                                |
+| `TOKENIZER_INFIXES`    | list  | Strings or regexes, usually not customized.                                                                |
+| `LEX_ATTRS`            | dict  | Attribute ID mapped to function.                                                                           |
+| `SYNTAX_ITERATORS`     | dict  | Iterator ID mapped to function. Currently only supports `'noun_chunks'`.                                   |
+| `TAG_MAP`              | dict  | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
+| `MORPH_RULES`          | dict  | Keyed by strings mapped to a dict of their morphological features.                                         |
 
 > #### Should I ever update the global data?
 >
@@ -213,9 +211,7 @@ spaCy's [tokenization algorithm](/usage/linguistic-features#how-tokenizer-works)
 lets you deal with whitespace-delimited chunks separately. This makes it easy to
 define special-case rules, without worrying about how they interact with the
 rest of the tokenizer. Whenever the key string is matched, the special-case rule
-is applied, giving the defined sequence of tokens. You can also attach
-attributes to the subtokens, covered by your special case, such as the subtokens
-`LEMMA` or `TAG`.
+is applied, giving the defined sequence of tokens.
 
 Tokenizer exceptions can be added in the following format:
 
@@ -223,8 +219,8 @@ Tokenizer exceptions can be added in the following format:
 ### tokenizer_exceptions.py (excerpt)
 TOKENIZER_EXCEPTIONS = {
     "don't": [
-        {ORTH: "do", LEMMA: "do"},
-        {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}]
+        {ORTH: "do"},
+        {ORTH: "n't", NORM: "not"}]
 }
 ```
 
@@ -233,41 +229,12 @@ TOKENIZER_EXCEPTIONS = {
 If an exception consists of more than one token, the `ORTH` values combined
 always need to **match the original string**. The way the original string is
 split up can be pretty arbitrary sometimes – for example `"gonna"` is split into
-`"gon"` (lemma "go") and `"na"` (lemma "to"). Because of how the tokenizer
+`"gon"` (norm "going") and `"na"` (norm "to"). Because of how the tokenizer
 works, it's currently not possible to split single-letter strings into multiple
 tokens.
 
 </Infobox>
 
-Unambiguous abbreviations, like month names or locations in English, should be
-added to exceptions with a lemma assigned, for example
-`{ORTH: "Jan.", LEMMA: "January"}`. Since the exceptions are added in Python,
-you can use custom logic to generate them more efficiently and make your data
-less verbose. How you do this ultimately depends on the language. Here's an
-example of how exceptions for time formats like "1a.m." and "1am" are generated
-in the English
-[`tokenizer_exceptions.py`](https://github.com/explosion/spaCy/tree/master/spacy/en/lang/tokenizer_exceptions.py):
-
-```python
-### tokenizer_exceptions.py (excerpt)
-# use short, internal variable for readability
-_exc = {}
-
-for h in range(1, 12 + 1):
-    for period in ["a.m.", "am"]:
-        # always keep an eye on string interpolation!
-        _exc["%d%s" % (h, period)] = [
-            {ORTH: "%d" % h},
-            {ORTH: period, LEMMA: "a.m."}]
-    for period in ["p.m.", "pm"]:
-        _exc["%d%s" % (h, period)] = [
-            {ORTH: "%d" % h},
-            {ORTH: period, LEMMA: "p.m."}]
-
-# only declare this at the bottom
-TOKENIZER_EXCEPTIONS = _exc
-```
-
 > #### Generating tokenizer exceptions
 >
 > Keep in mind that generating exceptions only makes sense if there's a clearly
@@ -275,7 +242,8 @@ TOKENIZER_EXCEPTIONS = _exc
 > This is not always the case – in Spanish for instance, infinitive or
 > imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). In
 > cases like this, spaCy shouldn't be generating exceptions for _all verbs_.
-> Instead, this will be handled at a later stage during lemmatization.
+> Instead, this will be handled at a later stage after part-of-speech tagging
+> and lemmatization.
 
 When adding the tokenizer exceptions to the `Defaults`, you can use the
 [`update_exc`](/api/top-level#util.update_exc) helper function to merge them
@@ -292,28 +260,18 @@ custom one.
 from ...util import update_exc
 
 BASE_EXCEPTIONS =  {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]}
-TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA: "all"}]}
+TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", NORM: "all"}]}
 
 tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
-# {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]}
+# {"a.": [{ORTH: "a.", NORM: "all"}], ":)": [{ORTH: ":)"}]}
 ```
 
-<Infobox title="About spaCy's custom pronoun lemma" variant="warning">
-
-Unlike verbs and common nouns, there's no clear base form of a personal pronoun.
-Should the lemma of "me" be "I", or should we normalize person as well, giving
-"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`,
-which is used as the lemma for all personal pronouns.
-
-</Infobox>
-
 ### Norm exceptions {#norm-exceptions new="2"}
 
-In addition to `ORTH` or `LEMMA`, tokenizer exceptions can also set a `NORM`
-attribute. This is useful to specify a normalized version of the token – for
-example, the norm of "n't" is "not". By default, a token's norm equals its
-lowercase text. If the lowercase spelling of a word exists, norms should always
-be in lowercase.
+In addition to `ORTH`, tokenizer exceptions can also set a `NORM` attribute.
+This is useful to specify a normalized version of the token – for example, the
+norm of "n't" is "not". By default, a token's norm equals its lowercase text. If
+the lowercase spelling of a word exists, norms should always be in lowercase.
 
 > #### Norms vs. lemmas
 >
@@ -458,25 +416,36 @@ the quickest and easiest way to get started. The data is stored in a dictionary
 mapping a string to its lemma. To determine a token's lemma, spaCy simply looks
 it up in the table. Here's an example from the Spanish language data:
 
-```python
-### lang/es/lemmatizer.py (excerpt)
-LOOKUP = {
-    "aba": "abar",
-    "ababa": "abar",
-    "ababais": "abar",
-    "ababan": "abar",
-    "ababanes": "ababán",
-    "ababas": "abar",
-    "ababoles": "ababol",
-    "ababábites": "ababábite"
+```json
+### lang/es/lemma_lookup.json (excerpt)
+{
+  "aba": "abar",
+  "ababa": "abar",
+  "ababais": "abar",
+  "ababan": "abar",
+  "ababanes": "ababán",
+  "ababas": "abar",
+  "ababoles": "ababol",
+  "ababábites": "ababábite"
 }
 ```
 
-To provide a lookup lemmatizer for your language, import the lookup table and
-add it to the `Language` class as `lemma_lookup`:
+#### Adding JSON resources {#lemmatizer-resources new="2.2"}
+
+As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the
+new [`Lookups`](/api/lookups) class. This allows easier access to the data,
+serialization with the models and file compression on disk (so your spaCy
+installation is smaller). Resource files can be provided via the `resources`
+attribute on the custom language subclass. All paths are relative to the
+language data directory, i.e. the directory the language's `__init__.py` is in.
 
 ```python
-lemma_lookup = LOOKUP
+resources = {
+    "lemma_lookup": "lemmatizer/lemma_lookup.json",
+    "lemma_rules": "lemmatizer/lemma_rules.json",
+    "lemma_index": "lemmatizer/lemma_index.json",
+    "lemma_exc": "lemmatizer/lemma_exc.json",
+}
 ```
 
 ### Tag map {#tag-map}