Update "Adding languages" docs [ci skip]

2025-09-18 01:52:37 +03:00 · 2019-09-12 14:53:06 +02:00 · 2019-09-12 14:53:06 +02:00 · c0a4cab178
commit c0a4cab178
parent 10257f3131
1 changed files with 50 additions and 81 deletions
--- a/website/docs/usage/adding-languages.md
+++ b/website/docs/usage/adding-languages.md
@ -72,7 +72,7 @@ specific and will make a big difference to spaCy's performance on the particular
 language and training a language model.
 | Variable               | Type  | Description                                                                                                |
-| ----------------------------------------- | ----- | ---------------------------------------------------------------------------------------------------------- |
+| ---------------------- | ----- | ---------------------------------------------------------------------------------------------------------- |
 | `STOP_WORDS`           | set   | Individual words.                                                                                          |
 | `TOKENIZER_EXCEPTIONS` | dict  | Keyed by strings mapped to list of one dict per token with token attributes.                               |
 | `TOKEN_MATCH`          | regex | Regexes to match complex tokens, e.g. URLs.                                                                |
@ -82,8 +82,6 @@ language and training a language model.
 | `TOKENIZER_INFIXES`    | list  | Strings or regexes, usually not customized.                                                                |
 | `LEX_ATTRS`            | dict  | Attribute ID mapped to function.                                                                           |
 | `SYNTAX_ITERATORS`     | dict  | Iterator ID mapped to function. Currently only supports `'noun_chunks'`.                                   |
 | `LOOKUP`                                  | dict  | Keyed by strings mapping to their lemma.                                                                   |
 | `LEMMA_RULES`, `LEMMA_INDEX`, `LEMMA_EXC` | dict  | Lemmatization rules, keyed by part of speech.                                                              |
 | `TAG_MAP`              | dict  | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
 | `MORPH_RULES`          | dict  | Keyed by strings mapped to a dict of their morphological features.                                         |
@ -213,9 +211,7 @@ spaCy's [tokenization algorithm](/usage/linguistic-features#how-tokenizer-works)
 lets you deal with whitespace-delimited chunks separately. This makes it easy to
 define special-case rules, without worrying about how they interact with the
 rest of the tokenizer. Whenever the key string is matched, the special-case rule
-is applied, giving the defined sequence of tokens. You can also attach
+is applied, giving the defined sequence of tokens.
 attributes to the subtokens, covered by your special case, such as the subtokens
 `LEMMA` or `TAG`.
 Tokenizer exceptions can be added in the following format:
@ -223,8 +219,8 @@ Tokenizer exceptions can be added in the following format:
 ### tokenizer_exceptions.py (excerpt)
 TOKENIZER_EXCEPTIONS = {
    "don't": [
-        {ORTH: "do", LEMMA: "do"},
+        {ORTH: "do"},
-        {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}]
+        {ORTH: "n't", NORM: "not"}]
 }
 ```
@ -233,41 +229,12 @@ TOKENIZER_EXCEPTIONS = {
 If an exception consists of more than one token, the `ORTH` values combined
 always need to **match the original string**. The way the original string is
 split up can be pretty arbitrary sometimes – for example `"gonna"` is split into
-`"gon"` (lemma "go") and `"na"` (lemma "to"). Because of how the tokenizer
+`"gon"` (norm "going") and `"na"` (norm "to"). Because of how the tokenizer
 works, it's currently not possible to split single-letter strings into multiple
 tokens.
 </Infobox>
 Unambiguous abbreviations, like month names or locations in English, should be
 added to exceptions with a lemma assigned, for example
 `{ORTH: "Jan.", LEMMA: "January"}`. Since the exceptions are added in Python,
 you can use custom logic to generate them more efficiently and make your data
 less verbose. How you do this ultimately depends on the language. Here's an
 example of how exceptions for time formats like "1a.m." and "1am" are generated
 in the English
 [`tokenizer_exceptions.py`](https://github.com/explosion/spaCy/tree/master/spacy/en/lang/tokenizer_exceptions.py):
 ```python
 ### tokenizer_exceptions.py (excerpt)
 # use short, internal variable for readability
 _exc = {}
 for h in range(1, 12 + 1):
    for period in ["a.m.", "am"]:
        # always keep an eye on string interpolation!
        _exc["%d%s" % (h, period)] = [
            {ORTH: "%d" % h},
            {ORTH: period, LEMMA: "a.m."}]
    for period in ["p.m.", "pm"]:
        _exc["%d%s" % (h, period)] = [
            {ORTH: "%d" % h},
            {ORTH: period, LEMMA: "p.m."}]
 # only declare this at the bottom
 TOKENIZER_EXCEPTIONS = _exc
 ```
 > #### Generating tokenizer exceptions
 >
 > Keep in mind that generating exceptions only makes sense if there's a clearly
@ -275,7 +242,8 @@ TOKENIZER_EXCEPTIONS = _exc
 > This is not always the case – in Spanish for instance, infinitive or
 > imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). In
 > cases like this, spaCy shouldn't be generating exceptions for _all verbs_.
-> Instead, this will be handled at a later stage during lemmatization.
+> Instead, this will be handled at a later stage after part-of-speech tagging
 > and lemmatization.
 When adding the tokenizer exceptions to the `Defaults`, you can use the
 [`update_exc`](/api/top-level#util.update_exc) helper function to merge them
@ -292,28 +260,18 @@ custom one.
 from ...util import update_exc
 BASE_EXCEPTIONS =  {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]}
-TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA: "all"}]}
+TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", NORM: "all"}]}
 tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
-# {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]}
+# {"a.": [{ORTH: "a.", NORM: "all"}], ":)": [{ORTH: ":)"}]}
 ```
 <Infobox title="About spaCy's custom pronoun lemma" variant="warning">
 Unlike verbs and common nouns, there's no clear base form of a personal pronoun.
 Should the lemma of "me" be "I", or should we normalize person as well, giving
 "it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`,
 which is used as the lemma for all personal pronouns.
 </Infobox>
 ### Norm exceptions {#norm-exceptions new="2"}
-In addition to `ORTH` or `LEMMA`, tokenizer exceptions can also set a `NORM`
+In addition to `ORTH`, tokenizer exceptions can also set a `NORM` attribute.
-attribute. This is useful to specify a normalized version of the token – for
+This is useful to specify a normalized version of the token – for example, the
-example, the norm of "n't" is "not". By default, a token's norm equals its
+norm of "n't" is "not". By default, a token's norm equals its lowercase text. If
-lowercase text. If the lowercase spelling of a word exists, norms should always
+the lowercase spelling of a word exists, norms should always be in lowercase.
 be in lowercase.
 > #### Norms vs. lemmas
 >
@ -458,9 +416,9 @@ the quickest and easiest way to get started. The data is stored in a dictionary
 mapping a string to its lemma. To determine a token's lemma, spaCy simply looks
 it up in the table. Here's an example from the Spanish language data:
-```python
+```json
-### lang/es/lemmatizer.py (excerpt)
+### lang/es/lemma_lookup.json (excerpt)
-LOOKUP = {
+{
  "aba": "abar",
  "ababa": "abar",
  "ababais": "abar",
@ -472,11 +430,22 @@ LOOKUP = {
 }
 ```
-To provide a lookup lemmatizer for your language, import the lookup table and
+#### Adding JSON resources {#lemmatizer-resources new="2.2"}
-add it to the `Language` class as `lemma_lookup`:
+
 As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the
 new [`Lookups`](/api/lookups) class. This allows easier access to the data,
 serialization with the models and file compression on disk (so your spaCy
 installation is smaller). Resource files can be provided via the `resources`
 attribute on the custom language subclass. All paths are relative to the
 language data directory, i.e. the directory the language's `__init__.py` is in.
 ```python
-lemma_lookup = LOOKUP
+resources = {
    "lemma_lookup": "lemmatizer/lemma_lookup.json",
    "lemma_rules": "lemmatizer/lemma_rules.json",
    "lemma_index": "lemmatizer/lemma_index.json",
    "lemma_exc": "lemmatizer/lemma_exc.json",
 }
 ```
 ### Tag map {#tag-map}