From c0a4cab17887d14655659b381ea4ae5e062a5108 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Thu, 12 Sep 2019 14:53:06 +0200 Subject: [PATCH] Update "Adding languages" docs [ci skip] --- website/docs/usage/adding-languages.md | 131 ++++++++++--------------- 1 file changed, 50 insertions(+), 81 deletions(-) diff --git a/website/docs/usage/adding-languages.md b/website/docs/usage/adding-languages.md index 374d948b2..6f8955326 100644 --- a/website/docs/usage/adding-languages.md +++ b/website/docs/usage/adding-languages.md @@ -71,21 +71,19 @@ from the global rules. Others, like the tokenizer and norm exceptions, are very specific and will make a big difference to spaCy's performance on the particular language and training a language model. -| Variable | Type | Description | -| ----------------------------------------- | ----- | ---------------------------------------------------------------------------------------------------------- | -| `STOP_WORDS` | set | Individual words. | -| `TOKENIZER_EXCEPTIONS` | dict | Keyed by strings mapped to list of one dict per token with token attributes. | -| `TOKEN_MATCH` | regex | Regexes to match complex tokens, e.g. URLs. | -| `NORM_EXCEPTIONS` | dict | Keyed by strings, mapped to their norms. | -| `TOKENIZER_PREFIXES` | list | Strings or regexes, usually not customized. | -| `TOKENIZER_SUFFIXES` | list | Strings or regexes, usually not customized. | -| `TOKENIZER_INFIXES` | list | Strings or regexes, usually not customized. | -| `LEX_ATTRS` | dict | Attribute ID mapped to function. | -| `SYNTAX_ITERATORS` | dict | Iterator ID mapped to function. Currently only supports `'noun_chunks'`. | -| `LOOKUP` | dict | Keyed by strings mapping to their lemma. | -| `LEMMA_RULES`, `LEMMA_INDEX`, `LEMMA_EXC` | dict | Lemmatization rules, keyed by part of speech. | -| `TAG_MAP` | dict | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. | -| `MORPH_RULES` | dict | Keyed by strings mapped to a dict of their morphological features. | +| Variable | Type | Description | +| ---------------------- | ----- | ---------------------------------------------------------------------------------------------------------- | +| `STOP_WORDS` | set | Individual words. | +| `TOKENIZER_EXCEPTIONS` | dict | Keyed by strings mapped to list of one dict per token with token attributes. | +| `TOKEN_MATCH` | regex | Regexes to match complex tokens, e.g. URLs. | +| `NORM_EXCEPTIONS` | dict | Keyed by strings, mapped to their norms. | +| `TOKENIZER_PREFIXES` | list | Strings or regexes, usually not customized. | +| `TOKENIZER_SUFFIXES` | list | Strings or regexes, usually not customized. | +| `TOKENIZER_INFIXES` | list | Strings or regexes, usually not customized. | +| `LEX_ATTRS` | dict | Attribute ID mapped to function. | +| `SYNTAX_ITERATORS` | dict | Iterator ID mapped to function. Currently only supports `'noun_chunks'`. | +| `TAG_MAP` | dict | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. | +| `MORPH_RULES` | dict | Keyed by strings mapped to a dict of their morphological features. | > #### Should I ever update the global data? > @@ -213,9 +211,7 @@ spaCy's [tokenization algorithm](/usage/linguistic-features#how-tokenizer-works) lets you deal with whitespace-delimited chunks separately. This makes it easy to define special-case rules, without worrying about how they interact with the rest of the tokenizer. Whenever the key string is matched, the special-case rule -is applied, giving the defined sequence of tokens. You can also attach -attributes to the subtokens, covered by your special case, such as the subtokens -`LEMMA` or `TAG`. +is applied, giving the defined sequence of tokens. Tokenizer exceptions can be added in the following format: @@ -223,8 +219,8 @@ Tokenizer exceptions can be added in the following format: ### tokenizer_exceptions.py (excerpt) TOKENIZER_EXCEPTIONS = { "don't": [ - {ORTH: "do", LEMMA: "do"}, - {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}] + {ORTH: "do"}, + {ORTH: "n't", NORM: "not"}] } ``` @@ -233,41 +229,12 @@ TOKENIZER_EXCEPTIONS = { If an exception consists of more than one token, the `ORTH` values combined always need to **match the original string**. The way the original string is split up can be pretty arbitrary sometimes – for example `"gonna"` is split into -`"gon"` (lemma "go") and `"na"` (lemma "to"). Because of how the tokenizer +`"gon"` (norm "going") and `"na"` (norm "to"). Because of how the tokenizer works, it's currently not possible to split single-letter strings into multiple tokens. -Unambiguous abbreviations, like month names or locations in English, should be -added to exceptions with a lemma assigned, for example -`{ORTH: "Jan.", LEMMA: "January"}`. Since the exceptions are added in Python, -you can use custom logic to generate them more efficiently and make your data -less verbose. How you do this ultimately depends on the language. Here's an -example of how exceptions for time formats like "1a.m." and "1am" are generated -in the English -[`tokenizer_exceptions.py`](https://github.com/explosion/spaCy/tree/master/spacy/en/lang/tokenizer_exceptions.py): - -```python -### tokenizer_exceptions.py (excerpt) -# use short, internal variable for readability -_exc = {} - -for h in range(1, 12 + 1): - for period in ["a.m.", "am"]: - # always keep an eye on string interpolation! - _exc["%d%s" % (h, period)] = [ - {ORTH: "%d" % h}, - {ORTH: period, LEMMA: "a.m."}] - for period in ["p.m.", "pm"]: - _exc["%d%s" % (h, period)] = [ - {ORTH: "%d" % h}, - {ORTH: period, LEMMA: "p.m."}] - -# only declare this at the bottom -TOKENIZER_EXCEPTIONS = _exc -``` - > #### Generating tokenizer exceptions > > Keep in mind that generating exceptions only makes sense if there's a clearly @@ -275,7 +242,8 @@ TOKENIZER_EXCEPTIONS = _exc > This is not always the case – in Spanish for instance, infinitive or > imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). In > cases like this, spaCy shouldn't be generating exceptions for _all verbs_. -> Instead, this will be handled at a later stage during lemmatization. +> Instead, this will be handled at a later stage after part-of-speech tagging +> and lemmatization. When adding the tokenizer exceptions to the `Defaults`, you can use the [`update_exc`](/api/top-level#util.update_exc) helper function to merge them @@ -292,28 +260,18 @@ custom one. from ...util import update_exc BASE_EXCEPTIONS = {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]} -TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA: "all"}]} +TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", NORM: "all"}]} tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) -# {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]} +# {"a.": [{ORTH: "a.", NORM: "all"}], ":)": [{ORTH: ":)"}]} ``` - - -Unlike verbs and common nouns, there's no clear base form of a personal pronoun. -Should the lemma of "me" be "I", or should we normalize person as well, giving -"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`, -which is used as the lemma for all personal pronouns. - - - ### Norm exceptions {#norm-exceptions new="2"} -In addition to `ORTH` or `LEMMA`, tokenizer exceptions can also set a `NORM` -attribute. This is useful to specify a normalized version of the token – for -example, the norm of "n't" is "not". By default, a token's norm equals its -lowercase text. If the lowercase spelling of a word exists, norms should always -be in lowercase. +In addition to `ORTH`, tokenizer exceptions can also set a `NORM` attribute. +This is useful to specify a normalized version of the token – for example, the +norm of "n't" is "not". By default, a token's norm equals its lowercase text. If +the lowercase spelling of a word exists, norms should always be in lowercase. > #### Norms vs. lemmas > @@ -458,25 +416,36 @@ the quickest and easiest way to get started. The data is stored in a dictionary mapping a string to its lemma. To determine a token's lemma, spaCy simply looks it up in the table. Here's an example from the Spanish language data: -```python -### lang/es/lemmatizer.py (excerpt) -LOOKUP = { - "aba": "abar", - "ababa": "abar", - "ababais": "abar", - "ababan": "abar", - "ababanes": "ababán", - "ababas": "abar", - "ababoles": "ababol", - "ababábites": "ababábite" +```json +### lang/es/lemma_lookup.json (excerpt) +{ + "aba": "abar", + "ababa": "abar", + "ababais": "abar", + "ababan": "abar", + "ababanes": "ababán", + "ababas": "abar", + "ababoles": "ababol", + "ababábites": "ababábite" } ``` -To provide a lookup lemmatizer for your language, import the lookup table and -add it to the `Language` class as `lemma_lookup`: +#### Adding JSON resources {#lemmatizer-resources new="2.2"} + +As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the +new [`Lookups`](/api/lookups) class. This allows easier access to the data, +serialization with the models and file compression on disk (so your spaCy +installation is smaller). Resource files can be provided via the `resources` +attribute on the custom language subclass. All paths are relative to the +language data directory, i.e. the directory the language's `__init__.py` is in. ```python -lemma_lookup = LOOKUP +resources = { + "lemma_lookup": "lemmatizer/lemma_lookup.json", + "lemma_rules": "lemmatizer/lemma_rules.json", + "lemma_index": "lemmatizer/lemma_index.json", + "lemma_exc": "lemmatizer/lemma_exc.json", +} ``` ### Tag map {#tag-map}