Update "Adding languages" docs [ci skip]

This commit is contained in:
Ines Montani 2019-09-12 14:53:06 +02:00
parent 10257f3131
commit c0a4cab178

View File

@ -72,7 +72,7 @@ specific and will make a big difference to spaCy's performance on the particular
language and training a language model. language and training a language model.
| Variable | Type | Description | | Variable | Type | Description |
| ----------------------------------------- | ----- | ---------------------------------------------------------------------------------------------------------- | | ---------------------- | ----- | ---------------------------------------------------------------------------------------------------------- |
| `STOP_WORDS` | set | Individual words. | | `STOP_WORDS` | set | Individual words. |
| `TOKENIZER_EXCEPTIONS` | dict | Keyed by strings mapped to list of one dict per token with token attributes. | | `TOKENIZER_EXCEPTIONS` | dict | Keyed by strings mapped to list of one dict per token with token attributes. |
| `TOKEN_MATCH` | regex | Regexes to match complex tokens, e.g. URLs. | | `TOKEN_MATCH` | regex | Regexes to match complex tokens, e.g. URLs. |
@ -82,8 +82,6 @@ language and training a language model.
| `TOKENIZER_INFIXES` | list | Strings or regexes, usually not customized. | | `TOKENIZER_INFIXES` | list | Strings or regexes, usually not customized. |
| `LEX_ATTRS` | dict | Attribute ID mapped to function. | | `LEX_ATTRS` | dict | Attribute ID mapped to function. |
| `SYNTAX_ITERATORS` | dict | Iterator ID mapped to function. Currently only supports `'noun_chunks'`. | | `SYNTAX_ITERATORS` | dict | Iterator ID mapped to function. Currently only supports `'noun_chunks'`. |
| `LOOKUP` | dict | Keyed by strings mapping to their lemma. |
| `LEMMA_RULES`, `LEMMA_INDEX`, `LEMMA_EXC` | dict | Lemmatization rules, keyed by part of speech. |
| `TAG_MAP` | dict | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. | | `TAG_MAP` | dict | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
| `MORPH_RULES` | dict | Keyed by strings mapped to a dict of their morphological features. | | `MORPH_RULES` | dict | Keyed by strings mapped to a dict of their morphological features. |
@ -213,9 +211,7 @@ spaCy's [tokenization algorithm](/usage/linguistic-features#how-tokenizer-works)
lets you deal with whitespace-delimited chunks separately. This makes it easy to lets you deal with whitespace-delimited chunks separately. This makes it easy to
define special-case rules, without worrying about how they interact with the define special-case rules, without worrying about how they interact with the
rest of the tokenizer. Whenever the key string is matched, the special-case rule rest of the tokenizer. Whenever the key string is matched, the special-case rule
is applied, giving the defined sequence of tokens. You can also attach is applied, giving the defined sequence of tokens.
attributes to the subtokens, covered by your special case, such as the subtokens
`LEMMA` or `TAG`.
Tokenizer exceptions can be added in the following format: Tokenizer exceptions can be added in the following format:
@ -223,8 +219,8 @@ Tokenizer exceptions can be added in the following format:
### tokenizer_exceptions.py (excerpt) ### tokenizer_exceptions.py (excerpt)
TOKENIZER_EXCEPTIONS = { TOKENIZER_EXCEPTIONS = {
"don't": [ "don't": [
{ORTH: "do", LEMMA: "do"}, {ORTH: "do"},
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}] {ORTH: "n't", NORM: "not"}]
} }
``` ```
@ -233,41 +229,12 @@ TOKENIZER_EXCEPTIONS = {
If an exception consists of more than one token, the `ORTH` values combined If an exception consists of more than one token, the `ORTH` values combined
always need to **match the original string**. The way the original string is always need to **match the original string**. The way the original string is
split up can be pretty arbitrary sometimes for example `"gonna"` is split into split up can be pretty arbitrary sometimes for example `"gonna"` is split into
`"gon"` (lemma "go") and `"na"` (lemma "to"). Because of how the tokenizer `"gon"` (norm "going") and `"na"` (norm "to"). Because of how the tokenizer
works, it's currently not possible to split single-letter strings into multiple works, it's currently not possible to split single-letter strings into multiple
tokens. tokens.
</Infobox> </Infobox>
Unambiguous abbreviations, like month names or locations in English, should be
added to exceptions with a lemma assigned, for example
`{ORTH: "Jan.", LEMMA: "January"}`. Since the exceptions are added in Python,
you can use custom logic to generate them more efficiently and make your data
less verbose. How you do this ultimately depends on the language. Here's an
example of how exceptions for time formats like "1a.m." and "1am" are generated
in the English
[`tokenizer_exceptions.py`](https://github.com/explosion/spaCy/tree/master/spacy/en/lang/tokenizer_exceptions.py):
```python
### tokenizer_exceptions.py (excerpt)
# use short, internal variable for readability
_exc = {}
for h in range(1, 12 + 1):
for period in ["a.m.", "am"]:
# always keep an eye on string interpolation!
_exc["%d%s" % (h, period)] = [
{ORTH: "%d" % h},
{ORTH: period, LEMMA: "a.m."}]
for period in ["p.m.", "pm"]:
_exc["%d%s" % (h, period)] = [
{ORTH: "%d" % h},
{ORTH: period, LEMMA: "p.m."}]
# only declare this at the bottom
TOKENIZER_EXCEPTIONS = _exc
```
> #### Generating tokenizer exceptions > #### Generating tokenizer exceptions
> >
> Keep in mind that generating exceptions only makes sense if there's a clearly > Keep in mind that generating exceptions only makes sense if there's a clearly
@ -275,7 +242,8 @@ TOKENIZER_EXCEPTIONS = _exc
> This is not always the case in Spanish for instance, infinitive or > This is not always the case in Spanish for instance, infinitive or
> imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). In > imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). In
> cases like this, spaCy shouldn't be generating exceptions for _all verbs_. > cases like this, spaCy shouldn't be generating exceptions for _all verbs_.
> Instead, this will be handled at a later stage during lemmatization. > Instead, this will be handled at a later stage after part-of-speech tagging
> and lemmatization.
When adding the tokenizer exceptions to the `Defaults`, you can use the When adding the tokenizer exceptions to the `Defaults`, you can use the
[`update_exc`](/api/top-level#util.update_exc) helper function to merge them [`update_exc`](/api/top-level#util.update_exc) helper function to merge them
@ -292,28 +260,18 @@ custom one.
from ...util import update_exc from ...util import update_exc
BASE_EXCEPTIONS = {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]} BASE_EXCEPTIONS = {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]}
TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA: "all"}]} TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", NORM: "all"}]}
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
# {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]} # {"a.": [{ORTH: "a.", NORM: "all"}], ":)": [{ORTH: ":)"}]}
``` ```
<Infobox title="About spaCy's custom pronoun lemma" variant="warning">
Unlike verbs and common nouns, there's no clear base form of a personal pronoun.
Should the lemma of "me" be "I", or should we normalize person as well, giving
"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`,
which is used as the lemma for all personal pronouns.
</Infobox>
### Norm exceptions {#norm-exceptions new="2"} ### Norm exceptions {#norm-exceptions new="2"}
In addition to `ORTH` or `LEMMA`, tokenizer exceptions can also set a `NORM` In addition to `ORTH`, tokenizer exceptions can also set a `NORM` attribute.
attribute. This is useful to specify a normalized version of the token for This is useful to specify a normalized version of the token for example, the
example, the norm of "n't" is "not". By default, a token's norm equals its norm of "n't" is "not". By default, a token's norm equals its lowercase text. If
lowercase text. If the lowercase spelling of a word exists, norms should always the lowercase spelling of a word exists, norms should always be in lowercase.
be in lowercase.
> #### Norms vs. lemmas > #### Norms vs. lemmas
> >
@ -458,9 +416,9 @@ the quickest and easiest way to get started. The data is stored in a dictionary
mapping a string to its lemma. To determine a token's lemma, spaCy simply looks mapping a string to its lemma. To determine a token's lemma, spaCy simply looks
it up in the table. Here's an example from the Spanish language data: it up in the table. Here's an example from the Spanish language data:
```python ```json
### lang/es/lemmatizer.py (excerpt) ### lang/es/lemma_lookup.json (excerpt)
LOOKUP = { {
"aba": "abar", "aba": "abar",
"ababa": "abar", "ababa": "abar",
"ababais": "abar", "ababais": "abar",
@ -472,11 +430,22 @@ LOOKUP = {
} }
``` ```
To provide a lookup lemmatizer for your language, import the lookup table and #### Adding JSON resources {#lemmatizer-resources new="2.2"}
add it to the `Language` class as `lemma_lookup`:
As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the
new [`Lookups`](/api/lookups) class. This allows easier access to the data,
serialization with the models and file compression on disk (so your spaCy
installation is smaller). Resource files can be provided via the `resources`
attribute on the custom language subclass. All paths are relative to the
language data directory, i.e. the directory the language's `__init__.py` is in.
```python ```python
lemma_lookup = LOOKUP resources = {
"lemma_lookup": "lemmatizer/lemma_lookup.json",
"lemma_rules": "lemmatizer/lemma_rules.json",
"lemma_index": "lemmatizer/lemma_index.json",
"lemma_exc": "lemmatizer/lemma_exc.json",
}
``` ```
### Tag map {#tag-map} ### Tag map {#tag-map}