Update "Adding languages" docs [ci skip]

This commit is contained in:
Ines Montani 2019-09-12 14:53:06 +02:00
parent 10257f3131
commit c0a4cab178

View File

@ -71,21 +71,19 @@ from the global rules. Others, like the tokenizer and norm exceptions, are very
specific and will make a big difference to spaCy's performance on the particular
language and training a language model.
| Variable | Type | Description |
| ----------------------------------------- | ----- | ---------------------------------------------------------------------------------------------------------- |
| `STOP_WORDS` | set | Individual words. |
| `TOKENIZER_EXCEPTIONS` | dict | Keyed by strings mapped to list of one dict per token with token attributes. |
| `TOKEN_MATCH` | regex | Regexes to match complex tokens, e.g. URLs. |
| `NORM_EXCEPTIONS` | dict | Keyed by strings, mapped to their norms. |
| `TOKENIZER_PREFIXES` | list | Strings or regexes, usually not customized. |
| `TOKENIZER_SUFFIXES` | list | Strings or regexes, usually not customized. |
| `TOKENIZER_INFIXES` | list | Strings or regexes, usually not customized. |
| `LEX_ATTRS` | dict | Attribute ID mapped to function. |
| `SYNTAX_ITERATORS` | dict | Iterator ID mapped to function. Currently only supports `'noun_chunks'`. |
| `LOOKUP` | dict | Keyed by strings mapping to their lemma. |
| `LEMMA_RULES`, `LEMMA_INDEX`, `LEMMA_EXC` | dict | Lemmatization rules, keyed by part of speech. |
| `TAG_MAP` | dict | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
| `MORPH_RULES` | dict | Keyed by strings mapped to a dict of their morphological features. |
| Variable | Type | Description |
| ---------------------- | ----- | ---------------------------------------------------------------------------------------------------------- |
| `STOP_WORDS` | set | Individual words. |
| `TOKENIZER_EXCEPTIONS` | dict | Keyed by strings mapped to list of one dict per token with token attributes. |
| `TOKEN_MATCH` | regex | Regexes to match complex tokens, e.g. URLs. |
| `NORM_EXCEPTIONS` | dict | Keyed by strings, mapped to their norms. |
| `TOKENIZER_PREFIXES` | list | Strings or regexes, usually not customized. |
| `TOKENIZER_SUFFIXES` | list | Strings or regexes, usually not customized. |
| `TOKENIZER_INFIXES` | list | Strings or regexes, usually not customized. |
| `LEX_ATTRS` | dict | Attribute ID mapped to function. |
| `SYNTAX_ITERATORS` | dict | Iterator ID mapped to function. Currently only supports `'noun_chunks'`. |
| `TAG_MAP` | dict | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
| `MORPH_RULES` | dict | Keyed by strings mapped to a dict of their morphological features. |
> #### Should I ever update the global data?
>
@ -213,9 +211,7 @@ spaCy's [tokenization algorithm](/usage/linguistic-features#how-tokenizer-works)
lets you deal with whitespace-delimited chunks separately. This makes it easy to
define special-case rules, without worrying about how they interact with the
rest of the tokenizer. Whenever the key string is matched, the special-case rule
is applied, giving the defined sequence of tokens. You can also attach
attributes to the subtokens, covered by your special case, such as the subtokens
`LEMMA` or `TAG`.
is applied, giving the defined sequence of tokens.
Tokenizer exceptions can be added in the following format:
@ -223,8 +219,8 @@ Tokenizer exceptions can be added in the following format:
### tokenizer_exceptions.py (excerpt)
TOKENIZER_EXCEPTIONS = {
"don't": [
{ORTH: "do", LEMMA: "do"},
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}]
{ORTH: "do"},
{ORTH: "n't", NORM: "not"}]
}
```
@ -233,41 +229,12 @@ TOKENIZER_EXCEPTIONS = {
If an exception consists of more than one token, the `ORTH` values combined
always need to **match the original string**. The way the original string is
split up can be pretty arbitrary sometimes for example `"gonna"` is split into
`"gon"` (lemma "go") and `"na"` (lemma "to"). Because of how the tokenizer
`"gon"` (norm "going") and `"na"` (norm "to"). Because of how the tokenizer
works, it's currently not possible to split single-letter strings into multiple
tokens.
</Infobox>
Unambiguous abbreviations, like month names or locations in English, should be
added to exceptions with a lemma assigned, for example
`{ORTH: "Jan.", LEMMA: "January"}`. Since the exceptions are added in Python,
you can use custom logic to generate them more efficiently and make your data
less verbose. How you do this ultimately depends on the language. Here's an
example of how exceptions for time formats like "1a.m." and "1am" are generated
in the English
[`tokenizer_exceptions.py`](https://github.com/explosion/spaCy/tree/master/spacy/en/lang/tokenizer_exceptions.py):
```python
### tokenizer_exceptions.py (excerpt)
# use short, internal variable for readability
_exc = {}
for h in range(1, 12 + 1):
for period in ["a.m.", "am"]:
# always keep an eye on string interpolation!
_exc["%d%s" % (h, period)] = [
{ORTH: "%d" % h},
{ORTH: period, LEMMA: "a.m."}]
for period in ["p.m.", "pm"]:
_exc["%d%s" % (h, period)] = [
{ORTH: "%d" % h},
{ORTH: period, LEMMA: "p.m."}]
# only declare this at the bottom
TOKENIZER_EXCEPTIONS = _exc
```
> #### Generating tokenizer exceptions
>
> Keep in mind that generating exceptions only makes sense if there's a clearly
@ -275,7 +242,8 @@ TOKENIZER_EXCEPTIONS = _exc
> This is not always the case in Spanish for instance, infinitive or
> imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). In
> cases like this, spaCy shouldn't be generating exceptions for _all verbs_.
> Instead, this will be handled at a later stage during lemmatization.
> Instead, this will be handled at a later stage after part-of-speech tagging
> and lemmatization.
When adding the tokenizer exceptions to the `Defaults`, you can use the
[`update_exc`](/api/top-level#util.update_exc) helper function to merge them
@ -292,28 +260,18 @@ custom one.
from ...util import update_exc
BASE_EXCEPTIONS = {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]}
TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA: "all"}]}
TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", NORM: "all"}]}
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
# {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]}
# {"a.": [{ORTH: "a.", NORM: "all"}], ":)": [{ORTH: ":)"}]}
```
<Infobox title="About spaCy's custom pronoun lemma" variant="warning">
Unlike verbs and common nouns, there's no clear base form of a personal pronoun.
Should the lemma of "me" be "I", or should we normalize person as well, giving
"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`,
which is used as the lemma for all personal pronouns.
</Infobox>
### Norm exceptions {#norm-exceptions new="2"}
In addition to `ORTH` or `LEMMA`, tokenizer exceptions can also set a `NORM`
attribute. This is useful to specify a normalized version of the token for
example, the norm of "n't" is "not". By default, a token's norm equals its
lowercase text. If the lowercase spelling of a word exists, norms should always
be in lowercase.
In addition to `ORTH`, tokenizer exceptions can also set a `NORM` attribute.
This is useful to specify a normalized version of the token for example, the
norm of "n't" is "not". By default, a token's norm equals its lowercase text. If
the lowercase spelling of a word exists, norms should always be in lowercase.
> #### Norms vs. lemmas
>
@ -458,25 +416,36 @@ the quickest and easiest way to get started. The data is stored in a dictionary
mapping a string to its lemma. To determine a token's lemma, spaCy simply looks
it up in the table. Here's an example from the Spanish language data:
```python
### lang/es/lemmatizer.py (excerpt)
LOOKUP = {
"aba": "abar",
"ababa": "abar",
"ababais": "abar",
"ababan": "abar",
"ababanes": "ababán",
"ababas": "abar",
"ababoles": "ababol",
"ababábites": "ababábite"
```json
### lang/es/lemma_lookup.json (excerpt)
{
"aba": "abar",
"ababa": "abar",
"ababais": "abar",
"ababan": "abar",
"ababanes": "ababán",
"ababas": "abar",
"ababoles": "ababol",
"ababábites": "ababábite"
}
```
To provide a lookup lemmatizer for your language, import the lookup table and
add it to the `Language` class as `lemma_lookup`:
#### Adding JSON resources {#lemmatizer-resources new="2.2"}
As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the
new [`Lookups`](/api/lookups) class. This allows easier access to the data,
serialization with the models and file compression on disk (so your spaCy
installation is smaller). Resource files can be provided via the `resources`
attribute on the custom language subclass. All paths are relative to the
language data directory, i.e. the directory the language's `__init__.py` is in.
```python
lemma_lookup = LOOKUP
resources = {
"lemma_lookup": "lemmatizer/lemma_lookup.json",
"lemma_rules": "lemmatizer/lemma_rules.json",
"lemma_index": "lemmatizer/lemma_index.json",
"lemma_exc": "lemmatizer/lemma_exc.json",
}
```
### Tag map {#tag-map}