mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 09:14:32 +03:00
Update "Adding languages" docs [ci skip]
This commit is contained in:
parent
10257f3131
commit
c0a4cab178
|
@ -71,21 +71,19 @@ from the global rules. Others, like the tokenizer and norm exceptions, are very
|
|||
specific and will make a big difference to spaCy's performance on the particular
|
||||
language and training a language model.
|
||||
|
||||
| Variable | Type | Description |
|
||||
| ----------------------------------------- | ----- | ---------------------------------------------------------------------------------------------------------- |
|
||||
| `STOP_WORDS` | set | Individual words. |
|
||||
| `TOKENIZER_EXCEPTIONS` | dict | Keyed by strings mapped to list of one dict per token with token attributes. |
|
||||
| `TOKEN_MATCH` | regex | Regexes to match complex tokens, e.g. URLs. |
|
||||
| `NORM_EXCEPTIONS` | dict | Keyed by strings, mapped to their norms. |
|
||||
| `TOKENIZER_PREFIXES` | list | Strings or regexes, usually not customized. |
|
||||
| `TOKENIZER_SUFFIXES` | list | Strings or regexes, usually not customized. |
|
||||
| `TOKENIZER_INFIXES` | list | Strings or regexes, usually not customized. |
|
||||
| `LEX_ATTRS` | dict | Attribute ID mapped to function. |
|
||||
| `SYNTAX_ITERATORS` | dict | Iterator ID mapped to function. Currently only supports `'noun_chunks'`. |
|
||||
| `LOOKUP` | dict | Keyed by strings mapping to their lemma. |
|
||||
| `LEMMA_RULES`, `LEMMA_INDEX`, `LEMMA_EXC` | dict | Lemmatization rules, keyed by part of speech. |
|
||||
| `TAG_MAP` | dict | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
|
||||
| `MORPH_RULES` | dict | Keyed by strings mapped to a dict of their morphological features. |
|
||||
| Variable | Type | Description |
|
||||
| ---------------------- | ----- | ---------------------------------------------------------------------------------------------------------- |
|
||||
| `STOP_WORDS` | set | Individual words. |
|
||||
| `TOKENIZER_EXCEPTIONS` | dict | Keyed by strings mapped to list of one dict per token with token attributes. |
|
||||
| `TOKEN_MATCH` | regex | Regexes to match complex tokens, e.g. URLs. |
|
||||
| `NORM_EXCEPTIONS` | dict | Keyed by strings, mapped to their norms. |
|
||||
| `TOKENIZER_PREFIXES` | list | Strings or regexes, usually not customized. |
|
||||
| `TOKENIZER_SUFFIXES` | list | Strings or regexes, usually not customized. |
|
||||
| `TOKENIZER_INFIXES` | list | Strings or regexes, usually not customized. |
|
||||
| `LEX_ATTRS` | dict | Attribute ID mapped to function. |
|
||||
| `SYNTAX_ITERATORS` | dict | Iterator ID mapped to function. Currently only supports `'noun_chunks'`. |
|
||||
| `TAG_MAP` | dict | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
|
||||
| `MORPH_RULES` | dict | Keyed by strings mapped to a dict of their morphological features. |
|
||||
|
||||
> #### Should I ever update the global data?
|
||||
>
|
||||
|
@ -213,9 +211,7 @@ spaCy's [tokenization algorithm](/usage/linguistic-features#how-tokenizer-works)
|
|||
lets you deal with whitespace-delimited chunks separately. This makes it easy to
|
||||
define special-case rules, without worrying about how they interact with the
|
||||
rest of the tokenizer. Whenever the key string is matched, the special-case rule
|
||||
is applied, giving the defined sequence of tokens. You can also attach
|
||||
attributes to the subtokens, covered by your special case, such as the subtokens
|
||||
`LEMMA` or `TAG`.
|
||||
is applied, giving the defined sequence of tokens.
|
||||
|
||||
Tokenizer exceptions can be added in the following format:
|
||||
|
||||
|
@ -223,8 +219,8 @@ Tokenizer exceptions can be added in the following format:
|
|||
### tokenizer_exceptions.py (excerpt)
|
||||
TOKENIZER_EXCEPTIONS = {
|
||||
"don't": [
|
||||
{ORTH: "do", LEMMA: "do"},
|
||||
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}]
|
||||
{ORTH: "do"},
|
||||
{ORTH: "n't", NORM: "not"}]
|
||||
}
|
||||
```
|
||||
|
||||
|
@ -233,41 +229,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
If an exception consists of more than one token, the `ORTH` values combined
|
||||
always need to **match the original string**. The way the original string is
|
||||
split up can be pretty arbitrary sometimes – for example `"gonna"` is split into
|
||||
`"gon"` (lemma "go") and `"na"` (lemma "to"). Because of how the tokenizer
|
||||
`"gon"` (norm "going") and `"na"` (norm "to"). Because of how the tokenizer
|
||||
works, it's currently not possible to split single-letter strings into multiple
|
||||
tokens.
|
||||
|
||||
</Infobox>
|
||||
|
||||
Unambiguous abbreviations, like month names or locations in English, should be
|
||||
added to exceptions with a lemma assigned, for example
|
||||
`{ORTH: "Jan.", LEMMA: "January"}`. Since the exceptions are added in Python,
|
||||
you can use custom logic to generate them more efficiently and make your data
|
||||
less verbose. How you do this ultimately depends on the language. Here's an
|
||||
example of how exceptions for time formats like "1a.m." and "1am" are generated
|
||||
in the English
|
||||
[`tokenizer_exceptions.py`](https://github.com/explosion/spaCy/tree/master/spacy/en/lang/tokenizer_exceptions.py):
|
||||
|
||||
```python
|
||||
### tokenizer_exceptions.py (excerpt)
|
||||
# use short, internal variable for readability
|
||||
_exc = {}
|
||||
|
||||
for h in range(1, 12 + 1):
|
||||
for period in ["a.m.", "am"]:
|
||||
# always keep an eye on string interpolation!
|
||||
_exc["%d%s" % (h, period)] = [
|
||||
{ORTH: "%d" % h},
|
||||
{ORTH: period, LEMMA: "a.m."}]
|
||||
for period in ["p.m.", "pm"]:
|
||||
_exc["%d%s" % (h, period)] = [
|
||||
{ORTH: "%d" % h},
|
||||
{ORTH: period, LEMMA: "p.m."}]
|
||||
|
||||
# only declare this at the bottom
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
```
|
||||
|
||||
> #### Generating tokenizer exceptions
|
||||
>
|
||||
> Keep in mind that generating exceptions only makes sense if there's a clearly
|
||||
|
@ -275,7 +242,8 @@ TOKENIZER_EXCEPTIONS = _exc
|
|||
> This is not always the case – in Spanish for instance, infinitive or
|
||||
> imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). In
|
||||
> cases like this, spaCy shouldn't be generating exceptions for _all verbs_.
|
||||
> Instead, this will be handled at a later stage during lemmatization.
|
||||
> Instead, this will be handled at a later stage after part-of-speech tagging
|
||||
> and lemmatization.
|
||||
|
||||
When adding the tokenizer exceptions to the `Defaults`, you can use the
|
||||
[`update_exc`](/api/top-level#util.update_exc) helper function to merge them
|
||||
|
@ -292,28 +260,18 @@ custom one.
|
|||
from ...util import update_exc
|
||||
|
||||
BASE_EXCEPTIONS = {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]}
|
||||
TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA: "all"}]}
|
||||
TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", NORM: "all"}]}
|
||||
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
# {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]}
|
||||
# {"a.": [{ORTH: "a.", NORM: "all"}], ":)": [{ORTH: ":)"}]}
|
||||
```
|
||||
|
||||
<Infobox title="About spaCy's custom pronoun lemma" variant="warning">
|
||||
|
||||
Unlike verbs and common nouns, there's no clear base form of a personal pronoun.
|
||||
Should the lemma of "me" be "I", or should we normalize person as well, giving
|
||||
"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`,
|
||||
which is used as the lemma for all personal pronouns.
|
||||
|
||||
</Infobox>
|
||||
|
||||
### Norm exceptions {#norm-exceptions new="2"}
|
||||
|
||||
In addition to `ORTH` or `LEMMA`, tokenizer exceptions can also set a `NORM`
|
||||
attribute. This is useful to specify a normalized version of the token – for
|
||||
example, the norm of "n't" is "not". By default, a token's norm equals its
|
||||
lowercase text. If the lowercase spelling of a word exists, norms should always
|
||||
be in lowercase.
|
||||
In addition to `ORTH`, tokenizer exceptions can also set a `NORM` attribute.
|
||||
This is useful to specify a normalized version of the token – for example, the
|
||||
norm of "n't" is "not". By default, a token's norm equals its lowercase text. If
|
||||
the lowercase spelling of a word exists, norms should always be in lowercase.
|
||||
|
||||
> #### Norms vs. lemmas
|
||||
>
|
||||
|
@ -458,25 +416,36 @@ the quickest and easiest way to get started. The data is stored in a dictionary
|
|||
mapping a string to its lemma. To determine a token's lemma, spaCy simply looks
|
||||
it up in the table. Here's an example from the Spanish language data:
|
||||
|
||||
```python
|
||||
### lang/es/lemmatizer.py (excerpt)
|
||||
LOOKUP = {
|
||||
"aba": "abar",
|
||||
"ababa": "abar",
|
||||
"ababais": "abar",
|
||||
"ababan": "abar",
|
||||
"ababanes": "ababán",
|
||||
"ababas": "abar",
|
||||
"ababoles": "ababol",
|
||||
"ababábites": "ababábite"
|
||||
```json
|
||||
### lang/es/lemma_lookup.json (excerpt)
|
||||
{
|
||||
"aba": "abar",
|
||||
"ababa": "abar",
|
||||
"ababais": "abar",
|
||||
"ababan": "abar",
|
||||
"ababanes": "ababán",
|
||||
"ababas": "abar",
|
||||
"ababoles": "ababol",
|
||||
"ababábites": "ababábite"
|
||||
}
|
||||
```
|
||||
|
||||
To provide a lookup lemmatizer for your language, import the lookup table and
|
||||
add it to the `Language` class as `lemma_lookup`:
|
||||
#### Adding JSON resources {#lemmatizer-resources new="2.2"}
|
||||
|
||||
As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the
|
||||
new [`Lookups`](/api/lookups) class. This allows easier access to the data,
|
||||
serialization with the models and file compression on disk (so your spaCy
|
||||
installation is smaller). Resource files can be provided via the `resources`
|
||||
attribute on the custom language subclass. All paths are relative to the
|
||||
language data directory, i.e. the directory the language's `__init__.py` is in.
|
||||
|
||||
```python
|
||||
lemma_lookup = LOOKUP
|
||||
resources = {
|
||||
"lemma_lookup": "lemmatizer/lemma_lookup.json",
|
||||
"lemma_rules": "lemmatizer/lemma_rules.json",
|
||||
"lemma_index": "lemmatizer/lemma_index.json",
|
||||
"lemma_exc": "lemmatizer/lemma_exc.json",
|
||||
}
|
||||
```
|
||||
|
||||
### Tag map {#tag-map}
|
||||
|
|
Loading…
Reference in New Issue
Block a user