mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-28 02:04:07 +03:00
Update "Adding languages" docs [ci skip]
This commit is contained in:
parent
10257f3131
commit
c0a4cab178
|
@ -72,7 +72,7 @@ specific and will make a big difference to spaCy's performance on the particular
|
||||||
language and training a language model.
|
language and training a language model.
|
||||||
|
|
||||||
| Variable | Type | Description |
|
| Variable | Type | Description |
|
||||||
| ----------------------------------------- | ----- | ---------------------------------------------------------------------------------------------------------- |
|
| ---------------------- | ----- | ---------------------------------------------------------------------------------------------------------- |
|
||||||
| `STOP_WORDS` | set | Individual words. |
|
| `STOP_WORDS` | set | Individual words. |
|
||||||
| `TOKENIZER_EXCEPTIONS` | dict | Keyed by strings mapped to list of one dict per token with token attributes. |
|
| `TOKENIZER_EXCEPTIONS` | dict | Keyed by strings mapped to list of one dict per token with token attributes. |
|
||||||
| `TOKEN_MATCH` | regex | Regexes to match complex tokens, e.g. URLs. |
|
| `TOKEN_MATCH` | regex | Regexes to match complex tokens, e.g. URLs. |
|
||||||
|
@ -82,8 +82,6 @@ language and training a language model.
|
||||||
| `TOKENIZER_INFIXES` | list | Strings or regexes, usually not customized. |
|
| `TOKENIZER_INFIXES` | list | Strings or regexes, usually not customized. |
|
||||||
| `LEX_ATTRS` | dict | Attribute ID mapped to function. |
|
| `LEX_ATTRS` | dict | Attribute ID mapped to function. |
|
||||||
| `SYNTAX_ITERATORS` | dict | Iterator ID mapped to function. Currently only supports `'noun_chunks'`. |
|
| `SYNTAX_ITERATORS` | dict | Iterator ID mapped to function. Currently only supports `'noun_chunks'`. |
|
||||||
| `LOOKUP` | dict | Keyed by strings mapping to their lemma. |
|
|
||||||
| `LEMMA_RULES`, `LEMMA_INDEX`, `LEMMA_EXC` | dict | Lemmatization rules, keyed by part of speech. |
|
|
||||||
| `TAG_MAP` | dict | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
|
| `TAG_MAP` | dict | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
|
||||||
| `MORPH_RULES` | dict | Keyed by strings mapped to a dict of their morphological features. |
|
| `MORPH_RULES` | dict | Keyed by strings mapped to a dict of their morphological features. |
|
||||||
|
|
||||||
|
@ -213,9 +211,7 @@ spaCy's [tokenization algorithm](/usage/linguistic-features#how-tokenizer-works)
|
||||||
lets you deal with whitespace-delimited chunks separately. This makes it easy to
|
lets you deal with whitespace-delimited chunks separately. This makes it easy to
|
||||||
define special-case rules, without worrying about how they interact with the
|
define special-case rules, without worrying about how they interact with the
|
||||||
rest of the tokenizer. Whenever the key string is matched, the special-case rule
|
rest of the tokenizer. Whenever the key string is matched, the special-case rule
|
||||||
is applied, giving the defined sequence of tokens. You can also attach
|
is applied, giving the defined sequence of tokens.
|
||||||
attributes to the subtokens, covered by your special case, such as the subtokens
|
|
||||||
`LEMMA` or `TAG`.
|
|
||||||
|
|
||||||
Tokenizer exceptions can be added in the following format:
|
Tokenizer exceptions can be added in the following format:
|
||||||
|
|
||||||
|
@ -223,8 +219,8 @@ Tokenizer exceptions can be added in the following format:
|
||||||
### tokenizer_exceptions.py (excerpt)
|
### tokenizer_exceptions.py (excerpt)
|
||||||
TOKENIZER_EXCEPTIONS = {
|
TOKENIZER_EXCEPTIONS = {
|
||||||
"don't": [
|
"don't": [
|
||||||
{ORTH: "do", LEMMA: "do"},
|
{ORTH: "do"},
|
||||||
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}]
|
{ORTH: "n't", NORM: "not"}]
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -233,41 +229,12 @@ TOKENIZER_EXCEPTIONS = {
|
||||||
If an exception consists of more than one token, the `ORTH` values combined
|
If an exception consists of more than one token, the `ORTH` values combined
|
||||||
always need to **match the original string**. The way the original string is
|
always need to **match the original string**. The way the original string is
|
||||||
split up can be pretty arbitrary sometimes – for example `"gonna"` is split into
|
split up can be pretty arbitrary sometimes – for example `"gonna"` is split into
|
||||||
`"gon"` (lemma "go") and `"na"` (lemma "to"). Because of how the tokenizer
|
`"gon"` (norm "going") and `"na"` (norm "to"). Because of how the tokenizer
|
||||||
works, it's currently not possible to split single-letter strings into multiple
|
works, it's currently not possible to split single-letter strings into multiple
|
||||||
tokens.
|
tokens.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
Unambiguous abbreviations, like month names or locations in English, should be
|
|
||||||
added to exceptions with a lemma assigned, for example
|
|
||||||
`{ORTH: "Jan.", LEMMA: "January"}`. Since the exceptions are added in Python,
|
|
||||||
you can use custom logic to generate them more efficiently and make your data
|
|
||||||
less verbose. How you do this ultimately depends on the language. Here's an
|
|
||||||
example of how exceptions for time formats like "1a.m." and "1am" are generated
|
|
||||||
in the English
|
|
||||||
[`tokenizer_exceptions.py`](https://github.com/explosion/spaCy/tree/master/spacy/en/lang/tokenizer_exceptions.py):
|
|
||||||
|
|
||||||
```python
|
|
||||||
### tokenizer_exceptions.py (excerpt)
|
|
||||||
# use short, internal variable for readability
|
|
||||||
_exc = {}
|
|
||||||
|
|
||||||
for h in range(1, 12 + 1):
|
|
||||||
for period in ["a.m.", "am"]:
|
|
||||||
# always keep an eye on string interpolation!
|
|
||||||
_exc["%d%s" % (h, period)] = [
|
|
||||||
{ORTH: "%d" % h},
|
|
||||||
{ORTH: period, LEMMA: "a.m."}]
|
|
||||||
for period in ["p.m.", "pm"]:
|
|
||||||
_exc["%d%s" % (h, period)] = [
|
|
||||||
{ORTH: "%d" % h},
|
|
||||||
{ORTH: period, LEMMA: "p.m."}]
|
|
||||||
|
|
||||||
# only declare this at the bottom
|
|
||||||
TOKENIZER_EXCEPTIONS = _exc
|
|
||||||
```
|
|
||||||
|
|
||||||
> #### Generating tokenizer exceptions
|
> #### Generating tokenizer exceptions
|
||||||
>
|
>
|
||||||
> Keep in mind that generating exceptions only makes sense if there's a clearly
|
> Keep in mind that generating exceptions only makes sense if there's a clearly
|
||||||
|
@ -275,7 +242,8 @@ TOKENIZER_EXCEPTIONS = _exc
|
||||||
> This is not always the case – in Spanish for instance, infinitive or
|
> This is not always the case – in Spanish for instance, infinitive or
|
||||||
> imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). In
|
> imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). In
|
||||||
> cases like this, spaCy shouldn't be generating exceptions for _all verbs_.
|
> cases like this, spaCy shouldn't be generating exceptions for _all verbs_.
|
||||||
> Instead, this will be handled at a later stage during lemmatization.
|
> Instead, this will be handled at a later stage after part-of-speech tagging
|
||||||
|
> and lemmatization.
|
||||||
|
|
||||||
When adding the tokenizer exceptions to the `Defaults`, you can use the
|
When adding the tokenizer exceptions to the `Defaults`, you can use the
|
||||||
[`update_exc`](/api/top-level#util.update_exc) helper function to merge them
|
[`update_exc`](/api/top-level#util.update_exc) helper function to merge them
|
||||||
|
@ -292,28 +260,18 @@ custom one.
|
||||||
from ...util import update_exc
|
from ...util import update_exc
|
||||||
|
|
||||||
BASE_EXCEPTIONS = {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]}
|
BASE_EXCEPTIONS = {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]}
|
||||||
TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA: "all"}]}
|
TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", NORM: "all"}]}
|
||||||
|
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
# {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]}
|
# {"a.": [{ORTH: "a.", NORM: "all"}], ":)": [{ORTH: ":)"}]}
|
||||||
```
|
```
|
||||||
|
|
||||||
<Infobox title="About spaCy's custom pronoun lemma" variant="warning">
|
|
||||||
|
|
||||||
Unlike verbs and common nouns, there's no clear base form of a personal pronoun.
|
|
||||||
Should the lemma of "me" be "I", or should we normalize person as well, giving
|
|
||||||
"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`,
|
|
||||||
which is used as the lemma for all personal pronouns.
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
### Norm exceptions {#norm-exceptions new="2"}
|
### Norm exceptions {#norm-exceptions new="2"}
|
||||||
|
|
||||||
In addition to `ORTH` or `LEMMA`, tokenizer exceptions can also set a `NORM`
|
In addition to `ORTH`, tokenizer exceptions can also set a `NORM` attribute.
|
||||||
attribute. This is useful to specify a normalized version of the token – for
|
This is useful to specify a normalized version of the token – for example, the
|
||||||
example, the norm of "n't" is "not". By default, a token's norm equals its
|
norm of "n't" is "not". By default, a token's norm equals its lowercase text. If
|
||||||
lowercase text. If the lowercase spelling of a word exists, norms should always
|
the lowercase spelling of a word exists, norms should always be in lowercase.
|
||||||
be in lowercase.
|
|
||||||
|
|
||||||
> #### Norms vs. lemmas
|
> #### Norms vs. lemmas
|
||||||
>
|
>
|
||||||
|
@ -458,9 +416,9 @@ the quickest and easiest way to get started. The data is stored in a dictionary
|
||||||
mapping a string to its lemma. To determine a token's lemma, spaCy simply looks
|
mapping a string to its lemma. To determine a token's lemma, spaCy simply looks
|
||||||
it up in the table. Here's an example from the Spanish language data:
|
it up in the table. Here's an example from the Spanish language data:
|
||||||
|
|
||||||
```python
|
```json
|
||||||
### lang/es/lemmatizer.py (excerpt)
|
### lang/es/lemma_lookup.json (excerpt)
|
||||||
LOOKUP = {
|
{
|
||||||
"aba": "abar",
|
"aba": "abar",
|
||||||
"ababa": "abar",
|
"ababa": "abar",
|
||||||
"ababais": "abar",
|
"ababais": "abar",
|
||||||
|
@ -472,11 +430,22 @@ LOOKUP = {
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
To provide a lookup lemmatizer for your language, import the lookup table and
|
#### Adding JSON resources {#lemmatizer-resources new="2.2"}
|
||||||
add it to the `Language` class as `lemma_lookup`:
|
|
||||||
|
As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the
|
||||||
|
new [`Lookups`](/api/lookups) class. This allows easier access to the data,
|
||||||
|
serialization with the models and file compression on disk (so your spaCy
|
||||||
|
installation is smaller). Resource files can be provided via the `resources`
|
||||||
|
attribute on the custom language subclass. All paths are relative to the
|
||||||
|
language data directory, i.e. the directory the language's `__init__.py` is in.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
lemma_lookup = LOOKUP
|
resources = {
|
||||||
|
"lemma_lookup": "lemmatizer/lemma_lookup.json",
|
||||||
|
"lemma_rules": "lemmatizer/lemma_rules.json",
|
||||||
|
"lemma_index": "lemmatizer/lemma_index.json",
|
||||||
|
"lemma_exc": "lemmatizer/lemma_exc.json",
|
||||||
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Tag map {#tag-map}
|
### Tag map {#tag-map}
|
||||||
|
|
Loading…
Reference in New Issue
Block a user