Update lemma data documentation [ci skip]

This commit is contained in:
Ines Montani 2019-10-01 13:22:13 +02:00
parent 932ad9cb91
commit a8a1800f2a
9 changed files with 172 additions and 75 deletions

View File

@ -42,18 +42,20 @@ processing.
> - **Nouns**: dogs, children → dog, child
> - **Verbs**: writes, writing, wrote, written → write
A lemma is the uninflected form of a word. The English lemmatization data is
taken from [WordNet](https://wordnet.princeton.edu). Lookup tables are taken
from [Lexiconista](http://www.lexiconista.com/datasets/lemmatization/). spaCy
also adds a **special case for pronouns**: all pronouns are lemmatized to the
special token `-PRON-`.
As of v2.2, lemmatization data is stored in a separate package,
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
be installed if needed via `pip install spacy[lookups]`. Some languages provide
full lemmatization rules and exceptions, while other languages currently only
rely on simple lookup tables.
<Infobox title="About spaCy's custom pronoun lemma" variant="warning">
Unlike verbs and common nouns, there's no clear base form of a personal pronoun.
Should the lemma of "me" be "I", or should we normalize person as well, giving
"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`,
which is used as the lemma for all personal pronouns.
spaCy adds a **special case for pronouns**: all pronouns are lemmatized to the
special token `-PRON-`. Unlike verbs and common nouns, there's no clear base
form of a personal pronoun. Should the lemma of "me" be "I", or should we
normalize person as well, giving "it" — or maybe "he"? spaCy's solution is to
introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal
pronouns.
</Infobox>

View File

@ -34,9 +34,9 @@ together all components and creating the `Language` subclass for example,
| **Character classes**<br />[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. |
| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
| **Lemmatizer**<br />[`lemmatizer.py`][lemmatizer.py] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
| **Tag map**<br />[`tag_map.py`][tag_map.py] | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
| **Morph rules**<br />[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. |
| **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
[stop_words.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
@ -52,9 +52,8 @@ together all components and creating the `Language` subclass for example,
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
[syntax_iterators.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
[lemmatizer.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/lemmatizer.py
[tag_map.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
[morph_rules.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data

View File

@ -417,7 +417,7 @@ mapping a string to its lemma. To determine a token's lemma, spaCy simply looks
it up in the table. Here's an example from the Spanish language data:
```json
### lang/es/lemma_lookup.json (excerpt)
### es_lemma_lookup.json (excerpt)
{
"aba": "abar",
"ababa": "abar",
@ -432,33 +432,18 @@ it up in the table. Here's an example from the Spanish language data:
#### Adding JSON resources {#lemmatizer-resources new="2.2"}
As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the
new [`Lookups`](/api/lookups) class. This allows easier access to the data,
serialization with the models and file compression on disk (so your spaCy
installation is smaller). Resource files can be provided via the `resources`
attribute on the custom language subclass. All paths are relative to the
language data directory, i.e. the directory the language's `__init__.py` is in.
```python
resources = {
"lemma_lookup": "lemmatizer/lemma_lookup.json",
"lemma_rules": "lemmatizer/lemma_rules.json",
"lemma_index": "lemmatizer/lemma_index.json",
"lemma_exc": "lemmatizer/lemma_exc.json",
}
```
> #### Lookups example
>
> ```python
> table = nlp.vocab.lookups.get_table("my_table")
> value = table.get("some_key")
> ```
If your language needs other large dictionaries and resources, you can also add
those files here. The data will become available via a [`Lookups`](/api/lookups)
table in `nlp.vocab.lookups`, and you'll be able to access it from the tokenizer
or a custom pipeline component (via `doc.vocab.lookups`).
As of v2.2, resources for the lemmatizer are stored as JSON and have been moved
to a separate repository and package,
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
package exposes the data files via language-specific
[entry points](/usage/saving-loading#entry-points) that spaCy reads when
constructing the `Vocab` and [`Lookups`](/api/lookups). This allows easier
access to the data, serialization with the models and file compression on disk
(so your spaCy installation is smaller). If you want to use the lookup tables
without a pre-trained model, you have to explicitly install spaCy with lookups
via `pip install spacy[lookups]` or by installing
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) in the
same environment as spaCy.
### Tag map {#tag-map}

View File

@ -49,6 +49,16 @@ $ pip install -U spacy
> >>> nlp = spacy.load("en_core_web_sm")
> ```
<Infobox variant="warning">
To install additional data tables for lemmatization in **spaCy v2.2+** (to
create blank models or lemmatize in languages that don't yet come with
pre-trained models), you can run `pip install spacy[lookups]` or install
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
separately.
</Infobox>
When using pip it is generally recommended to install packages in a virtual
environment to avoid modifying system state:

View File

@ -48,6 +48,15 @@ contribute to model development.
> nlp = Finnish() # use directly
> nlp = spacy.blank("fi") # blank instance
> ```
>
> If lemmatization rules are available for your language, make sure to install
> spaCy with the `lookups` option, or install
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
> separately in the same environment:
>
> ```bash
> $ pip install spacy[lookups]
> ```
import Languages from 'widgets/languages.js'

View File

@ -285,6 +285,7 @@ installed in the same environment that's it.
| ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`spacy_factories`](#entry-points-components) | Group of entry points for pipeline component factories to add to [`Language.factories`](/usage/processing-pipelines#custom-components-factories), keyed by component name. |
| [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/adding-languages), keyed by language shortcut. |
| `spacy_lookups` <Tag variant="new">2.2</Tag> | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package. |
| [`spacy_displacy_colors`](#entry-points-displacy) <Tag variant="new">2.2</Tag> | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |
### Custom components via entry points {#entry-points-components}

View File

@ -145,6 +145,7 @@ the following components:
entity recognizer to predict those annotations in context.
- **Lexical entries** in the vocabulary, i.e. words and their
context-independent attributes like the shape or spelling.
- **Data files** like lemmatization rules and lookup tables.
- **Word vectors**, i.e. multi-dimensional meaning representations of words that
let you determine how similar they are to each other.
- **Configuration** options, like the language and processing pipeline settings,

View File

@ -4,13 +4,14 @@ teaser: New features, backwards incompatibilities and migration guide
menu:
- ['New Features', 'features']
- ['Backwards Incompatibilities', 'incompat']
- ['Migrating from v2.1', 'migrating']
---
## New Features {#features hidden="true"}
spaCy v2.2 features improved statistical models, new pretrained models for
Norwegian and Lithuanian, better Dutch NER, as well as a new mechanism for
storing language data that makes the installation about **15&times; smaller** on
storing language data that makes the installation about **7&times; smaller** on
disk. We've also added a new class to efficiently **serialize annotations**, an
improved and **10&times; faster** phrase matching engine, built-in scoring and
**CLI training for text classification**, a new command to analyze and **debug
@ -45,35 +46,6 @@ overall. We've also added new core models for [Norwegian](/models/nb) (MIT) and
</Infobox>
### Serializable lookup table and dictionary API {#lookups}
> #### Example
>
> ```python
> data = {"foo": "bar"}
> nlp.vocab.lookups.add_table("my_dict", data)
>
> def custom_component(doc):
> table = doc.vocab.lookups.get_table("my_dict")
> print(table.get("foo")) # look something up
> return doc
> ```
The new `Lookups` API lets you add large dictionaries and lookup tables to the
`Vocab` and access them from the tokenizer or custom components and extension
attributes. Internally, the tables use Bloom filters for efficient lookup
checks. They're also fully serializable out-of-the-box. All large data resources
included with spaCy now use this API and are additionally compressed at build
time. This allowed us to make the installed library roughly **15 times smaller
on disk**.
<Infobox>
**API:** [`Lookups`](/api/lookups) **Usage: **
[Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer)
</Infobox>
### Text classification scores and CLI training {#train-textcat-cli}
> #### Example
@ -134,6 +106,40 @@ processing.
</Infobox>
### Serializable lookup tables and smaller installation {#lookups}
> #### Example
>
> ```python
> data = {"foo": "bar"}
> nlp.vocab.lookups.add_table("my_dict", data)
>
> def custom_component(doc):
> table = doc.vocab.lookups.get_table("my_dict")
> print(table.get("foo")) # look something up
> return doc
> ```
The new `Lookups` API lets you add large dictionaries and lookup tables to the
`Vocab` and access them from the tokenizer or custom components and extension
attributes. Internally, the tables use Bloom filters for efficient lookup
checks. They're also fully serializable out-of-the-box. All large data resources
like lemmatization tables have been moved to a separate package,
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
be installed alongside the core library. This allowed us to make the spaCy
installation roughly **7&times; smaller on disk**. [Pretrained models](/models)
now include their data files, so you only need to install the lookups if you
want to build blank models or use lemmatization with languages that don't yet
ship with pretrained models.
<Infobox>
**API:** [`Lookups`](/api/lookups),
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) **Usage:
** [Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer)
</Infobox>
### CLI command to debug and validate training data {#debug-data}
> #### Example
@ -306,6 +312,28 @@ check if all of your models are up to date, you can run the
</Infobox>
> #### Install with lookups data
>
> ```bash
> $ pip install spacy[lookups]
> ```
>
> You can also install
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
> directly.
- The lemmatization tables have been moved to their own package,
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
is not installed by default. If you're using pre-trained models, **nothing
changes**, because the tables are now included in the model packages. If you
want to use the lemmatizer for other languages that don't yet have pre-trained
models (e.g. Turkish or Croatian) or start off with a blank model that
contains lookup data (e.g. `spacy.blank("en")`), you'll need to **explicitly
install spaCy plus data** via `pip install spacy[lookups]`.
- Lemmatization tables (rules, exceptions, index and lookups) are now part of
the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
pipeline components, vocab) will now include additional data, and models
written to disk will include additional files.
- The [Dutch model](/models/nl) has been trained on a new NER corpus (custom
labelled UD instead of WikiNER), so their predictions may be very different
compared to the previous version. The results should be significantly better
@ -331,7 +359,7 @@ check if all of your models are up to date, you can run the
- The default punctuation in the [`Sentencizer`](/api/sentencizer) has been
extended and now includes more characters common in various languages. This
also means that the results it produces may change, depending on your text. If
you want the previous behaviour with limited characters, set
you want the previous behavior with limited characters, set
`punct_chars=[".", "!", "?"]` on initialization.
- The [`PhraseMatcher`](/api/phrasematcher) algorithm was rewritten from scratch
and it's now 10&times; faster. The rewrite also resolved a few subtle bugs
@ -339,13 +367,62 @@ check if all of your models are up to date, you can run the
may see slightly different results however, the results should now be fully
correct. See [this PR](https://github.com/explosion/spaCy/pull/4309) for more
details.
- Lemmatization tables (rules, exceptions, index and lookups) are now part of
the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
pipeline components, vocab) will now include additional data, and models
written to disk will include additional files.
- The `Serbian` language class (introduced in v2.1.8) incorrectly used the
language code `rs` instead of `sr`. This has now been fixed, so `Serbian` is
now available via `spacy.lang.sr`.
- The `"sources"` in the `meta.json` have changed from a list of strings to a
list of dicts. This is mostly internals, but if your code used
`nlp.meta["sources"]`, you might have to update it.
### Migrating from spaCy 2.1 {#migrating}
#### Lemmatization data and lookup tables
If you application needs lemmatization for [languages](/usage/models#languages)
with only tokenizers, you now need to install that data explicitly via
`pip install spacy[lookups]` or `pip install spacy-lookups-data`. No additional
setup is required the package just needs to be installed in the same
environment as spaCy.
```python
### {highlight="3-4"}
nlp = Turkish()
doc = nlp("Bu bir cümledir.")
# 🚨 This now requires the lookups data to be installed explicitly
print([token.lemma_ for token in doc])
```
The same applies to blank models that you want to update and train for
instance, you might use [`spacy.blank`](/api/top-level#spacy.blank) to create a
blank English model and then train your own part-of-speech tagger on top. If you
don't explicitly install the lookups data, that `nlp` object won't have any
lemmatization rules available. spaCy will now show you a warning when you train
a new part-of-speech tagger and the vocab has no lookups available.
#### Converting entity offsets to BILUO tags
If you've been using the
[`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) helper to
convert character offsets into token-based BILUO tags, you may now see an error
if the offsets contain overlapping tokens and make it impossible to create a
valid BILUO sequence. This is helpful, because it lets you spot potential
problems in your data that can lead to inconsistent results later on. But it
also means that you need to adjust and clean up the offsets before converting
them:
```diff
doc = nlp("I live in Berlin Kreuzberg")
- entities = [(10, 26, "LOC"), (10, 16, "GPE"), (17, 26, "LOC")]
+ entities = [(10, 16, "GPE"), (17, 26, "LOC")]
tags = get_biluo_tags_from_offsets(doc, entities)
```
#### Serbian language data
If you've been working with `Serbian` (introduced in v2.1.8), you'll need to
change the language code from `rs` to the correct `sr`:
```diff
- from spacy.lang.rs import Serbian
+ from spacy.lang.sr import Serbian
```

View File

@ -40,6 +40,18 @@ const DATA = [
},
],
},
{
id: 'data',
title: 'Additional data',
multiple: true,
options: [
{
id: 'lookups',
title: 'Lemmatization',
help: 'Install additional lookup tables and rules for lemmatization',
},
],
},
]
const QuickstartInstall = ({ id, title }) => (
@ -87,6 +99,7 @@ const QuickstartInstall = ({ id, title }) => (
set PYTHONPATH=/path/to/spaCy
</QS>
<QS package="source">pip install -r requirements.txt</QS>
<QS data="lookups">pip install -U spacy-lookups-data</QS>
<QS package="source">python setup.py build_ext --inplace</QS>
{models.map(({ code, models: modelOptions }) => (
<QS models={code} key={code}>