Update lemma data documentation [ci skip]

2025-10-25 13:11:03 +03:00 · 2019-10-01 13:22:13 +02:00 · 2019-10-01 13:22:13 +02:00 · a8a1800f2a
commit a8a1800f2a
parent 932ad9cb91
9 changed files with 172 additions and 75 deletions
--- a/website/docs/api/annotation.md
+++ b/website/docs/api/annotation.md
@ -42,18 +42,20 @@ processing.
 > - **Nouns**: dogs, children → dog, child
 > - **Verbs**: writes, writing, wrote, written → write
-A lemma is the uninflected form of a word. The English lemmatization data is
+As of v2.2, lemmatization data is stored in a separate package,
-taken from [WordNet](https://wordnet.princeton.edu). Lookup tables are taken
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
-from [Lexiconista](http://www.lexiconista.com/datasets/lemmatization/). spaCy
+be installed if needed via `pip install spacy[lookups]`. Some languages provide
-also adds a **special case for pronouns**: all pronouns are lemmatized to the
+full lemmatization rules and exceptions, while other languages currently only
-special token `-PRON-`.
+rely on simple lookup tables.
 <Infobox title="About spaCy's custom pronoun lemma" variant="warning">
-Unlike verbs and common nouns, there's no clear base form of a personal pronoun.
+spaCy adds a **special case for pronouns**: all pronouns are lemmatized to the
-Should the lemma of "me" be "I", or should we normalize person as well, giving
+special token `-PRON-`. Unlike verbs and common nouns, there's no clear base
-"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`,
+form of a personal pronoun. Should the lemma of "me" be "I", or should we
-which is used as the lemma for all personal pronouns.
+normalize person as well, giving "it" — or maybe "he"? spaCy's solution is to
 introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal
 pronouns.
 </Infobox>
--- a/website/docs/usage/101/_language-data.md
+++ b/website/docs/usage/101/_language-data.md
@ -34,9 +34,9 @@ together all components and creating the `Language` subclass – for example,
 | **Character classes**<br />[`char_classes.py`][char_classes.py]                    | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons.                                            |
 | **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py]                         | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred".              |
 | **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py]             | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).  |
 | **Lemmatizer**<br />[`lemmatizer.py`][lemmatizer.py]                               | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was".                                              |
 | **Tag map**<br />[`tag_map.py`][tag_map.py]                                        | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags.                            |
 | **Morph rules**<br />[`morph_rules.py`][morph_rules.py]                            | Exception rules for morphological analysis of irregular words like personal pronouns.                                                                    |
 | **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data]                     | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was".                                              |
 [stop_words.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
@ -52,9 +52,8 @@ together all components and creating the `Language` subclass – for example,
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
 [syntax_iterators.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
 [lemmatizer.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/de/lemmatizer.py
 [tag_map.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
 [morph_rules.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
 [spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data
--- a/website/docs/usage/adding-languages.md
+++ b/website/docs/usage/adding-languages.md
@ -417,7 +417,7 @@ mapping a string to its lemma. To determine a token's lemma, spaCy simply looks
 it up in the table. Here's an example from the Spanish language data:
 ```json
-### lang/es/lemma_lookup.json (excerpt)
+### es_lemma_lookup.json (excerpt)
 {
  "aba": "abar",
  "ababa": "abar",
@ -432,33 +432,18 @@ it up in the table. Here's an example from the Spanish language data:
 #### Adding JSON resources {#lemmatizer-resources new="2.2"}
-As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the
+As of v2.2, resources for the lemmatizer are stored as JSON and have been moved
-new [`Lookups`](/api/lookups) class. This allows easier access to the data,
+to a separate repository and package,
-serialization with the models and file compression on disk (so your spaCy
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
-installation is smaller). Resource files can be provided via the `resources`
+package exposes the data files via language-specific
-attribute on the custom language subclass. All paths are relative to the
+[entry points](/usage/saving-loading#entry-points) that spaCy reads when
-language data directory, i.e. the directory the language's `__init__.py` is in.
+constructing the `Vocab` and [`Lookups`](/api/lookups). This allows easier
-
+access to the data, serialization with the models and file compression on disk
-```python
+(so your spaCy installation is smaller). If you want to use the lookup tables
-resources = {
+without a pre-trained model, you have to explicitly install spaCy with lookups
-    "lemma_lookup": "lemmatizer/lemma_lookup.json",
+via `pip install spacy[lookups]` or by installing
-    "lemma_rules": "lemmatizer/lemma_rules.json",
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) in the
-    "lemma_index": "lemmatizer/lemma_index.json",
+same environment as spaCy.
    "lemma_exc": "lemmatizer/lemma_exc.json",
 }
 ```
 > #### Lookups example
 >
 > ```python
 > table = nlp.vocab.lookups.get_table("my_table")
 > value = table.get("some_key")
 > ```
 If your language needs other large dictionaries and resources, you can also add
 those files here. The data will become available via a [`Lookups`](/api/lookups)
 table in `nlp.vocab.lookups`, and you'll be able to access it from the tokenizer
 or a custom pipeline component (via `doc.vocab.lookups`).
 ### Tag map {#tag-map}
--- a/website/docs/usage/index.md
+++ b/website/docs/usage/index.md
@ -49,6 +49,16 @@ $ pip install -U spacy
 > >>> nlp = spacy.load("en_core_web_sm")
 > ```
 <Infobox variant="warning">
 To install additional data tables for lemmatization in **spaCy v2.2+** (to
 create blank models or lemmatize in languages that don't yet come with
 pre-trained models), you can run `pip install spacy[lookups]` or install
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
 separately.
 </Infobox>
 When using pip it is generally recommended to install packages in a virtual
 environment to avoid modifying system state:
--- a/website/docs/usage/models.md
+++ b/website/docs/usage/models.md
@ -48,6 +48,15 @@ contribute to model development.
 > nlp = Finnish()  # use directly
 > nlp = spacy.blank("fi")  # blank instance
 > ```
 >
 > If lemmatization rules are available for your language, make sure to install
 > spaCy with the `lookups` option, or install
 > [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
 > separately in the same environment:
 >
 > ```bash
 > $ pip install spacy[lookups]
 > ```
 import Languages from 'widgets/languages.js'
--- a/website/docs/usage/saving-loading.md
+++ b/website/docs/usage/saving-loading.md
@ -285,6 +285,7 @@ installed in the same environment – that's it.
 | ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | [`spacy_factories`](#entry-points-components)                                  | Group of entry points for pipeline component factories to add to [`Language.factories`](/usage/processing-pipelines#custom-components-factories), keyed by component name.                                                                               |
 | [`spacy_languages`](#entry-points-languages)                                   | Group of entry points for custom [`Language` subclasses](/usage/adding-languages), keyed by language shortcut.                                                                                                                                           |
 | `spacy_lookups` <Tag variant="new">2.2</Tag>                                   | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package.                                                                  |
 | [`spacy_displacy_colors`](#entry-points-displacy) <Tag variant="new">2.2</Tag> | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |
 ### Custom components via entry points {#entry-points-components}
--- a/website/docs/usage/spacy-101.md
+++ b/website/docs/usage/spacy-101.md
@ -145,6 +145,7 @@ the following components:
  entity recognizer to predict those annotations in context.
 - **Lexical entries** in the vocabulary, i.e. words and their
  context-independent attributes like the shape or spelling.
 - **Data files** like lemmatization rules and lookup tables.
 - **Word vectors**, i.e. multi-dimensional meaning representations of words that
  let you determine how similar they are to each other.
 - **Configuration** options, like the language and processing pipeline settings,
--- a/website/docs/usage/v2-2.md
+++ b/website/docs/usage/v2-2.md
@ -4,13 +4,14 @@ teaser: New features, backwards incompatibilities and migration guide
 menu:
  - ['New Features', 'features']
  - ['Backwards Incompatibilities', 'incompat']
  - ['Migrating from v2.1', 'migrating']
 ---
 ## New Features {#features hidden="true"}
 spaCy v2.2 features improved statistical models, new pretrained models for
 Norwegian and Lithuanian, better Dutch NER, as well as a new mechanism for
-storing language data that makes the installation about **15&times; smaller** on
+storing language data that makes the installation about **7&times; smaller** on
 disk. We've also added a new class to efficiently **serialize annotations**, an
 improved and **10&times; faster** phrase matching engine, built-in scoring and
 **CLI training for text classification**, a new command to analyze and **debug
@ -45,35 +46,6 @@ overall. We've also added new core models for [Norwegian](/models/nb) (MIT) and
 </Infobox>
 ### Serializable lookup table and dictionary API {#lookups}
 > #### Example
 >
 > ```python
 > data = {"foo": "bar"}
 > nlp.vocab.lookups.add_table("my_dict", data)
 >
 > def custom_component(doc):
 >    table = doc.vocab.lookups.get_table("my_dict")
 >    print(table.get("foo"))  # look something up
 >    return doc
 > ```
 The new `Lookups` API lets you add large dictionaries and lookup tables to the
 `Vocab` and access them from the tokenizer or custom components and extension
 attributes. Internally, the tables use Bloom filters for efficient lookup
 checks. They're also fully serializable out-of-the-box. All large data resources
 included with spaCy now use this API and are additionally compressed at build
 time. This allowed us to make the installed library roughly **15 times smaller
 on disk**.
 <Infobox>
 **API:** [`Lookups`](/api/lookups) **Usage: **
 [Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer)
 </Infobox>
 ### Text classification scores and CLI training {#train-textcat-cli}
 > #### Example
@ -134,6 +106,40 @@ processing.
 </Infobox>
 ### Serializable lookup tables and smaller installation {#lookups}
 > #### Example
 >
 > ```python
 > data = {"foo": "bar"}
 > nlp.vocab.lookups.add_table("my_dict", data)
 >
 > def custom_component(doc):
 >    table = doc.vocab.lookups.get_table("my_dict")
 >    print(table.get("foo"))  # look something up
 >    return doc
 > ```
 The new `Lookups` API lets you add large dictionaries and lookup tables to the
 `Vocab` and access them from the tokenizer or custom components and extension
 attributes. Internally, the tables use Bloom filters for efficient lookup
 checks. They're also fully serializable out-of-the-box. All large data resources
 like lemmatization tables have been moved to a separate package,
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
 be installed alongside the core library. This allowed us to make the spaCy
 installation roughly **7&times; smaller on disk**. [Pretrained models](/models)
 now include their data files, so you only need to install the lookups if you
 want to build blank models or use lemmatization with languages that don't yet
 ship with pretrained models.
 <Infobox>
 **API:** [`Lookups`](/api/lookups),
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) **Usage:
 ** [Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer)
 </Infobox>
 ### CLI command to debug and validate training data {#debug-data}
 > #### Example
@ -306,6 +312,28 @@ check if all of your models are up to date, you can run the
 </Infobox>
 > #### Install with lookups data
 >
 > ```bash
 > $ pip install spacy[lookups]
 > ```
 >
 > You can also install
 > [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
 > directly.
 - The lemmatization tables have been moved to their own package,
  [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
  is not installed by default. If you're using pre-trained models, **nothing
  changes**, because the tables are now included in the model packages. If you
  want to use the lemmatizer for other languages that don't yet have pre-trained
  models (e.g. Turkish or Croatian) or start off with a blank model that
  contains lookup data (e.g. `spacy.blank("en")`), you'll need to **explicitly
  install spaCy plus data** via `pip install spacy[lookups]`.
 - Lemmatization tables (rules, exceptions, index and lookups) are now part of
  the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
  pipeline components, vocab) will now include additional data, and models
  written to disk will include additional files.
 - The [Dutch model](/models/nl) has been trained on a new NER corpus (custom
  labelled UD instead of WikiNER), so their predictions may be very different
  compared to the previous version. The results should be significantly better
@ -331,7 +359,7 @@ check if all of your models are up to date, you can run the
 - The default punctuation in the [`Sentencizer`](/api/sentencizer) has been
  extended and now includes more characters common in various languages. This
  also means that the results it produces may change, depending on your text. If
-  you want the previous behaviour with limited characters, set
+  you want the previous behavior with limited characters, set
  `punct_chars=[".", "!", "?"]` on initialization.
 - The [`PhraseMatcher`](/api/phrasematcher) algorithm was rewritten from scratch
  and it's now 10&times; faster. The rewrite also resolved a few subtle bugs
@ -339,13 +367,62 @@ check if all of your models are up to date, you can run the
  may see slightly different results – however, the results should now be fully
  correct. See [this PR](https://github.com/explosion/spaCy/pull/4309) for more
  details.
 - Lemmatization tables (rules, exceptions, index and lookups) are now part of
  the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
  pipeline components, vocab) will now include additional data, and models
  written to disk will include additional files.
 - The `Serbian` language class (introduced in v2.1.8) incorrectly used the
  language code `rs` instead of `sr`. This has now been fixed, so `Serbian` is
  now available via `spacy.lang.sr`.
 - The `"sources"` in the `meta.json` have changed from a list of strings to a
  list of dicts. This is mostly internals, but if your code used
  `nlp.meta["sources"]`, you might have to update it.
 ### Migrating from spaCy 2.1 {#migrating}
 #### Lemmatization data and lookup tables
 If you application needs lemmatization for [languages](/usage/models#languages)
 with only tokenizers, you now need to install that data explicitly via
 `pip install spacy[lookups]` or `pip install spacy-lookups-data`. No additional
 setup is required – the package just needs to be installed in the same
 environment as spaCy.
 ```python
 ### {highlight="3-4"}
 nlp = Turkish()
 doc = nlp("Bu bir cümledir.")
 # 🚨 This now requires the lookups data to be installed explicitly
 print([token.lemma_ for token in doc])
 ```
 The same applies to blank models that you want to update and train – for
 instance, you might use [`spacy.blank`](/api/top-level#spacy.blank) to create a
 blank English model and then train your own part-of-speech tagger on top. If you
 don't explicitly install the lookups data, that `nlp` object won't have any
 lemmatization rules available. spaCy will now show you a warning when you train
 a new part-of-speech tagger and the vocab has no lookups available.
 #### Converting entity offsets to BILUO tags
 If you've been using the
 [`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) helper to
 convert character offsets into token-based BILUO tags, you may now see an error
 if the offsets contain overlapping tokens and make it impossible to create a
 valid BILUO sequence. This is helpful, because it lets you spot potential
 problems in your data that can lead to inconsistent results later on. But it
 also means that you need to adjust and clean up the offsets before converting
 them:
 ```diff
 doc = nlp("I live in Berlin Kreuzberg")
 - entities = [(10, 26, "LOC"), (10, 16, "GPE"), (17, 26, "LOC")]
 + entities = [(10, 16, "GPE"), (17, 26, "LOC")]
 tags = get_biluo_tags_from_offsets(doc, entities)
 ```
 #### Serbian language data
 If you've been working with `Serbian` (introduced in v2.1.8), you'll need to
 change the language code from `rs` to the correct `sr`:
 ```diff
 - from spacy.lang.rs import Serbian
 + from spacy.lang.sr import Serbian
 ```
--- a/website/src/widgets/quickstart-install.js
+++ b/website/src/widgets/quickstart-install.js
@ -40,6 +40,18 @@ const DATA = [
            },
        ],
    },
    {
        id: 'data',
        title: 'Additional data',
        multiple: true,
        options: [
            {
                id: 'lookups',
                title: 'Lemmatization',
                help: 'Install additional lookup tables and rules for lemmatization',
            },
        ],
    },
 ]
 const QuickstartInstall = ({ id, title }) => (
@ -87,6 +99,7 @@ const QuickstartInstall = ({ id, title }) => (
                        set PYTHONPATH=/path/to/spaCy
                    </QS>
                    <QS package="source">pip install -r requirements.txt</QS>
                    <QS data="lookups">pip install -U spacy-lookups-data</QS>
                    <QS package="source">python setup.py build_ext --inplace</QS>
                    {models.map(({ code, models: modelOptions }) => (
                        <QS models={code} key={code}>