diff --git a/website/docs/api/annotation.md b/website/docs/api/annotation.md
index 34065de91..fac7e79b6 100644
--- a/website/docs/api/annotation.md
+++ b/website/docs/api/annotation.md
@@ -42,18 +42,20 @@ processing.
> - **Nouns**: dogs, children → dog, child
> - **Verbs**: writes, writing, wrote, written → write
-A lemma is the uninflected form of a word. The English lemmatization data is
-taken from [WordNet](https://wordnet.princeton.edu). Lookup tables are taken
-from [Lexiconista](http://www.lexiconista.com/datasets/lemmatization/). spaCy
-also adds a **special case for pronouns**: all pronouns are lemmatized to the
-special token `-PRON-`.
+As of v2.2, lemmatization data is stored in a separate package,
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
+be installed if needed via `pip install spacy[lookups]`. Some languages provide
+full lemmatization rules and exceptions, while other languages currently only
+rely on simple lookup tables.
-Unlike verbs and common nouns, there's no clear base form of a personal pronoun.
-Should the lemma of "me" be "I", or should we normalize person as well, giving
-"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`,
-which is used as the lemma for all personal pronouns.
+spaCy adds a **special case for pronouns**: all pronouns are lemmatized to the
+special token `-PRON-`. Unlike verbs and common nouns, there's no clear base
+form of a personal pronoun. Should the lemma of "me" be "I", or should we
+normalize person as well, giving "it" — or maybe "he"? spaCy's solution is to
+introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal
+pronouns.
diff --git a/website/docs/usage/101/_language-data.md b/website/docs/usage/101/_language-data.md
index 6834f884f..31bfe53ab 100644
--- a/website/docs/usage/101/_language-data.md
+++ b/website/docs/usage/101/_language-data.md
@@ -34,9 +34,9 @@ together all components and creating the `Language` subclass – for example,
| **Character classes**
[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. |
| **Lexical attributes**
[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
| **Syntax iterators**
[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
-| **Lemmatizer**
[`lemmatizer.py`][lemmatizer.py] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
| **Tag map**
[`tag_map.py`][tag_map.py] | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
| **Morph rules**
[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. |
+| **Lemmatizer**
[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
[stop_words.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
@@ -52,9 +52,8 @@ together all components and creating the `Language` subclass – for example,
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
[syntax_iterators.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
-[lemmatizer.py]:
- https://github.com/explosion/spaCy/tree/master/spacy/lang/de/lemmatizer.py
[tag_map.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
[morph_rules.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
+[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data
diff --git a/website/docs/usage/adding-languages.md b/website/docs/usage/adding-languages.md
index 94d75ea31..157f543e6 100644
--- a/website/docs/usage/adding-languages.md
+++ b/website/docs/usage/adding-languages.md
@@ -417,7 +417,7 @@ mapping a string to its lemma. To determine a token's lemma, spaCy simply looks
it up in the table. Here's an example from the Spanish language data:
```json
-### lang/es/lemma_lookup.json (excerpt)
+### es_lemma_lookup.json (excerpt)
{
"aba": "abar",
"ababa": "abar",
@@ -432,33 +432,18 @@ it up in the table. Here's an example from the Spanish language data:
#### Adding JSON resources {#lemmatizer-resources new="2.2"}
-As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the
-new [`Lookups`](/api/lookups) class. This allows easier access to the data,
-serialization with the models and file compression on disk (so your spaCy
-installation is smaller). Resource files can be provided via the `resources`
-attribute on the custom language subclass. All paths are relative to the
-language data directory, i.e. the directory the language's `__init__.py` is in.
-
-```python
-resources = {
- "lemma_lookup": "lemmatizer/lemma_lookup.json",
- "lemma_rules": "lemmatizer/lemma_rules.json",
- "lemma_index": "lemmatizer/lemma_index.json",
- "lemma_exc": "lemmatizer/lemma_exc.json",
-}
-```
-
-> #### Lookups example
->
-> ```python
-> table = nlp.vocab.lookups.get_table("my_table")
-> value = table.get("some_key")
-> ```
-
-If your language needs other large dictionaries and resources, you can also add
-those files here. The data will become available via a [`Lookups`](/api/lookups)
-table in `nlp.vocab.lookups`, and you'll be able to access it from the tokenizer
-or a custom pipeline component (via `doc.vocab.lookups`).
+As of v2.2, resources for the lemmatizer are stored as JSON and have been moved
+to a separate repository and package,
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
+package exposes the data files via language-specific
+[entry points](/usage/saving-loading#entry-points) that spaCy reads when
+constructing the `Vocab` and [`Lookups`](/api/lookups). This allows easier
+access to the data, serialization with the models and file compression on disk
+(so your spaCy installation is smaller). If you want to use the lookup tables
+without a pre-trained model, you have to explicitly install spaCy with lookups
+via `pip install spacy[lookups]` or by installing
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) in the
+same environment as spaCy.
### Tag map {#tag-map}
diff --git a/website/docs/usage/index.md b/website/docs/usage/index.md
index 1d6c0574c..43d602f6c 100644
--- a/website/docs/usage/index.md
+++ b/website/docs/usage/index.md
@@ -49,6 +49,16 @@ $ pip install -U spacy
> >>> nlp = spacy.load("en_core_web_sm")
> ```
+
+
+To install additional data tables for lemmatization in **spaCy v2.2+** (to
+create blank models or lemmatize in languages that don't yet come with
+pre-trained models), you can run `pip install spacy[lookups]` or install
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
+separately.
+
+
+
When using pip it is generally recommended to install packages in a virtual
environment to avoid modifying system state:
diff --git a/website/docs/usage/models.md b/website/docs/usage/models.md
index c9b22279d..5fd92f8f3 100644
--- a/website/docs/usage/models.md
+++ b/website/docs/usage/models.md
@@ -48,6 +48,15 @@ contribute to model development.
> nlp = Finnish() # use directly
> nlp = spacy.blank("fi") # blank instance
> ```
+>
+> If lemmatization rules are available for your language, make sure to install
+> spaCy with the `lookups` option, or install
+> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
+> separately in the same environment:
+>
+> ```bash
+> $ pip install spacy[lookups]
+> ```
import Languages from 'widgets/languages.js'
diff --git a/website/docs/usage/saving-loading.md b/website/docs/usage/saving-loading.md
index 3d904f01a..fe2f4868f 100644
--- a/website/docs/usage/saving-loading.md
+++ b/website/docs/usage/saving-loading.md
@@ -285,6 +285,7 @@ installed in the same environment – that's it.
| ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`spacy_factories`](#entry-points-components) | Group of entry points for pipeline component factories to add to [`Language.factories`](/usage/processing-pipelines#custom-components-factories), keyed by component name. |
| [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/adding-languages), keyed by language shortcut. |
+| `spacy_lookups` 2.2 | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package. |
| [`spacy_displacy_colors`](#entry-points-displacy) 2.2 | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |
### Custom components via entry points {#entry-points-components}
diff --git a/website/docs/usage/spacy-101.md b/website/docs/usage/spacy-101.md
index 306186870..379535cf4 100644
--- a/website/docs/usage/spacy-101.md
+++ b/website/docs/usage/spacy-101.md
@@ -145,6 +145,7 @@ the following components:
entity recognizer to predict those annotations in context.
- **Lexical entries** in the vocabulary, i.e. words and their
context-independent attributes like the shape or spelling.
+- **Data files** like lemmatization rules and lookup tables.
- **Word vectors**, i.e. multi-dimensional meaning representations of words that
let you determine how similar they are to each other.
- **Configuration** options, like the language and processing pipeline settings,
diff --git a/website/docs/usage/v2-2.md b/website/docs/usage/v2-2.md
index d256037ac..31d2552a3 100644
--- a/website/docs/usage/v2-2.md
+++ b/website/docs/usage/v2-2.md
@@ -4,13 +4,14 @@ teaser: New features, backwards incompatibilities and migration guide
menu:
- ['New Features', 'features']
- ['Backwards Incompatibilities', 'incompat']
+ - ['Migrating from v2.1', 'migrating']
---
## New Features {#features hidden="true"}
spaCy v2.2 features improved statistical models, new pretrained models for
Norwegian and Lithuanian, better Dutch NER, as well as a new mechanism for
-storing language data that makes the installation about **15× smaller** on
+storing language data that makes the installation about **7× smaller** on
disk. We've also added a new class to efficiently **serialize annotations**, an
improved and **10× faster** phrase matching engine, built-in scoring and
**CLI training for text classification**, a new command to analyze and **debug
@@ -45,35 +46,6 @@ overall. We've also added new core models for [Norwegian](/models/nb) (MIT) and
-### Serializable lookup table and dictionary API {#lookups}
-
-> #### Example
->
-> ```python
-> data = {"foo": "bar"}
-> nlp.vocab.lookups.add_table("my_dict", data)
->
-> def custom_component(doc):
-> table = doc.vocab.lookups.get_table("my_dict")
-> print(table.get("foo")) # look something up
-> return doc
-> ```
-
-The new `Lookups` API lets you add large dictionaries and lookup tables to the
-`Vocab` and access them from the tokenizer or custom components and extension
-attributes. Internally, the tables use Bloom filters for efficient lookup
-checks. They're also fully serializable out-of-the-box. All large data resources
-included with spaCy now use this API and are additionally compressed at build
-time. This allowed us to make the installed library roughly **15 times smaller
-on disk**.
-
-
-
-**API:** [`Lookups`](/api/lookups) **Usage: **
-[Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer)
-
-
-
### Text classification scores and CLI training {#train-textcat-cli}
> #### Example
@@ -134,6 +106,40 @@ processing.
+### Serializable lookup tables and smaller installation {#lookups}
+
+> #### Example
+>
+> ```python
+> data = {"foo": "bar"}
+> nlp.vocab.lookups.add_table("my_dict", data)
+>
+> def custom_component(doc):
+> table = doc.vocab.lookups.get_table("my_dict")
+> print(table.get("foo")) # look something up
+> return doc
+> ```
+
+The new `Lookups` API lets you add large dictionaries and lookup tables to the
+`Vocab` and access them from the tokenizer or custom components and extension
+attributes. Internally, the tables use Bloom filters for efficient lookup
+checks. They're also fully serializable out-of-the-box. All large data resources
+like lemmatization tables have been moved to a separate package,
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
+be installed alongside the core library. This allowed us to make the spaCy
+installation roughly **7× smaller on disk**. [Pretrained models](/models)
+now include their data files, so you only need to install the lookups if you
+want to build blank models or use lemmatization with languages that don't yet
+ship with pretrained models.
+
+
+
+**API:** [`Lookups`](/api/lookups),
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) **Usage:
+** [Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer)
+
+
+
### CLI command to debug and validate training data {#debug-data}
> #### Example
@@ -306,6 +312,28 @@ check if all of your models are up to date, you can run the
+> #### Install with lookups data
+>
+> ```bash
+> $ pip install spacy[lookups]
+> ```
+>
+> You can also install
+> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
+> directly.
+
+- The lemmatization tables have been moved to their own package,
+ [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
+ is not installed by default. If you're using pre-trained models, **nothing
+ changes**, because the tables are now included in the model packages. If you
+ want to use the lemmatizer for other languages that don't yet have pre-trained
+ models (e.g. Turkish or Croatian) or start off with a blank model that
+ contains lookup data (e.g. `spacy.blank("en")`), you'll need to **explicitly
+ install spaCy plus data** via `pip install spacy[lookups]`.
+- Lemmatization tables (rules, exceptions, index and lookups) are now part of
+ the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
+ pipeline components, vocab) will now include additional data, and models
+ written to disk will include additional files.
- The [Dutch model](/models/nl) has been trained on a new NER corpus (custom
labelled UD instead of WikiNER), so their predictions may be very different
compared to the previous version. The results should be significantly better
@@ -331,7 +359,7 @@ check if all of your models are up to date, you can run the
- The default punctuation in the [`Sentencizer`](/api/sentencizer) has been
extended and now includes more characters common in various languages. This
also means that the results it produces may change, depending on your text. If
- you want the previous behaviour with limited characters, set
+ you want the previous behavior with limited characters, set
`punct_chars=[".", "!", "?"]` on initialization.
- The [`PhraseMatcher`](/api/phrasematcher) algorithm was rewritten from scratch
and it's now 10× faster. The rewrite also resolved a few subtle bugs
@@ -339,13 +367,62 @@ check if all of your models are up to date, you can run the
may see slightly different results – however, the results should now be fully
correct. See [this PR](https://github.com/explosion/spaCy/pull/4309) for more
details.
-- Lemmatization tables (rules, exceptions, index and lookups) are now part of
- the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
- pipeline components, vocab) will now include additional data, and models
- written to disk will include additional files.
- The `Serbian` language class (introduced in v2.1.8) incorrectly used the
language code `rs` instead of `sr`. This has now been fixed, so `Serbian` is
now available via `spacy.lang.sr`.
- The `"sources"` in the `meta.json` have changed from a list of strings to a
list of dicts. This is mostly internals, but if your code used
`nlp.meta["sources"]`, you might have to update it.
+
+### Migrating from spaCy 2.1 {#migrating}
+
+#### Lemmatization data and lookup tables
+
+If you application needs lemmatization for [languages](/usage/models#languages)
+with only tokenizers, you now need to install that data explicitly via
+`pip install spacy[lookups]` or `pip install spacy-lookups-data`. No additional
+setup is required – the package just needs to be installed in the same
+environment as spaCy.
+
+```python
+### {highlight="3-4"}
+nlp = Turkish()
+doc = nlp("Bu bir cümledir.")
+# 🚨 This now requires the lookups data to be installed explicitly
+print([token.lemma_ for token in doc])
+```
+
+The same applies to blank models that you want to update and train – for
+instance, you might use [`spacy.blank`](/api/top-level#spacy.blank) to create a
+blank English model and then train your own part-of-speech tagger on top. If you
+don't explicitly install the lookups data, that `nlp` object won't have any
+lemmatization rules available. spaCy will now show you a warning when you train
+a new part-of-speech tagger and the vocab has no lookups available.
+
+#### Converting entity offsets to BILUO tags
+
+If you've been using the
+[`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) helper to
+convert character offsets into token-based BILUO tags, you may now see an error
+if the offsets contain overlapping tokens and make it impossible to create a
+valid BILUO sequence. This is helpful, because it lets you spot potential
+problems in your data that can lead to inconsistent results later on. But it
+also means that you need to adjust and clean up the offsets before converting
+them:
+
+```diff
+doc = nlp("I live in Berlin Kreuzberg")
+- entities = [(10, 26, "LOC"), (10, 16, "GPE"), (17, 26, "LOC")]
++ entities = [(10, 16, "GPE"), (17, 26, "LOC")]
+tags = get_biluo_tags_from_offsets(doc, entities)
+```
+
+#### Serbian language data
+
+If you've been working with `Serbian` (introduced in v2.1.8), you'll need to
+change the language code from `rs` to the correct `sr`:
+
+```diff
+- from spacy.lang.rs import Serbian
++ from spacy.lang.sr import Serbian
+```
diff --git a/website/src/widgets/quickstart-install.js b/website/src/widgets/quickstart-install.js
index d267766f6..402d09c3c 100644
--- a/website/src/widgets/quickstart-install.js
+++ b/website/src/widgets/quickstart-install.js
@@ -40,6 +40,18 @@ const DATA = [
},
],
},
+ {
+ id: 'data',
+ title: 'Additional data',
+ multiple: true,
+ options: [
+ {
+ id: 'lookups',
+ title: 'Lemmatization',
+ help: 'Install additional lookup tables and rules for lemmatization',
+ },
+ ],
+ },
]
const QuickstartInstall = ({ id, title }) => (
@@ -87,6 +99,7 @@ const QuickstartInstall = ({ id, title }) => (
set PYTHONPATH=/path/to/spaCy
pip install -r requirements.txt
+ pip install -U spacy-lookups-data
python setup.py build_ext --inplace
{models.map(({ code, models: modelOptions }) => (