diff --git a/website/docs/api/annotation.md b/website/docs/api/annotation.md
index 34065de91..fac7e79b6 100644
--- a/website/docs/api/annotation.md
+++ b/website/docs/api/annotation.md
@@ -42,18 +42,20 @@ processing.
 > - **Nouns**: dogs, children → dog, child
 > - **Verbs**: writes, writing, wrote, written → write
 
-A lemma is the uninflected form of a word. The English lemmatization data is
-taken from [WordNet](https://wordnet.princeton.edu). Lookup tables are taken
-from [Lexiconista](http://www.lexiconista.com/datasets/lemmatization/). spaCy
-also adds a **special case for pronouns**: all pronouns are lemmatized to the
-special token `-PRON-`.
+As of v2.2, lemmatization data is stored in a separate package,
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
+be installed if needed via `pip install spacy[lookups]`. Some languages provide
+full lemmatization rules and exceptions, while other languages currently only
+rely on simple lookup tables.
 
 <Infobox title="About spaCy's custom pronoun lemma" variant="warning">
 
-Unlike verbs and common nouns, there's no clear base form of a personal pronoun.
-Should the lemma of "me" be "I", or should we normalize person as well, giving
-"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`,
-which is used as the lemma for all personal pronouns.
+spaCy adds a **special case for pronouns**: all pronouns are lemmatized to the
+special token `-PRON-`. Unlike verbs and common nouns, there's no clear base
+form of a personal pronoun. Should the lemma of "me" be "I", or should we
+normalize person as well, giving "it" — or maybe "he"? spaCy's solution is to
+introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal
+pronouns.
 
 </Infobox>
 
diff --git a/website/docs/usage/101/_language-data.md b/website/docs/usage/101/_language-data.md
index 6834f884f..31bfe53ab 100644
--- a/website/docs/usage/101/_language-data.md
+++ b/website/docs/usage/101/_language-data.md
@@ -34,9 +34,9 @@ together all components and creating the `Language` subclass – for example,
 | **Character classes**<br />[`char_classes.py`][char_classes.py]                    | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons.                                            |
 | **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py]                         | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred".              |
 | **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py]             | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).  |
-| **Lemmatizer**<br />[`lemmatizer.py`][lemmatizer.py]                               | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was".                                              |
 | **Tag map**<br />[`tag_map.py`][tag_map.py]                                        | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags.                            |
 | **Morph rules**<br />[`morph_rules.py`][morph_rules.py]                            | Exception rules for morphological analysis of irregular words like personal pronouns.                                                                    |
+| **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data]                     | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was".                                              |
 
 [stop_words.py]:
   https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
@@ -52,9 +52,8 @@ together all components and creating the `Language` subclass – for example,
   https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
 [syntax_iterators.py]:
   https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
-[lemmatizer.py]:
-  https://github.com/explosion/spaCy/tree/master/spacy/lang/de/lemmatizer.py
 [tag_map.py]:
   https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
 [morph_rules.py]:
   https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
+[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data
diff --git a/website/docs/usage/adding-languages.md b/website/docs/usage/adding-languages.md
index 94d75ea31..157f543e6 100644
--- a/website/docs/usage/adding-languages.md
+++ b/website/docs/usage/adding-languages.md
@@ -417,7 +417,7 @@ mapping a string to its lemma. To determine a token's lemma, spaCy simply looks
 it up in the table. Here's an example from the Spanish language data:
 
 ```json
-### lang/es/lemma_lookup.json (excerpt)
+### es_lemma_lookup.json (excerpt)
 {
   "aba": "abar",
   "ababa": "abar",
@@ -432,33 +432,18 @@ it up in the table. Here's an example from the Spanish language data:
 
 #### Adding JSON resources {#lemmatizer-resources new="2.2"}
 
-As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the
-new [`Lookups`](/api/lookups) class. This allows easier access to the data,
-serialization with the models and file compression on disk (so your spaCy
-installation is smaller). Resource files can be provided via the `resources`
-attribute on the custom language subclass. All paths are relative to the
-language data directory, i.e. the directory the language's `__init__.py` is in.
-
-```python
-resources = {
-    "lemma_lookup": "lemmatizer/lemma_lookup.json",
-    "lemma_rules": "lemmatizer/lemma_rules.json",
-    "lemma_index": "lemmatizer/lemma_index.json",
-    "lemma_exc": "lemmatizer/lemma_exc.json",
-}
-```
-
-> #### Lookups example
->
-> ```python
-> table = nlp.vocab.lookups.get_table("my_table")
-> value = table.get("some_key")
-> ```
-
-If your language needs other large dictionaries and resources, you can also add
-those files here. The data will become available via a [`Lookups`](/api/lookups)
-table in `nlp.vocab.lookups`, and you'll be able to access it from the tokenizer
-or a custom pipeline component (via `doc.vocab.lookups`).
+As of v2.2, resources for the lemmatizer are stored as JSON and have been moved
+to a separate repository and package,
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
+package exposes the data files via language-specific
+[entry points](/usage/saving-loading#entry-points) that spaCy reads when
+constructing the `Vocab` and [`Lookups`](/api/lookups). This allows easier
+access to the data, serialization with the models and file compression on disk
+(so your spaCy installation is smaller). If you want to use the lookup tables
+without a pre-trained model, you have to explicitly install spaCy with lookups
+via `pip install spacy[lookups]` or by installing
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) in the
+same environment as spaCy.
 
 ### Tag map {#tag-map}
 
diff --git a/website/docs/usage/index.md b/website/docs/usage/index.md
index 1d6c0574c..43d602f6c 100644
--- a/website/docs/usage/index.md
+++ b/website/docs/usage/index.md
@@ -49,6 +49,16 @@ $ pip install -U spacy
 > >>> nlp = spacy.load("en_core_web_sm")
 > ```
 
+<Infobox variant="warning">
+
+To install additional data tables for lemmatization in **spaCy v2.2+** (to
+create blank models or lemmatize in languages that don't yet come with
+pre-trained models), you can run `pip install spacy[lookups]` or install
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
+separately.
+
+</Infobox>
+
 When using pip it is generally recommended to install packages in a virtual
 environment to avoid modifying system state:
 
diff --git a/website/docs/usage/models.md b/website/docs/usage/models.md
index c9b22279d..5fd92f8f3 100644
--- a/website/docs/usage/models.md
+++ b/website/docs/usage/models.md
@@ -48,6 +48,15 @@ contribute to model development.
 > nlp = Finnish()  # use directly
 > nlp = spacy.blank("fi")  # blank instance
 > ```
+>
+> If lemmatization rules are available for your language, make sure to install
+> spaCy with the `lookups` option, or install
+> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
+> separately in the same environment:
+>
+> ```bash
+> $ pip install spacy[lookups]
+> ```
 
 import Languages from 'widgets/languages.js'
 
diff --git a/website/docs/usage/saving-loading.md b/website/docs/usage/saving-loading.md
index 3d904f01a..fe2f4868f 100644
--- a/website/docs/usage/saving-loading.md
+++ b/website/docs/usage/saving-loading.md
@@ -285,6 +285,7 @@ installed in the same environment – that's it.
 | ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | [`spacy_factories`](#entry-points-components)                                  | Group of entry points for pipeline component factories to add to [`Language.factories`](/usage/processing-pipelines#custom-components-factories), keyed by component name.                                                                               |
 | [`spacy_languages`](#entry-points-languages)                                   | Group of entry points for custom [`Language` subclasses](/usage/adding-languages), keyed by language shortcut.                                                                                                                                           |
+| `spacy_lookups` <Tag variant="new">2.2</Tag>                                   | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package.                                                                  |
 | [`spacy_displacy_colors`](#entry-points-displacy) <Tag variant="new">2.2</Tag> | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |
 
 ### Custom components via entry points {#entry-points-components}
diff --git a/website/docs/usage/spacy-101.md b/website/docs/usage/spacy-101.md
index 306186870..379535cf4 100644
--- a/website/docs/usage/spacy-101.md
+++ b/website/docs/usage/spacy-101.md
@@ -145,6 +145,7 @@ the following components:
   entity recognizer to predict those annotations in context.
 - **Lexical entries** in the vocabulary, i.e. words and their
   context-independent attributes like the shape or spelling.
+- **Data files** like lemmatization rules and lookup tables.
 - **Word vectors**, i.e. multi-dimensional meaning representations of words that
   let you determine how similar they are to each other.
 - **Configuration** options, like the language and processing pipeline settings,
diff --git a/website/docs/usage/v2-2.md b/website/docs/usage/v2-2.md
index d256037ac..31d2552a3 100644
--- a/website/docs/usage/v2-2.md
+++ b/website/docs/usage/v2-2.md
@@ -4,13 +4,14 @@ teaser: New features, backwards incompatibilities and migration guide
 menu:
   - ['New Features', 'features']
   - ['Backwards Incompatibilities', 'incompat']
+  - ['Migrating from v2.1', 'migrating']
 ---
 
 ## New Features {#features hidden="true"}
 
 spaCy v2.2 features improved statistical models, new pretrained models for
 Norwegian and Lithuanian, better Dutch NER, as well as a new mechanism for
-storing language data that makes the installation about **15&times; smaller** on
+storing language data that makes the installation about **7&times; smaller** on
 disk. We've also added a new class to efficiently **serialize annotations**, an
 improved and **10&times; faster** phrase matching engine, built-in scoring and
 **CLI training for text classification**, a new command to analyze and **debug
@@ -45,35 +46,6 @@ overall. We've also added new core models for [Norwegian](/models/nb) (MIT) and
 
 </Infobox>
 
-### Serializable lookup table and dictionary API {#lookups}
-
-> #### Example
->
-> ```python
-> data = {"foo": "bar"}
-> nlp.vocab.lookups.add_table("my_dict", data)
->
-> def custom_component(doc):
->    table = doc.vocab.lookups.get_table("my_dict")
->    print(table.get("foo"))  # look something up
->    return doc
-> ```
-
-The new `Lookups` API lets you add large dictionaries and lookup tables to the
-`Vocab` and access them from the tokenizer or custom components and extension
-attributes. Internally, the tables use Bloom filters for efficient lookup
-checks. They're also fully serializable out-of-the-box. All large data resources
-included with spaCy now use this API and are additionally compressed at build
-time. This allowed us to make the installed library roughly **15 times smaller
-on disk**.
-
-<Infobox>
-
-**API:** [`Lookups`](/api/lookups) **Usage: **
-[Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer)
-
-</Infobox>
-
 ### Text classification scores and CLI training {#train-textcat-cli}
 
 > #### Example
@@ -134,6 +106,40 @@ processing.
 
 </Infobox>
 
+### Serializable lookup tables and smaller installation {#lookups}
+
+> #### Example
+>
+> ```python
+> data = {"foo": "bar"}
+> nlp.vocab.lookups.add_table("my_dict", data)
+>
+> def custom_component(doc):
+>    table = doc.vocab.lookups.get_table("my_dict")
+>    print(table.get("foo"))  # look something up
+>    return doc
+> ```
+
+The new `Lookups` API lets you add large dictionaries and lookup tables to the
+`Vocab` and access them from the tokenizer or custom components and extension
+attributes. Internally, the tables use Bloom filters for efficient lookup
+checks. They're also fully serializable out-of-the-box. All large data resources
+like lemmatization tables have been moved to a separate package,
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
+be installed alongside the core library. This allowed us to make the spaCy
+installation roughly **7&times; smaller on disk**. [Pretrained models](/models)
+now include their data files, so you only need to install the lookups if you
+want to build blank models or use lemmatization with languages that don't yet
+ship with pretrained models.
+
+<Infobox>
+
+**API:** [`Lookups`](/api/lookups),
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) **Usage:
+** [Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer)
+
+</Infobox>
+
 ### CLI command to debug and validate training data {#debug-data}
 
 > #### Example
@@ -306,6 +312,28 @@ check if all of your models are up to date, you can run the
 
 </Infobox>
 
+> #### Install with lookups data
+>
+> ```bash
+> $ pip install spacy[lookups]
+> ```
+>
+> You can also install
+> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
+> directly.
+
+- The lemmatization tables have been moved to their own package,
+  [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
+  is not installed by default. If you're using pre-trained models, **nothing
+  changes**, because the tables are now included in the model packages. If you
+  want to use the lemmatizer for other languages that don't yet have pre-trained
+  models (e.g. Turkish or Croatian) or start off with a blank model that
+  contains lookup data (e.g. `spacy.blank("en")`), you'll need to **explicitly
+  install spaCy plus data** via `pip install spacy[lookups]`.
+- Lemmatization tables (rules, exceptions, index and lookups) are now part of
+  the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
+  pipeline components, vocab) will now include additional data, and models
+  written to disk will include additional files.
 - The [Dutch model](/models/nl) has been trained on a new NER corpus (custom
   labelled UD instead of WikiNER), so their predictions may be very different
   compared to the previous version. The results should be significantly better
@@ -331,7 +359,7 @@ check if all of your models are up to date, you can run the
 - The default punctuation in the [`Sentencizer`](/api/sentencizer) has been
   extended and now includes more characters common in various languages. This
   also means that the results it produces may change, depending on your text. If
-  you want the previous behaviour with limited characters, set
+  you want the previous behavior with limited characters, set
   `punct_chars=[".", "!", "?"]` on initialization.
 - The [`PhraseMatcher`](/api/phrasematcher) algorithm was rewritten from scratch
   and it's now 10&times; faster. The rewrite also resolved a few subtle bugs
@@ -339,13 +367,62 @@ check if all of your models are up to date, you can run the
   may see slightly different results – however, the results should now be fully
   correct. See [this PR](https://github.com/explosion/spaCy/pull/4309) for more
   details.
-- Lemmatization tables (rules, exceptions, index and lookups) are now part of
-  the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
-  pipeline components, vocab) will now include additional data, and models
-  written to disk will include additional files.
 - The `Serbian` language class (introduced in v2.1.8) incorrectly used the
   language code `rs` instead of `sr`. This has now been fixed, so `Serbian` is
   now available via `spacy.lang.sr`.
 - The `"sources"` in the `meta.json` have changed from a list of strings to a
   list of dicts. This is mostly internals, but if your code used
   `nlp.meta["sources"]`, you might have to update it.
+
+### Migrating from spaCy 2.1 {#migrating}
+
+#### Lemmatization data and lookup tables
+
+If you application needs lemmatization for [languages](/usage/models#languages)
+with only tokenizers, you now need to install that data explicitly via
+`pip install spacy[lookups]` or `pip install spacy-lookups-data`. No additional
+setup is required – the package just needs to be installed in the same
+environment as spaCy.
+
+```python
+### {highlight="3-4"}
+nlp = Turkish()
+doc = nlp("Bu bir cümledir.")
+# 🚨 This now requires the lookups data to be installed explicitly
+print([token.lemma_ for token in doc])
+```
+
+The same applies to blank models that you want to update and train – for
+instance, you might use [`spacy.blank`](/api/top-level#spacy.blank) to create a
+blank English model and then train your own part-of-speech tagger on top. If you
+don't explicitly install the lookups data, that `nlp` object won't have any
+lemmatization rules available. spaCy will now show you a warning when you train
+a new part-of-speech tagger and the vocab has no lookups available.
+
+#### Converting entity offsets to BILUO tags
+
+If you've been using the
+[`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) helper to
+convert character offsets into token-based BILUO tags, you may now see an error
+if the offsets contain overlapping tokens and make it impossible to create a
+valid BILUO sequence. This is helpful, because it lets you spot potential
+problems in your data that can lead to inconsistent results later on. But it
+also means that you need to adjust and clean up the offsets before converting
+them:
+
+```diff
+doc = nlp("I live in Berlin Kreuzberg")
+- entities = [(10, 26, "LOC"), (10, 16, "GPE"), (17, 26, "LOC")]
++ entities = [(10, 16, "GPE"), (17, 26, "LOC")]
+tags = get_biluo_tags_from_offsets(doc, entities)
+```
+
+#### Serbian language data
+
+If you've been working with `Serbian` (introduced in v2.1.8), you'll need to
+change the language code from `rs` to the correct `sr`:
+
+```diff
+- from spacy.lang.rs import Serbian
++ from spacy.lang.sr import Serbian
+```
diff --git a/website/src/widgets/quickstart-install.js b/website/src/widgets/quickstart-install.js
index d267766f6..402d09c3c 100644
--- a/website/src/widgets/quickstart-install.js
+++ b/website/src/widgets/quickstart-install.js
@@ -40,6 +40,18 @@ const DATA = [
             },
         ],
     },
+    {
+        id: 'data',
+        title: 'Additional data',
+        multiple: true,
+        options: [
+            {
+                id: 'lookups',
+                title: 'Lemmatization',
+                help: 'Install additional lookup tables and rules for lemmatization',
+            },
+        ],
+    },
 ]
 
 const QuickstartInstall = ({ id, title }) => (
@@ -87,6 +99,7 @@ const QuickstartInstall = ({ id, title }) => (
                         set PYTHONPATH=/path/to/spaCy
                     </QS>
                     <QS package="source">pip install -r requirements.txt</QS>
+                    <QS data="lookups">pip install -U spacy-lookups-data</QS>
                     <QS package="source">python setup.py build_ext --inplace</QS>
                     {models.map(({ code, models: modelOptions }) => (
                         <QS models={code} key={code}>