diff --git a/website/docs/api/annotation.md b/website/docs/api/annotation.md index 34065de91..fac7e79b6 100644 --- a/website/docs/api/annotation.md +++ b/website/docs/api/annotation.md @@ -42,18 +42,20 @@ processing. > - **Nouns**: dogs, children → dog, child > - **Verbs**: writes, writing, wrote, written → write -A lemma is the uninflected form of a word. The English lemmatization data is -taken from [WordNet](https://wordnet.princeton.edu). Lookup tables are taken -from [Lexiconista](http://www.lexiconista.com/datasets/lemmatization/). spaCy -also adds a **special case for pronouns**: all pronouns are lemmatized to the -special token `-PRON-`. +As of v2.2, lemmatization data is stored in a separate package, +[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can +be installed if needed via `pip install spacy[lookups]`. Some languages provide +full lemmatization rules and exceptions, while other languages currently only +rely on simple lookup tables. -Unlike verbs and common nouns, there's no clear base form of a personal pronoun. -Should the lemma of "me" be "I", or should we normalize person as well, giving -"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`, -which is used as the lemma for all personal pronouns. +spaCy adds a **special case for pronouns**: all pronouns are lemmatized to the +special token `-PRON-`. Unlike verbs and common nouns, there's no clear base +form of a personal pronoun. Should the lemma of "me" be "I", or should we +normalize person as well, giving "it" — or maybe "he"? spaCy's solution is to +introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal +pronouns. diff --git a/website/docs/usage/101/_language-data.md b/website/docs/usage/101/_language-data.md index 6834f884f..31bfe53ab 100644 --- a/website/docs/usage/101/_language-data.md +++ b/website/docs/usage/101/_language-data.md @@ -34,9 +34,9 @@ together all components and creating the `Language` subclass – for example, | **Character classes**
[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. | | **Lexical attributes**
[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". | | **Syntax iterators**
[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). | -| **Lemmatizer**
[`lemmatizer.py`][lemmatizer.py] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". | | **Tag map**
[`tag_map.py`][tag_map.py] | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. | | **Morph rules**
[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. | +| **Lemmatizer**
[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". | [stop_words.py]: https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py @@ -52,9 +52,8 @@ together all components and creating the `Language` subclass – for example, https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py [syntax_iterators.py]: https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py -[lemmatizer.py]: - https://github.com/explosion/spaCy/tree/master/spacy/lang/de/lemmatizer.py [tag_map.py]: https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py [morph_rules.py]: https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py +[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data diff --git a/website/docs/usage/adding-languages.md b/website/docs/usage/adding-languages.md index 94d75ea31..157f543e6 100644 --- a/website/docs/usage/adding-languages.md +++ b/website/docs/usage/adding-languages.md @@ -417,7 +417,7 @@ mapping a string to its lemma. To determine a token's lemma, spaCy simply looks it up in the table. Here's an example from the Spanish language data: ```json -### lang/es/lemma_lookup.json (excerpt) +### es_lemma_lookup.json (excerpt) { "aba": "abar", "ababa": "abar", @@ -432,33 +432,18 @@ it up in the table. Here's an example from the Spanish language data: #### Adding JSON resources {#lemmatizer-resources new="2.2"} -As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the -new [`Lookups`](/api/lookups) class. This allows easier access to the data, -serialization with the models and file compression on disk (so your spaCy -installation is smaller). Resource files can be provided via the `resources` -attribute on the custom language subclass. All paths are relative to the -language data directory, i.e. the directory the language's `__init__.py` is in. - -```python -resources = { - "lemma_lookup": "lemmatizer/lemma_lookup.json", - "lemma_rules": "lemmatizer/lemma_rules.json", - "lemma_index": "lemmatizer/lemma_index.json", - "lemma_exc": "lemmatizer/lemma_exc.json", -} -``` - -> #### Lookups example -> -> ```python -> table = nlp.vocab.lookups.get_table("my_table") -> value = table.get("some_key") -> ``` - -If your language needs other large dictionaries and resources, you can also add -those files here. The data will become available via a [`Lookups`](/api/lookups) -table in `nlp.vocab.lookups`, and you'll be able to access it from the tokenizer -or a custom pipeline component (via `doc.vocab.lookups`). +As of v2.2, resources for the lemmatizer are stored as JSON and have been moved +to a separate repository and package, +[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The +package exposes the data files via language-specific +[entry points](/usage/saving-loading#entry-points) that spaCy reads when +constructing the `Vocab` and [`Lookups`](/api/lookups). This allows easier +access to the data, serialization with the models and file compression on disk +(so your spaCy installation is smaller). If you want to use the lookup tables +without a pre-trained model, you have to explicitly install spaCy with lookups +via `pip install spacy[lookups]` or by installing +[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) in the +same environment as spaCy. ### Tag map {#tag-map} diff --git a/website/docs/usage/index.md b/website/docs/usage/index.md index 1d6c0574c..43d602f6c 100644 --- a/website/docs/usage/index.md +++ b/website/docs/usage/index.md @@ -49,6 +49,16 @@ $ pip install -U spacy > >>> nlp = spacy.load("en_core_web_sm") > ``` + + +To install additional data tables for lemmatization in **spaCy v2.2+** (to +create blank models or lemmatize in languages that don't yet come with +pre-trained models), you can run `pip install spacy[lookups]` or install +[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) +separately. + + + When using pip it is generally recommended to install packages in a virtual environment to avoid modifying system state: diff --git a/website/docs/usage/models.md b/website/docs/usage/models.md index c9b22279d..5fd92f8f3 100644 --- a/website/docs/usage/models.md +++ b/website/docs/usage/models.md @@ -48,6 +48,15 @@ contribute to model development. > nlp = Finnish() # use directly > nlp = spacy.blank("fi") # blank instance > ``` +> +> If lemmatization rules are available for your language, make sure to install +> spaCy with the `lookups` option, or install +> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) +> separately in the same environment: +> +> ```bash +> $ pip install spacy[lookups] +> ``` import Languages from 'widgets/languages.js' diff --git a/website/docs/usage/saving-loading.md b/website/docs/usage/saving-loading.md index 3d904f01a..fe2f4868f 100644 --- a/website/docs/usage/saving-loading.md +++ b/website/docs/usage/saving-loading.md @@ -285,6 +285,7 @@ installed in the same environment – that's it. | ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [`spacy_factories`](#entry-points-components) | Group of entry points for pipeline component factories to add to [`Language.factories`](/usage/processing-pipelines#custom-components-factories), keyed by component name. | | [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/adding-languages), keyed by language shortcut. | +| `spacy_lookups` 2.2 | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package. | | [`spacy_displacy_colors`](#entry-points-displacy) 2.2 | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. | ### Custom components via entry points {#entry-points-components} diff --git a/website/docs/usage/spacy-101.md b/website/docs/usage/spacy-101.md index 306186870..379535cf4 100644 --- a/website/docs/usage/spacy-101.md +++ b/website/docs/usage/spacy-101.md @@ -145,6 +145,7 @@ the following components: entity recognizer to predict those annotations in context. - **Lexical entries** in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling. +- **Data files** like lemmatization rules and lookup tables. - **Word vectors**, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other. - **Configuration** options, like the language and processing pipeline settings, diff --git a/website/docs/usage/v2-2.md b/website/docs/usage/v2-2.md index d256037ac..31d2552a3 100644 --- a/website/docs/usage/v2-2.md +++ b/website/docs/usage/v2-2.md @@ -4,13 +4,14 @@ teaser: New features, backwards incompatibilities and migration guide menu: - ['New Features', 'features'] - ['Backwards Incompatibilities', 'incompat'] + - ['Migrating from v2.1', 'migrating'] --- ## New Features {#features hidden="true"} spaCy v2.2 features improved statistical models, new pretrained models for Norwegian and Lithuanian, better Dutch NER, as well as a new mechanism for -storing language data that makes the installation about **15× smaller** on +storing language data that makes the installation about **7× smaller** on disk. We've also added a new class to efficiently **serialize annotations**, an improved and **10× faster** phrase matching engine, built-in scoring and **CLI training for text classification**, a new command to analyze and **debug @@ -45,35 +46,6 @@ overall. We've also added new core models for [Norwegian](/models/nb) (MIT) and -### Serializable lookup table and dictionary API {#lookups} - -> #### Example -> -> ```python -> data = {"foo": "bar"} -> nlp.vocab.lookups.add_table("my_dict", data) -> -> def custom_component(doc): -> table = doc.vocab.lookups.get_table("my_dict") -> print(table.get("foo")) # look something up -> return doc -> ``` - -The new `Lookups` API lets you add large dictionaries and lookup tables to the -`Vocab` and access them from the tokenizer or custom components and extension -attributes. Internally, the tables use Bloom filters for efficient lookup -checks. They're also fully serializable out-of-the-box. All large data resources -included with spaCy now use this API and are additionally compressed at build -time. This allowed us to make the installed library roughly **15 times smaller -on disk**. - - - -**API:** [`Lookups`](/api/lookups) **Usage: ** -[Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer) - - - ### Text classification scores and CLI training {#train-textcat-cli} > #### Example @@ -134,6 +106,40 @@ processing. +### Serializable lookup tables and smaller installation {#lookups} + +> #### Example +> +> ```python +> data = {"foo": "bar"} +> nlp.vocab.lookups.add_table("my_dict", data) +> +> def custom_component(doc): +> table = doc.vocab.lookups.get_table("my_dict") +> print(table.get("foo")) # look something up +> return doc +> ``` + +The new `Lookups` API lets you add large dictionaries and lookup tables to the +`Vocab` and access them from the tokenizer or custom components and extension +attributes. Internally, the tables use Bloom filters for efficient lookup +checks. They're also fully serializable out-of-the-box. All large data resources +like lemmatization tables have been moved to a separate package, +[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can +be installed alongside the core library. This allowed us to make the spaCy +installation roughly **7× smaller on disk**. [Pretrained models](/models) +now include their data files, so you only need to install the lookups if you +want to build blank models or use lemmatization with languages that don't yet +ship with pretrained models. + + + +**API:** [`Lookups`](/api/lookups), +[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) **Usage: +** [Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer) + + + ### CLI command to debug and validate training data {#debug-data} > #### Example @@ -306,6 +312,28 @@ check if all of your models are up to date, you can run the +> #### Install with lookups data +> +> ```bash +> $ pip install spacy[lookups] +> ``` +> +> You can also install +> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) +> directly. + +- The lemmatization tables have been moved to their own package, + [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which + is not installed by default. If you're using pre-trained models, **nothing + changes**, because the tables are now included in the model packages. If you + want to use the lemmatizer for other languages that don't yet have pre-trained + models (e.g. Turkish or Croatian) or start off with a blank model that + contains lookup data (e.g. `spacy.blank("en")`), you'll need to **explicitly + install spaCy plus data** via `pip install spacy[lookups]`. +- Lemmatization tables (rules, exceptions, index and lookups) are now part of + the `Vocab` and serialized with it. This means that serialized objects (`nlp`, + pipeline components, vocab) will now include additional data, and models + written to disk will include additional files. - The [Dutch model](/models/nl) has been trained on a new NER corpus (custom labelled UD instead of WikiNER), so their predictions may be very different compared to the previous version. The results should be significantly better @@ -331,7 +359,7 @@ check if all of your models are up to date, you can run the - The default punctuation in the [`Sentencizer`](/api/sentencizer) has been extended and now includes more characters common in various languages. This also means that the results it produces may change, depending on your text. If - you want the previous behaviour with limited characters, set + you want the previous behavior with limited characters, set `punct_chars=[".", "!", "?"]` on initialization. - The [`PhraseMatcher`](/api/phrasematcher) algorithm was rewritten from scratch and it's now 10× faster. The rewrite also resolved a few subtle bugs @@ -339,13 +367,62 @@ check if all of your models are up to date, you can run the may see slightly different results – however, the results should now be fully correct. See [this PR](https://github.com/explosion/spaCy/pull/4309) for more details. -- Lemmatization tables (rules, exceptions, index and lookups) are now part of - the `Vocab` and serialized with it. This means that serialized objects (`nlp`, - pipeline components, vocab) will now include additional data, and models - written to disk will include additional files. - The `Serbian` language class (introduced in v2.1.8) incorrectly used the language code `rs` instead of `sr`. This has now been fixed, so `Serbian` is now available via `spacy.lang.sr`. - The `"sources"` in the `meta.json` have changed from a list of strings to a list of dicts. This is mostly internals, but if your code used `nlp.meta["sources"]`, you might have to update it. + +### Migrating from spaCy 2.1 {#migrating} + +#### Lemmatization data and lookup tables + +If you application needs lemmatization for [languages](/usage/models#languages) +with only tokenizers, you now need to install that data explicitly via +`pip install spacy[lookups]` or `pip install spacy-lookups-data`. No additional +setup is required – the package just needs to be installed in the same +environment as spaCy. + +```python +### {highlight="3-4"} +nlp = Turkish() +doc = nlp("Bu bir cümledir.") +# 🚨 This now requires the lookups data to be installed explicitly +print([token.lemma_ for token in doc]) +``` + +The same applies to blank models that you want to update and train – for +instance, you might use [`spacy.blank`](/api/top-level#spacy.blank) to create a +blank English model and then train your own part-of-speech tagger on top. If you +don't explicitly install the lookups data, that `nlp` object won't have any +lemmatization rules available. spaCy will now show you a warning when you train +a new part-of-speech tagger and the vocab has no lookups available. + +#### Converting entity offsets to BILUO tags + +If you've been using the +[`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) helper to +convert character offsets into token-based BILUO tags, you may now see an error +if the offsets contain overlapping tokens and make it impossible to create a +valid BILUO sequence. This is helpful, because it lets you spot potential +problems in your data that can lead to inconsistent results later on. But it +also means that you need to adjust and clean up the offsets before converting +them: + +```diff +doc = nlp("I live in Berlin Kreuzberg") +- entities = [(10, 26, "LOC"), (10, 16, "GPE"), (17, 26, "LOC")] ++ entities = [(10, 16, "GPE"), (17, 26, "LOC")] +tags = get_biluo_tags_from_offsets(doc, entities) +``` + +#### Serbian language data + +If you've been working with `Serbian` (introduced in v2.1.8), you'll need to +change the language code from `rs` to the correct `sr`: + +```diff +- from spacy.lang.rs import Serbian ++ from spacy.lang.sr import Serbian +``` diff --git a/website/src/widgets/quickstart-install.js b/website/src/widgets/quickstart-install.js index d267766f6..402d09c3c 100644 --- a/website/src/widgets/quickstart-install.js +++ b/website/src/widgets/quickstart-install.js @@ -40,6 +40,18 @@ const DATA = [ }, ], }, + { + id: 'data', + title: 'Additional data', + multiple: true, + options: [ + { + id: 'lookups', + title: 'Lemmatization', + help: 'Install additional lookup tables and rules for lemmatization', + }, + ], + }, ] const QuickstartInstall = ({ id, title }) => ( @@ -87,6 +99,7 @@ const QuickstartInstall = ({ id, title }) => ( set PYTHONPATH=/path/to/spaCy pip install -r requirements.txt + pip install -U spacy-lookups-data python setup.py build_ext --inplace {models.map(({ code, models: modelOptions }) => (