mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 18:26:30 +03:00
Update lemma data documentation [ci skip]
This commit is contained in:
parent
932ad9cb91
commit
a8a1800f2a
|
@ -42,18 +42,20 @@ processing.
|
|||
> - **Nouns**: dogs, children → dog, child
|
||||
> - **Verbs**: writes, writing, wrote, written → write
|
||||
|
||||
A lemma is the uninflected form of a word. The English lemmatization data is
|
||||
taken from [WordNet](https://wordnet.princeton.edu). Lookup tables are taken
|
||||
from [Lexiconista](http://www.lexiconista.com/datasets/lemmatization/). spaCy
|
||||
also adds a **special case for pronouns**: all pronouns are lemmatized to the
|
||||
special token `-PRON-`.
|
||||
As of v2.2, lemmatization data is stored in a separate package,
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
|
||||
be installed if needed via `pip install spacy[lookups]`. Some languages provide
|
||||
full lemmatization rules and exceptions, while other languages currently only
|
||||
rely on simple lookup tables.
|
||||
|
||||
<Infobox title="About spaCy's custom pronoun lemma" variant="warning">
|
||||
|
||||
Unlike verbs and common nouns, there's no clear base form of a personal pronoun.
|
||||
Should the lemma of "me" be "I", or should we normalize person as well, giving
|
||||
"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`,
|
||||
which is used as the lemma for all personal pronouns.
|
||||
spaCy adds a **special case for pronouns**: all pronouns are lemmatized to the
|
||||
special token `-PRON-`. Unlike verbs and common nouns, there's no clear base
|
||||
form of a personal pronoun. Should the lemma of "me" be "I", or should we
|
||||
normalize person as well, giving "it" — or maybe "he"? spaCy's solution is to
|
||||
introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal
|
||||
pronouns.
|
||||
|
||||
</Infobox>
|
||||
|
||||
|
|
|
@ -34,9 +34,9 @@ together all components and creating the `Language` subclass – for example,
|
|||
| **Character classes**<br />[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. |
|
||||
| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
|
||||
| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
|
||||
| **Lemmatizer**<br />[`lemmatizer.py`][lemmatizer.py] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
|
||||
| **Tag map**<br />[`tag_map.py`][tag_map.py] | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
|
||||
| **Morph rules**<br />[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. |
|
||||
| **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
|
||||
|
||||
[stop_words.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
|
||||
|
@ -52,9 +52,8 @@ together all components and creating the `Language` subclass – for example,
|
|||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
|
||||
[syntax_iterators.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
|
||||
[lemmatizer.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/lemmatizer.py
|
||||
[tag_map.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
|
||||
[morph_rules.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
|
||||
[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data
|
||||
|
|
|
@ -417,7 +417,7 @@ mapping a string to its lemma. To determine a token's lemma, spaCy simply looks
|
|||
it up in the table. Here's an example from the Spanish language data:
|
||||
|
||||
```json
|
||||
### lang/es/lemma_lookup.json (excerpt)
|
||||
### es_lemma_lookup.json (excerpt)
|
||||
{
|
||||
"aba": "abar",
|
||||
"ababa": "abar",
|
||||
|
@ -432,33 +432,18 @@ it up in the table. Here's an example from the Spanish language data:
|
|||
|
||||
#### Adding JSON resources {#lemmatizer-resources new="2.2"}
|
||||
|
||||
As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the
|
||||
new [`Lookups`](/api/lookups) class. This allows easier access to the data,
|
||||
serialization with the models and file compression on disk (so your spaCy
|
||||
installation is smaller). Resource files can be provided via the `resources`
|
||||
attribute on the custom language subclass. All paths are relative to the
|
||||
language data directory, i.e. the directory the language's `__init__.py` is in.
|
||||
|
||||
```python
|
||||
resources = {
|
||||
"lemma_lookup": "lemmatizer/lemma_lookup.json",
|
||||
"lemma_rules": "lemmatizer/lemma_rules.json",
|
||||
"lemma_index": "lemmatizer/lemma_index.json",
|
||||
"lemma_exc": "lemmatizer/lemma_exc.json",
|
||||
}
|
||||
```
|
||||
|
||||
> #### Lookups example
|
||||
>
|
||||
> ```python
|
||||
> table = nlp.vocab.lookups.get_table("my_table")
|
||||
> value = table.get("some_key")
|
||||
> ```
|
||||
|
||||
If your language needs other large dictionaries and resources, you can also add
|
||||
those files here. The data will become available via a [`Lookups`](/api/lookups)
|
||||
table in `nlp.vocab.lookups`, and you'll be able to access it from the tokenizer
|
||||
or a custom pipeline component (via `doc.vocab.lookups`).
|
||||
As of v2.2, resources for the lemmatizer are stored as JSON and have been moved
|
||||
to a separate repository and package,
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
|
||||
package exposes the data files via language-specific
|
||||
[entry points](/usage/saving-loading#entry-points) that spaCy reads when
|
||||
constructing the `Vocab` and [`Lookups`](/api/lookups). This allows easier
|
||||
access to the data, serialization with the models and file compression on disk
|
||||
(so your spaCy installation is smaller). If you want to use the lookup tables
|
||||
without a pre-trained model, you have to explicitly install spaCy with lookups
|
||||
via `pip install spacy[lookups]` or by installing
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) in the
|
||||
same environment as spaCy.
|
||||
|
||||
### Tag map {#tag-map}
|
||||
|
||||
|
|
|
@ -49,6 +49,16 @@ $ pip install -U spacy
|
|||
> >>> nlp = spacy.load("en_core_web_sm")
|
||||
> ```
|
||||
|
||||
<Infobox variant="warning">
|
||||
|
||||
To install additional data tables for lemmatization in **spaCy v2.2+** (to
|
||||
create blank models or lemmatize in languages that don't yet come with
|
||||
pre-trained models), you can run `pip install spacy[lookups]` or install
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
|
||||
separately.
|
||||
|
||||
</Infobox>
|
||||
|
||||
When using pip it is generally recommended to install packages in a virtual
|
||||
environment to avoid modifying system state:
|
||||
|
||||
|
|
|
@ -48,6 +48,15 @@ contribute to model development.
|
|||
> nlp = Finnish() # use directly
|
||||
> nlp = spacy.blank("fi") # blank instance
|
||||
> ```
|
||||
>
|
||||
> If lemmatization rules are available for your language, make sure to install
|
||||
> spaCy with the `lookups` option, or install
|
||||
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
|
||||
> separately in the same environment:
|
||||
>
|
||||
> ```bash
|
||||
> $ pip install spacy[lookups]
|
||||
> ```
|
||||
|
||||
import Languages from 'widgets/languages.js'
|
||||
|
||||
|
|
|
@ -285,6 +285,7 @@ installed in the same environment – that's it.
|
|||
| ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| [`spacy_factories`](#entry-points-components) | Group of entry points for pipeline component factories to add to [`Language.factories`](/usage/processing-pipelines#custom-components-factories), keyed by component name. |
|
||||
| [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/adding-languages), keyed by language shortcut. |
|
||||
| `spacy_lookups` <Tag variant="new">2.2</Tag> | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package. |
|
||||
| [`spacy_displacy_colors`](#entry-points-displacy) <Tag variant="new">2.2</Tag> | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |
|
||||
|
||||
### Custom components via entry points {#entry-points-components}
|
||||
|
|
|
@ -145,6 +145,7 @@ the following components:
|
|||
entity recognizer to predict those annotations in context.
|
||||
- **Lexical entries** in the vocabulary, i.e. words and their
|
||||
context-independent attributes like the shape or spelling.
|
||||
- **Data files** like lemmatization rules and lookup tables.
|
||||
- **Word vectors**, i.e. multi-dimensional meaning representations of words that
|
||||
let you determine how similar they are to each other.
|
||||
- **Configuration** options, like the language and processing pipeline settings,
|
||||
|
|
|
@ -4,13 +4,14 @@ teaser: New features, backwards incompatibilities and migration guide
|
|||
menu:
|
||||
- ['New Features', 'features']
|
||||
- ['Backwards Incompatibilities', 'incompat']
|
||||
- ['Migrating from v2.1', 'migrating']
|
||||
---
|
||||
|
||||
## New Features {#features hidden="true"}
|
||||
|
||||
spaCy v2.2 features improved statistical models, new pretrained models for
|
||||
Norwegian and Lithuanian, better Dutch NER, as well as a new mechanism for
|
||||
storing language data that makes the installation about **15× smaller** on
|
||||
storing language data that makes the installation about **7× smaller** on
|
||||
disk. We've also added a new class to efficiently **serialize annotations**, an
|
||||
improved and **10× faster** phrase matching engine, built-in scoring and
|
||||
**CLI training for text classification**, a new command to analyze and **debug
|
||||
|
@ -45,35 +46,6 @@ overall. We've also added new core models for [Norwegian](/models/nb) (MIT) and
|
|||
|
||||
</Infobox>
|
||||
|
||||
### Serializable lookup table and dictionary API {#lookups}
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = {"foo": "bar"}
|
||||
> nlp.vocab.lookups.add_table("my_dict", data)
|
||||
>
|
||||
> def custom_component(doc):
|
||||
> table = doc.vocab.lookups.get_table("my_dict")
|
||||
> print(table.get("foo")) # look something up
|
||||
> return doc
|
||||
> ```
|
||||
|
||||
The new `Lookups` API lets you add large dictionaries and lookup tables to the
|
||||
`Vocab` and access them from the tokenizer or custom components and extension
|
||||
attributes. Internally, the tables use Bloom filters for efficient lookup
|
||||
checks. They're also fully serializable out-of-the-box. All large data resources
|
||||
included with spaCy now use this API and are additionally compressed at build
|
||||
time. This allowed us to make the installed library roughly **15 times smaller
|
||||
on disk**.
|
||||
|
||||
<Infobox>
|
||||
|
||||
**API:** [`Lookups`](/api/lookups) **Usage: **
|
||||
[Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer)
|
||||
|
||||
</Infobox>
|
||||
|
||||
### Text classification scores and CLI training {#train-textcat-cli}
|
||||
|
||||
> #### Example
|
||||
|
@ -134,6 +106,40 @@ processing.
|
|||
|
||||
</Infobox>
|
||||
|
||||
### Serializable lookup tables and smaller installation {#lookups}
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = {"foo": "bar"}
|
||||
> nlp.vocab.lookups.add_table("my_dict", data)
|
||||
>
|
||||
> def custom_component(doc):
|
||||
> table = doc.vocab.lookups.get_table("my_dict")
|
||||
> print(table.get("foo")) # look something up
|
||||
> return doc
|
||||
> ```
|
||||
|
||||
The new `Lookups` API lets you add large dictionaries and lookup tables to the
|
||||
`Vocab` and access them from the tokenizer or custom components and extension
|
||||
attributes. Internally, the tables use Bloom filters for efficient lookup
|
||||
checks. They're also fully serializable out-of-the-box. All large data resources
|
||||
like lemmatization tables have been moved to a separate package,
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
|
||||
be installed alongside the core library. This allowed us to make the spaCy
|
||||
installation roughly **7× smaller on disk**. [Pretrained models](/models)
|
||||
now include their data files, so you only need to install the lookups if you
|
||||
want to build blank models or use lemmatization with languages that don't yet
|
||||
ship with pretrained models.
|
||||
|
||||
<Infobox>
|
||||
|
||||
**API:** [`Lookups`](/api/lookups),
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) **Usage:
|
||||
** [Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer)
|
||||
|
||||
</Infobox>
|
||||
|
||||
### CLI command to debug and validate training data {#debug-data}
|
||||
|
||||
> #### Example
|
||||
|
@ -306,6 +312,28 @@ check if all of your models are up to date, you can run the
|
|||
|
||||
</Infobox>
|
||||
|
||||
> #### Install with lookups data
|
||||
>
|
||||
> ```bash
|
||||
> $ pip install spacy[lookups]
|
||||
> ```
|
||||
>
|
||||
> You can also install
|
||||
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
|
||||
> directly.
|
||||
|
||||
- The lemmatization tables have been moved to their own package,
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
|
||||
is not installed by default. If you're using pre-trained models, **nothing
|
||||
changes**, because the tables are now included in the model packages. If you
|
||||
want to use the lemmatizer for other languages that don't yet have pre-trained
|
||||
models (e.g. Turkish or Croatian) or start off with a blank model that
|
||||
contains lookup data (e.g. `spacy.blank("en")`), you'll need to **explicitly
|
||||
install spaCy plus data** via `pip install spacy[lookups]`.
|
||||
- Lemmatization tables (rules, exceptions, index and lookups) are now part of
|
||||
the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
|
||||
pipeline components, vocab) will now include additional data, and models
|
||||
written to disk will include additional files.
|
||||
- The [Dutch model](/models/nl) has been trained on a new NER corpus (custom
|
||||
labelled UD instead of WikiNER), so their predictions may be very different
|
||||
compared to the previous version. The results should be significantly better
|
||||
|
@ -331,7 +359,7 @@ check if all of your models are up to date, you can run the
|
|||
- The default punctuation in the [`Sentencizer`](/api/sentencizer) has been
|
||||
extended and now includes more characters common in various languages. This
|
||||
also means that the results it produces may change, depending on your text. If
|
||||
you want the previous behaviour with limited characters, set
|
||||
you want the previous behavior with limited characters, set
|
||||
`punct_chars=[".", "!", "?"]` on initialization.
|
||||
- The [`PhraseMatcher`](/api/phrasematcher) algorithm was rewritten from scratch
|
||||
and it's now 10× faster. The rewrite also resolved a few subtle bugs
|
||||
|
@ -339,13 +367,62 @@ check if all of your models are up to date, you can run the
|
|||
may see slightly different results – however, the results should now be fully
|
||||
correct. See [this PR](https://github.com/explosion/spaCy/pull/4309) for more
|
||||
details.
|
||||
- Lemmatization tables (rules, exceptions, index and lookups) are now part of
|
||||
the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
|
||||
pipeline components, vocab) will now include additional data, and models
|
||||
written to disk will include additional files.
|
||||
- The `Serbian` language class (introduced in v2.1.8) incorrectly used the
|
||||
language code `rs` instead of `sr`. This has now been fixed, so `Serbian` is
|
||||
now available via `spacy.lang.sr`.
|
||||
- The `"sources"` in the `meta.json` have changed from a list of strings to a
|
||||
list of dicts. This is mostly internals, but if your code used
|
||||
`nlp.meta["sources"]`, you might have to update it.
|
||||
|
||||
### Migrating from spaCy 2.1 {#migrating}
|
||||
|
||||
#### Lemmatization data and lookup tables
|
||||
|
||||
If you application needs lemmatization for [languages](/usage/models#languages)
|
||||
with only tokenizers, you now need to install that data explicitly via
|
||||
`pip install spacy[lookups]` or `pip install spacy-lookups-data`. No additional
|
||||
setup is required – the package just needs to be installed in the same
|
||||
environment as spaCy.
|
||||
|
||||
```python
|
||||
### {highlight="3-4"}
|
||||
nlp = Turkish()
|
||||
doc = nlp("Bu bir cümledir.")
|
||||
# 🚨 This now requires the lookups data to be installed explicitly
|
||||
print([token.lemma_ for token in doc])
|
||||
```
|
||||
|
||||
The same applies to blank models that you want to update and train – for
|
||||
instance, you might use [`spacy.blank`](/api/top-level#spacy.blank) to create a
|
||||
blank English model and then train your own part-of-speech tagger on top. If you
|
||||
don't explicitly install the lookups data, that `nlp` object won't have any
|
||||
lemmatization rules available. spaCy will now show you a warning when you train
|
||||
a new part-of-speech tagger and the vocab has no lookups available.
|
||||
|
||||
#### Converting entity offsets to BILUO tags
|
||||
|
||||
If you've been using the
|
||||
[`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) helper to
|
||||
convert character offsets into token-based BILUO tags, you may now see an error
|
||||
if the offsets contain overlapping tokens and make it impossible to create a
|
||||
valid BILUO sequence. This is helpful, because it lets you spot potential
|
||||
problems in your data that can lead to inconsistent results later on. But it
|
||||
also means that you need to adjust and clean up the offsets before converting
|
||||
them:
|
||||
|
||||
```diff
|
||||
doc = nlp("I live in Berlin Kreuzberg")
|
||||
- entities = [(10, 26, "LOC"), (10, 16, "GPE"), (17, 26, "LOC")]
|
||||
+ entities = [(10, 16, "GPE"), (17, 26, "LOC")]
|
||||
tags = get_biluo_tags_from_offsets(doc, entities)
|
||||
```
|
||||
|
||||
#### Serbian language data
|
||||
|
||||
If you've been working with `Serbian` (introduced in v2.1.8), you'll need to
|
||||
change the language code from `rs` to the correct `sr`:
|
||||
|
||||
```diff
|
||||
- from spacy.lang.rs import Serbian
|
||||
+ from spacy.lang.sr import Serbian
|
||||
```
|
||||
|
|
|
@ -40,6 +40,18 @@ const DATA = [
|
|||
},
|
||||
],
|
||||
},
|
||||
{
|
||||
id: 'data',
|
||||
title: 'Additional data',
|
||||
multiple: true,
|
||||
options: [
|
||||
{
|
||||
id: 'lookups',
|
||||
title: 'Lemmatization',
|
||||
help: 'Install additional lookup tables and rules for lemmatization',
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
const QuickstartInstall = ({ id, title }) => (
|
||||
|
@ -87,6 +99,7 @@ const QuickstartInstall = ({ id, title }) => (
|
|||
set PYTHONPATH=/path/to/spaCy
|
||||
</QS>
|
||||
<QS package="source">pip install -r requirements.txt</QS>
|
||||
<QS data="lookups">pip install -U spacy-lookups-data</QS>
|
||||
<QS package="source">python setup.py build_ext --inplace</QS>
|
||||
{models.map(({ code, models: modelOptions }) => (
|
||||
<QS models={code} key={code}>
|
||||
|
|
Loading…
Reference in New Issue
Block a user