mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-13 02:36:32 +03:00
Update lemma data documentation [ci skip]
This commit is contained in:
parent
932ad9cb91
commit
a8a1800f2a
|
@ -42,18 +42,20 @@ processing.
|
||||||
> - **Nouns**: dogs, children → dog, child
|
> - **Nouns**: dogs, children → dog, child
|
||||||
> - **Verbs**: writes, writing, wrote, written → write
|
> - **Verbs**: writes, writing, wrote, written → write
|
||||||
|
|
||||||
A lemma is the uninflected form of a word. The English lemmatization data is
|
As of v2.2, lemmatization data is stored in a separate package,
|
||||||
taken from [WordNet](https://wordnet.princeton.edu). Lookup tables are taken
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
|
||||||
from [Lexiconista](http://www.lexiconista.com/datasets/lemmatization/). spaCy
|
be installed if needed via `pip install spacy[lookups]`. Some languages provide
|
||||||
also adds a **special case for pronouns**: all pronouns are lemmatized to the
|
full lemmatization rules and exceptions, while other languages currently only
|
||||||
special token `-PRON-`.
|
rely on simple lookup tables.
|
||||||
|
|
||||||
<Infobox title="About spaCy's custom pronoun lemma" variant="warning">
|
<Infobox title="About spaCy's custom pronoun lemma" variant="warning">
|
||||||
|
|
||||||
Unlike verbs and common nouns, there's no clear base form of a personal pronoun.
|
spaCy adds a **special case for pronouns**: all pronouns are lemmatized to the
|
||||||
Should the lemma of "me" be "I", or should we normalize person as well, giving
|
special token `-PRON-`. Unlike verbs and common nouns, there's no clear base
|
||||||
"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`,
|
form of a personal pronoun. Should the lemma of "me" be "I", or should we
|
||||||
which is used as the lemma for all personal pronouns.
|
normalize person as well, giving "it" — or maybe "he"? spaCy's solution is to
|
||||||
|
introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal
|
||||||
|
pronouns.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
|
|
@ -34,9 +34,9 @@ together all components and creating the `Language` subclass – for example,
|
||||||
| **Character classes**<br />[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. |
|
| **Character classes**<br />[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. |
|
||||||
| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
|
| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
|
||||||
| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
|
| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
|
||||||
| **Lemmatizer**<br />[`lemmatizer.py`][lemmatizer.py] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
|
|
||||||
| **Tag map**<br />[`tag_map.py`][tag_map.py] | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
|
| **Tag map**<br />[`tag_map.py`][tag_map.py] | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
|
||||||
| **Morph rules**<br />[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. |
|
| **Morph rules**<br />[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. |
|
||||||
|
| **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
|
||||||
|
|
||||||
[stop_words.py]:
|
[stop_words.py]:
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
|
||||||
|
@ -52,9 +52,8 @@ together all components and creating the `Language` subclass – for example,
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
|
||||||
[syntax_iterators.py]:
|
[syntax_iterators.py]:
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
|
||||||
[lemmatizer.py]:
|
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/lemmatizer.py
|
|
||||||
[tag_map.py]:
|
[tag_map.py]:
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
|
||||||
[morph_rules.py]:
|
[morph_rules.py]:
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
|
||||||
|
[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data
|
||||||
|
|
|
@ -417,7 +417,7 @@ mapping a string to its lemma. To determine a token's lemma, spaCy simply looks
|
||||||
it up in the table. Here's an example from the Spanish language data:
|
it up in the table. Here's an example from the Spanish language data:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
### lang/es/lemma_lookup.json (excerpt)
|
### es_lemma_lookup.json (excerpt)
|
||||||
{
|
{
|
||||||
"aba": "abar",
|
"aba": "abar",
|
||||||
"ababa": "abar",
|
"ababa": "abar",
|
||||||
|
@ -432,33 +432,18 @@ it up in the table. Here's an example from the Spanish language data:
|
||||||
|
|
||||||
#### Adding JSON resources {#lemmatizer-resources new="2.2"}
|
#### Adding JSON resources {#lemmatizer-resources new="2.2"}
|
||||||
|
|
||||||
As of v2.2, resources for the lemmatizer are stored as JSON and loaded via the
|
As of v2.2, resources for the lemmatizer are stored as JSON and have been moved
|
||||||
new [`Lookups`](/api/lookups) class. This allows easier access to the data,
|
to a separate repository and package,
|
||||||
serialization with the models and file compression on disk (so your spaCy
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
|
||||||
installation is smaller). Resource files can be provided via the `resources`
|
package exposes the data files via language-specific
|
||||||
attribute on the custom language subclass. All paths are relative to the
|
[entry points](/usage/saving-loading#entry-points) that spaCy reads when
|
||||||
language data directory, i.e. the directory the language's `__init__.py` is in.
|
constructing the `Vocab` and [`Lookups`](/api/lookups). This allows easier
|
||||||
|
access to the data, serialization with the models and file compression on disk
|
||||||
```python
|
(so your spaCy installation is smaller). If you want to use the lookup tables
|
||||||
resources = {
|
without a pre-trained model, you have to explicitly install spaCy with lookups
|
||||||
"lemma_lookup": "lemmatizer/lemma_lookup.json",
|
via `pip install spacy[lookups]` or by installing
|
||||||
"lemma_rules": "lemmatizer/lemma_rules.json",
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) in the
|
||||||
"lemma_index": "lemmatizer/lemma_index.json",
|
same environment as spaCy.
|
||||||
"lemma_exc": "lemmatizer/lemma_exc.json",
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
> #### Lookups example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> table = nlp.vocab.lookups.get_table("my_table")
|
|
||||||
> value = table.get("some_key")
|
|
||||||
> ```
|
|
||||||
|
|
||||||
If your language needs other large dictionaries and resources, you can also add
|
|
||||||
those files here. The data will become available via a [`Lookups`](/api/lookups)
|
|
||||||
table in `nlp.vocab.lookups`, and you'll be able to access it from the tokenizer
|
|
||||||
or a custom pipeline component (via `doc.vocab.lookups`).
|
|
||||||
|
|
||||||
### Tag map {#tag-map}
|
### Tag map {#tag-map}
|
||||||
|
|
||||||
|
|
|
@ -49,6 +49,16 @@ $ pip install -U spacy
|
||||||
> >>> nlp = spacy.load("en_core_web_sm")
|
> >>> nlp = spacy.load("en_core_web_sm")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
<Infobox variant="warning">
|
||||||
|
|
||||||
|
To install additional data tables for lemmatization in **spaCy v2.2+** (to
|
||||||
|
create blank models or lemmatize in languages that don't yet come with
|
||||||
|
pre-trained models), you can run `pip install spacy[lookups]` or install
|
||||||
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
|
||||||
|
separately.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
When using pip it is generally recommended to install packages in a virtual
|
When using pip it is generally recommended to install packages in a virtual
|
||||||
environment to avoid modifying system state:
|
environment to avoid modifying system state:
|
||||||
|
|
||||||
|
|
|
@ -48,6 +48,15 @@ contribute to model development.
|
||||||
> nlp = Finnish() # use directly
|
> nlp = Finnish() # use directly
|
||||||
> nlp = spacy.blank("fi") # blank instance
|
> nlp = spacy.blank("fi") # blank instance
|
||||||
> ```
|
> ```
|
||||||
|
>
|
||||||
|
> If lemmatization rules are available for your language, make sure to install
|
||||||
|
> spaCy with the `lookups` option, or install
|
||||||
|
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
|
||||||
|
> separately in the same environment:
|
||||||
|
>
|
||||||
|
> ```bash
|
||||||
|
> $ pip install spacy[lookups]
|
||||||
|
> ```
|
||||||
|
|
||||||
import Languages from 'widgets/languages.js'
|
import Languages from 'widgets/languages.js'
|
||||||
|
|
||||||
|
|
|
@ -285,6 +285,7 @@ installed in the same environment – that's it.
|
||||||
| ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| [`spacy_factories`](#entry-points-components) | Group of entry points for pipeline component factories to add to [`Language.factories`](/usage/processing-pipelines#custom-components-factories), keyed by component name. |
|
| [`spacy_factories`](#entry-points-components) | Group of entry points for pipeline component factories to add to [`Language.factories`](/usage/processing-pipelines#custom-components-factories), keyed by component name. |
|
||||||
| [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/adding-languages), keyed by language shortcut. |
|
| [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/adding-languages), keyed by language shortcut. |
|
||||||
|
| `spacy_lookups` <Tag variant="new">2.2</Tag> | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package. |
|
||||||
| [`spacy_displacy_colors`](#entry-points-displacy) <Tag variant="new">2.2</Tag> | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |
|
| [`spacy_displacy_colors`](#entry-points-displacy) <Tag variant="new">2.2</Tag> | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |
|
||||||
|
|
||||||
### Custom components via entry points {#entry-points-components}
|
### Custom components via entry points {#entry-points-components}
|
||||||
|
|
|
@ -145,6 +145,7 @@ the following components:
|
||||||
entity recognizer to predict those annotations in context.
|
entity recognizer to predict those annotations in context.
|
||||||
- **Lexical entries** in the vocabulary, i.e. words and their
|
- **Lexical entries** in the vocabulary, i.e. words and their
|
||||||
context-independent attributes like the shape or spelling.
|
context-independent attributes like the shape or spelling.
|
||||||
|
- **Data files** like lemmatization rules and lookup tables.
|
||||||
- **Word vectors**, i.e. multi-dimensional meaning representations of words that
|
- **Word vectors**, i.e. multi-dimensional meaning representations of words that
|
||||||
let you determine how similar they are to each other.
|
let you determine how similar they are to each other.
|
||||||
- **Configuration** options, like the language and processing pipeline settings,
|
- **Configuration** options, like the language and processing pipeline settings,
|
||||||
|
|
|
@ -4,13 +4,14 @@ teaser: New features, backwards incompatibilities and migration guide
|
||||||
menu:
|
menu:
|
||||||
- ['New Features', 'features']
|
- ['New Features', 'features']
|
||||||
- ['Backwards Incompatibilities', 'incompat']
|
- ['Backwards Incompatibilities', 'incompat']
|
||||||
|
- ['Migrating from v2.1', 'migrating']
|
||||||
---
|
---
|
||||||
|
|
||||||
## New Features {#features hidden="true"}
|
## New Features {#features hidden="true"}
|
||||||
|
|
||||||
spaCy v2.2 features improved statistical models, new pretrained models for
|
spaCy v2.2 features improved statistical models, new pretrained models for
|
||||||
Norwegian and Lithuanian, better Dutch NER, as well as a new mechanism for
|
Norwegian and Lithuanian, better Dutch NER, as well as a new mechanism for
|
||||||
storing language data that makes the installation about **15× smaller** on
|
storing language data that makes the installation about **7× smaller** on
|
||||||
disk. We've also added a new class to efficiently **serialize annotations**, an
|
disk. We've also added a new class to efficiently **serialize annotations**, an
|
||||||
improved and **10× faster** phrase matching engine, built-in scoring and
|
improved and **10× faster** phrase matching engine, built-in scoring and
|
||||||
**CLI training for text classification**, a new command to analyze and **debug
|
**CLI training for text classification**, a new command to analyze and **debug
|
||||||
|
@ -45,35 +46,6 @@ overall. We've also added new core models for [Norwegian](/models/nb) (MIT) and
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Serializable lookup table and dictionary API {#lookups}
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> data = {"foo": "bar"}
|
|
||||||
> nlp.vocab.lookups.add_table("my_dict", data)
|
|
||||||
>
|
|
||||||
> def custom_component(doc):
|
|
||||||
> table = doc.vocab.lookups.get_table("my_dict")
|
|
||||||
> print(table.get("foo")) # look something up
|
|
||||||
> return doc
|
|
||||||
> ```
|
|
||||||
|
|
||||||
The new `Lookups` API lets you add large dictionaries and lookup tables to the
|
|
||||||
`Vocab` and access them from the tokenizer or custom components and extension
|
|
||||||
attributes. Internally, the tables use Bloom filters for efficient lookup
|
|
||||||
checks. They're also fully serializable out-of-the-box. All large data resources
|
|
||||||
included with spaCy now use this API and are additionally compressed at build
|
|
||||||
time. This allowed us to make the installed library roughly **15 times smaller
|
|
||||||
on disk**.
|
|
||||||
|
|
||||||
<Infobox>
|
|
||||||
|
|
||||||
**API:** [`Lookups`](/api/lookups) **Usage: **
|
|
||||||
[Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer)
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
### Text classification scores and CLI training {#train-textcat-cli}
|
### Text classification scores and CLI training {#train-textcat-cli}
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
|
@ -134,6 +106,40 @@ processing.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
### Serializable lookup tables and smaller installation {#lookups}
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> data = {"foo": "bar"}
|
||||||
|
> nlp.vocab.lookups.add_table("my_dict", data)
|
||||||
|
>
|
||||||
|
> def custom_component(doc):
|
||||||
|
> table = doc.vocab.lookups.get_table("my_dict")
|
||||||
|
> print(table.get("foo")) # look something up
|
||||||
|
> return doc
|
||||||
|
> ```
|
||||||
|
|
||||||
|
The new `Lookups` API lets you add large dictionaries and lookup tables to the
|
||||||
|
`Vocab` and access them from the tokenizer or custom components and extension
|
||||||
|
attributes. Internally, the tables use Bloom filters for efficient lookup
|
||||||
|
checks. They're also fully serializable out-of-the-box. All large data resources
|
||||||
|
like lemmatization tables have been moved to a separate package,
|
||||||
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
|
||||||
|
be installed alongside the core library. This allowed us to make the spaCy
|
||||||
|
installation roughly **7× smaller on disk**. [Pretrained models](/models)
|
||||||
|
now include their data files, so you only need to install the lookups if you
|
||||||
|
want to build blank models or use lemmatization with languages that don't yet
|
||||||
|
ship with pretrained models.
|
||||||
|
|
||||||
|
<Infobox>
|
||||||
|
|
||||||
|
**API:** [`Lookups`](/api/lookups),
|
||||||
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) **Usage:
|
||||||
|
** [Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer)
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
### CLI command to debug and validate training data {#debug-data}
|
### CLI command to debug and validate training data {#debug-data}
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
|
@ -306,6 +312,28 @@ check if all of your models are up to date, you can run the
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
> #### Install with lookups data
|
||||||
|
>
|
||||||
|
> ```bash
|
||||||
|
> $ pip install spacy[lookups]
|
||||||
|
> ```
|
||||||
|
>
|
||||||
|
> You can also install
|
||||||
|
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
|
||||||
|
> directly.
|
||||||
|
|
||||||
|
- The lemmatization tables have been moved to their own package,
|
||||||
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
|
||||||
|
is not installed by default. If you're using pre-trained models, **nothing
|
||||||
|
changes**, because the tables are now included in the model packages. If you
|
||||||
|
want to use the lemmatizer for other languages that don't yet have pre-trained
|
||||||
|
models (e.g. Turkish or Croatian) or start off with a blank model that
|
||||||
|
contains lookup data (e.g. `spacy.blank("en")`), you'll need to **explicitly
|
||||||
|
install spaCy plus data** via `pip install spacy[lookups]`.
|
||||||
|
- Lemmatization tables (rules, exceptions, index and lookups) are now part of
|
||||||
|
the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
|
||||||
|
pipeline components, vocab) will now include additional data, and models
|
||||||
|
written to disk will include additional files.
|
||||||
- The [Dutch model](/models/nl) has been trained on a new NER corpus (custom
|
- The [Dutch model](/models/nl) has been trained on a new NER corpus (custom
|
||||||
labelled UD instead of WikiNER), so their predictions may be very different
|
labelled UD instead of WikiNER), so their predictions may be very different
|
||||||
compared to the previous version. The results should be significantly better
|
compared to the previous version. The results should be significantly better
|
||||||
|
@ -331,7 +359,7 @@ check if all of your models are up to date, you can run the
|
||||||
- The default punctuation in the [`Sentencizer`](/api/sentencizer) has been
|
- The default punctuation in the [`Sentencizer`](/api/sentencizer) has been
|
||||||
extended and now includes more characters common in various languages. This
|
extended and now includes more characters common in various languages. This
|
||||||
also means that the results it produces may change, depending on your text. If
|
also means that the results it produces may change, depending on your text. If
|
||||||
you want the previous behaviour with limited characters, set
|
you want the previous behavior with limited characters, set
|
||||||
`punct_chars=[".", "!", "?"]` on initialization.
|
`punct_chars=[".", "!", "?"]` on initialization.
|
||||||
- The [`PhraseMatcher`](/api/phrasematcher) algorithm was rewritten from scratch
|
- The [`PhraseMatcher`](/api/phrasematcher) algorithm was rewritten from scratch
|
||||||
and it's now 10× faster. The rewrite also resolved a few subtle bugs
|
and it's now 10× faster. The rewrite also resolved a few subtle bugs
|
||||||
|
@ -339,13 +367,62 @@ check if all of your models are up to date, you can run the
|
||||||
may see slightly different results – however, the results should now be fully
|
may see slightly different results – however, the results should now be fully
|
||||||
correct. See [this PR](https://github.com/explosion/spaCy/pull/4309) for more
|
correct. See [this PR](https://github.com/explosion/spaCy/pull/4309) for more
|
||||||
details.
|
details.
|
||||||
- Lemmatization tables (rules, exceptions, index and lookups) are now part of
|
|
||||||
the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
|
|
||||||
pipeline components, vocab) will now include additional data, and models
|
|
||||||
written to disk will include additional files.
|
|
||||||
- The `Serbian` language class (introduced in v2.1.8) incorrectly used the
|
- The `Serbian` language class (introduced in v2.1.8) incorrectly used the
|
||||||
language code `rs` instead of `sr`. This has now been fixed, so `Serbian` is
|
language code `rs` instead of `sr`. This has now been fixed, so `Serbian` is
|
||||||
now available via `spacy.lang.sr`.
|
now available via `spacy.lang.sr`.
|
||||||
- The `"sources"` in the `meta.json` have changed from a list of strings to a
|
- The `"sources"` in the `meta.json` have changed from a list of strings to a
|
||||||
list of dicts. This is mostly internals, but if your code used
|
list of dicts. This is mostly internals, but if your code used
|
||||||
`nlp.meta["sources"]`, you might have to update it.
|
`nlp.meta["sources"]`, you might have to update it.
|
||||||
|
|
||||||
|
### Migrating from spaCy 2.1 {#migrating}
|
||||||
|
|
||||||
|
#### Lemmatization data and lookup tables
|
||||||
|
|
||||||
|
If you application needs lemmatization for [languages](/usage/models#languages)
|
||||||
|
with only tokenizers, you now need to install that data explicitly via
|
||||||
|
`pip install spacy[lookups]` or `pip install spacy-lookups-data`. No additional
|
||||||
|
setup is required – the package just needs to be installed in the same
|
||||||
|
environment as spaCy.
|
||||||
|
|
||||||
|
```python
|
||||||
|
### {highlight="3-4"}
|
||||||
|
nlp = Turkish()
|
||||||
|
doc = nlp("Bu bir cümledir.")
|
||||||
|
# 🚨 This now requires the lookups data to be installed explicitly
|
||||||
|
print([token.lemma_ for token in doc])
|
||||||
|
```
|
||||||
|
|
||||||
|
The same applies to blank models that you want to update and train – for
|
||||||
|
instance, you might use [`spacy.blank`](/api/top-level#spacy.blank) to create a
|
||||||
|
blank English model and then train your own part-of-speech tagger on top. If you
|
||||||
|
don't explicitly install the lookups data, that `nlp` object won't have any
|
||||||
|
lemmatization rules available. spaCy will now show you a warning when you train
|
||||||
|
a new part-of-speech tagger and the vocab has no lookups available.
|
||||||
|
|
||||||
|
#### Converting entity offsets to BILUO tags
|
||||||
|
|
||||||
|
If you've been using the
|
||||||
|
[`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) helper to
|
||||||
|
convert character offsets into token-based BILUO tags, you may now see an error
|
||||||
|
if the offsets contain overlapping tokens and make it impossible to create a
|
||||||
|
valid BILUO sequence. This is helpful, because it lets you spot potential
|
||||||
|
problems in your data that can lead to inconsistent results later on. But it
|
||||||
|
also means that you need to adjust and clean up the offsets before converting
|
||||||
|
them:
|
||||||
|
|
||||||
|
```diff
|
||||||
|
doc = nlp("I live in Berlin Kreuzberg")
|
||||||
|
- entities = [(10, 26, "LOC"), (10, 16, "GPE"), (17, 26, "LOC")]
|
||||||
|
+ entities = [(10, 16, "GPE"), (17, 26, "LOC")]
|
||||||
|
tags = get_biluo_tags_from_offsets(doc, entities)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Serbian language data
|
||||||
|
|
||||||
|
If you've been working with `Serbian` (introduced in v2.1.8), you'll need to
|
||||||
|
change the language code from `rs` to the correct `sr`:
|
||||||
|
|
||||||
|
```diff
|
||||||
|
- from spacy.lang.rs import Serbian
|
||||||
|
+ from spacy.lang.sr import Serbian
|
||||||
|
```
|
||||||
|
|
|
@ -40,6 +40,18 @@ const DATA = [
|
||||||
},
|
},
|
||||||
],
|
],
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
id: 'data',
|
||||||
|
title: 'Additional data',
|
||||||
|
multiple: true,
|
||||||
|
options: [
|
||||||
|
{
|
||||||
|
id: 'lookups',
|
||||||
|
title: 'Lemmatization',
|
||||||
|
help: 'Install additional lookup tables and rules for lemmatization',
|
||||||
|
},
|
||||||
|
],
|
||||||
|
},
|
||||||
]
|
]
|
||||||
|
|
||||||
const QuickstartInstall = ({ id, title }) => (
|
const QuickstartInstall = ({ id, title }) => (
|
||||||
|
@ -87,6 +99,7 @@ const QuickstartInstall = ({ id, title }) => (
|
||||||
set PYTHONPATH=/path/to/spaCy
|
set PYTHONPATH=/path/to/spaCy
|
||||||
</QS>
|
</QS>
|
||||||
<QS package="source">pip install -r requirements.txt</QS>
|
<QS package="source">pip install -r requirements.txt</QS>
|
||||||
|
<QS data="lookups">pip install -U spacy-lookups-data</QS>
|
||||||
<QS package="source">python setup.py build_ext --inplace</QS>
|
<QS package="source">python setup.py build_ext --inplace</QS>
|
||||||
{models.map(({ code, models: modelOptions }) => (
|
{models.map(({ code, models: modelOptions }) => (
|
||||||
<QS models={code} key={code}>
|
<QS models={code} key={code}>
|
||||||
|
|
Loading…
Reference in New Issue
Block a user