mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-29 03:16:31 +03:00
261 lines
9.8 KiB
Markdown
261 lines
9.8 KiB
Markdown
|
---
|
|||
|
title: What's New in v2.1
|
|||
|
teaser: New features, backwards incompatibilities and migration guide
|
|||
|
menu:
|
|||
|
- ['New Features', 'features']
|
|||
|
- ['Backwards Incompatibilities', 'incompat']
|
|||
|
---
|
|||
|
|
|||
|
## New Features {#features hidden="true"}
|
|||
|
|
|||
|
spaCy v2.1 has focussed primarily on stability and performance, solidifying the
|
|||
|
design changes introduced in [v2.0](/usage/v2). As well as smaller models,
|
|||
|
faster runtime, and many bug-fixes, v2.1 also introduces experimental support
|
|||
|
for some exciting new NLP innovations. For the full changelog, see the
|
|||
|
[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.1.0).
|
|||
|
|
|||
|
### BERT/ULMFit/Elmo-style pre-training
|
|||
|
|
|||
|
> #### Example
|
|||
|
>
|
|||
|
> ```bash
|
|||
|
> $ python -m spacy pretrain ./raw_text.jsonl
|
|||
|
> en_vectors_web_lg ./pretrained-model
|
|||
|
> ```
|
|||
|
|
|||
|
spaCy v2.1 introduces a new CLI command, `spacy pretrain`, that can make your
|
|||
|
models much more accurate. It's especially useful when you have **limited
|
|||
|
training data**. The `spacy pretrain` command lets you use transfer learning to
|
|||
|
initialize your models with information from raw text, using a language model
|
|||
|
objective similar to the one used in Google's BERT system. We've taken
|
|||
|
particular care to ensure that pretraining works well even with spaCy's small
|
|||
|
default architecture sizes, so you don't have to compromise on efficiency to use
|
|||
|
it.
|
|||
|
|
|||
|
<Infobox>
|
|||
|
|
|||
|
**API:** [`spacy pretrain`](/api/cli#pretrain) **Usage: **
|
|||
|
[Improving accuracy with transfer learning](/usage/training#transfer-learning)
|
|||
|
|
|||
|
</Infobox>
|
|||
|
|
|||
|
### Extended match pattern API
|
|||
|
|
|||
|
> #### Example
|
|||
|
>
|
|||
|
> ```python
|
|||
|
> # Matches "love cats" or "likes flowers"
|
|||
|
> pattern1 = [{"LEMMA": {"IN": ["like", "love"]}}, {"POS": "NOUN"}]
|
|||
|
> # Matches tokens of length >= 10
|
|||
|
> pattern2 = [{"LENGTH": {">=": 10}}]
|
|||
|
> # Matches custom attribute with regex
|
|||
|
> pattern3 = [{"_": {"country": {"REGEX": "^([Uu](\\.?|nited) ?[Ss](\\.?|tates)"}}}]
|
|||
|
> ```
|
|||
|
|
|||
|
Instead of mapping to a single value, token patterns can now also map to a
|
|||
|
**dictionary of properties**. For example, to specify that the value of a lemma
|
|||
|
should be part of a list of values, or to set a minimum character length. It now
|
|||
|
also supports a `REGEX` property, as well as set membership via `IN` and
|
|||
|
`NOT_IN`, custom extension attributes via `_` and rich comparison for numeric
|
|||
|
values.
|
|||
|
|
|||
|
<Infobox>
|
|||
|
|
|||
|
**API:** [`Matcher`](/api/matcher) **Usage: **
|
|||
|
[Extended pattern syntax and attributes](/usage/rule-based-matching#adding-patterns-attributes-extended),
|
|||
|
[Regular expressions](/usage/rule-based-matching#regex)
|
|||
|
|
|||
|
</Infobox>
|
|||
|
|
|||
|
### Easy rule-based entity recognition
|
|||
|
|
|||
|
> #### Example
|
|||
|
>
|
|||
|
> ```python
|
|||
|
> from spacy.pipeline import EntityRuler
|
|||
|
> ruler = EntityRuler(nlp)
|
|||
|
> ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
|
|||
|
> nlp.add_pipe(ruler, before="ner")
|
|||
|
> ```
|
|||
|
|
|||
|
The `EntityRuler` is an exciting new component that lets you add named entities
|
|||
|
based on pattern dictionaries, and makes it easy to combine rule-based and
|
|||
|
statistical named entity recognition for even more powerful models. Entity rules
|
|||
|
can be phrase patterns for exact string matches, or token patterns for full
|
|||
|
flexibility.
|
|||
|
|
|||
|
<Infobox>
|
|||
|
|
|||
|
**API:** [`EntityRuler`](/api/entityruler) **Usage: **
|
|||
|
[Rule-based entity recognition](/usage/rule-based-matching#entityruler)
|
|||
|
|
|||
|
</Infobox>
|
|||
|
|
|||
|
### Phrase matching with other attributes
|
|||
|
|
|||
|
> #### Example
|
|||
|
>
|
|||
|
> ```python
|
|||
|
> matcher = PhraseMatcher(nlp.vocab, attr="POS")
|
|||
|
> matcher.add("PATTERN", None, nlp(u"I love cats"))
|
|||
|
> doc = nlp(u"You like dogs")
|
|||
|
> matches = matcher(doc)
|
|||
|
> ```
|
|||
|
|
|||
|
By default, the `PhraseMatcher` will match on the verbatim token text, e.g.
|
|||
|
`Token.text`. By setting the `attr` argument on initialization, you can change
|
|||
|
**which token attribute the matcher should use** when comparing the phrase
|
|||
|
pattern to the matched `Doc`. For example, `LOWER` for case-insensitive matches
|
|||
|
or `POS` for finding sequences of the same part-of-speech tags.
|
|||
|
|
|||
|
<Infobox>
|
|||
|
|
|||
|
**API:** [`PhraseMatcher`](/api/phrasematcher) **Usage: **
|
|||
|
[Matching on other token attributes](/usage/rule-based-matching#phrasematcher-attrs)
|
|||
|
|
|||
|
</Infobox>
|
|||
|
|
|||
|
### Components and languages via entry points
|
|||
|
|
|||
|
> #### Example
|
|||
|
>
|
|||
|
> ```python
|
|||
|
> from setuptools import setup
|
|||
|
> setup(
|
|||
|
> name="custom_extension_package",
|
|||
|
> entry_points={
|
|||
|
> "spacy_factories": ["your_component = component:ComponentFactory"]
|
|||
|
> "spacy_languages": ["xyz = language:XYZLanguage"]
|
|||
|
> }
|
|||
|
> )
|
|||
|
> ```
|
|||
|
|
|||
|
Using entry points, model packages and extension packages can now define their
|
|||
|
own `"spacy_factories"` and `"spacy_languages"`, which will be added to the
|
|||
|
built-in factories and languages. If a package in the same environment exposes
|
|||
|
spaCy entry points, all of this happens automatically and no further user action
|
|||
|
is required.
|
|||
|
|
|||
|
<Infobox>
|
|||
|
|
|||
|
**Usage:** [Using entry points](/usage/saving-loading#entry-points)
|
|||
|
|
|||
|
</Infobox>
|
|||
|
|
|||
|
### Retokenizer for merging and splitting
|
|||
|
|
|||
|
> #### Example
|
|||
|
>
|
|||
|
> ```python
|
|||
|
> doc = nlp(u"I like David Bowie")
|
|||
|
> with doc.retokenize() as retokenizer:
|
|||
|
> attrs = {"LEMMA": u"David Bowie"}
|
|||
|
> retokenizer.merge(doc[2:4], attrs=attrs)
|
|||
|
> ```
|
|||
|
|
|||
|
The new `Doc.retokenize` context manager allows merging spans of multiple tokens
|
|||
|
into one single token, and splitting single tokens into multiple tokens.
|
|||
|
Modifications to the `Doc`'s tokenization are stored, and then made all at once
|
|||
|
when the context manager exits. This is much more efficient, and less
|
|||
|
error-prone. `Doc.merge` and `Span.merge` still work, but they're considered
|
|||
|
deprecated.
|
|||
|
|
|||
|
<Infobox>
|
|||
|
|
|||
|
**API:** [`Doc.retokenize`](/api/doc#retokenize),
|
|||
|
[`Retokenizer.merge`](/api/doc#retokenizer.merge),
|
|||
|
[`Retokenizer.split`](/api/doc#retokenizer.split)<br />**Usage:
|
|||
|
**[Merging and splitting](/usage/linguistic-features#retokenization)
|
|||
|
|
|||
|
</Infobox>
|
|||
|
|
|||
|
### Improved documentation
|
|||
|
|
|||
|
Although it looks pretty much the same, we've rebuilt the entire documentation
|
|||
|
using [Gatsby](https://www.gatsbyjs.org/) and [MDX](https://mdxjs.com/). It's
|
|||
|
now an even faster progressive web app and allows us to write all content
|
|||
|
entirely **in Markdown**, without having to compromise on easy-to-use custom UI
|
|||
|
components. We're hoping that the Markdown source will make it even easier to
|
|||
|
contribute to the documentation. For more details, check out the
|
|||
|
[styleguide](/styleguide) and
|
|||
|
[source](https://github.com/explosion/spaCy/tree/master/website). While
|
|||
|
converting the pages to Markdown, we've also fixed a bunch of typos, improved
|
|||
|
the existing pages and added some new content:
|
|||
|
|
|||
|
- **Usage Guide:** [Rule-based Matching](/usage/rule-based-matching)<br/>How to
|
|||
|
use the `Matcher`, `PhraseMatcher` and the new `EntityRuler`, and write
|
|||
|
powerful components to combine statistical models and rules.
|
|||
|
- **Usage Guide:** [Saving and Loading](/usage/saving-loading)<br/>Everything
|
|||
|
you need to know about serialization, and how to save and load pipeline
|
|||
|
components, package your spaCy models as Python modules and use entry points.
|
|||
|
- **Usage Guide: **
|
|||
|
[Merging and Splitting](/usage/linguistic-features#retokenization)<br />How to
|
|||
|
retokenize a `Doc` using the new `retokenize` context manager and merge spans
|
|||
|
into single tokens and split single tokens into multiple.
|
|||
|
- **Universe:** [Videos](/universe/category/videos) and
|
|||
|
[Podcasts](/universe/category/podcasts)
|
|||
|
- **API:** [`EntityRuler`](/api/entityruler)
|
|||
|
- **API:** [`SentenceSegmenter`](/api/sentencesegmenter)
|
|||
|
- **API:** [Pipeline functions](/api/pipeline-functions)
|
|||
|
|
|||
|
## Backwards incompatibilities {#incompat}
|
|||
|
|
|||
|
<Infobox title="Important note on models" variant="warning">
|
|||
|
|
|||
|
If you've been training **your own models**, you'll need to **retrain** them
|
|||
|
with the new version. Also don't forget to upgrade all models to the latest
|
|||
|
versions. Models for v2.0.x aren't compatible with models for v2.1.x. To check
|
|||
|
if all of your models are up to date, you can run the
|
|||
|
[`spacy validate`](/api/cli#validate) command.
|
|||
|
|
|||
|
</Infobox>
|
|||
|
|
|||
|
- While the [`Matcher`](/api/matcher) API is fully backwards compatible, its
|
|||
|
algorithm has changed to fix a number of bugs and performance issues. This
|
|||
|
means that the `Matcher` in v2.1.x may produce different results compared to
|
|||
|
the `Matcher` in v2.0.x.
|
|||
|
|
|||
|
- For better compatibility with the Universal Dependencies data, the lemmatizer
|
|||
|
now preserves capitalization, e.g. for proper nouns. See
|
|||
|
[this issue](https://github.com/explosion/spaCy/issues/3256) for details.
|
|||
|
|
|||
|
- The built-in rule-based sentence boundary detector is now only called
|
|||
|
`"sentencizer"` – the name `"sbd"` is deprecated.
|
|||
|
|
|||
|
```diff
|
|||
|
- sentence_splitter = nlp.create_pipe("sbd")
|
|||
|
+ sentence_splitter = nlp.create_pipe("sentencizer")
|
|||
|
```
|
|||
|
|
|||
|
- The `Doc.print_tree` method is now deprecated. If you need a custom nested
|
|||
|
JSON representation of a `Doc` object, you might want to write your own helper
|
|||
|
function. For a simple and consistent JSON representation of the `Doc` object
|
|||
|
and its annotations, you can now use the [`Doc.to_json`](/api/doc#to_json)
|
|||
|
method. Going forward, this method will output the same format as the JSON
|
|||
|
training data expected by [`spacy train`](/api/cli#train).
|
|||
|
|
|||
|
- The [`spacy train`](/api/cli#train) command now lets you specify a
|
|||
|
comma-separated list of pipeline component names, instead of separate flags
|
|||
|
like `--no-parser` to disable components. This is more flexible and also
|
|||
|
handles custom components out-of-the-box.
|
|||
|
|
|||
|
```diff
|
|||
|
- $ spacy train en /output train_data.json dev_data.json --no-parser
|
|||
|
+ $ spacy train en /output train_data.json dev_data.json --pipeline tagger,ner
|
|||
|
```
|
|||
|
|
|||
|
- The [`spacy init-model`](/api/cli#init-model) command now uses a `--jsonl-loc`
|
|||
|
argument to pass in a a newline-delimited JSON (JSONL) file containing one
|
|||
|
lexical entry per line instead of a separate `--freqs-loc` and
|
|||
|
`--clusters-loc`.
|
|||
|
|
|||
|
```diff
|
|||
|
- $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
|
|||
|
+ $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl
|
|||
|
```
|
|||
|
|
|||
|
- Also note that some of the model licenses have changed:
|
|||
|
[`it_core_news_sm`](/models/it#it_core_news_sm) is now correctly licensed
|
|||
|
under CC BY-NC-SA 3.0, and all [English](/models/en) and [German](/models/de)
|
|||
|
models are now published under the MIT license.
|