mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 20:28:20 +03:00
554df9ef20
* Rename all MDX file to `.mdx`
* Lock current node version (#11885)
* Apply Prettier (#11996)
* Minor website fixes (#11974) [ci skip]
* fix table
* Migrate to Next WEB-17 (#12005)
* Initial commit
* Run `npx create-next-app@13 next-blog`
* Install MDX packages
Following: 77b5f79a4d/packages/next-mdx/readme.md
* Add MDX to Next
* Allow Next to handle `.md` and `.mdx` files.
* Add VSCode extension recommendation
* Disabled TypeScript strict mode for now
* Add prettier
* Apply Prettier to all files
* Make sure to use correct Node version
* Add basic implementation for `MDXRemote`
* Add experimental Rust MDX parser
* Add `/public`
* Add SASS support
* Remove default pages and styling
* Convert to module
This allows to use `import/export` syntax
* Add import for custom components
* Add ability to load plugins
* Extract function
This will make the next commit easier to read
* Allow to handle directories for page creation
* Refactoring
* Allow to parse subfolders for pages
* Extract logic
* Redirect `index.mdx` to parent directory
* Disabled ESLint during builds
* Disabled typescript during build
* Remove Gatsby from `README.md`
* Rephrase Docker part of `README.md`
* Update project structure in `README.md`
* Move and rename plugins
* Update plugin for wrapping sections
* Add dependencies for plugin
* Use plugin
* Rename wrapper type
* Simplify unnessary adding of id to sections
The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading.
* Add plugin for custom attributes on Markdown elements
* Add plugin to readd support for tables
* Add plugin to fix problem with wrapped images
For more details see this issue: https://github.com/mdx-js/mdx/issues/1798
* Add necessary meta data to pages
* Install necessary dependencies
* Remove outdated MDX handling
* Remove reliance on `InlineList`
* Use existing Remark components
* Remove unallowed heading
Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either.
* Add missing components to MDX
* Add correct styling
* Fix broken list
* Fix broken CSS classes
* Implement layout
* Fix links
* Fix broken images
* Fix pattern image
* Fix heading attributes
* Rename heading attribute
`new` was causing some weird issue, so renaming it to `version`
* Update comment syntax in MDX
* Merge imports
* Fix markdown rendering inside components
* Add model pages
* Simplify anchors
* Fix default value for theme
* Add Universe index page
* Add Universe categories
* Add Universe projects
* Fix Next problem with copy
Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect`
* Fix improper component nesting
Next doesn't allow block elements inside a `<p>`
* Replace landing page MDX with page component
* Remove inlined iframe content
* Remove ability to inline HTML content in iFrames
* Remove MDX imports
* Fix problem with image inside link in MDX
* Escape character for MDX
* Fix unescaped characters in MDX
* Fix headings with logo
* Allow to export static HTML pages
* Add prebuild script
This command is automatically run by Next
* Replace `svg-loader` with `react-inlinesvg`
`svg-loader` is no longer maintained
* Fix ESLint `react-hooks/exhaustive-deps`
* Fix dropdowns
* Change code language from `cli` to `bash`
* Remove unnessary language `none`
* Fix invalid code language
`markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error.
* Enable code blocks plugin
* Readd `InlineCode` component
MDX2 removed the `inlineCode` component
> The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions
Source: https://mdxjs.com/migrating/v2/#update-mdx-content
* Remove unused code
* Extract function to own file
* Fix code syntax highlighting
* Update syntax for code block meta data
* Remove unused prop
* Fix internal link recognition
There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error.
`Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"`
This simplifies the implementation and fixes the above error.
* Replace `react-helmet` with `next/head`
* Fix `className` problem for JSX component
* Fix broken bold markdown
* Convert file to `.mjs` to be used by Node process
* Add plugin to replace strings
* Fix custom table row styling
* Fix problem with `span` inside inline `code`
React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode.
* Add `_document` to be able to customize `<html>` and `<body>`
* Add `lang="en"`
* Store Netlify settings in file
This way we don't need to update via Netlify UI, which can be tricky if changing build settings.
* Add sitemap
* Add Smartypants
* Add PWA support
* Add `manifest.webmanifest`
* Fix bug with anchor links after reloading
There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar.
* Rename custom event
I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠
* Fix missing comment syntax highlighting
* Refactor Quickstart component
The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle.
The new implementation simplfy filters the list of children (React elements) via their props.
* Fix syntax highlighting for Training Quickstart
* Unify code rendering
* Improve error logging in Juniper
* Fix Juniper component
* Automatically generate "Read Next" link
* Add Plausible
* Use recent DocSearch component and adjust styling
* Fix images
* Turn of image optimization
> Image Optimization using Next.js' default loader is not compatible with `next export`.
We currently deploy to Netlify via `next export`
* Dont build pages starting with `_`
* Remove unused files
* Add Next plugin to Netlify
* Fix button layout
MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string.
* Add 404 page
* Apply Prettier
* Update Prettier for `package.json`
Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces.
* Apply Next patch to `package-lock.json`
When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes.
* fix link
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* small backslash fixes
* adjust to new style
Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
1185 lines
62 KiB
Plaintext
1185 lines
62 KiB
Plaintext
---
|
||
title: What's New in v3.0
|
||
teaser: New features, backwards incompatibilities and migration guide
|
||
menu:
|
||
- ['Summary', 'summary']
|
||
- ['New Features', 'features']
|
||
- ['Backwards Incompatibilities', 'incompat']
|
||
- ['Migrating from v2.x', 'migrating']
|
||
---
|
||
|
||
## Summary {id="summary",hidden="true"}
|
||
|
||
> #### 📖 Looking for the old docs?
|
||
>
|
||
> To help you make the transition from v2.x to v3.0, we've uploaded the old
|
||
> website to [**v2.spacy.io**](https://v2.spacy.io/docs).
|
||
|
||
<Grid cols={2} gutterBottom={false}>
|
||
|
||
<div>
|
||
|
||
spaCy v3.0 features all new **transformer-based pipelines** that bring spaCy's
|
||
accuracy right up to the current **state-of-the-art**. You can use any
|
||
pretrained transformer to train your own pipelines, and even share one
|
||
transformer between multiple components with **multi-task learning**. Training
|
||
is now fully configurable and extensible, and you can define your own custom
|
||
models using **PyTorch**, **TensorFlow** and other frameworks. The new spaCy
|
||
projects system lets you describe whole **end-to-end workflows** in a single
|
||
file, giving you an easy path from prototype to production, and making it easy
|
||
to clone and adapt best-practice projects for your own use cases.
|
||
|
||
</div>
|
||
|
||
<Infobox title="Table of Contents" id="toc">
|
||
|
||
- [Summary](#summary)
|
||
- [New features](#features)
|
||
- [Transformer-based pipelines](#features-transformers)
|
||
- [Training & config system](#features-training)
|
||
- [Custom models](#features-custom-models)
|
||
- [End-to-end project workflows](#features-projects)
|
||
- [Parallel training with Ray](#features-parallel-training)
|
||
- [New built-in components](#features-pipeline-components)
|
||
- [New custom component API](#features-components)
|
||
- [Dependency matching](#features-dep-matcher)
|
||
- [Python type hints](#features-types)
|
||
- [New methods & attributes](#new-methods)
|
||
- [New & updated documentation](#new-docs)
|
||
- [Backwards incompatibilities](#incompat)
|
||
- [Migrating from spaCy v2.x](#migrating)
|
||
|
||
</Infobox>
|
||
|
||
</Grid>
|
||
|
||
## New Features {id="features"}
|
||
|
||
This section contains an overview of the most important **new features and
|
||
improvements**. The [API docs](/api) include additional deprecation notes. New
|
||
methods and functions that were introduced in this version are marked with the
|
||
tag <Tag variant="new">3</Tag>.
|
||
|
||
<YouTube id="9k_EfV7Cns0"></YouTube>
|
||
|
||
<Grid cols={2} gutterBottom={false} narrow>
|
||
|
||
<YouTube id="BWhh3r6W-qE"></YouTube>
|
||
|
||
<YouTube id="8HL-Ap5_Axo"></YouTube>
|
||
|
||
</Grid>
|
||
|
||
### Transformer-based pipelines {id="features-transformers"}
|
||
|
||
> #### Example
|
||
>
|
||
> ```bash
|
||
> $ python -m spacy download en_core_web_trf
|
||
> ```
|
||
|
||
spaCy v3.0 features all new transformer-based pipelines that bring spaCy's
|
||
accuracy right up to the current **state-of-the-art**. You can use any
|
||
pretrained transformer to train your own pipelines, and even share one
|
||
transformer between multiple components with **multi-task learning**. spaCy's
|
||
transformer support interoperates with [PyTorch](https://pytorch.org) and the
|
||
[HuggingFace `transformers`](https://huggingface.co/transformers/) library,
|
||
giving you access to thousands of pretrained models for your pipelines.
|
||
|
||
![Pipeline components listening to shared embedding component](/images/tok2vec-listener.svg)
|
||
|
||
<Benchmarks />
|
||
|
||
#### New trained transformer-based pipelines {id="features-transformers-pipelines"}
|
||
|
||
> #### Notes on model capabilities
|
||
>
|
||
> The models are each trained with a **single transformer** shared across the
|
||
> pipeline, which requires it to be trained on a single corpus. For
|
||
> [English](/models/en) and [Chinese](/models/zh), we used the OntoNotes 5
|
||
> corpus, which has annotations across several tasks. For [French](/models/fr),
|
||
> [Spanish](/models/es) and [German](/models/de), we didn't have a suitable
|
||
> corpus that had both syntactic and entity annotations, so the transformer
|
||
> models for those languages do not include NER.
|
||
|
||
| Package | Language | Transformer | Tagger | Parser | NER |
|
||
| ------------------------------------------------ | -------- | --------------------------------------------------------------------------------------------- | -----: | -----: | ---: |
|
||
| [`en_core_web_trf`](/models/en#en_core_web_trf) | English | [`roberta-base`](https://huggingface.co/roberta-base) | 97.8 | 95.2 | 89.9 |
|
||
| [`de_dep_news_trf`](/models/de#de_dep_news_trf) | German | [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased) | 99.0 | 95.8 | - |
|
||
| [`es_dep_news_trf`](/models/es#es_dep_news_trf) | Spanish | [`bert-base-spanish-wwm-cased`](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) | 98.2 | 94.6 | - |
|
||
| [`fr_dep_news_trf`](/models/fr#fr_dep_news_trf) | French | [`camembert-base`](https://huggingface.co/camembert-base) | 95.7 | 94.4 | - |
|
||
| [`zh_core_web_trf`](/models/zh#zh_core_news_trf) | Chinese | [`bert-base-chinese`](https://huggingface.co/bert-base-chinese) | 92.5 | 76.6 | 75.4 |
|
||
|
||
<Infobox title="Details & Documentation" emoji="📖" list>
|
||
|
||
- **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers),
|
||
[Training pipelines and models](/usage/training),
|
||
[Benchmarks](/usage/facts-figures#benchmarks)
|
||
- **API:** [`Transformer`](/api/transformer),
|
||
[`TransformerData`](/api/transformer#transformerdata),
|
||
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch)
|
||
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
|
||
[TransformerListener](/api/architectures#TransformerListener),
|
||
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
|
||
- **Implementation:**
|
||
[`spacy-transformers`](https://github.com/explosion/spacy-transformers)
|
||
|
||
</Infobox>
|
||
|
||
### New training workflow and config system {id="features-training"}
|
||
|
||
> #### Example
|
||
>
|
||
> ```ini
|
||
> [training]
|
||
> accumulate_gradient = 3
|
||
>
|
||
> [training.optimizer]
|
||
> @optimizers = "Adam.v1"
|
||
>
|
||
> [training.optimizer.learn_rate]
|
||
> @schedules = "warmup_linear.v1"
|
||
> warmup_steps = 250
|
||
> total_steps = 20000
|
||
> initial_rate = 0.01
|
||
> ```
|
||
|
||
spaCy v3.0 introduces a comprehensive and extensible system for **configuring
|
||
your training runs**. A single configuration file describes every detail of your
|
||
training run, with no hidden defaults, making it easy to rerun your experiments
|
||
and track changes. You can use the
|
||
[quickstart widget](/usage/training#quickstart) or the `init config` command to
|
||
get started. Instead of providing lots of arguments on the command line, you
|
||
only need to pass your `config.cfg` file to [`spacy train`](/api/cli#train).
|
||
Training config files include all **settings and hyperparameters** for training
|
||
your pipeline. Some settings can also be registered **functions** that you can
|
||
swap out and customize, making it easy to implement your own custom models and
|
||
architectures.
|
||
|
||
![Illustration of pipeline lifecycle](/images/lifecycle.svg)
|
||
|
||
<Infobox title="Details & Documentation" emoji="📖" list>
|
||
|
||
- **Usage:** [Training pipelines and models](/usage/training)
|
||
- **Thinc:** [Thinc's config system](https://thinc.ai/docs/usage-config),
|
||
[`Config`](https://thinc.ai/docs/api-config#config)
|
||
- **CLI:** [`init config`](/api/cli#init-config),
|
||
[`init fill-config`](/api/cli#init-fill-config), [`train`](/api/cli#train),
|
||
[`pretrain`](/api/cli#pretrain), [`evaluate`](/api/cli#evaluate)
|
||
- **API:** [Config format](/api/data-formats#config),
|
||
[`registry`](/api/top-level#registry)
|
||
|
||
</Infobox>
|
||
|
||
### Custom models using any framework {id="features-custom-models"}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from torch import nn
|
||
> from thinc.api import PyTorchWrapper
|
||
>
|
||
> torch_model = nn.Sequential(
|
||
> nn.Linear(32, 32),
|
||
> nn.ReLU(),
|
||
> nn.Softmax(dim=1)
|
||
> )
|
||
> model = PyTorchWrapper(torch_model)
|
||
> ```
|
||
|
||
spaCy's new configuration system makes it easy to customize the neural network
|
||
models used by the different pipeline components. You can also implement your
|
||
own architectures via spaCy's machine learning library [Thinc](https://thinc.ai)
|
||
that provides various layers and utilities, as well as thin wrappers around
|
||
frameworks like **PyTorch**, **TensorFlow** and **MXNet**. Component models all
|
||
follow the same unified [`Model`](https://thinc.ai/docs/api-model) API and each
|
||
`Model` can also be used as a sublayer of a larger network, allowing you to
|
||
freely combine implementations from different frameworks into a single model.
|
||
|
||
![Illustration of Pipe methods](/images/trainable_component.svg)
|
||
|
||
<Infobox title="Details & Documentation" emoji="📖" list>
|
||
|
||
- **Usage: ** [Layers and architectures](/usage/layers-architectures),
|
||
[Trainable component API](/usage/processing-pipelines#trainable-components),
|
||
[Trainable components and models](/usage/layers-architectures#components)
|
||
- **Thinc: **
|
||
[Wrapping PyTorch, TensorFlow & MXNet](https://thinc.ai/docs/usage-frameworks),
|
||
[`Model` API](https://thinc.ai/docs/api-model)
|
||
- **API:** [Model architectures](/api/architectures),
|
||
[`TrainablePipe`](/api/pipe)
|
||
|
||
</Infobox>
|
||
|
||
### Manage end-to-end workflows with projects {id="features-projects"}
|
||
|
||
> #### Example
|
||
>
|
||
> ```bash
|
||
> # Clone a project template
|
||
> $ python -m spacy project clone pipelines/tagger_parser_ud
|
||
> $ cd tagger_parser_ud
|
||
> # Download data assets
|
||
> $ python -m spacy project assets
|
||
> # Run a workflow
|
||
> $ python -m spacy project run all
|
||
> ```
|
||
|
||
spaCy projects let you manage and share **end-to-end spaCy workflows** for
|
||
different **use cases and domains**, and orchestrate training, packaging and
|
||
serving your custom pipelines. You can start off by cloning a pre-defined
|
||
project template, adjust it to fit your needs, load in your data, train a
|
||
pipeline, export it as a Python package, upload your outputs to a remote storage
|
||
and share your results with your team.
|
||
|
||
![Illustration of project workflow and commands](/images/projects.svg)
|
||
|
||
spaCy projects also make it easy to **integrate with other tools** in the data
|
||
science and machine learning ecosystem, including [DVC](/usage/projects#dvc) for
|
||
data version control, [Prodigy](/usage/projects#prodigy) for creating labelled
|
||
data, [Streamlit](/usage/projects#streamlit) for building interactive apps,
|
||
[FastAPI](/usage/projects#fastapi) for serving models in production,
|
||
[Ray](/usage/projects#ray) for parallel training,
|
||
[Weights & Biases](/usage/projects#wandb) for experiment tracking, and more!
|
||
|
||
<Infobox title="Details & Documentation" emoji="📖" list>
|
||
|
||
- **Usage:** [spaCy projects](/usage/projects),
|
||
[Training pipelines and models](/usage/training)
|
||
- **CLI:** [`project`](/api/cli#project), [`train`](/api/cli#train)
|
||
- **Templates:** [`projects`](https://github.com/explosion/projects)
|
||
|
||
</Infobox>
|
||
|
||
<Project id="pipelines/tagger_parser_ud">
|
||
|
||
The easiest way to get started is to clone a [project template](/usage/projects)
|
||
and run it – for example, this end-to-end template that lets you train a
|
||
**part-of-speech tagger** and **dependency parser** on a Universal Dependencies
|
||
treebank.
|
||
|
||
</Project>
|
||
|
||
### Parallel and distributed training with Ray {id="features-parallel-training"}
|
||
|
||
> #### Example
|
||
>
|
||
> ```bash
|
||
> $ pip install -U %%SPACY_PKG_NAME[ray]%%SPACY_PKG_FLAGS
|
||
> # Check that the CLI is registered
|
||
> $ python -m spacy ray --help
|
||
> # Train a pipeline
|
||
> $ python -m spacy ray train config.cfg --n-workers 2
|
||
> ```
|
||
|
||
[Ray](https://ray.io/) is a fast and simple framework for building and running
|
||
**distributed applications**. You can use Ray to train spaCy on one or more
|
||
remote machines, potentially speeding up your training process. The Ray
|
||
integration is powered by a lightweight extension package,
|
||
[`spacy-ray`](https://github.com/explosion/spacy-ray), that automatically adds
|
||
the [`ray`](/api/cli#ray) command to your spaCy CLI if it's installed in the
|
||
same environment. You can then run [`spacy ray train`](/api/cli#ray-train) for
|
||
parallel training.
|
||
|
||
![Illustration of setup](/images/spacy-ray.svg)
|
||
|
||
<Infobox title="Details & Documentation" emoji="📖" list>
|
||
|
||
- **Usage: **
|
||
[Parallel and distributed training](/usage/training#parallel-training),
|
||
[spaCy Projects integration](/usage/projects#ray)
|
||
- **CLI:** [`ray`](/api/cli#ray), [`ray train`](/api/cli#ray-train)
|
||
- **Implementation:** [`spacy-ray`](https://github.com/explosion/spacy-ray)
|
||
|
||
</Infobox>
|
||
|
||
### New built-in pipeline components {id="features-pipeline-components"}
|
||
|
||
spaCy v3.0 includes several new trainable and rule-based components that you can
|
||
add to your pipeline and customize for your use case:
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> # pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS
|
||
> nlp = spacy.blank("en")
|
||
> nlp.add_pipe("lemmatizer")
|
||
> ```
|
||
|
||
| Name | Description |
|
||
| ----------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation. |
|
||
| [`Morphologizer`](/api/morphologizer) | Trainable component to predict morphological features. |
|
||
| [`Lemmatizer`](/api/lemmatizer) | Standalone component for rule-based and lookup lemmatization. |
|
||
| [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. |
|
||
| [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/embeddings-transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |
|
||
| [`TrainablePipe`](/api/pipe) | Base class for trainable pipeline components. |
|
||
| [`Multi-label TextCategorizer`](/api/textcategorizer) | Trainable component for multi-label text classification. |
|
||
|
||
<Infobox title="Details & Documentation" emoji="📖" list>
|
||
|
||
- **Usage:** [Processing pipelines](/usage/processing-pipelines)
|
||
- **API:** [Built-in pipeline components](/api#architecture-pipeline)
|
||
- **Implementation:** [`spacy/pipeline`](%%GITHUB_SPACY/spacy/pipeline)
|
||
|
||
</Infobox>
|
||
|
||
### New and improved pipeline component APIs {id="features-components"}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> @Language.component("my_component")
|
||
> def my_component(doc):
|
||
> return doc
|
||
>
|
||
> nlp.add_pipe("my_component")
|
||
> nlp.add_pipe("ner", source=other_nlp)
|
||
> nlp.analyze_pipes(pretty=True)
|
||
> ```
|
||
|
||
Defining, configuring, reusing, training and analyzing pipeline components is
|
||
now easier and more convenient. The `@Language.component` and
|
||
`@Language.factory` decorators let you register your component, define its
|
||
default configuration and meta data, like the attribute values it assigns and
|
||
requires. Any custom component can be included during training, and sourcing
|
||
components from existing trained pipelines lets you **mix and match custom
|
||
pipelines**. The `nlp.analyze_pipes` method outputs structured information about
|
||
the current pipeline and its components, including the attributes they assign,
|
||
the scores they compute during training and whether any required attributes
|
||
aren't set.
|
||
|
||
<Infobox title="Details & Documentation" emoji="📖" list>
|
||
|
||
- **Usage:** [Custom components](/usage/processing-pipelines#custom_components),
|
||
[Defining components for training](/usage/training#config-components)
|
||
- **API:** [`@Language.component`](/api/language#component),
|
||
[`@Language.factory`](/api/language#factory),
|
||
[`Language.add_pipe`](/api/language#add_pipe),
|
||
[`Language.analyze_pipes`](/api/language#analyze_pipes)
|
||
- **Implementation:** [`spacy/language.py`](%%GITHUB_SPACY/spacy/language.py)
|
||
|
||
</Infobox>
|
||
|
||
### Dependency matching {id="features-dep-matcher"}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.matcher import DependencyMatcher
|
||
>
|
||
> matcher = DependencyMatcher(nlp.vocab)
|
||
> pattern = [
|
||
> {"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}},
|
||
> {"LEFT_ID": "anchor_founded", "REL_OP": ">", "RIGHT_ID": "subject", "RIGHT_ATTRS": {"DEP": "nsubj"}}
|
||
> ]
|
||
> matcher.add("FOUNDED", [pattern])
|
||
> ```
|
||
|
||
The new [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns
|
||
within the dependency parse using
|
||
[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html)
|
||
operators. It follows the same API as the token-based [`Matcher`](/api/matcher).
|
||
A pattern added to the dependency matcher consists of a **list of
|
||
dictionaries**, with each dictionary describing a **token to match** and its
|
||
**relation to an existing token** in the pattern.
|
||
|
||
![Dependency matcher pattern](/images/dep-match-diagram.svg)
|
||
|
||
<Infobox title="Details & Documentation" emoji="📖" list>
|
||
|
||
- **Usage:**
|
||
[Dependency matching](/usage/rule-based-matching#dependencymatcher),
|
||
- **API:** [`DependencyMatcher`](/api/dependencymatcher),
|
||
- **Implementation:**
|
||
[`spacy/matcher/dependencymatcher.pyx`](%%GITHUB_SPACY/spacy/matcher/dependencymatcher.pyx)
|
||
|
||
</Infobox>
|
||
|
||
### Type hints and type-based data validation {id="features-types"}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.language import Language
|
||
> from pydantic import StrictBool
|
||
>
|
||
> @Language.factory("my_component")
|
||
> def create_my_component(
|
||
> nlp: Language,
|
||
> name: str,
|
||
> custom: StrictBool
|
||
> ):
|
||
> ...
|
||
> ```
|
||
|
||
spaCy v3.0 officially drops support for Python 2 and now requires **Python
|
||
3.6+**. This also means that the code base can take full advantage of
|
||
[type hints](https://docs.python.org/3/library/typing.html). spaCy's user-facing
|
||
API that's implemented in pure Python (as opposed to Cython) now comes with type
|
||
hints. The new version of spaCy's machine learning library
|
||
[Thinc](https://thinc.ai) also features extensive
|
||
[type support](https://thinc.ai/docs/usage-type-checking/), including custom
|
||
types for models and arrays, and a custom `mypy` plugin that can be used to
|
||
type-check model definitions.
|
||
|
||
For data validation, spaCy v3.0 adopts
|
||
[`pydantic`](https://github.com/samuelcolvin/pydantic). It also powers the data
|
||
validation of Thinc's [config system](https://thinc.ai/docs/usage-config), which
|
||
lets you register **custom functions with typed arguments**, reference them in
|
||
your config and see validation errors if the argument values don't match.
|
||
|
||
<Infobox title="Details & Documentation" emoji="📖" list>
|
||
|
||
- **Usage: **
|
||
[Component type hints and validation](/usage/processing-pipelines#type-hints),
|
||
[Training with custom code](/usage/training#custom-code)
|
||
- **Thinc: **
|
||
[Type checking in Thinc](https://thinc.ai/docs/usage-type-checking),
|
||
[Thinc's config system](https://thinc.ai/docs/usage-config)
|
||
|
||
</Infobox>
|
||
|
||
### New methods, attributes and commands {id="new-methods"}
|
||
|
||
The following methods, attributes and commands are new in spaCy v3.0.
|
||
|
||
| Name | Description |
|
||
| ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||
| [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). |
|
||
| [`Token.morph`](/api/token#attributes) | Access a token's morphological analysis. |
|
||
| [`Doc.spans`](/api/doc#spans) | Named span groups to store and access collections of potentially overlapping spans. Uses the new [`SpanGroup`](/api/spangroup) data structure. |
|
||
| [`Doc.has_annotation`](/api/doc#has_annotation) | Check whether a doc has annotation on a token attribute. |
|
||
| [`Language.select_pipes`](/api/language#select_pipes) | Context manager for enabling or disabling specific pipeline components for a block. |
|
||
| [`Language.disable_pipe`](/api/language#disable_pipe), [`Language.enable_pipe`](/api/language#enable_pipe) | Disable or enable a loaded pipeline component (but don't remove it). |
|
||
| [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. |
|
||
| [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a trained pipeline and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. |
|
||
| [`@Language.factory`](/api/language#factory), [`@Language.component`](/api/language#component) | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions. |
|
||
| [`Language.has_factory`](/api/language#has_factory) | Check whether a component factory is registered on a language class. |
|
||
| [`Language.get_factory_meta`](/api/language#get_factory_meta), [`Language.get_pipe_meta`](/api/language#get_factory_meta) | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name. |
|
||
| [`Language.config`](/api/language#config) | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. |
|
||
| [`Language.components`](/api/language#attributes), [`Language.component_names`](/api/language#attributes) | All available components and component names, including disabled components that are not run as part of the pipeline. |
|
||
| [`Language.disabled`](/api/language#attributes) | Names of disabled components that are not run as part of the pipeline. |
|
||
| [`TrainablePipe.score`](/api/pipe#score) | Method on pipeline components that returns a dictionary of evaluation scores. |
|
||
| [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). |
|
||
| [`util.load_meta`](/api/top-level#util.load_meta), [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a pipeline's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). |
|
||
| [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all pipeline packages installed in the environment. |
|
||
| [`init config`](/api/cli#init-config), [`init fill-config`](/api/cli#init-fill-config), [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training). |
|
||
| [`init vectors`](/api/cli#init-vectors) | Convert word vectors for use with spaCy. |
|
||
| [`init labels`](/api/cli#init-labels) | Generate JSON files for the labels in the data to speed up training. |
|
||
| [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). |
|
||
| [`ray`](/api/cli#ray) | Suite of CLI commands for parallel training with [Ray](https://ray.io/), provided by the [`spacy-ray`](https://github.com/explosion/spacy-ray) extension package. |
|
||
|
||
### New and updated documentation {id="new-docs"}
|
||
|
||
<Grid cols={2} gutterBottom={false}>
|
||
|
||
<div>
|
||
|
||
To help you get started with spaCy v3.0 and the new features, we've added
|
||
several new or rewritten documentation pages, including a new usage guide on
|
||
[embeddings, transformers and transfer learning](/usage/embeddings-transformers),
|
||
a guide on [training pipelines and models](/usage/training) rewritten from
|
||
scratch, a page explaining the new [spaCy projects](/usage/projects) and updated
|
||
usage documentation on
|
||
[custom pipeline components](/usage/processing-pipelines#custom-components).
|
||
We've also added a bunch of new illustrations and new API reference pages
|
||
documenting spaCy's machine learning [model architectures](/api/architectures)
|
||
and the expected [data formats](/api/data-formats). API pages about
|
||
[pipeline components](/api/#architecture-pipeline) now include more information,
|
||
like the default config and implementation, and we've adopted a more detailed
|
||
format for documenting argument and return types.
|
||
|
||
</div>
|
||
|
||
<Image src="/images/architecture.svg" href="/api" alt="Library architecture" />
|
||
|
||
</Grid>
|
||
|
||
<Infobox title="New or reworked documentation" emoji="📖" list>
|
||
|
||
- **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers),
|
||
[Training models](/usage/training),
|
||
[Layers & Architectures](/usage/layers-architectures),
|
||
[Projects](/usage/projects),
|
||
[Custom pipeline components](/usage/processing-pipelines#custom-components),
|
||
[Custom tokenizers](/usage/linguistic-features#custom-tokenizer),
|
||
[Morphology](/usage/linguistic-features#morphology),
|
||
[Lemmatization](/usage/linguistic-features#lemmatization),
|
||
[Mapping & Exceptions](/usage/linguistic-features#mappings-exceptions),
|
||
[Dependency matching](/usage/rule-based-matching#dependencymatcher)
|
||
- **API Reference: ** [Library architecture](/api),
|
||
[Model architectures](/api/architectures), [Data formats](/api/data-formats)
|
||
- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
|
||
[`Transformer`](/api/transformer), [`Lemmatizer`](/api/lemmatizer),
|
||
[`Morphologizer`](/api/morphologizer),
|
||
[`AttributeRuler`](/api/attributeruler),
|
||
[`SentenceRecognizer`](/api/sentencerecognizer),
|
||
[`DependencyMatcher`](/api/dependencymatcher), [`TrainablePipe`](/api/pipe),
|
||
[`Corpus`](/api/corpus), [`SpanGroup`](/api/spangroup),
|
||
|
||
</Infobox>
|
||
|
||
## Backwards Incompatibilities {id="incompat"}
|
||
|
||
As always, we've tried to keep the breaking changes to a minimum and focus on
|
||
changes that were necessary to support the new features, fix problems or improve
|
||
usability. The following section lists the relevant changes to the user-facing
|
||
API. For specific examples of how to rewrite your code, check out the
|
||
[migration guide](#migrating).
|
||
|
||
<Infobox variant="warning">
|
||
|
||
Note that spaCy v3.0 now requires **Python 3.6+**.
|
||
|
||
</Infobox>
|
||
|
||
### API changes {id="incompat-api"}
|
||
|
||
- Pipeline package symlinks, the `link` command and shortcut names are now
|
||
deprecated. There can be many [different trained pipelines](/models) and not
|
||
just one "English model", so you should always use the full package name like
|
||
`en_core_web_sm` explicitly.
|
||
- A pipeline's `meta.json` is now only used to provide meta information like the
|
||
package name, author, license and labels. It's **not** used to construct the
|
||
processing pipeline anymore. This is all defined in the
|
||
[`config.cfg`](/api/data-formats#config), which also includes all settings
|
||
used to train the pipeline.
|
||
- The `train`, `pretrain` and `debug data` commands now only take a
|
||
`config.cfg`.
|
||
- [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of
|
||
the component factory instead of the component function.
|
||
- **Custom pipeline components** now need to be decorated with the
|
||
[`@Language.component`](/api/language#component) or
|
||
[`@Language.factory`](/api/language#factory) decorator.
|
||
- The [`Language.update`](/api/language#update),
|
||
[`Language.evaluate`](/api/language#evaluate) and
|
||
[`TrainablePipe.update`](/api/pipe#update) methods now all take batches of
|
||
[`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
|
||
raw text and a dictionary of annotations.
|
||
- The `begin_training` methods have been renamed to `initialize` and now take a
|
||
function that returns a sequence of `Example` objects to initialize the model
|
||
instead of a list of tuples.
|
||
- [`Matcher.add`](/api/matcher#add) and
|
||
[`PhraseMatcher.add`](/api/phrasematcher#add) now only accept a list of
|
||
patterns as the second argument (instead of a variable number of arguments).
|
||
The `on_match` callback becomes an optional keyword argument.
|
||
- The `Doc` flags like `Doc.is_parsed` or `Doc.is_tagged` have been replaced by
|
||
[`Doc.has_annotation`](/api/doc#has_annotation).
|
||
- The `spacy.gold` module has been renamed to
|
||
[`spacy.training`](%%GITHUB_SPACY/spacy/training).
|
||
- The `PRON_LEMMA` symbol and `-PRON-` as an indicator for pronoun lemmas has
|
||
been removed.
|
||
- The `TAG_MAP` and `MORPH_RULES` in the language data have been replaced by the
|
||
more flexible [`AttributeRuler`](/api/attributeruler).
|
||
- The [`Lemmatizer`](/api/lemmatizer) is now a standalone pipeline component and
|
||
doesn't provide lemmas by default or switch automatically between lookup and
|
||
rule-based lemmas. You can now add it to your pipeline explicitly and set its
|
||
mode on initialization.
|
||
- Various keyword arguments across functions and methods are now explicitly
|
||
declared as **keyword-only** arguments. Those arguments are documented
|
||
accordingly across the API reference using the <Tag>keyword-only</Tag> tag.
|
||
- The `textcat` pipeline component is now only applicable for classification of
|
||
mutually exclusives classes - i.e. one predicted class per input sentence or
|
||
document. To perform multi-label classification, use the new
|
||
`textcat_multilabel` component instead.
|
||
|
||
### Removed or renamed API {id="incompat-removed"}
|
||
|
||
| Removed | Replacement |
|
||
| -------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||
| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes), [`Language.disable_pipe`](/api/language#disable_pipe), [`Language.enable_pipe`](/api/language#enable_pipe) |
|
||
| `Language.begin_training`, `Pipe.begin_training`, ... | [`Language.initialize`](/api/language#initialize), [`Pipe.initialize`](/api/pipe#initialize), ... |
|
||
| `Doc.is_tagged`, `Doc.is_parsed`, ... | [`Doc.has_annotation`](/api/doc#has_annotation) |
|
||
| `GoldParse` | [`Example`](/api/example) |
|
||
| `GoldCorpus` | [`Corpus`](/api/corpus) |
|
||
| `KnowledgeBase.load_bulk`, `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) |
|
||
| `KnowledgeBase.get_candidates` | [`KnowledgeBase.get_alias_candidates`](/api/kb#get_alias_candidates) |
|
||
| `Matcher.pipe`, `PhraseMatcher.pipe` | not needed |
|
||
| `gold.offsets_from_biluo_tags`, `gold.spans_from_biluo_tags`, `gold.biluo_tags_from_offsets` | [`training.biluo_tags_to_offsets`](/api/top-level#biluo_tags_to_offsets), [`training.biluo_tags_to_spans`](/api/top-level#biluo_tags_to_spans), [`training.offsets_to_biluo_tags`](/api/top-level#offsets_to_biluo_tags) |
|
||
| `spacy init-model` | [`spacy init vectors`](/api/cli#init-vectors) |
|
||
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
|
||
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
|
||
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, symlinks are deprecated |
|
||
|
||
The following methods, attributes and arguments were removed in v3.0. Most of
|
||
them have been **deprecated for a while** and many would previously raise
|
||
errors. Many of them were also mostly internals. If you've been working with
|
||
more recent versions of spaCy v2.x, it's **unlikely** that your code relied on
|
||
them.
|
||
|
||
| Removed | Replacement |
|
||
| ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `Doc.tokens_from_list` | [`Doc.__init__`](/api/doc#init) |
|
||
| `Doc.merge`, `Span.merge` | [`Doc.retokenize`](/api/doc#retokenize) |
|
||
| `Token.string`, `Span.string`, `Span.upper`, `Span.lower` | [`Span.text`](/api/span#attributes), [`Token.text`](/api/token#attributes) |
|
||
| `Language.tagger`, `Language.parser`, `Language.entity` | [`Language.get_pipe`](/api/language#get_pipe) |
|
||
| keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` |
|
||
| `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` |
|
||
| `verbose` argument on [`Language.evaluate`](/api/language#evaluate) | logging (`DEBUG`) |
|
||
| `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentencerecognizer) |
|
||
|
||
## Migrating from v2.x {id="migrating"}
|
||
|
||
### Downloading and loading trained pipelines {id="migrating-downloading-models"}
|
||
|
||
Symlinks and shortcuts like `en` have been deprecated for a while, and are now
|
||
not supported anymore. There are [many different trained pipelines](/models)
|
||
with different capabilities and not just one "English model". In order to
|
||
download and load a package, you should always use its full name – for instance,
|
||
[`en_core_web_sm`](/models/en#en_core_web_sm).
|
||
|
||
```diff
|
||
- python -m spacy download en
|
||
+ python -m spacy download en_core_web_sm
|
||
```
|
||
|
||
```diff
|
||
- nlp = spacy.load("en")
|
||
+ nlp = spacy.load("en_core_web_sm")
|
||
```
|
||
|
||
### Custom pipeline components and factories {id="migrating-pipeline-components"}
|
||
|
||
Custom pipeline components now have to be registered explicitly using the
|
||
[`@Language.component`](/api/language#component) or
|
||
[`@Language.factory`](/api/language#factory) decorator. For simple functions
|
||
that take a `Doc` and return it, all you have to do is add the
|
||
`@Language.component` decorator to it and assign it a name:
|
||
|
||
```diff {title="Stateless function components"}
|
||
+ from spacy.language import Language
|
||
|
||
+ @Language.component("my_component")
|
||
def my_component(doc):
|
||
return doc
|
||
```
|
||
|
||
For class components that are initialized with settings and/or the shared `nlp`
|
||
object, you can use the `@Language.factory` decorator. Also make sure that that
|
||
the method used to initialize the factory has **two named arguments**: `nlp`
|
||
(the current `nlp` object) and `name` (the string name of the component
|
||
instance).
|
||
|
||
```diff {title="Stateful class components"}
|
||
+ from spacy.language import Language
|
||
|
||
+ @Language.factory("my_component")
|
||
class MyComponent:
|
||
- def __init__(self, nlp):
|
||
+ def __init__(self, nlp, name):
|
||
self.nlp = nlp
|
||
|
||
def __call__(self, doc):
|
||
return doc
|
||
```
|
||
|
||
Instead of decorating your class, you could also add a factory function that
|
||
takes the arguments `nlp` and `name` and returns an instance of your component:
|
||
|
||
```diff {title="Stateful class components with factory function"}
|
||
+ from spacy.language import Language
|
||
|
||
+ @Language.factory("my_component")
|
||
+ def create_my_component(nlp, name):
|
||
+ return MyComponent(nlp)
|
||
|
||
class MyComponent:
|
||
def __init__(self, nlp):
|
||
self.nlp = nlp
|
||
|
||
def __call__(self, doc):
|
||
return doc
|
||
```
|
||
|
||
The `@Language.component` and `@Language.factory` decorators now take care of
|
||
adding an entry to the component factories, so spaCy knows how to load a
|
||
component back in from its string name. You won't have to write to
|
||
`Language.factories` manually anymore.
|
||
|
||
```diff
|
||
- Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp)
|
||
```
|
||
|
||
#### Adding components to the pipeline {id="migrating-add-pipe"}
|
||
|
||
The [`nlp.add_pipe`](/api/language#add_pipe) method now takes the **string
|
||
name** of the component factory instead of a callable component. This allows
|
||
spaCy to track and serialize components that have been added and their settings.
|
||
|
||
```diff
|
||
+ @Language.component("my_component")
|
||
def my_component(doc):
|
||
return doc
|
||
|
||
- nlp.add_pipe(my_component)
|
||
+ nlp.add_pipe("my_component")
|
||
```
|
||
|
||
[`nlp.add_pipe`](/api/language#add_pipe) now also returns the pipeline component
|
||
itself, so you can access its attributes. The
|
||
[`nlp.create_pipe`](/api/language#create_pipe) method is now mostly internals
|
||
and you typically shouldn't have to use it in your code.
|
||
|
||
```diff
|
||
- parser = nlp.create_pipe("parser")
|
||
- nlp.add_pipe(parser)
|
||
+ parser = nlp.add_pipe("parser")
|
||
```
|
||
|
||
If you need to add a component from an existing trained pipeline, you can now
|
||
use the `source` argument on [`nlp.add_pipe`](/api/language#add_pipe). This will
|
||
check that the component is compatible, and take care of porting over all
|
||
config. During training, you can also reference existing trained components in
|
||
your [config](/usage/training#config-components) and decide whether or not they
|
||
should be updated with more data.
|
||
|
||
> #### config.cfg (excerpt)
|
||
>
|
||
> ```ini
|
||
> [components.ner]
|
||
> source = "en_core_web_sm"
|
||
> component = "ner"
|
||
> ```
|
||
|
||
```diff
|
||
source_nlp = spacy.load("en_core_web_sm")
|
||
nlp = spacy.blank("en")
|
||
- ner = source_nlp.get_pipe("ner")
|
||
- nlp.add_pipe(ner)
|
||
+ nlp.add_pipe("ner", source=source_nlp)
|
||
```
|
||
|
||
#### Configuring pipeline components with settings {id="migrating-configure-pipe"}
|
||
|
||
Because pipeline components are now added using their string names, you won't
|
||
have to instantiate the [component classes](/api/#architecture-pipeline)
|
||
directly anymore. To configure the component, you can now use the `config`
|
||
argument on [`nlp.add_pipe`](/api/language#add_pipe).
|
||
|
||
> #### config.cfg (excerpt)
|
||
>
|
||
> ```ini
|
||
> [components.sentencizer]
|
||
> factory = "sentencizer"
|
||
> punct_chars = ["!", ".", "?"]
|
||
> ```
|
||
|
||
```diff
|
||
punct_chars = ["!", ".", "?"]
|
||
- sentencizer = Sentencizer(punct_chars=punct_chars)
|
||
+ sentencizer = nlp.add_pipe("sentencizer", config={"punct_chars": punct_chars})
|
||
```
|
||
|
||
The `config` corresponds to the component settings in the
|
||
[`config.cfg`](/usage/training#config-components) and will overwrite the default
|
||
config defined by the components.
|
||
|
||
<Infobox variant="warning" title="Important note on config values">
|
||
|
||
Config values you pass to components **need to be JSON-serializable** and can't
|
||
be arbitrary Python objects. Otherwise, the settings you provide can't be
|
||
represented in the `config.cfg` and spaCy has no way of knowing how to re-create
|
||
your component with the same settings when you load the pipeline back in. If you
|
||
need to pass arbitrary objects to a component, use a
|
||
[registered function](/usage/processing-pipelines#example-stateful-components):
|
||
|
||
```diff
|
||
- config = {"model": MyTaggerModel()}
|
||
+ config= {"model": {"@architectures": "MyTaggerModel"}}
|
||
tagger = nlp.add_pipe("tagger", config=config)
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
### Adding match patterns {id="migrating-matcher"}
|
||
|
||
The [`Matcher.add`](/api/matcher#add),
|
||
[`PhraseMatcher.add`](/api/phrasematcher#add) and
|
||
[`DependencyMatcher.add`](/api/dependencymatcher#add) methods now only accept a
|
||
**list of patterns** as the second argument (instead of a variable number of
|
||
arguments). The `on_match` callback becomes an optional keyword argument.
|
||
|
||
```diff
|
||
matcher = Matcher(nlp.vocab)
|
||
patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
|
||
- matcher.add("GoogleNow", on_match, *patterns)
|
||
+ matcher.add("GoogleNow", patterns, on_match=on_match)
|
||
```
|
||
|
||
```diff
|
||
matcher = PhraseMatcher(nlp.vocab)
|
||
patterns = [nlp("health care reform"), nlp("healthcare reform")]
|
||
- matcher.add("HEALTH", on_match, *patterns)
|
||
+ matcher.add("HEALTH", patterns, on_match=on_match)
|
||
```
|
||
|
||
### Migrating attributes in tokenizer exceptions {id="migrating-tokenizer-exceptions"}
|
||
|
||
Tokenizer exceptions are now only allowed to set `ORTH` and `NORM` values as
|
||
part of the token attributes. Exceptions for other attributes such as `TAG` and
|
||
`LEMMA` should be moved to an [`AttributeRuler`](/api/attributeruler) component:
|
||
|
||
```diff
|
||
nlp = spacy.blank("en")
|
||
- nlp.tokenizer.add_special_case("don't", [{"ORTH": "do"}, {"ORTH": "n't", "LEMMA": "not"}])
|
||
+ nlp.tokenizer.add_special_case("don't", [{"ORTH": "do"}, {"ORTH": "n't"}])
|
||
+ ruler = nlp.add_pipe("attribute_ruler")
|
||
+ ruler.add(patterns=[[{"ORTH": "n't"}]], attrs={"LEMMA": "not"})
|
||
```
|
||
|
||
### Migrating tag maps and morph rules {id="migrating-training-mappings-exceptions"}
|
||
|
||
Instead of defining a `tag_map` and `morph_rules` in the language data, spaCy
|
||
v3.0 now manages mappings and exceptions with a separate and more flexible
|
||
pipeline component, the [`AttributeRuler`](/api/attributeruler). See the
|
||
[usage guide](/usage/linguistic-features#mappings-exceptions) for examples. If
|
||
you have tag maps and morph rules in the v2.x format, you can load them into the
|
||
attribute ruler before training using the `[initialize]` block of your config.
|
||
|
||
### Using Lexeme Tables
|
||
|
||
To use tables like `lexeme_prob` when training a model from scratch, you need to
|
||
add an entry to the `initialize` block in your config. Here's what that looks
|
||
like for the existing trained pipelines:
|
||
|
||
```ini
|
||
[initialize.lookups]
|
||
@misc = "spacy.LookupsDataLoader.v1"
|
||
lang = ${nlp.lang}
|
||
tables = ["lexeme_norm"]
|
||
```
|
||
|
||
> #### What does the initialization do?
|
||
>
|
||
> The `[initialize]` block is used when
|
||
> [`nlp.initialize`](/api/language#initialize) is called (usually right before
|
||
> training). It lets you define data resources for initializing the pipeline in
|
||
> your `config.cfg`. After training, the rules are saved to disk with the
|
||
> exported pipeline, so your runtime model doesn't depend on local data. For
|
||
> details see the [config lifecycle](/usage/training/#config-lifecycle) and
|
||
> [initialization](/usage/training/#initialization) docs.
|
||
|
||
```ini {title="config.cfg (excerpt)"}
|
||
[initialize.components.attribute_ruler]
|
||
|
||
[initialize.components.attribute_ruler.tag_map]
|
||
@readers = "srsly.read_json.v1"
|
||
path = "./corpus/tag_map.json"
|
||
```
|
||
|
||
The `AttributeRuler` also provides two handy helper methods
|
||
[`load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
|
||
[`load_from_morph_rules`](/api/attributeruler#load_from_morph_rules) that let
|
||
you load in your existing tag map or morph rules:
|
||
|
||
```diff
|
||
nlp = spacy.blank("en")
|
||
- nlp.vocab.morphology.load_tag_map(YOUR_TAG_MAP)
|
||
+ ruler = nlp.add_pipe("attribute_ruler")
|
||
+ ruler.load_from_tag_map(YOUR_TAG_MAP)
|
||
```
|
||
|
||
### Migrating Doc flags {id="migrating-doc-flags"}
|
||
|
||
The [`Doc`](/api/doc) flags `Doc.is_tagged`, `Doc.is_parsed`, `Doc.is_nered` and
|
||
`Doc.is_sentenced` are deprecated in v3.0 and replaced by
|
||
[`Doc.has_annotation`](/api/doc#has_annotation) method, which refers to the
|
||
token attribute symbols (the same symbols used in [`Matcher`](/api/matcher)
|
||
patterns):
|
||
|
||
```diff
|
||
doc = nlp(text)
|
||
- doc.is_parsed
|
||
+ doc.has_annotation("DEP")
|
||
- doc.is_tagged
|
||
+ doc.has_annotation("TAG")
|
||
- doc.is_sentenced
|
||
+ doc.has_annotation("SENT_START")
|
||
- doc.is_nered
|
||
+ doc.has_annotation("ENT_IOB")
|
||
```
|
||
|
||
### Training pipelines and models {id="migrating-training"}
|
||
|
||
To train your pipelines, you should now pretty much always use the
|
||
[`spacy train`](/api/cli#train) CLI. You shouldn't have to put together your own
|
||
training scripts anymore, unless you _really_ want to. The training commands now
|
||
use a [flexible config file](/usage/training#config) that describes all training
|
||
settings and hyperparameters, as well as your pipeline, components and
|
||
architectures to use. The `--code` argument lets you pass in code containing
|
||
[custom registered functions](/usage/training#custom-code) that you can
|
||
reference in your config. To get started, check out the
|
||
[quickstart widget](/usage/training#quickstart).
|
||
|
||
#### Binary .spacy training data format {id="migrating-training-format"}
|
||
|
||
spaCy v3.0 uses a new
|
||
[binary training data format](/api/data-formats#binary-training) created by
|
||
serializing a [`DocBin`](/api/docbin), which represents a collection of `Doc`
|
||
objects. This means that you can train spaCy pipelines using the same format it
|
||
outputs: annotated `Doc` objects. The binary format is extremely **efficient in
|
||
storage**, especially when packing multiple documents together. You can convert
|
||
your existing JSON-formatted data using the [`spacy convert`](/api/cli#convert)
|
||
command, which outputs `.spacy` files:
|
||
|
||
```bash
|
||
$ python -m spacy convert ./training.json ./output
|
||
```
|
||
|
||
#### Training config {id="migrating-training-config"}
|
||
|
||
The easiest way to get started with a training config is to use the
|
||
[`init config`](/api/cli#init-config) command or the
|
||
[quickstart widget](/usage/training#quickstart). You can define your
|
||
requirements, and it will auto-generate a starter config with the best-matching
|
||
default settings.
|
||
|
||
```bash
|
||
$ python -m spacy init config ./config.cfg --lang en --pipeline tagger,parser
|
||
```
|
||
|
||
If you've exported a starter config from our
|
||
[quickstart widget](/usage/training#quickstart), you can use the
|
||
[`init fill-config`](/api/cli#init-fill-config) to fill it with all default
|
||
values. You can then use the auto-generated `config.cfg` for training:
|
||
|
||
```diff
|
||
- python -m spacy train en ./output ./train.json ./dev.json
|
||
--pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
|
||
+ python -m spacy train ./config.cfg --output ./output
|
||
```
|
||
|
||
<Project id="pipelines/tagger_parser_ud">
|
||
|
||
The easiest way to get started is to clone a [project template](/usage/projects)
|
||
and run it – for example, this end-to-end template that lets you train a
|
||
**part-of-speech tagger** and **dependency parser** on a Universal Dependencies
|
||
treebank.
|
||
|
||
</Project>
|
||
|
||
#### Modifying tokenizer settings
|
||
|
||
If you were using a base model with `spacy train` to customize the tokenizer
|
||
settings in v2, your modifications can be provided in the
|
||
`[initialize.before_init]` callback.
|
||
|
||
Write a registered callback that modifies the tokenizer settings and specify
|
||
this callback in your config:
|
||
|
||
> #### config.cfg
|
||
>
|
||
> ```ini
|
||
> [initialize]
|
||
>
|
||
> [initialize.before_init]
|
||
> @callbacks = "customize_tokenizer"
|
||
> ```
|
||
|
||
```python {title="functions.py"}
|
||
from spacy.util import registry, compile_suffix_regex
|
||
|
||
@registry.callbacks("customize_tokenizer")
|
||
def make_customize_tokenizer():
|
||
def customize_tokenizer(nlp):
|
||
# remove a suffix
|
||
suffixes = list(nlp.Defaults.suffixes)
|
||
suffixes.remove("\\[")
|
||
suffix_regex = compile_suffix_regex(suffixes)
|
||
nlp.tokenizer.suffix_search = suffix_regex.search
|
||
|
||
# add a special case
|
||
nlp.tokenizer.add_special_case("_SPECIAL_", [{"ORTH": "_SPECIAL_"}])
|
||
return customize_tokenizer
|
||
```
|
||
|
||
When training, provide the function above with the `--code` option:
|
||
|
||
```bash
|
||
$ python -m spacy train config.cfg --code ./functions.py
|
||
```
|
||
|
||
The train step requires the `--code` option with your registered functions from
|
||
the `[initialize]` block, but since those callbacks are only required during the
|
||
initialization step, you don't need to provide them with the final pipeline
|
||
package. However, to make it easier for others to replicate your training setup,
|
||
you can choose to package the initialization callbacks with the pipeline package
|
||
or to publish them separately.
|
||
|
||
#### Training via the Python API {id="migrating-training-python"}
|
||
|
||
For most use cases, you **shouldn't** have to write your own training scripts
|
||
anymore. Instead, you can use [`spacy train`](/api/cli#train) with a
|
||
[config file](/usage/training#config) and custom
|
||
[registered functions](/usage/training#custom-code) if needed. You can even
|
||
register callbacks that can modify the `nlp` object at different stages of its
|
||
lifecycle to fully customize it before training.
|
||
|
||
If you do decide to use the [internal training API](/usage/training#api) from
|
||
Python, you should only need a few small modifications to convert your scripts
|
||
from spaCy v2.x to v3.x. The [`Example.from_dict`](/api/example#from_dict)
|
||
classmethod takes a reference `Doc` and a
|
||
[dictionary of annotations](/api/data-formats#dict-input), similar to the
|
||
"simple training style" in spaCy v2.x:
|
||
|
||
```diff {title="Migrating Doc and GoldParse"}
|
||
doc = nlp.make_doc("Mark Zuckerberg is the CEO of Facebook")
|
||
entities = [(0, 15, "PERSON"), (30, 38, "ORG")]
|
||
- gold = GoldParse(doc, entities=entities)
|
||
+ example = Example.from_dict(doc, {"entities": entities})
|
||
```
|
||
|
||
```diff {title="Migrating simple training style"}
|
||
text = "Mark Zuckerberg is the CEO of Facebook"
|
||
annotations = {"entities": [(0, 15, "PERSON"), (30, 38, "ORG")]}
|
||
+ doc = nlp.make_doc(text)
|
||
+ example = Example.from_dict(doc, annotations)
|
||
```
|
||
|
||
The [`Language.update`](/api/language#update),
|
||
[`Language.evaluate`](/api/language#evaluate) and
|
||
[`TrainablePipe.update`](/api/pipe#update) methods now all take batches of
|
||
[`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
|
||
raw text and a dictionary of annotations.
|
||
|
||
```python {title="Training loop",highlight="5-8,12"}
|
||
TRAIN_DATA = [
|
||
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
|
||
("I like London.", {"entities": [(7, 13, "LOC")]}),
|
||
]
|
||
examples = []
|
||
for text, annots in TRAIN_DATA:
|
||
examples.append(Example.from_dict(nlp.make_doc(text), annots))
|
||
nlp.initialize(lambda: examples)
|
||
for i in range(20):
|
||
random.shuffle(examples)
|
||
for batch in minibatch(examples, size=8):
|
||
nlp.update(batch)
|
||
```
|
||
|
||
`Language.begin_training` and `TrainablePipe.begin_training` have been renamed
|
||
to [`Language.initialize`](/api/language#initialize) and
|
||
[`TrainablePipe.initialize`](/api/pipe#initialize), and the methods now take a
|
||
function that returns a sequence of `Example` objects to initialize the model
|
||
instead of a list of tuples. The data examples are used to **initialize the
|
||
models** of trainable pipeline components, which includes validating the
|
||
network,
|
||
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||
setting up the label scheme.
|
||
|
||
```diff
|
||
- nlp.begin_training()
|
||
+ nlp.initialize(lambda: examples)
|
||
```
|
||
|
||
#### Packaging trained pipelines {id="migrating-training-packaging"}
|
||
|
||
The [`spacy package`](/api/cli#package) command now automatically builds the
|
||
installable `.tar.gz` sdist of the Python package, so you don't have to run this
|
||
step manually anymore. To disable the behavior, you can set `--build none`. You
|
||
can also choose to build a binary wheel (which installs more efficiently) by
|
||
setting `--build wheel`, or to build both the sdist and wheel by setting
|
||
`--build sdist,wheel`.
|
||
|
||
```diff
|
||
python -m spacy package ./output ./packages
|
||
- cd /output/en_pipeline-0.0.0
|
||
- python setup.py sdist
|
||
```
|
||
|
||
#### Data utilities and gold module {id="migrating-gold"}
|
||
|
||
The `spacy.gold` module has been renamed to `spacy.training` and the conversion
|
||
utilities now follow the naming format of `x_to_y`. This mostly affects
|
||
internals, but if you've been using the span offset conversion utilities
|
||
[`offsets_to_biluo_tags`](/api/top-level#offsets_to_biluo_tags),
|
||
[`biluo_tags_to_offsets`](/api/top-level#biluo_tags_to_offsets) or
|
||
[`biluo_tags_to_spans`](/api/top-level#biluo_tags_to_spans), you'll have to
|
||
change your names and imports:
|
||
|
||
```diff
|
||
- from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags, spans_from_biluo_tags
|
||
+ from spacy.training import offsets_to_biluo_tags, biluo_tags_to_offsets, biluo_tags_to_spans
|
||
```
|
||
|
||
#### Migration notes for plugin maintainers {id="migrating-plugins"}
|
||
|
||
Thanks to everyone who's been contributing to the spaCy ecosystem by developing
|
||
and maintaining one of the many awesome [plugins and extensions](/universe).
|
||
We've tried to make it as easy as possible for you to upgrade your packages for
|
||
spaCy v3.0. The most common use case for plugins is providing pipeline
|
||
components and extension attributes. When migrating your plugin, double-check
|
||
the following:
|
||
|
||
- Use the [`@Language.factory`](/api/language#factory) decorator to register
|
||
your component and assign it a name. This allows users to refer to your
|
||
components by name and serialize pipelines referencing them. Remove all manual
|
||
entries to the `Language.factories`.
|
||
- Make sure your component factories take at least two **named arguments**:
|
||
`nlp` (the current `nlp` object) and `name` (the instance name of the added
|
||
component so you can identify multiple instances of the same component).
|
||
- Update all references to [`nlp.add_pipe`](/api/language#add_pipe) in your docs
|
||
to use **string names** instead of the component functions.
|
||
|
||
```python {highlight="1-5"}
|
||
from spacy.language import Language
|
||
|
||
@Language.factory("my_component", default_config={"some_setting": False})
|
||
def create_component(nlp: Language, name: str, some_setting: bool):
|
||
return MyCoolComponent(some_setting=some_setting)
|
||
|
||
|
||
class MyCoolComponent:
|
||
def __init__(self, some_setting):
|
||
self.some_setting = some_setting
|
||
|
||
def __call__(self, doc):
|
||
# Do something to the doc
|
||
return doc
|
||
```
|
||
|
||
> #### Result in config.cfg
|
||
>
|
||
> ```ini
|
||
> [components.my_component]
|
||
> factory = "my_component"
|
||
> some_setting = true
|
||
> ```
|
||
|
||
```diff
|
||
import spacy
|
||
from your_plugin import MyCoolComponent
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
- component = MyCoolComponent(some_setting=True)
|
||
- nlp.add_pipe(component)
|
||
+ nlp.add_pipe("my_component", config={"some_setting": True})
|
||
```
|
||
|
||
<Infobox title="Important note on registering factories" variant="warning">
|
||
|
||
The [`@Language.factory`](/api/language#factory) decorator takes care of letting
|
||
spaCy know that a component of that name is available. This means that your
|
||
users can add it to the pipeline using its **string name**. However, this
|
||
requires the decorator to be executed – so users will still have to **import
|
||
your plugin**. Alternatively, your plugin could expose an
|
||
[entry point](/usage/saving-loading#entry-points), which spaCy can read from.
|
||
This means that spaCy knows how to initialize `my_component`, even if your
|
||
package isn't imported.
|
||
|
||
</Infobox>
|
||
|
||
#### Using GPUs in Jupyter notebooks {id="jupyter-notebook-gpu"}
|
||
|
||
In Jupyter notebooks, run [`prefer_gpu`](/api/top-level#spacy.prefer_gpu),
|
||
[`require_gpu`](/api/top-level#spacy.require_gpu) or
|
||
[`require_cpu`](/api/top-level#spacy.require_cpu) in the same cell as
|
||
[`spacy.load`](/api/top-level#spacy.load) to ensure that the model is loaded on
|
||
the correct device.
|
||
|
||
Due to a bug related to `contextvars` (see the
|
||
[bug report](https://github.com/ipython/ipython/issues/11565)), the GPU settings
|
||
may not be preserved correctly across cells, resulting in models being loaded on
|
||
the wrong device or only partially on GPU.
|