--- title: What's New in v3.0 teaser: New features, backwards incompatibilities and migration guide menu: - ['Summary', 'summary'] - ['New Features', 'features'] - ['Backwards Incompatibilities', 'incompat'] - ['Migrating from v2.x', 'migrating'] --- ## Summary {#summary} ## New Features {#features} ### New training workflow and config system {#features-training} - **Usage:** [Training models](/usage/training) - **Thinc:** [Thinc's config system](https://thinc.ai/docs/usage-config), [`Config`](https://thinc.ai/docs/api-config#config) - **CLI:** [`train`](/api/cli#train), [`pretrain`](/api/cli#pretrain), [`evaluate`](/api/cli#evaluate) - **API:** [Config format](/api/data-formats#config), [`registry`](/api/top-level#registry) ### Transformer-based pipelines {#features-transformers} - **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers), [Training models](/usage/training) - **API:** [`Transformer`](/api/transformer), [`TransformerData`](/api/transformer#transformerdata), [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) - **Architectures: ** [TransformerModel](/api/architectures#TransformerModel), [Tok2VecListener](/api/architectures#transformers-Tok2VecListener), [Tok2VecTransformer](/api/architectures#Tok2VecTransformer) - **Models:** [`en_core_bert_sm`](/models/en) - **Implementation:** [`spacy-transformers`](https://github.com/explosion/spacy-transformers) ### Custom models using any framework {#feautres-custom-models} ### Manage end-to-end workflows with projects {#features-projects} - **Usage:** [spaCy projects](/usage/projects), [Training models](/usage/training) - **CLI:** [`project`](/api/cli#project), [`train`](/api/cli#train) - **Templates:** [`projects`](https://github.com/explosion/projects) ### New built-in pipeline components {#features-pipeline-components} | Name | Description | | ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation. | | [`Morphologizer`](/api/morphologizer) | Trainable component to predict morphological features. | | [`Lemmatizer`](/api/lemmatizer) | Standalone component for rule-based and lookup lemmatization. | | [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. | | [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/embeddings-transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). | - **Usage:** [Processing pipelines](/usage/processing-pipelines) - **API:** [Built-in pipeline components](/api#architecture-pipeline) - **Implementation:** [`spacy/pipeline`](https://github.com/explosion/spaCy/tree/develop/spacy/pipeline) ### New and improved pipeline component APIs {#features-components} - `Language.factory`, `Language.component` - `Language.analyze_pipes` - Adding components from other models - **Usage:** [Custom components](/usage/processing-pipelines#custom_components), [Defining components during training](/usage/training#config-components) - **API:** [`Language`](/api/language) - **Implementation:** [`spacy/language.py`](https://github.com/explosion/spaCy/tree/develop/spacy/language.py) ### Type hints and type-based data validation {#features-types} > #### Example > > ```python > from spacy.language import Language > from pydantic import StrictBool > > @Language.factory("my_component") > def create_my_component( > nlp: Language, > name: str, > custom: StrictBool > ): > ... > ``` spaCy v3.0 officially drops support for Python 2 and now requires **Python 3.6+**. This also means that the code base can take full advantage of [type hints](https://docs.python.org/3/library/typing.html). spaCy's user-facing API that's implemented in pure Python (as opposed to Cython) now comes with type hints. The new version of spaCy's machine learning library [Thinc](https://thinc.ai) also features extensive [type support](https://thinc.ai/docs/usage-type-checking/), including custom types for models and arrays, and a custom `mypy` plugin that can be used to type-check model definitions. For data validation, spacy v3.0 adopts [`pydantic`](https://github.com/samuelcolvin/pydantic). It also powers the data validation of Thinc's [config system](https://thinc.ai/docs/usage-config), which lets you to register **custom functions with typed arguments**, reference them in your config and see validation errors if the argument values don't match. - **Usage: ** [Component type hints and validation](/usage/processing-pipelines#type-hints), [Training with custom code](/usage/training#custom-code) - **Thinc: ** [Type checking in Thinc](https://thinc.ai/docs/usage-type-checking), [Thinc's config system](https://thinc.ai/docs/usage-config) ### New methods, attributes and commands The following methods, attributes and commands are new in spaCy v3.0. | Name | Description | | ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). | | [`Language.select_pipes`](/api/language#select_pipes) | Contextmanager for enabling or disabling specific pipeline components for a block. | | [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. | | [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. | | [`@Language.factory`](/api/language#factory) [`@Language.component`](/api/language#component) | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions. | | [`Language.has_factory`](/api/language#has_factory) | Check whether a component factory is registered on a language class.s | | [`Language.get_factory_meta`](/api/language#get_factory_meta) [`Language.get_pipe_meta`](/api/language#get_factory_meta) | Get the [`FactoryMeta`](/api/language#factorymeta) with component metadata for a factory or instance name. | | [`Language.config`](/api/language#config) | The [config](/usage/training#config) used to create the current `nlp` object. An instance of [`Config`](https://thinc.ai/docs/api-config#config) and can be saved to disk and used for training. | | [`Pipe.score`](/api/pipe#score) | Method on trainable pipeline components that returns a dictionary of evaluation scores. | | [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). | | [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). | | [`init config`](/api/cli#init-config) [`init fill-config`](/api/cli#init-fill-config) [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training). | | [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). | ## Backwards Incompatibilities {#incompat} As always, we've tried to keep the breaking changes to a minimum and focus on changes that were necessary to support the new features, fix problems or improve usability. The following section lists the relevant changes to the user-facing API. For specific examples of how to rewrite your code, check out the [migration guide](#migrating). Note that spaCy v3.0 now requires **Python 3.6+**. ### API changes {#incompat-api} - Model symlinks, the `link` command and shortcut names are now deprecated. There can be many [different models](/models) and not just one "English model", so you should always use the full model name like [`en_core_web_sm`](/models/en) explicitly. - A model's [`meta.json`](/api/data-formats#meta) is now only used to provide meta information like the model name, author, license and labels. It's **not** used to construct the processing pipeline anymore. This is all defined in the [`config.cfg`](/api/data-formats#config), which also includes all settings used to train the model. - The [`train`](/api/cli#train) and [`pretrain`](/api/cli#pretrain) commands now only take a `config.cfg` file containing the full [training config](/usage/training#config). - [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of the component factory instead of the component function. - **Custom pipeline components** now needs to be decorated with the [`@Language.component`](/api/language#component) or [`@Language.factory`](/api/language#factory) decorator. - [`Language.update`](/api/language#update) now takes a batch of [`Example`](/api/example) objects instead of raw texts and annotations, or `Doc` and `GoldParse` objects. - The `Language.disable_pipes` contextmanager has been replaced by [`Language.select_pipes`](/api/language#select_pipes), which can explicitly disable or enable components. - The [`Language.update`](/api/language#update), [`Language.evaluate`](/api/language#evaluate) and [`Pipe.update`](/api/pipe#update) methods now all take batches of [`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or raw text and a dictionary of annotations. [`Language.begin_training`](/api/language#begin_training) and [`Pipe.begin_training`](/api/pipe#begin_training) now take a function that returns a sequence of `Example` objects to initialize the model instead of a list of tuples. - [`Matcher.add`](/api/matcher#add), [`PhraseMatcher.add`](/api/phrasematcher#add) and [`DependencyMatcher.add`](/api/dependencymatcher#add) now only accept a list of patterns as the second argument (instead of a variable number of arguments). The `on_match` callback becomes an optional keyword argument. ### Removed or renamed API {#incompat-removed} | Removed | Replacement | | ------------------------------------------------------ | ----------------------------------------------------------------------------------------- | | `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) | | `GoldParse` | [`Example`](/api/example) | | `GoldCorpus` | [`Corpus`](/api/corpus) | | `KnowledgeBase.load_bulk` `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk) [`KnowledgeBase.to_disk`](/api/kb#to_disk) | | `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) | | `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) | | `spacy link` `util.set_data_path` `util.get_data_path` | not needed, model symlinks are deprecated | The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been **deprecated for a while** and many would previously raise errors. Many of them were also mostly internals. If you've been working with more recent versions of spaCy v2.x, it's **unlikely** that your code relied on them. | Removed | Replacement | | ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | | `Doc.tokens_from_list` | [`Doc.__init__`](/api/doc#init) | | `Doc.merge`, `Span.merge` | [`Doc.retokenize`](/api/doc#retokenize) | | `Token.string`, `Span.string`, `Span.upper`, `Span.lower` | [`Span.text`](/api/span#attributes), [`Token.text`](/api/token#attributes) | | `Language.tagger`, `Language.parser`, `Language.entity` | [`Language.get_pipe`](/api/language#get_pipe) | | keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` | | `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` | | `verbose` argument on [`Language.evaluate`] | logging | | `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) | ## Migrating from v2.x {#migrating} ### Downloading and loading models {#migrating-downloading-models} Model symlinks and shortcuts like `en` are now officially deprecated. There are [many different models](/models) with different capabilities and not just one "English model". In order to download and load a model, you should always use its full name – for instance, [`en_core_web_sm`](/models/en#en_core_web_sm). ```diff - python -m spacy download en + python -m spacy download en_core_web_sm ``` ```diff - nlp = spacy.load("en") + nlp = spacy.load("en_core_web_sm") ``` ### Custom pipeline components and factories {#migrating-pipeline-components} Custom pipeline components now have to be registered explicitly using the [`@Language.component`](/api/language#component) or [`@Language.factory`](/api/language#factory) decorator. For simple functions that take a `Doc` and return it, all you have to do is add the `@Language.component` decorator to it and assign it a name: ```diff ### Stateless function components + from spacy.language import Language + @Language.component("my_component") def my_component(doc): return doc ``` For class components that are initialized with settings and/or the shared `nlp` object, you can use the `@Language.factory` decorator. Also make sure that that the method used to initialize the factory has **two named arguments**: `nlp` (the current `nlp` object) and `name` (the string name of the component instance). ```diff ### Stateful class components + from spacy.language import Language + @Language.factory("my_component") class MyComponent: - def __init__(self, nlp): + def __init__(self, nlp, name): self.nlp = nlp def __call__(self, doc): return doc ``` Instead of decorating your class, you could also add a factory function that takes the arguments `nlp` and `name` and returns an instance of your component: ```diff ### Stateful class components with factory function + from spacy.language import Language + @Language.factory("my_component") + def create_my_component(nlp, name): + return MyComponent(nlp) class MyComponent: def __init__(self, nlp): self.nlp = nlp def __call__(self, doc): return doc ``` The `@Language.component` and `@Language.factory` decorators now take care of adding an entry to the component factories, so spaCy knows how to load a component back in from its string name. You won't have to write to `Language.factories` manually anymore. ```diff - Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp) ``` #### Adding components to the pipeline {#migrating-add-pipe} The [`nlp.add_pipe`](/api/language#add_pipe) method now takes the **string name** of the component factory instead of a callable component. This allows spaCy to track and serialize components that have been added and their settings. ```diff + @Language.component("my_component") def my_component(doc): return doc - nlp.add_pipe(my_component) + nlp.add_pipe("my_component") ``` [`nlp.add_pipe`](/api/language#add_pipe) now also returns the pipeline component itself, so you can access its attributes. The [`nlp.create_pipe`](/api/language#create_pipe) method is now mostly internals and you typically shouldn't have to use it in your code. ```diff - parser = nlp.create_pipe("parser") - nlp.add_pipe(parser) + parser = nlp.add_pipe("parser") ``` If you need to add a component from an existing pretrained model, you can now use the `source` argument on [`nlp.add_pipe`](/api/language#add_pipe). This will check that the component is compatible, and take care of porting over all config. During training, you can also reference existing pretrained components in your [config](/usage/training#config-components) and decide whether or not they should be updated with more data. > #### config.cfg (excerpt) > > ```ini > [components.ner] > source = "en_core_web_sm" > component = "ner" > ``` ```diff source_nlp = spacy.load("en_core_web_sm") nlp = spacy.blank("en") - ner = source_nlp.get_pipe("ner") - nlp.add_pipe(ner) + nlp.add_pipe("ner", source=source_nlp) ``` ### Adding match patterns {#migrating-matcher} The [`Matcher.add`](/api/matcher#add), [`PhraseMatcher.add`](/api/phrasematcher#add) and [`DependencyMatcher.add`](/api/dependencymatcher#add) methods now only accept a **list of patterns** as the second argument (instead of a variable number of arguments). The `on_match` callback becomes an optional keyword argument. ```diff matcher = Matcher(nlp.vocab) patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]] - matcher.add("GoogleNow", on_match, *patterns) + matcher.add("GoogleNow", patterns, on_match=on_match) ``` ```diff matcher = PhraseMatcher(nlp.vocab) patterns = [nlp("health care reform"), nlp("healthcare reform")] - matcher.add("HEALTH", on_match, *patterns) + matcher.add("HEALTH", patterns, on_match=on_match) ``` ### Training models {#migrating-training} To train your models, you should now pretty much always use the [`spacy train`](/api/cli#train) CLI. You shouldn't have to put together your own training scripts anymore, unless you _really_ want to. The training commands now use a [flexible config file](/usage/training#config) that describes all training settings and hyperparameters, as well as your pipeline, model components and architectures to use. The `--code` argument lets you pass in code containing [custom registered functions](/usage/training#custom-code) that you can reference in your config. To get started, check out the [quickstart widget](/usage/training#quickstart). #### Binary .spacy training data format {#migrating-training-format} spaCy v3.0 uses a new [binary training data format](/api/data-formats#binary-training) created by serializing a [`DocBin`](/api/docbin), which represents a collection of `Doc` objects. This means that you can train spaCy models using the same format it outputs: annotated `Doc` objects. The binary format is extremely **efficient in storage**, especially when packing multiple documents together. You can convert your existing JSON-formatted data using the [`spacy convert`](/api/cli#convert) command, which outputs `.spacy` files: ```cli $ python -m spacy convert ./training.json ./output ``` #### Training config {#migrating-training-config} The easiest way to get started with a training config is to use the [`init config`](/api/cli#init-config) command or the [quickstart widget](/usage/training#quickstart). You can define your requirements, and it will auto-generate a starter config with the best-matching default settings. ```cli $ python -m spacy init config ./config.cfg --lang en --pipeline tagger,parser ``` If you've exported a starter config from our [quickstart widget](/usage/training#quickstart), you can use the [`init fill-config`](/api/cli#init-fill-config) to fill it with all default values. You can then use the auto-generated `config.cfg` for training: ```diff ### {wrap="true"} - python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0 + python -m spacy train ./config.cfg --output ./output ``` The easiest way to get started with an end-to-end training process is to clone a [project](/usage/projects) template. Projects let you manage multi-step workflows, from data preprocessing to training and packaging your model. #### Training via the Python API {#migrating-training-python} For most use cases, you **shouldn't** have to write your own training scripts anymore. Instead, you can use [`spacy train`](/api/cli#train) with a [config file](/usage/training#config) and custom [registered functions](/usage/training#custom-code) if needed. You can even register callbacks that can modify the `nlp` object at different stages of its lifecycle to fully customize it before training. If you do decide to use the [internal training API](/usage/training#api) from Python, you should only need a few small modifications to convert your scripts from spaCy v2.x to v3.x. The [`Example.from_dict`](/api/example#from_dict) classmethod takes a reference `Doc` and a [dictionary of annotations](/api/data-formats#dict-input), similar to the "simple training style" in spaCy v2.x: ```diff ### Migrating Doc and GoldParse doc = nlp.make_doc("Mark Zuckerberg is the CEO of Facebook") entities = [(0, 15, "PERSON"), (30, 38, "ORG")] - gold = GoldParse(doc, entities=entities) + example = Example.from_dict(doc, {"entities": entities}) ``` ```diff ### Migrating simple training style text = "Mark Zuckerberg is the CEO of Facebook" annotations = {"entities": [(0, 15, "PERSON"), (30, 38, "ORG")]} + doc = nlp.make_doc(text) + example = Example.from_dict(doc, annotations) ``` The [`Language.update`](/api/language#update), [`Language.evaluate`](/api/language#evaluate) and [`Pipe.update`](/api/pipe#update) methods now all take batches of [`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or raw text and a dictionary of annotations. ```python ### Training loop {highlight="11"} TRAIN_DATA = [ ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}), ("I like London.", {"entities": [(7, 13, "LOC")]}), ] nlp.begin_training() for i in range(20): random.shuffle(TRAIN_DATA) for batch in minibatch(TRAIN_DATA): examples = [] for text, annots in batch: examples.append(Example.from_dict(nlp.make_doc(text), annots)) nlp.update(examples) ``` [`Language.begin_training`](/api/language#begin_training) and [`Pipe.begin_training`](/api/pipe#begin_training) now take a function that returns a sequence of `Example` objects to initialize the model instead of a list of tuples. The data examples are used to **initialize the models** of trainable pipeline components, which includes validating the network, [inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and setting up the label scheme. ```diff - nlp.begin_training(examples) + nlp.begin_training(lambda: examples) ``` #### Packaging models {#migrating-training-packaging} The [`spacy package`](/api/cli#package) command now automatically builds the installable `.tar.gz` sdist of the Python package, so you don't have to run this step manually anymore. You can disable the behavior by setting the `--no-sdist` flag. ```diff python -m spacy package ./model ./packages - cd /output/en_model-0.0.0 - python setup.py sdist ``` #### Migration notes for plugin maintainers {#migrating-plugins} Thanks to everyone who's been contributing to the spaCy ecosystem by developing and maintaining one of the many awesome [plugins and extensions](/universe). We've tried to make it as easy as possible for you to upgrade your packages for spaCy v3. The most common use case for plugins is providing pipeline components and extension attributes. When migrating your plugin, double-check the following: - Use the [`@Language.factory`](/api/language#factory) decorator to register your component and assign it a name. This allows users to refer to your components by name and serialize pipelines referencing them. Remove all manual entries to the `Language.factories`. - Make sure your component factories take at least two **named arguments**: `nlp` (the current `nlp` object) and `name` (the instance name of the added component so you can identify multiple instances of the same component). - Update all references to [`nlp.add_pipe`](/api/language#add_pipe) in your docs to use **string names** instead of the component functions. ```python ### {highlight="1-5"} from spacy.language import Language @Language.factory("my_component", default_config={"some_setting": False}) def create_component(nlp: Language, name: str, some_setting: bool): return MyCoolComponent(some_setting=some_setting) class MyCoolComponent: def __init__(self, some_setting): self.some_setting = some_setting def __call__(self, doc): # Do something to the doc return doc ``` > #### Result in config.cfg > > ```ini > [components.my_component] > factory = "my_component" > some_setting = true > ``` ```diff import spacy from your_plugin import MyCoolComponent nlp = spacy.load("en_core_web_sm") - component = MyCoolComponent(some_setting=True) - nlp.add_pipe(component) + nlp.add_pipe("my_component", config={"some_setting": True}) ``` The [`@Language.factory`](/api/language#factory) decorator takes care of letting spaCy know that a component of that name is available. This means that your users can add it to the pipeline using its **string name**. However, this requires the decorator to be executed – so users will still have to **import your plugin**. Alternatively, your plugin could expose an [entry point](/usage/saving-loading#entry-points), which spaCy can read from. This means that spaCy knows how to initialize `my_component`, even if your package isn't imported.