--- title: What's New in v3.0 teaser: New features, backwards incompatibilities and migration guide menu: - ['Summary', 'summary'] - ['New Features', 'features'] - ['Backwards Incompatibilities', 'incompat'] - ['Migrating from v2.x', 'migrating'] - ['Migrating plugins', 'plugins'] --- ## Summary {#summary} ## New Features {#features} ## Backwards Incompatibilities {#incompat} ### Removed or renamed objects, methods, attributes and arguments {#incompat-removed} | Removed | Replacement | | -------------------------------------------------------- | ----------------------------------------- | | `GoldParse` | [`Example`](/api/example) | | `GoldCorpus` | [`Corpus`](/api/corpus) | | `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) | | `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated | ### Removed deprecated methods, attributes and arguments {#incompat-removed-deprecated} The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been **deprecated for a while** and many would previously raise errors. Many of them were also mostly internals. If you've been working with more recent versions of spaCy v2.x, it's **unlikely** that your code relied on them. | Removed | Replacement | | ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | | `Doc.tokens_from_list` | [`Doc.__init__`](/api/doc#init) | | `Doc.merge`, `Span.merge` | [`Doc.retokenize`](/api/doc#retokenize) | | `Token.string`, `Span.string`, `Span.upper`, `Span.lower` | [`Span.text`](/api/span#attributes), [`Token.text`](/api/token#attributes) | | `Language.tagger`, `Language.parser`, `Language.entity` | [`Language.get_pipe`](/api/language#get_pipe) | | keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` | | `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` | | `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) | ## Migrating from v2.x {#migrating} ### Downloading and loading models {#migrating-downloading-models} Model symlinks and shortcuts like `en` are now officially deprecated. There are [many different models](/models) with different capabilities and not just one "English model". In order to download and load a model, you should always use its full name – for instance, `en_core_web_sm`. ```diff - python -m spacy download en + python -m spacy download en_core_web_sm ``` ```diff - nlp = spacy.load("en") + nlp = spacy.load("en_core_web_sm") ``` ### Custom pipeline components and factories {#migrating-pipeline-components} Custom pipeline components now have to be registered explicitly using the [`@Language.component`](/api/language#component) or [`@Language.factory`](/api/language#factory) decorator. For simple functions that take a `Doc` and return it, all you have to do is add the `@Language.component` decorator to it and assign it a name: ```diff ### Stateless function components + from spacy.language import Language + @Language.component("my_component") def my_component(doc): return doc ``` For class components that are initialized with settings and/or the shared `nlp` object, you can use the `@Language.factory` decorator. Also make sure that that the method used to initialize the factory has **two named arguments**: `nlp` (the current `nlp` object) and `name` (the string name of the component instance). ```diff ### Stateful class components + from spacy.language import Language + @Language.factory("my_component") class MyComponent: - def __init__(self, nlp): + def __init__(self, nlp, name): self.nlp = nlp def __call__(self, doc): return doc ``` Instead of decorating your class, you could also add a factory function that takes the arguments `nlp` and `name` and returns an instance of your component: ```diff ### Stateful class components with factory function + from spacy.language import Language + @Language.factory("my_component") + def create_my_component(nlp, name): + return MyComponent(nlp) class MyComponent: def __init__(self, nlp): self.nlp = nlp def __call__(self, doc): return doc ``` The `@Language.component` and `@Language.factory` decorators now take care of adding an entry to the component factories, so spaCy knows how to load a component back in from its string name. You won't have to write to `Language.factories` manually anymore. ```diff - Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp) ``` #### Adding components to the pipeline {#migrating-add-pipe} The [`nlp.add_pipe`](/api/language#add_pipe) method now takes the **string name** of the component factory instead of a callable component. This allows spaCy to track and serialize components that have been added and their settings. ```diff + @Language.component("my_component") def my_component(doc): return doc - nlp.add_pipe(my_component) + nlp.add_pipe("my_component") ``` [`nlp.add_pipe`](/api/language#add_pipe) now also returns the pipeline component itself, so you can access its attributes. The [`nlp.create_pipe`](/api/language#create_pipe) method is now mostly internals and you typically shouldn't have to use it in your code. ```diff - parser = nlp.create_pipe("parser") - nlp.add_pipe(parser) + parser = nlp.add_pipe("parser") ``` ### Training models {#migrating-training} To train your models, you should now pretty much always use the [`spacy train`](/api/cli#train) CLI. You shouldn't have to put together your own training scripts anymore, unless you _really_ want to. The training commands now use a [flexible config file](/usage/training#config) that describes all training settings and hyperparameters, as well as your pipeline, model components and architectures to use. The `--code` argument lets you pass in code containing [custom registered functions](/usage/training#custom-code) that you can reference in your config. #### Binary .spacy training data format {#migrating-training-format} spaCy now uses a new [binary training data format](/api/data-formats#binary-training), which is much smaller and consists of `Doc` objects, serialized via the [`DocBin`](/api/docbin). You can convert your existing JSON-formatted data using the [`spacy convert`](/api/cli#convert) command, which outputs `.spacy` files: ```bash $ python -m spacy convert ./training.json ./output ``` #### Training config {#migrating-training-config} ```diff ### {wrap="true"} - python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0 + python -m spacy train ./train.spacy ./dev.spacy ./config.cfg --output ./output ``` The easiest way to get started with an end-to-end training process is to clone a [project](/usage/projects) template. Projects let you manage multi-step workflows, from data preprocessing to training and packaging your model. #### Migrating training scripts to CLI command and config {#migrating-training-scripts} #### Training via the Python API {#migrating-training-python} #### Packaging models {#migrating-training-packaging} The [`spacy package`](/api/cli#package) command now automatically builds the installable `.tar.gz` sdist of the Python package, so you don't have to run this step manually anymore. You can disable the behavior by setting the `--no-sdist` flag. ```diff python -m spacy package ./model ./packages - cd /output/en_model-0.0.0 - python setup.py sdist ``` ## Migration notes for plugin maintainers {#plugins} Thanks to everyone who's been contributing to the spaCy ecosystem by developing and maintaining one of the many awesome [plugins and extensions](/universe). We've tried to keep breaking changes to a minimum and make it as easy as possible for you to upgrade your packages for spaCy v3. ### Custom pipeline components The most common use case for plugins is providing pipeline components and extension attributes. - Use the [`@Language.factory`](/api/language#factory) decorator to register your component and assign it a name. This allows users to refer to your components by name and serialize pipelines referencing them. Remove all manual entries to the `Language.factories`. - Make sure your component factories take at least two **named arguments**: `nlp` (the current `nlp` object) and `name` (the instance name of the added component so you can identify multiple instances of the same component). - Update all references to [`nlp.add_pipe`](/api/language#add_pipe) in your docs to use **string names** instead of the component functions. ```python ### {highlight="1-5"} from spacy.language import Language @Language.factory("my_component", default_config={"some_setting": False}) def create_component(nlp: Language, name: str, some_setting: bool): return MyCoolComponent(some_setting=some_setting) class MyCoolComponent: def __init__(self, some_setting): self.some_setting = some_setting def __call__(self, doc): # Do something to the doc return doc ``` > #### Result in config.cfg > > ```ini > [components.my_component] > factory = "my_component" > some_setting = true > ``` ```diff import spacy from your_plugin import MyCoolComponent nlp = spacy.load("en_core_web_sm") - component = MyCoolComponent(some_setting=True) - nlp.add_pipe(component) + nlp.add_pipe("my_component", config={"some_setting": True}) ``` The [`@Language.factory`](/api/language#factory) decorator takes care of letting spaCy know that a component of that name is available. This means that your users can add it to the pipeline using its **string name**. However, this requires the decorator to be executed – so users will still have to **import your plugin**. Alternatively, your plugin could expose an [entry point](/usage/saving-loading#entry-points), which spaCy can read from. This means that spaCy knows how to initialize `my_component`, even if your package isn't imported.