mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 07:57:35 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			287 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			287 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | ||
| title: What's New in v3.0
 | ||
| teaser: New features, backwards incompatibilities and migration guide
 | ||
| menu:
 | ||
|   - ['Summary', 'summary']
 | ||
|   - ['New Features', 'features']
 | ||
|   - ['Backwards Incompatibilities', 'incompat']
 | ||
|   - ['Migrating from v2.x', 'migrating']
 | ||
|   - ['Migrating plugins', 'plugins']
 | ||
| ---
 | ||
| 
 | ||
| ## Summary {#summary}
 | ||
| 
 | ||
| ## New Features {#features}
 | ||
| 
 | ||
| ## Backwards Incompatibilities {#incompat}
 | ||
| 
 | ||
| ### Removed or renamed objects, methods, attributes and arguments {#incompat-removed}
 | ||
| 
 | ||
| | Removed                                                  | Replacement                               |
 | ||
| | -------------------------------------------------------- | ----------------------------------------- |
 | ||
| | `GoldParse`                                              | [`Example`](/api/example)                 |
 | ||
| | `GoldCorpus`                                             | [`Corpus`](/api/corpus)                   |
 | ||
| | `spacy debug-data`                                       | [`spacy debug data`](/api/cli#debug-data) |
 | ||
| | `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
 | ||
| 
 | ||
| ### Removed deprecated methods, attributes and arguments {#incompat-removed-deprecated}
 | ||
| 
 | ||
| The following deprecated methods, attributes and arguments were removed in v3.0.
 | ||
| Most of them have been **deprecated for a while** and many would previously
 | ||
| raise errors. Many of them were also mostly internals. If you've been working
 | ||
| with more recent versions of spaCy v2.x, it's **unlikely** that your code relied
 | ||
| on them.
 | ||
| 
 | ||
| | Removed                                                                                                                 | Replacement                                                                                                                                                |
 | ||
| | ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | ||
| | `Doc.tokens_from_list`                                                                                                  | [`Doc.__init__`](/api/doc#init)                                                                                                                            |
 | ||
| | `Doc.merge`, `Span.merge`                                                                                               | [`Doc.retokenize`](/api/doc#retokenize)                                                                                                                    |
 | ||
| | `Token.string`, `Span.string`, `Span.upper`, `Span.lower`                                                               | [`Span.text`](/api/span#attributes), [`Token.text`](/api/token#attributes)                                                                                 |
 | ||
| | `Language.tagger`, `Language.parser`, `Language.entity`                                                                 | [`Language.get_pipe`](/api/language#get_pipe)                                                                                                              |
 | ||
| | keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes`                                | `exclude=["vocab"]`                                                                                                                                        |
 | ||
| | `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process`                                                                                                                                                |
 | ||
| | `SentenceSegmenter` hook, `SimilarityHook`                                                                              | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) |
 | ||
| 
 | ||
| ## Migrating from v2.x {#migrating}
 | ||
| 
 | ||
| ### Downloading and loading models {#migrating-downloading-models}
 | ||
| 
 | ||
| Model symlinks and shortcuts like `en` are now officially deprecated. There are
 | ||
| [many different models](/models) with different capabilities and not just one
 | ||
| "English model". In order to download and load a model, you should always use
 | ||
| its full name – for instance, `en_core_web_sm`.
 | ||
| 
 | ||
| ```diff
 | ||
| - python -m spacy download en
 | ||
| + python -m spacy download en_core_web_sm
 | ||
| ```
 | ||
| 
 | ||
| ```diff
 | ||
| - nlp = spacy.load("en")
 | ||
| + nlp = spacy.load("en_core_web_sm")
 | ||
| ```
 | ||
| 
 | ||
| ### Custom pipeline components and factories {#migrating-pipeline-components}
 | ||
| 
 | ||
| Custom pipeline components now have to be registered explicitly using the
 | ||
| [`@Language.component`](/api/language#component) or
 | ||
| [`@Language.factory`](/api/language#factory) decorator. For simple functions
 | ||
| that take a `Doc` and return it, all you have to do is add the
 | ||
| `@Language.component` decorator to it and assign it a name:
 | ||
| 
 | ||
| ```diff
 | ||
| ### Stateless function components
 | ||
| + from spacy.language import Language
 | ||
| 
 | ||
| + @Language.component("my_component")
 | ||
| def my_component(doc):
 | ||
|     return doc
 | ||
| ```
 | ||
| 
 | ||
| For class components that are initialized with settings and/or the shared `nlp`
 | ||
| object, you can use the `@Language.factory` decorator. Also make sure that that
 | ||
| the method used to initialize the factory has **two named arguments**: `nlp`
 | ||
| (the current `nlp` object) and `name` (the string name of the component
 | ||
| instance).
 | ||
| 
 | ||
| ```diff
 | ||
| ### Stateful class components
 | ||
| + from spacy.language import Language
 | ||
| 
 | ||
| + @Language.factory("my_component")
 | ||
| class MyComponent:
 | ||
| -   def __init__(self, nlp):
 | ||
| +   def __init__(self, nlp, name):
 | ||
|         self.nlp = nlp
 | ||
| 
 | ||
|     def __call__(self, doc):
 | ||
|         return doc
 | ||
| ```
 | ||
| 
 | ||
| Instead of decorating your class, you could also add a factory function that
 | ||
| takes the arguments `nlp` and `name` and returns an instance of your component:
 | ||
| 
 | ||
| ```diff
 | ||
| ### Stateful class components with factory function
 | ||
| + from spacy.language import Language
 | ||
| 
 | ||
| + @Language.factory("my_component")
 | ||
| + def create_my_component(nlp, name):
 | ||
| +     return MyComponent(nlp)
 | ||
| 
 | ||
| class MyComponent:
 | ||
|     def __init__(self, nlp):
 | ||
|         self.nlp = nlp
 | ||
| 
 | ||
|     def __call__(self, doc):
 | ||
|         return doc
 | ||
| ```
 | ||
| 
 | ||
| The `@Language.component` and `@Language.factory` decorators now take care of
 | ||
| adding an entry to the component factories, so spaCy knows how to load a
 | ||
| component back in from its string name. You won't have to write to
 | ||
| `Language.factories` manually anymore.
 | ||
| 
 | ||
| ```diff
 | ||
| - Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp)
 | ||
| ```
 | ||
| 
 | ||
| #### Adding components to the pipeline {#migrating-add-pipe}
 | ||
| 
 | ||
| The [`nlp.add_pipe`](/api/language#add_pipe) method now takes the **string
 | ||
| name** of the component factory instead of a callable component. This allows
 | ||
| spaCy to track and serialize components that have been added and their settings.
 | ||
| 
 | ||
| ```diff
 | ||
| + @Language.component("my_component")
 | ||
| def my_component(doc):
 | ||
|     return doc
 | ||
| 
 | ||
| - nlp.add_pipe(my_component)
 | ||
| + nlp.add_pipe("my_component")
 | ||
| ```
 | ||
| 
 | ||
| [`nlp.add_pipe`](/api/language#add_pipe) now also returns the pipeline component
 | ||
| itself, so you can access its attributes. The
 | ||
| [`nlp.create_pipe`](/api/language#create_pipe) method is now mostly internals
 | ||
| and you typically shouldn't have to use it in your code.
 | ||
| 
 | ||
| ```diff
 | ||
| - parser = nlp.create_pipe("parser")
 | ||
| - nlp.add_pipe(parser)
 | ||
| + parser = nlp.add_pipe("parser")
 | ||
| ```
 | ||
| 
 | ||
| ### Training models {#migrating-training}
 | ||
| 
 | ||
| To train your models, you should now pretty much always use the
 | ||
| [`spacy train`](/api/cli#train) CLI. You shouldn't have to put together your own
 | ||
| training scripts anymore, unless you _really_ want to. The training commands now
 | ||
| use a [flexible config file](/usage/training#config) that describes all training
 | ||
| settings and hyperparameters, as well as your pipeline, model components and
 | ||
| architectures to use. The `--code` argument lets you pass in code containing
 | ||
| [custom registered functions](/usage/training#custom-code) that you can
 | ||
| reference in your config.
 | ||
| 
 | ||
| #### Binary .spacy training data format {#migrating-training-format}
 | ||
| 
 | ||
| spaCy now uses a new
 | ||
| [binary training data format](/api/data-formats#binary-training), which is much
 | ||
| smaller and consists of `Doc` objects, serialized via the
 | ||
| [`DocBin`](/api/docbin). You can convert your existing JSON-formatted data using
 | ||
| the [`spacy convert`](/api/cli#convert) command, which outputs `.spacy` files:
 | ||
| 
 | ||
| ```bash
 | ||
| $ python -m spacy convert ./training.json ./output
 | ||
| ```
 | ||
| 
 | ||
| #### Training config {#migrating-training-config}
 | ||
| 
 | ||
| <!-- TODO: update once we have recommended "getting started with a new config" workflow -->
 | ||
| 
 | ||
| ```diff
 | ||
| ### {wrap="true"}
 | ||
| - python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
 | ||
| + python -m spacy train ./config.cfg --output ./output
 | ||
| ```
 | ||
| 
 | ||
| <Project id="some_example_project">
 | ||
| 
 | ||
| The easiest way to get started with an end-to-end training process is to clone a
 | ||
| [project](/usage/projects) template. Projects let you manage multi-step
 | ||
| workflows, from data preprocessing to training and packaging your model.
 | ||
| 
 | ||
| </Project>
 | ||
| 
 | ||
| #### Migrating training scripts to CLI command and config {#migrating-training-scripts}
 | ||
| 
 | ||
| <!-- TODO: write -->
 | ||
| 
 | ||
| #### Training via the Python API {#migrating-training-python}
 | ||
| 
 | ||
| <!-- TODO: this should explain the GoldParse -> Example stuff -->
 | ||
| 
 | ||
| #### Packaging models {#migrating-training-packaging}
 | ||
| 
 | ||
| The [`spacy package`](/api/cli#package) command now automatically builds the
 | ||
| installable `.tar.gz` sdist of the Python package, so you don't have to run this
 | ||
| step manually anymore. You can disable the behavior by setting the `--no-sdist`
 | ||
| flag.
 | ||
| 
 | ||
| ```diff
 | ||
| python -m spacy package ./model ./packages
 | ||
| - cd /output/en_model-0.0.0
 | ||
| - python setup.py sdist
 | ||
| ```
 | ||
| 
 | ||
| ## Migration notes for plugin maintainers {#plugins}
 | ||
| 
 | ||
| Thanks to everyone who's been contributing to the spaCy ecosystem by developing
 | ||
| and maintaining one of the many awesome [plugins and extensions](/universe).
 | ||
| We've tried to keep breaking changes to a minimum and make it as easy as
 | ||
| possible for you to upgrade your packages for spaCy v3.
 | ||
| 
 | ||
| ### Custom pipeline components
 | ||
| 
 | ||
| The most common use case for plugins is providing pipeline components and
 | ||
| extension attributes.
 | ||
| 
 | ||
| - Use the [`@Language.factory`](/api/language#factory) decorator to register
 | ||
|   your component and assign it a name. This allows users to refer to your
 | ||
|   components by name and serialize pipelines referencing them. Remove all manual
 | ||
|   entries to the `Language.factories`.
 | ||
| - Make sure your component factories take at least two **named arguments**:
 | ||
|   `nlp` (the current `nlp` object) and `name` (the instance name of the added
 | ||
|   component so you can identify multiple instances of the same component).
 | ||
| - Update all references to [`nlp.add_pipe`](/api/language#add_pipe) in your docs
 | ||
|   to use **string names** instead of the component functions.
 | ||
| 
 | ||
| ```python
 | ||
| ### {highlight="1-5"}
 | ||
| from spacy.language import Language
 | ||
| 
 | ||
| @Language.factory("my_component", default_config={"some_setting": False})
 | ||
| def create_component(nlp: Language, name: str, some_setting: bool):
 | ||
|     return MyCoolComponent(some_setting=some_setting)
 | ||
| 
 | ||
| 
 | ||
| class MyCoolComponent:
 | ||
|     def __init__(self, some_setting):
 | ||
|         self.some_setting = some_setting
 | ||
| 
 | ||
|     def __call__(self, doc):
 | ||
|         # Do something to the doc
 | ||
|         return doc
 | ||
| ```
 | ||
| 
 | ||
| > #### Result in config.cfg
 | ||
| >
 | ||
| > ```ini
 | ||
| > [components.my_component]
 | ||
| > factory = "my_component"
 | ||
| > some_setting = true
 | ||
| > ```
 | ||
| 
 | ||
| ```diff
 | ||
| import spacy
 | ||
| from your_plugin import MyCoolComponent
 | ||
| 
 | ||
| nlp = spacy.load("en_core_web_sm")
 | ||
| - component = MyCoolComponent(some_setting=True)
 | ||
| - nlp.add_pipe(component)
 | ||
| + nlp.add_pipe("my_component", config={"some_setting": True})
 | ||
| ```
 | ||
| 
 | ||
| <Infobox title="Important note on registering factories" variant="warning">
 | ||
| 
 | ||
| The [`@Language.factory`](/api/language#factory) decorator takes care of letting
 | ||
| spaCy know that a component of that name is available. This means that your
 | ||
| users can add it to the pipeline using its **string name**. However, this
 | ||
| requires the decorator to be executed – so users will still have to **import
 | ||
| your plugin**. Alternatively, your plugin could expose an
 | ||
| [entry point](/usage/saving-loading#entry-points), which spaCy can read from.
 | ||
| This means that spaCy knows how to initialize `my_component`, even if your
 | ||
| package isn't imported.
 | ||
| 
 | ||
| </Infobox>
 |