spaCy/v3.md at 8038b87f04e3def6b8d23cd398b89ceb2446649f

mirror of https://github.com/explosion/spaCy.git synced 2024-12-28 19:06:33 +03:00

Matthew Honnibal e559867605

Allow spacy project to push and pull to/from remote storage (#5949 )

* Add utils for working with remote storage

* WIP add remote_cache for project

* WIP add push and pull commands

* Use pathy in remote_cache

* Updarte util

* Update remote_cache

* Update util

* Update project assets

* Update pull script

* Update push script

* Fix type annotation in util

* Work on remote storage

* Remove site and env hash

* Fix imports

* Fix type annotation

* Require pathy

* Require pathy

* Fix import

* Add a util to handle project variable substitution

* Import push and pull commands

* Fix pull command

* Fix push command

* Fix tarfile in remote_storage

* Improve printing

* Fiddle with status messages

* Set version to v3.0.0a9

* Draft docs for spacy project remote storages

* Update docs [ci skip]

* Use Thinc config to simplify and unify template variables

* Auto-format

* Don't import Pathy globally for now

Causes slow and annoying Google Cloud warning

* Tidy up test

* Tidy up and update tests

* Update to latest Thinc

* Update docs

* variables -> vars

* Update docs [ci skip]

* Update docs [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>

2020-08-23 18:32:09 +02:00

38 KiB

Raw Blame History

title

teaser

What's New in v3.0

New features, backwards incompatibilities and migration guide

Summary

summary

New Features

features

Backwards Incompatibilities

incompat

Migrating from v2.x

migrating

New Features

Manage end-to-end workflows with projects

Example

# Clone a project template
$ python -m spacy project clone example
$ cd example
# Download data assets
$ python -m spacy project assets
# Run a workflow
$ python -m spacy project run train

spaCy projects let you manage and share end-to-end spaCy workflows for different use cases and domains, and orchestrate training, packaging and serving your custom models. You can start off by cloning a pre-defined project template, adjust it to fit your needs, load in your data, train a model, export it as a Python package, upload your outputs to a remote storage and share your results with your team.

spaCy projects also make it easy to integrate with other tools in the data science and machine learning ecosystem, including DVC for data version control, Prodigy for creating labelled data, Streamlit for building interactive apps, FastAPI for serving models in production, Ray for parallel training, Weights & Biases for experiment tracking, and more!

Usage: spaCy projects, Training models
CLI: project, train
Templates: projects

New built-in pipeline components

spaCy v3.0 includes several new trainable and rule-based components that you can add to your pipeline and customize for your use case:

Example

nlp = spacy.blank("en")
nlp.add_pipe("lemmatizer")

Name	Description
`SentenceRecognizer`	Trainable component for sentence segmentation.
`Morphologizer`	Trainable component to predict morphological features.
`Lemmatizer`	Standalone component for rule-based and lookup lemmatization.
`AttributeRuler`	Component for setting token attributes using match patterns.
`Transformer`	Component for using transformer models in your pipeline, accessing outputs and aligning tokens. Provided via `spacy-transformers`.

Usage: Processing pipelines
API: Built-in pipeline components
Implementation: spacy/pipeline

New and improved pipeline component APIs

Example

@Language.component("my_component")
def my_component(doc):
    return doc

nlp.add_pipe("my_component")
nlp.add_pipe("ner", source=other_nlp)
nlp.analyze_pipes(pretty=True)

Defining, configuring, reusing, training and analyzing pipeline components is now easier and more convenient. The @Language.component and @Language.factory decorators let you register your component, define its default configuration and meta data, like the attribute values it assigns and requires. Any custom component can be included during training, and sourcing components from existing pretrained models lets you mix and match custom pipelines. The nlp.analyze_pipes method outputs structured information about the current pipeline and its components, including the attributes they assign, the scores they compute during training and whether any required attributes aren't set.

Usage: Custom components, Defining components for training
API: @Language.component, @Language.factory, Language.add_pipe, Language.analyze_pipes
Implementation: spacy/language.py

Type hints and type-based data validation

Example

from spacy.language import Language
from pydantic import StrictBool

@Language.factory("my_component")
def create_my_component(
    nlp: Language,
    name: str,
    custom: StrictBool
):
   ...

spaCy v3.0 officially drops support for Python 2 and now requires Python 3.6+. This also means that the code base can take full advantage of type hints. spaCy's user-facing API that's implemented in pure Python (as opposed to Cython) now comes with type hints. The new version of spaCy's machine learning library Thinc also features extensive type support, including custom types for models and arrays, and a custom mypy plugin that can be used to type-check model definitions.

For data validation, spacy v3.0 adopts pydantic. It also powers the data validation of Thinc's config system, which lets you to register custom functions with typed arguments, reference them in your config and see validation errors if the argument values don't match.

**Usage: ** Component type hints and validation, Training with custom code
**Thinc: ** Type checking in Thinc, Thinc's config system

New methods, attributes and commands

The following methods, attributes and commands are new in spaCy v3.0.

Name	Description
`Token.lex`	Access a token's `Lexeme`.
`Token.morph` `Token.morph_`	Access a token's morphological analysis.
`Language.select_pipes`	Context manager for enabling or disabling specific pipeline components for a block.
`Language.analyze_pipes`	Analyze components and their interdependencies.
`Language.resume_training`	Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting.
`@Language.factory` `@Language.component`	Decorators for registering pipeline component factories and simple stateless component functions.
`Language.has_factory`	Check whether a component factory is registered on a language class.s
`Language.get_factory_meta` `Language.get_pipe_meta`	Get the `FactoryMeta` with component metadata for a factory or instance name.
`Language.config`	The config used to create the current `nlp` object. An instance of `Config` and can be saved to disk and used for training.
`Pipe.score`	Method on trainable pipeline components that returns a dictionary of evaluation scores.
`registry`	Function registry to map functions to string names that can be referenced in configs.
`util.load_meta` `util.load_config`	Updated helpers for loading a model's `meta.json` and `config.cfg`.
`util.get_installed_models`	Names of all models installed in the environment.
`init config` `init fill-config` `debug config`	CLI commands for initializing, auto-filling and debugging training configs.
`project`	Suite of CLI commands for cloning, running and managing spaCy projects.

New and updated documentation

To help you get started with spaCy v3.0 and the new features, we've added several new or rewritten documentation pages, including a new usage guide on embeddings, transformers and transfer learning, a guide on training models rewritten from scratch, a page explaining the new spaCy projects and updated usage documentation on custom pipeline components. We've also added a bunch of new illustrations and new API reference pages documenting spaCy's machine learning model architectures and the expected data formats. API pages about pipeline components now include more information, like the default config and implementation, and we've adopted a more detailed format for documenting argument and return types.

**Usage: ** Embeddings & Transformers, Training models, Layers & Architectures, Projects, Custom pipeline components, Custom tokenizers
**API Reference: ** Library architecture, Model architectures, Data formats
**New Classes: ** Example, Tok2Vec, Transformer, Lemmatizer, Morphologizer, AttributeRuler, SentenceRecognizer, Pipe, Corpus

Backwards Incompatibilities

As always, we've tried to keep the breaking changes to a minimum and focus on changes that were necessary to support the new features, fix problems or improve usability. The following section lists the relevant changes to the user-facing API. For specific examples of how to rewrite your code, check out the migration guide.

Note that spaCy v3.0 now requires Python 3.6+.

API changes

Model symlinks, the link command and shortcut names are now deprecated. There can be many different models and not just one "English model", so you should always use the full model name like en_core_web_sm explicitly.
A model's meta.json is now only used to provide meta information like the model name, author, license and labels. It's not used to construct the processing pipeline anymore. This is all defined in the config.cfg, which also includes all settings used to train the model.
The train and pretrain commands now only take a config.cfg file containing the full training config.
Language.add_pipe now takes the string name of the component factory instead of the component function.
Custom pipeline components now need to be decorated with the @Language.component or @Language.factory decorator.
Language.update now takes a batch of Example objects instead of raw texts and annotations, or Doc and GoldParse objects.
The Language.disable_pipes context manager has been replaced by Language.select_pipes, which can explicitly disable or enable components.
The Language.update, Language.evaluate and Pipe.update methods now all take batches of Example objects instead of Doc and GoldParse objects, or raw text and a dictionary of annotations. Language.begin_training and Pipe.begin_training now take a function that returns a sequence of Example objects to initialize the model instead of a list of tuples.
Matcher.add, PhraseMatcher.add and DependencyMatcher.add now only accept a list of patterns as the second argument (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument.

Removed or renamed API

Removed	Replacement
`Language.disable_pipes`	`Language.select_pipes`
`GoldParse`	`Example`
`GoldCorpus`	`Corpus`
`KnowledgeBase.load_bulk`, `KnowledgeBase.dump`	`KnowledgeBase.from_disk`, `KnowledgeBase.to_disk`
`spacy init-model`	`spacy init model`
`spacy debug-data`	`spacy debug data`
`spacy profile`	`spacy debug profile`
`spacy link`, `util.set_data_path`, `util.get_data_path`	not needed, model symlinks are deprecated

The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been deprecated for a while and many would previously raise errors. Many of them were also mostly internals. If you've been working with more recent versions of spaCy v2.x, it's unlikely that your code relied on them.

Removed	Replacement
`Doc.tokens_from_list`	`Doc.__init__`
`Doc.merge`, `Span.merge`	`Doc.retokenize`
`Token.string`, `Span.string`, `Span.upper`, `Span.lower`	`Span.text`, `Token.text`
`Language.tagger`, `Language.parser`, `Language.entity`	`Language.get_pipe`
keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes`	`exclude=["vocab"]`
`n_threads` argument on `Tokenizer`, `Matcher`, `PhraseMatcher`	`n_process`
`verbose` argument on `Language.evaluate`	logging (`DEBUG`)
`SentenceSegmenter` hook, `SimilarityHook`	user hooks, `Sentencizer`, `SentenceRecognizer`

Migrating from v2.x

Downloading and loading models

Model symlinks and shortcuts like en are now officially deprecated. There are many different models with different capabilities and not just one "English model". In order to download and load a model, you should always use its full name – for instance, en_core_web_sm.

- python -m spacy download en
+ python -m spacy download en_core_web_sm

- nlp = spacy.load("en")
+ nlp = spacy.load("en_core_web_sm")

Custom pipeline components and factories

Custom pipeline components now have to be registered explicitly using the @Language.component or @Language.factory decorator. For simple functions that take a Doc and return it, all you have to do is add the @Language.component decorator to it and assign it a name:

### Stateless function components
+ from spacy.language import Language

+ @Language.component("my_component")
def my_component(doc):
    return doc

For class components that are initialized with settings and/or the shared nlp object, you can use the @Language.factory decorator. Also make sure that that the method used to initialize the factory has two named arguments: nlp (the current nlp object) and name (the string name of the component instance).

### Stateful class components
+ from spacy.language import Language

+ @Language.factory("my_component")
class MyComponent:
-   def __init__(self, nlp):
+   def __init__(self, nlp, name):
        self.nlp = nlp

    def __call__(self, doc):
        return doc

Instead of decorating your class, you could also add a factory function that takes the arguments nlp and name and returns an instance of your component:

### Stateful class components with factory function
+ from spacy.language import Language

+ @Language.factory("my_component")
+ def create_my_component(nlp, name):
+     return MyComponent(nlp)

class MyComponent:
    def __init__(self, nlp):
        self.nlp = nlp

    def __call__(self, doc):
        return doc

The @Language.component and @Language.factory decorators now take care of adding an entry to the component factories, so spaCy knows how to load a component back in from its string name. You won't have to write to Language.factories manually anymore.

- Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp)

Adding components to the pipeline

The nlp.add_pipe method now takes the string name of the component factory instead of a callable component. This allows spaCy to track and serialize components that have been added and their settings.

+ @Language.component("my_component")
def my_component(doc):
    return doc

- nlp.add_pipe(my_component)
+ nlp.add_pipe("my_component")

nlp.add_pipe now also returns the pipeline component itself, so you can access its attributes. The nlp.create_pipe method is now mostly internals and you typically shouldn't have to use it in your code.

- parser = nlp.create_pipe("parser")
- nlp.add_pipe(parser)
+ parser = nlp.add_pipe("parser")

If you need to add a component from an existing pretrained model, you can now use the source argument on nlp.add_pipe. This will check that the component is compatible, and take care of porting over all config. During training, you can also reference existing pretrained components in your config and decide whether or not they should be updated with more data.

config.cfg (excerpt)

[components.ner]
source = "en_core_web_sm"
component = "ner"

source_nlp = spacy.load("en_core_web_sm")
nlp = spacy.blank("en")
- ner = source_nlp.get_pipe("ner")
- nlp.add_pipe(ner)
+ nlp.add_pipe("ner", source=source_nlp)

Adding match patterns

The Matcher.add, PhraseMatcher.add and DependencyMatcher.add methods now only accept a list of patterns as the second argument (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument.

matcher = Matcher(nlp.vocab)
patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
- matcher.add("GoogleNow", on_match, *patterns)
+ matcher.add("GoogleNow", patterns, on_match=on_match)

matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp("health care reform"), nlp("healthcare reform")]
- matcher.add("HEALTH", on_match, *patterns)
+ matcher.add("HEALTH", patterns, on_match=on_match)

Training models

To train your models, you should now pretty much always use the spacy train CLI. You shouldn't have to put together your own training scripts anymore, unless you really want to. The training commands now use a flexible config file that describes all training settings and hyperparameters, as well as your pipeline, model components and architectures to use. The --code argument lets you pass in code containing custom registered functions that you can reference in your config. To get started, check out the quickstart widget.

Binary .spacy training data format

spaCy v3.0 uses a new binary training data format created by serializing a DocBin, which represents a collection of Doc objects. This means that you can train spaCy models using the same format it outputs: annotated Doc objects. The binary format is extremely efficient in storage, especially when packing multiple documents together. You can convert your existing JSON-formatted data using the spacy convert command, which outputs .spacy files:

$ python -m spacy convert ./training.json ./output

Training config

The easiest way to get started with a training config is to use the init config command or the quickstart widget. You can define your requirements, and it will auto-generate a starter config with the best-matching default settings.

$ python -m spacy init config ./config.cfg --lang en --pipeline tagger,parser

If you've exported a starter config from our quickstart widget, you can use the init fill-config to fill it with all default values. You can then use the auto-generated config.cfg for training:

### {wrap="true"}
- python -m spacy train en ./output ./train.json ./dev.json --pipeline tagger,parser --cnn-window 1 --bilstm-depth 0
+ python -m spacy train ./config.cfg --output ./output

Training via the Python API

For most use cases, you shouldn't have to write your own training scripts anymore. Instead, you can use spacy train with a config file and custom registered functions if needed. You can even register callbacks that can modify the nlp object at different stages of its lifecycle to fully customize it before training.

If you do decide to use the internal training API from Python, you should only need a few small modifications to convert your scripts from spaCy v2.x to v3.x. The Example.from_dict classmethod takes a reference Doc and a dictionary of annotations, similar to the "simple training style" in spaCy v2.x:

### Migrating Doc and GoldParse
doc = nlp.make_doc("Mark Zuckerberg is the CEO of Facebook")
entities = [(0, 15, "PERSON"), (30, 38, "ORG")]
- gold = GoldParse(doc, entities=entities)
+ example = Example.from_dict(doc, {"entities": entities})

### Migrating simple training style
text = "Mark Zuckerberg is the CEO of Facebook"
annotations = {"entities": [(0, 15, "PERSON"), (30, 38, "ORG")]}
+ doc = nlp.make_doc(text)
+ example = Example.from_dict(doc, annotations)

The Language.update, Language.evaluate and Pipe.update methods now all take batches of Example objects instead of Doc and GoldParse objects, or raw text and a dictionary of annotations.

### Training loop {highlight="11"}
TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London.", {"entities": [(7, 13, "LOC")]}),
]
nlp.begin_training()
for i in range(20):
    random.shuffle(TRAIN_DATA)
    for batch in minibatch(TRAIN_DATA):
        examples = []
        for text, annots in batch:
            examples.append(Example.from_dict(nlp.make_doc(text), annots))
        nlp.update(examples)

Language.begin_training and Pipe.begin_training now take a function that returns a sequence of Example objects to initialize the model instead of a list of tuples. The data examples are used to initialize the models of trainable pipeline components, which includes validating the network, inferring missing shapes and setting up the label scheme.

- nlp.begin_training(examples)
+ nlp.begin_training(lambda: examples)

Packaging models

The spacy package command now automatically builds the installable .tar.gz sdist of the Python package, so you don't have to run this step manually anymore. You can disable the behavior by setting the --no-sdist flag.

python -m spacy package ./model ./packages
- cd /output/en_model-0.0.0
- python setup.py sdist

Migration notes for plugin maintainers

Thanks to everyone who's been contributing to the spaCy ecosystem by developing and maintaining one of the many awesome plugins and extensions. We've tried to make it as easy as possible for you to upgrade your packages for spaCy v3. The most common use case for plugins is providing pipeline components and extension attributes. When migrating your plugin, double-check the following:

Use the @Language.factory decorator to register your component and assign it a name. This allows users to refer to your components by name and serialize pipelines referencing them. Remove all manual entries to the Language.factories.
Make sure your component factories take at least two named arguments: nlp (the current nlp object) and name (the instance name of the added component so you can identify multiple instances of the same component).
Update all references to nlp.add_pipe in your docs to use string names instead of the component functions.

### {highlight="1-5"}
from spacy.language import Language

@Language.factory("my_component", default_config={"some_setting": False})
def create_component(nlp: Language, name: str, some_setting: bool):
    return MyCoolComponent(some_setting=some_setting)


class MyCoolComponent:
    def __init__(self, some_setting):
        self.some_setting = some_setting

    def __call__(self, doc):
        # Do something to the doc
        return doc

Result in config.cfg

[components.my_component]
factory = "my_component"
some_setting = true

import spacy
from your_plugin import MyCoolComponent

nlp = spacy.load("en_core_web_sm")
- component = MyCoolComponent(some_setting=True)
- nlp.add_pipe(component)
+ nlp.add_pipe("my_component", config={"some_setting": True})

The @Language.factory decorator takes care of letting spaCy know that a component of that name is available. This means that your users can add it to the pipeline using its string name. However, this requires the decorator to be executed – so users will still have to import your plugin. Alternatively, your plugin could expose an entry point, which spaCy can read from. This means that spaCy knows how to initialize my_component, even if your package isn't imported.

38 KiB Raw Blame History Unescape Escape

Summary

New Features

New training workflow and config system

Transformer-based pipelines

Custom models using any framework

Manage end-to-end workflows with projects

Example

New built-in pipeline components

Example

New and improved pipeline component APIs

Example

Type hints and type-based data validation

Example

New methods, attributes and commands

New and updated documentation

Backwards Incompatibilities

API changes

Removed or renamed API

Migrating from v2.x

Downloading and loading models

Custom pipeline components and factories

Adding components to the pipeline

config.cfg (excerpt)

Adding match patterns

Training models

Binary .spacy training data format

Training config

Training via the Python API

Packaging models

Migration notes for plugin maintainers

Result in config.cfg

38 KiB

Raw Blame History