mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 17:24:41 +03:00
554df9ef20
* Rename all MDX file to `.mdx`
* Lock current node version (#11885)
* Apply Prettier (#11996)
* Minor website fixes (#11974) [ci skip]
* fix table
* Migrate to Next WEB-17 (#12005)
* Initial commit
* Run `npx create-next-app@13 next-blog`
* Install MDX packages
Following: 77b5f79a4d/packages/next-mdx/readme.md
* Add MDX to Next
* Allow Next to handle `.md` and `.mdx` files.
* Add VSCode extension recommendation
* Disabled TypeScript strict mode for now
* Add prettier
* Apply Prettier to all files
* Make sure to use correct Node version
* Add basic implementation for `MDXRemote`
* Add experimental Rust MDX parser
* Add `/public`
* Add SASS support
* Remove default pages and styling
* Convert to module
This allows to use `import/export` syntax
* Add import for custom components
* Add ability to load plugins
* Extract function
This will make the next commit easier to read
* Allow to handle directories for page creation
* Refactoring
* Allow to parse subfolders for pages
* Extract logic
* Redirect `index.mdx` to parent directory
* Disabled ESLint during builds
* Disabled typescript during build
* Remove Gatsby from `README.md`
* Rephrase Docker part of `README.md`
* Update project structure in `README.md`
* Move and rename plugins
* Update plugin for wrapping sections
* Add dependencies for plugin
* Use plugin
* Rename wrapper type
* Simplify unnessary adding of id to sections
The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading.
* Add plugin for custom attributes on Markdown elements
* Add plugin to readd support for tables
* Add plugin to fix problem with wrapped images
For more details see this issue: https://github.com/mdx-js/mdx/issues/1798
* Add necessary meta data to pages
* Install necessary dependencies
* Remove outdated MDX handling
* Remove reliance on `InlineList`
* Use existing Remark components
* Remove unallowed heading
Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either.
* Add missing components to MDX
* Add correct styling
* Fix broken list
* Fix broken CSS classes
* Implement layout
* Fix links
* Fix broken images
* Fix pattern image
* Fix heading attributes
* Rename heading attribute
`new` was causing some weird issue, so renaming it to `version`
* Update comment syntax in MDX
* Merge imports
* Fix markdown rendering inside components
* Add model pages
* Simplify anchors
* Fix default value for theme
* Add Universe index page
* Add Universe categories
* Add Universe projects
* Fix Next problem with copy
Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect`
* Fix improper component nesting
Next doesn't allow block elements inside a `<p>`
* Replace landing page MDX with page component
* Remove inlined iframe content
* Remove ability to inline HTML content in iFrames
* Remove MDX imports
* Fix problem with image inside link in MDX
* Escape character for MDX
* Fix unescaped characters in MDX
* Fix headings with logo
* Allow to export static HTML pages
* Add prebuild script
This command is automatically run by Next
* Replace `svg-loader` with `react-inlinesvg`
`svg-loader` is no longer maintained
* Fix ESLint `react-hooks/exhaustive-deps`
* Fix dropdowns
* Change code language from `cli` to `bash`
* Remove unnessary language `none`
* Fix invalid code language
`markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error.
* Enable code blocks plugin
* Readd `InlineCode` component
MDX2 removed the `inlineCode` component
> The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions
Source: https://mdxjs.com/migrating/v2/#update-mdx-content
* Remove unused code
* Extract function to own file
* Fix code syntax highlighting
* Update syntax for code block meta data
* Remove unused prop
* Fix internal link recognition
There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error.
`Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"`
This simplifies the implementation and fixes the above error.
* Replace `react-helmet` with `next/head`
* Fix `className` problem for JSX component
* Fix broken bold markdown
* Convert file to `.mjs` to be used by Node process
* Add plugin to replace strings
* Fix custom table row styling
* Fix problem with `span` inside inline `code`
React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode.
* Add `_document` to be able to customize `<html>` and `<body>`
* Add `lang="en"`
* Store Netlify settings in file
This way we don't need to update via Netlify UI, which can be tricky if changing build settings.
* Add sitemap
* Add Smartypants
* Add PWA support
* Add `manifest.webmanifest`
* Fix bug with anchor links after reloading
There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar.
* Rename custom event
I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠
* Fix missing comment syntax highlighting
* Refactor Quickstart component
The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle.
The new implementation simplfy filters the list of children (React elements) via their props.
* Fix syntax highlighting for Training Quickstart
* Unify code rendering
* Improve error logging in Juniper
* Fix Juniper component
* Automatically generate "Read Next" link
* Add Plausible
* Use recent DocSearch component and adjust styling
* Fix images
* Turn of image optimization
> Image Optimization using Next.js' default loader is not compatible with `next export`.
We currently deploy to Netlify via `next export`
* Dont build pages starting with `_`
* Remove unused files
* Add Next plugin to Netlify
* Fix button layout
MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string.
* Add 404 page
* Apply Prettier
* Update Prettier for `package.json`
Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces.
* Apply Next patch to `package-lock.json`
When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes.
* fix link
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* small backslash fixes
* adjust to new style
Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
1142 lines
51 KiB
Plaintext
1142 lines
51 KiB
Plaintext
---
|
||
title: Projects
|
||
version: 3
|
||
menu:
|
||
- ['Intro & Workflow', 'intro']
|
||
- ['Directory & Assets', 'directory']
|
||
- ['Custom Projects', 'custom']
|
||
- ['Remote Storage', 'remote']
|
||
- ['Integrations', 'integrations']
|
||
---
|
||
|
||
## Introduction and workflow {id="intro",hidden="true"}
|
||
|
||
> #### 🪐 Project templates
|
||
>
|
||
> Our [`projects`](https://github.com/explosion/projects) repo includes various
|
||
> project templates for different NLP tasks, models, workflows and integrations
|
||
> that you can clone and run. The easiest way to get started is to pick a
|
||
> template, clone it and start modifying it!
|
||
|
||
spaCy projects let you manage and share **end-to-end spaCy workflows** for
|
||
different **use cases and domains**, and orchestrate training, packaging and
|
||
serving your custom pipelines. You can start off by cloning a pre-defined
|
||
project template, adjust it to fit your needs, load in your data, train a
|
||
pipeline, export it as a Python package, upload your outputs to a remote storage
|
||
and share your results with your team. spaCy projects can be used via the new
|
||
[`spacy project`](/api/cli#project) command and we provide templates in our
|
||
[`projects`](https://github.com/explosion/projects) repo.
|
||
|
||
![Illustration of project workflow and commands](/images/projects.svg)
|
||
|
||
<Project id="pipelines/tagger_parser_ud">
|
||
|
||
The easiest way to get started is to clone a project template and run it – for
|
||
example, this end-to-end template that lets you train a **part-of-speech
|
||
tagger** and **dependency parser** on a Universal Dependencies treebank.
|
||
|
||
</Project>
|
||
|
||
spaCy projects make it easy to integrate with many other **awesome tools** in
|
||
the data science and machine learning ecosystem to track and manage your data
|
||
and experiments, iterate on demos and prototypes and ship your models into
|
||
production.
|
||
|
||
<Grid narrow cols={3}>
|
||
<Integration title="DVC" logo="dvc" url="#dvc">
|
||
Manage and version your data
|
||
</Integration>
|
||
<Integration title="Prodigy" logo="prodigy" url="#prodigy">
|
||
Create labelled training data
|
||
</Integration>
|
||
<Integration title="Streamlit" logo="streamlit" url="#streamlit">
|
||
Visualize and demo your pipelines
|
||
</Integration>
|
||
<Integration title="FastAPI" logo="fastapi" url="#fastapi">
|
||
Serve your models and host APIs
|
||
</Integration>
|
||
<Integration title="Ray" logo="ray" url="#ray">
|
||
Distributed and parallel training
|
||
</Integration>
|
||
<Integration title="Weights & Biases" logo="wandb" url="#wandb">
|
||
Track your experiments and results
|
||
</Integration>
|
||
<Integration
|
||
title="Hugging Face Hub"
|
||
logo="huggingface_hub"
|
||
url="#huggingface_hub"
|
||
>
|
||
Upload your pipelines to the Hugging Face Hub
|
||
</Integration>
|
||
</Grid>
|
||
|
||
### 1. Clone a project template {id="clone"}
|
||
|
||
> #### Cloning under the hood
|
||
>
|
||
> To clone a project, spaCy calls into `git` and uses the "sparse checkout"
|
||
> feature to only clone the relevant directory or directories.
|
||
|
||
The [`spacy project clone`](/api/cli#project-clone) command clones an existing
|
||
project template and copies the files to a local directory. You can then run the
|
||
project, e.g. to train a pipeline and edit the commands and scripts to build
|
||
fully custom workflows.
|
||
|
||
```bash
|
||
python -m spacy project clone pipelines/tagger_parser_ud
|
||
```
|
||
|
||
By default, the project will be cloned into the current working directory. You
|
||
can specify an optional second argument to define the output directory. The
|
||
`--repo` option lets you define a custom repo to clone from if you don't want to
|
||
use the spaCy [`projects`](https://github.com/explosion/projects) repo. You can
|
||
also use any private repo you have access to with Git.
|
||
|
||
### 2. Fetch the project assets {id="assets"}
|
||
|
||
> #### project.yml
|
||
>
|
||
> ```yaml
|
||
> assets:
|
||
> - dest: 'assets/training.spacy'
|
||
> url: 'https://example.com/data.spacy'
|
||
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
||
> - dest: 'assets/development.spacy'
|
||
> git:
|
||
> repo: 'https://github.com/example/repo'
|
||
> branch: 'master'
|
||
> path: 'path/development.spacy'
|
||
> checksum: '5113dc04e03f079525edd8df3f4f39e3'
|
||
> ```
|
||
|
||
Assets are data files your project needs – for example, the training and
|
||
evaluation data or pretrained vectors and embeddings to initialize your model
|
||
with. Each project template comes with a `project.yml` that defines the assets
|
||
to download and where to put them. The [`spacy project assets`](/api/cli#run)
|
||
will fetch the project assets for you:
|
||
|
||
```bash
|
||
$ cd some_example_project
|
||
$ python -m spacy project assets
|
||
```
|
||
|
||
Asset URLs can be a number of different protocols: HTTP, HTTPS, FTP, SSH, and
|
||
even cloud storage such as GCS and S3. You can also fetch assets using git, by
|
||
replacing the `url` string with a `git` block. spaCy will use Git's "sparse
|
||
checkout" feature to avoid downloading the whole repository.
|
||
|
||
Sometimes your project configuration may include large assets that you don't
|
||
necessarily want to download when you run `spacy project assets`. That's why
|
||
assets can be marked as [`extra`](#data-assets-url) - by default, these assets
|
||
are not downloaded. If they should be, run `spacy project assets --extra`.
|
||
|
||
### 3. Run a command {id="run"}
|
||
|
||
> #### project.yml
|
||
>
|
||
> ```yaml
|
||
> commands:
|
||
> - name: preprocess
|
||
> help: "Convert the input data to spaCy's format"
|
||
> script:
|
||
> - 'python -m spacy convert assets/train.conllu corpus/'
|
||
> - 'python -m spacy convert assets/eval.conllu corpus/'
|
||
> deps:
|
||
> - 'assets/train.conllu'
|
||
> - 'assets/eval.conllu'
|
||
> outputs:
|
||
> - 'corpus/train.spacy'
|
||
> - 'corpus/eval.spacy'
|
||
> ```
|
||
|
||
Commands consist of one or more steps and can be run with
|
||
[`spacy project run`](/api/cli#project-run). The following will run the command
|
||
`preprocess` defined in the `project.yml`:
|
||
|
||
```bash
|
||
$ python -m spacy project run preprocess
|
||
```
|
||
|
||
Commands can define their expected [dependencies and outputs](#deps-outputs)
|
||
using the `deps` (files the commands require) and `outputs` (files the commands
|
||
create) keys. This allows your project to track changes and determine whether a
|
||
command needs to be re-run. For instance, if your input data changes, you want
|
||
to re-run the `preprocess` command. But if nothing changed, this step can be
|
||
skipped. You can also set `--force` to force re-running a command, or `--dry` to
|
||
perform a "dry run" and see what would happen (without actually running the
|
||
script).
|
||
|
||
Since spaCy v3.4.2, `spacy projects run` checks your installed dependencies to
|
||
verify that your environment is properly set up and aligns with the project's
|
||
`requirements.txt`, if there is one. If missing or conflicting dependencies are
|
||
detected, a corresponding warning is displayed. If you'd like to disable the
|
||
dependency check, set `check_requirements: false` in your project's
|
||
`project.yml`.
|
||
|
||
### 4. Run a workflow {id="run-workfow"}
|
||
|
||
> #### project.yml
|
||
>
|
||
> ```yaml
|
||
> workflows:
|
||
> all:
|
||
> - preprocess
|
||
> - train
|
||
> - package
|
||
> ```
|
||
|
||
Workflows are series of commands that are run in order and often depend on each
|
||
other. For instance, to generate a pipeline package, you might start by
|
||
converting your data, then run [`spacy train`](/api/cli#train) to train your
|
||
pipeline on the converted data and if that's successful, run
|
||
[`spacy package`](/api/cli#package) to turn the best trained artifact into an
|
||
installable Python package. The following command runs the workflow named `all`
|
||
defined in the `project.yml`, and executes the commands it specifies, in order:
|
||
|
||
```bash
|
||
$ python -m spacy project run all
|
||
```
|
||
|
||
Using the expected [dependencies and outputs](#deps-outputs) defined in the
|
||
commands, spaCy can determine whether to re-run a command (if its inputs or
|
||
outputs have changed) or whether to skip it. If you're looking to implement more
|
||
advanced data pipelines and track your changes in Git, check out the
|
||
[Data Version Control (DVC) integration](#dvc). The
|
||
[`spacy project dvc`](/api/cli#project-dvc) command generates a DVC config file
|
||
from a workflow defined in your `project.yml` so you can manage your spaCy
|
||
project as a DVC repo.
|
||
|
||
### 5. Optional: Push to remote storage {id="push"}
|
||
|
||
> ```yaml
|
||
> ### project.yml
|
||
> remotes:
|
||
> default: 's3://my-spacy-bucket'
|
||
> local: '/mnt/scratch/cache'
|
||
> ```
|
||
|
||
After training a pipeline, you can optionally use the
|
||
[`spacy project push`](/api/cli#project-push) command to upload your outputs to
|
||
a remote storage, using protocols like [S3](https://aws.amazon.com/s3/),
|
||
[Google Cloud Storage](https://cloud.google.com/storage) or SSH. This can help
|
||
you **export** your pipeline packages, **share** work with your team, or **cache
|
||
results** to avoid repeating work.
|
||
|
||
```bash
|
||
$ python -m spacy project push
|
||
```
|
||
|
||
The `remotes` section in your `project.yml` lets you assign names to the
|
||
different storages. To download state from a remote storage, you can use the
|
||
[`spacy project pull`](/api/cli#project-pull) command. For more details, see the
|
||
docs on [remote storage](#remote).
|
||
|
||
## Project directory and assets {id="directory"}
|
||
|
||
### project.yml {id="project-yml"}
|
||
|
||
The `project.yml` defines the assets a project depends on, like datasets and
|
||
pretrained weights, as well as a series of commands that can be run separately
|
||
or as a workflow – for instance, to preprocess the data, convert it to spaCy's
|
||
format, train a pipeline, evaluate it and export metrics, package it and spin up
|
||
a quick web demo. It looks pretty similar to a config file used to define CI
|
||
pipelines.
|
||
|
||
> #### Tip: Multi-line YAML syntax for long values
|
||
>
|
||
> YAML has [multi-line syntax](https://yaml-multiline.info/) that can be helpful
|
||
> for readability with longer values such as project descriptions or commands
|
||
> that take several arguments.
|
||
|
||
```yaml
|
||
%%GITHUB_PROJECTS/pipelines/tagger_parser_ud/project.yml
|
||
```
|
||
|
||
> #### Tip: Overriding variables on the CLI
|
||
>
|
||
> If you want to override one or more variables on the CLI and are not already
|
||
> specifying a project directory, you need to add `.` as a placeholder:
|
||
>
|
||
> ```
|
||
> python -m spacy project run test . --vars.foo bar
|
||
> ```
|
||
|
||
> #### Tip: Environment Variables
|
||
>
|
||
> Commands in a project file are not executed in a shell, so they don't have
|
||
> direct access to environment variables. But you can insert environment
|
||
> variables using the `env` dictionary to make values available for
|
||
> interpolation, just like values in `vars`. Here's an example `env` dict that
|
||
> makes `$PATH` available as `ENV_PATH`:
|
||
>
|
||
> ```yaml
|
||
> env:
|
||
> ENV_PATH: PATH
|
||
> ```
|
||
>
|
||
> This can be used in a project command like so:
|
||
>
|
||
> ```yaml
|
||
> - name: 'echo-path'
|
||
> script:
|
||
> - 'echo ${env.ENV_PATH}'
|
||
> ```
|
||
|
||
| Section | Description |
|
||
| --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||
| `title` | An optional project title used in `--help` message and [auto-generated docs](#custom-docs). |
|
||
| `description` | An optional project description used in [auto-generated docs](#custom-docs). |
|
||
| `vars` | A dictionary of variables that can be referenced in paths, URLs and scripts and overriden on the CLI, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`. |
|
||
| `env` | A dictionary of variables, mapped to the names of environment variables that will be read in when running the project. For example, `${env.name}` will use the value of the environment variable defined as `name`. |
|
||
| `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. |
|
||
| `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo. |
|
||
| `workflows` | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command. |
|
||
| `commands` | A list of named commands. A command can define an optional help message (shown in the CLI when the user adds `--help`) and the `script`, a list of commands to run. The `deps` and `outputs` let you define the created file the command depends on and produces, respectively. This lets spaCy determine whether a command needs to be re-run because its dependencies or outputs changed. Commands can be run as part of a workflow, or separately with the [`project run`](/api/cli#project-run) command. |
|
||
| `spacy_version` | Optional spaCy version range like `>=3.0.0,<3.1.0` that the project is compatible with. If it's loaded with an incompatible version, an error is raised when the project is loaded. |
|
||
| `check_requirements` <Tag variant="new">3.4.2</Tag> | A flag determining whether to verify that the installed dependencies align with the project's `requirements.txt`. Defaults to `true`. |
|
||
|
||
### Data assets {id="data-assets"}
|
||
|
||
Assets are any files that your project might need, like training and development
|
||
corpora or pretrained weights for initializing your model. Assets are defined in
|
||
the `assets` block of your `project.yml` and can be downloaded using the
|
||
[`project assets`](/api/cli#project-assets) command. Defining checksums lets you
|
||
verify that someone else running your project will use the same files you used.
|
||
Asset URLs can be a number of different **protocols**: HTTP, HTTPS, FTP, SSH,
|
||
and even **cloud storage** such as GCS and S3. You can also download assets from
|
||
a **Git repo** instead.
|
||
|
||
#### Downloading from a URL or cloud storage {id="data-assets-url"}
|
||
|
||
Under the hood, spaCy uses the
|
||
[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library so you
|
||
can use any protocol it supports. Note that you may need to install extra
|
||
dependencies to use certain protocols.
|
||
|
||
> #### project.yml
|
||
>
|
||
> ```yaml
|
||
> assets:
|
||
> # Download from public HTTPS URL
|
||
> - dest: 'assets/training.spacy'
|
||
> url: 'https://example.com/data.spacy'
|
||
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
||
> # Optional download from Google Cloud Storage bucket
|
||
> - dest: 'assets/development.spacy'
|
||
> extra: True
|
||
> url: 'gs://your-bucket/corpora'
|
||
> checksum: '5113dc04e03f079525edd8df3f4f39e3'
|
||
> ```
|
||
|
||
| Name | Description |
|
||
| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. |
|
||
| `extra` | Optional flag determining whether this asset is downloaded only if `spacy project assets` is run with `--extra`. `False` by default. |
|
||
| `url` | The URL to download from, using the respective protocol. |
|
||
| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. |
|
||
| `description` | Optional asset description, used in [auto-generated docs](#custom-docs). |
|
||
|
||
#### Downloading from a Git repo {id="data-assets-git"}
|
||
|
||
If a `git` block is provided, the asset is downloaded from the given Git
|
||
repository. You can download from any repo that you have access to. Under the
|
||
hood, this uses Git's "sparse checkout" feature, so you're only downloading the
|
||
files you need and not the whole repo.
|
||
|
||
> #### project.yml
|
||
>
|
||
> ```yaml
|
||
> assets:
|
||
> - dest: 'assets/training.spacy'
|
||
> git:
|
||
> repo: 'https://github.com/example/repo'
|
||
> branch: 'master'
|
||
> path: 'path/training.spacy'
|
||
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
||
> description: 'The training data (5000 examples)'
|
||
> ```
|
||
|
||
| Name | Description |
|
||
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. |
|
||
| `git` | `repo`: The URL of the repo to download from.<br />`path`: Path of the file or directory to download, relative to the repo root. "" specifies the root directory.<br />`branch`: The branch to download from. Defaults to `"master"`. |
|
||
| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. |
|
||
| `description` | Optional asset description, used in [auto-generated docs](#custom-docs). |
|
||
|
||
#### Working with private assets {id="data-asets-private"}
|
||
|
||
> #### project.yml
|
||
>
|
||
> ```yaml
|
||
> assets:
|
||
> - dest: 'assets/private_training_data.json'
|
||
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
||
> - dest: 'assets/private_vectors.bin'
|
||
> checksum: '5113dc04e03f079525edd8df3f4f39e3'
|
||
> ```
|
||
|
||
For many projects, the datasets and weights you're working with might be
|
||
company-internal and not available over the internet. In that case, you can
|
||
specify the destination paths and a checksum, and leave out the URL. When your
|
||
teammates clone and run your project, they can place the files in the respective
|
||
directory themselves. The [`project assets`](/api/cli#project-assets) command
|
||
will alert you about missing files and mismatched checksums, so you can ensure
|
||
that others are running your project with the same data.
|
||
|
||
### Dependencies and outputs {id="deps-outputs"}
|
||
|
||
Each command defined in the `project.yml` can optionally define a list of
|
||
dependencies and outputs. These are the files the command requires and creates.
|
||
For example, a command for training a pipeline may depend on a
|
||
[`config.cfg`](/usage/training#config) and the training and evaluation data, and
|
||
it will export a directory `model-best`, which you can then re-use in other
|
||
commands.
|
||
|
||
{/* prettier-ignore */}
|
||
```yaml {title="project.yml"}
|
||
commands:
|
||
- name: train
|
||
help: 'Train a spaCy pipeline using the specified corpus and config'
|
||
script:
|
||
- 'python -m spacy train ./configs/config.cfg -o training/ --paths.train ./corpus/training.spacy --paths.dev ./corpus/evaluation.spacy'
|
||
deps:
|
||
- 'configs/config.cfg'
|
||
- 'corpus/training.spacy'
|
||
- 'corpus/evaluation.spacy'
|
||
outputs:
|
||
- 'training/model-best'
|
||
```
|
||
|
||
> #### Re-running vs. skipping
|
||
>
|
||
> Under the hood, spaCy uses a `project.lock` lockfile that stores the details
|
||
> for each command, as well as its dependencies and outputs and their checksums.
|
||
> It's updated on each run. If any of this information changes, the command will
|
||
> be re-run. Otherwise, it will be skipped.
|
||
|
||
If you're running a command and it depends on files that are missing, spaCy will
|
||
show you an error. If a command defines dependencies and outputs that haven't
|
||
changed since the last run, the command will be skipped. This means that you're
|
||
only re-running commands if they need to be re-run. Commands can also set
|
||
`no_skip: true` if they should never be skipped – for example commands that run
|
||
tests. Commands without outputs are also never skipped. To force re-running a
|
||
command or workflow, even if nothing changed, you can set the `--force` flag.
|
||
|
||
Note that [`spacy project`](/api/cli#project) doesn't compile any dependency
|
||
graphs based on the dependencies and outputs, and won't re-run previous steps
|
||
automatically. For instance, if you only run the command `train` that depends on
|
||
data created by `preprocess` and those files are missing, spaCy will show an
|
||
error – it won't just re-run `preprocess`. If you're looking for more advanced
|
||
data management, check out the [Data Version Control (DVC) integration](#dvc).
|
||
If you're planning on integrating your spaCy project with DVC, you can also use
|
||
`outputs_no_cache` instead of `outputs` to define outputs that won't be cached
|
||
or tracked.
|
||
|
||
### Files and directory structure {id="project-files"}
|
||
|
||
The `project.yml` can define a list of `directories` that should be created
|
||
within a project – for instance, `assets`, `training`, `corpus` and so on. spaCy
|
||
will make sure that these directories are always available, so your commands can
|
||
write to and read from them. Project directories will also include all files and
|
||
directories copied from the project template with
|
||
[`spacy project clone`](/api/cli#project-clone). Here's an example of a project
|
||
directory:
|
||
|
||
> #### project.yml
|
||
>
|
||
> {/* prettier-ignore */}
|
||
> ```yaml
|
||
> directories: ['assets', 'configs', 'corpus', 'metas', 'metrics', 'notebooks', 'packages', 'scripts', 'training']
|
||
> ```
|
||
|
||
```yaml {title="Example project directory"}
|
||
├── project.yml # the project settings
|
||
├── project.lock # lockfile that tracks inputs/outputs
|
||
├── assets/ # downloaded data assets
|
||
├── configs/ # pipeline config.cfg files used for training
|
||
├── corpus/ # output directory for training corpus
|
||
├── metas/ # pipeline meta.json templates used for packaging
|
||
├── metrics/ # output directory for evaluation metrics
|
||
├── notebooks/ # directory for Jupyter notebooks
|
||
├── packages/ # output directory for pipeline Python packages
|
||
├── scripts/ # directory for scripts, e.g. referenced in commands
|
||
├── training/ # output directory for trained pipelines
|
||
└── ... # any other files, like a requirements.txt etc.
|
||
```
|
||
|
||
If you don't want a project to create a directory, you can delete it and remove
|
||
its entry from the `project.yml` – just make sure it's not required by any of
|
||
the commands. [Custom templates](#custom) can use any directories they need –
|
||
the only file that's required for a project is the `project.yml`.
|
||
|
||
---
|
||
|
||
## Custom scripts and projects {id="custom"}
|
||
|
||
The `project.yml` lets you define any custom commands and run them as part of
|
||
your training, evaluation or deployment workflows. The `script` section defines
|
||
a list of commands that are called in a subprocess, in order. This lets you
|
||
execute other Python scripts or command-line tools. Let's say you've written a
|
||
few integration tests that load the best model produced by the training command
|
||
and check that it works correctly. You can now define a `test` command that
|
||
calls into [`pytest`](https://docs.pytest.org/en/latest/), runs your tests and
|
||
uses [`pytest-html`](https://github.com/pytest-dev/pytest-html) to export a test
|
||
report:
|
||
|
||
```yaml {title="project.yml"}
|
||
commands:
|
||
- name: test
|
||
help: 'Test the trained pipeline'
|
||
script:
|
||
- 'pip install pytest pytest-html'
|
||
- 'python -m pytest ./scripts/tests --html=metrics/test-report.html'
|
||
deps:
|
||
- 'training/model-best'
|
||
outputs:
|
||
- 'metrics/test-report.html'
|
||
no_skip: true
|
||
```
|
||
|
||
Adding `training/model-best` to the command's `deps` lets you ensure that the
|
||
file is available. If not, spaCy will show an error and the command won't run.
|
||
Setting `no_skip: true` means that the command will always run, even if the
|
||
dependencies (the trained pipeline) haven't changed. This makes sense here,
|
||
because you typically don't want to skip your tests.
|
||
|
||
### Writing custom scripts {id="custom-scripts"}
|
||
|
||
Your project commands can include any custom scripts – essentially, anything you
|
||
can run from the command line. Here's an example of a custom script that uses
|
||
[`typer`](https://typer.tiangolo.com/) for quick and easy command-line arguments
|
||
that you can define via your `project.yml`:
|
||
|
||
> #### About Typer
|
||
>
|
||
> [`typer`](https://typer.tiangolo.com/) is a modern library for building Python
|
||
> CLIs using type hints. It's a dependency of spaCy, so it will already be
|
||
> pre-installed in your environment. Function arguments automatically become
|
||
> positional CLI arguments and using Python type hints, you can define the value
|
||
> types. For instance, `batch_size: int` means that the value provided via the
|
||
> command line is converted to an integer.
|
||
|
||
```python {title="scripts/custom_evaluation.py"}
|
||
import typer
|
||
|
||
def custom_evaluation(batch_size: int = 128, model_path: str, data_path: str):
|
||
# The arguments are now available as positional CLI arguments
|
||
print(batch_size, model_path, data_path)
|
||
|
||
if __name__ == "__main__":
|
||
typer.run(custom_evaluation)
|
||
```
|
||
|
||
In your `project.yml`, you can then run the script by calling
|
||
`python scripts/custom_evaluation.py` with the function arguments. You can also
|
||
use the `vars` section to define reusable variables that will be substituted in
|
||
commands, paths and URLs. In this example, the batch size is defined as a
|
||
variable will be added in place of `${vars.batch_size}` in the script. Just like
|
||
in the [training config](/usage/training##config-overrides), you can also
|
||
override settings on the command line – for example using `--vars.batch_size`.
|
||
|
||
> #### Calling into Python
|
||
>
|
||
> If any of your command scripts call into `python`, spaCy will take care of
|
||
> replacing that with your `sys.executable`, to make sure you're executing
|
||
> everything with the same Python (not some other Python installed on your
|
||
> system). It also normalizes references to `python3`, `pip3` and `pip`.
|
||
|
||
{/* prettier-ignore */}
|
||
```yaml {title="project.yml"}
|
||
vars:
|
||
batch_size: 128
|
||
|
||
commands:
|
||
- name: evaluate
|
||
script:
|
||
- 'python scripts/custom_evaluation.py ${vars.batch_size} ./training/model-best ./corpus/eval.json'
|
||
deps:
|
||
- 'training/model-best'
|
||
- 'corpus/eval.json'
|
||
```
|
||
|
||
You can also use the `env` section to reference **environment variables** and
|
||
make their values available to the commands. This can be useful for overriding
|
||
settings on the command line and passing through system-level settings.
|
||
|
||
> #### Usage example
|
||
>
|
||
> ```bash
|
||
> export GPU_ID=1
|
||
> BATCH_SIZE=128 python -m spacy project run evaluate
|
||
> ```
|
||
|
||
```yaml {title="project.yml"}
|
||
env:
|
||
batch_size: BATCH_SIZE
|
||
gpu_id: GPU_ID
|
||
|
||
commands:
|
||
- name: evaluate
|
||
script:
|
||
- 'python scripts/custom_evaluation.py ${env.batch_size}'
|
||
```
|
||
|
||
### Documenting your project {id="custom-docs"}
|
||
|
||
> #### Readme Example
|
||
>
|
||
> For more examples, see the [`projects`](https://github.com/explosion/projects)
|
||
> repo.
|
||
>
|
||
> ![Screenshot of auto-generated Markdown Readme](/images/project_document.jpg)
|
||
|
||
When your custom project is ready and you want to share it with others, you can
|
||
use the [`spacy project document`](/api/cli#project-document) command to
|
||
**auto-generate** a pretty, Markdown-formatted `README` file based on your
|
||
project's `project.yml`. It will list all commands, workflows and assets defined
|
||
in the project and include details on how to run the project, as well as links
|
||
to the relevant spaCy documentation to make it easy for others to get started
|
||
using your project.
|
||
|
||
```bash
|
||
$ python -m spacy project document --output README.md
|
||
```
|
||
|
||
Under the hood, hidden markers are added to identify where the auto-generated
|
||
content starts and ends. This means that you can add your own custom content
|
||
before or after it and re-running the `project document` command will **only
|
||
update the auto-generated part**. This makes it easy to keep your documentation
|
||
up to date.
|
||
|
||
<Infobox variant="warning">
|
||
|
||
Note that the contents of an existing file will be **replaced** if no existing
|
||
auto-generated docs are found. If you want spaCy to ignore a file and not update
|
||
it, you can add the comment marker `{/* SPACY PROJECT: IGNORE */}` anywhere in
|
||
your markup.
|
||
|
||
</Infobox>
|
||
|
||
### Cloning from your own repo {id="custom-repo"}
|
||
|
||
The [`spacy project clone`](/api/cli#project-clone) command lets you customize
|
||
the repo to clone from using the `--repo` option. It calls into `git`, so you'll
|
||
be able to clone from any repo that you have access to, including private repos.
|
||
|
||
```bash
|
||
python -m spacy project clone your_project --repo https://github.com/you/repo
|
||
```
|
||
|
||
At a minimum, a valid project template needs to contain a
|
||
[`project.yml`](#project-yml). It can also include
|
||
[other files](/usage/projects#project-files), like custom scripts, a
|
||
`requirements.txt` listing additional dependencies,
|
||
[training configs](/usage/training#config) and model meta templates, or Jupyter
|
||
notebooks with usage examples.
|
||
|
||
<Infobox title="Important note about assets" variant="warning">
|
||
|
||
It's typically not a good idea to check large data assets, trained pipelines or
|
||
other artifacts into a Git repo and you should exclude them from your project
|
||
template by adding a `.gitignore`. If you want to version your data and models,
|
||
check out [Data Version Control](#dvc) (DVC), which integrates with spaCy
|
||
projects.
|
||
|
||
</Infobox>
|
||
|
||
## Remote Storage {id="remote"}
|
||
|
||
You can persist your project outputs to a remote storage using the
|
||
[`project push`](/api/cli#project-push) command. This can help you **export**
|
||
your pipeline packages, **share** work with your team, or **cache results** to
|
||
avoid repeating work. The [`project pull`](/api/cli#project-pull) command will
|
||
download any outputs that are in the remote storage and aren't available
|
||
locally.
|
||
|
||
You can list one or more remotes in the `remotes` section of your
|
||
[`project.yml`](#project-yml) by mapping a string name to the URL of the
|
||
storage. Under the hood, spaCy uses
|
||
[`Pathy`](https://github.com/justindujardin/pathy) to communicate with the
|
||
remote storages, so you can use any protocol that `Pathy` supports, including
|
||
[S3](https://aws.amazon.com/s3/),
|
||
[Google Cloud Storage](https://cloud.google.com/storage), and the local
|
||
filesystem, although you may need to install extra dependencies to use certain
|
||
protocols.
|
||
|
||
> #### Example
|
||
>
|
||
> ```bash
|
||
> $ python -m spacy project pull local
|
||
> ```
|
||
|
||
```yaml {title="project.yml"}
|
||
remotes:
|
||
default: 's3://my-spacy-bucket'
|
||
local: '/mnt/scratch/cache'
|
||
```
|
||
|
||
<Infobox title="How it works" emoji="💡">
|
||
|
||
Inside the remote storage, spaCy uses a clever **directory structure** to avoid
|
||
overwriting files. The top level of the directory structure is a URL-encoded
|
||
version of the output's path. Within this directory are subdirectories named
|
||
according to a hash of the command string and the command's dependencies.
|
||
Finally, within those directories are files, named according to an MD5 hash of
|
||
their contents.
|
||
|
||
{/* TODO: update with actual real example? */}
|
||
|
||
{/* prettier-ignore */}
|
||
```yaml
|
||
└── urlencoded_file_path # Path of original file
|
||
├── some_command_hash # Hash of command you ran
|
||
│ ├── some_content_hash # Hash of file content
|
||
│ └── another_content_hash
|
||
└── another_command_hash
|
||
└── third_content_hash
|
||
```
|
||
|
||
</Infobox>
|
||
|
||
For instance, let's say you had the following command in your `project.yml`:
|
||
|
||
```yaml {title="project.yml"}
|
||
- name: train
|
||
help: 'Train a spaCy pipeline using the specified corpus and config'
|
||
script:
|
||
- 'spacy train ./config.cfg --output training/'
|
||
deps:
|
||
- 'corpus/train'
|
||
- 'corpus/dev'
|
||
- 'config.cfg'
|
||
outputs:
|
||
- 'training/model-best'
|
||
```
|
||
|
||
> #### Example
|
||
>
|
||
> ```
|
||
> └── s3://my-spacy-bucket/training%2Fmodel-best
|
||
> └── 1d8cb33a06cc345ad3761c6050934a1b
|
||
> └── d8e20c3537a084c5c10d95899fe0b1ff
|
||
> ```
|
||
|
||
After you finish training, you run [`project push`](/api/cli#project-push) to
|
||
make sure the `training/model-best` output is saved to remote storage. spaCy
|
||
will then construct a hash from your command script and the listed dependencies,
|
||
`corpus/train`, `corpus/dev` and `config.cfg`, in order to identify the
|
||
execution context of your output. It would then compute an MD5 hash of the
|
||
`training/model-best` directory, and use those three pieces of information to
|
||
construct the storage URL.
|
||
|
||
```bash
|
||
$ python -m spacy project run train
|
||
$ python -m spacy project push
|
||
```
|
||
|
||
If you change the command or one of its dependencies (for instance, by editing
|
||
the [`config.cfg`](/usage/training#config) file to tune the hyperparameters, a
|
||
different creation hash will be calculated, so when you use
|
||
[`project push`](/api/cli#project-push) you won't be overwriting your previous
|
||
file. The system even supports multiple outputs for the same file and the same
|
||
context, which can happen if your training process is not deterministic, or if
|
||
you have dependencies that aren't represented in the command.
|
||
|
||
In summary, the [`spacy project`](/api/cli#project) remote storages are designed
|
||
to make a particular set of trade-offs. Priority is placed on **convenience**,
|
||
**correctness** and **avoiding data loss**. You can use
|
||
[`project push`](/api/cli#project-push) freely, as you'll never overwrite remote
|
||
state, and you don't have to come up with names or version numbers. However,
|
||
it's up to you to manage the size of your remote storage, and to remove files
|
||
that are no longer relevant to you.
|
||
|
||
## Integrations {id="integrations"}
|
||
|
||
{<H3 id="dvc"> Data Version Control (DVC)
|
||
|
||
<IntegrationLogo name="dvc" title="DVC" width={70} height="auto" align="right"/>
|
||
</H3>}
|
||
|
||
Data assets like training corpora or pretrained weights are at the core of any
|
||
NLP project, but they're often difficult to manage: you can't just check them
|
||
into your Git repo to version and keep track of them. And if you have multiple
|
||
steps that depend on each other, like a preprocessing step that generates your
|
||
training data, you need to make sure the data is always up-to-date, and re-run
|
||
all steps of your process every time, just to be safe.
|
||
|
||
[Data Version Control](https://dvc.org) (DVC) is a standalone open-source tool
|
||
that integrates into your workflow like Git, builds a dependency graph for your
|
||
data pipelines and tracks and caches your data files. If you're downloading data
|
||
from an external source, like a storage bucket, DVC can tell whether the
|
||
resource has changed. It can also determine whether to re-run a step, depending
|
||
on whether its input have changed or not. All metadata can be checked into a Git
|
||
repo, so you'll always be able to reproduce your experiments.
|
||
|
||
To set up DVC, install the package and initialize your spaCy project as a Git
|
||
and DVC repo. You can also
|
||
[customize your DVC installation](https://dvc.org/doc/install/macos#install-with-pip)
|
||
to include support for remote storage like Google Cloud Storage, S3, Azure, SSH
|
||
and more.
|
||
|
||
```bash
|
||
$ pip install dvc # Install DVC
|
||
$ git init # Initialize a Git repo
|
||
$ dvc init # Initialize a DVC project
|
||
```
|
||
|
||
<Infobox title="Important note on privacy" variant="warning">
|
||
|
||
DVC enables usage analytics by default, so if you're working in a
|
||
privacy-sensitive environment, make sure to
|
||
[**opt-out manually**](https://dvc.org/doc/user-guide/analytics#opting-out).
|
||
|
||
</Infobox>
|
||
|
||
The [`spacy project dvc`](/api/cli#project-dvc) command creates a `dvc.yaml`
|
||
config file based on a workflow defined in your `project.yml`. Whenever you
|
||
update your project, you can re-run the command to update your DVC config. You
|
||
can then manage your spaCy project like any other DVC project, run
|
||
[`dvc add`](https://dvc.org/doc/command-reference/add) to add and track assets
|
||
and [`dvc repro`](https://dvc.org/doc/command-reference/repro) to reproduce the
|
||
workflow or individual commands.
|
||
|
||
```bash
|
||
$ python -m spacy project dvc [project_dir] [workflow_name]
|
||
```
|
||
|
||
<Infobox title="Important note for multiple workflows" variant="warning">
|
||
|
||
DVC currently expects a single workflow per project, so when creating the config
|
||
with [`spacy project dvc`](/api/cli#project-dvc), you need to specify the name
|
||
of a workflow defined in your `project.yml`. You can still use multiple
|
||
workflows, but only one can be tracked by DVC.
|
||
|
||
</Infobox>
|
||
|
||
{/* { TODO: <Project id="integrations/dvc"></Project>} */}
|
||
|
||
---
|
||
|
||
{<H3 id="prodigy">Prodigy
|
||
|
||
<IntegrationLogo name="prodigy" width={100} height="auto" align="right" />
|
||
</H3>}
|
||
|
||
[Prodigy](https://prodi.gy) is a modern annotation tool for creating training
|
||
data for machine learning models, developed by us. It integrates with spaCy
|
||
out-of-the-box and provides many different
|
||
[annotation recipes](https://prodi.gy/docs/recipes) for a variety of NLP tasks,
|
||
with and without a model in the loop. If Prodigy is installed in your project,
|
||
you can start the annotation server from your `project.yml` for a tight feedback
|
||
loop between data development and training.
|
||
|
||
<Infobox variant="warning">
|
||
|
||
This integration requires [Prodigy v1.11](https://prodi.gy/docs/changelog#v1.11)
|
||
or higher. If you're using an older version of Prodigy, you can still use your
|
||
annotations in spaCy v3 by exporting your data with
|
||
[`data-to-spacy`](https://prodi.gy/docs/recipes#data-to-spacy) and running
|
||
[`spacy convert`](/api/cli#convert) to convert it to the binary format.
|
||
|
||
</Infobox>
|
||
|
||
The following example shows a workflow for merging and exporting NER annotations
|
||
collected with Prodigy and training a spaCy pipeline:
|
||
|
||
> #### Example usage
|
||
>
|
||
> ```bash
|
||
> $ python -m spacy project run all
|
||
> ```
|
||
|
||
{/* prettier-ignore */}
|
||
```yaml {title="project.yml"}
|
||
vars:
|
||
prodigy:
|
||
train_dataset: "fashion_brands_training"
|
||
eval_dataset: "fashion_brands_eval"
|
||
|
||
workflows:
|
||
all:
|
||
- data-to-spacy
|
||
- train_spacy
|
||
|
||
commands:
|
||
- name: "data-to-spacy"
|
||
help: "Merge your annotations and create data in spaCy's binary format"
|
||
script:
|
||
- "python -m prodigy data-to-spacy corpus/ --ner ${vars.prodigy.train_dataset},eval:${vars.prodigy.eval_dataset}"
|
||
outputs:
|
||
- "corpus/train.spacy"
|
||
- "corpus/dev.spacy"
|
||
- name: "train_spacy"
|
||
help: "Train a named entity recognition model with spaCy"
|
||
script:
|
||
- "python -m spacy train configs/config.cfg --output training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy"
|
||
deps:
|
||
- "corpus/train.spacy"
|
||
- "corpus/dev.spacy"
|
||
outputs:
|
||
- "training/model-best"
|
||
```
|
||
|
||
> #### Example train curve output
|
||
>
|
||
> <Image
|
||
> src="/images/prodigy_train_curve.jpg"
|
||
> href="https://prodi.gy/docs/recipes#train-curve"
|
||
> alt="Screenshot of train curve terminal output"
|
||
> />
|
||
|
||
The [`train-curve`](https://prodi.gy/docs/recipes#train-curve) recipe is another
|
||
cool workflow you can include in your project. It will run the training with
|
||
different portions of the data, e.g. 25%, 50%, 75% and 100%. As a rule of thumb,
|
||
if accuracy increases in the last segment, this could indicate that collecting
|
||
more annotations of the same type might improve the model further.
|
||
|
||
{/* prettier-ignore */}
|
||
```yaml {title="project.yml (excerpt)"}
|
||
- name: "train_curve"
|
||
help: "Train the model with Prodigy by using different portions of training examples to evaluate if more annotations can potentially improve the performance"
|
||
script:
|
||
- "python -m prodigy train-curve --ner ${vars.prodigy.train_dataset},eval:${vars.prodigy.eval_dataset} --config configs/${vars.config} --show-plot"
|
||
```
|
||
|
||
You can use the same approach for various types of projects and annotation
|
||
workflows, including
|
||
[named entity recognition](https://prodi.gy/docs/named-entity-recognition),
|
||
[span categorization](https://prodi.gy/docs/span-categorization),
|
||
[text classification](https://prodi.gy/docs/text-classification),
|
||
[dependency parsing](https://prodi.gy/docs/dependencies-relations),
|
||
[part-of-speech tagging](https://prodi.gy/docs/recipes#pos) or fully
|
||
[custom recipes](https://prodi.gy/docs/custom-recipes). You can also use spaCy
|
||
project templates to quickly start the annotation server to collect more
|
||
annotations and add them to your Prodigy dataset.
|
||
|
||
<Project id="integrations/prodigy">
|
||
|
||
Get started with spaCy and Prodigy using our project template. It includes
|
||
commands to create a merged training corpus from your Prodigy annotations,
|
||
training and packaging a spaCy pipeline and analyzing if more annotations may
|
||
improve performance.
|
||
|
||
</Project>
|
||
|
||
---
|
||
|
||
{<H3 id="streamlit">Streamlit
|
||
|
||
<IntegrationLogo name="streamlit" width={150} height="auto" align="right" />
|
||
</H3>}
|
||
|
||
[Streamlit](https://streamlit.io) is a Python framework for building interactive
|
||
data apps. The [`spacy-streamlit`](https://github.com/explosion/spacy-streamlit)
|
||
package helps you integrate spaCy visualizations into your Streamlit apps and
|
||
quickly spin up demos to explore your pipelines interactively. It includes a
|
||
full embedded visualizer, as well as individual components.
|
||
|
||
{/* TODO: update once version is stable */}
|
||
|
||
> #### Installation
|
||
>
|
||
> ```bash
|
||
> $ pip install spacy-streamlit --pre
|
||
> ```
|
||
|
||
![](/images/spacy-streamlit.png)
|
||
|
||
Using [`spacy-streamlit`](https://github.com/explosion/spacy-streamlit), your
|
||
projects can easily define their own scripts that spin up an interactive
|
||
visualizer, using the latest pipeline you trained, or a selection of pipelines
|
||
so you can compare their results.
|
||
|
||
<Project id="integrations/streamlit">
|
||
|
||
Get started with spaCy and Streamlit using our project template. It includes a
|
||
script to spin up a custom visualizer and commands you can adjust to showcase
|
||
and explore your own custom trained pipelines.
|
||
|
||
</Project>
|
||
|
||
> #### Example usage
|
||
>
|
||
> ```bash
|
||
> $ python -m spacy project run visualize
|
||
> ```
|
||
|
||
{/* prettier-ignore */}
|
||
```yaml {title="project.yml"}
|
||
commands:
|
||
- name: visualize
|
||
help: "Visualize the pipeline's output interactively using Streamlit"
|
||
script:
|
||
- 'streamlit run ./scripts/visualize.py ./training/model-best "I like Adidas shoes."'
|
||
deps:
|
||
- "training/model-best"
|
||
```
|
||
|
||
The following script is called from the `project.yml` and takes two positional
|
||
command-line argument: a comma-separated list of paths or packages to load the
|
||
pipelines from and an example text to use as the default text.
|
||
|
||
```python
|
||
https://github.com/explosion/projects/blob/v3/integrations/streamlit/scripts/visualize.py
|
||
```
|
||
|
||
---
|
||
|
||
{<H3 id="fastapi">FastAPI
|
||
|
||
<IntegrationLogo name="fastapi" width={100} height="auto" align="right" />
|
||
</H3>}
|
||
|
||
[FastAPI](https://fastapi.tiangolo.com/) is a modern high-performance framework
|
||
for building REST APIs with Python, based on Python
|
||
[type hints](https://fastapi.tiangolo.com/python-types/). It's become a popular
|
||
library for serving machine learning models and you can use it in your spaCy
|
||
projects to quickly serve up a trained pipeline and make it available behind a
|
||
REST API.
|
||
|
||
<Project id="integrations/fastapi">
|
||
|
||
Get started with spaCy and FastAPI using our project template. It includes a
|
||
simple REST API for processing batches of text, and usage examples for how to
|
||
query your API from Python and JavaScript (Vanilla JS and React).
|
||
|
||
</Project>
|
||
|
||
> #### Example usage
|
||
>
|
||
> ```bash
|
||
> $ python -m spacy project run serve
|
||
> ```
|
||
|
||
{/* prettier-ignore */}
|
||
```yaml {title="project.yml"}
|
||
- name: "serve"
|
||
help: "Serve the models via a FastAPI REST API using the given host and port"
|
||
script:
|
||
- "uvicorn scripts.main:app --reload --host 127.0.0.1 --port 5000"
|
||
deps:
|
||
- "scripts/main.py"
|
||
no_skip: true
|
||
```
|
||
|
||
The script included in the template shows a simple REST API with a `POST`
|
||
endpoint that accepts batches of texts and returns batches of predictions, e.g.
|
||
named entities found in the documents. Type hints and
|
||
[`pydantic`](https://github.com/samuelcolvin/pydantic) are used to define the
|
||
expected data types.
|
||
|
||
```python
|
||
https://github.com/explosion/projects/blob/v3/integrations/fastapi/scripts/main.py
|
||
```
|
||
|
||
---
|
||
|
||
{<H3 id="wandb">Weights & Biases
|
||
|
||
<IntegrationLogo name="wandb" width={175} height="auto" align="right" />
|
||
</H3>}
|
||
|
||
[Weights & Biases](https://www.wandb.com/) is a popular platform for experiment
|
||
tracking. spaCy integrates with it out-of-the-box via the
|
||
[`WandbLogger`](https://github.com/explosion/spacy-loggers#wandblogger), which
|
||
you can add as the `[training.logger]` block of your training
|
||
[config](/usage/training#config). The results of each step are then logged in
|
||
your project, together with the full **training config**. This means that
|
||
_every_ hyperparameter, registered function name and argument will be tracked
|
||
and you'll be able to see the impact it has on your results.
|
||
|
||
> #### Example config
|
||
>
|
||
> ```ini
|
||
> [training.logger]
|
||
> @loggers = "spacy.WandbLogger.v3"
|
||
> project_name = "monitor_spacy_training"
|
||
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
|
||
> log_dataset_dir = "corpus"
|
||
> model_log_interval = 1000
|
||
> ```
|
||
|
||
![Screenshot: Visualized training results](/images/wandb1.jpg)
|
||
|
||
![Screenshot: Parameter importance using config values](/images/wandb2.jpg 'Parameter importance using config values')
|
||
|
||
<Project id="integrations/wandb">
|
||
|
||
Get started with tracking your spaCy training runs in Weights & Biases using our
|
||
project template. It trains on the IMDB Movie Review Dataset and includes a
|
||
simple config with the built-in `WandbLogger`, as well as a custom example of
|
||
creating variants of the config for a simple hyperparameter grid search and
|
||
logging the results.
|
||
|
||
</Project>
|
||
|
||
---
|
||
|
||
{<H3 id="huggingface_hub">Hugging Face Hub
|
||
|
||
<IntegrationLogo name="huggingface_hub" width={175} height="auto" align="right" />
|
||
</H3>}
|
||
|
||
The [Hugging Face Hub](https://huggingface.co/) lets you upload models and share
|
||
them with others. It hosts models as Git-based repositories which are storage
|
||
spaces that can contain all your files. It support versioning, branches and
|
||
custom metadata out-of-the-box, and provides browser-based visualizers for
|
||
exploring your models interactively, as well as an API for production use. The
|
||
[`spacy-huggingface-hub`](https://github.com/explosion/spacy-huggingface-hub)
|
||
package automatically adds the `huggingface-hub` command to your `spacy` CLI if
|
||
it's installed.
|
||
|
||
> #### Installation
|
||
>
|
||
> ```bash
|
||
> $ pip install spacy-huggingface-hub
|
||
> # Check that the CLI is registered
|
||
> $ python -m spacy huggingface-hub --help
|
||
> ```
|
||
|
||
You can then upload any pipeline packaged with
|
||
[`spacy package`](/api/cli#package). Make sure to set `--build wheel` to output
|
||
a binary `.whl` file. The uploader will read all metadata from the pipeline
|
||
package, including the auto-generated pretty `README.md` and the model details
|
||
available in the `meta.json`. For examples, check out the
|
||
[spaCy pipelines](https://huggingface.co/spacy) we've uploaded.
|
||
|
||
```bash
|
||
$ huggingface-cli login
|
||
$ python -m spacy package ./en_ner_fashion ./output --build wheel
|
||
$ cd ./output/en_ner_fashion-0.0.0/dist
|
||
$ python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl
|
||
```
|
||
|
||
After uploading, you will see the live URL of your pipeline packages, as well as
|
||
the direct URL to the model wheel you can install via `pip install`. You'll also
|
||
be able to test your pipeline interactively from your browser:
|
||
|
||
![Screenshot: interactive NER visualizer](/images/huggingface_hub.jpg)
|
||
|
||
In your `project.yml`, you can add a command that uploads your trained and
|
||
packaged pipeline to the hub. You can either run this as a manual step, or
|
||
automatically as part of a workflow. Make sure to set `--build wheel` when
|
||
running `spacy package` to build a wheel file for your pipeline package.
|
||
|
||
{/* prettier-ignore */}
|
||
```yaml {title="project.yml"}
|
||
- name: "push_to_hub"
|
||
help: "Upload the trained model to the Hugging Face Hub"
|
||
script:
|
||
- "python -m spacy huggingface-hub push packages/en_${vars.name}-${vars.version}/dist/en_${vars.name}-${vars.version}-py3-none-any.whl"
|
||
deps:
|
||
- "packages/en_${vars.name}-${vars.version}/dist/en_${vars.name}-${vars.version}-py3-none-any.whl"
|
||
```
|
||
|
||
<Project id="integrations/huggingface_hub">
|
||
|
||
Get started with uploading your models to the Hugging Face hub using our project
|
||
template. It trains a simple pipeline, packages it and uploads it if the
|
||
packaged model has changed. This makes it easy to deploy your models end-to-end.
|
||
|
||
</Project>
|