Document new features

This commit is contained in:
Ines Montani 2020-07-09 21:10:36 +02:00
parent 797ca6f3dd
commit 7bcf9f7cfb

View File

@ -186,12 +186,13 @@ pipelines.
https://github.com/explosion/spacy-boilerplates/blob/master/ner_fashion/project.yml https://github.com/explosion/spacy-boilerplates/blob/master/ner_fashion/project.yml
``` ```
| Section | Description | | Section | Description |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `variables` | A dictionary of variables that can be referenced in paths, URLs and scripts. For example, `{NAME}` will use the value of the variable `NAME`. | | `variables` | A dictionary of variables that can be referenced in paths, URLs and scripts. For example, `{NAME}` will use the value of the variable `NAME`. |
| `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. | | `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. |
| `workflows` | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command. | | `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. |
| `commands` | A list of named commands. A command can define an optional help message (shown in the CLI when the user adds `--help`) and the `script`, a list of commands to run. The `deps` and `outputs` let you define the created file the command depends on and produces, respectively. This lets spaCy determine whether a command needs to be re-run because its dependencies or outputs changed. Commands can be run as part of a workflow, or separately with the [`project run`](/api/cli#project-run) command. | | `workflows` | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command. |
| `commands` | A list of named commands. A command can define an optional help message (shown in the CLI when the user adds `--help`) and the `script`, a list of commands to run. The `deps` and `outputs` let you define the created file the command depends on and produces, respectively. This lets spaCy determine whether a command needs to be re-run because its dependencies or outputs changed. Commands can be run as part of a workflow, or separately with the [`project run`](/api/cli#project-run) command. |
### Dependencies and outputs {#deps-outputs} ### Dependencies and outputs {#deps-outputs}
@ -228,7 +229,9 @@ commands:
If you're running a command and it depends on files that are missing, spaCy will If you're running a command and it depends on files that are missing, spaCy will
show you an error. If a command defines dependencies and outputs that haven't show you an error. If a command defines dependencies and outputs that haven't
changed since the last run, the command will be skipped. This means that you're changed since the last run, the command will be skipped. This means that you're
only re-running commands if they need to be re-run. To force re-running a only re-running commands if they need to be re-run. Commands can also set
`no_skip: true` if they should never be skipped for example commands that run
tests. Commands without outputs are also never skipped. To force re-running a
command or workflow, even if nothing changed, you can set the `--force` flag. command or workflow, even if nothing changed, you can set the `--force` flag.
Note that [`spacy project`](/api/cli#project) doesn't compile any dependency Note that [`spacy project`](/api/cli#project) doesn't compile any dependency
@ -243,28 +246,42 @@ won't be cached or tracked.
### Files and directory structure {#project-files} ### Files and directory structure {#project-files}
A project directory created by [`spacy project clone`](/api/cli#project-clone) The `project.yml` can define a list of `directories` that should be created
includes the following files and directories. They can optionally be within a project for instance, `assets`, `training`, `corpus` and so on. spaCy
pre-populated by a project template (most commonly used for metas, configs or will make sure that these directories are always available, so your commands can
scripts). write to and read from them. Project directories will also include all files and
directories copied from the project template with
[`spacy project clone`](/api/cli#project-clone). Here's an example of a project
directory:
> #### project.yml
>
> <!-- prettier-ignore -->
> ```yaml
> directories: ['assets', 'configs', 'corpus', 'metas', 'metrics', 'notebooks', 'packages', 'scripts', 'training']
> ```
```yaml ```yaml
### Project directory ### Example project directory
├── project.yml # the project settings ├── project.yml # the project settings
├── project.lock # lockfile that tracks inputs/outputs ├── project.lock # lockfile that tracks inputs/outputs
├── assets/ # downloaded data assets ├── assets/ # downloaded data assets
├── metrics/ # output directory for evaluation metrics ├── configs/ # model config.cfg files used for training
├── training/ # output directory for trained models
├── corpus/ # output directory for training corpus ├── corpus/ # output directory for training corpus
├── packages/ # output directory for model Python packages ├── metas/ # model meta.json templates used for packaging
├── metrics/ # output directory for evaluation metrics ├── metrics/ # output directory for evaluation metrics
├── notebooks/ # directory for Jupyter notebooks ├── notebooks/ # directory for Jupyter notebooks
├── packages/ # output directory for model Python packages
├── scripts/ # directory for scripts, e.g. referenced in commands ├── scripts/ # directory for scripts, e.g. referenced in commands
├── metas/ # model meta.json templates used for packaging ├── training/ # output directory for trained models
├── configs/ # model config.cfg files used for training
└── ... # any other files, like a requirements.txt etc. └── ... # any other files, like a requirements.txt etc.
``` ```
If you don't want a project to create a directory, you can delete it and remove
its entry from the `project.yml` just make sure it's not required by any of
the commands. [Custom templates](#custom) can use any directories they need
the only file that's required for a project is the `project.yml`.
--- ---
## Custom scripts and projects {#custom} ## Custom scripts and projects {#custom}
@ -275,7 +292,9 @@ a list of commands that are called in a subprocess, in order. This lets you
execute other Python scripts or command-line tools. Let's say you've written a execute other Python scripts or command-line tools. Let's say you've written a
few integration tests that load the best model produced by the training command few integration tests that load the best model produced by the training command
and check that it works correctly. You can now define a `test` command that and check that it works correctly. You can now define a `test` command that
calls into [`pytest`](https://docs.pytest.org/en/latest/) and runs your tests: calls into [`pytest`](https://docs.pytest.org/en/latest/), runs your tests and
uses [`pytest-html`](https://github.com/pytest-dev/pytest-html) to export a test
report:
> #### Calling into Python > #### Calling into Python
> >
@ -290,15 +309,20 @@ commands:
- name: test - name: test
help: 'Test the trained model' help: 'Test the trained model'
script: script:
- 'python -m pytest ./scripts/tests' - 'pip install pytest pytest-html'
- 'python -m pytest ./scripts/tests --html=metrics/test-report.html'
deps: deps:
- 'training/model-best' - 'training/model-best'
outputs:
- 'metrics/test-report.html'
no_skip: true
``` ```
Adding `training/model-best` to the command's `deps` lets you ensure that the Adding `training/model-best` to the command's `deps` lets you ensure that the
file is available. If not, spaCy will show an error and the command won't run. file is available. If not, spaCy will show an error and the command won't run.
Setting `no_skip: true` means that the command will always run, even if the
<!-- TODO: add another example --> dependencies (the trained model) hasn't changed. This makes sense here, because
you typically don't want to skip your tests.
### Cloning from your own repo {#custom-repo} ### Cloning from your own repo {#custom-repo}