Update WIP

This commit is contained in:
Ines Montani 2020-07-06 22:22:37 +02:00
parent 44da24ddd0
commit bb3ee38cf9
16 changed files with 325 additions and 53 deletions

View File

@ -122,7 +122,7 @@ where the rescuers keep passing out from low oxygen, causing another rescuer to
follow — only to succumb themselves. In short, just say no to optimizing your
Python. If it's not fast enough the first time, just switch to Cython.
<Infobox title="📖 Resources">
<Infobox title="Resources" emoji="📖">
- [Official Cython documentation](http://docs.cython.org/en/latest/)
(cython.org)

View File

@ -85,7 +85,7 @@ hood. For details on how to use training configs, see the
<Infobox variant="warning">
The `@` notation lets you refer to function names registered in the
The `@` syntax lets you refer to function names registered in the
[function registry](/api/top-level#registry). For example,
`@architectures = "spacy.HashEmbedCNN.v1"` refers to a registered function of
the name `"spacy.HashEmbedCNN.v1"` and all other values defined in its block
@ -96,6 +96,7 @@ API details.
</Infobox>
<!-- TODO: we need to come up with a good way to present the sections and their expected values visually? -->
<!-- TODO: once we know how we want to implement "starter config" workflow or outputting a full default config for the user, update this section with the command -->
## Lexical data for vocabulary {#vocab-jsonl new="2"}

View File

@ -27,7 +27,7 @@ import QuickstartModels from 'widgets/quickstart-models.js'
<QuickstartModels title="Quickstart" id="quickstart" description="Install a default model, get the code to load it from within spaCy and test it." />
<Infobox title="📖 Installation and usage">
<Infobox title="Installation and usage" emoji="📖">
For more details on how to use models with spaCy, see the
[usage guide on models](/usage/models).

View File

@ -28,7 +28,7 @@ import PosDeps101 from 'usage/101/\_pos-deps.md'
<PosDeps101 />
<Infobox title="📖 Part-of-speech tag scheme">
<Infobox title="Part-of-speech tag scheme" emoji="📖">
For a list of the fine-grained and coarse-grained part-of-speech tags assigned
by spaCy's models across different languages, see the label schemes documented
@ -287,7 +287,7 @@ for token in doc:
| their | `ADJ` | `poss` | requests |
| requests | `NOUN` | `dobj` | submit |
<Infobox title="📖 Dependency label scheme">
<Infobox title="Dependency label scheme" emoji="📖">
For a list of the syntactic dependency labels assigned by spaCy's models across
different languages, see the label schemes documented in the
@ -615,7 +615,7 @@ tokens containing periods intact (abbreviations like "U.S.").
![Language data architecture](../images/language_data.svg)
<Infobox title="📖 Language data">
<Infobox title="Language data" emoji="📖">
For more details on the language-specific data, see the usage guide on
[adding languages](/usage/adding-languages).

View File

@ -338,7 +338,7 @@ nlp = spacy.load("/path/to/en_core_web_sm") # load package from a directory
doc = nlp("This is a sentence.")
```
<Infobox title="Tip: Preview model info">
<Infobox title="Tip: Preview model info" emoji="💡">
You can use the [`info`](/api/cli#info) command or
[`spacy.info()`](/api/top-level#spacy.info) method to print a model's meta data

View File

@ -34,7 +34,7 @@ texts = ["This is a text", "These are lots of texts", "..."]
+ docs = list(nlp.pipe(texts))
```
<Infobox title="Tips for efficient processing">
<Infobox title="Tips for efficient processing" emoji="💡">
- Process the texts **as a stream** using [`nlp.pipe`](/api/language#pipe) and
buffer them in batches, instead of one-by-one. This is usually much more
@ -912,7 +912,7 @@ new_heads = [head - i - 1 if head != 0 else 0 for i, head in enumerate(heads)]
</Infobox>
<Infobox title="📖 Advanced usage, serialization and entry points">
<Infobox title="Advanced usage, serialization and entry points" emoji="📖">
For more details on how to write and package custom components, make them
available to spaCy via entry points and implement your own serialization

View File

@ -1,5 +1,158 @@
---
title: Projects
new: 3
menu:
- ['Intro & Workflow', 'intro']
- ['Directory & Assets', 'directory']
- ['Custom Projects', 'custom']
---
TODO: write
> #### Project templates
>
> Our [`projects`](https://github.com/explosion/projects) repo includes various
> project templates for different tasks and models that you can clone and run.
<!-- TODO: write more about templates in aside -->
spaCy projects let you manage and share **end-to-end spaCy workflows** for
training, packaging and serving your custom models. You can start off by cloning
a pre-defined project template, adjust it to fit your needs, load in your data,
train a model, export it as a Python package and share the project templates
with your team. Under the hood, project use
[Data Version Control](https://dvc.org) (DVC) to track and version inputs and
outputs, and make sure you're only re-running what's needed. spaCy projects can
be used via the new [`spacy project`](/api/cli#project) command. For an overview
of the available project templates, check out the
[`projects`](https://github.com/explosion/projects) repo.
## Introduction and workflow {#intro}
<!-- TODO: decide how to introduce concept -->
<Project id="some_example_project">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum
sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat
mattis pretium.
</Project>
### 1. Clone a project template {#clone}
The [`spacy project clone`](/api/cli#project-clone) command clones an existing
project template and copies the files to a local directory. You can then run the
project, e.g. to train a model and edit the commands and scripts to build fully
custom workflows.
> #### Cloning under the hood
>
> To clone a project, spaCy calls into `git` and uses the "sparse checkout"
> feature to only clone the relevant directory or directories.
```bash
$ python -m spacy clone some_example_project
```
By default, the project will be cloned into the current working directory. You
can specify an optional second argument to define the output directory. The
`--repo` option lets you define a custom repo to clone from, if you don't want
to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You
can also use any private repo you have access to with Git.
If you plan on making the project a Git repo, you can set the `--git` flag to
set it up automatically _before_ initializing DVC, so DVC can integrate with
Git. This means that it will automatically add asset files to a `.gitignore` (so
you never check assets into the repo, only the asset meta files).
### 2. Fetch the project assets {#assets}
Assets are data files your project needs for example, the training and
evaluation data or pretrained vectors and embeddings to initialize your model
with. <!-- TODO: ... -->
```bash
cd some_example_project
python -m spacy project assets
```
### 3. Run the steps {#run-all}
```bash
$ python -m spacy project run-all
```
### 4. Run single commands {#run}
```bash
$ python -m spacy project run visualize
```
## Project directory and assets {#directory}
### project.yml {#project-yml}
The project config, `project.yml`, defines the assets a project depends on, like
datasets and pretrained weights, as well as a series of commands that can be run
separately or as a pipeline for instance, to preprocess the data, convert it
to spaCy's format, train a model, evaluate it and export metrics, package it and
spin up a quick web demo. It looks pretty similar to a config file used to
define CI pipelines.
<!-- TODO: include example etc. -->
### Files and directory structure {#project-files}
A project directory created by [`spacy project clone`](/api/cli#project-clone)
includes the following files and directories. They can optionally be
pre-populated by a project template (most commonly used for metas, configs or
scripts).
```yaml
### Project directory
├── project.yml # the project configuration
├── dvc.yaml # auto-generated Data Version Control config
├── dvc.lock # auto-generated Data Version control lock file
├── assets/ # downloaded data assets and DVC meta files
├── metrics/ # output directory for evaluation metrics
├── training/ # output directory for trained models
├── corpus/ # output directory for training corpus
├── packages/ # output directory for model Python packages
├── metrics/ # output directory for evaluation metrics
├── notebooks/ # directory for Jupyter notebooks
├── scripts/ # directory for scripts, e.g. referenced in commands
├── metas/ # model meta.json templates used for packaging
├── configs/ # model config.cfg files used for training
└── ... # any other files, like a requirements.txt etc.
```
When the project is initialized, spaCy will auto-generate a `dvc.yaml` based on
the project config. The file is updated whenever the project config has changed
and includes all commands defined in the `run` section of the project config.
This allows DVC to track the inputs and outputs and know which steps need to be
re-run.
#### Why Data Version Control (DVC)?
Data assets like training corpora or pretrained weights are at the core of any
NLP project, but they're often difficult to manage: you can't just check them
into your Git repo to version and keep track of them. And if you have multiple
steps that depend on each other, like a preprocessing step that generates your
training data, you need to make sure the data is always up-to-date, and re-run
all steps of your process every time, just to be safe.
[Data Version Control (DVC)](https://dvc.org) is a standalone open-source tool
that integrates into your workflow like Git, builds a dependency graph for your
data pipelines and tracks and caches your data files. If you're downloading data
from an external source, like a storage bucket, DVC can tell whether the
resource has changed. It can also determine whether to re-run a step, depending
on whether its input have changed or not. All metadata can be checked into a Git
repo, so you'll always be able to reproduce your experiments. `spacy project`
uses DVC under the hood and you typically don't have to think about it if you
don't want to. But if you do want to integrate with DVC more deeply, you can.
Each spaCy project is also a regular DVC project.
#### Checking projects into Git
---
## Custom projects and scripts {#custom}

View File

@ -552,7 +552,7 @@ component with different patterns, depending on your application:
html_merger = BadHTMLMerger(nlp, path="/path/to/patterns.json")
```
<Infobox title="📖 Processing pipelines">
<Infobox title="Processing pipelines" emoji="📖">
For more details and examples of how to **create custom pipeline components**
and **extension attributes**, see the

View File

@ -198,7 +198,7 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
<Tokenization101 />
<Infobox title="📖 Tokenization rules">
<Infobox title="Tokenization rules" emoji="📖">
To learn more about how spaCy's tokenization rules work in detail, how to
**customize and replace** the default tokenizer and how to **add
@ -214,7 +214,7 @@ import PosDeps101 from 'usage/101/\_pos-deps.md'
<PosDeps101 />
<Infobox title="📖 Part-of-speech tagging and morphology">
<Infobox title="Part-of-speech tagging and morphology" emoji="📖">
To learn more about **part-of-speech tagging** and rule-based morphology, and
how to **navigate and use the parse tree** effectively, see the usage guides on
@ -229,7 +229,7 @@ import NER101 from 'usage/101/\_named-entities.md'
<NER101 />
<Infobox title="📖 Named Entity Recognition">
<Infobox title="Named Entity Recognition" emoji="📖">
To learn more about entity recognition in spaCy, how to **add your own
entities** to a document and how to **train and update** the entity predictions
@ -245,7 +245,7 @@ import Vectors101 from 'usage/101/\_vectors-similarity.md'
<Vectors101 />
<Infobox title="📖 Word vectors">
<Infobox title="Word vectors" emoji="📖">
To learn more about word vectors, how to **customize them** and how to load
**your own vectors** into spaCy, see the usage guide on
@ -259,7 +259,7 @@ import Pipelines101 from 'usage/101/\_pipelines.md'
<Pipelines101 />
<Infobox title="📖 Processing pipelines">
<Infobox title="Processing pipelines" emoji="📖">
To learn more about **how processing pipelines work** in detail, how to enable
and disable their components, and how to **create your own**, see the usage
@ -458,7 +458,7 @@ import Serialization101 from 'usage/101/\_serialization.md'
<Serialization101 />
<Infobox title="📖 Saving and loading">
<Infobox title="Saving and loading" emoji="📖">
To learn more about how to **save and load your own models**, see the usage
guide on [saving and loading](/usage/saving-loading#models).
@ -471,7 +471,7 @@ import Training101 from 'usage/101/\_training.md'
<Training101 />
<Infobox title="📖 Training statistical models">
<Infobox title="Training statistical models" emoji="📖">
To learn more about **training and updating** models, how to create training
data and how to improve spaCy's named entity recognition models, see the usage
@ -485,14 +485,6 @@ import LanguageData101 from 'usage/101/\_language-data.md'
<LanguageData101 />
<Infobox title="📖 Language data">
To learn more about the individual components of the language data and how to
**add a new language** to spaCy in preparation for training a language model,
see the usage guide on [adding languages](/usage/adding-languages).
</Infobox>
## Lightning tour {#lightning-tour}
The following examples and code snippets give you an overview of spaCy's

View File

@ -4,8 +4,8 @@ next: /usage/projects
menu:
- ['Introduction', 'basics']
- ['CLI & Config', 'cli-config']
- ['Custom Models', 'custom-models']
- ['Transfer Learning', 'transfer-learning']
- ['Custom Models', 'custom-models']
- ['Parallel Training', 'parallel-training']
- ['Internal API', 'api']
---
@ -195,7 +195,7 @@ dropout = null
<!-- TODO: explain settings and @ notation, refer to function registry docs -->
<Infobox title="📖 Config format and settings">
<Infobox title="Config format and settings" emoji="📖">
For a full overview of spaCy's config format and settings, see the
[training format documentation](/api/data-formats#config). The settings
@ -206,26 +206,47 @@ available for the different architectures are documented with the
</Infobox>
#### Using registered functions {#config-functions}
The training configuration defined in the config file doesn't have to only
consist of static values. Some settings can also be **functions**. For instance,
the `batch_size` can be a number that doesn't change, or a schedule, like a
sequence of compounding values, which has shown to be an effective trick (see
[Smith et al., 2017](https://arxiv.org/abs/1711.00489)).
```ini
### With static value
[training]
batch_size = 128
```
To refer to a function instead, you can make `[training.batch_size]` its own
section and use the `@` syntax specify the function and its arguments in this
case [`compounding.v1`](https://thinc.ai/docs/api-schedules#compounding) defined
in the [function registry](/api/top-level#registry). All other values defined in
the block are passed to the function as keyword arguments when it's initialized.
You can also use this mechanism to register
[custom implementations and architectures](#custom-models) and reference them
from your configs.
> #### TODO
>
> TODO: something about how the tree is built bottom-up?
```ini
### With registered function
[training.batch_size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
```
### Model architectures {#model-architectures}
<!-- TODO: refer to architectures API: /api/architectures. This should document the architectures in spacy/ml/models -->
## Custom model implementations and architectures {#custom-models}
<!-- TODO: document some basic examples for custom models, refer to Thinc, refer to example config/project -->
<Project id="some_example_project">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum
sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat
mattis pretium.
</Project>
### Training with custom code
<!-- TODO: document usage of spacy train with --code -->
<!-- TODO: link to type annotations and maybe show example: https://thinc.ai/docs/usage-config#advanced-types -->
<!-- TODO: how do we document the default configs? -->
## Transfer learning {#transfer-learning}
@ -245,6 +266,101 @@ visualize your model.
<!-- TODO: document spacy pretrain -->
## Custom model implementations and architectures {#custom-models}
<!-- TODO: intro, should summarise what spaCy v3 can do and that you can now use fully custom implementations, models defined in PyTorch and TF, etc. etc. -->
### Training with custom code {#custom-code}
The [`spacy train`](/api/cli#train) recipe lets you specify an optional argument
`--code` that points to a Python file. The file is imported before training and
allows you to add custom functions and architectures to the function registry
that can then be referenced from your `config.cfg`. This lets you train spaCy
models with custom components, without having to re-implement the whole training
workflow.
For example, let's say you've implemented your own batch size schedule to use
during training. The `@spacy.registry.schedules` decorator lets you register
that function in the `schedules` [registry](/api/top-level#registry) and assign
it a string name:
> #### Why the version in the name?
>
> A big benefit of the config system is that it makes your experiments
> reproducible. We recommend versioning the functions you register, especially
> if you expect them to change (like a new model architecture). This way, you
> know that a config referencing `v1` means a different function than a config
> referencing `v2`.
```python
### functions.py
import spacy
@spacy.registry.schedules("my_custom_schedule.v1")
def my_custom_schedule(start: int = 1, factor: int = 1.001):
while True:
yield start
start = start * factor
```
In your config, you can now reference the schedule in the
`[training.batch_size]` block via `@schedules`. If a block contains a key
starting with an `@`, it's interpreted as a reference to a function. All other
settings in the block will be passed to the function as keyword arguments. Keep
in mind that the config shouldn't have any hidden defaults and all arguments on
the functions need to be represented in the config.
<!-- TODO: this needs to be updated once we've decided on a workflow for "fill config" -->
```ini
### config.cfg (excerpt)
[training.batch_size]
@schedules = "my_custom_schedule.v1"
start = 2
factor = 1.005
```
You can now run [`spacy train`](/api/cli#train) with the `config.cfg` and your
custom `functions.py` as the argument `--code`. Before loading the config, spaCy
will import the `functions.py` module and your custom functions will be
registered.
```bash
### Training with custom code {wrap="true"}
python -m spacy train train.spacy dev.spacy config.cfg --output ./output --code ./functions.py
```
<Infobox title="Tip: Use Python type hints" emoji="💡">
spaCy's configs are powered by our machine learning library Thinc's
[configuration system](https://thinc.ai/docs/usage-config), which supports
[type hints](https://docs.python.org/3/library/typing.html) and even
[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
function provides For example, `start: int` in the example above will ensure
that the value received as the argument `start` is an integer. If the value
can't be cast to an integer, spaCy will raise an error.
`start: pydantic.StrictInt` will force the value to be an integer and raise an
error if it's not for instance, if your config defines a float.
</Infobox>
### Defining custom architectures {#custom-architectures}
<!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works -->
### Wrapping PyTorch and TensorFlow {#custom-frameworks}
<!-- TODO: -->
<Project id="example_pytorch_model">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum
sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat
mattis pretium.
</Project>
## Parallel Training with Ray {#parallel-training}
<!-- TODO: document Ray integration -->

View File

@ -186,7 +186,7 @@ underlying [`Lexeme`](/api/lexeme), while [`Doc.vector`](/api/doc#vector) and
tokens. You can customize these behaviors by modifying the `doc.user_hooks`,
`doc.user_span_hooks` and `doc.user_token_hooks` dictionaries.
<Infobox title="📖 Custom user hooks">
<Infobox title="Custom user hooks" emoji="📖">
For more details on **adding hooks** and **overwriting** the built-in `Doc`,
`Span` and `Token` methods, see the usage guide on

View File

@ -5,7 +5,7 @@ import classNames from 'classnames'
import Icon from './icon'
import classes from '../styles/infobox.module.sass'
const Infobox = ({ title, id, variant, className, children }) => {
const Infobox = ({ title, emoji, id, variant, className, children }) => {
const infoboxClassNames = classNames(classes.root, className, {
[classes.warning]: variant === 'warning',
[classes.danger]: variant === 'danger',
@ -17,7 +17,14 @@ const Infobox = ({ title, id, variant, className, children }) => {
{variant !== 'default' && (
<Icon width={18} name={variant} inline className={classes.icon} />
)}
<span className={classes.titleText}>{title}</span>
<span className={classes.titleText}>
{emoji && (
<span className={classes.emoji} aria-hidden="true">
{emoji}
</span>
)}
{title}
</span>
</h4>
)}
{children}

View File

@ -27,9 +27,9 @@ function getCellContent(children) {
}
function isDividerRow(children) {
if (children.length && children[0].props.name == 'td') {
if (children.length && children[0].props && children[0].props.name == 'td') {
const tdChildren = children[0].props.children
if (!Array.isArray(tdChildren)) {
if (!Array.isArray(tdChildren) && tdChildren.props) {
return tdChildren.props.name === 'em'
}
}

View File

@ -31,6 +31,9 @@
position: relative
bottom: -2px
.emoji
margin-right: 0.65em
.warning
--color-theme: var(--color-yellow-dark)
--color-theme-dark: var(--color-yellow-dark)

View File

@ -25,7 +25,7 @@
--line-height-sm: 1.375
--line-height-md: 1.5
--line-height-lg: 1.9
--line-height-code: 1.8
--line-height-code: 1.7
// Spacing
--spacing-xs: 1rem
@ -271,7 +271,7 @@ body
color: var(--color-front)
p
margin-bottom: var(--spacing-md)
margin-bottom: var(--spacing-sm)
font-family: var(--font-primary)
font-size: var(--font-size-md)
line-height: var(--line-height-md)

View File

@ -15,14 +15,14 @@ const Project = ({ id, repo, children }) => {
const url = `${repo || DEFAULT_REPO}/${id}`
const title = (
<>
🪐 Get started with a project template:{' '}
Get started with a project template:{' '}
<Link to={url}>
<InlineCode>{id}</InlineCode>
</Link>
</>
)
return (
<Infobox title={title}>
<Infobox title={title} emoji="🪐">
{children}
<CopyInput text={text} prefix="$" />
</Infobox>