diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md
index 5dfe567b3..10ab2083e 100644
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@@ -16,6 +16,7 @@ menu:
- ['package', 'package']
- ['project', 'project']
- ['ray', 'ray']
+ - ['huggingface-hub', 'huggingface-hub']
---
spaCy's CLI provides a range of helpful commands for downloading and training
@@ -1276,3 +1277,49 @@ $ python -m spacy ray train [config_path] [--code] [--output] [--n-workers] [--a
| `--verbose`, `-V` | Display more information for debugging purposes. ~~bool (flag)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
+
+## huggingface-hub {#huggingface-hub new="3.1"}
+
+The `spacy huggingface-cli` CLI includes commands for uploading your trained
+spaCy pipelines to the [Hugging Face Hub](https://huggingface.co/).
+
+> #### Installation
+>
+> ```cli
+> $ pip install spacy-huggingface-hub
+> $ huggingface-cli login
+> ```
+
+
+
+To use this command, you need the
+[`spacy-huggingface-hub`](https://github.com/explosion/spacy-huggingface-hub)
+package installed. Installing the package will automatically add the
+`huggingface-hub` command to the spaCy CLI.
+
+
+
+### huggingface-hub push {#huggingface-hub-push tag="command"}
+
+Push a spaCy pipeline to the Hugging Face Hub. Expects a `.whl` file packaged
+with [`spacy package`](/api/cli#package) and `--build wheel`. For more details,
+see the spaCy project [integration](/usage/projects#huggingface_hub).
+
+```cli
+$ python -m spacy huggingface-hub push [whl_path] [--org] [--msg] [--local-repo] [--verbose]
+```
+
+> #### Example
+>
+> ```cli
+> $ python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl
+> ```
+
+| Name | Description |
+| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
+| `whl_path` | The path to the `.whl` file packaged with [`spacy package`](https://spacy.io/api/cli#package). ~~Path(positional)~~ |
+| `--org`, `-o` | Optional name of organization to which the pipeline should be uploaded. ~~str (option)~~ |
+| `--msg`, `-m` | Commit message to use for update. Defaults to `"Update spaCy pipeline"`. ~~str (option)~~ |
+| `--local-repo`, `-l` | Local path to the model repository (will be created if it doesn't exist). Defaults to `hub` in the current working directory. ~~Path (option)~~ |
+| `--verbose`, `-V` | Output additional info for debugging, e.g. the full generated hub metadata. ~~bool (flag)~~ |
+| **UPLOADS** | The pipeline to the hub. |
diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md
index b237729be..601b644c1 100644
--- a/website/docs/api/entityrecognizer.md
+++ b/website/docs/api/entityrecognizer.md
@@ -82,7 +82,7 @@ shortcut for this and instantiate the component using its string name and
| `moves` | A list of transition names. Inferred from the data if set to `None`, which is the default. ~~Optional[List[str]]~~ |
| _keyword-only_ | |
| `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. Defaults to `100`. ~~int~~ |
-| `incorrect_spans_key` | Identifies spans that are known to be incorrect entity annotations. The incorrect entity annotations can be stored in the span group, under this key. Defaults to `None`. ~~Optional[str]~~ |
+| `incorrect_spans_key` | Identifies spans that are known to be incorrect entity annotations. The incorrect entity annotations can be stored in the span group in [`Doc.spans`](/api/doc#spans), under this key. Defaults to `None`. ~~Optional[str]~~ |
## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
diff --git a/website/docs/images/huggingface_hub.jpg b/website/docs/images/huggingface_hub.jpg
new file mode 100644
index 000000000..5618df020
Binary files /dev/null and b/website/docs/images/huggingface_hub.jpg differ
diff --git a/website/docs/images/prodigy_spans-manual.jpg b/website/docs/images/prodigy_spans-manual.jpg
new file mode 100644
index 000000000..d67f347e0
Binary files /dev/null and b/website/docs/images/prodigy_spans-manual.jpg differ
diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md
index d30a50302..cb71f361b 100644
--- a/website/docs/usage/projects.md
+++ b/website/docs/usage/projects.md
@@ -49,6 +49,7 @@ production.
Serve your models and host APIsDistributed and parallel trainingTrack your experiments and results
+Upload your pipelines to the Hugging Face Hub
### 1. Clone a project template {#clone}
@@ -1013,3 +1014,68 @@ creating variants of the config for a simple hyperparameter grid search and
logging the results.
+
+---
+
+### Hugging Face Hub {#huggingface_hub}
+
+The [Hugging Face Hub](https://huggingface.co/) lets you upload models and share
+them with others. It hosts models as Git-based repositories which are storage
+spaces that can contain all your files. It support versioning, branches and
+custom metadata out-of-the-box, and provides browser-based visualizers for
+exploring your models interactively, as well as an API for production use. The
+[`spacy-huggingface-hub`](https://github.com/explosion/spacy-huggingface-hub)
+package automatically adds the `huggingface-hub` command to your `spacy` CLI if
+it's installed.
+
+> #### Installation
+>
+> ```cli
+> $ pip install spacy-huggingface-hub
+> # Check that the CLI is registered
+> $ python -m spacy huggingface-hub --help
+> ```
+
+You can then upload any pipeline packaged with
+[`spacy package`](/api/cli#package). Make sure to set `--build wheel` to output
+a binary `.whl` file. The uploader will read all metadata from the pipeline
+package, including the auto-generated pretty `README.md` and the model details
+available in the `meta.json`. For examples, check out the
+[spaCy pipelines](https://huggingface.co/spacy) we've uploaded.
+
+```cli
+$ huggingface-cli login
+$ python -m spacy package ./en_ner_fashion ./output --build wheel
+$ cd ./output/en_ner_fashion-0.0.0/dist
+$ python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl
+```
+
+After uploading, you will see the live URL of your pipeline packages, as well as
+the direct URL to the model wheel you can install via `pip install`. You'll also
+be able to test your pipeline interactively from your browser:
+
+
+
+In your `project.yml`, you can add a command that uploads your trained and
+packaged pipeline to the hub. You can either run this as a manual step, or
+automatically as part of a workflow. Make sure to set `--build wheel` when
+running `spacy package` to build a wheel file for your pipeline package.
+
+
+```yaml
+### project.yml
+- name: "push_to_hub"
+ help: "Upload the trained model to the Hugging Face Hub"
+ script:
+ - "python -m spacy huggingface-hub push packages/en_${vars.name}-${vars.version}/dist/en_${vars.name}-${vars.version}-py3-none-any.whl"
+ deps:
+ - "packages/en_${vars.name}-${vars.version}/dist/en_${vars.name}-${vars.version}-py3-none-any.whl"
+```
+
+
+
+Get started with uploading your models to the Hugging Face hub using our project
+template. It trains a simple pipeline, packages it and uploads it if the
+packaged model has changed. This makes it easy to deploy your models end-to-end.
+
+
diff --git a/website/docs/usage/v3-1.md b/website/docs/usage/v3-1.md
new file mode 100644
index 000000000..da6fa6070
--- /dev/null
+++ b/website/docs/usage/v3-1.md
@@ -0,0 +1,309 @@
+---
+title: What's New in v3.1
+teaser: New features and how to upgrade
+menu:
+ - ['New Features', 'features']
+ - ['Upgrading Notes', 'upgrading']
+---
+
+## New Features {#features hidden="true"}
+
+It's been great to see the adoption of the new spaCy v3, which introduced
+[transformer-based](/usage/embeddings-transformers) pipelines, a new
+[config and training system](/usage/training) for reproducible experiments,
+[projects](/usage/projects) for end-to-end workflows, and many
+[other features](/usage/v3). Version 3.1 adds more on top of it, including the
+ability to use predicted annotations during training, a new `SpanCategorizer`
+component for predicting arbitrary and potentially overlapping spans, support
+for partial incorrect annotations in the entity recognizer, new trained
+pipelines for Catalan and Danish, as well as many bug fixes and improvements.
+
+### Using predicted annotations during training {#predicted-annotations-training}
+
+By default, components are updated in isolation during training, which means
+that they don't see the predictions of any earlier components in the pipeline.
+The new
+[`[training.annotating_components]`](/usage/training#annotating-components)
+config setting lets you specify pipeline components that should set annotations
+on the predicted docs during training. This makes it easy to use the predictions
+of a previous component in the pipeline as features for a subsequent component,
+e.g. the dependency labels in the tagger:
+
+```ini
+### config.cfg (excerpt) {highlight="7,12"}
+[nlp]
+pipeline = ["parser", "tagger"]
+
+[components.tagger.model.tok2vec.embed]
+@architectures = "spacy.MultiHashEmbed.v1"
+width = ${components.tagger.model.tok2vec.encode.width}
+attrs = ["NORM","DEP"]
+rows = [5000,2500]
+include_static_vectors = false
+
+[training]
+annotating_components = ["parser"]
+```
+
+
+
+This project shows how to use the `token.dep` attribute predicted by the parser
+as a feature for a subsequent tagger component in the pipeline.
+
+
+
+### SpanCategorizer for predicting arbitrary and overlapping spans {#spancategorizer tag="experimental"}
+
+A common task in applied NLP is extracting spans of texts from documents,
+including longer phrases or nested expressions. Named entity recognition isn't
+the right tool for this problem, since an entity recognizer typically predicts
+single token-based tags that are very sensitive to boundaries. This is effective
+for proper nouns and self-contained expressions, but less useful for other types
+of phrases or overlapping spans. The new
+[`SpanCategorizer`](/api/spancategorizer) component and
+[SpanCategorizer](/api/architectures#spancategorizer) architecture let you label
+arbitrary and potentially overlapping spans of texts. A span categorizer
+consists of two parts: a [suggester function](/api/spancategorizer#suggesters)
+that proposes candidate spans, which may or may not overlap, and a labeler model
+that predicts zero or more labels for each candidate. The predicted spans are
+available via the [`Doc.spans`](/api/doc#spans) container.
+
+
+
+
+
+[](https://support.prodi.gy/t/3861)
+
+The upcoming version of our annotation tool [Prodigy](https://prodi.gy)
+(currently available as a [pre-release](https://support.prodi.gy/t/3861) for all
+users) features a [new workflow and UI](https://support.prodi.gy/t/3861) for
+annotating overlapping and nested spans. You can use it to create training data
+for spaCy's `SpanCategorizer` component.
+
+
+
+### Update the entity recognizer with partial incorrect annotations {#negative-samples}
+
+> #### config.cfg (excerpt)
+>
+> ```ini
+> [components.ner]
+> factory = "ner"
+> incorrect_spans_key = "incorrect_spans"
+> moves = null
+> update_with_oracle_cut_size = 100
+> ```
+
+The [`EntityRecognizer`](/api/entityrecognizer) can now be updated with known
+incorrect annotations, which lets you take advantage of partial and sparse data.
+For example, you'll be able to use the information that certain spans of text
+are definitely **not** `PERSON` entities, without having to provide the complete
+gold-standard annotations for the given example. The incorrect span annotations
+can be added via the [`Doc.spans`](/api/doc#spans) in the training data under
+the key defined as [`incorrect_spans_key`](/api/entityrecognizer#init) in the
+component config.
+
+```python
+train_doc = nlp.make_doc("Barack Obama was born in Hawaii.")
+# The doc.spans key can be defined in the config
+train_doc.spans["incorrect_spans"] = [
+ Span(doc, 0, 2, label="ORG"),
+ Span(doc, 5, 6, label="PRODUCT")
+]
+```
+
+
+
+### New pipeline packages for Catalan and Danish {#pipeline-packages}
+
+spaCy v3.1 adds 5 new pipeline packages, including a new core family for Catalan
+and a new transformer-based pipeline for Danish using the
+[`danish-bert-botxo`](http://huggingface.co/Maltehb/danish-bert-botxo) weights.
+See the [models directory](/models) for an overview of all available trained
+pipelines and the [training guide](/usage/training) for details on how to train
+your own.
+
+> Thanks to Carlos Rodríguez Penagos and the
+> [Barcelona Supercomputing Center](https://temu.bsc.es/) for their
+> contributions for Catalan and to Kenneth Enevoldsen for Danish. For additional
+> Danish pipelines, check out [DaCy](https://github.com/KennethEnevoldsen/DaCy).
+
+| Package | Language | UPOS | Parser LAS | NER F |
+| ------------------------------------------------- | -------- | ---: | ---------: | -----: |
+| [`ca_core_news_sm`](/models/ca#ca_core_news_sm) | Catalan | 98.2 | 87.4 | 79.8 |
+| [`ca_core_news_md`](/models/ca#ca_core_news_md) | Catalan | 98.3 | 88.2 | 84.0 |
+| [`ca_core_news_lg`](/models/ca#ca_core_news_lg) | Catalan | 98.5 | 88.4 | 84.2 |
+| [`ca_core_news_trf`](/models/ca#ca_core_news_trf) | Catalan | 98.9 | 93.0 | 91.2 |
+| [`da_core_news_trf`](/models/da#da_core_news_trf) | Danish | 98.0 | 85.0 | 82.9 |
+
+### Resizable text classification architectures {#resizable-textcat}
+
+Previously, the [`TextCategorizer`](/api/textcategorizer) architectures could
+not be resized, meaning that you couldn't add new labels to an already trained
+model. In spaCy v3.1, the [TextCatCNN](/api/architectures#TextCatCNN) and
+[TextCatBOW](/api/architectures#TextCatBOW) architectures are now resizable,
+while ensuring that the predictions for the old labels remain the same.
+
+### CLI command to assemble pipeline from config {#assemble}
+
+The [`spacy assemble`](/api/cli#assemble) command lets you assemble a pipeline
+from a config file without additional training. It can be especially useful for
+creating a blank pipeline with a custom tokenizer, rule-based components or word
+vectors.
+
+```cli
+$ python -m spacy assemble config.cfg ./output
+```
+
+### Pretty pipeline package READMEs {#package-readme}
+
+The [`spacy package`](/api/cli#package) command now auto-generates a pretty
+`README.md` based on the pipeline information defined in the `meta.json`. This
+includes a table with a general overview, as well as the label scheme and
+accuracy figures, if available. For an example, see the
+[model releases](https://github.com/explosion/spacy-models/releases).
+
+### Support for streaming large or infinite corpora {#streaming-corpora}
+
+> #### config.cfg (excerpt)
+>
+> ```ini
+> [training]
+> max_epochs = -1
+> ```
+
+The training process now supports streaming large or infinite corpora
+out-of-the-box, which can be controlled via the
+[`[training.max_epochs]`](/api/data-formats#training) config setting. Setting it
+to `-1` means that the train corpus should be streamed rather than loaded into
+memory with no shuffling within the training loop. For details on how to
+implement a custom corpus loader, e.g. to stream in data from a remote storage,
+see the usage guide on
+[custom data reading](/usage/training#custom-code-readers-batchers).
+
+When streaming a corpus, only the first 100 examples will be used for
+[initialization](/usage/training#config-lifecycle). This is no problem if you're
+training a component like the text classifier with data that specifies all
+available labels in every example. If necessary, you can use the
+[`init labels`](/api/cli#init-labels) command to pre-generate the labels for
+your components using a representative sample so the model can be initialized
+correctly before training.
+
+### New lemmatizers for Catalan and Italian {#pos-lemmatizers}
+
+The trained pipelines for [Catalan](/models/ca) and [Italian](/models/it) now
+include lemmatizers that use the predicted part-of-speech tags as part of the
+lookup lemmatization for higher lemmatization accuracy. If you're training your
+own pipelines for these languages and you want to include a lemmatizer, make
+sure you have the
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package
+installed, which provides the relevant tables.
+
+### Upload your pipelines to the Hugging Face Hub {#huggingface-hub}
+
+The [Hugging Face Hub](https://huggingface.co/) lets you upload models and share
+them with others, and it now supports spaCy pipelines out-of-the-box. The new
+[`spacy-huggingface-hub`](https://github.com/explosion/spacy-huggingface-hub)
+package automatically adds the `huggingface-hub` command to your `spacy` CLI. It
+lets you upload any pipelines packaged with [`spacy package`](/api/cli#package)
+and `--build wheel` and takes care of auto-generating all required meta
+information.
+
+After uploading, you'll get a live URL for your model page that includes all
+details, files and interactive visualizers, as well as a direct URL to the wheel
+file that you can install via `pip install`. For examples, check out the
+[spaCy pipelines](https://huggingface.co/spacy) we've uploaded.
+
+```cli
+$ pip install spacy-huggingface-hub
+$ huggingface-cli login
+$ python -m spacy package ./en_ner_fashion ./output --build wheel
+$ cd ./output/en_ner_fashion-0.0.0/dist
+$ python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl
+```
+
+You can also integrate the upload command into your
+[project template](/usage/projects#huggingface_hub) to automatically upload your
+packaged pipelines after training.
+
+
+
+Get started with uploading your models to the Hugging Face hub using our project
+template. It trains a simple pipeline, packages it and uploads it if the
+packaged model has changed. This makes it easy to deploy your models end-to-end.
+
+
+
+## Notes about upgrading from v3.0 {#upgrading}
+
+### Pipeline package version compatibility {#version-compat}
+
+> #### Using legacy implementations
+>
+> In spaCy v3, you'll still be able to load and reference legacy implementations
+> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
+> components or architectures change and newer versions are available in the
+> core library.
+
+When you're loading a pipeline package trained with spaCy v3.0, you will see a
+warning telling you that the pipeline may be incompatible. This doesn't
+necessarily have to be true, but we recommend running your pipelines against
+your test suite or evaluation data to make sure there are no unexpected results.
+If you're using one of the [trained pipelines](/models) we provide, you should
+run [`spacy download`](/api/cli#download) to update to the latest version. To
+see an overview of all installed packages and their compatibility, you can run
+[`spacy validate`](/api/cli#validate).
+
+If you've trained your own custom pipeline and you've confirmed that it's still
+working as expected, you can update the spaCy version requirements in the
+[`meta.json`](/api/data-formats#meta):
+
+```diff
+- "spacy_version": ">=3.0.0,<3.1.0",
++ "spacy_version": ">=3.0.0,<3.2.0",
+```
+
+### Updating v3.0 configs
+
+To update a config from spaCy v3.0 with the new v3.1 settings, run
+[`init fill-config`](/api/cli#init-fill-config):
+
+```bash
+python -m spacy init fill-config config-v3.0.cfg config-v3.1.cfg
+```
+
+In many cases (`spacy train`, `spacy.load()`), the new defaults will be filled
+in automatically, but you'll need to fill in the new settings to run
+[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
+
+### Sourcing pipeline components with vectors {#source-vectors}
+
+If you're sourcing a pipeline component that requires static vectors (for
+example, a tagger or parser from an `md` or `lg` pretrained pipeline), be sure
+to include the source model's vectors in the setting `[initialize.vectors]`. In
+spaCy v3.0, a bug allowed vectors to be loaded implicitly through `source`,
+however in v3.1 this setting must be provided explicitly as
+`[initialize.vectors]`:
+
+```ini
+### config.cfg (excerpt)
+[components.ner]
+source = "en_core_web_md"
+
+[initialize]
+vectors = "en_core_web_md"
+```
+
+
+
+Each pipeline can only store one set of static vectors, so it's not possible to
+assemble a pipeline with components that were trained on different static
+vectors.
+
+
+
+[`spacy train`](/api/cli#train) and [`spacy assemble`](/api/cli#assemble) will
+provide warnings if the source and target pipelines don't contain the same
+vectors. If you are sourcing a rule-based component like an entity ruler or
+lemmatizer that does not use the vectors as a model feature, then this warning
+can be safely ignored.
diff --git a/website/meta/sidebars.json b/website/meta/sidebars.json
index 6b2850187..6fe09f052 100644
--- a/website/meta/sidebars.json
+++ b/website/meta/sidebars.json
@@ -9,7 +9,8 @@
{ "text": "Models & Languages", "url": "/usage/models" },
{ "text": "Facts & Figures", "url": "/usage/facts-figures" },
{ "text": "spaCy 101", "url": "/usage/spacy-101" },
- { "text": "New in v3.0", "url": "/usage/v3" }
+ { "text": "New in v3.0", "url": "/usage/v3" },
+ { "text": "New in v3.1", "url": "/usage/v3-1" }
]
},
{
@@ -136,9 +137,7 @@
},
{
"label": "Legacy",
- "items": [
- { "text": "Legacy functions", "url": "/api/legacy" }
- ]
+ "items": [{ "text": "Legacy functions", "url": "/api/legacy" }]
}
]
}
diff --git a/website/src/components/code.js b/website/src/components/code.js
index 4dd7a8eb8..6e9f0c22e 100644
--- a/website/src/components/code.js
+++ b/website/src/components/code.js
@@ -14,7 +14,7 @@ import GitHubCode from './github'
import classes from '../styles/code.module.sass'
const WRAP_THRESHOLD = 30
-const CLI_GROUPS = ['init', 'debug', 'project', 'ray']
+const CLI_GROUPS = ['init', 'debug', 'project', 'ray', 'huggingface-hub']
export default props => (