diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index 5dfe567b3..10ab2083e 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -16,6 +16,7 @@ menu: - ['package', 'package'] - ['project', 'project'] - ['ray', 'ray'] + - ['huggingface-hub', 'huggingface-hub'] --- spaCy's CLI provides a range of helpful commands for downloading and training @@ -1276,3 +1277,49 @@ $ python -m spacy ray train [config_path] [--code] [--output] [--n-workers] [--a | `--verbose`, `-V` | Display more information for debugging purposes. ~~bool (flag)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ | + +## huggingface-hub {#huggingface-hub new="3.1"} + +The `spacy huggingface-cli` CLI includes commands for uploading your trained +spaCy pipelines to the [Hugging Face Hub](https://huggingface.co/). + +> #### Installation +> +> ```cli +> $ pip install spacy-huggingface-hub +> $ huggingface-cli login +> ``` + + + +To use this command, you need the +[`spacy-huggingface-hub`](https://github.com/explosion/spacy-huggingface-hub) +package installed. Installing the package will automatically add the +`huggingface-hub` command to the spaCy CLI. + + + +### huggingface-hub push {#huggingface-hub-push tag="command"} + +Push a spaCy pipeline to the Hugging Face Hub. Expects a `.whl` file packaged +with [`spacy package`](/api/cli#package) and `--build wheel`. For more details, +see the spaCy project [integration](/usage/projects#huggingface_hub). + +```cli +$ python -m spacy huggingface-hub push [whl_path] [--org] [--msg] [--local-repo] [--verbose] +``` + +> #### Example +> +> ```cli +> $ python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl +> ``` + +| Name | Description | +| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | +| `whl_path` | The path to the `.whl` file packaged with [`spacy package`](https://spacy.io/api/cli#package). ~~Path(positional)~~ | +| `--org`, `-o` | Optional name of organization to which the pipeline should be uploaded. ~~str (option)~~ | +| `--msg`, `-m` | Commit message to use for update. Defaults to `"Update spaCy pipeline"`. ~~str (option)~~ | +| `--local-repo`, `-l` | Local path to the model repository (will be created if it doesn't exist). Defaults to `hub` in the current working directory. ~~Path (option)~~ | +| `--verbose`, `-V` | Output additional info for debugging, e.g. the full generated hub metadata. ~~bool (flag)~~  | +| **UPLOADS** | The pipeline to the hub. | diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md index b237729be..601b644c1 100644 --- a/website/docs/api/entityrecognizer.md +++ b/website/docs/api/entityrecognizer.md @@ -82,7 +82,7 @@ shortcut for this and instantiate the component using its string name and | `moves` | A list of transition names. Inferred from the data if set to `None`, which is the default. ~~Optional[List[str]]~~ | | _keyword-only_ | | | `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. Defaults to `100`. ~~int~~ | -| `incorrect_spans_key` | Identifies spans that are known to be incorrect entity annotations. The incorrect entity annotations can be stored in the span group, under this key. Defaults to `None`. ~~Optional[str]~~ | +| `incorrect_spans_key` | Identifies spans that are known to be incorrect entity annotations. The incorrect entity annotations can be stored in the span group in [`Doc.spans`](/api/doc#spans), under this key. Defaults to `None`. ~~Optional[str]~~ | ## EntityRecognizer.\_\_call\_\_ {#call tag="method"} diff --git a/website/docs/images/huggingface_hub.jpg b/website/docs/images/huggingface_hub.jpg new file mode 100644 index 000000000..5618df020 Binary files /dev/null and b/website/docs/images/huggingface_hub.jpg differ diff --git a/website/docs/images/prodigy_spans-manual.jpg b/website/docs/images/prodigy_spans-manual.jpg new file mode 100644 index 000000000..d67f347e0 Binary files /dev/null and b/website/docs/images/prodigy_spans-manual.jpg differ diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md index d30a50302..cb71f361b 100644 --- a/website/docs/usage/projects.md +++ b/website/docs/usage/projects.md @@ -49,6 +49,7 @@ production. Serve your models and host APIs Distributed and parallel training Track your experiments and results +Upload your pipelines to the Hugging Face Hub ### 1. Clone a project template {#clone} @@ -1013,3 +1014,68 @@ creating variants of the config for a simple hyperparameter grid search and logging the results. + +--- + +### Hugging Face Hub {#huggingface_hub} + +The [Hugging Face Hub](https://huggingface.co/) lets you upload models and share +them with others. It hosts models as Git-based repositories which are storage +spaces that can contain all your files. It support versioning, branches and +custom metadata out-of-the-box, and provides browser-based visualizers for +exploring your models interactively, as well as an API for production use. The +[`spacy-huggingface-hub`](https://github.com/explosion/spacy-huggingface-hub) +package automatically adds the `huggingface-hub` command to your `spacy` CLI if +it's installed. + +> #### Installation +> +> ```cli +> $ pip install spacy-huggingface-hub +> # Check that the CLI is registered +> $ python -m spacy huggingface-hub --help +> ``` + +You can then upload any pipeline packaged with +[`spacy package`](/api/cli#package). Make sure to set `--build wheel` to output +a binary `.whl` file. The uploader will read all metadata from the pipeline +package, including the auto-generated pretty `README.md` and the model details +available in the `meta.json`. For examples, check out the +[spaCy pipelines](https://huggingface.co/spacy) we've uploaded. + +```cli +$ huggingface-cli login +$ python -m spacy package ./en_ner_fashion ./output --build wheel +$ cd ./output/en_ner_fashion-0.0.0/dist +$ python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl +``` + +After uploading, you will see the live URL of your pipeline packages, as well as +the direct URL to the model wheel you can install via `pip install`. You'll also +be able to test your pipeline interactively from your browser: + +![Screenshot: interactive NER visualizer](../images/huggingface_hub.jpg) + +In your `project.yml`, you can add a command that uploads your trained and +packaged pipeline to the hub. You can either run this as a manual step, or +automatically as part of a workflow. Make sure to set `--build wheel` when +running `spacy package` to build a wheel file for your pipeline package. + + +```yaml +### project.yml +- name: "push_to_hub" + help: "Upload the trained model to the Hugging Face Hub" + script: + - "python -m spacy huggingface-hub push packages/en_${vars.name}-${vars.version}/dist/en_${vars.name}-${vars.version}-py3-none-any.whl" + deps: + - "packages/en_${vars.name}-${vars.version}/dist/en_${vars.name}-${vars.version}-py3-none-any.whl" +``` + + + +Get started with uploading your models to the Hugging Face hub using our project +template. It trains a simple pipeline, packages it and uploads it if the +packaged model has changed. This makes it easy to deploy your models end-to-end. + + diff --git a/website/docs/usage/v3-1.md b/website/docs/usage/v3-1.md new file mode 100644 index 000000000..da6fa6070 --- /dev/null +++ b/website/docs/usage/v3-1.md @@ -0,0 +1,309 @@ +--- +title: What's New in v3.1 +teaser: New features and how to upgrade +menu: + - ['New Features', 'features'] + - ['Upgrading Notes', 'upgrading'] +--- + +## New Features {#features hidden="true"} + +It's been great to see the adoption of the new spaCy v3, which introduced +[transformer-based](/usage/embeddings-transformers) pipelines, a new +[config and training system](/usage/training) for reproducible experiments, +[projects](/usage/projects) for end-to-end workflows, and many +[other features](/usage/v3). Version 3.1 adds more on top of it, including the +ability to use predicted annotations during training, a new `SpanCategorizer` +component for predicting arbitrary and potentially overlapping spans, support +for partial incorrect annotations in the entity recognizer, new trained +pipelines for Catalan and Danish, as well as many bug fixes and improvements. + +### Using predicted annotations during training {#predicted-annotations-training} + +By default, components are updated in isolation during training, which means +that they don't see the predictions of any earlier components in the pipeline. +The new +[`[training.annotating_components]`](/usage/training#annotating-components) +config setting lets you specify pipeline components that should set annotations +on the predicted docs during training. This makes it easy to use the predictions +of a previous component in the pipeline as features for a subsequent component, +e.g. the dependency labels in the tagger: + +```ini +### config.cfg (excerpt) {highlight="7,12"} +[nlp] +pipeline = ["parser", "tagger"] + +[components.tagger.model.tok2vec.embed] +@architectures = "spacy.MultiHashEmbed.v1" +width = ${components.tagger.model.tok2vec.encode.width} +attrs = ["NORM","DEP"] +rows = [5000,2500] +include_static_vectors = false + +[training] +annotating_components = ["parser"] +``` + + + +This project shows how to use the `token.dep` attribute predicted by the parser +as a feature for a subsequent tagger component in the pipeline. + + + +### SpanCategorizer for predicting arbitrary and overlapping spans {#spancategorizer tag="experimental"} + +A common task in applied NLP is extracting spans of texts from documents, +including longer phrases or nested expressions. Named entity recognition isn't +the right tool for this problem, since an entity recognizer typically predicts +single token-based tags that are very sensitive to boundaries. This is effective +for proper nouns and self-contained expressions, but less useful for other types +of phrases or overlapping spans. The new +[`SpanCategorizer`](/api/spancategorizer) component and +[SpanCategorizer](/api/architectures#spancategorizer) architecture let you label +arbitrary and potentially overlapping spans of texts. A span categorizer +consists of two parts: a [suggester function](/api/spancategorizer#suggesters) +that proposes candidate spans, which may or may not overlap, and a labeler model +that predicts zero or more labels for each candidate. The predicted spans are +available via the [`Doc.spans`](/api/doc#spans) container. + + + + + +[![Prodigy: example of the new manual spans UI](../images/prodigy_spans-manual.jpg)](https://support.prodi.gy/t/3861) + +The upcoming version of our annotation tool [Prodigy](https://prodi.gy) +(currently available as a [pre-release](https://support.prodi.gy/t/3861) for all +users) features a [new workflow and UI](https://support.prodi.gy/t/3861) for +annotating overlapping and nested spans. You can use it to create training data +for spaCy's `SpanCategorizer` component. + + + +### Update the entity recognizer with partial incorrect annotations {#negative-samples} + +> #### config.cfg (excerpt) +> +> ```ini +> [components.ner] +> factory = "ner" +> incorrect_spans_key = "incorrect_spans" +> moves = null +> update_with_oracle_cut_size = 100 +> ``` + +The [`EntityRecognizer`](/api/entityrecognizer) can now be updated with known +incorrect annotations, which lets you take advantage of partial and sparse data. +For example, you'll be able to use the information that certain spans of text +are definitely **not** `PERSON` entities, without having to provide the complete +gold-standard annotations for the given example. The incorrect span annotations +can be added via the [`Doc.spans`](/api/doc#spans) in the training data under +the key defined as [`incorrect_spans_key`](/api/entityrecognizer#init) in the +component config. + +```python +train_doc = nlp.make_doc("Barack Obama was born in Hawaii.") +# The doc.spans key can be defined in the config +train_doc.spans["incorrect_spans"] = [ + Span(doc, 0, 2, label="ORG"), + Span(doc, 5, 6, label="PRODUCT") +] +``` + + + +### New pipeline packages for Catalan and Danish {#pipeline-packages} + +spaCy v3.1 adds 5 new pipeline packages, including a new core family for Catalan +and a new transformer-based pipeline for Danish using the +[`danish-bert-botxo`](http://huggingface.co/Maltehb/danish-bert-botxo) weights. +See the [models directory](/models) for an overview of all available trained +pipelines and the [training guide](/usage/training) for details on how to train +your own. + +> Thanks to Carlos Rodríguez Penagos and the +> [Barcelona Supercomputing Center](https://temu.bsc.es/) for their +> contributions for Catalan and to Kenneth Enevoldsen for Danish. For additional +> Danish pipelines, check out [DaCy](https://github.com/KennethEnevoldsen/DaCy). + +| Package | Language | UPOS | Parser LAS |  NER F | +| ------------------------------------------------- | -------- | ---: | ---------: | -----: | +| [`ca_core_news_sm`](/models/ca#ca_core_news_sm) | Catalan | 98.2 | 87.4 | 79.8 | +| [`ca_core_news_md`](/models/ca#ca_core_news_md) | Catalan | 98.3 | 88.2 | 84.0 | +| [`ca_core_news_lg`](/models/ca#ca_core_news_lg) | Catalan | 98.5 | 88.4 | 84.2 | +| [`ca_core_news_trf`](/models/ca#ca_core_news_trf) | Catalan | 98.9 | 93.0 | 91.2 | +| [`da_core_news_trf`](/models/da#da_core_news_trf) | Danish | 98.0 | 85.0 | 82.9 | + +### Resizable text classification architectures {#resizable-textcat} + +Previously, the [`TextCategorizer`](/api/textcategorizer) architectures could +not be resized, meaning that you couldn't add new labels to an already trained +model. In spaCy v3.1, the [TextCatCNN](/api/architectures#TextCatCNN) and +[TextCatBOW](/api/architectures#TextCatBOW) architectures are now resizable, +while ensuring that the predictions for the old labels remain the same. + +### CLI command to assemble pipeline from config {#assemble} + +The [`spacy assemble`](/api/cli#assemble) command lets you assemble a pipeline +from a config file without additional training. It can be especially useful for +creating a blank pipeline with a custom tokenizer, rule-based components or word +vectors. + +```cli +$ python -m spacy assemble config.cfg ./output +``` + +### Pretty pipeline package READMEs {#package-readme} + +The [`spacy package`](/api/cli#package) command now auto-generates a pretty +`README.md` based on the pipeline information defined in the `meta.json`. This +includes a table with a general overview, as well as the label scheme and +accuracy figures, if available. For an example, see the +[model releases](https://github.com/explosion/spacy-models/releases). + +### Support for streaming large or infinite corpora {#streaming-corpora} + +> #### config.cfg (excerpt) +> +> ```ini +> [training] +> max_epochs = -1 +> ``` + +The training process now supports streaming large or infinite corpora +out-of-the-box, which can be controlled via the +[`[training.max_epochs]`](/api/data-formats#training) config setting. Setting it +to `-1` means that the train corpus should be streamed rather than loaded into +memory with no shuffling within the training loop. For details on how to +implement a custom corpus loader, e.g. to stream in data from a remote storage, +see the usage guide on +[custom data reading](/usage/training#custom-code-readers-batchers). + +When streaming a corpus, only the first 100 examples will be used for +[initialization](/usage/training#config-lifecycle). This is no problem if you're +training a component like the text classifier with data that specifies all +available labels in every example. If necessary, you can use the +[`init labels`](/api/cli#init-labels) command to pre-generate the labels for +your components using a representative sample so the model can be initialized +correctly before training. + +### New lemmatizers for Catalan and Italian {#pos-lemmatizers} + +The trained pipelines for [Catalan](/models/ca) and [Italian](/models/it) now +include lemmatizers that use the predicted part-of-speech tags as part of the +lookup lemmatization for higher lemmatization accuracy. If you're training your +own pipelines for these languages and you want to include a lemmatizer, make +sure you have the +[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package +installed, which provides the relevant tables. + +### Upload your pipelines to the Hugging Face Hub {#huggingface-hub} + +The [Hugging Face Hub](https://huggingface.co/) lets you upload models and share +them with others, and it now supports spaCy pipelines out-of-the-box. The new +[`spacy-huggingface-hub`](https://github.com/explosion/spacy-huggingface-hub) +package automatically adds the `huggingface-hub` command to your `spacy` CLI. It +lets you upload any pipelines packaged with [`spacy package`](/api/cli#package) +and `--build wheel` and takes care of auto-generating all required meta +information. + +After uploading, you'll get a live URL for your model page that includes all +details, files and interactive visualizers, as well as a direct URL to the wheel +file that you can install via `pip install`. For examples, check out the +[spaCy pipelines](https://huggingface.co/spacy) we've uploaded. + +```cli +$ pip install spacy-huggingface-hub +$ huggingface-cli login +$ python -m spacy package ./en_ner_fashion ./output --build wheel +$ cd ./output/en_ner_fashion-0.0.0/dist +$ python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl +``` + +You can also integrate the upload command into your +[project template](/usage/projects#huggingface_hub) to automatically upload your +packaged pipelines after training. + + + +Get started with uploading your models to the Hugging Face hub using our project +template. It trains a simple pipeline, packages it and uploads it if the +packaged model has changed. This makes it easy to deploy your models end-to-end. + + + +## Notes about upgrading from v3.0 {#upgrading} + +### Pipeline package version compatibility {#version-compat} + +> #### Using legacy implementations +> +> In spaCy v3, you'll still be able to load and reference legacy implementations +> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the +> components or architectures change and newer versions are available in the +> core library. + +When you're loading a pipeline package trained with spaCy v3.0, you will see a +warning telling you that the pipeline may be incompatible. This doesn't +necessarily have to be true, but we recommend running your pipelines against +your test suite or evaluation data to make sure there are no unexpected results. +If you're using one of the [trained pipelines](/models) we provide, you should +run [`spacy download`](/api/cli#download) to update to the latest version. To +see an overview of all installed packages and their compatibility, you can run +[`spacy validate`](/api/cli#validate). + +If you've trained your own custom pipeline and you've confirmed that it's still +working as expected, you can update the spaCy version requirements in the +[`meta.json`](/api/data-formats#meta): + +```diff +- "spacy_version": ">=3.0.0,<3.1.0", ++ "spacy_version": ">=3.0.0,<3.2.0", +``` + +### Updating v3.0 configs + +To update a config from spaCy v3.0 with the new v3.1 settings, run +[`init fill-config`](/api/cli#init-fill-config): + +```bash +python -m spacy init fill-config config-v3.0.cfg config-v3.1.cfg +``` + +In many cases (`spacy train`, `spacy.load()`), the new defaults will be filled +in automatically, but you'll need to fill in the new settings to run +[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data). + +### Sourcing pipeline components with vectors {#source-vectors} + +If you're sourcing a pipeline component that requires static vectors (for +example, a tagger or parser from an `md` or `lg` pretrained pipeline), be sure +to include the source model's vectors in the setting `[initialize.vectors]`. In +spaCy v3.0, a bug allowed vectors to be loaded implicitly through `source`, +however in v3.1 this setting must be provided explicitly as +`[initialize.vectors]`: + +```ini +### config.cfg (excerpt) +[components.ner] +source = "en_core_web_md" + +[initialize] +vectors = "en_core_web_md" +``` + + + +Each pipeline can only store one set of static vectors, so it's not possible to +assemble a pipeline with components that were trained on different static +vectors. + + + +[`spacy train`](/api/cli#train) and [`spacy assemble`](/api/cli#assemble) will +provide warnings if the source and target pipelines don't contain the same +vectors. If you are sourcing a rule-based component like an entity ruler or +lemmatizer that does not use the vectors as a model feature, then this warning +can be safely ignored. diff --git a/website/meta/sidebars.json b/website/meta/sidebars.json index 6b2850187..6fe09f052 100644 --- a/website/meta/sidebars.json +++ b/website/meta/sidebars.json @@ -9,7 +9,8 @@ { "text": "Models & Languages", "url": "/usage/models" }, { "text": "Facts & Figures", "url": "/usage/facts-figures" }, { "text": "spaCy 101", "url": "/usage/spacy-101" }, - { "text": "New in v3.0", "url": "/usage/v3" } + { "text": "New in v3.0", "url": "/usage/v3" }, + { "text": "New in v3.1", "url": "/usage/v3-1" } ] }, { @@ -136,9 +137,7 @@ }, { "label": "Legacy", - "items": [ - { "text": "Legacy functions", "url": "/api/legacy" } - ] + "items": [{ "text": "Legacy functions", "url": "/api/legacy" }] } ] } diff --git a/website/src/components/code.js b/website/src/components/code.js index 4dd7a8eb8..6e9f0c22e 100644 --- a/website/src/components/code.js +++ b/website/src/components/code.js @@ -14,7 +14,7 @@ import GitHubCode from './github' import classes from '../styles/code.module.sass' const WRAP_THRESHOLD = 30 -const CLI_GROUPS = ['init', 'debug', 'project', 'ray'] +const CLI_GROUPS = ['init', 'debug', 'project', 'ray', 'huggingface-hub'] export default props => (
diff --git a/website/src/images/logos/huggingface_hub.svg b/website/src/images/logos/huggingface_hub.svg
new file mode 100644
index 000000000..582e89e0d
--- /dev/null
+++ b/website/src/images/logos/huggingface_hub.svg
@@ -0,0 +1,66 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/website/src/templates/index.js b/website/src/templates/index.js
index a5adc6e50..2c68ff056 100644
--- a/website/src/templates/index.js
+++ b/website/src/templates/index.js
@@ -119,8 +119,8 @@ const AlertSpace = ({ nightly, legacy }) => {
 }
 
 const navAlert = (
-    
-        💥 Out now: spaCy v3.0
+    
+        💥 Out now: spaCy v3.1
     
 )
 
diff --git a/website/src/widgets/integration.js b/website/src/widgets/integration.js
index 0de078e0b..cf2320052 100644
--- a/website/src/widgets/integration.js
+++ b/website/src/widgets/integration.js
@@ -8,6 +8,7 @@ import StreamlitLogo from '-!svg-react-loader!../images/logos/streamlit.svg'
 import FastAPILogo from '-!svg-react-loader!../images/logos/fastapi.svg'
 import WandBLogo from '-!svg-react-loader!../images/logos/wandb.svg'
 import RayLogo from '-!svg-react-loader!../images/logos/ray.svg'
+import HuggingFaceHubLogo from '-!svg-react-loader!../images/logos/huggingface_hub.svg'
 
 const LOGOS = {
     dvc: DVCLogo,
@@ -16,6 +17,7 @@ const LOGOS = {
     fastapi: FastAPILogo,
     wandb: WandBLogo,
     ray: RayLogo,
+    huggingface_hub: HuggingFaceHubLogo,
 }
 
 export const IntegrationLogo = ({ name, title, width, height, maxWidth, align, ...props }) => {