mirror of
https://github.com/explosion/spaCy.git
synced 2025-08-07 13:44:55 +03:00
Remove more ray docs
This commit is contained in:
parent
42c02ae8e0
commit
82d34828dd
|
@ -75,7 +75,6 @@ spaCy's [`setup.cfg`](%%GITHUB_SPACY/setup.cfg) for details on what's included.
|
||||||
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `lookups` | Install [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) for data tables for lemmatization and lexeme normalization. The data is serialized with trained pipelines, so you only need this package if you want to train your own models. |
|
| `lookups` | Install [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) for data tables for lemmatization and lexeme normalization. The data is serialized with trained pipelines, so you only need this package if you want to train your own models. |
|
||||||
| `transformers` | Install [`spacy-transformers`](https://github.com/explosion/spacy-transformers). The package will be installed automatically when you install a transformer-based pipeline. |
|
| `transformers` | Install [`spacy-transformers`](https://github.com/explosion/spacy-transformers). The package will be installed automatically when you install a transformer-based pipeline. |
|
||||||
| `ray` | Install [`spacy-ray`](https://github.com/explosion/spacy-ray) to add CLI commands for [parallel training](/usage/training#parallel-training). |
|
|
||||||
| `cuda`, ... | Install spaCy with GPU support provided by [CuPy](https://cupy.chainer.org) for your given CUDA version. See the GPU [installation instructions](#gpu) for details and options. |
|
| `cuda`, ... | Install spaCy with GPU support provided by [CuPy](https://cupy.chainer.org) for your given CUDA version. See the GPU [installation instructions](#gpu) for details and options. |
|
||||||
| `apple` | Install [`thinc-apple-ops`](https://github.com/explosion/thinc-apple-ops) to improve performance on an Apple M1. |
|
| `apple` | Install [`thinc-apple-ops`](https://github.com/explosion/thinc-apple-ops) to improve performance on an Apple M1. |
|
||||||
| `ja`, `ko`, `th` | Install additional dependencies required for tokenization for the [languages](/usage/models#languages). |
|
| `ja`, `ko`, `th` | Install additional dependencies required for tokenization for the [languages](/usage/models#languages). |
|
||||||
|
|
|
@ -1014,54 +1014,6 @@ https://github.com/explosion/projects/blob/v3/integrations/fastapi/scripts/main.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Ray {#ray} <IntegrationLogo name="ray" width={100} height="auto" align="right" />
|
|
||||||
|
|
||||||
> #### Installation
|
|
||||||
>
|
|
||||||
> ```cli
|
|
||||||
> $ pip install -U %%SPACY_PKG_NAME[ray]%%SPACY_PKG_FLAGS
|
|
||||||
> # Check that the CLI is registered
|
|
||||||
> $ python -m spacy ray --help
|
|
||||||
> ```
|
|
||||||
|
|
||||||
[Ray](https://ray.io/) is a fast and simple framework for building and running
|
|
||||||
**distributed applications**. You can use Ray for parallel and distributed
|
|
||||||
training with spaCy via our lightweight
|
|
||||||
[`spacy-ray`](https://github.com/explosion/spacy-ray) extension package. If the
|
|
||||||
package is installed in the same environment as spaCy, it will automatically add
|
|
||||||
[`spacy ray`](/api/cli#ray) commands to your spaCy CLI. See the usage guide on
|
|
||||||
[parallel training](/usage/training#parallel-training) for more details on how
|
|
||||||
it works under the hood.
|
|
||||||
|
|
||||||
<Project id="integrations/ray">
|
|
||||||
|
|
||||||
Get started with parallel training using our project template. It trains a
|
|
||||||
simple model on a Universal Dependencies Treebank and lets you parallelize the
|
|
||||||
training with Ray.
|
|
||||||
|
|
||||||
</Project>
|
|
||||||
|
|
||||||
You can integrate [`spacy ray train`](/api/cli#ray-train) into your
|
|
||||||
`project.yml` just like the regular training command and pass it the config, and
|
|
||||||
optional output directory or remote storage URL and config overrides if needed.
|
|
||||||
|
|
||||||
<!-- prettier-ignore -->
|
|
||||||
```yaml
|
|
||||||
### project.yml
|
|
||||||
commands:
|
|
||||||
- name: "ray"
|
|
||||||
help: "Train a model via parallel training with Ray"
|
|
||||||
script:
|
|
||||||
- "python -m spacy ray train configs/config.cfg -o training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy"
|
|
||||||
deps:
|
|
||||||
- "corpus/train.spacy"
|
|
||||||
- "corpus/dev.spacy"
|
|
||||||
outputs:
|
|
||||||
- "training/model-best"
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Weights & Biases {#wandb} <IntegrationLogo name="wandb" width={175} height="auto" align="right" />
|
### Weights & Biases {#wandb} <IntegrationLogo name="wandb" width={175} height="auto" align="right" />
|
||||||
|
|
||||||
[Weights & Biases](https://www.wandb.com/) is a popular platform for experiment
|
[Weights & Biases](https://www.wandb.com/) is a popular platform for experiment
|
||||||
|
|
|
@ -1572,77 +1572,6 @@ token-based annotations like the dependency parse or entity labels, you'll need
|
||||||
to take care to adjust the `Example` object so its annotations match and remain
|
to take care to adjust the `Example` object so its annotations match and remain
|
||||||
valid.
|
valid.
|
||||||
|
|
||||||
## Parallel & distributed training with Ray {#parallel-training}
|
|
||||||
|
|
||||||
> #### Installation
|
|
||||||
>
|
|
||||||
> ```cli
|
|
||||||
> $ pip install -U %%SPACY_PKG_NAME[ray]%%SPACY_PKG_FLAGS
|
|
||||||
> # Check that the CLI is registered
|
|
||||||
> $ python -m spacy ray --help
|
|
||||||
> ```
|
|
||||||
|
|
||||||
[Ray](https://ray.io/) is a fast and simple framework for building and running
|
|
||||||
**distributed applications**. You can use Ray to train spaCy on one or more
|
|
||||||
remote machines, potentially speeding up your training process. Parallel
|
|
||||||
training won't always be faster though – it depends on your batch size, models,
|
|
||||||
and hardware.
|
|
||||||
|
|
||||||
<Infobox variant="warning">
|
|
||||||
|
|
||||||
To use Ray with spaCy, you need the
|
|
||||||
[`spacy-ray`](https://github.com/explosion/spacy-ray) package installed.
|
|
||||||
Installing the package will automatically add the `ray` command to the spaCy
|
|
||||||
CLI.
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
The [`spacy ray train`](/api/cli#ray-train) command follows the same API as
|
|
||||||
[`spacy train`](/api/cli#train), with a few extra options to configure the Ray
|
|
||||||
setup. You can optionally set the `--address` option to point to your Ray
|
|
||||||
cluster. If it's not set, Ray will run locally.
|
|
||||||
|
|
||||||
```cli
|
|
||||||
python -m spacy ray train config.cfg --n-workers 2
|
|
||||||
```
|
|
||||||
|
|
||||||
<Project id="integrations/ray">
|
|
||||||
|
|
||||||
Get started with parallel training using our project template. It trains a
|
|
||||||
simple model on a Universal Dependencies Treebank and lets you parallelize the
|
|
||||||
training with Ray.
|
|
||||||
|
|
||||||
</Project>
|
|
||||||
|
|
||||||
### How parallel training works {#parallel-training-details}
|
|
||||||
|
|
||||||
Each worker receives a shard of the **data** and builds a copy of the **model
|
|
||||||
and optimizer** from the [`config.cfg`](#config). It also has a communication
|
|
||||||
channel to **pass gradients and parameters** to the other workers. Additionally,
|
|
||||||
each worker is given ownership of a subset of the parameter arrays. Every
|
|
||||||
parameter array is owned by exactly one worker, and the workers are given a
|
|
||||||
mapping so they know which worker owns which parameter.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
As training proceeds, every worker will be computing gradients for **all** of
|
|
||||||
the model parameters. When they compute gradients for parameters they don't own,
|
|
||||||
they'll **send them to the worker** that does own that parameter, along with a
|
|
||||||
version identifier so that the owner can decide whether to discard the gradient.
|
|
||||||
Workers use the gradients they receive and the ones they compute locally to
|
|
||||||
update the parameters they own, and then broadcast the updated array and a new
|
|
||||||
version ID to the other workers.
|
|
||||||
|
|
||||||
This training procedure is **asynchronous** and **non-blocking**. Workers always
|
|
||||||
push their gradient increments and parameter updates, they do not have to pull
|
|
||||||
them and block on the result, so the transfers can happen in the background,
|
|
||||||
overlapped with the actual training work. The workers also do not have to stop
|
|
||||||
and wait for each other ("synchronize") at the start of each batch. This is very
|
|
||||||
useful for spaCy, because spaCy is often trained on long documents, which means
|
|
||||||
**batches can vary in size** significantly. Uneven workloads make synchronous
|
|
||||||
gradient descent inefficient, because if one batch is slow, all of the other
|
|
||||||
workers are stuck waiting for it to complete before they can continue.
|
|
||||||
|
|
||||||
## Internal training API {#api}
|
## Internal training API {#api}
|
||||||
|
|
||||||
<Infobox variant="danger">
|
<Infobox variant="danger">
|
||||||
|
|
Loading…
Reference in New Issue
Block a user