mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-10 16:40:34 +03:00
Update docs
This commit is contained in:
parent
ef43152af4
commit
8ac5ef1284
|
@ -660,8 +660,10 @@ for more info.
|
||||||
As of spaCy v3.0, the `pretrain` command takes the same
|
As of spaCy v3.0, the `pretrain` command takes the same
|
||||||
[config file](/usage/training#config) as the `train` command. This ensures that
|
[config file](/usage/training#config) as the `train` command. This ensures that
|
||||||
settings are consistent between pretraining and training. Settings for
|
settings are consistent between pretraining and training. Settings for
|
||||||
pretraining can be defined in the `[pretraining]` block of the config file. See
|
pretraining can be defined in the `[pretraining]` block of the config file and
|
||||||
the [data format](/api/data-formats#config) for details.
|
auto-generated by setting `--pretraining` on
|
||||||
|
[`init fill-config`](/api/cli#init-fill-config). Also see the
|
||||||
|
[data format](/api/data-formats#config) for details.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
|
|
@ -375,7 +375,8 @@ The [`spacy pretrain`](/api/cli#pretrain) command lets you pretrain the
|
||||||
"token-to-vector" embedding layer of pipeline components from raw text. Raw text
|
"token-to-vector" embedding layer of pipeline components from raw text. Raw text
|
||||||
can be provided as a `.jsonl` (newline-delimited JSON) file containing one input
|
can be provided as a `.jsonl` (newline-delimited JSON) file containing one input
|
||||||
text per line (roughly paragraph length is good). Optionally, custom
|
text per line (roughly paragraph length is good). Optionally, custom
|
||||||
tokenization can be provided.
|
tokenization can be provided. The JSONL format means that the texts can be read
|
||||||
|
in line-by-line, while still making it easy to represent newlines in the data.
|
||||||
|
|
||||||
> #### Tip: Writing JSONL
|
> #### Tip: Writing JSONL
|
||||||
>
|
>
|
||||||
|
|
|
@ -43,6 +43,8 @@ recognizer doesn't use any features set by the tagger and parser, and so on.
|
||||||
This means that you can swap them, or remove single components from the pipeline
|
This means that you can swap them, or remove single components from the pipeline
|
||||||
without affecting the others. However, components may share a "token-to-vector"
|
without affecting the others. However, components may share a "token-to-vector"
|
||||||
component like [`Tok2Vec`](/api/tok2vec) or [`Transformer`](/api/transformer).
|
component like [`Tok2Vec`](/api/tok2vec) or [`Transformer`](/api/transformer).
|
||||||
|
You can read more about this in the docs on
|
||||||
|
[embedding layers](/usage/embeddings-transformers#embedding-layers).
|
||||||
|
|
||||||
Custom components may also depend on annotations set by other components. For
|
Custom components may also depend on annotations set by other components. For
|
||||||
example, a custom lemmatizer may need the part-of-speech tags assigned, so it'll
|
example, a custom lemmatizer may need the part-of-speech tags assigned, so it'll
|
||||||
|
|
|
@ -107,7 +107,62 @@ transformer outputs to the
|
||||||
[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
|
[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
|
||||||
giving you access to them after the pipeline has finished running.
|
giving you access to them after the pipeline has finished running.
|
||||||
|
|
||||||
<!-- TODO: show example of implementation via config, side by side -->
|
### Example: Shared vs. independent config {#embedding-layers-config}
|
||||||
|
|
||||||
|
The [config system](/usage/training#config) lets you express model configuration
|
||||||
|
for both shared and independent embedding layers. The shared setup uses a single
|
||||||
|
[`Tok2Vec`](/api/tok2vec) component with the
|
||||||
|
[Tok2Vec](/api/architectures#Tok2Vec) architecture. All other components, like
|
||||||
|
the entity recognizer, use a
|
||||||
|
[Tok2VecListener](/api/architectures#Tok2VecListener) layer as their model's
|
||||||
|
`tok2vec` argument, which connects to the `tok2vec` component model.
|
||||||
|
|
||||||
|
```ini
|
||||||
|
### Shared {highlight="1-2,4-5,19-20"}
|
||||||
|
[components.tok2vec]
|
||||||
|
factory = "tok2vec"
|
||||||
|
|
||||||
|
[components.tok2vec.model]
|
||||||
|
@architectures = "spacy.Tok2Vec.v1"
|
||||||
|
|
||||||
|
[components.tok2vec.model.embed]
|
||||||
|
@architectures = "spacy.MultiHashEmbed.v1"
|
||||||
|
|
||||||
|
[components.tok2vec.model.encode]
|
||||||
|
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||||
|
|
||||||
|
[components.ner]
|
||||||
|
factory = "ner"
|
||||||
|
|
||||||
|
[components.ner.model]
|
||||||
|
@architectures = "spacy.TransitionBasedParser.v1"
|
||||||
|
|
||||||
|
[components.ner.model.tok2vec]
|
||||||
|
@architectures = "spacy.Tok2VecListener.v1"
|
||||||
|
```
|
||||||
|
|
||||||
|
In the independent setup, the entity recognizer component defines its own
|
||||||
|
[Tok2Vec](/api/architectures#Tok2Vec) instance. Other components will do the
|
||||||
|
same. This makes them fully independent and doesn't require an upstream
|
||||||
|
[`Tok2Vec`](/api/tok2vec) component to be present in the pipeline.
|
||||||
|
|
||||||
|
```ini
|
||||||
|
### Independent {highlight="7-8"}
|
||||||
|
[components.ner]
|
||||||
|
factory = "ner"
|
||||||
|
|
||||||
|
[components.ner.model]
|
||||||
|
@architectures = "spacy.TransitionBasedParser.v1"
|
||||||
|
|
||||||
|
[components.ner.model.tok2vec]
|
||||||
|
@architectures = "spacy.Tok2Vec.v1"
|
||||||
|
|
||||||
|
[components.ner.model.tok2vec.embed]
|
||||||
|
@architectures = "spacy.MultiHashEmbed.v1"
|
||||||
|
|
||||||
|
[components.ner.model.tok2vec.encode]
|
||||||
|
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||||
|
```
|
||||||
|
|
||||||
<!-- TODO: Once rehearsal is tested, mention it here. -->
|
<!-- TODO: Once rehearsal is tested, mention it here. -->
|
||||||
|
|
||||||
|
@ -503,3 +558,22 @@ def MyCustomVectors(
|
||||||
## Pretraining {#pretraining}
|
## Pretraining {#pretraining}
|
||||||
|
|
||||||
<!-- TODO: write -->
|
<!-- TODO: write -->
|
||||||
|
|
||||||
|
> #### Raw text format
|
||||||
|
>
|
||||||
|
> The raw text can be provided as JSONL (newline-delimited JSON) with a key
|
||||||
|
> `"text"` per entry. This allows the data to be read in line by line, while
|
||||||
|
> also allowing you to include newlines in the texts.
|
||||||
|
>
|
||||||
|
> ```json
|
||||||
|
> {"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
|
||||||
|
> {"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```cli
|
||||||
|
$ python -m spacy init fill-config config.cfg config_pretrain.cfg --pretraining
|
||||||
|
```
|
||||||
|
|
||||||
|
```cli
|
||||||
|
$ python -m spacy pretrain raw_text.jsonl /output config_pretrain.cfg
|
||||||
|
```
|
||||||
|
|
|
@ -88,6 +88,12 @@ can also use any private repo you have access to with Git.
|
||||||
> - dest: 'assets/training.spacy'
|
> - dest: 'assets/training.spacy'
|
||||||
> url: 'https://example.com/data.spacy'
|
> url: 'https://example.com/data.spacy'
|
||||||
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
||||||
|
> - dest: 'assets/development.spacy'
|
||||||
|
> git:
|
||||||
|
> repo: 'https://github.com/example/repo'
|
||||||
|
> branch: 'master'
|
||||||
|
> path: 'path/developments.spacy'
|
||||||
|
> checksum: '5113dc04e03f079525edd8df3f4f39e3'
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Assets are data files your project needs – for example, the training and
|
Assets are data files your project needs – for example, the training and
|
||||||
|
@ -104,22 +110,8 @@ $ python -m spacy project assets
|
||||||
|
|
||||||
Asset URLs can be a number of different protocols: HTTP, HTTPS, FTP, SSH, and
|
Asset URLs can be a number of different protocols: HTTP, HTTPS, FTP, SSH, and
|
||||||
even cloud storage such as GCS and S3. You can also fetch assets using git, by
|
even cloud storage such as GCS and S3. You can also fetch assets using git, by
|
||||||
replacing the `url` string with a `git` block, like this:
|
replacing the `url` string with a `git` block. spaCy will use Git's "sparse
|
||||||
|
checkout" feature, to avoid download the whole repository.
|
||||||
> #### project.yml
|
|
||||||
>
|
|
||||||
> ```yaml
|
|
||||||
> assets:
|
|
||||||
> - dest: 'assets/training.spacy'
|
|
||||||
> git:
|
|
||||||
> repo: "https://github.com/example/repo"
|
|
||||||
> branch: "master"
|
|
||||||
> path: "some/path"
|
|
||||||
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
|
||||||
> ```
|
|
||||||
|
|
||||||
spaCy will use Git's "sparse checkout" feature, to avoid download the whole
|
|
||||||
repository.
|
|
||||||
|
|
||||||
### 3. Run a command {#run}
|
### 3. Run a command {#run}
|
||||||
|
|
||||||
|
@ -236,10 +228,93 @@ https://github.com/explosion/spacy-boilerplates/blob/master/ner_fashion/project.
|
||||||
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `vars` | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`. |
|
| `vars` | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`. |
|
||||||
| `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. |
|
| `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. |
|
||||||
| `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. |
|
| `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo. |
|
||||||
| `workflows` | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command. |
|
| `workflows` | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command. |
|
||||||
| `commands` | A list of named commands. A command can define an optional help message (shown in the CLI when the user adds `--help`) and the `script`, a list of commands to run. The `deps` and `outputs` let you define the created file the command depends on and produces, respectively. This lets spaCy determine whether a command needs to be re-run because its dependencies or outputs changed. Commands can be run as part of a workflow, or separately with the [`project run`](/api/cli#project-run) command. |
|
| `commands` | A list of named commands. A command can define an optional help message (shown in the CLI when the user adds `--help`) and the `script`, a list of commands to run. The `deps` and `outputs` let you define the created file the command depends on and produces, respectively. This lets spaCy determine whether a command needs to be re-run because its dependencies or outputs changed. Commands can be run as part of a workflow, or separately with the [`project run`](/api/cli#project-run) command. |
|
||||||
|
|
||||||
|
### Data assets {#data-assets}
|
||||||
|
|
||||||
|
Assets are any files that your project might need, like training and development
|
||||||
|
corpora or pretrained weights for initializing your model. Assets are defined in
|
||||||
|
the `assets` block of your `project.yml` and can be downloaded using the
|
||||||
|
[`project assets`](/api/cli#project-assets) command. Defining checksums lets you
|
||||||
|
verify that someone else running your project will use the same files you used.
|
||||||
|
Asset URLs can be a number of different **protocols**: HTTP, HTTPS, FTP, SSH,
|
||||||
|
and even **cloud storage** such as GCS and S3. You can also download assets from
|
||||||
|
a **Git repo** instead.
|
||||||
|
|
||||||
|
#### Downloading from a URL or cloud storage {#data-assets-url}
|
||||||
|
|
||||||
|
Under the hood, spaCy uses the
|
||||||
|
[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library so you
|
||||||
|
can use any protocol it supports. Note that you may need to install extra
|
||||||
|
dependencies to use certain protocols.
|
||||||
|
|
||||||
|
> #### project.yml
|
||||||
|
>
|
||||||
|
> ```yaml
|
||||||
|
> assets:
|
||||||
|
> # Download from public HTTPS URL
|
||||||
|
> - dest: 'assets/training.spacy'
|
||||||
|
> url: 'https://example.com/data.spacy'
|
||||||
|
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
||||||
|
> # Download from Google Cloud Storage bucket
|
||||||
|
> - dest: 'assets/development.spacy'
|
||||||
|
> url: 'gs://your-bucket/corpora'
|
||||||
|
> checksum: '5113dc04e03f079525edd8df3f4f39e3'
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. |
|
||||||
|
| `url` | The URL to download from, using the respective protocol. |
|
||||||
|
| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. |
|
||||||
|
|
||||||
|
#### Downloading from a Git repo {#data-assets-git}
|
||||||
|
|
||||||
|
If a `git` block is provided, the asset is downloaded from the given Git
|
||||||
|
repository. You can download from any repo that you have access to. Under the
|
||||||
|
hood, this uses Git's "sparse checkout" feature, so you're only downloading the
|
||||||
|
files you need and not the whole repo.
|
||||||
|
|
||||||
|
> #### project.yml
|
||||||
|
>
|
||||||
|
> ```yaml
|
||||||
|
> assets:
|
||||||
|
> - dest: 'assets/training.spacy'
|
||||||
|
> git:
|
||||||
|
> repo: 'https://github.com/example/repo'
|
||||||
|
> branch: 'master'
|
||||||
|
> path: 'path/training.spacy'
|
||||||
|
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. |
|
||||||
|
| `git` | `repo`: The URL of the repo to download from.<br />`path`: Path of the file or directory to download, relative to the repo root.<br />`branch`: The branch to download from. Defaults to `"master"`. |
|
||||||
|
| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. |
|
||||||
|
|
||||||
|
#### Working with private assets {#data-asets-private}
|
||||||
|
|
||||||
|
> #### project.yml
|
||||||
|
>
|
||||||
|
> ```yaml
|
||||||
|
> assets:
|
||||||
|
> - dest: 'assets/private_training_data.json'
|
||||||
|
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
||||||
|
> - dest: 'assets/private_vectors.bin'
|
||||||
|
> checksum: '5113dc04e03f079525edd8df3f4f39e3'
|
||||||
|
> ```
|
||||||
|
|
||||||
|
For many projects, the datasets and weights you're working with might be
|
||||||
|
company-internal and not available over the internet. In that case, you can
|
||||||
|
specify the destination paths and a checksum, and leave out the URL. When your
|
||||||
|
teammates clone and run your project, they can place the files in the respective
|
||||||
|
directory themselves. The [`project assets`](/api/cli#project-assets) command
|
||||||
|
will alert about missing files and mismatched checksums, so you can ensure that
|
||||||
|
others are running your project with the same data.
|
||||||
|
|
||||||
### Dependencies and outputs {#deps-outputs}
|
### Dependencies and outputs {#deps-outputs}
|
||||||
|
|
||||||
Each command defined in the `project.yml` can optionally define a list of
|
Each command defined in the `project.yml` can optionally define a list of
|
||||||
|
@ -446,25 +521,6 @@ projects.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Working with private assets {#private-assets}
|
|
||||||
|
|
||||||
For many projects, the datasets and weights you're working with might be
|
|
||||||
company-internal and not available via a public URL. In that case, you can
|
|
||||||
specify the destination paths and a checksum, and leave out the URL. When your
|
|
||||||
teammates clone and run your project, they can place the files in the respective
|
|
||||||
directory themselves. The [`spacy project assets`](/api/cli#project-assets)
|
|
||||||
command will alert about missing files and mismatched checksums, so you can
|
|
||||||
ensure that others are running your project with the same data.
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
### project.yml
|
|
||||||
assets:
|
|
||||||
- dest: 'assets/private_training_data.json'
|
|
||||||
checksum: '63373dd656daa1fd3043ce166a59474c'
|
|
||||||
- dest: 'assets/private_vectors.bin'
|
|
||||||
checksum: '5113dc04e03f079525edd8df3f4f39e3'
|
|
||||||
```
|
|
||||||
|
|
||||||
## Remote Storage {#remote}
|
## Remote Storage {#remote}
|
||||||
|
|
||||||
You can persist your project outputs to a remote storage using the
|
You can persist your project outputs to a remote storage using the
|
||||||
|
|
|
@ -365,6 +365,8 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
|
||||||
[`DependencyMatcher.add`](/api/dependencymatcher#add) now only accept a list
|
[`DependencyMatcher.add`](/api/dependencymatcher#add) now only accept a list
|
||||||
of patterns as the second argument (instead of a variable number of
|
of patterns as the second argument (instead of a variable number of
|
||||||
arguments). The `on_match` callback becomes an optional keyword argument.
|
arguments). The `on_match` callback becomes an optional keyword argument.
|
||||||
|
- The `PRON_LEMMA` symbol and `-PRON-` as an indicator for pronoun lemmas has
|
||||||
|
been removed.
|
||||||
|
|
||||||
### Removed or renamed API {#incompat-removed}
|
### Removed or renamed API {#incompat-removed}
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user