mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 17:24:41 +03:00
Update docs
This commit is contained in:
parent
ef43152af4
commit
8ac5ef1284
|
@ -660,8 +660,10 @@ for more info.
|
|||
As of spaCy v3.0, the `pretrain` command takes the same
|
||||
[config file](/usage/training#config) as the `train` command. This ensures that
|
||||
settings are consistent between pretraining and training. Settings for
|
||||
pretraining can be defined in the `[pretraining]` block of the config file. See
|
||||
the [data format](/api/data-formats#config) for details.
|
||||
pretraining can be defined in the `[pretraining]` block of the config file and
|
||||
auto-generated by setting `--pretraining` on
|
||||
[`init fill-config`](/api/cli#init-fill-config). Also see the
|
||||
[data format](/api/data-formats#config) for details.
|
||||
|
||||
</Infobox>
|
||||
|
||||
|
|
|
@ -375,7 +375,8 @@ The [`spacy pretrain`](/api/cli#pretrain) command lets you pretrain the
|
|||
"token-to-vector" embedding layer of pipeline components from raw text. Raw text
|
||||
can be provided as a `.jsonl` (newline-delimited JSON) file containing one input
|
||||
text per line (roughly paragraph length is good). Optionally, custom
|
||||
tokenization can be provided.
|
||||
tokenization can be provided. The JSONL format means that the texts can be read
|
||||
in line-by-line, while still making it easy to represent newlines in the data.
|
||||
|
||||
> #### Tip: Writing JSONL
|
||||
>
|
||||
|
|
|
@ -43,6 +43,8 @@ recognizer doesn't use any features set by the tagger and parser, and so on.
|
|||
This means that you can swap them, or remove single components from the pipeline
|
||||
without affecting the others. However, components may share a "token-to-vector"
|
||||
component like [`Tok2Vec`](/api/tok2vec) or [`Transformer`](/api/transformer).
|
||||
You can read more about this in the docs on
|
||||
[embedding layers](/usage/embeddings-transformers#embedding-layers).
|
||||
|
||||
Custom components may also depend on annotations set by other components. For
|
||||
example, a custom lemmatizer may need the part-of-speech tags assigned, so it'll
|
||||
|
|
|
@ -107,7 +107,62 @@ transformer outputs to the
|
|||
[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
|
||||
giving you access to them after the pipeline has finished running.
|
||||
|
||||
<!-- TODO: show example of implementation via config, side by side -->
|
||||
### Example: Shared vs. independent config {#embedding-layers-config}
|
||||
|
||||
The [config system](/usage/training#config) lets you express model configuration
|
||||
for both shared and independent embedding layers. The shared setup uses a single
|
||||
[`Tok2Vec`](/api/tok2vec) component with the
|
||||
[Tok2Vec](/api/architectures#Tok2Vec) architecture. All other components, like
|
||||
the entity recognizer, use a
|
||||
[Tok2VecListener](/api/architectures#Tok2VecListener) layer as their model's
|
||||
`tok2vec` argument, which connects to the `tok2vec` component model.
|
||||
|
||||
```ini
|
||||
### Shared {highlight="1-2,4-5,19-20"}
|
||||
[components.tok2vec]
|
||||
factory = "tok2vec"
|
||||
|
||||
[components.tok2vec.model]
|
||||
@architectures = "spacy.Tok2Vec.v1"
|
||||
|
||||
[components.tok2vec.model.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
|
||||
[components.tok2vec.model.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
|
||||
[components.ner]
|
||||
factory = "ner"
|
||||
|
||||
[components.ner.model]
|
||||
@architectures = "spacy.TransitionBasedParser.v1"
|
||||
|
||||
[components.ner.model.tok2vec]
|
||||
@architectures = "spacy.Tok2VecListener.v1"
|
||||
```
|
||||
|
||||
In the independent setup, the entity recognizer component defines its own
|
||||
[Tok2Vec](/api/architectures#Tok2Vec) instance. Other components will do the
|
||||
same. This makes them fully independent and doesn't require an upstream
|
||||
[`Tok2Vec`](/api/tok2vec) component to be present in the pipeline.
|
||||
|
||||
```ini
|
||||
### Independent {highlight="7-8"}
|
||||
[components.ner]
|
||||
factory = "ner"
|
||||
|
||||
[components.ner.model]
|
||||
@architectures = "spacy.TransitionBasedParser.v1"
|
||||
|
||||
[components.ner.model.tok2vec]
|
||||
@architectures = "spacy.Tok2Vec.v1"
|
||||
|
||||
[components.ner.model.tok2vec.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
|
||||
[components.ner.model.tok2vec.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
```
|
||||
|
||||
<!-- TODO: Once rehearsal is tested, mention it here. -->
|
||||
|
||||
|
@ -503,3 +558,22 @@ def MyCustomVectors(
|
|||
## Pretraining {#pretraining}
|
||||
|
||||
<!-- TODO: write -->
|
||||
|
||||
> #### Raw text format
|
||||
>
|
||||
> The raw text can be provided as JSONL (newline-delimited JSON) with a key
|
||||
> `"text"` per entry. This allows the data to be read in line by line, while
|
||||
> also allowing you to include newlines in the texts.
|
||||
>
|
||||
> ```json
|
||||
> {"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
|
||||
> {"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
|
||||
> ```
|
||||
|
||||
```cli
|
||||
$ python -m spacy init fill-config config.cfg config_pretrain.cfg --pretraining
|
||||
```
|
||||
|
||||
```cli
|
||||
$ python -m spacy pretrain raw_text.jsonl /output config_pretrain.cfg
|
||||
```
|
||||
|
|
|
@ -88,6 +88,12 @@ can also use any private repo you have access to with Git.
|
|||
> - dest: 'assets/training.spacy'
|
||||
> url: 'https://example.com/data.spacy'
|
||||
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
||||
> - dest: 'assets/development.spacy'
|
||||
> git:
|
||||
> repo: 'https://github.com/example/repo'
|
||||
> branch: 'master'
|
||||
> path: 'path/developments.spacy'
|
||||
> checksum: '5113dc04e03f079525edd8df3f4f39e3'
|
||||
> ```
|
||||
|
||||
Assets are data files your project needs – for example, the training and
|
||||
|
@ -104,22 +110,8 @@ $ python -m spacy project assets
|
|||
|
||||
Asset URLs can be a number of different protocols: HTTP, HTTPS, FTP, SSH, and
|
||||
even cloud storage such as GCS and S3. You can also fetch assets using git, by
|
||||
replacing the `url` string with a `git` block, like this:
|
||||
|
||||
> #### project.yml
|
||||
>
|
||||
> ```yaml
|
||||
> assets:
|
||||
> - dest: 'assets/training.spacy'
|
||||
> git:
|
||||
> repo: "https://github.com/example/repo"
|
||||
> branch: "master"
|
||||
> path: "some/path"
|
||||
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
||||
> ```
|
||||
|
||||
spaCy will use Git's "sparse checkout" feature, to avoid download the whole
|
||||
repository.
|
||||
replacing the `url` string with a `git` block. spaCy will use Git's "sparse
|
||||
checkout" feature, to avoid download the whole repository.
|
||||
|
||||
### 3. Run a command {#run}
|
||||
|
||||
|
@ -236,10 +228,93 @@ https://github.com/explosion/spacy-boilerplates/blob/master/ner_fashion/project.
|
|||
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `vars` | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`. |
|
||||
| `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. |
|
||||
| `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. |
|
||||
| `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo. |
|
||||
| `workflows` | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command. |
|
||||
| `commands` | A list of named commands. A command can define an optional help message (shown in the CLI when the user adds `--help`) and the `script`, a list of commands to run. The `deps` and `outputs` let you define the created file the command depends on and produces, respectively. This lets spaCy determine whether a command needs to be re-run because its dependencies or outputs changed. Commands can be run as part of a workflow, or separately with the [`project run`](/api/cli#project-run) command. |
|
||||
|
||||
### Data assets {#data-assets}
|
||||
|
||||
Assets are any files that your project might need, like training and development
|
||||
corpora or pretrained weights for initializing your model. Assets are defined in
|
||||
the `assets` block of your `project.yml` and can be downloaded using the
|
||||
[`project assets`](/api/cli#project-assets) command. Defining checksums lets you
|
||||
verify that someone else running your project will use the same files you used.
|
||||
Asset URLs can be a number of different **protocols**: HTTP, HTTPS, FTP, SSH,
|
||||
and even **cloud storage** such as GCS and S3. You can also download assets from
|
||||
a **Git repo** instead.
|
||||
|
||||
#### Downloading from a URL or cloud storage {#data-assets-url}
|
||||
|
||||
Under the hood, spaCy uses the
|
||||
[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library so you
|
||||
can use any protocol it supports. Note that you may need to install extra
|
||||
dependencies to use certain protocols.
|
||||
|
||||
> #### project.yml
|
||||
>
|
||||
> ```yaml
|
||||
> assets:
|
||||
> # Download from public HTTPS URL
|
||||
> - dest: 'assets/training.spacy'
|
||||
> url: 'https://example.com/data.spacy'
|
||||
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
||||
> # Download from Google Cloud Storage bucket
|
||||
> - dest: 'assets/development.spacy'
|
||||
> url: 'gs://your-bucket/corpora'
|
||||
> checksum: '5113dc04e03f079525edd8df3f4f39e3'
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. |
|
||||
| `url` | The URL to download from, using the respective protocol. |
|
||||
| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. |
|
||||
|
||||
#### Downloading from a Git repo {#data-assets-git}
|
||||
|
||||
If a `git` block is provided, the asset is downloaded from the given Git
|
||||
repository. You can download from any repo that you have access to. Under the
|
||||
hood, this uses Git's "sparse checkout" feature, so you're only downloading the
|
||||
files you need and not the whole repo.
|
||||
|
||||
> #### project.yml
|
||||
>
|
||||
> ```yaml
|
||||
> assets:
|
||||
> - dest: 'assets/training.spacy'
|
||||
> git:
|
||||
> repo: 'https://github.com/example/repo'
|
||||
> branch: 'master'
|
||||
> path: 'path/training.spacy'
|
||||
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. |
|
||||
| `git` | `repo`: The URL of the repo to download from.<br />`path`: Path of the file or directory to download, relative to the repo root.<br />`branch`: The branch to download from. Defaults to `"master"`. |
|
||||
| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. |
|
||||
|
||||
#### Working with private assets {#data-asets-private}
|
||||
|
||||
> #### project.yml
|
||||
>
|
||||
> ```yaml
|
||||
> assets:
|
||||
> - dest: 'assets/private_training_data.json'
|
||||
> checksum: '63373dd656daa1fd3043ce166a59474c'
|
||||
> - dest: 'assets/private_vectors.bin'
|
||||
> checksum: '5113dc04e03f079525edd8df3f4f39e3'
|
||||
> ```
|
||||
|
||||
For many projects, the datasets and weights you're working with might be
|
||||
company-internal and not available over the internet. In that case, you can
|
||||
specify the destination paths and a checksum, and leave out the URL. When your
|
||||
teammates clone and run your project, they can place the files in the respective
|
||||
directory themselves. The [`project assets`](/api/cli#project-assets) command
|
||||
will alert about missing files and mismatched checksums, so you can ensure that
|
||||
others are running your project with the same data.
|
||||
|
||||
### Dependencies and outputs {#deps-outputs}
|
||||
|
||||
Each command defined in the `project.yml` can optionally define a list of
|
||||
|
@ -446,25 +521,6 @@ projects.
|
|||
|
||||
</Infobox>
|
||||
|
||||
### Working with private assets {#private-assets}
|
||||
|
||||
For many projects, the datasets and weights you're working with might be
|
||||
company-internal and not available via a public URL. In that case, you can
|
||||
specify the destination paths and a checksum, and leave out the URL. When your
|
||||
teammates clone and run your project, they can place the files in the respective
|
||||
directory themselves. The [`spacy project assets`](/api/cli#project-assets)
|
||||
command will alert about missing files and mismatched checksums, so you can
|
||||
ensure that others are running your project with the same data.
|
||||
|
||||
```yaml
|
||||
### project.yml
|
||||
assets:
|
||||
- dest: 'assets/private_training_data.json'
|
||||
checksum: '63373dd656daa1fd3043ce166a59474c'
|
||||
- dest: 'assets/private_vectors.bin'
|
||||
checksum: '5113dc04e03f079525edd8df3f4f39e3'
|
||||
```
|
||||
|
||||
## Remote Storage {#remote}
|
||||
|
||||
You can persist your project outputs to a remote storage using the
|
||||
|
|
|
@ -365,6 +365,8 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
|
|||
[`DependencyMatcher.add`](/api/dependencymatcher#add) now only accept a list
|
||||
of patterns as the second argument (instead of a variable number of
|
||||
arguments). The `on_match` callback becomes an optional keyword argument.
|
||||
- The `PRON_LEMMA` symbol and `-PRON-` as an indicator for pronoun lemmas has
|
||||
been removed.
|
||||
|
||||
### Removed or renamed API {#incompat-removed}
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user