Update docs

2025-07-07 21:33:13 +03:00 · 2020-08-25 11:54:37 +02:00 · 2020-08-25 11:54:37 +02:00 · 8ac5ef1284
commit 8ac5ef1284
parent ef43152af4
6 changed files with 177 additions and 40 deletions
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@ -660,8 +660,10 @@ for more info.
 As of spaCy v3.0, the `pretrain` command takes the same
 [config file](/usage/training#config) as the `train` command. This ensures that
 settings are consistent between pretraining and training. Settings for
-pretraining can be defined in the `[pretraining]` block of the config file. See
-the [data format](/api/data-formats#config) for details.
+pretraining can be defined in the `[pretraining]` block of the config file and
+auto-generated by setting `--pretraining` on
+[`init fill-config`](/api/cli#init-fill-config). Also see the
+[data format](/api/data-formats#config) for details.

 </Infobox>

--- a/website/docs/api/data-formats.md
+++ b/website/docs/api/data-formats.md
@ -375,7 +375,8 @@ The [`spacy pretrain`](/api/cli#pretrain) command lets you pretrain the
 "token-to-vector" embedding layer of pipeline components from raw text. Raw text
 can be provided as a `.jsonl` (newline-delimited JSON) file containing one input
 text per line (roughly paragraph length is good). Optionally, custom
-tokenization can be provided.
+tokenization can be provided. The JSONL format means that the texts can be read
+in line-by-line, while still making it easy to represent newlines in the data.

 > #### Tip: Writing JSONL
 >
--- a/website/docs/usage/101/_pipelines.md
+++ b/website/docs/usage/101/_pipelines.md
@ -43,6 +43,8 @@ recognizer doesn't use any features set by the tagger and parser, and so on.
 This means that you can swap them, or remove single components from the pipeline
 without affecting the others. However, components may share a "token-to-vector"
 component like [`Tok2Vec`](/api/tok2vec) or [`Transformer`](/api/transformer).
+You can read more about this in the docs on
+[embedding layers](/usage/embeddings-transformers#embedding-layers).

 Custom components may also depend on annotations set by other components. For
 example, a custom lemmatizer may need the part-of-speech tags assigned, so it'll
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@ -107,7 +107,62 @@ transformer outputs to the
 [`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
 giving you access to them after the pipeline has finished running.

-<!-- TODO: show example of implementation via config, side by side -->
+### Example: Shared vs. independent config {#embedding-layers-config}
+
+The [config system](/usage/training#config) lets you express model configuration
+for both shared and independent embedding layers. The shared setup uses a single
+[`Tok2Vec`](/api/tok2vec) component with the
+[Tok2Vec](/api/architectures#Tok2Vec) architecture. All other components, like
+the entity recognizer, use a
+[Tok2VecListener](/api/architectures#Tok2VecListener) layer as their model's
+`tok2vec` argument, which connects to the `tok2vec` component model.
+
+```ini
+### Shared {highlight="1-2,4-5,19-20"}
+[components.tok2vec]
+factory = "tok2vec"
+
+[components.tok2vec.model]
+@architectures = "spacy.Tok2Vec.v1"
+
+[components.tok2vec.model.embed]
+@architectures = "spacy.MultiHashEmbed.v1"
+
+[components.tok2vec.model.encode]
+@architectures = "spacy.MaxoutWindowEncoder.v1"
+
+[components.ner]
+factory = "ner"
+
+[components.ner.model]
+@architectures = "spacy.TransitionBasedParser.v1"
+
+[components.ner.model.tok2vec]
+@architectures = "spacy.Tok2VecListener.v1"
+```
+
+In the independent setup, the entity recognizer component defines its own
+[Tok2Vec](/api/architectures#Tok2Vec) instance. Other components will do the
+same. This makes them fully independent and doesn't require an upstream
+[`Tok2Vec`](/api/tok2vec) component to be present in the pipeline.
+
+```ini
+### Independent {highlight="7-8"}
+[components.ner]
+factory = "ner"
+
+[components.ner.model]
+@architectures = "spacy.TransitionBasedParser.v1"
+
+[components.ner.model.tok2vec]
+@architectures = "spacy.Tok2Vec.v1"
+
+[components.ner.model.tok2vec.embed]
+@architectures = "spacy.MultiHashEmbed.v1"
+
+[components.ner.model.tok2vec.encode]
+@architectures = "spacy.MaxoutWindowEncoder.v1"
+```

 <!-- TODO: Once rehearsal is tested, mention it here. -->

@ -503,3 +558,22 @@ def MyCustomVectors(
 ## Pretraining {#pretraining}

 <!-- TODO: write -->
+
+> #### Raw text format
+>
+> The raw text can be provided as JSONL (newline-delimited JSON) with a key
+> `"text"` per entry. This allows the data to be read in line by line, while
+> also allowing you to include newlines in the texts.
+>
+> ```json
+> {"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
+> {"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
+> ```
+
+```cli
+$ python -m spacy init fill-config config.cfg config_pretrain.cfg --pretraining
+```
+
+```cli
+$ python -m spacy pretrain raw_text.jsonl /output config_pretrain.cfg
+```
--- a/website/docs/usage/projects.md
+++ b/website/docs/usage/projects.md
@ -88,6 +88,12 @@ can also use any private repo you have access to with Git.
 >   - dest: 'assets/training.spacy'
 >     url: 'https://example.com/data.spacy'
 >     checksum: '63373dd656daa1fd3043ce166a59474c'
+>   - dest: 'assets/development.spacy'
+>     git:
+>       repo: 'https://github.com/example/repo'
+>       branch: 'master'
+>       path: 'path/developments.spacy'
+>     checksum: '5113dc04e03f079525edd8df3f4f39e3'
 > ```

 Assets are data files your project needs – for example, the training and
@ -104,22 +110,8 @@ $ python -m spacy project assets

 Asset URLs can be a number of different protocols: HTTP, HTTPS, FTP, SSH, and
 even cloud storage such as GCS and S3. You can also fetch assets using git, by
-replacing the `url` string with a `git` block, like this:
-
-> #### project.yml
->
-> ```yaml
-> assets:
->   - dest: 'assets/training.spacy'
->     git: 
->       repo: "https://github.com/example/repo"
->       branch: "master"
->       path: "some/path"
->     checksum: '63373dd656daa1fd3043ce166a59474c'
-> ```
-
-spaCy will use Git's "sparse checkout" feature, to avoid download the whole
-repository.
+replacing the `url` string with a `git` block. spaCy will use Git's "sparse
+checkout" feature, to avoid download the whole repository.

 ### 3. Run a command {#run}

@ -236,10 +228,93 @@ https://github.com/explosion/spacy-boilerplates/blob/master/ner_fashion/project.
 | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `vars`        | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`.                                                                                                                                                |
 | `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist.                                                                                                                                                                                                                                                                                                                 |
-| `assets`      | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match.                                                                                                                                                                                                     |
+| `assets`      | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo.                                                                        |
 | `workflows`   | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command.                                                                                                                                                                                                                                                                                                                                         |
 | `commands`    | A list of named commands. A command can define an optional help message (shown in the CLI when the user adds `--help`) and the `script`, a list of commands to run. The `deps` and `outputs` let you define the created file the command depends on and produces, respectively. This lets spaCy determine whether a command needs to be re-run because its dependencies or outputs changed. Commands can be run as part of a workflow, or separately with the [`project run`](/api/cli#project-run) command. |

+### Data assets {#data-assets}
+
+Assets are any files that your project might need, like training and development
+corpora or pretrained weights for initializing your model. Assets are defined in
+the `assets` block of your `project.yml` and can be downloaded using the
+[`project assets`](/api/cli#project-assets) command. Defining checksums lets you
+verify that someone else running your project will use the same files you used.
+Asset URLs can be a number of different **protocols**: HTTP, HTTPS, FTP, SSH,
+and even **cloud storage** such as GCS and S3. You can also download assets from
+a **Git repo** instead.
+
+#### Downloading from a URL or cloud storage {#data-assets-url}
+
+Under the hood, spaCy uses the
+[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library so you
+can use any protocol it supports. Note that you may need to install extra
+dependencies to use certain protocols.
+
+> #### project.yml
+>
+> ```yaml
+> assets:
+>   # Download from public HTTPS URL
+>   - dest: 'assets/training.spacy'
+>     url: 'https://example.com/data.spacy'
+>     checksum: '63373dd656daa1fd3043ce166a59474c'
+>   # Download from Google Cloud Storage bucket
+>   - dest: 'assets/development.spacy'
+>     url: 'gs://your-bucket/corpora'
+>     checksum: '5113dc04e03f079525edd8df3f4f39e3'
+> ```
+
+| Name       | Description                                                                                                                                                                      |
+| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `dest`     | The destination path to save the downloaded asset to (relative to the project directory), including the file name.                                                               |
+| `url`      | The URL to download from, using the respective protocol.                                                                                                                         |
+| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. |
+
+#### Downloading from a Git repo {#data-assets-git}
+
+If a `git` block is provided, the asset is downloaded from the given Git
+repository. You can download from any repo that you have access to. Under the
+hood, this uses Git's "sparse checkout" feature, so you're only downloading the
+files you need and not the whole repo.
+
+> #### project.yml
+>
+> ```yaml
+> assets:
+>   - dest: 'assets/training.spacy'
+>     git:
+>       repo: 'https://github.com/example/repo'
+>       branch: 'master'
+>       path: 'path/training.spacy'
+>     checksum: '63373dd656daa1fd3043ce166a59474c'
+> ```
+
+| Name       | Description                                                                                                                                                                                          |
+| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `dest`     | The destination path to save the downloaded asset to (relative to the project directory), including the file name.                                                                                   |
+| `git`      | `repo`: The URL of the repo to download from.<br />`path`: Path of the file or directory to download, relative to the repo root.<br />`branch`: The branch to download from. Defaults to `"master"`. |
+| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists.                     |
+
+#### Working with private assets {#data-asets-private}
+
+> #### project.yml
+>
+> ```yaml
+> assets:
+>   - dest: 'assets/private_training_data.json'
+>     checksum: '63373dd656daa1fd3043ce166a59474c'
+>   - dest: 'assets/private_vectors.bin'
+>     checksum: '5113dc04e03f079525edd8df3f4f39e3'
+> ```
+
+For many projects, the datasets and weights you're working with might be
+company-internal and not available over the internet. In that case, you can
+specify the destination paths and a checksum, and leave out the URL. When your
+teammates clone and run your project, they can place the files in the respective
+directory themselves. The [`project assets`](/api/cli#project-assets) command
+will alert about missing files and mismatched checksums, so you can ensure that
+others are running your project with the same data.
+
 ### Dependencies and outputs {#deps-outputs}

 Each command defined in the `project.yml` can optionally define a list of
@ -446,25 +521,6 @@ projects.

 </Infobox>

-### Working with private assets {#private-assets}
-
-For many projects, the datasets and weights you're working with might be
-company-internal and not available via a public URL. In that case, you can
-specify the destination paths and a checksum, and leave out the URL. When your
-teammates clone and run your project, they can place the files in the respective
-directory themselves. The [`spacy project assets`](/api/cli#project-assets)
-command will alert about missing files and mismatched checksums, so you can
-ensure that others are running your project with the same data.
-
-```yaml
-### project.yml
-assets:
-  - dest: 'assets/private_training_data.json'
-    checksum: '63373dd656daa1fd3043ce166a59474c'
-  - dest: 'assets/private_vectors.bin'
-    checksum: '5113dc04e03f079525edd8df3f4f39e3'
-```
-
 ## Remote Storage {#remote}

 You can persist your project outputs to a remote storage using the
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@ -365,6 +365,8 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
  [`DependencyMatcher.add`](/api/dependencymatcher#add) now only accept a list
  of patterns as the second argument (instead of a variable number of
  arguments). The `on_match` callback becomes an optional keyword argument.
+- The `PRON_LEMMA` symbol and `-PRON-` as an indicator for pronoun lemmas has
+  been removed.

 ### Removed or renamed API {#incompat-removed}