Merge pull request #7483 from adrianeboyd/docs/various-v3-4 [ci skip]

2025-12-23 01:53:17 +03:00 · 2021-03-22 12:37:06 +01:00 · 2021-03-22 12:37:06 +01:00 · 3ee2fcfba0
commit 3ee2fcfba0
parent 88e5a0dc16 0d2b723e8d
5 changed files with 118 additions and 55 deletions
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@ -77,7 +77,7 @@ $ python -m spacy info [model] [--markdown] [--silent] [--exclude]
 | Name                                             | Description                                                                                   |
 | ------------------------------------------------ | --------------------------------------------------------------------------------------------- |
-| `model`                                          | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(positional)~~     |
+| `model`                                          | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(option)~~         |
 | `--markdown`, `-md`                              | Print information as Markdown. ~~bool (flag)~~                                                |
 | `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | Don't print anything, just return the values. ~~bool (flag)~~                                 |
 | `--exclude`, `-e`                                | Comma-separated keys to exclude from the print-out. Defaults to `"labels"`. ~~Optional[str]~~ |
@ -259,7 +259,7 @@ $ python -m spacy convert [input_file] [output_dir] [--converter] [--file-type]
 | Name                                             | Description                                                                                                                               |
 | ------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- |
 | `input_file`                                     | Input file. ~~Path (positional)~~                                                                                                         |
-| `output_dir`                                     | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. ~~Optional[Path] \(positional)~~        |
+| `output_dir`                                     | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. ~~Optional[Path] \(option)~~            |
 | `--converter`, `-c` <Tag variant="new">2</Tag>   | Name of converter to use (see below). ~~str (option)~~                                                                                    |
 | `--file-type`, `-t` <Tag variant="new">2.1</Tag> | Type of file to create. Either `spacy` (default) for binary [`DocBin`](/api/docbin) data or `json` for v2.x JSON format. ~~str (option)~~ |
 | `--n-sents`, `-n`                                | Number of sentences per document. Supported for: `conll`, `conllu`, `iob`, `ner` ~~int (option)~~                                         |
@ -642,7 +642,7 @@ $ python -m spacy debug profile [model] [inputs] [--n-texts]
 | Name              | Description                                                                        |
 | ----------------- | ---------------------------------------------------------------------------------- |
 | `model`           | A loadable spaCy pipeline (package name or path). ~~str (positional)~~             |
-| `inputs`          | Optional path to input file, or `-` for standard input. ~~Path (positional)~~      |
+| `inputs`          | Path to input file, or `-` for standard input. ~~Path (positional)~~               |
 | `--n-texts`, `-n` | Maximum number of texts to use if available. Defaults to `10000`. ~~int (option)~~ |
 | `--help`, `-h`    | Show help message and available arguments. ~~bool (flag)~~                         |
 | **PRINTS**        | Profiling information for the pipeline.                                            |
@ -1192,9 +1192,9 @@ $ python -m spacy project dvc [project_dir] [workflow] [--force] [--verbose]
 > ```
 | Name              | Description                                                                                                   |
-| ----------------- | ----------------------------------------------------------------------------------------------------------------- |
+| ----------------- | ------------------------------------------------------------------------------------------------------------- |
 | `project_dir`     | Path to project directory. Defaults to current working directory. ~~Path (positional)~~                       |
-| `workflow`        | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. ~~Optional[str] \(positional)~~ |
+| `workflow`        | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. ~~Optional[str] \(option)~~ |
 | `--force`, `-F`   | Force-updating config file. ~~bool (flag)~~                                                                   |
 | `--verbose`, `-V` |  Print more output generated by DVC. ~~bool (flag)~~                                                          |
 | `--help`, `-h`    | Show help message and available arguments. ~~bool (flag)~~                                                    |
@ -1236,7 +1236,7 @@ $ python -m spacy ray train [config_path] [--code] [--output] [--n-workers] [--a
 | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `config_path`       | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~                                                                |
 | `--code`, `-c`      | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~       |
-| `--output`, `-o`    | Directory or remote storage URL for saving trained pipeline. The directory will be created if it doesn't exist. ~~Optional[Path] \(positional)~~                                           |
+| `--output`, `-o`    | Directory or remote storage URL for saving trained pipeline. The directory will be created if it doesn't exist. ~~Optional[Path] \(option)~~                                               |
 | `--n-workers`, `-n` | The number of workers. Defaults to `1`. ~~int (option)~~                                                                                                                                   |
 | `--address`, `-a`   | Optional address of the Ray cluster. If not set (default), Ray will run locally. ~~Optional[str] \(option)~~                                                                               |
 | `--gpu-id`, `-g`    | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~                                                                                                                                 |
--- a/website/docs/api/language.md
+++ b/website/docs/api/language.md
@ -198,7 +198,6 @@ more efficient than processing texts one-by-one.
 | `as_tuples`                                | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. ~~bool~~ |
 | `batch_size`                               | The number of texts to buffer. ~~Optional[int]~~                                                                                                                    |
 | `disable`                                  | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). ~~List[str]~~                                                                     |
 | `cleanup`                                  | If `True`, unneeded strings are freed to control memory use. Experimental. ~~bool~~                                                                                 |
 | `component_cfg`                            | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~                      |
 | `n_process` <Tag variant="new">2.2.2</Tag> | Number of processors to use. Defaults to `1`. ~~int~~                                                                                                               |
 | **YIELDS**                                 | Documents in the order of the original text. ~~Doc~~                                                                                                                |
@ -873,7 +872,7 @@ when loading a config with
 > ```
 | Name           | Description                                                                                                                                                                                                                                                                                                                                                                                                                            |
-| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `tok2vec_name` | Name of the token-to-vector component, typically `"tok2vec"` or `"transformer"`.~~str~~                                                                                                                                                                                                                                                                                                                                                |
 | `pipe_name`    | Name of pipeline component to replace listeners for. ~~str~~                                                                                                                                                                                                                                                                                                                                                                           |
 | `listeners`    | The paths to the listeners, relative to the component config, e.g. `["model.tok2vec"]`. Typically, implementations will only connect to one tok2vec component, `model.tok2vec`, but in theory, custom models can use multiple listeners. The value here can either be an empty list to not replace any listeners, or a _complete_ list of the paths to all listener layers used by the model that should be replaced.~~Iterable[str]~~ |
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -599,18 +599,27 @@ ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
 print('Before', ents)
 # The model didn't recognize "fb" as an entity :(
-fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
+# Create a span for the new entity
 fb_ent = Span(doc, 0, 1, label="ORG")
 # Option 1: Modify the provided entity spans, leaving the rest unmodified
 doc.set_ents([fb_ent], default="unmodified")
 # Option 2: Assign a complete list of ents to doc.ents
 doc.ents = list(doc.ents) + [fb_ent]
-ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
+ents = [(e.text, e.start, e.end, e.label_) for e in doc.ents]
 print('After', ents)
-# [('fb', 0, 2, 'ORG')] 🎉
+# [('fb', 0, 1, 'ORG')] 🎉
 ```
-Keep in mind that you need to create a `Span` with the start and end index of
+Keep in mind that `Span` is initialized with the start and end **token**
-the **token**, not the start and end index of the entity in the document. In
+indices, not the character offsets. To create a span from character offsets, use
-this case, "fb" is token `(0, 1)` – but at the document level, the entity will
+[`Doc.char_span`](/api/doc#char_span):
-have the start and end indices `(0, 2)`.
+
 ```python
 fb_ent = doc.char_span(0, 2, label="ORG")
 ```
 #### Setting entity annotations from array {#setting-from-array}
@ -645,9 +654,10 @@ write efficient native code.
 ```python
 # cython: infer_types=True
 from spacy.typedefs cimport attr_t
 from spacy.tokens.doc cimport Doc
-cpdef set_entity(Doc doc, int start, int end, int ent_type):
+cpdef set_entity(Doc doc, int start, int end, attr_t ent_type):
    for i in range(start, end):
        doc.c[i].ent_type = ent_type
    doc.c[start].ent_iob = 3
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@ -54,9 +54,8 @@ texts = ["This is a text", "These are lots of texts", "..."]
 In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a
 (potentially very large) iterable of texts as a stream. Because we're only
 accessing the named entities in `doc.ents` (set by the `ner` component), we'll
-disable all other statistical components (the `tagger` and `parser`) during
+disable all other components during processing. `nlp.pipe` yields `Doc` objects,
-processing. `nlp.pipe` yields `Doc` objects, so we can iterate over them and
+so we can iterate over them and access the named entity predictions:
 access the named entity predictions:
 > #### ✏️ Things to try
 >
@ -73,7 +72,7 @@ texts = [
 ]
 nlp = spacy.load("en_core_web_sm")
-for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
+for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])
 ```
@ -92,6 +91,54 @@ have to call `list()` on it first:
 </Infobox>
 ### Multiprocessing {#multiprocessing}
 spaCy includes built-in support for multiprocessing with
 [`nlp.pipe`](/api/language#pipe) using the `n_process` option:
 ```python
 # Multiprocessing with 4 processes
 docs = nlp.pipe(texts, n_process=4)
 # With as many processes as CPUs (use with caution!)
 docs = nlp.pipe(texts, n_process=-1)
 ```
 Depending on your platform, starting many processes with multiprocessing can add
 a lot of overhead. In particular, the default start method `spawn` used in
 macOS/OS X (as of Python 3.8) and in Windows can be slow for larger models
 because the model data is copied in memory for each new process. See the
 [Python docs on multiprocessing](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
 for further details.
 For shorter tasks and in particular with `spawn`, it can be faster to use a
 smaller number of processes with a larger batch size. The optimal `batch_size`
 setting will depend on the pipeline components, the length of your documents,
 the number of processes and how much memory is available.
 ```python
 # Default batch size is `nlp.batch_size` (typically 1000)
 docs = nlp.pipe(texts, n_process=2, batch_size=2000)
 ```
 <Infobox title="Multiprocessing on GPU" variant="warning">
 Multiprocessing is not generally recommended on GPU because RAM is too limited.
 If you want to try it out, be aware that it is only possible using `spawn` due
 to limitations in CUDA.
 </Infobox>
 <Infobox title="Multiprocessing with transformer models" variant="warning">
 In Linux, transformer models may hang or deadlock with multiprocessing due to an
 [issue in PyTorch](https://github.com/pytorch/pytorch/issues/17199). One
 suggested workaround is to use `spawn` instead of `fork` and another is to limit
 the number of threads before loading any models using
 `torch.set_num_threads(1)`.
 </Infobox>
 ## Pipelines and built-in components {#pipelines}
 spaCy makes it very easy to create your own pipelines consisting of reusable
@ -144,10 +191,12 @@ nlp = spacy.load("en_core_web_sm")
 ```
 ... the pipeline's `config.cfg` tells spaCy to use the language `"en"` and the
-pipeline `["tok2vec", "tagger", "parser", "ner"]`. spaCy will then initialize
+pipeline
-`spacy.lang.en.English`, and create each pipeline component and add it to the
+`["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]`. spaCy
-processing pipeline. It'll then load in the model data from the data directory
+will then initialize `spacy.lang.en.English`, and create each pipeline component
-and return the modified `Language` class for you to use as the `nlp` object.
+and add it to the processing pipeline. It'll then load in the model data from
 the data directory and return the modified `Language` class for you to use as
 the `nlp` object.
 <Infobox title="Changed in v3.0" variant="warning">
@ -171,7 +220,7 @@ the binary data:
 ```python
 ### spacy.load under the hood
 lang = "en"
-pipeline = ["tok2vec", "tagger", "parser", "ner"]
+pipeline = ["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]
 data_path = "path/to/en_core_web_sm/en_core_web_sm-3.0.0"
 cls = spacy.util.get_lang_class(lang)  # 1. Get Language class, e.g. English
@ -186,7 +235,7 @@ component** on the `Doc`, in order. Since the model data is loaded, the
 components can access it to assign annotations to the `Doc` object, and
 subsequently to the `Token` and `Span` which are only views of the `Doc`, and
 don't own any data themselves. All components return the modified document,
-which is then processed by the component next in the pipeline.
+which is then processed by the next component in the pipeline.
 ```python
 ### The pipeline under the hood
@ -201,9 +250,9 @@ list of human-readable component names.
 ```python
 print(nlp.pipeline)
-# [('tok2vec', <spacy.pipeline.Tok2Vec>), ('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>), ('ner', <spacy.pipeline.EntityRecognizer>)]
+# [('tok2vec', <spacy.pipeline.Tok2Vec>), ('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>), ('ner', <spacy.pipeline.EntityRecognizer>), ('attribute_ruler', <spacy.pipeline.AttributeRuler>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer>)]
 print(nlp.pipe_names)
-# ['tok2vec', 'tagger', 'parser', 'ner']
+# ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
 ```
 ### Built-in pipeline components {#built-in}
@ -300,7 +349,7 @@ blocks.
 ```python
 ### Disable for block
 # 1. Use as a context manager
-with nlp.select_pipes(disable=["tagger", "parser"]):
+with nlp.select_pipes(disable=["tagger", "parser", "lemmatizer"]):
    doc = nlp("I won't be tagged and parsed")
 doc = nlp("I will be tagged and parsed")
@ -324,7 +373,7 @@ The [`nlp.pipe`](/api/language#pipe) method also supports a `disable` keyword
 argument if you only want to disable components during processing:
 ```python
-for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
+for doc in nlp.pipe(texts, disable=["tagger", "parser", "lemmatizer"]):
    # Do something with the doc here
 ```
@ -1497,24 +1546,33 @@ to `Doc.user_span_hooks` and `Doc.user_token_hooks`.
 | Name               | Customizes                                                                                                                                                                                                              |
 | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `user_hooks`       | [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents)                                                                      |
+| `user_hooks`       | [`Doc.similarity`](/api/doc#similarity), [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents)                             |
 | `user_token_hooks` | [`Token.similarity`](/api/token#similarity), [`Token.vector`](/api/token#vector), [`Token.has_vector`](/api/token#has_vector), [`Token.vector_norm`](/api/token#vector_norm), [`Token.conjuncts`](/api/token#conjuncts) |
 | `user_span_hooks`  | [`Span.similarity`](/api/span#similarity), [`Span.vector`](/api/span#vector), [`Span.has_vector`](/api/span#has_vector), [`Span.vector_norm`](/api/span#vector_norm), [`Span.root`](/api/span#root)                     |
 ```python
 ### Add custom similarity hooks
 from spacy.language import Language
 class SimilarityModel:
-    def __init__(self, model):
+    def __init__(self, name: str, index: int):
-        self._model = model
+        self.name = name
        self.index = index
    def __call__(self, doc):
        doc.user_hooks["similarity"] = self.similarity
        doc.user_span_hooks["similarity"] = self.similarity
        doc.user_token_hooks["similarity"] = self.similarity
        return doc
    def similarity(self, obj1, obj2):
-        y = self._model([obj1.vector, obj2.vector])
+        return obj1.vector[self.index] + obj2.vector[self.index]
-        return float(y[0])
+
@Language.factory("similarity_component", default_config={"index": 0})
 def create_similarity_component(nlp, name, index: int):
    return SimilarityModel(name, index)
 ```
 ## Developing plugins and wrappers {#plugins}
--- a/website/docs/usage/saving-loading.md
+++ b/website/docs/usage/saving-loading.md
@ -19,9 +19,8 @@ import Serialization101 from 'usage/101/\_serialization.md'
 When serializing the pipeline, keep in mind that this will only save out the
 **binary data for the individual components** to allow spaCy to restore them –
 not the entire objects. This is a good thing, because it makes serialization
-safe. But it also means that you have to take care of storing the language name
+safe. But it also means that you have to take care of storing the config, which
-and pipeline component names as well, and restoring them separately before you
+contains the pipeline configuration and all the relevant settings.
 can load in the data.
 > #### Saving the meta and config
 >
@ -33,24 +32,21 @@ can load in the data.
 ```python
 ### Serialize
 config = nlp.config
 bytes_data = nlp.to_bytes()
 lang = nlp.config["nlp"]["lang"]  # "en"
 pipeline = nlp.config["nlp"]["pipeline"]  # ["tagger", "parser", "ner"]
 ```
 ```python
 ### Deserialize
-nlp = spacy.blank(lang)
+lang_cls = spacy.util.get_lang_class(config["nlp"]["lang"])
-for pipe_name in pipeline:
+nlp = lang_cls.from_config(config)
    nlp.add_pipe(pipe_name)
 nlp.from_bytes(bytes_data)
 ```
 This is also how spaCy does it under the hood when loading a pipeline: it loads
 the `config.cfg` containing the language and pipeline information, initializes
-the language class, creates and adds the pipeline components based on the
+the language class, creates and adds the pipeline components based on the config
-defined [factories](/usage/processing-pipeline#custom-components-factories) and
+and _then_ loads in the binary data. You can read more about this process
 _then_ loads in the binary data. You can read more about this process
 [here](/usage/processing-pipelines#pipelines).
 ## Serializing Doc objects efficiently {#docs new="2.2"}