Document Assigned Attributes of Pipeline Components (#9041)

* Add textcat docs * Add NER docs * Add Entity Linker docs * Add assigned fields docs for the tagger This also adds a preamble, since there wasn't one. * Add morphologizer docs * Add dependency parser docs * Update entityrecognizer docs This is a little weird because `Doc.ents` is the only thing assigned to, but it's actually a bidirectional property. * Add token fields for entityrecognizer * Fix section name * Add entity ruler docs * Add lemmatizer docs * Add sentencizer/recognizer docs * Update website/docs/api/entityrecognizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/entityruler.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/tagger.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/entityruler.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update type for Doc.ents This was `Tuple[Span, ...]` everywhere but `Tuple[Span]` seems to be correct. * Run prettier * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Run prettier * Add transformers section This basically just moves and renames the "custom attributes" section from the bottom of the page to be consistent with "assigned attributes" on other pages. I looked at moving the paragraph just above the section into the section, but it includes the unrelated registry additions, so it seemed better to leave it unchanged. * Make table header consistent Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2025-10-31 16:07:41 +03:00 · 2021-09-01 19:09:39 +09:00 · 2021-09-01 19:09:39 +09:00 · ba6a37d358
commit ba6a37d358
parent f803a84571
19 changed files with 171 additions and 37 deletions
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -555,8 +555,8 @@ consists of either two or three subnetworks:

 <Accordion title="spacy.TransitionBasedParser.v1 definition" spaced>

-[TransitionBasedParser.v1](/api/legacy#TransitionBasedParser_v1) had the exact same signature, 
-but the `use_upper` argument was `True` by default.
+[TransitionBasedParser.v1](/api/legacy#TransitionBasedParser_v1) had the exact
+same signature, but the `use_upper` argument was `True` by default.

 </Accordion>

--- a/website/docs/api/dependencyparser.md
+++ b/website/docs/api/dependencyparser.md
@ -25,6 +25,20 @@ current state. The weights are updated such that the scores assigned to the set
 of optimal actions is increased, while scores assigned to other actions are
 decreased. Note that more than one action may be optimal for a given state.

+## Assigned Attributes {#assigned-attributes}
+
+Dependency predictions are assigned to the `Token.dep` and `Token.head` fields.
+Beside the dependencies themselves, the parser decides sentence boundaries,
+which are saved in `Token.is_sent_start` and accessible via `Doc.sents`.
+
+| Location              | Value                                                                                                                                         |
+| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
+| `Token.dep`           | The type of dependency relation (hash). ~~int~~                                                                                               |
+| `Token.dep_`          | The type of dependency relation. ~~str~~                                                                                                      |
+| `Token.head`          | The syntactic parent, or "governor", of this token. ~~Token~~                                                                                 |
+| `Token.is_sent_start` | A boolean value indicating whether the token starts a sentence. After the parser runs this will be `True` or `False` for all tokens. ~~bool~~ |
+| `Doc.sents`           | An iterator over sentences in the `Doc`, determined by `Token.is_sent_start` values. ~~Iterator[Span]~~                                       |
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
--- a/website/docs/api/doc.md
+++ b/website/docs/api/doc.md
@ -571,9 +571,9 @@ objects, if the entity recognizer has been applied.
 > assert ents[0].text == "Mr. Best"
 > ```

-| Name        | Description                                                           |
-| ----------- | --------------------------------------------------------------------- |
-| **RETURNS** | Entities in the document, one `Span` per entity. ~~Tuple[Span, ...]~~ |
+| Name        | Description                                                      |
+| ----------- | ---------------------------------------------------------------- |
+| **RETURNS** | Entities in the document, one `Span` per entity. ~~Tuple[Span]~~ |

 ## Doc.spans {#spans tag="property"}

--- a/website/docs/api/entitylinker.md
+++ b/website/docs/api/entitylinker.md
@ -16,6 +16,16 @@ plausible candidates from that `KnowledgeBase` given a certain textual mention,
 and a machine learning model to pick the right candidate, given the local
 context of the mention.

+## Assigned Attributes {#assigned-attributes}
+
+Predictions, in the form of knowledge base IDs, will be assigned to
+`Token.ent_kb_id_`.
+
+| Location           | Value                             |
+| ------------------ | --------------------------------- |
+| `Token.ent_kb_id`  | Knowledge base ID (hash). ~~int~~ |
+| `Token.ent_kb_id_` | Knowledge base ID. ~~str~~        |
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
--- a/website/docs/api/entityrecognizer.md
+++ b/website/docs/api/entityrecognizer.md
@ -20,6 +20,24 @@ your entities will be close to their initial tokens. If your entities are long
 and characterized by tokens in their middle, the component will likely not be a
 good fit for your task.

+## Assigned Attributes {#assigned-attributes}
+
+Predictions will be saved to `Doc.ents` as a tuple. Each label will also be
+reflected to each underlying token, where it is saved in the `Token.ent_type`
+and `Token.ent_iob` fields. Note that by definition each token can only have one
+label.
+
+When setting `Doc.ents` to create training data, all the spans must be valid and
+non-overlapping, or an error will be thrown.
+
+| Location          | Value                                                             |
+| ----------------- | ----------------------------------------------------------------- |
+| `Doc.ents`        | The annotated spans. ~~Tuple[Span]~~                              |
+| `Token.ent_iob`   | An enum encoding of the IOB part of the named entity tag. ~~int~~ |
+| `Token.ent_iob_`  | The IOB part of the named entity tag. ~~str~~                     |
+| `Token.ent_type`  | The label part of the named entity tag (hash). ~~int~~            |
+| `Token.ent_type_` | The label part of the named entity tag. ~~str~~                   |
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
--- a/website/docs/api/entityruler.md
+++ b/website/docs/api/entityruler.md
@ -15,6 +15,27 @@ used on its own to implement a purely rule-based entity recognition system. For
 usage examples, see the docs on
 [rule-based entity recognition](/usage/rule-based-matching#entityruler).

+## Assigned Attributes {#assigned-attributes}
+
+This component assigns predictions basically the same way as the
+[`EntityRecognizer`](/api/entityrecognizer).
+
+Predictions can be accessed under `Doc.ents` as a tuple. Each label will also be
+reflected in each underlying token, where it is saved in the `Token.ent_type`
+and `Token.ent_iob` fields. Note that by definition each token can only have one
+label.
+
+When setting `Doc.ents` to create training data, all the spans must be valid and
+non-overlapping, or an error will be thrown.
+
+| Location          | Value                                                             |
+| ----------------- | ----------------------------------------------------------------- |
+| `Doc.ents`        | The annotated spans. ~~Tuple[Span]~~                              |
+| `Token.ent_iob`   | An enum encoding of the IOB part of the named entity tag. ~~int~~ |
+| `Token.ent_iob_`  | The IOB part of the named entity tag. ~~str~~                     |
+| `Token.ent_type`  | The label part of the named entity tag (hash). ~~int~~            |
+| `Token.ent_type_` | The label part of the named entity tag. ~~str~~                   |
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
--- a/website/docs/api/legacy.md
+++ b/website/docs/api/legacy.md
@ -105,7 +105,8 @@ and residual connections.

 ### spacy.TransitionBasedParser.v1 {#TransitionBasedParser_v1}

-Identical to [`spacy.TransitionBasedParser.v2`](/api/architectures#TransitionBasedParser)
+Identical to
+[`spacy.TransitionBasedParser.v2`](/api/architectures#TransitionBasedParser)
 except the `use_upper` was set to `True` by default.

 ### spacy.TextCatEnsemble.v1 {#TextCatEnsemble_v1}
--- a/website/docs/api/lemmatizer.md
+++ b/website/docs/api/lemmatizer.md
@ -31,6 +31,15 @@ available in the pipeline and runs _before_ the lemmatizer.

 </Infobox>

+## Assigned Attributes {#assigned-attributes}
+
+Lemmas generated by rules or predicted will be saved to `Token.lemma`.
+
+| Location       | Value                     |
+| -------------- | ------------------------- |
+| `Token.lemma`  | The lemma (hash). ~~int~~ |
+| `Token.lemma_` | The lemma. ~~str~~        |
+
 ## Config and implementation

 The default config is defined by the pipeline component factory and describes
--- a/website/docs/api/morphologizer.md
+++ b/website/docs/api/morphologizer.md
@ -15,6 +15,16 @@ coarse-grained POS tags following the Universal Dependencies
 [FEATS](https://universaldependencies.org/format.html#morphological-annotation)
 annotation guidelines.

+## Assigned Attributes {#assigned-attributes}
+
+Predictions are saved to `Token.morph` and `Token.pos`.
+
+| Location      | Value                                     |
+| ------------- | ----------------------------------------- |
+| `Token.pos`   | The UPOS part of speech (hash). ~~int~~   |
+| `Token.pos_`  | The UPOS part of speech. ~~str~~          |
+| `Token.morph` | Morphological features. ~~MorphAnalysis~~ |
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
--- a/website/docs/api/morphology.md
+++ b/website/docs/api/morphology.md
@ -105,11 +105,11 @@ representation.

 ## Attributes {#attributes}

-| Name          | Description                                                                                                                  |
-| ------------- | ---------------------------------------------------------------------------------------------------------------------------- | ---------- |
-| `FEATURE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) feature separator. Default is `          | `. ~~str~~ |
-| `FIELD_SEP`   | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) field separator. Default is `=`. ~~str~~ |
-| `VALUE_SEP`   | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) value separator. Default is `,`. ~~str~~ |
+| Name          | Description                                                                                                                    |
+| ------------- | ------------------------------------------------------------------------------------------------------------------------------ |
+| `FEATURE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) feature separator. Default is `|`. ~~str~~ |
+| `FIELD_SEP`   | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) field separator. Default is `=`. ~~str~~   |
+| `VALUE_SEP`   | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) value separator. Default is `,`. ~~str~~   |

 ## MorphAnalysis {#morphanalysis tag="class" source="spacy/tokens/morphanalysis.pyx"}

--- a/website/docs/api/phrasematcher.md
+++ b/website/docs/api/phrasematcher.md
@ -149,8 +149,8 @@ patterns = [nlp("health care reform"), nlp("healthcare reform")]
 </Infobox>

 | Name           | Description                                                                                                                                                |
-| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |
-| `match_id`     | An ID for the thing you're matching. ~~str~~                                                                                                               |     |
+| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `match_id`     | An ID for the thing you're matching. ~~str~~                                                                                                               |  |
 | `docs`         | `Doc` objects of the phrases to match. ~~List[Doc]~~                                                                                                       |
 | _keyword-only_ |                                                                                                                                                            |
 | `on_match`     | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ |
--- a/website/docs/api/scorer.md
+++ b/website/docs/api/scorer.md
@ -80,7 +80,7 @@ Docs with `has_unknown_spaces` are skipped during scoring.
 > ```

 | Name        | Description                                                                                                         |
-| ----------- | ------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
+| ----------- | ------------------------------------------------------------------------------------------------------------------- |
 | `examples`  | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
 | **RETURNS** | `Dict`                                                                                                              | A dictionary containing the scores `token_acc`, `token_p`, `token_r`, `token_f`. ~~Dict[str, float]]~~ |

--- a/website/docs/api/sentencerecognizer.md
+++ b/website/docs/api/sentencerecognizer.md
@ -12,6 +12,16 @@ api_trainable: true
 A trainable pipeline component for sentence segmentation. For a simpler,
 rule-based strategy, see the [`Sentencizer`](/api/sentencizer).

+## Assigned Attributes {#assigned-attributes}
+
+Predicted values will be assigned to `Token.is_sent_start`. The resulting
+sentences can be accessed using `Doc.sents`.
+
+| Location              | Value                                                                                                                          |
+| --------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
+| `Token.is_sent_start` | A boolean value indicating whether the token starts a sentence. This will be either `True` or `False` for all tokens. ~~bool~~ |
+| `Doc.sents`           | An iterator over sentences in the `Doc`, determined by `Token.is_sent_start` values. ~~Iterator[Span]~~                        |
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
--- a/website/docs/api/sentencizer.md
+++ b/website/docs/api/sentencizer.md
@ -13,6 +13,16 @@ performed by the [`DependencyParser`](/api/dependencyparser), so the
 `Sentencizer` lets you implement a simpler, rule-based strategy that doesn't
 require a statistical model to be loaded.

+## Assigned Attributes {#assigned-attributes}
+
+Calculated values will be assigned to `Token.is_sent_start`. The resulting
+sentences can be accessed using `Doc.sents`.
+
+| Location              | Value                                                                                                                          |
+| --------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
+| `Token.is_sent_start` | A boolean value indicating whether the token starts a sentence. This will be either `True` or `False` for all tokens. ~~bool~~ |
+| `Doc.sents`           | An iterator over sentences in the `Doc`, determined by `Token.is_sent_start` values. ~~Iterator[Span]~~                        |
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
@ -28,7 +38,7 @@ how the component should be configured. You can override its settings via the
 > ```

 | Setting       | Description                                                                                                                                            |
-| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ |
+| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `punct_chars` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to `None`. ~~Optional[List[str]]~~ | `None` |

 ```python
--- a/website/docs/api/tagger.md
+++ b/website/docs/api/tagger.md
@ -8,6 +8,21 @@ api_string_name: tagger
 api_trainable: true
 ---

+A trainable pipeline component to predict part-of-speech tags for any
+part-of-speech tag set.
+
+In the pre-trained pipelines, the tag schemas vary by language; see the
+[individual model pages](/models) for details.
+
+## Assigned Attributes {#assigned-attributes}
+
+Predictions are assigned to `Token.tag`.
+
+| Location     | Value                              |
+| ------------ | ---------------------------------- |
+| `Token.tag`  | The part of speech (hash). ~~int~~ |
+| `Token.tag_` | The part of speech. ~~str~~        |
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
--- a/website/docs/api/textcategorizer.md
+++ b/website/docs/api/textcategorizer.md
@ -29,6 +29,22 @@ only.

 </Infobox>

+## Assigned Attributes {#assigned-attributes}
+
+Predictions will be saved to `doc.cats` as a dictionary, where the key is the
+name of the category and the value is a score between 0 and 1 (inclusive). For
+`textcat` (exclusive categories), the scores will sum to 1, while for
+`textcat_multilabel` there is no particular guarantee about their sum.
+
+Note that when assigning values to create training data, the score of each
+category must be 0 or 1. Using other values, for example to create a document
+that is a little bit in category A and a little bit in category B, is not
+supported.
+
+| Location   | Value                                 |
+| ---------- | ------------------------------------- |
+| `Doc.cats` | Category scores. ~~Dict[str, float]~~ |
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
--- a/website/docs/api/transformer.md
+++ b/website/docs/api/transformer.md
@ -38,12 +38,21 @@ attributes. We also calculate an alignment between the word-piece tokens and the
 spaCy tokenization, so that we can use the last hidden states to set the
 `Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
 token, the spaCy token receives the sum of their values. To access the values,
-you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The
+you can use the custom [`Doc._.trf_data`](#assigned-attributes) attribute. The
 package also adds the function registries [`@span_getters`](#span_getters) and
 [`@annotation_setters`](#annotation_setters) with several built-in registered
 functions. For more details, see the
 [usage documentation](/usage/embeddings-transformers).

+## Assigned Attributes {#assigned-attributes}
+
+The component sets the following
+[custom extension attribute](/usage/processing-pipeline#custom-components-attributes):
+
+| Location         | Value                                                                    |
+| ---------------- | ------------------------------------------------------------------------ |
+| `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ |
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
@ -98,7 +107,7 @@ https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/p
 Construct a `Transformer` component. One or more subsequent spaCy components can
 use the transformer outputs as features in its model, with gradients
 backpropagated to the single shared weights. The activations from the
-transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension
+transformer are saved in the [`Doc._.trf_data`](#assigned-attributes) extension
 attribute. You can also provide a callback to set additional annotations. In
 your application, you would normally use a shortcut for this and instantiate the
 component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
@ -205,7 +214,7 @@ modifying them.

 Assign the extracted features to the `Doc` objects. By default, the
 [`TransformerData`](/api/transformer#transformerdata) object is written to the
-[`Doc._.trf_data`](#custom-attributes) attribute. Your `set_extra_annotations`
+[`Doc._.trf_data`](#assigned-attributes) attribute. Your `set_extra_annotations`
 callback is then called, if provided.

 > #### Example
@ -383,7 +392,7 @@ are wrapped into the
 [FullTransformerBatch](/api/transformer#fulltransformerbatch) object. The
 `FullTransformerBatch` then splits out the per-document data, which is handled
 by this class. Instances of this class are typically assigned to the
-[`Doc._.trf_data`](/api/transformer#custom-attributes) extension attribute.
+[`Doc._.trf_data`](/api/transformer#assigned-attributes) extension attribute.

 | Name      | Description                                                                                                                                                                                                                                                                                                                                               |
 | --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -549,12 +558,3 @@ The following built-in functions are available:
 | Name                                           | Description                           |
 | ---------------------------------------------- | ------------------------------------- |
 | `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. |
-
-## Custom attributes {#custom-attributes}
-
-The component sets the following
-[custom extension attributes](/usage/processing-pipeline#custom-components-attributes):
-
-| Name             | Description                                                              |
-| ---------------- | ------------------------------------------------------------------------ |
-| `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ |
--- a/website/docs/api/vectors.md
+++ b/website/docs/api/vectors.md
@ -321,7 +321,7 @@ performed in chunks to avoid consuming too much memory. You can set the
 > ```

 | Name           | Description                                                                 |
-| -------------- | --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
+| -------------- | --------------------------------------------------------------------------- |
 | `queries`      | An array with one or more vectors. ~~numpy.ndarray~~                        |
 | _keyword-only_ |                                                                             |
 | `batch_size`   | The batch size to use. Default to `1024`. ~~int~~                           |
--- a/website/docs/api/vocab.md
+++ b/website/docs/api/vocab.md
@ -21,14 +21,14 @@ Create the vocabulary.
 > vocab = Vocab(strings=["hello", "world"])
 > ```

-| Name                                        | Description                                                                                                                                            |
-| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| `lex_attr_getters`                          | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. ~~Optional[Dict[str, Callable[[str], Any]]]~~                     |
-| `strings`                                   | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. ~~Union[List[str], StringStore]~~          |
-| `lookups`                                   | A [`Lookups`](/api/lookups) that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. ~~Optional[Lookups]~~                     |
-| `oov_prob`                                  | The default OOV probability. Defaults to `-20.0`. ~~float~~                                                                                            |
-| `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~                                                                                                          |
-| `writing_system`                            | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~         |
+| Name                                        | Description                                                                                                                                             |
+| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `lex_attr_getters`                          | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. ~~Optional[Dict[str, Callable[[str], Any]]]~~                      |
+| `strings`                                   | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. ~~Union[List[str], StringStore]~~           |
+| `lookups`                                   | A [`Lookups`](/api/lookups) that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. ~~Optional[Lookups]~~                      |
+| `oov_prob`                                  | The default OOV probability. Defaults to `-20.0`. ~~float~~                                                                                             |
+| `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~                                                                                                           |
+| `writing_system`                            | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~          |
 | `get_noun_chunks`                           | A function that yields base noun phrases used for [`Doc.noun_chunks`](/api/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |

 ## Vocab.\_\_len\_\_ {#len tag="method"}