Document Assigned Attributes of Pipeline Components (#9041)

* Add textcat docs

* Add NER docs

* Add Entity Linker docs

* Add assigned fields docs for the tagger

This also adds a preamble, since there wasn't one.

* Add morphologizer docs

* Add dependency parser docs

* Update entityrecognizer docs

This is a little weird because `Doc.ents` is the only thing assigned to,
but it's actually a bidirectional property.

* Add token fields for entityrecognizer

* Fix section name

* Add entity ruler docs

* Add lemmatizer docs

* Add sentencizer/recognizer docs

* Update website/docs/api/entityrecognizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/entityruler.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/tagger.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/entityruler.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update type for Doc.ents

This was `Tuple[Span, ...]` everywhere but `Tuple[Span]` seems to be
correct.

* Run prettier

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Run prettier

* Add transformers section

This basically just moves and renames the "custom attributes" section
from the bottom of the page to be consistent with "assigned attributes"
on other pages.

I looked at moving the paragraph just above the section into the
section, but it includes the unrelated registry additions, so it seemed
better to leave it unchanged.

* Make table header consistent

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
This commit is contained in:
Paul O'Leary McCann 2021-09-01 19:09:39 +09:00 committed by GitHub
parent f803a84571
commit ba6a37d358
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
19 changed files with 171 additions and 37 deletions

View File

@ -555,8 +555,8 @@ consists of either two or three subnetworks:
<Accordion title="spacy.TransitionBasedParser.v1 definition" spaced>
[TransitionBasedParser.v1](/api/legacy#TransitionBasedParser_v1) had the exact same signature,
but the `use_upper` argument was `True` by default.
[TransitionBasedParser.v1](/api/legacy#TransitionBasedParser_v1) had the exact
same signature, but the `use_upper` argument was `True` by default.
</Accordion>

View File

@ -25,6 +25,20 @@ current state. The weights are updated such that the scores assigned to the set
of optimal actions is increased, while scores assigned to other actions are
decreased. Note that more than one action may be optimal for a given state.
## Assigned Attributes {#assigned-attributes}
Dependency predictions are assigned to the `Token.dep` and `Token.head` fields.
Beside the dependencies themselves, the parser decides sentence boundaries,
which are saved in `Token.is_sent_start` and accessible via `Doc.sents`.
| Location | Value |
| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| `Token.dep` | The type of dependency relation (hash). ~~int~~ |
| `Token.dep_` | The type of dependency relation. ~~str~~ |
| `Token.head` | The syntactic parent, or "governor", of this token. ~~Token~~ |
| `Token.is_sent_start` | A boolean value indicating whether the token starts a sentence. After the parser runs this will be `True` or `False` for all tokens. ~~bool~~ |
| `Doc.sents` | An iterator over sentences in the `Doc`, determined by `Token.is_sent_start` values. ~~Iterator[Span]~~ |
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes

View File

@ -572,8 +572,8 @@ objects, if the entity recognizer has been applied.
> ```
| Name | Description |
| ----------- | --------------------------------------------------------------------- |
| **RETURNS** | Entities in the document, one `Span` per entity. ~~Tuple[Span, ...]~~ |
| ----------- | ---------------------------------------------------------------- |
| **RETURNS** | Entities in the document, one `Span` per entity. ~~Tuple[Span]~~ |
## Doc.spans {#spans tag="property"}

View File

@ -16,6 +16,16 @@ plausible candidates from that `KnowledgeBase` given a certain textual mention,
and a machine learning model to pick the right candidate, given the local
context of the mention.
## Assigned Attributes {#assigned-attributes}
Predictions, in the form of knowledge base IDs, will be assigned to
`Token.ent_kb_id_`.
| Location | Value |
| ------------------ | --------------------------------- |
| `Token.ent_kb_id` | Knowledge base ID (hash). ~~int~~ |
| `Token.ent_kb_id_` | Knowledge base ID. ~~str~~ |
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes

View File

@ -20,6 +20,24 @@ your entities will be close to their initial tokens. If your entities are long
and characterized by tokens in their middle, the component will likely not be a
good fit for your task.
## Assigned Attributes {#assigned-attributes}
Predictions will be saved to `Doc.ents` as a tuple. Each label will also be
reflected to each underlying token, where it is saved in the `Token.ent_type`
and `Token.ent_iob` fields. Note that by definition each token can only have one
label.
When setting `Doc.ents` to create training data, all the spans must be valid and
non-overlapping, or an error will be thrown.
| Location | Value |
| ----------------- | ----------------------------------------------------------------- |
| `Doc.ents` | The annotated spans. ~~Tuple[Span]~~ |
| `Token.ent_iob` | An enum encoding of the IOB part of the named entity tag. ~~int~~ |
| `Token.ent_iob_` | The IOB part of the named entity tag. ~~str~~ |
| `Token.ent_type` | The label part of the named entity tag (hash). ~~int~~ |
| `Token.ent_type_` | The label part of the named entity tag. ~~str~~ |
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes

View File

@ -15,6 +15,27 @@ used on its own to implement a purely rule-based entity recognition system. For
usage examples, see the docs on
[rule-based entity recognition](/usage/rule-based-matching#entityruler).
## Assigned Attributes {#assigned-attributes}
This component assigns predictions basically the same way as the
[`EntityRecognizer`](/api/entityrecognizer).
Predictions can be accessed under `Doc.ents` as a tuple. Each label will also be
reflected in each underlying token, where it is saved in the `Token.ent_type`
and `Token.ent_iob` fields. Note that by definition each token can only have one
label.
When setting `Doc.ents` to create training data, all the spans must be valid and
non-overlapping, or an error will be thrown.
| Location | Value |
| ----------------- | ----------------------------------------------------------------- |
| `Doc.ents` | The annotated spans. ~~Tuple[Span]~~ |
| `Token.ent_iob` | An enum encoding of the IOB part of the named entity tag. ~~int~~ |
| `Token.ent_iob_` | The IOB part of the named entity tag. ~~str~~ |
| `Token.ent_type` | The label part of the named entity tag (hash). ~~int~~ |
| `Token.ent_type_` | The label part of the named entity tag. ~~str~~ |
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes

View File

@ -105,7 +105,8 @@ and residual connections.
### spacy.TransitionBasedParser.v1 {#TransitionBasedParser_v1}
Identical to [`spacy.TransitionBasedParser.v2`](/api/architectures#TransitionBasedParser)
Identical to
[`spacy.TransitionBasedParser.v2`](/api/architectures#TransitionBasedParser)
except the `use_upper` was set to `True` by default.
### spacy.TextCatEnsemble.v1 {#TextCatEnsemble_v1}

View File

@ -31,6 +31,15 @@ available in the pipeline and runs _before_ the lemmatizer.
</Infobox>
## Assigned Attributes {#assigned-attributes}
Lemmas generated by rules or predicted will be saved to `Token.lemma`.
| Location | Value |
| -------------- | ------------------------- |
| `Token.lemma` | The lemma (hash). ~~int~~ |
| `Token.lemma_` | The lemma. ~~str~~ |
## Config and implementation
The default config is defined by the pipeline component factory and describes

View File

@ -15,6 +15,16 @@ coarse-grained POS tags following the Universal Dependencies
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
annotation guidelines.
## Assigned Attributes {#assigned-attributes}
Predictions are saved to `Token.morph` and `Token.pos`.
| Location | Value |
| ------------- | ----------------------------------------- |
| `Token.pos` | The UPOS part of speech (hash). ~~int~~ |
| `Token.pos_` | The UPOS part of speech. ~~str~~ |
| `Token.morph` | Morphological features. ~~MorphAnalysis~~ |
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes

View File

@ -106,7 +106,7 @@ representation.
## Attributes {#attributes}
| Name | Description |
| ------------- | ---------------------------------------------------------------------------------------------------------------------------- | ---------- |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| `FEATURE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) feature separator. Default is `|`. ~~str~~ |
| `FIELD_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) field separator. Default is `=`. ~~str~~ |
| `VALUE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) value separator. Default is `,`. ~~str~~ |

View File

@ -149,7 +149,7 @@ patterns = [nlp("health care reform"), nlp("healthcare reform")]
</Infobox>
| Name | Description |
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `match_id` | An ID for the thing you're matching. ~~str~~ | |
| `docs` | `Doc` objects of the phrases to match. ~~List[Doc]~~ |
| _keyword-only_ | |

View File

@ -80,7 +80,7 @@ Docs with `has_unknown_spaces` are skipped during scoring.
> ```
| Name | Description |
| ----------- | ------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
| ----------- | ------------------------------------------------------------------------------------------------------------------- |
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
| **RETURNS** | `Dict` | A dictionary containing the scores `token_acc`, `token_p`, `token_r`, `token_f`. ~~Dict[str, float]]~~ |

View File

@ -12,6 +12,16 @@ api_trainable: true
A trainable pipeline component for sentence segmentation. For a simpler,
rule-based strategy, see the [`Sentencizer`](/api/sentencizer).
## Assigned Attributes {#assigned-attributes}
Predicted values will be assigned to `Token.is_sent_start`. The resulting
sentences can be accessed using `Doc.sents`.
| Location | Value |
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| `Token.is_sent_start` | A boolean value indicating whether the token starts a sentence. This will be either `True` or `False` for all tokens. ~~bool~~ |
| `Doc.sents` | An iterator over sentences in the `Doc`, determined by `Token.is_sent_start` values. ~~Iterator[Span]~~ |
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes

View File

@ -13,6 +13,16 @@ performed by the [`DependencyParser`](/api/dependencyparser), so the
`Sentencizer` lets you implement a simpler, rule-based strategy that doesn't
require a statistical model to be loaded.
## Assigned Attributes {#assigned-attributes}
Calculated values will be assigned to `Token.is_sent_start`. The resulting
sentences can be accessed using `Doc.sents`.
| Location | Value |
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| `Token.is_sent_start` | A boolean value indicating whether the token starts a sentence. This will be either `True` or `False` for all tokens. ~~bool~~ |
| `Doc.sents` | An iterator over sentences in the `Doc`, determined by `Token.is_sent_start` values. ~~Iterator[Span]~~ |
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes
@ -28,7 +38,7 @@ how the component should be configured. You can override its settings via the
> ```
| Setting | Description |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `punct_chars` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to `None`. ~~Optional[List[str]]~~ | `None` |
```python

View File

@ -8,6 +8,21 @@ api_string_name: tagger
api_trainable: true
---
A trainable pipeline component to predict part-of-speech tags for any
part-of-speech tag set.
In the pre-trained pipelines, the tag schemas vary by language; see the
[individual model pages](/models) for details.
## Assigned Attributes {#assigned-attributes}
Predictions are assigned to `Token.tag`.
| Location | Value |
| ------------ | ---------------------------------- |
| `Token.tag` | The part of speech (hash). ~~int~~ |
| `Token.tag_` | The part of speech. ~~str~~ |
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes

View File

@ -29,6 +29,22 @@ only.
</Infobox>
## Assigned Attributes {#assigned-attributes}
Predictions will be saved to `doc.cats` as a dictionary, where the key is the
name of the category and the value is a score between 0 and 1 (inclusive). For
`textcat` (exclusive categories), the scores will sum to 1, while for
`textcat_multilabel` there is no particular guarantee about their sum.
Note that when assigning values to create training data, the score of each
category must be 0 or 1. Using other values, for example to create a document
that is a little bit in category A and a little bit in category B, is not
supported.
| Location | Value |
| ---------- | ------------------------------------- |
| `Doc.cats` | Category scores. ~~Dict[str, float]~~ |
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes

View File

@ -38,12 +38,21 @@ attributes. We also calculate an alignment between the word-piece tokens and the
spaCy tokenization, so that we can use the last hidden states to set the
`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
token, the spaCy token receives the sum of their values. To access the values,
you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The
you can use the custom [`Doc._.trf_data`](#assigned-attributes) attribute. The
package also adds the function registries [`@span_getters`](#span_getters) and
[`@annotation_setters`](#annotation_setters) with several built-in registered
functions. For more details, see the
[usage documentation](/usage/embeddings-transformers).
## Assigned Attributes {#assigned-attributes}
The component sets the following
[custom extension attribute](/usage/processing-pipeline#custom-components-attributes):
| Location | Value |
| ---------------- | ------------------------------------------------------------------------ |
| `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ |
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes
@ -98,7 +107,7 @@ https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/p
Construct a `Transformer` component. One or more subsequent spaCy components can
use the transformer outputs as features in its model, with gradients
backpropagated to the single shared weights. The activations from the
transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension
transformer are saved in the [`Doc._.trf_data`](#assigned-attributes) extension
attribute. You can also provide a callback to set additional annotations. In
your application, you would normally use a shortcut for this and instantiate the
component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
@ -205,7 +214,7 @@ modifying them.
Assign the extracted features to the `Doc` objects. By default, the
[`TransformerData`](/api/transformer#transformerdata) object is written to the
[`Doc._.trf_data`](#custom-attributes) attribute. Your `set_extra_annotations`
[`Doc._.trf_data`](#assigned-attributes) attribute. Your `set_extra_annotations`
callback is then called, if provided.
> #### Example
@ -383,7 +392,7 @@ are wrapped into the
[FullTransformerBatch](/api/transformer#fulltransformerbatch) object. The
`FullTransformerBatch` then splits out the per-document data, which is handled
by this class. Instances of this class are typically assigned to the
[`Doc._.trf_data`](/api/transformer#custom-attributes) extension attribute.
[`Doc._.trf_data`](/api/transformer#assigned-attributes) extension attribute.
| Name | Description |
| --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -549,12 +558,3 @@ The following built-in functions are available:
| Name | Description |
| ---------------------------------------------- | ------------------------------------- |
| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. |
## Custom attributes {#custom-attributes}
The component sets the following
[custom extension attributes](/usage/processing-pipeline#custom-components-attributes):
| Name | Description |
| ---------------- | ------------------------------------------------------------------------ |
| `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ |

View File

@ -321,7 +321,7 @@ performed in chunks to avoid consuming too much memory. You can set the
> ```
| Name | Description |
| -------------- | --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| -------------- | --------------------------------------------------------------------------- |
| `queries` | An array with one or more vectors. ~~numpy.ndarray~~ |
| _keyword-only_ | |
| `batch_size` | The batch size to use. Default to `1024`. ~~int~~ |

View File

@ -22,7 +22,7 @@ Create the vocabulary.
> ```
| Name | Description |
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `lex_attr_getters` | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. ~~Optional[Dict[str, Callable[[str], Any]]]~~ |
| `strings` | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. ~~Union[List[str], StringStore]~~ |
| `lookups` | A [`Lookups`](/api/lookups) that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. ~~Optional[Lookups]~~ |