diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index f1a11bbc4..ceeb388ab 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -555,8 +555,8 @@ consists of either two or three subnetworks: -[TransitionBasedParser.v1](/api/legacy#TransitionBasedParser_v1) had the exact same signature, -but the `use_upper` argument was `True` by default. +[TransitionBasedParser.v1](/api/legacy#TransitionBasedParser_v1) had the exact +same signature, but the `use_upper` argument was `True` by default. diff --git a/website/docs/api/dependencyparser.md b/website/docs/api/dependencyparser.md index fa02a6f99..c48172a22 100644 --- a/website/docs/api/dependencyparser.md +++ b/website/docs/api/dependencyparser.md @@ -25,6 +25,20 @@ current state. The weights are updated such that the scores assigned to the set of optimal actions is increased, while scores assigned to other actions are decreased. Note that more than one action may be optimal for a given state. +## Assigned Attributes {#assigned-attributes} + +Dependency predictions are assigned to the `Token.dep` and `Token.head` fields. +Beside the dependencies themselves, the parser decides sentence boundaries, +which are saved in `Token.is_sent_start` and accessible via `Doc.sents`. + +| Location | Value | +| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | +| `Token.dep` | The type of dependency relation (hash). ~~int~~ | +| `Token.dep_` | The type of dependency relation. ~~str~~ | +| `Token.head` | The syntactic parent, or "governor", of this token. ~~Token~~ | +| `Token.is_sent_start` | A boolean value indicating whether the token starts a sentence. After the parser runs this will be `True` or `False` for all tokens. ~~bool~~ | +| `Doc.sents` | An iterator over sentences in the `Doc`, determined by `Token.is_sent_start` values. ~~Iterator[Span]~~ | + ## Config and implementation {#config} The default config is defined by the pipeline component factory and describes diff --git a/website/docs/api/doc.md b/website/docs/api/doc.md index 0b5ef56c0..e1f18963b 100644 --- a/website/docs/api/doc.md +++ b/website/docs/api/doc.md @@ -571,9 +571,9 @@ objects, if the entity recognizer has been applied. > assert ents[0].text == "Mr. Best" > ``` -| Name | Description | -| ----------- | --------------------------------------------------------------------- | -| **RETURNS** | Entities in the document, one `Span` per entity. ~~Tuple[Span, ...]~~ | +| Name | Description | +| ----------- | ---------------------------------------------------------------- | +| **RETURNS** | Entities in the document, one `Span` per entity. ~~Tuple[Span]~~ | ## Doc.spans {#spans tag="property"} diff --git a/website/docs/api/entitylinker.md b/website/docs/api/entitylinker.md index 2994d934b..bbc8f3942 100644 --- a/website/docs/api/entitylinker.md +++ b/website/docs/api/entitylinker.md @@ -16,6 +16,16 @@ plausible candidates from that `KnowledgeBase` given a certain textual mention, and a machine learning model to pick the right candidate, given the local context of the mention. +## Assigned Attributes {#assigned-attributes} + +Predictions, in the form of knowledge base IDs, will be assigned to +`Token.ent_kb_id_`. + +| Location | Value | +| ------------------ | --------------------------------- | +| `Token.ent_kb_id` | Knowledge base ID (hash). ~~int~~ | +| `Token.ent_kb_id_` | Knowledge base ID. ~~str~~ | + ## Config and implementation {#config} The default config is defined by the pipeline component factory and describes diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md index 601b644c1..ba7022c14 100644 --- a/website/docs/api/entityrecognizer.md +++ b/website/docs/api/entityrecognizer.md @@ -20,6 +20,24 @@ your entities will be close to their initial tokens. If your entities are long and characterized by tokens in their middle, the component will likely not be a good fit for your task. +## Assigned Attributes {#assigned-attributes} + +Predictions will be saved to `Doc.ents` as a tuple. Each label will also be +reflected to each underlying token, where it is saved in the `Token.ent_type` +and `Token.ent_iob` fields. Note that by definition each token can only have one +label. + +When setting `Doc.ents` to create training data, all the spans must be valid and +non-overlapping, or an error will be thrown. + +| Location | Value | +| ----------------- | ----------------------------------------------------------------- | +| `Doc.ents` | The annotated spans. ~~Tuple[Span]~~ | +| `Token.ent_iob` | An enum encoding of the IOB part of the named entity tag. ~~int~~ | +| `Token.ent_iob_` | The IOB part of the named entity tag. ~~str~~ | +| `Token.ent_type` | The label part of the named entity tag (hash). ~~int~~ | +| `Token.ent_type_` | The label part of the named entity tag. ~~str~~ | + ## Config and implementation {#config} The default config is defined by the pipeline component factory and describes diff --git a/website/docs/api/entityruler.md b/website/docs/api/entityruler.md index 93b5da45a..48c279914 100644 --- a/website/docs/api/entityruler.md +++ b/website/docs/api/entityruler.md @@ -15,6 +15,27 @@ used on its own to implement a purely rule-based entity recognition system. For usage examples, see the docs on [rule-based entity recognition](/usage/rule-based-matching#entityruler). +## Assigned Attributes {#assigned-attributes} + +This component assigns predictions basically the same way as the +[`EntityRecognizer`](/api/entityrecognizer). + +Predictions can be accessed under `Doc.ents` as a tuple. Each label will also be +reflected in each underlying token, where it is saved in the `Token.ent_type` +and `Token.ent_iob` fields. Note that by definition each token can only have one +label. + +When setting `Doc.ents` to create training data, all the spans must be valid and +non-overlapping, or an error will be thrown. + +| Location | Value | +| ----------------- | ----------------------------------------------------------------- | +| `Doc.ents` | The annotated spans. ~~Tuple[Span]~~ | +| `Token.ent_iob` | An enum encoding of the IOB part of the named entity tag. ~~int~~ | +| `Token.ent_iob_` | The IOB part of the named entity tag. ~~str~~ | +| `Token.ent_type` | The label part of the named entity tag (hash). ~~int~~ | +| `Token.ent_type_` | The label part of the named entity tag. ~~str~~ | + ## Config and implementation {#config} The default config is defined by the pipeline component factory and describes diff --git a/website/docs/api/legacy.md b/website/docs/api/legacy.md index 02b376780..916a5bf7f 100644 --- a/website/docs/api/legacy.md +++ b/website/docs/api/legacy.md @@ -105,7 +105,8 @@ and residual connections. ### spacy.TransitionBasedParser.v1 {#TransitionBasedParser_v1} -Identical to [`spacy.TransitionBasedParser.v2`](/api/architectures#TransitionBasedParser) +Identical to +[`spacy.TransitionBasedParser.v2`](/api/architectures#TransitionBasedParser) except the `use_upper` was set to `True` by default. ### spacy.TextCatEnsemble.v1 {#TextCatEnsemble_v1} diff --git a/website/docs/api/lemmatizer.md b/website/docs/api/lemmatizer.md index 279821e71..8cb869f64 100644 --- a/website/docs/api/lemmatizer.md +++ b/website/docs/api/lemmatizer.md @@ -31,6 +31,15 @@ available in the pipeline and runs _before_ the lemmatizer. +## Assigned Attributes {#assigned-attributes} + +Lemmas generated by rules or predicted will be saved to `Token.lemma`. + +| Location | Value | +| -------------- | ------------------------- | +| `Token.lemma` | The lemma (hash). ~~int~~ | +| `Token.lemma_` | The lemma. ~~str~~ | + ## Config and implementation The default config is defined by the pipeline component factory and describes diff --git a/website/docs/api/morphologizer.md b/website/docs/api/morphologizer.md index d2dd28ac2..00af83e6f 100644 --- a/website/docs/api/morphologizer.md +++ b/website/docs/api/morphologizer.md @@ -15,6 +15,16 @@ coarse-grained POS tags following the Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) annotation guidelines. +## Assigned Attributes {#assigned-attributes} + +Predictions are saved to `Token.morph` and `Token.pos`. + +| Location | Value | +| ------------- | ----------------------------------------- | +| `Token.pos` | The UPOS part of speech (hash). ~~int~~ | +| `Token.pos_` | The UPOS part of speech. ~~str~~ | +| `Token.morph` | Morphological features. ~~MorphAnalysis~~ | + ## Config and implementation {#config} The default config is defined by the pipeline component factory and describes diff --git a/website/docs/api/morphology.md b/website/docs/api/morphology.md index 565e520b5..20fcd1a40 100644 --- a/website/docs/api/morphology.md +++ b/website/docs/api/morphology.md @@ -105,11 +105,11 @@ representation. ## Attributes {#attributes} -| Name | Description | -| ------------- | ---------------------------------------------------------------------------------------------------------------------------- | ---------- | -| `FEATURE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) feature separator. Default is ` | `. ~~str~~ | -| `FIELD_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) field separator. Default is `=`. ~~str~~ | -| `VALUE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) value separator. Default is `,`. ~~str~~ | +| Name | Description | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------ | +| `FEATURE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) feature separator. Default is `|`. ~~str~~ | +| `FIELD_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) field separator. Default is `=`. ~~str~~ | +| `VALUE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) value separator. Default is `,`. ~~str~~ | ## MorphAnalysis {#morphanalysis tag="class" source="spacy/tokens/morphanalysis.pyx"} diff --git a/website/docs/api/phrasematcher.md b/website/docs/api/phrasematcher.md index 4a5fb6042..71ee4b7d1 100644 --- a/website/docs/api/phrasematcher.md +++ b/website/docs/api/phrasematcher.md @@ -149,8 +149,8 @@ patterns = [nlp("health care reform"), nlp("healthcare reform")] | Name | Description | -| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | -| `match_id` | An ID for the thing you're matching. ~~str~~ | | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `match_id` | An ID for the thing you're matching. ~~str~~ | | | `docs` | `Doc` objects of the phrases to match. ~~List[Doc]~~ | | _keyword-only_ | | | `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ | diff --git a/website/docs/api/scorer.md b/website/docs/api/scorer.md index ad908f204..c8163091f 100644 --- a/website/docs/api/scorer.md +++ b/website/docs/api/scorer.md @@ -80,7 +80,7 @@ Docs with `has_unknown_spaces` are skipped during scoring. > ``` | Name | Description | -| ----------- | ------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | +| ----------- | ------------------------------------------------------------------------------------------------------------------- | | `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | | **RETURNS** | `Dict` | A dictionary containing the scores `token_acc`, `token_p`, `token_r`, `token_f`. ~~Dict[str, float]]~~ | diff --git a/website/docs/api/sentencerecognizer.md b/website/docs/api/sentencerecognizer.md index e82a4bef6..8d8e57319 100644 --- a/website/docs/api/sentencerecognizer.md +++ b/website/docs/api/sentencerecognizer.md @@ -12,6 +12,16 @@ api_trainable: true A trainable pipeline component for sentence segmentation. For a simpler, rule-based strategy, see the [`Sentencizer`](/api/sentencizer). +## Assigned Attributes {#assigned-attributes} + +Predicted values will be assigned to `Token.is_sent_start`. The resulting +sentences can be accessed using `Doc.sents`. + +| Location | Value | +| --------------------- | ------------------------------------------------------------------------------------------------------------------------------ | +| `Token.is_sent_start` | A boolean value indicating whether the token starts a sentence. This will be either `True` or `False` for all tokens. ~~bool~~ | +| `Doc.sents` | An iterator over sentences in the `Doc`, determined by `Token.is_sent_start` values. ~~Iterator[Span]~~ | + ## Config and implementation {#config} The default config is defined by the pipeline component factory and describes diff --git a/website/docs/api/sentencizer.md b/website/docs/api/sentencizer.md index 75a253fc0..ef2465c27 100644 --- a/website/docs/api/sentencizer.md +++ b/website/docs/api/sentencizer.md @@ -13,6 +13,16 @@ performed by the [`DependencyParser`](/api/dependencyparser), so the `Sentencizer` lets you implement a simpler, rule-based strategy that doesn't require a statistical model to be loaded. +## Assigned Attributes {#assigned-attributes} + +Calculated values will be assigned to `Token.is_sent_start`. The resulting +sentences can be accessed using `Doc.sents`. + +| Location | Value | +| --------------------- | ------------------------------------------------------------------------------------------------------------------------------ | +| `Token.is_sent_start` | A boolean value indicating whether the token starts a sentence. This will be either `True` or `False` for all tokens. ~~bool~~ | +| `Doc.sents` | An iterator over sentences in the `Doc`, determined by `Token.is_sent_start` values. ~~Iterator[Span]~~ | + ## Config and implementation {#config} The default config is defined by the pipeline component factory and describes @@ -28,7 +38,7 @@ how the component should be configured. You can override its settings via the > ``` | Setting | Description | -| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | | `punct_chars` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to `None`. ~~Optional[List[str]]~~ | `None` | ```python diff --git a/website/docs/api/tagger.md b/website/docs/api/tagger.md index 3002aff7b..f34456b0c 100644 --- a/website/docs/api/tagger.md +++ b/website/docs/api/tagger.md @@ -8,6 +8,21 @@ api_string_name: tagger api_trainable: true --- +A trainable pipeline component to predict part-of-speech tags for any +part-of-speech tag set. + +In the pre-trained pipelines, the tag schemas vary by language; see the +[individual model pages](/models) for details. + +## Assigned Attributes {#assigned-attributes} + +Predictions are assigned to `Token.tag`. + +| Location | Value | +| ------------ | ---------------------------------- | +| `Token.tag` | The part of speech (hash). ~~int~~ | +| `Token.tag_` | The part of speech. ~~str~~ | + ## Config and implementation {#config} The default config is defined by the pipeline component factory and describes diff --git a/website/docs/api/textcategorizer.md b/website/docs/api/textcategorizer.md index 923da0048..62a921d02 100644 --- a/website/docs/api/textcategorizer.md +++ b/website/docs/api/textcategorizer.md @@ -29,6 +29,22 @@ only. +## Assigned Attributes {#assigned-attributes} + +Predictions will be saved to `doc.cats` as a dictionary, where the key is the +name of the category and the value is a score between 0 and 1 (inclusive). For +`textcat` (exclusive categories), the scores will sum to 1, while for +`textcat_multilabel` there is no particular guarantee about their sum. + +Note that when assigning values to create training data, the score of each +category must be 0 or 1. Using other values, for example to create a document +that is a little bit in category A and a little bit in category B, is not +supported. + +| Location | Value | +| ---------- | ------------------------------------- | +| `Doc.cats` | Category scores. ~~Dict[str, float]~~ | + ## Config and implementation {#config} The default config is defined by the pipeline component factory and describes diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index 569fcfbd4..6e68ac599 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -38,12 +38,21 @@ attributes. We also calculate an alignment between the word-piece tokens and the spaCy tokenization, so that we can use the last hidden states to set the `Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy token, the spaCy token receives the sum of their values. To access the values, -you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The +you can use the custom [`Doc._.trf_data`](#assigned-attributes) attribute. The package also adds the function registries [`@span_getters`](#span_getters) and [`@annotation_setters`](#annotation_setters) with several built-in registered functions. For more details, see the [usage documentation](/usage/embeddings-transformers). +## Assigned Attributes {#assigned-attributes} + +The component sets the following +[custom extension attribute](/usage/processing-pipeline#custom-components-attributes): + +| Location | Value | +| ---------------- | ------------------------------------------------------------------------ | +| `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ | + ## Config and implementation {#config} The default config is defined by the pipeline component factory and describes @@ -98,7 +107,7 @@ https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/p Construct a `Transformer` component. One or more subsequent spaCy components can use the transformer outputs as features in its model, with gradients backpropagated to the single shared weights. The activations from the -transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension +transformer are saved in the [`Doc._.trf_data`](#assigned-attributes) extension attribute. You can also provide a callback to set additional annotations. In your application, you would normally use a shortcut for this and instantiate the component using its string name and [`nlp.add_pipe`](/api/language#create_pipe). @@ -205,7 +214,7 @@ modifying them. Assign the extracted features to the `Doc` objects. By default, the [`TransformerData`](/api/transformer#transformerdata) object is written to the -[`Doc._.trf_data`](#custom-attributes) attribute. Your `set_extra_annotations` +[`Doc._.trf_data`](#assigned-attributes) attribute. Your `set_extra_annotations` callback is then called, if provided. > #### Example @@ -383,7 +392,7 @@ are wrapped into the [FullTransformerBatch](/api/transformer#fulltransformerbatch) object. The `FullTransformerBatch` then splits out the per-document data, which is handled by this class. Instances of this class are typically assigned to the -[`Doc._.trf_data`](/api/transformer#custom-attributes) extension attribute. +[`Doc._.trf_data`](/api/transformer#assigned-attributes) extension attribute. | Name | Description | | --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -549,12 +558,3 @@ The following built-in functions are available: | Name | Description | | ---------------------------------------------- | ------------------------------------- | | `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. | - -## Custom attributes {#custom-attributes} - -The component sets the following -[custom extension attributes](/usage/processing-pipeline#custom-components-attributes): - -| Name | Description | -| ---------------- | ------------------------------------------------------------------------ | -| `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ | diff --git a/website/docs/api/vectors.md b/website/docs/api/vectors.md index 598abe681..1a7f7a3f5 100644 --- a/website/docs/api/vectors.md +++ b/website/docs/api/vectors.md @@ -321,7 +321,7 @@ performed in chunks to avoid consuming too much memory. You can set the > ``` | Name | Description | -| -------------- | --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | +| -------------- | --------------------------------------------------------------------------- | | `queries` | An array with one or more vectors. ~~numpy.ndarray~~ | | _keyword-only_ | | | `batch_size` | The batch size to use. Default to `1024`. ~~int~~ | diff --git a/website/docs/api/vocab.md b/website/docs/api/vocab.md index 320ad5605..40a3c3b22 100644 --- a/website/docs/api/vocab.md +++ b/website/docs/api/vocab.md @@ -21,14 +21,14 @@ Create the vocabulary. > vocab = Vocab(strings=["hello", "world"]) > ``` -| Name | Description | -| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `lex_attr_getters` | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. ~~Optional[Dict[str, Callable[[str], Any]]]~~ | -| `strings` | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. ~~Union[List[str], StringStore]~~ | -| `lookups` | A [`Lookups`](/api/lookups) that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. ~~Optional[Lookups]~~ | -| `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ | -| `vectors_name` 2.2 | A name to identify the vectors table. ~~str~~ | -| `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ | +| Name | Description | +| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `lex_attr_getters` | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. ~~Optional[Dict[str, Callable[[str], Any]]]~~ | +| `strings` | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. ~~Union[List[str], StringStore]~~ | +| `lookups` | A [`Lookups`](/api/lookups) that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. ~~Optional[Lookups]~~ | +| `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ | +| `vectors_name` 2.2 | A name to identify the vectors table. ~~str~~ | +| `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ | | `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/api/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ | ## Vocab.\_\_len\_\_ {#len tag="method"}