Proofreading

Another round of proofreading. All the API docs have been read through and I've grazed the Usage docs.
This commit is contained in:
walterhenry 2020-09-28 16:50:15 +02:00
parent 3dd5f409ec
commit 3360825e00
12 changed files with 47 additions and 48 deletions

View File

@ -444,8 +444,7 @@ invalidated, although they may accidentally continue to work.
Mark a span for merging. The `attrs` will be applied to the resulting token (if Mark a span for merging. The `attrs` will be applied to the resulting token (if
they're context-dependent token attributes like `LEMMA` or `DEP`) or to the they're context-dependent token attributes like `LEMMA` or `DEP`) or to the
underlying lexeme (if they're context-independent lexical attributes like underlying lexeme (if they're context-independent lexical attributes like
`LOWER` or `IS_STOP`). Writable custom extension attributes can be provided as a `LOWER` or `IS_STOP`). Writable custom extension attributes can be provided using the `"_"` key and specifying a dictionary that maps attribute name to values.
dictionary mapping attribute name to values as the `"_"` key.
> #### Example > #### Example
> >

View File

@ -26,7 +26,7 @@ Merge noun chunks into a single token. Also available via the string name
<Infobox variant="warning"> <Infobox variant="warning">
Since noun chunks require part-of-speech tags and the dependency parser, make Since noun chunks require part-of-speech tags and the dependency parse, make
sure to add this component _after_ the `"tagger"` and `"parser"` components. By sure to add this component _after_ the `"tagger"` and `"parser"` components. By
default, `nlp.add_pipe` will add components to the end of the pipeline and after default, `nlp.add_pipe` will add components to the end of the pipeline and after
all other components. all other components.

View File

@ -187,7 +187,7 @@ the character indices don't map to a valid span.
| Name | Description | | Name | Description |
| ------------------------------------ | ----------------------------------------------------------------------------------------- | | ------------------------------------ | ----------------------------------------------------------------------------------------- |
| `start` | The index of the first character of the span. ~~int~~ | | `start` | The index of the first character of the span. ~~int~~ |
| `end` | The index of the last character after the span. ~int~~ | | `end` | The index of the last character after the span. ~~int~~ |
| `label` | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~ | | `label` | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~ |
| `kb_id` <Tag variant="new">2.2</Tag> | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~ | | `kb_id` <Tag variant="new">2.2</Tag> | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~ |
| `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | | `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |

View File

@ -153,7 +153,7 @@ setting up the label scheme based on the data.
## TextCategorizer.predict {#predict tag="method"} ## TextCategorizer.predict {#predict tag="method"}
Apply the component's model to a batch of [`Doc`](/api/doc) objects, without Apply the component's model to a batch of [`Doc`](/api/doc) objects without
modifying them. modifying them.
> #### Example > #### Example
@ -170,7 +170,7 @@ modifying them.
## TextCategorizer.set_annotations {#set_annotations tag="method"} ## TextCategorizer.set_annotations {#set_annotations tag="method"}
Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores. Modify a batch of [`Doc`](/api/doc) objects using pre-computed scores.
> #### Example > #### Example
> >
@ -213,7 +213,7 @@ Delegates to [`predict`](/api/textcategorizer#predict) and
## TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"} ## TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model, to try to address current model to make predictions similar to an initial model to try to address
the "catastrophic forgetting" problem. This feature is experimental. the "catastrophic forgetting" problem. This feature is experimental.
> #### Example > #### Example
@ -286,7 +286,7 @@ Create an optimizer for the pipeline component.
## TextCategorizer.use_params {#use_params tag="method, contextmanager"} ## TextCategorizer.use_params {#use_params tag="method, contextmanager"}
Modify the pipe's model, to use the given parameter values. Modify the pipe's model to use the given parameter values.
> #### Example > #### Example
> >

View File

@ -151,7 +151,7 @@ setting up the label scheme based on the data.
## Tok2Vec.predict {#predict tag="method"} ## Tok2Vec.predict {#predict tag="method"}
Apply the component's model to a batch of [`Doc`](/api/doc) objects, without Apply the component's model to a batch of [`Doc`](/api/doc) objects without
modifying them. modifying them.
> #### Example > #### Example
@ -224,7 +224,7 @@ Create an optimizer for the pipeline component.
## Tok2Vec.use_params {#use_params tag="method, contextmanager"} ## Tok2Vec.use_params {#use_params tag="method, contextmanager"}
Modify the pipe's model, to use the given parameter values. At the end of the Modify the pipe's model to use the given parameter values. At the end of the
context, the original parameters are restored. context, the original parameters are restored.
> #### Example > #### Example

View File

@ -243,7 +243,7 @@ A sequence of the token's immediate syntactic children.
## Token.lefts {#lefts tag="property" model="parser"} ## Token.lefts {#lefts tag="property" model="parser"}
The leftward immediate children of the word, in the syntactic dependency parse. The leftward immediate children of the word in the syntactic dependency parse.
> #### Example > #### Example
> >
@ -259,7 +259,7 @@ The leftward immediate children of the word, in the syntactic dependency parse.
## Token.rights {#rights tag="property" model="parser"} ## Token.rights {#rights tag="property" model="parser"}
The rightward immediate children of the word, in the syntactic dependency parse. The rightward immediate children of the word in the syntactic dependency parse.
> #### Example > #### Example
> >
@ -275,7 +275,7 @@ The rightward immediate children of the word, in the syntactic dependency parse.
## Token.n_lefts {#n_lefts tag="property" model="parser"} ## Token.n_lefts {#n_lefts tag="property" model="parser"}
The number of leftward immediate children of the word, in the syntactic The number of leftward immediate children of the word in the syntactic
dependency parse. dependency parse.
> #### Example > #### Example
@ -291,7 +291,7 @@ dependency parse.
## Token.n_rights {#n_rights tag="property" model="parser"} ## Token.n_rights {#n_rights tag="property" model="parser"}
The number of rightward immediate children of the word, in the syntactic The number of rightward immediate children of the word in the syntactic
dependency parse. dependency parse.
> #### Example > #### Example
@ -422,8 +422,8 @@ The L2 norm of the token's vector representation.
| `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~str~~ | | `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~str~~ |
| `lower` | Lowercase form of the token. ~~int~~ | | `lower` | Lowercase form of the token. ~~int~~ |
| `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ | | `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ |
| `shape` | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ | | `shape` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ |
| `shape_` | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~str~~ | | `shape_` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~str~~ |
| `prefix` | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. ~~int~~ | | `prefix` | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. ~~int~~ |
| `prefix_` | A length-N substring from the start of the token. Defaults to `N=1`. ~~str~~ | | `prefix_` | A length-N substring from the start of the token. Defaults to `N=1`. ~~str~~ |
| `suffix` | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. ~~int~~ | | `suffix` | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. ~~int~~ |
@ -451,7 +451,7 @@ The L2 norm of the token's vector representation.
| `tag` | Fine-grained part-of-speech. ~~int~~ | | `tag` | Fine-grained part-of-speech. ~~int~~ |
| `tag_` | Fine-grained part-of-speech. ~~str~~ | | `tag_` | Fine-grained part-of-speech. ~~str~~ |
| `morph` <Tag variant="new">3</Tag> | Morphological analysis. ~~MorphAnalysis~~ | | `morph` <Tag variant="new">3</Tag> | Morphological analysis. ~~MorphAnalysis~~ |
| `morph_` <Tag variant="new">3</Tag> | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ | | `morph_` <Tag variant="new">3</Tag> | Morphological analysis in the Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) format. ~~str~~ |
| `dep` | Syntactic dependency relation. ~~int~~ | | `dep` | Syntactic dependency relation. ~~int~~ |
| `dep_` | Syntactic dependency relation. ~~str~~ | | `dep_` | Syntactic dependency relation. ~~str~~ |
| `lang` | Language of the parent document's vocabulary. ~~int~~ | | `lang` | Language of the parent document's vocabulary. ~~int~~ |

View File

@ -1,6 +1,6 @@
--- ---
title: Tokenizer title: Tokenizer
teaser: Segment text into words, punctuations marks etc. teaser: Segment text into words, punctuations marks, etc.
tag: class tag: class
source: spacy/tokenizer.pyx source: spacy/tokenizer.pyx
--- ---
@ -15,14 +15,14 @@ source: spacy/tokenizer.pyx
Segment text, and create `Doc` objects with the discovered segment boundaries. Segment text, and create `Doc` objects with the discovered segment boundaries.
For a deeper understanding, see the docs on For a deeper understanding, see the docs on
[how spaCy's tokenizer works](/usage/linguistic-features#how-tokenizer-works). [how spaCy's tokenizer works](/usage/linguistic-features#how-tokenizer-works).
The tokenizer is typically created automatically when the a The tokenizer is typically created automatically when a
[`Language`](/api/language) subclass is initialized and it reads its settings [`Language`](/api/language) subclass is initialized and it reads its settings
like punctuation and special case rules from the like punctuation and special case rules from the
[`Language.Defaults`](/api/language#defaults) provided by the language subclass. [`Language.Defaults`](/api/language#defaults) provided by the language subclass.
## Tokenizer.\_\_init\_\_ {#init tag="method"} ## Tokenizer.\_\_init\_\_ {#init tag="method"}
Create a `Tokenizer`, to create `Doc` objects given unicode text. For examples Create a `Tokenizer` to create `Doc` objects given unicode text. For examples
of how to construct a custom tokenizer with different tokenization rules, see of how to construct a custom tokenizer with different tokenization rules, see
the the
[usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers). [usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers).
@ -87,7 +87,7 @@ Tokenize a stream of texts.
| ------------ | ------------------------------------------------------------------------------------ | | ------------ | ------------------------------------------------------------------------------------ |
| `texts` | A sequence of unicode texts. ~~Iterable[str]~~ | | `texts` | A sequence of unicode texts. ~~Iterable[str]~~ |
| `batch_size` | The number of texts to accumulate in an internal buffer. Defaults to `1000`. ~~int~~ | | `batch_size` | The number of texts to accumulate in an internal buffer. Defaults to `1000`. ~~int~~ |
| **YIELDS** | The tokenized Doc objects, in order. ~~Doc~~ | | **YIELDS** | The tokenized `Doc` objects, in order. ~~Doc~~ |
## Tokenizer.find_infix {#find_infix tag="method"} ## Tokenizer.find_infix {#find_infix tag="method"}

View File

@ -196,7 +196,7 @@ browser. Will run a simple web server.
| `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ | | `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ |
| `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ | | `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ |
| `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ | | `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ |
| `manual` | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ | | `manual` | Don't parse `Doc` and instead expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
| `port` | Port to serve visualization. Defaults to `5000`. ~~int~~ | | `port` | Port to serve visualization. Defaults to `5000`. ~~int~~ |
| `host` | Host to serve visualization. Defaults to `"0.0.0.0"`. ~~str~~ | | `host` | Host to serve visualization. Defaults to `"0.0.0.0"`. ~~str~~ |
@ -221,7 +221,7 @@ Render a dependency parse tree or named entity visualization.
| `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ | | `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ |
| `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ | | `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ |
| `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ | | `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ |
| `manual` | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ | | `manual` | Don't parse `Doc` and instead expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
| `jupyter` | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None` (default). ~~Optional[bool]~~ | | `jupyter` | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None` (default). ~~Optional[bool]~~ |
| **RETURNS** | The rendered HTML markup. ~~str~~ | | **RETURNS** | The rendered HTML markup. ~~str~~ |
@ -242,7 +242,7 @@ If a setting is not present in the options, the default value will be used.
| Name | Description | | Name | Description |
| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------- |
| `fine_grained` | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). Defaults to `False`. ~~bool~~ | | `fine_grained` | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). Defaults to `False`. ~~bool~~ |
| `add_lemma` <Tag variant="new">2.2.4</Tag> | Print the lemma's in a separate row below the token texts. Defaults to `False`. ~~bool~~ | | `add_lemma` <Tag variant="new">2.2.4</Tag> | Print the lemmas in a separate row below the token texts. Defaults to `False`. ~~bool~~ |
| `collapse_punct` | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. Defaults to `True`. ~~bool~~ | | `collapse_punct` | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. Defaults to `True`. ~~bool~~ |
| `collapse_phrases` | Merge noun phrases into one token. Defaults to `False`. ~~bool~~ | | `collapse_phrases` | Merge noun phrases into one token. Defaults to `False`. ~~bool~~ |
| `compact` | "Compact mode" with square arrows that takes up less space. Defaults to `False`. ~~bool~~ | | `compact` | "Compact mode" with square arrows that takes up less space. Defaults to `False`. ~~bool~~ |
@ -611,7 +611,7 @@ sequences in the batch.
Encode labelled spans into per-token tags, using the Encode labelled spans into per-token tags, using the
[BILUO scheme](/usage/linguistic-features#accessing-ner) (Begin, In, Last, Unit, [BILUO scheme](/usage/linguistic-features#accessing-ner) (Begin, In, Last, Unit,
Out). Returns a list of strings, describing the tags. Each tag string will be of Out). Returns a list of strings, describing the tags. Each tag string will be in
the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of
`"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets `"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets
don't align with the tokenization in the `Doc` object. The training algorithm don't align with the tokenization in the `Doc` object. The training algorithm
@ -716,7 +716,7 @@ decorator.
### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"} ### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"}
Check whether a `Language` subclass is already loaded. `Language` subclasses are Check whether a `Language` subclass is already loaded. `Language` subclasses are
loaded lazily, to avoid expensive setup code associated with the language data. loaded lazily to avoid expensive setup code associated with the language data.
> #### Example > #### Example
> >
@ -904,7 +904,7 @@ Compile a sequence of prefix rules into a regex object.
| Name | Description | | Name | Description |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `entries` | The prefix rules, e.g. [`lang.punctuation.TOKENIZER_PREFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ | | `entries` | The prefix rules, e.g. [`lang.punctuation.TOKENIZER_PREFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
| **RETURNS** | The regex object. to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). ~~Pattern~~ | | **RETURNS** | The regex object to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). ~~Pattern~~ |
### util.compile_suffix_regex {#util.compile_suffix_regex tag="function"} ### util.compile_suffix_regex {#util.compile_suffix_regex tag="function"}
@ -921,7 +921,7 @@ Compile a sequence of suffix rules into a regex object.
| Name | Description | | Name | Description |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `entries` | The suffix rules, e.g. [`lang.punctuation.TOKENIZER_SUFFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ | | `entries` | The suffix rules, e.g. [`lang.punctuation.TOKENIZER_SUFFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
| **RETURNS** | The regex object. to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). ~~Pattern~~ | | **RETURNS** | The regex object to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). ~~Pattern~~ |
### util.compile_infix_regex {#util.compile_infix_regex tag="function"} ### util.compile_infix_regex {#util.compile_infix_regex tag="function"}
@ -938,7 +938,7 @@ Compile a sequence of infix rules into a regex object.
| Name | Description | | Name | Description |
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| `entries` | The infix rules, e.g. [`lang.punctuation.TOKENIZER_INFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ | | `entries` | The infix rules, e.g. [`lang.punctuation.TOKENIZER_INFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
| **RETURNS** | The regex object. to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). ~~Pattern~~ | | **RETURNS** | The regex object to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). ~~Pattern~~ |
### util.minibatch {#util.minibatch tag="function" new="2"} ### util.minibatch {#util.minibatch tag="function" new="2"}

View File

@ -186,7 +186,7 @@ setting up the label scheme based on the data.
## Transformer.predict {#predict tag="method"} ## Transformer.predict {#predict tag="method"}
Apply the component's model to a batch of [`Doc`](/api/doc) objects, without Apply the component's model to a batch of [`Doc`](/api/doc) objects without
modifying them. modifying them.
> #### Example > #### Example
@ -203,7 +203,7 @@ modifying them.
## Transformer.set_annotations {#set_annotations tag="method"} ## Transformer.set_annotations {#set_annotations tag="method"}
Assign the extracted features to the Doc objects. By default, the Assign the extracted features to the `Doc` objects. By default, the
[`TransformerData`](/api/transformer#transformerdata) object is written to the [`TransformerData`](/api/transformer#transformerdata) object is written to the
[`Doc._.trf_data`](#custom-attributes) attribute. Your `set_extra_annotations` [`Doc._.trf_data`](#custom-attributes) attribute. Your `set_extra_annotations`
callback is then called, if provided. callback is then called, if provided.
@ -272,7 +272,7 @@ Create an optimizer for the pipeline component.
## Transformer.use_params {#use_params tag="method, contextmanager"} ## Transformer.use_params {#use_params tag="method, contextmanager"}
Modify the pipe's model, to use the given parameter values. At the end of the Modify the pipe's model to use the given parameter values. At the end of the
context, the original parameters are restored. context, the original parameters are restored.
> #### Example > #### Example
@ -388,8 +388,8 @@ by this class. Instances of this class are typically assigned to the
| Name | Description | | Name | Description |
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts, and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ | | `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ |
| `tensors` | The activations for the Doc from the transformer. Usually the last tensor that is 3-dimensional will be the most important, as that will provide the final hidden state. Generally activations that are 2-dimensional will be attention weights. Details of this variable will differ depending on the underlying transformer model. ~~List[FloatsXd]~~ | | `tensors` | The activations for the `Doc` from the transformer. Usually the last tensor that is 3-dimensional will be the most important, as that will provide the final hidden state. Generally activations that are 2-dimensional will be attention weights. Details of this variable will differ depending on the underlying transformer model. ~~List[FloatsXd]~~ |
| `align` | Alignment from the `Doc`'s tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ | | `align` | Alignment from the `Doc`'s tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ |
| `width` | The width of the last hidden layer. ~~int~~ | | `width` | The width of the last hidden layer. ~~int~~ |
@ -409,7 +409,7 @@ objects to associate the outputs to each [`Doc`](/api/doc) in the batch.
| Name | Description | | Name | Description |
| ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `spans` | The batch of input spans. The outer list refers to the Doc objects in the batch, and the inner list are the spans for that `Doc`. Note that spans are allowed to overlap or exclude tokens, but each Span can only refer to one `Doc` (by definition). This means that within a `Doc`, the regions of the output tensors that correspond to each Span may overlap or have gaps, but for each `Doc`, there is a non-overlapping contiguous slice of the outputs. ~~List[List[Span]]~~ | | `spans` | The batch of input spans. The outer list refers to the Doc objects in the batch, and the inner list are the spans for that `Doc`. Note that spans are allowed to overlap or exclude tokens, but each `Span` can only refer to one `Doc` (by definition). This means that within a `Doc`, the regions of the output tensors that correspond to each `Span` may overlap or have gaps, but for each `Doc`, there is a non-overlapping contiguous slice of the outputs. ~~List[List[Span]]~~ |
| `tokens` | The output of the tokenizer. ~~transformers.BatchEncoding~~ | | `tokens` | The output of the tokenizer. ~~transformers.BatchEncoding~~ |
| `tensors` | The output of the transformer model. ~~List[torch.Tensor]~~ | | `tensors` | The output of the transformer model. ~~List[torch.Tensor]~~ |
| `align` | Alignment from the spaCy tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ | | `align` | Alignment from the spaCy tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ |
@ -439,10 +439,10 @@ Split a `TransformerData` object that represents a batch into a list with one
## Span getters {#span_getters source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"} ## Span getters {#span_getters source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"}
Span getters are functions that take a batch of [`Doc`](/api/doc) objects and Span getters are functions that take a batch of [`Doc`](/api/doc) objects and
return a lists of [`Span`](/api/span) objects for each doc, to be processed by return a lists of [`Span`](/api/span) objects for each doc to be processed by
the transformer. This is used to manage long documents, by cutting them into the transformer. This is used to manage long documents by cutting them into
smaller sequences before running the transformer. The spans are allowed to smaller sequences before running the transformer. The spans are allowed to
overlap, and you can also omit sections of the Doc if they are not relevant. overlap, and you can also omit sections of the `Doc` if they are not relevant.
Span getters can be referenced in the `[components.transformer.model.get_spans]` Span getters can be referenced in the `[components.transformer.model.get_spans]`
block of the config to customize the sequences processed by the transformer. You block of the config to customize the sequences processed by the transformer. You

View File

@ -290,7 +290,7 @@ If a table is full, it can be resized using
## Vectors.n_keys {#n_keys tag="property"} ## Vectors.n_keys {#n_keys tag="property"}
Get the number of keys in the table. Note that this is the number of _all_ keys, Get the number of keys in the table. Note that this is the number of _all_ keys,
not just unique vectors. If several keys are mapped are mapped to the same not just unique vectors. If several keys are mapped to the same
vectors, they will be counted individually. vectors, they will be counted individually.
> #### Example > #### Example
@ -307,10 +307,10 @@ vectors, they will be counted individually.
## Vectors.most_similar {#most_similar tag="method"} ## Vectors.most_similar {#most_similar tag="method"}
For each of the given vectors, find the `n` most similar entries to it, by For each of the given vectors, find the `n` most similar entries to it by
cosine. Queries are by vector. Results are returned as a cosine. Queries are by vector. Results are returned as a
`(keys, best_rows, scores)` tuple. If `queries` is large, the calculations are `(keys, best_rows, scores)` tuple. If `queries` is large, the calculations are
performed in chunks, to avoid consuming too much memory. You can set the performed in chunks to avoid consuming too much memory. You can set the
`batch_size` to control the size/space trade-off during the calculations. `batch_size` to control the size/space trade-off during the calculations.
> #### Example > #### Example

View File

@ -29,7 +29,7 @@ Create the vocabulary.
| `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ | | `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ |
| `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~ | | `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~ |
| `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ | | `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ |
| `get_noun_chunks` | A function that yields base noun phrases, used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ | | `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
## Vocab.\_\_len\_\_ {#len tag="method"} ## Vocab.\_\_len\_\_ {#len tag="method"}
@ -150,7 +150,7 @@ rows, we would discard the vectors for "feline" and "reclined". These words
would then be remapped to the closest remaining vector so "feline" would have would then be remapped to the closest remaining vector so "feline" would have
the same vector as "cat", and "reclined" would have the same vector as "sat". the same vector as "cat", and "reclined" would have the same vector as "sat".
The similarities are judged by cosine. The original vectors may be large, so the The similarities are judged by cosine. The original vectors may be large, so the
cosines are calculated in minibatches, to reduce memory usage. cosines are calculated in minibatches to reduce memory usage.
> #### Example > #### Example
> >
@ -170,7 +170,7 @@ cosines are calculated in minibatches, to reduce memory usage.
Retrieve a vector for a word in the vocabulary. Words can be looked up by string Retrieve a vector for a word in the vocabulary. Words can be looked up by string
or hash value. If no vectors data is loaded, a `ValueError` is raised. If `minn` or hash value. If no vectors data is loaded, a `ValueError` is raised. If `minn`
is defined, then the resulting vector uses [FastText](https://fasttext.cc/)'s is defined, then the resulting vector uses [FastText](https://fasttext.cc/)'s
subword features by average over ngrams of `orth` (introduced in spaCy `v2.1`). subword features by average over n-grams of `orth` (introduced in spaCy `v2.1`).
> #### Example > #### Example
> >
@ -182,13 +182,13 @@ subword features by average over ngrams of `orth` (introduced in spaCy `v2.1`).
| Name | Description | | Name | Description |
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | | ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ | | `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ |
| `minn` <Tag variant="new">2.1</Tag> | Minimum n-gram length used for FastText's ngram computation. Defaults to the length of `orth`. ~~int~~ | | `minn` <Tag variant="new">2.1</Tag> | Minimum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ |
| `maxn` <Tag variant="new">2.1</Tag> | Maximum n-gram length used for FastText's ngram computation. Defaults to the length of `orth`. ~~int~~ | | `maxn` <Tag variant="new">2.1</Tag> | Maximum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ |
| **RETURNS** | A word vector. Size and shape are determined by the `Vocab.vectors` instance. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | | **RETURNS** | A word vector. Size and shape are determined by the `Vocab.vectors` instance. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
## Vocab.set_vector {#set_vector tag="method" new="2"} ## Vocab.set_vector {#set_vector tag="method" new="2"}
Set a vector for a word in the vocabulary. Words can be referenced by by string Set a vector for a word in the vocabulary. Words can be referenced by string
or hash value. or hash value.
> #### Example > #### Example

View File

@ -36,7 +36,7 @@ models such as [transformers](#transformers) is that word vectors model
context around them, a transformer model like BERT can't really help you. BERT context around them, a transformer model like BERT can't really help you. BERT
is designed to understand language **in context**, which isn't what you have. A is designed to understand language **in context**, which isn't what you have. A
word vectors table will be a much better fit for your task. However, if you do word vectors table will be a much better fit for your task. However, if you do
have words in context — whole sentences or paragraphs of running text — word have words in context whole sentences or paragraphs of running text word
vectors will only provide a very rough approximation of what the text is about. vectors will only provide a very rough approximation of what the text is about.
Word vectors are also very computationally efficient, as they map a word to a Word vectors are also very computationally efficient, as they map a word to a