mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 01:46:28 +03:00
Proofreading
Another round of proofreading. All the API docs have been read through and I've grazed the Usage docs.
This commit is contained in:
parent
3dd5f409ec
commit
3360825e00
|
@ -444,8 +444,7 @@ invalidated, although they may accidentally continue to work.
|
|||
Mark a span for merging. The `attrs` will be applied to the resulting token (if
|
||||
they're context-dependent token attributes like `LEMMA` or `DEP`) or to the
|
||||
underlying lexeme (if they're context-independent lexical attributes like
|
||||
`LOWER` or `IS_STOP`). Writable custom extension attributes can be provided as a
|
||||
dictionary mapping attribute name to values as the `"_"` key.
|
||||
`LOWER` or `IS_STOP`). Writable custom extension attributes can be provided using the `"_"` key and specifying a dictionary that maps attribute name to values.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
|
|
@ -26,7 +26,7 @@ Merge noun chunks into a single token. Also available via the string name
|
|||
|
||||
<Infobox variant="warning">
|
||||
|
||||
Since noun chunks require part-of-speech tags and the dependency parser, make
|
||||
Since noun chunks require part-of-speech tags and the dependency parse, make
|
||||
sure to add this component _after_ the `"tagger"` and `"parser"` components. By
|
||||
default, `nlp.add_pipe` will add components to the end of the pipeline and after
|
||||
all other components.
|
||||
|
|
|
@ -187,7 +187,7 @@ the character indices don't map to a valid span.
|
|||
| Name | Description |
|
||||
| ------------------------------------ | ----------------------------------------------------------------------------------------- |
|
||||
| `start` | The index of the first character of the span. ~~int~~ |
|
||||
| `end` | The index of the last character after the span. ~int~~ |
|
||||
| `end` | The index of the last character after the span. ~~int~~ |
|
||||
| `label` | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~ |
|
||||
| `kb_id` <Tag variant="new">2.2</Tag> | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~ |
|
||||
| `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||
|
|
|
@ -153,7 +153,7 @@ setting up the label scheme based on the data.
|
|||
|
||||
## TextCategorizer.predict {#predict tag="method"}
|
||||
|
||||
Apply the component's model to a batch of [`Doc`](/api/doc) objects, without
|
||||
Apply the component's model to a batch of [`Doc`](/api/doc) objects without
|
||||
modifying them.
|
||||
|
||||
> #### Example
|
||||
|
@ -170,7 +170,7 @@ modifying them.
|
|||
|
||||
## TextCategorizer.set_annotations {#set_annotations tag="method"}
|
||||
|
||||
Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores.
|
||||
Modify a batch of [`Doc`](/api/doc) objects using pre-computed scores.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -213,7 +213,7 @@ Delegates to [`predict`](/api/textcategorizer#predict) and
|
|||
## TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"}
|
||||
|
||||
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
|
||||
current model to make predictions similar to an initial model, to try to address
|
||||
current model to make predictions similar to an initial model to try to address
|
||||
the "catastrophic forgetting" problem. This feature is experimental.
|
||||
|
||||
> #### Example
|
||||
|
@ -286,7 +286,7 @@ Create an optimizer for the pipeline component.
|
|||
|
||||
## TextCategorizer.use_params {#use_params tag="method, contextmanager"}
|
||||
|
||||
Modify the pipe's model, to use the given parameter values.
|
||||
Modify the pipe's model to use the given parameter values.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
|
|
@ -151,7 +151,7 @@ setting up the label scheme based on the data.
|
|||
|
||||
## Tok2Vec.predict {#predict tag="method"}
|
||||
|
||||
Apply the component's model to a batch of [`Doc`](/api/doc) objects, without
|
||||
Apply the component's model to a batch of [`Doc`](/api/doc) objects without
|
||||
modifying them.
|
||||
|
||||
> #### Example
|
||||
|
@ -224,7 +224,7 @@ Create an optimizer for the pipeline component.
|
|||
|
||||
## Tok2Vec.use_params {#use_params tag="method, contextmanager"}
|
||||
|
||||
Modify the pipe's model, to use the given parameter values. At the end of the
|
||||
Modify the pipe's model to use the given parameter values. At the end of the
|
||||
context, the original parameters are restored.
|
||||
|
||||
> #### Example
|
||||
|
|
|
@ -243,7 +243,7 @@ A sequence of the token's immediate syntactic children.
|
|||
|
||||
## Token.lefts {#lefts tag="property" model="parser"}
|
||||
|
||||
The leftward immediate children of the word, in the syntactic dependency parse.
|
||||
The leftward immediate children of the word in the syntactic dependency parse.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -259,7 +259,7 @@ The leftward immediate children of the word, in the syntactic dependency parse.
|
|||
|
||||
## Token.rights {#rights tag="property" model="parser"}
|
||||
|
||||
The rightward immediate children of the word, in the syntactic dependency parse.
|
||||
The rightward immediate children of the word in the syntactic dependency parse.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -275,7 +275,7 @@ The rightward immediate children of the word, in the syntactic dependency parse.
|
|||
|
||||
## Token.n_lefts {#n_lefts tag="property" model="parser"}
|
||||
|
||||
The number of leftward immediate children of the word, in the syntactic
|
||||
The number of leftward immediate children of the word in the syntactic
|
||||
dependency parse.
|
||||
|
||||
> #### Example
|
||||
|
@ -291,7 +291,7 @@ dependency parse.
|
|||
|
||||
## Token.n_rights {#n_rights tag="property" model="parser"}
|
||||
|
||||
The number of rightward immediate children of the word, in the syntactic
|
||||
The number of rightward immediate children of the word in the syntactic
|
||||
dependency parse.
|
||||
|
||||
> #### Example
|
||||
|
@ -422,8 +422,8 @@ The L2 norm of the token's vector representation.
|
|||
| `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~str~~ |
|
||||
| `lower` | Lowercase form of the token. ~~int~~ |
|
||||
| `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ |
|
||||
| `shape` | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ |
|
||||
| `shape_` | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~str~~ |
|
||||
| `shape` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ |
|
||||
| `shape_` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~str~~ |
|
||||
| `prefix` | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. ~~int~~ |
|
||||
| `prefix_` | A length-N substring from the start of the token. Defaults to `N=1`. ~~str~~ |
|
||||
| `suffix` | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. ~~int~~ |
|
||||
|
@ -451,7 +451,7 @@ The L2 norm of the token's vector representation.
|
|||
| `tag` | Fine-grained part-of-speech. ~~int~~ |
|
||||
| `tag_` | Fine-grained part-of-speech. ~~str~~ |
|
||||
| `morph` <Tag variant="new">3</Tag> | Morphological analysis. ~~MorphAnalysis~~ |
|
||||
| `morph_` <Tag variant="new">3</Tag> | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ |
|
||||
| `morph_` <Tag variant="new">3</Tag> | Morphological analysis in the Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) format. ~~str~~ |
|
||||
| `dep` | Syntactic dependency relation. ~~int~~ |
|
||||
| `dep_` | Syntactic dependency relation. ~~str~~ |
|
||||
| `lang` | Language of the parent document's vocabulary. ~~int~~ |
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
---
|
||||
title: Tokenizer
|
||||
teaser: Segment text into words, punctuations marks etc.
|
||||
teaser: Segment text into words, punctuations marks, etc.
|
||||
tag: class
|
||||
source: spacy/tokenizer.pyx
|
||||
---
|
||||
|
@ -15,14 +15,14 @@ source: spacy/tokenizer.pyx
|
|||
Segment text, and create `Doc` objects with the discovered segment boundaries.
|
||||
For a deeper understanding, see the docs on
|
||||
[how spaCy's tokenizer works](/usage/linguistic-features#how-tokenizer-works).
|
||||
The tokenizer is typically created automatically when the a
|
||||
The tokenizer is typically created automatically when a
|
||||
[`Language`](/api/language) subclass is initialized and it reads its settings
|
||||
like punctuation and special case rules from the
|
||||
[`Language.Defaults`](/api/language#defaults) provided by the language subclass.
|
||||
|
||||
## Tokenizer.\_\_init\_\_ {#init tag="method"}
|
||||
|
||||
Create a `Tokenizer`, to create `Doc` objects given unicode text. For examples
|
||||
Create a `Tokenizer` to create `Doc` objects given unicode text. For examples
|
||||
of how to construct a custom tokenizer with different tokenization rules, see
|
||||
the
|
||||
[usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers).
|
||||
|
@ -87,7 +87,7 @@ Tokenize a stream of texts.
|
|||
| ------------ | ------------------------------------------------------------------------------------ |
|
||||
| `texts` | A sequence of unicode texts. ~~Iterable[str]~~ |
|
||||
| `batch_size` | The number of texts to accumulate in an internal buffer. Defaults to `1000`. ~~int~~ |
|
||||
| **YIELDS** | The tokenized Doc objects, in order. ~~Doc~~ |
|
||||
| **YIELDS** | The tokenized `Doc` objects, in order. ~~Doc~~ |
|
||||
|
||||
## Tokenizer.find_infix {#find_infix tag="method"}
|
||||
|
||||
|
|
|
@ -196,7 +196,7 @@ browser. Will run a simple web server.
|
|||
| `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ |
|
||||
| `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ |
|
||||
| `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ |
|
||||
| `manual` | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
|
||||
| `manual` | Don't parse `Doc` and instead expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
|
||||
| `port` | Port to serve visualization. Defaults to `5000`. ~~int~~ |
|
||||
| `host` | Host to serve visualization. Defaults to `"0.0.0.0"`. ~~str~~ |
|
||||
|
||||
|
@ -221,7 +221,7 @@ Render a dependency parse tree or named entity visualization.
|
|||
| `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ |
|
||||
| `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ |
|
||||
| `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ |
|
||||
| `manual` | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
|
||||
| `manual` | Don't parse `Doc` and instead expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
|
||||
| `jupyter` | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None` (default). ~~Optional[bool]~~ |
|
||||
| **RETURNS** | The rendered HTML markup. ~~str~~ |
|
||||
|
||||
|
@ -242,7 +242,7 @@ If a setting is not present in the options, the default value will be used.
|
|||
| Name | Description |
|
||||
| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `fine_grained` | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). Defaults to `False`. ~~bool~~ |
|
||||
| `add_lemma` <Tag variant="new">2.2.4</Tag> | Print the lemma's in a separate row below the token texts. Defaults to `False`. ~~bool~~ |
|
||||
| `add_lemma` <Tag variant="new">2.2.4</Tag> | Print the lemmas in a separate row below the token texts. Defaults to `False`. ~~bool~~ |
|
||||
| `collapse_punct` | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. Defaults to `True`. ~~bool~~ |
|
||||
| `collapse_phrases` | Merge noun phrases into one token. Defaults to `False`. ~~bool~~ |
|
||||
| `compact` | "Compact mode" with square arrows that takes up less space. Defaults to `False`. ~~bool~~ |
|
||||
|
@ -611,7 +611,7 @@ sequences in the batch.
|
|||
|
||||
Encode labelled spans into per-token tags, using the
|
||||
[BILUO scheme](/usage/linguistic-features#accessing-ner) (Begin, In, Last, Unit,
|
||||
Out). Returns a list of strings, describing the tags. Each tag string will be of
|
||||
Out). Returns a list of strings, describing the tags. Each tag string will be in
|
||||
the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of
|
||||
`"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets
|
||||
don't align with the tokenization in the `Doc` object. The training algorithm
|
||||
|
@ -716,7 +716,7 @@ decorator.
|
|||
### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"}
|
||||
|
||||
Check whether a `Language` subclass is already loaded. `Language` subclasses are
|
||||
loaded lazily, to avoid expensive setup code associated with the language data.
|
||||
loaded lazily to avoid expensive setup code associated with the language data.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -904,7 +904,7 @@ Compile a sequence of prefix rules into a regex object.
|
|||
| Name | Description |
|
||||
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `entries` | The prefix rules, e.g. [`lang.punctuation.TOKENIZER_PREFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
|
||||
| **RETURNS** | The regex object. to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). ~~Pattern~~ |
|
||||
| **RETURNS** | The regex object to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). ~~Pattern~~ |
|
||||
|
||||
### util.compile_suffix_regex {#util.compile_suffix_regex tag="function"}
|
||||
|
||||
|
@ -921,7 +921,7 @@ Compile a sequence of suffix rules into a regex object.
|
|||
| Name | Description |
|
||||
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `entries` | The suffix rules, e.g. [`lang.punctuation.TOKENIZER_SUFFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
|
||||
| **RETURNS** | The regex object. to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). ~~Pattern~~ |
|
||||
| **RETURNS** | The regex object to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). ~~Pattern~~ |
|
||||
|
||||
### util.compile_infix_regex {#util.compile_infix_regex tag="function"}
|
||||
|
||||
|
@ -938,7 +938,7 @@ Compile a sequence of infix rules into a regex object.
|
|||
| Name | Description |
|
||||
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `entries` | The infix rules, e.g. [`lang.punctuation.TOKENIZER_INFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
|
||||
| **RETURNS** | The regex object. to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). ~~Pattern~~ |
|
||||
| **RETURNS** | The regex object to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). ~~Pattern~~ |
|
||||
|
||||
### util.minibatch {#util.minibatch tag="function" new="2"}
|
||||
|
||||
|
|
|
@ -186,7 +186,7 @@ setting up the label scheme based on the data.
|
|||
|
||||
## Transformer.predict {#predict tag="method"}
|
||||
|
||||
Apply the component's model to a batch of [`Doc`](/api/doc) objects, without
|
||||
Apply the component's model to a batch of [`Doc`](/api/doc) objects without
|
||||
modifying them.
|
||||
|
||||
> #### Example
|
||||
|
@ -203,7 +203,7 @@ modifying them.
|
|||
|
||||
## Transformer.set_annotations {#set_annotations tag="method"}
|
||||
|
||||
Assign the extracted features to the Doc objects. By default, the
|
||||
Assign the extracted features to the `Doc` objects. By default, the
|
||||
[`TransformerData`](/api/transformer#transformerdata) object is written to the
|
||||
[`Doc._.trf_data`](#custom-attributes) attribute. Your `set_extra_annotations`
|
||||
callback is then called, if provided.
|
||||
|
@ -272,7 +272,7 @@ Create an optimizer for the pipeline component.
|
|||
|
||||
## Transformer.use_params {#use_params tag="method, contextmanager"}
|
||||
|
||||
Modify the pipe's model, to use the given parameter values. At the end of the
|
||||
Modify the pipe's model to use the given parameter values. At the end of the
|
||||
context, the original parameters are restored.
|
||||
|
||||
> #### Example
|
||||
|
@ -388,8 +388,8 @@ by this class. Instances of this class are typically assigned to the
|
|||
|
||||
| Name | Description |
|
||||
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts, and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ |
|
||||
| `tensors` | The activations for the Doc from the transformer. Usually the last tensor that is 3-dimensional will be the most important, as that will provide the final hidden state. Generally activations that are 2-dimensional will be attention weights. Details of this variable will differ depending on the underlying transformer model. ~~List[FloatsXd]~~ |
|
||||
| `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ |
|
||||
| `tensors` | The activations for the `Doc` from the transformer. Usually the last tensor that is 3-dimensional will be the most important, as that will provide the final hidden state. Generally activations that are 2-dimensional will be attention weights. Details of this variable will differ depending on the underlying transformer model. ~~List[FloatsXd]~~ |
|
||||
| `align` | Alignment from the `Doc`'s tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ |
|
||||
| `width` | The width of the last hidden layer. ~~int~~ |
|
||||
|
||||
|
@ -409,7 +409,7 @@ objects to associate the outputs to each [`Doc`](/api/doc) in the batch.
|
|||
|
||||
| Name | Description |
|
||||
| ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `spans` | The batch of input spans. The outer list refers to the Doc objects in the batch, and the inner list are the spans for that `Doc`. Note that spans are allowed to overlap or exclude tokens, but each Span can only refer to one `Doc` (by definition). This means that within a `Doc`, the regions of the output tensors that correspond to each Span may overlap or have gaps, but for each `Doc`, there is a non-overlapping contiguous slice of the outputs. ~~List[List[Span]]~~ |
|
||||
| `spans` | The batch of input spans. The outer list refers to the Doc objects in the batch, and the inner list are the spans for that `Doc`. Note that spans are allowed to overlap or exclude tokens, but each `Span` can only refer to one `Doc` (by definition). This means that within a `Doc`, the regions of the output tensors that correspond to each `Span` may overlap or have gaps, but for each `Doc`, there is a non-overlapping contiguous slice of the outputs. ~~List[List[Span]]~~ |
|
||||
| `tokens` | The output of the tokenizer. ~~transformers.BatchEncoding~~ |
|
||||
| `tensors` | The output of the transformer model. ~~List[torch.Tensor]~~ |
|
||||
| `align` | Alignment from the spaCy tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ |
|
||||
|
@ -439,10 +439,10 @@ Split a `TransformerData` object that represents a batch into a list with one
|
|||
## Span getters {#span_getters source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"}
|
||||
|
||||
Span getters are functions that take a batch of [`Doc`](/api/doc) objects and
|
||||
return a lists of [`Span`](/api/span) objects for each doc, to be processed by
|
||||
the transformer. This is used to manage long documents, by cutting them into
|
||||
return a lists of [`Span`](/api/span) objects for each doc to be processed by
|
||||
the transformer. This is used to manage long documents by cutting them into
|
||||
smaller sequences before running the transformer. The spans are allowed to
|
||||
overlap, and you can also omit sections of the Doc if they are not relevant.
|
||||
overlap, and you can also omit sections of the `Doc` if they are not relevant.
|
||||
|
||||
Span getters can be referenced in the `[components.transformer.model.get_spans]`
|
||||
block of the config to customize the sequences processed by the transformer. You
|
||||
|
|
|
@ -290,7 +290,7 @@ If a table is full, it can be resized using
|
|||
## Vectors.n_keys {#n_keys tag="property"}
|
||||
|
||||
Get the number of keys in the table. Note that this is the number of _all_ keys,
|
||||
not just unique vectors. If several keys are mapped are mapped to the same
|
||||
not just unique vectors. If several keys are mapped to the same
|
||||
vectors, they will be counted individually.
|
||||
|
||||
> #### Example
|
||||
|
@ -307,10 +307,10 @@ vectors, they will be counted individually.
|
|||
|
||||
## Vectors.most_similar {#most_similar tag="method"}
|
||||
|
||||
For each of the given vectors, find the `n` most similar entries to it, by
|
||||
For each of the given vectors, find the `n` most similar entries to it by
|
||||
cosine. Queries are by vector. Results are returned as a
|
||||
`(keys, best_rows, scores)` tuple. If `queries` is large, the calculations are
|
||||
performed in chunks, to avoid consuming too much memory. You can set the
|
||||
performed in chunks to avoid consuming too much memory. You can set the
|
||||
`batch_size` to control the size/space trade-off during the calculations.
|
||||
|
||||
> #### Example
|
||||
|
|
|
@ -29,7 +29,7 @@ Create the vocabulary.
|
|||
| `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ |
|
||||
| `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~ |
|
||||
| `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ |
|
||||
| `get_noun_chunks` | A function that yields base noun phrases, used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
|
||||
| `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
|
||||
|
||||
## Vocab.\_\_len\_\_ {#len tag="method"}
|
||||
|
||||
|
@ -150,7 +150,7 @@ rows, we would discard the vectors for "feline" and "reclined". These words
|
|||
would then be remapped to the closest remaining vector – so "feline" would have
|
||||
the same vector as "cat", and "reclined" would have the same vector as "sat".
|
||||
The similarities are judged by cosine. The original vectors may be large, so the
|
||||
cosines are calculated in minibatches, to reduce memory usage.
|
||||
cosines are calculated in minibatches to reduce memory usage.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -170,7 +170,7 @@ cosines are calculated in minibatches, to reduce memory usage.
|
|||
Retrieve a vector for a word in the vocabulary. Words can be looked up by string
|
||||
or hash value. If no vectors data is loaded, a `ValueError` is raised. If `minn`
|
||||
is defined, then the resulting vector uses [FastText](https://fasttext.cc/)'s
|
||||
subword features by average over ngrams of `orth` (introduced in spaCy `v2.1`).
|
||||
subword features by average over n-grams of `orth` (introduced in spaCy `v2.1`).
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -182,13 +182,13 @@ subword features by average over ngrams of `orth` (introduced in spaCy `v2.1`).
|
|||
| Name | Description |
|
||||
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
|
||||
| `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ |
|
||||
| `minn` <Tag variant="new">2.1</Tag> | Minimum n-gram length used for FastText's ngram computation. Defaults to the length of `orth`. ~~int~~ |
|
||||
| `maxn` <Tag variant="new">2.1</Tag> | Maximum n-gram length used for FastText's ngram computation. Defaults to the length of `orth`. ~~int~~ |
|
||||
| `minn` <Tag variant="new">2.1</Tag> | Minimum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ |
|
||||
| `maxn` <Tag variant="new">2.1</Tag> | Maximum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ |
|
||||
| **RETURNS** | A word vector. Size and shape are determined by the `Vocab.vectors` instance. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||
|
||||
## Vocab.set_vector {#set_vector tag="method" new="2"}
|
||||
|
||||
Set a vector for a word in the vocabulary. Words can be referenced by by string
|
||||
Set a vector for a word in the vocabulary. Words can be referenced by string
|
||||
or hash value.
|
||||
|
||||
> #### Example
|
||||
|
|
|
@ -36,7 +36,7 @@ models such as [transformers](#transformers) is that word vectors model
|
|||
context around them, a transformer model like BERT can't really help you. BERT
|
||||
is designed to understand language **in context**, which isn't what you have. A
|
||||
word vectors table will be a much better fit for your task. However, if you do
|
||||
have words in context — whole sentences or paragraphs of running text — word
|
||||
have words in context – whole sentences or paragraphs of running text – word
|
||||
vectors will only provide a very rough approximation of what the text is about.
|
||||
|
||||
Word vectors are also very computationally efficient, as they map a word to a
|
||||
|
|
Loading…
Reference in New Issue
Block a user