From 3360825e0042a535e0da08d045f6147425edb00a Mon Sep 17 00:00:00 2001 From: walterhenry <55140654+walterhenry@users.noreply.github.com> Date: Mon, 28 Sep 2020 16:50:15 +0200 Subject: [PATCH] Proofreading Another round of proofreading. All the API docs have been read through and I've grazed the Usage docs. --- website/docs/api/doc.md | 3 +-- website/docs/api/pipeline-functions.md | 2 +- website/docs/api/span.md | 2 +- website/docs/api/textcategorizer.md | 8 ++++---- website/docs/api/tok2vec.md | 4 ++-- website/docs/api/token.md | 14 +++++++------- website/docs/api/tokenizer.md | 8 ++++---- website/docs/api/top-level.md | 16 ++++++++-------- website/docs/api/transformer.md | 18 +++++++++--------- website/docs/api/vectors.md | 6 +++--- website/docs/api/vocab.md | 12 ++++++------ website/docs/usage/embeddings-transformers.md | 2 +- 12 files changed, 47 insertions(+), 48 deletions(-) diff --git a/website/docs/api/doc.md b/website/docs/api/doc.md index b4097ddb7..151b00a0a 100644 --- a/website/docs/api/doc.md +++ b/website/docs/api/doc.md @@ -444,8 +444,7 @@ invalidated, although they may accidentally continue to work. Mark a span for merging. The `attrs` will be applied to the resulting token (if they're context-dependent token attributes like `LEMMA` or `DEP`) or to the underlying lexeme (if they're context-independent lexical attributes like -`LOWER` or `IS_STOP`). Writable custom extension attributes can be provided as a -dictionary mapping attribute name to values as the `"_"` key. +`LOWER` or `IS_STOP`). Writable custom extension attributes can be provided using the `"_"` key and specifying a dictionary that maps attribute name to values. > #### Example > diff --git a/website/docs/api/pipeline-functions.md b/website/docs/api/pipeline-functions.md index 8bb52d0f9..0dc03a16a 100644 --- a/website/docs/api/pipeline-functions.md +++ b/website/docs/api/pipeline-functions.md @@ -26,7 +26,7 @@ Merge noun chunks into a single token. Also available via the string name -Since noun chunks require part-of-speech tags and the dependency parser, make +Since noun chunks require part-of-speech tags and the dependency parse, make sure to add this component _after_ the `"tagger"` and `"parser"` components. By default, `nlp.add_pipe` will add components to the end of the pipeline and after all other components. diff --git a/website/docs/api/span.md b/website/docs/api/span.md index 242ceaed0..7fa1aaa38 100644 --- a/website/docs/api/span.md +++ b/website/docs/api/span.md @@ -187,7 +187,7 @@ the character indices don't map to a valid span. | Name | Description | | ------------------------------------ | ----------------------------------------------------------------------------------------- | | `start` | The index of the first character of the span. ~~int~~ | -| `end` | The index of the last character after the span. ~int~~ | +| `end` | The index of the last character after the span. ~~int~~ | | `label` | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~ | | `kb_id` 2.2 | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~ | | `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | diff --git a/website/docs/api/textcategorizer.md b/website/docs/api/textcategorizer.md index b68039094..be4052f46 100644 --- a/website/docs/api/textcategorizer.md +++ b/website/docs/api/textcategorizer.md @@ -153,7 +153,7 @@ setting up the label scheme based on the data. ## TextCategorizer.predict {#predict tag="method"} -Apply the component's model to a batch of [`Doc`](/api/doc) objects, without +Apply the component's model to a batch of [`Doc`](/api/doc) objects without modifying them. > #### Example @@ -170,7 +170,7 @@ modifying them. ## TextCategorizer.set_annotations {#set_annotations tag="method"} -Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores. +Modify a batch of [`Doc`](/api/doc) objects using pre-computed scores. > #### Example > @@ -213,7 +213,7 @@ Delegates to [`predict`](/api/textcategorizer#predict) and ## TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"} Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the -current model to make predictions similar to an initial model, to try to address +current model to make predictions similar to an initial model to try to address the "catastrophic forgetting" problem. This feature is experimental. > #### Example @@ -286,7 +286,7 @@ Create an optimizer for the pipeline component. ## TextCategorizer.use_params {#use_params tag="method, contextmanager"} -Modify the pipe's model, to use the given parameter values. +Modify the pipe's model to use the given parameter values. > #### Example > diff --git a/website/docs/api/tok2vec.md b/website/docs/api/tok2vec.md index 5c7214edc..2633a7a1a 100644 --- a/website/docs/api/tok2vec.md +++ b/website/docs/api/tok2vec.md @@ -151,7 +151,7 @@ setting up the label scheme based on the data. ## Tok2Vec.predict {#predict tag="method"} -Apply the component's model to a batch of [`Doc`](/api/doc) objects, without +Apply the component's model to a batch of [`Doc`](/api/doc) objects without modifying them. > #### Example @@ -224,7 +224,7 @@ Create an optimizer for the pipeline component. ## Tok2Vec.use_params {#use_params tag="method, contextmanager"} -Modify the pipe's model, to use the given parameter values. At the end of the +Modify the pipe's model to use the given parameter values. At the end of the context, the original parameters are restored. > #### Example diff --git a/website/docs/api/token.md b/website/docs/api/token.md index 0860797aa..068a1d2d2 100644 --- a/website/docs/api/token.md +++ b/website/docs/api/token.md @@ -243,7 +243,7 @@ A sequence of the token's immediate syntactic children. ## Token.lefts {#lefts tag="property" model="parser"} -The leftward immediate children of the word, in the syntactic dependency parse. +The leftward immediate children of the word in the syntactic dependency parse. > #### Example > @@ -259,7 +259,7 @@ The leftward immediate children of the word, in the syntactic dependency parse. ## Token.rights {#rights tag="property" model="parser"} -The rightward immediate children of the word, in the syntactic dependency parse. +The rightward immediate children of the word in the syntactic dependency parse. > #### Example > @@ -275,7 +275,7 @@ The rightward immediate children of the word, in the syntactic dependency parse. ## Token.n_lefts {#n_lefts tag="property" model="parser"} -The number of leftward immediate children of the word, in the syntactic +The number of leftward immediate children of the word in the syntactic dependency parse. > #### Example @@ -291,7 +291,7 @@ dependency parse. ## Token.n_rights {#n_rights tag="property" model="parser"} -The number of rightward immediate children of the word, in the syntactic +The number of rightward immediate children of the word in the syntactic dependency parse. > #### Example @@ -422,8 +422,8 @@ The L2 norm of the token's vector representation. | `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~str~~ | | `lower` | Lowercase form of the token. ~~int~~ | | `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ | -| `shape` | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ | -| `shape_` | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~str~~ | +| `shape` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ | +| `shape_` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~str~~ | | `prefix` | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. ~~int~~ | | `prefix_` | A length-N substring from the start of the token. Defaults to `N=1`. ~~str~~ | | `suffix` | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. ~~int~~ | @@ -451,7 +451,7 @@ The L2 norm of the token's vector representation. | `tag` | Fine-grained part-of-speech. ~~int~~ | | `tag_` | Fine-grained part-of-speech. ~~str~~ | | `morph` 3 | Morphological analysis. ~~MorphAnalysis~~ | -| `morph_` 3 | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ | +| `morph_` 3 | Morphological analysis in the Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) format. ~~str~~ | | `dep` | Syntactic dependency relation. ~~int~~ | | `dep_` | Syntactic dependency relation. ~~str~~ | | `lang` | Language of the parent document's vocabulary. ~~int~~ | diff --git a/website/docs/api/tokenizer.md b/website/docs/api/tokenizer.md index 0158c5589..8ea5a1f65 100644 --- a/website/docs/api/tokenizer.md +++ b/website/docs/api/tokenizer.md @@ -1,6 +1,6 @@ --- title: Tokenizer -teaser: Segment text into words, punctuations marks etc. +teaser: Segment text into words, punctuations marks, etc. tag: class source: spacy/tokenizer.pyx --- @@ -15,14 +15,14 @@ source: spacy/tokenizer.pyx Segment text, and create `Doc` objects with the discovered segment boundaries. For a deeper understanding, see the docs on [how spaCy's tokenizer works](/usage/linguistic-features#how-tokenizer-works). -The tokenizer is typically created automatically when the a +The tokenizer is typically created automatically when a [`Language`](/api/language) subclass is initialized and it reads its settings like punctuation and special case rules from the [`Language.Defaults`](/api/language#defaults) provided by the language subclass. ## Tokenizer.\_\_init\_\_ {#init tag="method"} -Create a `Tokenizer`, to create `Doc` objects given unicode text. For examples +Create a `Tokenizer` to create `Doc` objects given unicode text. For examples of how to construct a custom tokenizer with different tokenization rules, see the [usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers). @@ -87,7 +87,7 @@ Tokenize a stream of texts. | ------------ | ------------------------------------------------------------------------------------ | | `texts` | A sequence of unicode texts. ~~Iterable[str]~~ | | `batch_size` | The number of texts to accumulate in an internal buffer. Defaults to `1000`. ~~int~~ | -| **YIELDS** | The tokenized Doc objects, in order. ~~Doc~~ | +| **YIELDS** | The tokenized `Doc` objects, in order. ~~Doc~~ | ## Tokenizer.find_infix {#find_infix tag="method"} diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index f52c63f18..94260cacb 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -196,7 +196,7 @@ browser. Will run a simple web server. | `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ | | `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ | | `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ | -| `manual` | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ | +| `manual` | Don't parse `Doc` and instead expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ | | `port` | Port to serve visualization. Defaults to `5000`. ~~int~~ | | `host` | Host to serve visualization. Defaults to `"0.0.0.0"`. ~~str~~ | @@ -221,7 +221,7 @@ Render a dependency parse tree or named entity visualization. | `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ | | `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ | | `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ | -| `manual` | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ | +| `manual` | Don't parse `Doc` and instead expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ | | `jupyter` | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None` (default). ~~Optional[bool]~~ | | **RETURNS** | The rendered HTML markup. ~~str~~ | @@ -242,7 +242,7 @@ If a setting is not present in the options, the default value will be used. | Name | Description | | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------- | | `fine_grained` | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). Defaults to `False`. ~~bool~~ | -| `add_lemma` 2.2.4 | Print the lemma's in a separate row below the token texts. Defaults to `False`. ~~bool~~ | +| `add_lemma` 2.2.4 | Print the lemmas in a separate row below the token texts. Defaults to `False`. ~~bool~~ | | `collapse_punct` | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. Defaults to `True`. ~~bool~~ | | `collapse_phrases` | Merge noun phrases into one token. Defaults to `False`. ~~bool~~ | | `compact` | "Compact mode" with square arrows that takes up less space. Defaults to `False`. ~~bool~~ | @@ -611,7 +611,7 @@ sequences in the batch. Encode labelled spans into per-token tags, using the [BILUO scheme](/usage/linguistic-features#accessing-ner) (Begin, In, Last, Unit, -Out). Returns a list of strings, describing the tags. Each tag string will be of +Out). Returns a list of strings, describing the tags. Each tag string will be in the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets don't align with the tokenization in the `Doc` object. The training algorithm @@ -716,7 +716,7 @@ decorator. ### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"} Check whether a `Language` subclass is already loaded. `Language` subclasses are -loaded lazily, to avoid expensive setup code associated with the language data. +loaded lazily to avoid expensive setup code associated with the language data. > #### Example > @@ -904,7 +904,7 @@ Compile a sequence of prefix rules into a regex object. | Name | Description | | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- | | `entries` | The prefix rules, e.g. [`lang.punctuation.TOKENIZER_PREFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ | -| **RETURNS** | The regex object. to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). ~~Pattern~~ | +| **RETURNS** | The regex object to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). ~~Pattern~~ | ### util.compile_suffix_regex {#util.compile_suffix_regex tag="function"} @@ -921,7 +921,7 @@ Compile a sequence of suffix rules into a regex object. | Name | Description | | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- | | `entries` | The suffix rules, e.g. [`lang.punctuation.TOKENIZER_SUFFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ | -| **RETURNS** | The regex object. to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). ~~Pattern~~ | +| **RETURNS** | The regex object to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). ~~Pattern~~ | ### util.compile_infix_regex {#util.compile_infix_regex tag="function"} @@ -938,7 +938,7 @@ Compile a sequence of infix rules into a regex object. | Name | Description | | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------- | | `entries` | The infix rules, e.g. [`lang.punctuation.TOKENIZER_INFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ | -| **RETURNS** | The regex object. to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). ~~Pattern~~ | +| **RETURNS** | The regex object to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). ~~Pattern~~ | ### util.minibatch {#util.minibatch tag="function" new="2"} diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index d5bcef229..957ce69a4 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -186,7 +186,7 @@ setting up the label scheme based on the data. ## Transformer.predict {#predict tag="method"} -Apply the component's model to a batch of [`Doc`](/api/doc) objects, without +Apply the component's model to a batch of [`Doc`](/api/doc) objects without modifying them. > #### Example @@ -203,7 +203,7 @@ modifying them. ## Transformer.set_annotations {#set_annotations tag="method"} -Assign the extracted features to the Doc objects. By default, the +Assign the extracted features to the `Doc` objects. By default, the [`TransformerData`](/api/transformer#transformerdata) object is written to the [`Doc._.trf_data`](#custom-attributes) attribute. Your `set_extra_annotations` callback is then called, if provided. @@ -272,7 +272,7 @@ Create an optimizer for the pipeline component. ## Transformer.use_params {#use_params tag="method, contextmanager"} -Modify the pipe's model, to use the given parameter values. At the end of the +Modify the pipe's model to use the given parameter values. At the end of the context, the original parameters are restored. > #### Example @@ -388,8 +388,8 @@ by this class. Instances of this class are typically assigned to the | Name | Description | | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts, and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ | -| `tensors` | The activations for the Doc from the transformer. Usually the last tensor that is 3-dimensional will be the most important, as that will provide the final hidden state. Generally activations that are 2-dimensional will be attention weights. Details of this variable will differ depending on the underlying transformer model. ~~List[FloatsXd]~~ | +| `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ | +| `tensors` | The activations for the `Doc` from the transformer. Usually the last tensor that is 3-dimensional will be the most important, as that will provide the final hidden state. Generally activations that are 2-dimensional will be attention weights. Details of this variable will differ depending on the underlying transformer model. ~~List[FloatsXd]~~ | | `align` | Alignment from the `Doc`'s tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ | | `width` | The width of the last hidden layer. ~~int~~ | @@ -409,7 +409,7 @@ objects to associate the outputs to each [`Doc`](/api/doc) in the batch. | Name | Description | | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `spans` | The batch of input spans. The outer list refers to the Doc objects in the batch, and the inner list are the spans for that `Doc`. Note that spans are allowed to overlap or exclude tokens, but each Span can only refer to one `Doc` (by definition). This means that within a `Doc`, the regions of the output tensors that correspond to each Span may overlap or have gaps, but for each `Doc`, there is a non-overlapping contiguous slice of the outputs. ~~List[List[Span]]~~ | +| `spans` | The batch of input spans. The outer list refers to the Doc objects in the batch, and the inner list are the spans for that `Doc`. Note that spans are allowed to overlap or exclude tokens, but each `Span` can only refer to one `Doc` (by definition). This means that within a `Doc`, the regions of the output tensors that correspond to each `Span` may overlap or have gaps, but for each `Doc`, there is a non-overlapping contiguous slice of the outputs. ~~List[List[Span]]~~ | | `tokens` | The output of the tokenizer. ~~transformers.BatchEncoding~~ | | `tensors` | The output of the transformer model. ~~List[torch.Tensor]~~ | | `align` | Alignment from the spaCy tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ | @@ -439,10 +439,10 @@ Split a `TransformerData` object that represents a batch into a list with one ## Span getters {#span_getters source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"} Span getters are functions that take a batch of [`Doc`](/api/doc) objects and -return a lists of [`Span`](/api/span) objects for each doc, to be processed by -the transformer. This is used to manage long documents, by cutting them into +return a lists of [`Span`](/api/span) objects for each doc to be processed by +the transformer. This is used to manage long documents by cutting them into smaller sequences before running the transformer. The spans are allowed to -overlap, and you can also omit sections of the Doc if they are not relevant. +overlap, and you can also omit sections of the `Doc` if they are not relevant. Span getters can be referenced in the `[components.transformer.model.get_spans]` block of the config to customize the sequences processed by the transformer. You diff --git a/website/docs/api/vectors.md b/website/docs/api/vectors.md index 7e97b4ca3..ba2d5ab42 100644 --- a/website/docs/api/vectors.md +++ b/website/docs/api/vectors.md @@ -290,7 +290,7 @@ If a table is full, it can be resized using ## Vectors.n_keys {#n_keys tag="property"} Get the number of keys in the table. Note that this is the number of _all_ keys, -not just unique vectors. If several keys are mapped are mapped to the same +not just unique vectors. If several keys are mapped to the same vectors, they will be counted individually. > #### Example @@ -307,10 +307,10 @@ vectors, they will be counted individually. ## Vectors.most_similar {#most_similar tag="method"} -For each of the given vectors, find the `n` most similar entries to it, by +For each of the given vectors, find the `n` most similar entries to it by cosine. Queries are by vector. Results are returned as a `(keys, best_rows, scores)` tuple. If `queries` is large, the calculations are -performed in chunks, to avoid consuming too much memory. You can set the +performed in chunks to avoid consuming too much memory. You can set the `batch_size` to control the size/space trade-off during the calculations. > #### Example diff --git a/website/docs/api/vocab.md b/website/docs/api/vocab.md index 71a678cb3..a2ca63002 100644 --- a/website/docs/api/vocab.md +++ b/website/docs/api/vocab.md @@ -29,7 +29,7 @@ Create the vocabulary. | `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ | | `vectors_name` 2.2 | A name to identify the vectors table. ~~str~~ | | `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ | -| `get_noun_chunks` | A function that yields base noun phrases, used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ | +| `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ | ## Vocab.\_\_len\_\_ {#len tag="method"} @@ -150,7 +150,7 @@ rows, we would discard the vectors for "feline" and "reclined". These words would then be remapped to the closest remaining vector – so "feline" would have the same vector as "cat", and "reclined" would have the same vector as "sat". The similarities are judged by cosine. The original vectors may be large, so the -cosines are calculated in minibatches, to reduce memory usage. +cosines are calculated in minibatches to reduce memory usage. > #### Example > @@ -170,7 +170,7 @@ cosines are calculated in minibatches, to reduce memory usage. Retrieve a vector for a word in the vocabulary. Words can be looked up by string or hash value. If no vectors data is loaded, a `ValueError` is raised. If `minn` is defined, then the resulting vector uses [FastText](https://fasttext.cc/)'s -subword features by average over ngrams of `orth` (introduced in spaCy `v2.1`). +subword features by average over n-grams of `orth` (introduced in spaCy `v2.1`). > #### Example > @@ -182,13 +182,13 @@ subword features by average over ngrams of `orth` (introduced in spaCy `v2.1`). | Name | Description | | ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | | `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ | -| `minn` 2.1 | Minimum n-gram length used for FastText's ngram computation. Defaults to the length of `orth`. ~~int~~ | -| `maxn` 2.1 | Maximum n-gram length used for FastText's ngram computation. Defaults to the length of `orth`. ~~int~~ | +| `minn` 2.1 | Minimum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ | +| `maxn` 2.1 | Maximum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ | | **RETURNS** | A word vector. Size and shape are determined by the `Vocab.vectors` instance. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | ## Vocab.set_vector {#set_vector tag="method" new="2"} -Set a vector for a word in the vocabulary. Words can be referenced by by string +Set a vector for a word in the vocabulary. Words can be referenced by string or hash value. > #### Example diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index 8dd104ead..c61d7e144 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -36,7 +36,7 @@ models such as [transformers](#transformers) is that word vectors model context around them, a transformer model like BERT can't really help you. BERT is designed to understand language **in context**, which isn't what you have. A word vectors table will be a much better fit for your task. However, if you do -have words in context — whole sentences or paragraphs of running text — word +have words in context – whole sentences or paragraphs of running text – word vectors will only provide a very rough approximation of what the text is about. Word vectors are also very computationally efficient, as they map a word to a