Proofreading

Another round of proofreading. All the API docs have been read through and I've grazed the Usage docs.
This commit is contained in:
walterhenry 2020-09-28 16:50:15 +02:00
parent 3dd5f409ec
commit 3360825e00
12 changed files with 47 additions and 48 deletions

View File

@ -444,8 +444,7 @@ invalidated, although they may accidentally continue to work.
Mark a span for merging. The `attrs` will be applied to the resulting token (if
they're context-dependent token attributes like `LEMMA` or `DEP`) or to the
underlying lexeme (if they're context-independent lexical attributes like
`LOWER` or `IS_STOP`). Writable custom extension attributes can be provided as a
dictionary mapping attribute name to values as the `"_"` key.
`LOWER` or `IS_STOP`). Writable custom extension attributes can be provided using the `"_"` key and specifying a dictionary that maps attribute name to values.
> #### Example
>

View File

@ -26,7 +26,7 @@ Merge noun chunks into a single token. Also available via the string name
<Infobox variant="warning">
Since noun chunks require part-of-speech tags and the dependency parser, make
Since noun chunks require part-of-speech tags and the dependency parse, make
sure to add this component _after_ the `"tagger"` and `"parser"` components. By
default, `nlp.add_pipe` will add components to the end of the pipeline and after
all other components.

View File

@ -187,7 +187,7 @@ the character indices don't map to a valid span.
| Name | Description |
| ------------------------------------ | ----------------------------------------------------------------------------------------- |
| `start` | The index of the first character of the span. ~~int~~ |
| `end` | The index of the last character after the span. ~int~~ |
| `end` | The index of the last character after the span. ~~int~~ |
| `label` | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~ |
| `kb_id` <Tag variant="new">2.2</Tag> | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~ |
| `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |

View File

@ -153,7 +153,7 @@ setting up the label scheme based on the data.
## TextCategorizer.predict {#predict tag="method"}
Apply the component's model to a batch of [`Doc`](/api/doc) objects, without
Apply the component's model to a batch of [`Doc`](/api/doc) objects without
modifying them.
> #### Example
@ -170,7 +170,7 @@ modifying them.
## TextCategorizer.set_annotations {#set_annotations tag="method"}
Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores.
Modify a batch of [`Doc`](/api/doc) objects using pre-computed scores.
> #### Example
>
@ -213,7 +213,7 @@ Delegates to [`predict`](/api/textcategorizer#predict) and
## TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model, to try to address
current model to make predictions similar to an initial model to try to address
the "catastrophic forgetting" problem. This feature is experimental.
> #### Example
@ -286,7 +286,7 @@ Create an optimizer for the pipeline component.
## TextCategorizer.use_params {#use_params tag="method, contextmanager"}
Modify the pipe's model, to use the given parameter values.
Modify the pipe's model to use the given parameter values.
> #### Example
>

View File

@ -151,7 +151,7 @@ setting up the label scheme based on the data.
## Tok2Vec.predict {#predict tag="method"}
Apply the component's model to a batch of [`Doc`](/api/doc) objects, without
Apply the component's model to a batch of [`Doc`](/api/doc) objects without
modifying them.
> #### Example
@ -224,7 +224,7 @@ Create an optimizer for the pipeline component.
## Tok2Vec.use_params {#use_params tag="method, contextmanager"}
Modify the pipe's model, to use the given parameter values. At the end of the
Modify the pipe's model to use the given parameter values. At the end of the
context, the original parameters are restored.
> #### Example

View File

@ -243,7 +243,7 @@ A sequence of the token's immediate syntactic children.
## Token.lefts {#lefts tag="property" model="parser"}
The leftward immediate children of the word, in the syntactic dependency parse.
The leftward immediate children of the word in the syntactic dependency parse.
> #### Example
>
@ -259,7 +259,7 @@ The leftward immediate children of the word, in the syntactic dependency parse.
## Token.rights {#rights tag="property" model="parser"}
The rightward immediate children of the word, in the syntactic dependency parse.
The rightward immediate children of the word in the syntactic dependency parse.
> #### Example
>
@ -275,7 +275,7 @@ The rightward immediate children of the word, in the syntactic dependency parse.
## Token.n_lefts {#n_lefts tag="property" model="parser"}
The number of leftward immediate children of the word, in the syntactic
The number of leftward immediate children of the word in the syntactic
dependency parse.
> #### Example
@ -291,7 +291,7 @@ dependency parse.
## Token.n_rights {#n_rights tag="property" model="parser"}
The number of rightward immediate children of the word, in the syntactic
The number of rightward immediate children of the word in the syntactic
dependency parse.
> #### Example
@ -422,8 +422,8 @@ The L2 norm of the token's vector representation.
| `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~str~~ |
| `lower` | Lowercase form of the token. ~~int~~ |
| `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ |
| `shape` | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ |
| `shape_` | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~str~~ |
| `shape` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ |
| `shape_` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~str~~ |
| `prefix` | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. ~~int~~ |
| `prefix_` | A length-N substring from the start of the token. Defaults to `N=1`. ~~str~~ |
| `suffix` | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. ~~int~~ |
@ -451,7 +451,7 @@ The L2 norm of the token's vector representation.
| `tag` | Fine-grained part-of-speech. ~~int~~ |
| `tag_` | Fine-grained part-of-speech. ~~str~~ |
| `morph` <Tag variant="new">3</Tag> | Morphological analysis. ~~MorphAnalysis~~ |
| `morph_` <Tag variant="new">3</Tag> | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ |
| `morph_` <Tag variant="new">3</Tag> | Morphological analysis in the Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) format. ~~str~~ |
| `dep` | Syntactic dependency relation. ~~int~~ |
| `dep_` | Syntactic dependency relation. ~~str~~ |
| `lang` | Language of the parent document's vocabulary. ~~int~~ |

View File

@ -1,6 +1,6 @@
---
title: Tokenizer
teaser: Segment text into words, punctuations marks etc.
teaser: Segment text into words, punctuations marks, etc.
tag: class
source: spacy/tokenizer.pyx
---
@ -15,14 +15,14 @@ source: spacy/tokenizer.pyx
Segment text, and create `Doc` objects with the discovered segment boundaries.
For a deeper understanding, see the docs on
[how spaCy's tokenizer works](/usage/linguistic-features#how-tokenizer-works).
The tokenizer is typically created automatically when the a
The tokenizer is typically created automatically when a
[`Language`](/api/language) subclass is initialized and it reads its settings
like punctuation and special case rules from the
[`Language.Defaults`](/api/language#defaults) provided by the language subclass.
## Tokenizer.\_\_init\_\_ {#init tag="method"}
Create a `Tokenizer`, to create `Doc` objects given unicode text. For examples
Create a `Tokenizer` to create `Doc` objects given unicode text. For examples
of how to construct a custom tokenizer with different tokenization rules, see
the
[usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers).
@ -87,7 +87,7 @@ Tokenize a stream of texts.
| ------------ | ------------------------------------------------------------------------------------ |
| `texts` | A sequence of unicode texts. ~~Iterable[str]~~ |
| `batch_size` | The number of texts to accumulate in an internal buffer. Defaults to `1000`. ~~int~~ |
| **YIELDS** | The tokenized Doc objects, in order. ~~Doc~~ |
| **YIELDS** | The tokenized `Doc` objects, in order. ~~Doc~~ |
## Tokenizer.find_infix {#find_infix tag="method"}

View File

@ -196,7 +196,7 @@ browser. Will run a simple web server.
| `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ |
| `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ |
| `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ |
| `manual` | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
| `manual` | Don't parse `Doc` and instead expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
| `port` | Port to serve visualization. Defaults to `5000`. ~~int~~ |
| `host` | Host to serve visualization. Defaults to `"0.0.0.0"`. ~~str~~ |
@ -221,7 +221,7 @@ Render a dependency parse tree or named entity visualization.
| `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ |
| `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ |
| `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ |
| `manual` | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
| `manual` | Don't parse `Doc` and instead expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
| `jupyter` | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None` (default). ~~Optional[bool]~~ |
| **RETURNS** | The rendered HTML markup. ~~str~~ |
@ -242,7 +242,7 @@ If a setting is not present in the options, the default value will be used.
| Name | Description |
| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------- |
| `fine_grained` | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). Defaults to `False`. ~~bool~~ |
| `add_lemma` <Tag variant="new">2.2.4</Tag> | Print the lemma's in a separate row below the token texts. Defaults to `False`. ~~bool~~ |
| `add_lemma` <Tag variant="new">2.2.4</Tag> | Print the lemmas in a separate row below the token texts. Defaults to `False`. ~~bool~~ |
| `collapse_punct` | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. Defaults to `True`. ~~bool~~ |
| `collapse_phrases` | Merge noun phrases into one token. Defaults to `False`. ~~bool~~ |
| `compact` | "Compact mode" with square arrows that takes up less space. Defaults to `False`. ~~bool~~ |
@ -611,7 +611,7 @@ sequences in the batch.
Encode labelled spans into per-token tags, using the
[BILUO scheme](/usage/linguistic-features#accessing-ner) (Begin, In, Last, Unit,
Out). Returns a list of strings, describing the tags. Each tag string will be of
Out). Returns a list of strings, describing the tags. Each tag string will be in
the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of
`"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets
don't align with the tokenization in the `Doc` object. The training algorithm
@ -716,7 +716,7 @@ decorator.
### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"}
Check whether a `Language` subclass is already loaded. `Language` subclasses are
loaded lazily, to avoid expensive setup code associated with the language data.
loaded lazily to avoid expensive setup code associated with the language data.
> #### Example
>
@ -904,7 +904,7 @@ Compile a sequence of prefix rules into a regex object.
| Name | Description |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `entries` | The prefix rules, e.g. [`lang.punctuation.TOKENIZER_PREFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
| **RETURNS** | The regex object. to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). ~~Pattern~~ |
| **RETURNS** | The regex object to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). ~~Pattern~~ |
### util.compile_suffix_regex {#util.compile_suffix_regex tag="function"}
@ -921,7 +921,7 @@ Compile a sequence of suffix rules into a regex object.
| Name | Description |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `entries` | The suffix rules, e.g. [`lang.punctuation.TOKENIZER_SUFFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
| **RETURNS** | The regex object. to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). ~~Pattern~~ |
| **RETURNS** | The regex object to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). ~~Pattern~~ |
### util.compile_infix_regex {#util.compile_infix_regex tag="function"}
@ -938,7 +938,7 @@ Compile a sequence of infix rules into a regex object.
| Name | Description |
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| `entries` | The infix rules, e.g. [`lang.punctuation.TOKENIZER_INFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
| **RETURNS** | The regex object. to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). ~~Pattern~~ |
| **RETURNS** | The regex object to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). ~~Pattern~~ |
### util.minibatch {#util.minibatch tag="function" new="2"}

View File

@ -186,7 +186,7 @@ setting up the label scheme based on the data.
## Transformer.predict {#predict tag="method"}
Apply the component's model to a batch of [`Doc`](/api/doc) objects, without
Apply the component's model to a batch of [`Doc`](/api/doc) objects without
modifying them.
> #### Example
@ -203,7 +203,7 @@ modifying them.
## Transformer.set_annotations {#set_annotations tag="method"}
Assign the extracted features to the Doc objects. By default, the
Assign the extracted features to the `Doc` objects. By default, the
[`TransformerData`](/api/transformer#transformerdata) object is written to the
[`Doc._.trf_data`](#custom-attributes) attribute. Your `set_extra_annotations`
callback is then called, if provided.
@ -272,7 +272,7 @@ Create an optimizer for the pipeline component.
## Transformer.use_params {#use_params tag="method, contextmanager"}
Modify the pipe's model, to use the given parameter values. At the end of the
Modify the pipe's model to use the given parameter values. At the end of the
context, the original parameters are restored.
> #### Example
@ -388,8 +388,8 @@ by this class. Instances of this class are typically assigned to the
| Name | Description |
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts, and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ |
| `tensors` | The activations for the Doc from the transformer. Usually the last tensor that is 3-dimensional will be the most important, as that will provide the final hidden state. Generally activations that are 2-dimensional will be attention weights. Details of this variable will differ depending on the underlying transformer model. ~~List[FloatsXd]~~ |
| `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ |
| `tensors` | The activations for the `Doc` from the transformer. Usually the last tensor that is 3-dimensional will be the most important, as that will provide the final hidden state. Generally activations that are 2-dimensional will be attention weights. Details of this variable will differ depending on the underlying transformer model. ~~List[FloatsXd]~~ |
| `align` | Alignment from the `Doc`'s tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ |
| `width` | The width of the last hidden layer. ~~int~~ |
@ -409,7 +409,7 @@ objects to associate the outputs to each [`Doc`](/api/doc) in the batch.
| Name | Description |
| ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `spans` | The batch of input spans. The outer list refers to the Doc objects in the batch, and the inner list are the spans for that `Doc`. Note that spans are allowed to overlap or exclude tokens, but each Span can only refer to one `Doc` (by definition). This means that within a `Doc`, the regions of the output tensors that correspond to each Span may overlap or have gaps, but for each `Doc`, there is a non-overlapping contiguous slice of the outputs. ~~List[List[Span]]~~ |
| `spans` | The batch of input spans. The outer list refers to the Doc objects in the batch, and the inner list are the spans for that `Doc`. Note that spans are allowed to overlap or exclude tokens, but each `Span` can only refer to one `Doc` (by definition). This means that within a `Doc`, the regions of the output tensors that correspond to each `Span` may overlap or have gaps, but for each `Doc`, there is a non-overlapping contiguous slice of the outputs. ~~List[List[Span]]~~ |
| `tokens` | The output of the tokenizer. ~~transformers.BatchEncoding~~ |
| `tensors` | The output of the transformer model. ~~List[torch.Tensor]~~ |
| `align` | Alignment from the spaCy tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ |
@ -439,10 +439,10 @@ Split a `TransformerData` object that represents a batch into a list with one
## Span getters {#span_getters source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"}
Span getters are functions that take a batch of [`Doc`](/api/doc) objects and
return a lists of [`Span`](/api/span) objects for each doc, to be processed by
the transformer. This is used to manage long documents, by cutting them into
return a lists of [`Span`](/api/span) objects for each doc to be processed by
the transformer. This is used to manage long documents by cutting them into
smaller sequences before running the transformer. The spans are allowed to
overlap, and you can also omit sections of the Doc if they are not relevant.
overlap, and you can also omit sections of the `Doc` if they are not relevant.
Span getters can be referenced in the `[components.transformer.model.get_spans]`
block of the config to customize the sequences processed by the transformer. You

View File

@ -290,7 +290,7 @@ If a table is full, it can be resized using
## Vectors.n_keys {#n_keys tag="property"}
Get the number of keys in the table. Note that this is the number of _all_ keys,
not just unique vectors. If several keys are mapped are mapped to the same
not just unique vectors. If several keys are mapped to the same
vectors, they will be counted individually.
> #### Example
@ -307,10 +307,10 @@ vectors, they will be counted individually.
## Vectors.most_similar {#most_similar tag="method"}
For each of the given vectors, find the `n` most similar entries to it, by
For each of the given vectors, find the `n` most similar entries to it by
cosine. Queries are by vector. Results are returned as a
`(keys, best_rows, scores)` tuple. If `queries` is large, the calculations are
performed in chunks, to avoid consuming too much memory. You can set the
performed in chunks to avoid consuming too much memory. You can set the
`batch_size` to control the size/space trade-off during the calculations.
> #### Example

View File

@ -29,7 +29,7 @@ Create the vocabulary.
| `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ |
| `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~ |
| `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ |
| `get_noun_chunks` | A function that yields base noun phrases, used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
| `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
## Vocab.\_\_len\_\_ {#len tag="method"}
@ -150,7 +150,7 @@ rows, we would discard the vectors for "feline" and "reclined". These words
would then be remapped to the closest remaining vector so "feline" would have
the same vector as "cat", and "reclined" would have the same vector as "sat".
The similarities are judged by cosine. The original vectors may be large, so the
cosines are calculated in minibatches, to reduce memory usage.
cosines are calculated in minibatches to reduce memory usage.
> #### Example
>
@ -170,7 +170,7 @@ cosines are calculated in minibatches, to reduce memory usage.
Retrieve a vector for a word in the vocabulary. Words can be looked up by string
or hash value. If no vectors data is loaded, a `ValueError` is raised. If `minn`
is defined, then the resulting vector uses [FastText](https://fasttext.cc/)'s
subword features by average over ngrams of `orth` (introduced in spaCy `v2.1`).
subword features by average over n-grams of `orth` (introduced in spaCy `v2.1`).
> #### Example
>
@ -182,13 +182,13 @@ subword features by average over ngrams of `orth` (introduced in spaCy `v2.1`).
| Name | Description |
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ |
| `minn` <Tag variant="new">2.1</Tag> | Minimum n-gram length used for FastText's ngram computation. Defaults to the length of `orth`. ~~int~~ |
| `maxn` <Tag variant="new">2.1</Tag> | Maximum n-gram length used for FastText's ngram computation. Defaults to the length of `orth`. ~~int~~ |
| `minn` <Tag variant="new">2.1</Tag> | Minimum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ |
| `maxn` <Tag variant="new">2.1</Tag> | Maximum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ |
| **RETURNS** | A word vector. Size and shape are determined by the `Vocab.vectors` instance. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
## Vocab.set_vector {#set_vector tag="method" new="2"}
Set a vector for a word in the vocabulary. Words can be referenced by by string
Set a vector for a word in the vocabulary. Words can be referenced by string
or hash value.
> #### Example

View File

@ -36,7 +36,7 @@ models such as [transformers](#transformers) is that word vectors model
context around them, a transformer model like BERT can't really help you. BERT
is designed to understand language **in context**, which isn't what you have. A
word vectors table will be a much better fit for your task. However, if you do
have words in context — whole sentences or paragraphs of running text — word
have words in context whole sentences or paragraphs of running text word
vectors will only provide a very rough approximation of what the text is about.
Word vectors are also very computationally efficient, as they map a word to a