spaCy/website/docs/api/basevectors.mdx
Adriane Boyd 0fe43f40f1
Support registered vectors (#12492)
* Support registered vectors

* Format

* Auto-fill [nlp] on load from config and from bytes/disk

* Only auto-fill [nlp]

* Undo all changes to Language.from_disk

* Expand BaseVectors

These methods are needed in various places for training and vector
similarity.

* isort

* More linting

* Only fill [nlp.vectors]

* Update spacy/vocab.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Revert changes to test related to auto-filling [nlp]

* Add vectors registry

* Rephrase error about vocab methods for vectors

* Switch to dummy implementation for BaseVectors.to_ops

* Add initial draft of docs

* Remove example from BaseVectors docs

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/basevectors.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix type and lint bpemb example

* Update website/docs/api/basevectors.mdx

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-08-01 15:46:08 +02:00

144 lines
6.3 KiB
Plaintext

---
title: BaseVectors
teaser: Abstract class for word vectors
tag: class
source: spacy/vectors.pyx
version: 3.7
---
`BaseVectors` is an abstract class to support the development of custom vectors
implementations.
For use in training with [`StaticVectors`](/api/architectures#staticvectors),
`get_batch` must be implemented. For improved performance, use efficient
batching in `get_batch` and implement `to_ops` to copy the vector data to the
current device. See an example custom implementation for
[BPEmb subword embeddings](/usage/embeddings-transformers#custom-vectors).
## BaseVectors.\_\_init\_\_ {id="init",tag="method"}
Create a new vector store.
| Name | Description |
| -------------- | --------------------------------------------------------------------------------------------------------------------- |
| _keyword-only_ | |
| `strings` | The string store. A new string store is created if one is not provided. Defaults to `None`. ~~Optional[StringStore]~~ |
## BaseVectors.\_\_getitem\_\_ {id="getitem",tag="method"}
Get a vector by key. If the key is not found in the table, a `KeyError` should
be raised.
| Name | Description |
| ----------- | ---------------------------------------------------------------- |
| `key` | The key to get the vector for. ~~Union[int, str]~~ |
| **RETURNS** | The vector for the key. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
## BaseVectors.\_\_len\_\_ {id="len",tag="method"}
Return the number of vectors in the table.
| Name | Description |
| ----------- | ------------------------------------------- |
| **RETURNS** | The number of vectors in the table. ~~int~~ |
## BaseVectors.\_\_contains\_\_ {id="contains",tag="method"}
Check whether there is a vector entry for the given key.
| Name | Description |
| ----------- | -------------------------------------------- |
| `key` | The key to check. ~~int~~ |
| **RETURNS** | Whether the key has a vector entry. ~~bool~~ |
## BaseVectors.add {id="add",tag="method"}
Add a key to the table, if possible. If no keys can be added, return `-1`.
| Name | Description |
| ----------- | ----------------------------------------------------------------------------------- |
| `key` | The key to add. ~~Union[str, int]~~ |
| **RETURNS** | The row the vector was added to, or `-1` if the operation is not supported. ~~int~~ |
## BaseVectors.shape {id="shape",tag="property"}
Get `(rows, dims)` tuples of number of rows and number of dimensions in the
vector table.
| Name | Description |
| ----------- | ------------------------------------------ |
| **RETURNS** | A `(rows, dims)` pair. ~~Tuple[int, int]~~ |
## BaseVectors.size {id="size",tag="property"}
The vector size, i.e. `rows * dims`.
| Name | Description |
| ----------- | ------------------------ |
| **RETURNS** | The vector size. ~~int~~ |
## BaseVectors.is_full {id="is_full",tag="property"}
Whether the vectors table is full and no slots are available for new keys.
| Name | Description |
| ----------- | ------------------------------------------- |
| **RETURNS** | Whether the vectors table is full. ~~bool~~ |
## BaseVectors.get_batch {id="get_batch",tag="method",version="3.2"}
Get the vectors for the provided keys efficiently as a batch. Required to use
the vectors with [`StaticVectors`](/api/architectures#StaticVectors) for
training.
| Name | Description |
| ------ | --------------------------------------- |
| `keys` | The keys. ~~Iterable[Union[int, str]]~~ |
## BaseVectors.to_ops {id="to_ops",tag="method"}
Dummy method. Implement this to change the embedding matrix to use different
Thinc ops.
| Name | Description |
| ----- | -------------------------------------------------------- |
| `ops` | The Thinc ops to switch the embedding matrix to. ~~Ops~~ |
## BaseVectors.to_disk {id="to_disk",tag="method"}
Dummy method to allow serialization. Implement to save vector data with the
pipeline.
| Name | Description |
| ------ | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
## BaseVectors.from_disk {id="from_disk",tag="method"}
Dummy method to allow serialization. Implement to load vector data from a saved
pipeline.
| Name | Description |
| ----------- | ----------------------------------------------------------------------------------------------- |
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
| **RETURNS** | The modified vectors object. ~~BaseVectors~~ |
## BaseVectors.to_bytes {id="to_bytes",tag="method"}
Dummy method to allow serialization. Implement to serialize vector data to a
binary string.
| Name | Description |
| ----------- | ---------------------------------------------------- |
| **RETURNS** | The serialized form of the vectors object. ~~bytes~~ |
## BaseVectors.from_bytes {id="from_bytes",tag="method"}
Dummy method to allow serialization. Implement to load vector data from a binary
string.
| Name | Description |
| ----------- | ----------------------------------- |
| `data` | The data to load from. ~~bytes~~ |
| **RETURNS** | The vectors object. ~~BaseVectors~~ |