Add initial draft of docs

This commit is contained in:
Adriane Boyd 2023-08-01 11:19:04 +02:00
parent ae9dfb48e8
commit 294f89e1ca
4 changed files with 310 additions and 5 deletions

View File

@ -0,0 +1,149 @@
---
title: BaseVectors
teaser: Abstract class for word vectors
tag: class
source: spacy/vectors.pyx
version: 3.7
---
`BaseVectors` is an abstract class to support the development of custom vectors
implementations.
For use in training with [`StaticVectors`](/api/architectures#staticvectors),
`get_batch` must be implemented. For improved performance, use efficient
batching in `get_batch` and implement `to_ops` to copy the vector data to the
current device. See an example custom implementation for
[BPEMb subword embeddings](/usage/embeddings-transformers#custom-vectors).
## BaseVectors.\_\_init\_\_ {id="init",tag="method"}
Create a new vector store.
| Name | Description |
| -------------- | --------------------------------------------------------------------------------------------------------------------- |
| _keyword-only_ | |
| `strings` | The string store. A new string store is created if one is not provided. Defaults to `None`. ~~Optional[StringStore]~~ |
## BaseVectors.\_\_getitem\_\_ {id="getitem",tag="method"}
Get a vector by key. If the key is not found in the table, a `KeyError` should
be raised.
| Name | Description |
| ----------- | ---------------------------------------------------------------- |
| `key` | The key to get the vector for. ~~Union[int, str]~~ |
| **RETURNS** | The vector for the key. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
## BaseVectors.\_\_len\_\_ {id="len",tag="method"}
Return the number of vectors in the table.
| Name | Description |
| ----------- | ------------------------------------------- |
| **RETURNS** | The number of vectors in the table. ~~int~~ |
## BaseVectors.\_\_contains\_\_ {id="contains",tag="method"}
Check whether there is a vector entry for key.
| Name | Description |
| ----------- | -------------------------------------------- |
| `key` | The key to check. ~~int~~ |
| **RETURNS** | Whether the key has a vector entry. ~~bool~~ |
## BaseVectors.add {id="add",tag="method"}
Add a key to the table, if possible. If no keys can be added, return `-1`.
| Name | Description |
| ----------- | ----------------------------------------------------------------------------------- |
| `key` | The key to add. ~~Union[str, int]~~ |
| **RETURNS** | The row the vector was added to, or `-1` if the operation is not supported. ~~int~~ |
## BaseVectors.shape {id="shape",tag="property"}
Get `(rows, dims)` tuples of number of rows and number of dimensions in the
vector table.
| Name | Description |
| ----------- | ------------------------------------------ |
| **RETURNS** | A `(rows, dims)` pair. ~~Tuple[int, int]~~ |
## BaseVectors.size {id="size",tag="property"}
The vector size, i.e. `rows * dims`.
| Name | Description |
| ----------- | ------------------------ |
| **RETURNS** | The vector size. ~~int~~ |
## BaseVectors.is_full {id="is_full",tag="property"}
Whether the vectors table is full and no slots are available for new keys.
| Name | Description |
| ----------- | ------------------------------------------- |
| **RETURNS** | Whether the vectors table is full. ~~bool~~ |
## BaseVectors.get_batch {id="get_batch",tag="method",version="3.2"}
Get the vectors for the provided keys efficiently as a batch. Required to use
the vectors with [`StaticVectors`](/api/architectures#StaticVectors) for
training.
| Name | Description |
| ------ | --------------------------------------- |
| `keys` | The keys. ~~Iterable[Union[int, str]]~~ |
## BaseVectors.to_ops {id="to_ops",tag="method"}
Dummy method. Implement this to change the embedding matrix to use different
Thinc ops.
| Name | Description |
| ----- | -------------------------------------------------------- |
| `ops` | The Thinc ops to switch the embedding matrix to. ~~Ops~~ |
## BaseVectors.to_disk {id="to_disk",tag="method"}
Dummy method to allow serialization. Implement to save vector data with the
pipeline.
| Name | Description |
| ------ | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
## BaseVectors.from_disk {id="from_disk",tag="method"}
Dummy method to allow serialization. Implement to load vector data from a saved
pipeline.
| Name | Description |
| ----------- | ----------------------------------------------------------------------------------------------- |
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
| **RETURNS** | The modified vectors object. ~~BaseVectors~~ |
## BaseVectors.to_bytes {id="to_bytes",tag="method"}
Dummy method to allow serialization. Implement to serialize vector data to a
binary string.
> #### Example
>
> ```python
> vectors_bytes = vectors.to_bytes()
> ```
| Name | Description |
| ----------- | ---------------------------------------------------- |
| **RETURNS** | The serialized form of the vectors object. ~~bytes~~ |
## BaseVectors.from_bytes {id="from_bytes",tag="method"}
Dummy method to allow serialization. Implement to load vector data from a binary
string.
| Name | Description |
| ----------- | ----------------------------------- |
| `data` | The data to load from. ~~bytes~~ |
| **RETURNS** | The vectors object. ~~BaseVectors~~ |

View File

@ -297,10 +297,9 @@ The vector size, i.e. `rows * dims`.
## Vectors.is_full {id="is_full",tag="property"}
Whether the vectors table is full and has no slots are available for new keys.
If a table is full, it can be resized using
[`Vectors.resize`](/api/vectors#resize). In `floret` mode, the table is always
full and cannot be resized.
Whether the vectors table is full and no slots are available for new keys. If a
table is full, it can be resized using [`Vectors.resize`](/api/vectors#resize).
In `floret` mode, the table is always full and cannot be resized.
> #### Example
>
@ -441,7 +440,7 @@ Load state from a binary string.
> #### Example
>
> ```python
> fron spacy.vectors import Vectors
> from spacy.vectors import Vectors
> vectors_bytes = vectors.to_bytes()
> new_vectors = Vectors(StringStore())
> new_vectors.from_bytes(vectors_bytes)

View File

@ -632,6 +632,162 @@ def MyCustomVectors(
)
```
#### Creating a custom vectors implementation {id="custom-vectors",version="3.7"}
You can specify a custom registered vectors class under `[nlp.vectors]` in order
to use static vectors in formats other than the ones supported by
[`Vectors`](/api/vectors). Extend the abstract [`BaseVectors`](/api/basevectors)
class to implement your custom vectors.
As an example, the following `BPEmbVectors` class implements support for
[BPEmb subword embeddings](https://bpemb.h-its.org/):
```python
# requires: pip install bpemb
from typing import cast, Callable, Optional
from pathlib import Path
import warnings
from bpemb import BPEmb
from spacy.util import registry
from spacy.vectors import BaseVectors
from spacy.vocab import Vocab
from thinc.api import Ops, get_current_ops
from thinc.backends import get_array_ops
from thinc.types import Floats2d
class BPEmbVectors(BaseVectors):
def __init__(
self,
*,
strings: Optional[str] = None,
lang: Optional[str] = None,
vs: Optional[int] = None,
dim: Optional[int] = None,
cache_dir: Optional[Path] = None,
encode_extra_options: Optional[str] = None,
model_file: Optional[Path] = None,
emb_file: Optional[Path] = None,
):
kwargs = {}
if lang is not None:
kwargs["lang"] = lang
if vs is not None:
kwargs["vs"] = vs
if dim is not None:
kwargs["dim"] = dim
if cache_dir is not None:
kwargs["cache_dir"] = cache_dir
if encode_extra_options is not None:
kwargs["encode_extra_options"] = encode_extra_options
if model_file is not None:
kwargs["model_file"] = model_file
if emb_file is not None:
kwargs["emb_file"] = emb_file
self.bpemb = BPEmb(**kwargs)
self.strings = strings
self.name = repr(self.bpemb)
self.n_keys = -1
self.mode = "BPEmb"
self.to_ops(get_current_ops())
def __contains__(self, key):
return True
def is_full(self):
return True
def add(self, key, *, vector=None, row=None):
warnings.warn(
(
"Skipping BPEmbVectors.add: the bpemb vector table cannot be "
"modified. Vectors are calculated from bytepieces."
)
)
return -1
def __getitem__(self, key):
return self.get_batch([key])[0]
def get_batch(self, keys):
keys = [self.strings.as_string(key) for key in keys]
bp_ids = self.bpemb.encode_ids(keys)
ops = get_array_ops(self.bpemb.emb.vectors)
indices = ops.asarray(ops.xp.hstack(bp_ids), dtype="int32")
lengths = ops.asarray([len(x) for x in bp_ids], dtype="int32")
vecs = ops.reduce_mean(cast(Floats2d, self.bpemb.emb.vectors[indices]), lengths)
return vecs
@property
def shape(self):
return self.bpemb.vectors.shape
def __len__(self):
return self.shape[0]
@property
def vectors_length(self):
return self.shape[1]
@property
def size(self):
return self.bpemb.vectors.size
def to_ops(self, ops: Ops):
self.bpemb.emb.vectors = ops.asarray(self.bpemb.emb.vectors)
@registry.vectors("BPEmbVectors.v1")
def create_bpemb_vectors(
lang: Optional[str] = "multi",
vs: Optional[int] = None,
dim: Optional[int] = None,
cache_dir: Optional[Path] = None,
encode_extra_options: Optional[str] = None,
model_file: Optional[Path] = None,
emb_file: Optional[Path] = None,
) -> Callable[[Vocab], BPEmbVectors]:
def bpemb_vectors_factory(vocab: Vocab) -> BPEmbVectors:
return BPEmbVectors(
strings=vocab.strings,
lang=lang,
vs=vs,
dim=dim,
cache_dir=cache_dir,
encode_extra_options=encode_extra_options,
model_file=model_file,
emb_file=emb_file,
)
return bpemb_vectors_factory
```
<Infobox variant="warning">
Note that the serialization methods are not implemented, so the embeddings are
loaded from your local cache or downloaded by `BPEmb` each time the pipeline is
loaded.
</Infobox>
To use this in your pipeline, specify this registered function under
`[nlp.vectors]` in your config:
```ini
[nlp.vectors]
@vectors = "BPEmbVectors.v1"
lang = "en"
```
Or specify it when creating a blank pipeline:
```python
nlp = spacy.blank("en", config={"nlp.vectors": {"@vectors": "BPEmbVectors.v1", "lang": "en"}})
```
Remember to include this code with `--code` when using
[`spacy train`](/api/cli#train) and [`spacy package`](/api/cli#package).
## Pretraining {id="pretraining"}
The [`spacy pretrain`](/api/cli#pretrain) command lets you initialize your

View File

@ -131,6 +131,7 @@
"label": "Other",
"items": [
{ "text": "Attributes", "url": "/api/attributes" },
{ "text": "BaseVectors", "url": "/api/basevectors" },
{ "text": "Corpus", "url": "/api/corpus" },
{ "text": "InMemoryLookupKB", "url": "/api/inmemorylookupkb" },
{ "text": "KnowledgeBase", "url": "/api/kb" },