Merge remote-tracking branch 'upstream/master' into spacy.io

This commit is contained in:
svlandeg 2021-09-20 15:54:00 +02:00
commit ec621e6853
26 changed files with 744 additions and 63 deletions

View File

@ -0,0 +1,220 @@
# Listeners
1. [Overview](#1-overview)
2. [Initialization](#2-initialization)
- [A. Linking listeners to the embedding component](#2a-linking-listeners-to-the-embedding-component)
- [B. Shape inference](#2b-shape-inference)
3. [Internal communication](#3-internal-communication)
- [A. During prediction](#3a-during-prediction)
- [B. During training](#3b-during-training)
- [C. Frozen components](#3c-frozen-components)
4. [Replacing listener with standalone](#4-replacing-listener-with-standalone)
## 1. Overview
Trainable spaCy components typically use some sort of `tok2vec` layer as part of the `model` definition.
This `tok2vec` layer produces embeddings and is either a standard `Tok2Vec` layer, or a Transformer-based one.
Both versions can be used either inline/standalone, which means that they are defined and used
by only one specific component (e.g. NER), or
[shared](https://spacy.io/usage/embeddings-transformers#embedding-layers),
in which case the embedding functionality becomes a separate component that can
feed embeddings to multiple components downstream, using a listener-pattern.
| Type | Usage | Model Architecture |
| ------------- | ---------- | -------------------------------------------------------------------------------------------------- |
| `Tok2Vec` | standalone | [`spacy.Tok2Vec`](https://spacy.io/api/architectures#Tok2Vec) |
| `Tok2Vec` | listener | [`spacy.Tok2VecListener`](https://spacy.io/api/architectures#Tok2VecListener) |
| `Transformer` | standalone | [`spacy-transformers.Tok2VecTransformer`](https://spacy.io/api/architectures#Tok2VecTransformer) |
| `Transformer` | listener | [`spacy-transformers.TransformerListener`](https://spacy.io/api/architectures#TransformerListener) |
Here we discuss the listener pattern and its implementation in code in more detail.
## 2. Initialization
### 2A. Linking listeners to the embedding component
To allow sharing a `tok2vec` layer, a separate `tok2vec` component needs to be defined in the config:
```
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
```
A listener can then be set up by making sure the correct `upstream` name is defined, referring to the
name of the `tok2vec` component (which equals the factory name by default), or `*` as a wildcard:
```
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
upstream = "tok2vec"
```
When an [`nlp`](https://github.com/explosion/spaCy/blob/master/extra/DEVELOPER_DOCS/Language.md) object is
initialized or deserialized, it will make sure to link each `tok2vec` component to its listeners. This is
implemented in the method `nlp._link_components()` which loops over each
component in the pipeline and calls `find_listeners()` on a component if it's defined.
The [`tok2vec` component](https://github.com/explosion/spaCy/blob/master/spacy/pipeline/tok2vec.py)'s implementation
of this `find_listener()` method will specifically identify sublayers of a model definition that are of type
`Tok2VecListener` with a matching upstream name and will then add that listener to the internal `self.listener_map`.
If it's a Transformer-based pipeline, a
[`transformer` component](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py)
has a similar implementation but its `find_listener()` function will specifically look for `TransformerListener`
sublayers of downstream components.
### 2B. Shape inference
Typically, the output dimension `nO` of a listener's model equals the `nO` (or `width`) of the upstream embedding layer.
For a standard `Tok2Vec`-based component, this is typically known up-front and defined as such in the config:
```
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
```
A `transformer` component however only knows its `nO` dimension after the HuggingFace transformer
is set with the function `model.attrs["set_transformer"]`,
[implemented](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py)
by `set_pytorch_transformer`.
This is why, upon linking of the transformer listeners, the `transformer` component also makes sure to set
the listener's output dimension correctly.
This shape inference mechanism also needs to happen with resumed/frozen components, which means that for some CLI
commands (`assemble` and `train`), we need to call `nlp._link_components` even before initializing the `nlp`
object. To cover all use-cases and avoid negative side effects, the code base ensures that performing the
linking twice is not harmful.
## 3. Internal communication
The internal communication between a listener and its downstream components is organized by sending and
receiving information across the components - either directly or implicitly.
The details are different depending on whether the pipeline is currently training, or predicting.
Either way, the `tok2vec` or `transformer` component always needs to run before the listener.
### 3A. During prediction
When the `Tok2Vec` pipeline component is called, its `predict()` method is executed to produce the results,
which are then stored by `set_annotations()` in the `doc.tensor` field of the document(s).
Similarly, the `Transformer` component stores the produced embeddings
in `doc._.trf_data`. Next, the `forward` pass of a
[`Tok2VecListener`](https://github.com/explosion/spaCy/blob/master/spacy/pipeline/tok2vec.py)
or a
[`TransformerListener`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/listener.py)
accesses these fields on the `Doc` directly. Both listener implementations have a fallback mechanism for when these
properties were not set on the `Doc`: in that case an all-zero tensor is produced and returned.
We need this fallback mechanism to enable shape inference methods in Thinc, but the code
is slightly risky and at times might hide another bug - so it's a good spot to be aware of.
### 3B. During training
During training, the `update()` methods of the `Tok2Vec` & `Transformer` components don't necessarily set the
annotations on the `Doc` (though since 3.1 they can if they are part of the `annotating_components` list in the config).
Instead, we rely on a caching mechanism between the original embedding component and its listener.
Specifically, the produced embeddings are sent to the listeners by calling `listener.receive()` and uniquely
identifying the batch of documents with a `batch_id`. This `receive()` call also sends the appropriate `backprop`
call to ensure that gradients from the downstream component flow back to the trainable `Tok2Vec` or `Transformer`
network.
We rely on the `nlp` object properly batching the data and sending each batch through the pipeline in sequence,
which means that only one such batch needs to be kept in memory for each listener.
When the downstream component runs and the listener should produce embeddings, it accesses the batch in memory,
runs the backpropagation, and returns the results and the gradients.
There are two ways in which this mechanism can fail, both are detected by `verify_inputs()`:
- `E953` if a different batch is in memory than the requested one - signaling some kind of out-of-sync state of the
training pipeline.
- `E954` if no batch is in memory at all - signaling that the pipeline is probably not set up correctly.
#### Training with multiple listeners
One `Tok2Vec` or `Transformer` component may be listened to by several downstream components, e.g.
a tagger and a parser could be sharing the same embeddings. In this case, we need to be careful about how we do
the backpropagation. When the `Tok2Vec` or `Transformer` sends out data to the listener with `receive()`, they will
send an `accumulate_gradient` function call to all listeners, except the last one. This function will keep track
of the gradients received so far. Only the final listener in the pipeline will get an actual `backprop` call that
will initiate the backpropagation of the `tok2vec` or `transformer` model with the accumulated gradients.
### 3C. Frozen components
The listener pattern can get particularly tricky in combination with frozen components. To detect components
with listeners that are not frozen consistently, `init_nlp()` (which is called by `spacy train`) goes through
the listeners and their upstream components and warns in two scenarios.
#### The Tok2Vec or Transformer is frozen
If the `Tok2Vec` or `Transformer` was already trained,
e.g. by [pretraining](https://spacy.io/usage/embeddings-transformers#pretraining),
it could be a valid use-case to freeze the embedding architecture and only train downstream components such
as a tagger or a parser. This used to be impossible before 3.1, but has become supported since then by putting the
embedding component in the [`annotating_components`](https://spacy.io/usage/training#annotating-components)
list of the config. This works like any other "annotating component" because it relies on the `Doc` attributes.
However, if the `Tok2Vec` or `Transformer` is frozen, and not present in `annotating_components`, and a related
listener isn't frozen, then a `W086` warning is shown and further training of the pipeline will likely end with `E954`.
#### The upstream component is frozen
If an upstream component is frozen but the underlying `Tok2Vec` or `Transformer` isn't, the performance of
the upstream component will be degraded after training. In this case, a `W087` warning is shown, explaining
how to use the `replace_listeners` functionality to prevent this problem.
## 4. Replacing listener with standalone
The [`replace_listeners`](https://spacy.io/api/language#replace_listeners) functionality changes the architecture
of a downstream component from using a listener pattern to a standalone `tok2vec` or `transformer` layer,
effectively making the downstream component independent of any other components in the pipeline.
It is implemented by `nlp.replace_listeners()` and typically executed by `nlp.from_config()`.
First, it fetches the original `Model` of the original component that creates the embeddings:
```
tok2vec = self.get_pipe(tok2vec_name)
tok2vec_model = tok2vec.model
```
Which is either a [`Tok2Vec` model](https://github.com/explosion/spaCy/blob/master/spacy/ml/models/tok2vec.py) or a
[`TransformerModel`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py).
In the case of the `tok2vec`, this model can be copied as-is into the configuration and architecture of the
downstream component. However, for the `transformer`, this doesn't work.
The reason is that the `TransformerListener` architecture chains the listener with
[`trfs2arrays`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/trfs2arrays.py):
```
model = chain(
TransformerListener(upstream_name=upstream)
trfs2arrays(pooling, grad_factor),
)
```
but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained inbetween the model
and `trfs2arrays`:
```
model = chain(
TransformerModel(name, get_spans, tokenizer_config),
split_trf_batch(),
trfs2arrays(pooling, grad_factor),
)
```
So you can't just take the model from the listener, and drop that into the component internally. You need to
adjust the model and the config. To facilitate this, `nlp.replace_listeners()` will check whether additional
[functions](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/_util.py) are
[defined](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py)
in `model.attrs`, and if so, it will essentially call these to make the appropriate changes:
```
replace_func = tok2vec_model.attrs["replace_listener_cfg"]
new_config = replace_func(tok2vec_cfg["model"], pipe_cfg["model"]["tok2vec"])
...
new_model = tok2vec_model.attrs["replace_listener"](new_model)
```
The new config and model are then properly stored on the `nlp` object.
Note that this functionality (running the replacement for a transformer listener) was broken prior to
`spacy-transformers` 1.0.5.

View File

@ -0,0 +1,216 @@
# StringStore & Vocab
> Reference: `spacy/strings.pyx`
> Reference: `spacy/vocab.pyx`
## Overview
spaCy represents mosts strings internally using a `uint64` in Cython which
corresponds to a hash. The magic required to make this largely transparent is
handled by the `StringStore`, and is integrated into the pipelines using the
`Vocab`, which also connects it to some other information.
These are mostly internal details that average library users should never have
to think about. On the other hand, when developing a component it's normal to
interact with the Vocab for lexeme data or word vectors, and it's not unusual
to add labels to the `StringStore`.
## StringStore
### Overview
The `StringStore` is a `cdef class` that looks a bit like a two-way dictionary,
though it is not a subclass of anything in particular.
The main functionality of the `StringStore` is that `__getitem__` converts
hashes into strings or strings into hashes.
The full details of the conversion are complicated. Normally you shouldn't have
to worry about them, but the first applicable case here is used to get the
return value:
1. 0 and the empty string are special cased to each other
2. internal symbols use a lookup table (`SYMBOLS_BY_STR`)
3. normal strings or bytes are hashed
4. internal symbol IDs in `SYMBOLS_BY_INT` are handled
5. anything not yet handled is used as a hash to lookup a string
For the symbol enums, see [`symbols.pxd`](https://github.com/explosion/spaCy/blob/master/spacy/symbols.pxd).
Almost all strings in spaCy are stored in the `StringStore`. This naturally
includes tokens, but also includes things like labels (not just NER/POS/dep,
but also categories etc.), lemmas, lowercase forms, word shapes, and so on. One
of the main results of this is that tokens can be represented by a compact C
struct ([`LexemeC`](https://spacy.io/api/cython-structs#lexemec)/[`TokenC`](https://github.com/explosion/spaCy/issues/4854)) that mostly consists of string hashes. This also means that converting
input for the models is straightforward, and there's not a token mapping step
like in many machine learning frameworks. Additionally, because the token IDs
in spaCy are based on hashes, they are consistent across environments or
models.
One pattern you'll see a lot in spaCy APIs is that `something.value` returns an
`int` and `something.value_` returns a string. That's implemented using the
`StringStore`. Typically the `int` is stored in a C struct and the string is
generated via a property that calls into the `StringStore` with the `int`.
Besides `__getitem__`, the `StringStore` has functions to return specifically a
string or specifically a hash, regardless of whether the input was a string or
hash to begin with, though these are only used occasionally.
### Implementation Details: Hashes and Allocations
Hashes are 64-bit and are computed using [murmurhash][] on UTF-8 bytes. There is no
mechanism for detecting and avoiding collisions. To date there has never been a
reproducible collision or user report about any related issues.
[murmurhash]: https://github.com/explosion/murmurhash
The empty string is not hashed, it's just converted to/from 0.
A small number of strings use indices into a lookup table (so low integers)
rather than hashes. This is mostly Universal Dependencies labels or other
strings considered "core" in spaCy. This was critical in v1, which hadn't
introduced hashing yet. Since v2 it's important for items in `spacy.attrs`,
especially lexeme flags, but is otherwise only maintained for backwards
compatibility.
You can call `strings["mystring"]` with a string the `StringStore` has never seen
before and it will return a hash. But in order to do the reverse operation, you
need to call `strings.add("mystring")` first. Without a call to `add` the
string will not be interned.
Example:
```
from spacy.strings import StringStore
ss = StringStore()
hashval = ss["spacy"] # 10639093010105930009
try:
# this won't work
ss[hashval]
except KeyError:
print(f"key {hashval} unknown in the StringStore.")
ss.add("spacy")
assert ss[hashval] == "spacy" # it works now
# There is no `.keys` property, but you can iterate over keys
# The empty string will never be in the list of keys
for key in ss:
print(key)
```
In normal use nothing is ever removed from the `StringStore`. In theory this
means that if you do something like iterate through all hex values of a certain
length you can have explosive memory usage. In practice this has never been an
issue. (Note that this is also different from using `sys.intern` to intern
Python strings, which does not guarantee they won't be garbage collected later.)
Strings are stored in the `StringStore` in a peculiar way: each string uses a
union that is either an eight-byte `char[]` or a `char*`. Short strings are
stored directly in the `char[]`, while longer strings are stored in allocated
memory and prefixed with their length. This is a strategy to reduce indirection
and memory fragmentation. See `decode_Utf8Str` and `_allocate` in
`strings.pyx` for the implementation.
### When to Use the StringStore?
While you can ignore the `StringStore` in many cases, there are situations where
you should make use of it to avoid errors.
Any time you introduce a string that may be set on a `Doc` field that has a hash,
you should add the string to the `StringStore`. This mainly happens when adding
labels in components, but there are some other cases:
- syntax iterators, mainly `get_noun_chunks`
- external data used in components, like the `KnowledgeBase` in the `entity_linker`
- labels used in tests
## Vocab
The `Vocab` is a core component of a `Language` pipeline. Its main function is
to manage `Lexeme`s, which are structs that contain information about a token
that depends only on its surface form, without context. `Lexeme`s store much of
the data associated with `Token`s. As a side effect of this the `Vocab` also
manages the `StringStore` for a pipeline and a grab-bag of other data.
These are things stored in the vocab:
- `Lexeme`s
- `StringStore`
- `Morphology`: manages info used in `MorphAnalysis` objects
- `vectors`: basically a dict for word vectors
- `lookups`: language specific data like lemmas
- `writing_system`: language specific metadata
- `get_noun_chunks`: a syntax iterator
- lex attribute getters: functions like `is_punct`, set in language defaults
- `cfg`: **not** the pipeline config, this is mostly unused
- `_unused_object`: Formerly an unused object, kept around until v4 for compatability
Some of these, like the Morphology and Vectors, are complex enough that they
need their own explanations. Here we'll just look at Vocab-specific items.
### Lexemes
A `Lexeme` is a type that mainly wraps a `LexemeC`, a struct consisting of ints
that identify various context-free token attributes. Lexemes are the core data
of the `Vocab`, and can be accessed using `__getitem__` on the `Vocab`. The memory
for storing `LexemeC` objects is managed by a pool that belongs to the `Vocab`.
Note that `__getitem__` on the `Vocab` works much like the `StringStore`, in
that it accepts a hash or id, with one important difference: if you do a lookup
using a string, that value is added to the `StringStore` automatically.
The attributes stored in a `LexemeC` are:
- orth (the raw text)
- lower
- norm
- shape
- prefix
- suffix
Most of these are straightforward. All of them can be customized, and (except
`orth`) probably should be since the defaults are based on English, but in
practice this is rarely done at present.
### Lookups
This is basically a dict of dicts, implemented using a `Table` for each
sub-dict, that stores lemmas and other language-specific lookup data.
A `Table` is a subclass of `OrderedDict` used for string-to-string data. It uses
Bloom filters to speed up misses and has some extra serialization features.
Tables are not used outside of the lookups.
### Lex Attribute Getters
Lexical Attribute Getters like `is_punct` are defined on a per-language basis,
much like lookups, but take the form of functions rather than string-to-string
dicts, so they're stored separately.
### Writing System
This is a dict with three attributes:
- `direction`: ltr or rtl (default ltr)
- `has_case`: bool (default `True`)
- `has_letters`: bool (default `True`, `False` only for CJK for now)
Currently these are not used much - the main use is that `direction` is used in
visualizers, though `rtl` doesn't quite work (see
[#4854](https://github.com/explosion/spaCy/issues/4854)). In the future they
could be used when choosing hyperparameters for subwords, controlling word
shape generation, and similar tasks.
### Other Vocab Members
The Vocab is kind of the default place to store things from `Language.defaults`
that don't belong to the Tokenizer. The following properties are in the Vocab
just because they don't have anywhere else to go.
- `get_noun_chunks`
- `cfg`: This is a dict that just stores `oov_prob` (hardcoded to `-20`)
- `_unused_object`: Leftover C member, should be removed in next major version

View File

@ -5,7 +5,7 @@ requires = [
"cymem>=2.0.2,<2.1.0",
"preshed>=3.0.2,<3.1.0",
"murmurhash>=0.28.0,<1.1.0",
"thinc>=8.0.8,<8.1.0",
"thinc>=8.0.10,<8.1.0",
"blis>=0.4.0,<0.8.0",
"pathy",
"numpy>=1.15.0",

View File

@ -1,15 +1,15 @@
# Our libraries
spacy-legacy>=3.0.7,<3.1.0
spacy-legacy>=3.0.8,<3.1.0
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
thinc>=8.0.8,<8.1.0
thinc>=8.0.10,<8.1.0
blis>=0.4.0,<0.8.0
ml_datasets>=0.2.0,<0.3.0
murmurhash>=0.28.0,<1.1.0
wasabi>=0.8.1,<1.1.0
srsly>=2.4.1,<3.0.0
catalogue>=2.0.4,<2.1.0
typer>=0.3.0,<0.4.0
catalogue>=2.0.6,<2.1.0
typer>=0.3.0,<0.5.0
pathy>=0.3.5
# Third party dependencies
numpy>=1.15.0

View File

@ -37,19 +37,19 @@ setup_requires =
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0
thinc>=8.0.8,<8.1.0
thinc>=8.0.10,<8.1.0
install_requires =
# Our libraries
spacy-legacy>=3.0.7,<3.1.0
spacy-legacy>=3.0.8,<3.1.0
murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
thinc>=8.0.8,<8.1.0
thinc>=8.0.9,<8.1.0
blis>=0.4.0,<0.8.0
wasabi>=0.8.1,<1.1.0
srsly>=2.4.1,<3.0.0
catalogue>=2.0.4,<2.1.0
typer>=0.3.0,<0.4.0
catalogue>=2.0.6,<2.1.0
typer>=0.3.0,<0.5.0
pathy>=0.3.5
# Third-party dependencies
tqdm>=4.38.0,<5.0.0

View File

@ -1,6 +1,6 @@
# fmt: off
__title__ = "spacy"
__version__ = "3.1.2"
__version__ = "3.1.3"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects"

View File

@ -397,7 +397,11 @@ def git_checkout(
run_command(cmd, capture=True)
# We need Path(name) to make sure we also support subdirectories
try:
shutil.copytree(str(tmp_dir / Path(subpath)), str(dest))
source_path = tmp_dir / Path(subpath)
if not is_subpath_of(tmp_dir, source_path):
err = f"'{subpath}' is a path outside of the cloned repository."
msg.fail(err, repo, exits=1)
shutil.copytree(str(source_path), str(dest))
except FileNotFoundError:
err = f"Can't clone {subpath}. Make sure the directory exists in the repo (branch '{branch}')"
msg.fail(err, repo, exits=1)
@ -445,8 +449,14 @@ def git_sparse_checkout(repo, subpath, dest, branch):
# And finally, we can checkout our subpath
cmd = f"git -C {tmp_dir} checkout {branch} {subpath}"
run_command(cmd, capture=True)
# We need Path(name) to make sure we also support subdirectories
shutil.move(str(tmp_dir / Path(subpath)), str(dest))
# Get a subdirectory of the cloned path, if appropriate
source_path = tmp_dir / Path(subpath)
if not is_subpath_of(tmp_dir, source_path):
err = f"'{subpath}' is a path outside of the cloned repository."
msg.fail(err, repo, exits=1)
shutil.move(str(source_path), str(dest))
def get_git_version(
@ -477,6 +487,19 @@ def _http_to_git(repo: str) -> str:
return repo
def is_subpath_of(parent, child):
"""
Check whether `child` is a path contained within `parent`.
"""
# Based on https://stackoverflow.com/a/37095733 .
# In Python 3.9, the `Path.is_relative_to()` method will supplant this, so
# we can stop using crusty old os.path functions.
parent_realpath = os.path.realpath(parent)
child_realpath = os.path.realpath(child)
return os.path.commonpath([parent_realpath, child_realpath]) == parent_realpath
def string_to_list(value: str, intify: bool = False) -> Union[List[str], List[int]]:
"""Parse a comma-separated string to a list and account for various
formatting options. Mostly used to handle CLI arguments that take a list of

View File

@ -200,17 +200,21 @@ def get_third_party_dependencies(
exclude (list): List of packages to exclude (e.g. that already exist in meta).
RETURNS (list): The versioned requirements.
"""
own_packages = ("spacy", "spacy-nightly", "thinc", "srsly")
own_packages = ("spacy", "spacy-legacy", "spacy-nightly", "thinc", "srsly")
distributions = util.packages_distributions()
funcs = defaultdict(set)
for path, value in util.walk_dict(config):
if path[-1].startswith("@"): # collect all function references by registry
funcs[path[-1][1:]].add(value)
# We only want to look at runtime-relevant sections, not [training] or [initialize]
for section in ("nlp", "components"):
for path, value in util.walk_dict(config[section]):
if path[-1].startswith("@"): # collect all function references by registry
funcs[path[-1][1:]].add(value)
for component in config.get("components", {}).values():
if "factory" in component:
funcs["factories"].add(component["factory"])
modules = set()
for reg_name, func_names in funcs.items():
sub_registry = getattr(util.registry, reg_name)
for func_name in func_names:
func_info = sub_registry.find(func_name)
func_info = util.registry.find(reg_name, func_name)
module_name = func_info.get("module")
if module_name: # the code is part of a module, not a --code file
modules.add(func_info["module"].split(".")[0])

View File

@ -59,6 +59,15 @@ def project_assets(project_dir: Path, *, sparse_checkout: bool = False) -> None:
shutil.rmtree(dest)
else:
dest.unlink()
if "repo" not in asset["git"] or asset["git"]["repo"] is None:
msg.fail(
"A git asset must include 'repo', the repository address.", exits=1
)
if "path" not in asset["git"] or asset["git"]["path"] is None:
msg.fail(
"A git asset must include 'path' - use \"\" to get the entire repository.",
exits=1,
)
git_checkout(
asset["git"]["repo"],
asset["git"]["path"],

View File

@ -57,6 +57,7 @@ def project_run(
project_dir (Path): Path to project directory.
subcommand (str): Name of command to run.
overrides (Dict[str, Any]): Optional config overrides.
force (bool): Force re-running, even if nothing changed.
dry (bool): Perform a dry run and don't execute commands.
capture (bool): Whether to capture the output and errors of individual commands.
@ -72,7 +73,14 @@ def project_run(
if subcommand in workflows:
msg.info(f"Running workflow '{subcommand}'")
for cmd in workflows[subcommand]:
project_run(project_dir, cmd, force=force, dry=dry, capture=capture)
project_run(
project_dir,
cmd,
overrides=overrides,
force=force,
dry=dry,
capture=capture,
)
else:
cmd = commands[subcommand]
for dep in cmd.get("deps", []):

View File

@ -869,6 +869,10 @@ class Errors:
E1019 = ("`noun_chunks` requires the pos tagging, which requires a "
"statistical model to be installed and loaded. For more info, see "
"the documentation:\nhttps://spacy.io/usage/models")
E1020 = ("No `epoch_resume` value specified and could not infer one from "
"filename. Specify an epoch to resume from.")
E1021 = ("`pos` value \"{pp}\" is not a valid Universal Dependencies tag. "
"Non-UD tags should use the `tag` property.")
# Deprecated model shortcuts, only used in errors and warnings

View File

@ -82,7 +82,8 @@ for orth in [
for verb in [
"a",
"est" "semble",
"est",
"semble",
"indique",
"moque",
"passe",

View File

@ -281,28 +281,19 @@ cdef class Matcher:
final_matches.append((key, *match))
# Mark tokens that have matched
memset(&matched[start], 1, span_len * sizeof(matched[0]))
if with_alignments:
final_matches_with_alignments = final_matches
final_matches = [(key, start, end) for key, start, end, alignments in final_matches]
# perform the callbacks on the filtered set of results
for i, (key, start, end) in enumerate(final_matches):
on_match = self._callbacks.get(key, None)
if on_match is not None:
on_match(self, doc, i, final_matches)
if as_spans:
spans = []
for key, start, end in final_matches:
final_results = []
for key, start, end, *_ in final_matches:
if isinstance(doclike, Span):
start += doclike.start
end += doclike.start
spans.append(Span(doc, start, end, label=key))
return spans
final_results.append(Span(doc, start, end, label=key))
elif with_alignments:
# convert alignments List[Dict[str, int]] --> List[int]
final_matches = []
# when multiple alignment (belongs to the same length) is found,
# keeps the alignment that has largest token_idx
for key, start, end, alignments in final_matches_with_alignments:
final_results = []
for key, start, end, alignments in final_matches:
sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False)
alignments = [0] * (end-start)
for align in sorted_alignments:
@ -311,10 +302,16 @@ cdef class Matcher:
# Since alignments are sorted in order of (length, token_idx)
# this overwrites smaller token_idx when they have same length.
alignments[align['length']] = align['token_idx']
final_matches.append((key, start, end, alignments))
return final_matches
final_results.append((key, start, end, alignments))
final_matches = final_results # for callbacks
else:
return final_matches
final_results = final_matches
# perform the callbacks on the filtered set of results
for i, (key, *_) in enumerate(final_matches):
on_match = self._callbacks.get(key, None)
if on_match is not None:
on_match(self, doc, i, final_matches)
return final_results
def _normalize_key(self, key):
if isinstance(key, basestring):

View File

@ -398,7 +398,9 @@ class SpanCategorizer(TrainablePipe):
pass
def _get_aligned_spans(self, eg: Example):
return eg.get_aligned_spans_y2x(eg.reference.spans.get(self.key, []), allow_overlap=True)
return eg.get_aligned_spans_y2x(
eg.reference.spans.get(self.key, []), allow_overlap=True
)
def _make_span_group(
self, doc: Doc, indices: Ints2d, scores: Floats2d, labels: List[str]

View File

@ -70,3 +70,10 @@ def test_create_with_heads_and_no_deps(vocab):
heads = list(range(len(words)))
with pytest.raises(ValueError):
Doc(vocab, words=words, heads=heads)
def test_create_invalid_pos(vocab):
words = "I like ginger".split()
pos = "QQ ZZ XX".split()
with pytest.raises(ValueError):
Doc(vocab, words=words, pos=pos)

View File

@ -203,6 +203,12 @@ def test_set_pos():
assert doc[1].pos_ == "VERB"
def test_set_invalid_pos():
doc = Doc(Vocab(), words=["hello", "world"])
with pytest.raises(ValueError):
doc[0].pos_ = "blah"
def test_tokens_sent(doc):
"""Test token.sent property"""
assert len(list(doc.sents)) == 3

View File

@ -576,6 +576,16 @@ def test_matcher_callback(en_vocab):
mock.assert_called_once_with(matcher, doc, 0, matches)
def test_matcher_callback_with_alignments(en_vocab):
mock = Mock()
matcher = Matcher(en_vocab)
pattern = [{"ORTH": "test"}]
matcher.add("Rule", [pattern], on_match=mock)
doc = Doc(en_vocab, words=["This", "is", "a", "test", "."])
matches = matcher(doc, with_alignments=True)
mock.assert_called_once_with(matcher, doc, 0, matches)
def test_matcher_span(matcher):
text = "JavaScript is good but Java is better"
doc = Doc(matcher.vocab, words=text.split())

View File

@ -85,7 +85,12 @@ def test_doc_gc():
spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY})
spancat.add_label("PERSON")
nlp.initialize()
texts = ["Just a sentence.", "I like London and Berlin", "I like Berlin", "I eat ham."]
texts = [
"Just a sentence.",
"I like London and Berlin",
"I like Berlin",
"I eat ham.",
]
all_spans = [doc.spans for doc in nlp.pipe(texts)]
for text, spangroups in zip(texts, all_spans):
assert isinstance(spangroups, SpanGroups)
@ -338,7 +343,11 @@ def test_overfitting_IO_overlapping():
assert len(spans) == 3
assert len(spans.attrs["scores"]) == 3
assert min(spans.attrs["scores"]) > 0.9
assert set([span.text for span in spans]) == {"London", "Berlin", "London and Berlin"}
assert set([span.text for span in spans]) == {
"London",
"Berlin",
"London and Berlin",
}
assert set([span.label_ for span in spans]) == {"LOC", "DOUBLE_LOC"}
# Also test the results are still the same after IO
@ -350,5 +359,9 @@ def test_overfitting_IO_overlapping():
assert len(spans2) == 3
assert len(spans2.attrs["scores"]) == 3
assert min(spans2.attrs["scores"]) > 0.9
assert set([span.text for span in spans2]) == {"London", "Berlin", "London and Berlin"}
assert set([span.text for span in spans2]) == {
"London",
"Berlin",
"London and Berlin",
}
assert set([span.label_ for span in spans2]) == {"LOC", "DOUBLE_LOC"}

View File

@ -9,6 +9,7 @@ from spacy.cli import info
from spacy.cli.init_config import init_config, RECOMMENDATIONS
from spacy.cli._util import validate_project_commands, parse_config_overrides
from spacy.cli._util import load_project_config, substitute_project_variables
from spacy.cli._util import is_subpath_of
from spacy.cli._util import string_to_list
from spacy import about
from spacy.util import get_minor_version
@ -535,8 +536,41 @@ def test_init_labels(component_name):
assert len(nlp2.get_pipe(component_name).labels) == 4
def test_get_third_party_dependencies_runs():
def test_get_third_party_dependencies():
# We can't easily test the detection of third-party packages here, but we
# can at least make sure that the function and its importlib magic runs.
nlp = Dutch()
# Test with component factory based on Cython module
nlp.add_pipe("tagger")
assert get_third_party_dependencies(nlp.config) == []
# Test with legacy function
nlp = Dutch()
nlp.add_pipe(
"textcat",
config={
"model": {
# Do not update from legacy architecture spacy.TextCatBOW.v1
"@architectures": "spacy.TextCatBOW.v1",
"exclusive_classes": True,
"ngram_size": 1,
"no_output_layer": False,
}
},
)
get_third_party_dependencies(nlp.config) == []
@pytest.mark.parametrize(
"parent,child,expected",
[
("/tmp", "/tmp", True),
("/tmp", "/", False),
("/tmp", "/tmp/subdir", True),
("/tmp", "/tmpdir", False),
("/tmp", "/tmp/subdir/..", True),
("/tmp", "/tmp/..", False),
],
)
def test_is_subpath_of(parent, child, expected):
assert is_subpath_of(parent, child) == expected

View File

@ -30,6 +30,7 @@ from ..compat import copy_reg, pickle
from ..errors import Errors, Warnings
from ..morphology import Morphology
from .. import util
from .. import parts_of_speech
from .underscore import Underscore, get_ext_args
from ._retokenize import Retokenizer
from ._serialize import ALL_ATTRS as DOCBIN_ALL_ATTRS
@ -285,6 +286,10 @@ cdef class Doc:
sent_starts[i] = -1
elif sent_starts[i] is None or sent_starts[i] not in [-1, 0, 1]:
sent_starts[i] = 0
if pos is not None:
for pp in set(pos):
if pp not in parts_of_speech.IDS:
raise ValueError(Errors.E1021.format(pp=pp))
ent_iobs = None
ent_types = None
if ents is not None:

View File

@ -867,6 +867,8 @@ cdef class Token:
return parts_of_speech.NAMES[self.c.pos]
def __set__(self, pos_name):
if pos_name not in parts_of_speech.IDS:
raise ValueError(Errors.E1021.format(pp=pos_name))
self.c.pos = parts_of_speech.IDS[pos_name]
property tag_:

View File

@ -177,3 +177,89 @@ def wandb_logger(
return log_step, finalize
return setup_logger
@registry.loggers("spacy.WandbLogger.v3")
def wandb_logger(
project_name: str,
remove_config_values: List[str] = [],
model_log_interval: Optional[int] = None,
log_dataset_dir: Optional[str] = None,
entity: Optional[str] = None,
run_name: Optional[str] = None,
):
try:
import wandb
# test that these are available
from wandb import init, log, join # noqa: F401
except ImportError:
raise ImportError(Errors.E880)
console = console_logger(progress_bar=False)
def setup_logger(
nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr
) -> Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]:
config = nlp.config.interpolate()
config_dot = util.dict_to_dot(config)
for field in remove_config_values:
del config_dot[field]
config = util.dot_to_dict(config_dot)
run = wandb.init(
project=project_name, config=config, entity=entity, reinit=True
)
if run_name:
wandb.run.name = run_name
console_log_step, console_finalize = console(nlp, stdout, stderr)
def log_dir_artifact(
path: str,
name: str,
type: str,
metadata: Optional[Dict[str, Any]] = {},
aliases: Optional[List[str]] = [],
):
dataset_artifact = wandb.Artifact(name, type=type, metadata=metadata)
dataset_artifact.add_dir(path, name=name)
wandb.log_artifact(dataset_artifact, aliases=aliases)
if log_dataset_dir:
log_dir_artifact(path=log_dataset_dir, name="dataset", type="dataset")
def log_step(info: Optional[Dict[str, Any]]):
console_log_step(info)
if info is not None:
score = info["score"]
other_scores = info["other_scores"]
losses = info["losses"]
wandb.log({"score": score})
if losses:
wandb.log({f"loss_{k}": v for k, v in losses.items()})
if isinstance(other_scores, dict):
wandb.log(other_scores)
if model_log_interval and info.get("output_path"):
if info["step"] % model_log_interval == 0 and info["step"] != 0:
log_dir_artifact(
path=info["output_path"],
name="pipeline_" + run.id,
type="checkpoint",
metadata=info,
aliases=[
f"epoch {info['epoch']} step {info['step']}",
"latest",
"best"
if info["score"] == max(info["checkpoints"])[0]
else "",
],
)
def finalize() -> None:
console_finalize()
wandb.join()
return log_step, finalize
return setup_logger

View File

@ -41,10 +41,11 @@ def pretrain(
optimizer = P["optimizer"]
# Load in pretrained weights to resume from
if resume_path is not None:
_resume_model(model, resume_path, epoch_resume, silent=silent)
epoch_resume = _resume_model(model, resume_path, epoch_resume, silent=silent)
else:
# Without '--resume-path' the '--epoch-resume' argument is ignored
epoch_resume = 0
objective = model.attrs["loss"]
# TODO: move this to logger function?
tracker = ProgressTracker(frequency=10000)
@ -93,20 +94,25 @@ def ensure_docs(examples_or_docs: Iterable[Union[Doc, Example]]) -> List[Doc]:
def _resume_model(
model: Model, resume_path: Path, epoch_resume: int, silent: bool = True
) -> None:
) -> int:
msg = Printer(no_print=silent)
msg.info(f"Resume training tok2vec from: {resume_path}")
with resume_path.open("rb") as file_:
weights_data = file_.read()
model.get_ref("tok2vec").from_bytes(weights_data)
# Parse the epoch number from the given weight file
model_name = re.search(r"model\d+\.bin", str(resume_path))
if model_name:
# Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
epoch_resume = int(model_name.group(0)[5:][:-4]) + 1
msg.info(f"Resuming from epoch: {epoch_resume}")
else:
msg.info(f"Resuming from epoch: {epoch_resume}")
if epoch_resume is None:
# Parse the epoch number from the given weight file
model_name = re.search(r"model\d+\.bin", str(resume_path))
if model_name:
# Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
epoch_resume = int(model_name.group(0)[5:][:-4]) + 1
else:
# No epoch given and couldn't infer it
raise ValueError(Errors.E1020)
msg.info(f"Resuming from epoch: {epoch_resume}")
return epoch_resume
def make_update(

View File

@ -140,6 +140,32 @@ class registry(thinc.registry):
) from None
return func
@classmethod
def find(cls, registry_name: str, func_name: str) -> Callable:
"""Get info about a registered function from the registry."""
# We're overwriting this classmethod so we're able to provide more
# specific error messages and implement a fallback to spacy-legacy.
if not hasattr(cls, registry_name):
names = ", ".join(cls.get_registry_names()) or "none"
raise RegistryError(Errors.E892.format(name=registry_name, available=names))
reg = getattr(cls, registry_name)
try:
func_info = reg.find(func_name)
except RegistryError:
if func_name.startswith("spacy."):
legacy_name = func_name.replace("spacy.", "spacy-legacy.")
try:
return reg.find(legacy_name)
except catalogue.RegistryError:
pass
available = ", ".join(sorted(reg.get_all().keys())) or "none"
raise RegistryError(
Errors.E893.format(
name=func_name, reg_name=registry_name, available=available
)
) from None
return func_info
@classmethod
def has(cls, registry_name: str, func_name: str) -> bool:
"""Check whether a function is available in a registry."""

View File

@ -462,7 +462,7 @@ start decreasing across epochs.
</Accordion>
#### spacy.WandbLogger.v2 {#WandbLogger tag="registered function"}
#### spacy.WandbLogger.v3 {#WandbLogger tag="registered function"}
> #### Installation
>
@ -494,19 +494,21 @@ remain in the config file stored on your local system.
>
> ```ini
> [training.logger]
> @loggers = "spacy.WandbLogger.v2"
> @loggers = "spacy.WandbLogger.v3"
> project_name = "monitor_spacy_training"
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
> log_dataset_dir = "corpus"
> model_log_interval = 1000
> ```
| Name | Description |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |
| `model_log_interval` | Steps to wait between logging model checkpoints to W&B dasboard (default: None). ~~Optional[int]~~ |
| `log_dataset_dir` | Directory containing dataset to be logged and versioned as W&B artifact (default: None). ~~Optional[str]~~ |
| Name | Description |
| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |
| `model_log_interval` | Steps to wait between logging model checkpoints to W&B dasboard (default: None). ~~Optional[int]~~ |
| `log_dataset_dir` | Directory containing dataset to be logged and versioned as W&B artifact (default: None). ~~Optional[str]~~ |
| `run_name` | The name of the run. If you don't specify a run_name, the name will be created by wandb library. (default: None ). ~~Optional[str]~~ |
| `entity` | An entity is a username or team name where you're sending runs. If you don't specify an entity, the run will be sent to your default entity, which is usually your username. (default: None). ~~Optional[str]~~ |
<Project id="integrations/wandb">

View File

@ -291,7 +291,7 @@ files you need and not the whole repo.
| Name | Description |
| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. |
| `git` | `repo`: The URL of the repo to download from.<br />`path`: Path of the file or directory to download, relative to the repo root.<br />`branch`: The branch to download from. Defaults to `"master"`. |
| `git` | `repo`: The URL of the repo to download from.<br />`path`: Path of the file or directory to download, relative to the repo root. "" specifies the root directory.<br />`branch`: The branch to download from. Defaults to `"master"`. |
| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. |
| `description` | Optional asset description, used in [auto-generated docs](#custom-docs). |