diff --git a/extra/DEVELOPER_DOCS/Listeners.md b/extra/DEVELOPER_DOCS/Listeners.md new file mode 100644 index 000000000..3a71082e0 --- /dev/null +++ b/extra/DEVELOPER_DOCS/Listeners.md @@ -0,0 +1,220 @@ +# Listeners + +1. [Overview](#1-overview) +2. [Initialization](#2-initialization) + - [A. Linking listeners to the embedding component](#2a-linking-listeners-to-the-embedding-component) + - [B. Shape inference](#2b-shape-inference) +3. [Internal communication](#3-internal-communication) + - [A. During prediction](#3a-during-prediction) + - [B. During training](#3b-during-training) + - [C. Frozen components](#3c-frozen-components) +4. [Replacing listener with standalone](#4-replacing-listener-with-standalone) + +## 1. Overview + +Trainable spaCy components typically use some sort of `tok2vec` layer as part of the `model` definition. +This `tok2vec` layer produces embeddings and is either a standard `Tok2Vec` layer, or a Transformer-based one. +Both versions can be used either inline/standalone, which means that they are defined and used +by only one specific component (e.g. NER), or +[shared](https://spacy.io/usage/embeddings-transformers#embedding-layers), +in which case the embedding functionality becomes a separate component that can +feed embeddings to multiple components downstream, using a listener-pattern. + +| Type | Usage | Model Architecture | +| ------------- | ---------- | -------------------------------------------------------------------------------------------------- | +| `Tok2Vec` | standalone | [`spacy.Tok2Vec`](https://spacy.io/api/architectures#Tok2Vec) | +| `Tok2Vec` | listener | [`spacy.Tok2VecListener`](https://spacy.io/api/architectures#Tok2VecListener) | +| `Transformer` | standalone | [`spacy-transformers.Tok2VecTransformer`](https://spacy.io/api/architectures#Tok2VecTransformer) | +| `Transformer` | listener | [`spacy-transformers.TransformerListener`](https://spacy.io/api/architectures#TransformerListener) | + +Here we discuss the listener pattern and its implementation in code in more detail. + +## 2. Initialization + +### 2A. Linking listeners to the embedding component + +To allow sharing a `tok2vec` layer, a separate `tok2vec` component needs to be defined in the config: + +``` +[components.tok2vec] +factory = "tok2vec" + +[components.tok2vec.model] +@architectures = "spacy.Tok2Vec.v2" +``` + +A listener can then be set up by making sure the correct `upstream` name is defined, referring to the +name of the `tok2vec` component (which equals the factory name by default), or `*` as a wildcard: + +``` +[components.ner.model.tok2vec] +@architectures = "spacy.Tok2VecListener.v1" +upstream = "tok2vec" +``` + +When an [`nlp`](https://github.com/explosion/spaCy/blob/master/extra/DEVELOPER_DOCS/Language.md) object is +initialized or deserialized, it will make sure to link each `tok2vec` component to its listeners. This is +implemented in the method `nlp._link_components()` which loops over each +component in the pipeline and calls `find_listeners()` on a component if it's defined. +The [`tok2vec` component](https://github.com/explosion/spaCy/blob/master/spacy/pipeline/tok2vec.py)'s implementation +of this `find_listener()` method will specifically identify sublayers of a model definition that are of type +`Tok2VecListener` with a matching upstream name and will then add that listener to the internal `self.listener_map`. + +If it's a Transformer-based pipeline, a +[`transformer` component](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py) +has a similar implementation but its `find_listener()` function will specifically look for `TransformerListener` +sublayers of downstream components. + +### 2B. Shape inference + +Typically, the output dimension `nO` of a listener's model equals the `nO` (or `width`) of the upstream embedding layer. +For a standard `Tok2Vec`-based component, this is typically known up-front and defined as such in the config: + +``` +[components.ner.model.tok2vec] +@architectures = "spacy.Tok2VecListener.v1" +width = ${components.tok2vec.model.encode.width} +``` + +A `transformer` component however only knows its `nO` dimension after the HuggingFace transformer +is set with the function `model.attrs["set_transformer"]`, +[implemented](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py) +by `set_pytorch_transformer`. +This is why, upon linking of the transformer listeners, the `transformer` component also makes sure to set +the listener's output dimension correctly. + +This shape inference mechanism also needs to happen with resumed/frozen components, which means that for some CLI +commands (`assemble` and `train`), we need to call `nlp._link_components` even before initializing the `nlp` +object. To cover all use-cases and avoid negative side effects, the code base ensures that performing the +linking twice is not harmful. + +## 3. Internal communication + +The internal communication between a listener and its downstream components is organized by sending and +receiving information across the components - either directly or implicitly. +The details are different depending on whether the pipeline is currently training, or predicting. +Either way, the `tok2vec` or `transformer` component always needs to run before the listener. + +### 3A. During prediction + +When the `Tok2Vec` pipeline component is called, its `predict()` method is executed to produce the results, +which are then stored by `set_annotations()` in the `doc.tensor` field of the document(s). +Similarly, the `Transformer` component stores the produced embeddings +in `doc._.trf_data`. Next, the `forward` pass of a +[`Tok2VecListener`](https://github.com/explosion/spaCy/blob/master/spacy/pipeline/tok2vec.py) +or a +[`TransformerListener`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/listener.py) +accesses these fields on the `Doc` directly. Both listener implementations have a fallback mechanism for when these +properties were not set on the `Doc`: in that case an all-zero tensor is produced and returned. +We need this fallback mechanism to enable shape inference methods in Thinc, but the code +is slightly risky and at times might hide another bug - so it's a good spot to be aware of. + +### 3B. During training + +During training, the `update()` methods of the `Tok2Vec` & `Transformer` components don't necessarily set the +annotations on the `Doc` (though since 3.1 they can if they are part of the `annotating_components` list in the config). +Instead, we rely on a caching mechanism between the original embedding component and its listener. +Specifically, the produced embeddings are sent to the listeners by calling `listener.receive()` and uniquely +identifying the batch of documents with a `batch_id`. This `receive()` call also sends the appropriate `backprop` +call to ensure that gradients from the downstream component flow back to the trainable `Tok2Vec` or `Transformer` +network. + +We rely on the `nlp` object properly batching the data and sending each batch through the pipeline in sequence, +which means that only one such batch needs to be kept in memory for each listener. +When the downstream component runs and the listener should produce embeddings, it accesses the batch in memory, +runs the backpropagation, and returns the results and the gradients. + +There are two ways in which this mechanism can fail, both are detected by `verify_inputs()`: + +- `E953` if a different batch is in memory than the requested one - signaling some kind of out-of-sync state of the + training pipeline. +- `E954` if no batch is in memory at all - signaling that the pipeline is probably not set up correctly. + +#### Training with multiple listeners + +One `Tok2Vec` or `Transformer` component may be listened to by several downstream components, e.g. +a tagger and a parser could be sharing the same embeddings. In this case, we need to be careful about how we do +the backpropagation. When the `Tok2Vec` or `Transformer` sends out data to the listener with `receive()`, they will +send an `accumulate_gradient` function call to all listeners, except the last one. This function will keep track +of the gradients received so far. Only the final listener in the pipeline will get an actual `backprop` call that +will initiate the backpropagation of the `tok2vec` or `transformer` model with the accumulated gradients. + +### 3C. Frozen components + +The listener pattern can get particularly tricky in combination with frozen components. To detect components +with listeners that are not frozen consistently, `init_nlp()` (which is called by `spacy train`) goes through +the listeners and their upstream components and warns in two scenarios. + +#### The Tok2Vec or Transformer is frozen + +If the `Tok2Vec` or `Transformer` was already trained, +e.g. by [pretraining](https://spacy.io/usage/embeddings-transformers#pretraining), +it could be a valid use-case to freeze the embedding architecture and only train downstream components such +as a tagger or a parser. This used to be impossible before 3.1, but has become supported since then by putting the +embedding component in the [`annotating_components`](https://spacy.io/usage/training#annotating-components) +list of the config. This works like any other "annotating component" because it relies on the `Doc` attributes. + +However, if the `Tok2Vec` or `Transformer` is frozen, and not present in `annotating_components`, and a related +listener isn't frozen, then a `W086` warning is shown and further training of the pipeline will likely end with `E954`. + +#### The upstream component is frozen + +If an upstream component is frozen but the underlying `Tok2Vec` or `Transformer` isn't, the performance of +the upstream component will be degraded after training. In this case, a `W087` warning is shown, explaining +how to use the `replace_listeners` functionality to prevent this problem. + +## 4. Replacing listener with standalone + +The [`replace_listeners`](https://spacy.io/api/language#replace_listeners) functionality changes the architecture +of a downstream component from using a listener pattern to a standalone `tok2vec` or `transformer` layer, +effectively making the downstream component independent of any other components in the pipeline. +It is implemented by `nlp.replace_listeners()` and typically executed by `nlp.from_config()`. +First, it fetches the original `Model` of the original component that creates the embeddings: + +``` +tok2vec = self.get_pipe(tok2vec_name) +tok2vec_model = tok2vec.model +``` + +Which is either a [`Tok2Vec` model](https://github.com/explosion/spaCy/blob/master/spacy/ml/models/tok2vec.py) or a +[`TransformerModel`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py). + +In the case of the `tok2vec`, this model can be copied as-is into the configuration and architecture of the +downstream component. However, for the `transformer`, this doesn't work. +The reason is that the `TransformerListener` architecture chains the listener with +[`trfs2arrays`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/trfs2arrays.py): + +``` +model = chain( + TransformerListener(upstream_name=upstream) + trfs2arrays(pooling, grad_factor), +) +``` + +but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained inbetween the model +and `trfs2arrays`: + +``` +model = chain( + TransformerModel(name, get_spans, tokenizer_config), + split_trf_batch(), + trfs2arrays(pooling, grad_factor), +) +``` + +So you can't just take the model from the listener, and drop that into the component internally. You need to +adjust the model and the config. To facilitate this, `nlp.replace_listeners()` will check whether additional +[functions](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/_util.py) are +[defined](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py) +in `model.attrs`, and if so, it will essentially call these to make the appropriate changes: + +``` +replace_func = tok2vec_model.attrs["replace_listener_cfg"] +new_config = replace_func(tok2vec_cfg["model"], pipe_cfg["model"]["tok2vec"]) +... +new_model = tok2vec_model.attrs["replace_listener"](new_model) +``` + +The new config and model are then properly stored on the `nlp` object. +Note that this functionality (running the replacement for a transformer listener) was broken prior to +`spacy-transformers` 1.0.5. diff --git a/extra/DEVELOPER_DOCS/StringStore-Vocab.md b/extra/DEVELOPER_DOCS/StringStore-Vocab.md new file mode 100644 index 000000000..866ba2aae --- /dev/null +++ b/extra/DEVELOPER_DOCS/StringStore-Vocab.md @@ -0,0 +1,216 @@ +# StringStore & Vocab + +> Reference: `spacy/strings.pyx` +> Reference: `spacy/vocab.pyx` + +## Overview + +spaCy represents mosts strings internally using a `uint64` in Cython which +corresponds to a hash. The magic required to make this largely transparent is +handled by the `StringStore`, and is integrated into the pipelines using the +`Vocab`, which also connects it to some other information. + +These are mostly internal details that average library users should never have +to think about. On the other hand, when developing a component it's normal to +interact with the Vocab for lexeme data or word vectors, and it's not unusual +to add labels to the `StringStore`. + +## StringStore + +### Overview + +The `StringStore` is a `cdef class` that looks a bit like a two-way dictionary, +though it is not a subclass of anything in particular. + +The main functionality of the `StringStore` is that `__getitem__` converts +hashes into strings or strings into hashes. + +The full details of the conversion are complicated. Normally you shouldn't have +to worry about them, but the first applicable case here is used to get the +return value: + +1. 0 and the empty string are special cased to each other +2. internal symbols use a lookup table (`SYMBOLS_BY_STR`) +3. normal strings or bytes are hashed +4. internal symbol IDs in `SYMBOLS_BY_INT` are handled +5. anything not yet handled is used as a hash to lookup a string + +For the symbol enums, see [`symbols.pxd`](https://github.com/explosion/spaCy/blob/master/spacy/symbols.pxd). + +Almost all strings in spaCy are stored in the `StringStore`. This naturally +includes tokens, but also includes things like labels (not just NER/POS/dep, +but also categories etc.), lemmas, lowercase forms, word shapes, and so on. One +of the main results of this is that tokens can be represented by a compact C +struct ([`LexemeC`](https://spacy.io/api/cython-structs#lexemec)/[`TokenC`](https://github.com/explosion/spaCy/issues/4854)) that mostly consists of string hashes. This also means that converting +input for the models is straightforward, and there's not a token mapping step +like in many machine learning frameworks. Additionally, because the token IDs +in spaCy are based on hashes, they are consistent across environments or +models. + +One pattern you'll see a lot in spaCy APIs is that `something.value` returns an +`int` and `something.value_` returns a string. That's implemented using the +`StringStore`. Typically the `int` is stored in a C struct and the string is +generated via a property that calls into the `StringStore` with the `int`. + +Besides `__getitem__`, the `StringStore` has functions to return specifically a +string or specifically a hash, regardless of whether the input was a string or +hash to begin with, though these are only used occasionally. + +### Implementation Details: Hashes and Allocations + +Hashes are 64-bit and are computed using [murmurhash][] on UTF-8 bytes. There is no +mechanism for detecting and avoiding collisions. To date there has never been a +reproducible collision or user report about any related issues. + +[murmurhash]: https://github.com/explosion/murmurhash + +The empty string is not hashed, it's just converted to/from 0. + +A small number of strings use indices into a lookup table (so low integers) +rather than hashes. This is mostly Universal Dependencies labels or other +strings considered "core" in spaCy. This was critical in v1, which hadn't +introduced hashing yet. Since v2 it's important for items in `spacy.attrs`, +especially lexeme flags, but is otherwise only maintained for backwards +compatibility. + +You can call `strings["mystring"]` with a string the `StringStore` has never seen +before and it will return a hash. But in order to do the reverse operation, you +need to call `strings.add("mystring")` first. Without a call to `add` the +string will not be interned. + +Example: + +``` +from spacy.strings import StringStore + +ss = StringStore() +hashval = ss["spacy"] # 10639093010105930009 +try: + # this won't work + ss[hashval] +except KeyError: + print(f"key {hashval} unknown in the StringStore.") + +ss.add("spacy") +assert ss[hashval] == "spacy" # it works now + +# There is no `.keys` property, but you can iterate over keys +# The empty string will never be in the list of keys +for key in ss: + print(key) +``` + +In normal use nothing is ever removed from the `StringStore`. In theory this +means that if you do something like iterate through all hex values of a certain +length you can have explosive memory usage. In practice this has never been an +issue. (Note that this is also different from using `sys.intern` to intern +Python strings, which does not guarantee they won't be garbage collected later.) + +Strings are stored in the `StringStore` in a peculiar way: each string uses a +union that is either an eight-byte `char[]` or a `char*`. Short strings are +stored directly in the `char[]`, while longer strings are stored in allocated +memory and prefixed with their length. This is a strategy to reduce indirection +and memory fragmentation. See `decode_Utf8Str` and `_allocate` in +`strings.pyx` for the implementation. + +### When to Use the StringStore? + +While you can ignore the `StringStore` in many cases, there are situations where +you should make use of it to avoid errors. + +Any time you introduce a string that may be set on a `Doc` field that has a hash, +you should add the string to the `StringStore`. This mainly happens when adding +labels in components, but there are some other cases: + +- syntax iterators, mainly `get_noun_chunks` +- external data used in components, like the `KnowledgeBase` in the `entity_linker` +- labels used in tests + +## Vocab + +The `Vocab` is a core component of a `Language` pipeline. Its main function is +to manage `Lexeme`s, which are structs that contain information about a token +that depends only on its surface form, without context. `Lexeme`s store much of +the data associated with `Token`s. As a side effect of this the `Vocab` also +manages the `StringStore` for a pipeline and a grab-bag of other data. + +These are things stored in the vocab: + +- `Lexeme`s +- `StringStore` +- `Morphology`: manages info used in `MorphAnalysis` objects +- `vectors`: basically a dict for word vectors +- `lookups`: language specific data like lemmas +- `writing_system`: language specific metadata +- `get_noun_chunks`: a syntax iterator +- lex attribute getters: functions like `is_punct`, set in language defaults +- `cfg`: **not** the pipeline config, this is mostly unused +- `_unused_object`: Formerly an unused object, kept around until v4 for compatability + +Some of these, like the Morphology and Vectors, are complex enough that they +need their own explanations. Here we'll just look at Vocab-specific items. + +### Lexemes + +A `Lexeme` is a type that mainly wraps a `LexemeC`, a struct consisting of ints +that identify various context-free token attributes. Lexemes are the core data +of the `Vocab`, and can be accessed using `__getitem__` on the `Vocab`. The memory +for storing `LexemeC` objects is managed by a pool that belongs to the `Vocab`. + +Note that `__getitem__` on the `Vocab` works much like the `StringStore`, in +that it accepts a hash or id, with one important difference: if you do a lookup +using a string, that value is added to the `StringStore` automatically. + +The attributes stored in a `LexemeC` are: + +- orth (the raw text) +- lower +- norm +- shape +- prefix +- suffix + +Most of these are straightforward. All of them can be customized, and (except +`orth`) probably should be since the defaults are based on English, but in +practice this is rarely done at present. + +### Lookups + +This is basically a dict of dicts, implemented using a `Table` for each +sub-dict, that stores lemmas and other language-specific lookup data. + +A `Table` is a subclass of `OrderedDict` used for string-to-string data. It uses +Bloom filters to speed up misses and has some extra serialization features. +Tables are not used outside of the lookups. + +### Lex Attribute Getters + +Lexical Attribute Getters like `is_punct` are defined on a per-language basis, +much like lookups, but take the form of functions rather than string-to-string +dicts, so they're stored separately. + +### Writing System + +This is a dict with three attributes: + +- `direction`: ltr or rtl (default ltr) +- `has_case`: bool (default `True`) +- `has_letters`: bool (default `True`, `False` only for CJK for now) + +Currently these are not used much - the main use is that `direction` is used in +visualizers, though `rtl` doesn't quite work (see +[#4854](https://github.com/explosion/spaCy/issues/4854)). In the future they +could be used when choosing hyperparameters for subwords, controlling word +shape generation, and similar tasks. + +### Other Vocab Members + +The Vocab is kind of the default place to store things from `Language.defaults` +that don't belong to the Tokenizer. The following properties are in the Vocab +just because they don't have anywhere else to go. + +- `get_noun_chunks` +- `cfg`: This is a dict that just stores `oov_prob` (hardcoded to `-20`) +- `_unused_object`: Leftover C member, should be removed in next major version + + diff --git a/pyproject.toml b/pyproject.toml index 07091123a..7328cd6c2 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -5,7 +5,7 @@ requires = [ "cymem>=2.0.2,<2.1.0", "preshed>=3.0.2,<3.1.0", "murmurhash>=0.28.0,<1.1.0", - "thinc>=8.0.8,<8.1.0", + "thinc>=8.0.10,<8.1.0", "blis>=0.4.0,<0.8.0", "pathy", "numpy>=1.15.0", diff --git a/requirements.txt b/requirements.txt index ad8c70318..12fdf650f 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,15 +1,15 @@ # Our libraries -spacy-legacy>=3.0.7,<3.1.0 +spacy-legacy>=3.0.8,<3.1.0 cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 -thinc>=8.0.8,<8.1.0 +thinc>=8.0.10,<8.1.0 blis>=0.4.0,<0.8.0 ml_datasets>=0.2.0,<0.3.0 murmurhash>=0.28.0,<1.1.0 wasabi>=0.8.1,<1.1.0 srsly>=2.4.1,<3.0.0 -catalogue>=2.0.4,<2.1.0 -typer>=0.3.0,<0.4.0 +catalogue>=2.0.6,<2.1.0 +typer>=0.3.0,<0.5.0 pathy>=0.3.5 # Third party dependencies numpy>=1.15.0 diff --git a/setup.cfg b/setup.cfg index 1fa5b828d..ff12d511a 100644 --- a/setup.cfg +++ b/setup.cfg @@ -37,19 +37,19 @@ setup_requires = cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 murmurhash>=0.28.0,<1.1.0 - thinc>=8.0.8,<8.1.0 + thinc>=8.0.10,<8.1.0 install_requires = # Our libraries - spacy-legacy>=3.0.7,<3.1.0 + spacy-legacy>=3.0.8,<3.1.0 murmurhash>=0.28.0,<1.1.0 cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 - thinc>=8.0.8,<8.1.0 + thinc>=8.0.9,<8.1.0 blis>=0.4.0,<0.8.0 wasabi>=0.8.1,<1.1.0 srsly>=2.4.1,<3.0.0 - catalogue>=2.0.4,<2.1.0 - typer>=0.3.0,<0.4.0 + catalogue>=2.0.6,<2.1.0 + typer>=0.3.0,<0.5.0 pathy>=0.3.5 # Third-party dependencies tqdm>=4.38.0,<5.0.0 diff --git a/spacy/about.py b/spacy/about.py index 85b579f95..3137be806 100644 --- a/spacy/about.py +++ b/spacy/about.py @@ -1,6 +1,6 @@ # fmt: off __title__ = "spacy" -__version__ = "3.1.2" +__version__ = "3.1.3" __download_url__ = "https://github.com/explosion/spacy-models/releases/download" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __projects__ = "https://github.com/explosion/projects" diff --git a/spacy/cli/_util.py b/spacy/cli/_util.py index ed1e840a5..127bba55a 100644 --- a/spacy/cli/_util.py +++ b/spacy/cli/_util.py @@ -397,7 +397,11 @@ def git_checkout( run_command(cmd, capture=True) # We need Path(name) to make sure we also support subdirectories try: - shutil.copytree(str(tmp_dir / Path(subpath)), str(dest)) + source_path = tmp_dir / Path(subpath) + if not is_subpath_of(tmp_dir, source_path): + err = f"'{subpath}' is a path outside of the cloned repository." + msg.fail(err, repo, exits=1) + shutil.copytree(str(source_path), str(dest)) except FileNotFoundError: err = f"Can't clone {subpath}. Make sure the directory exists in the repo (branch '{branch}')" msg.fail(err, repo, exits=1) @@ -445,8 +449,14 @@ def git_sparse_checkout(repo, subpath, dest, branch): # And finally, we can checkout our subpath cmd = f"git -C {tmp_dir} checkout {branch} {subpath}" run_command(cmd, capture=True) - # We need Path(name) to make sure we also support subdirectories - shutil.move(str(tmp_dir / Path(subpath)), str(dest)) + + # Get a subdirectory of the cloned path, if appropriate + source_path = tmp_dir / Path(subpath) + if not is_subpath_of(tmp_dir, source_path): + err = f"'{subpath}' is a path outside of the cloned repository." + msg.fail(err, repo, exits=1) + + shutil.move(str(source_path), str(dest)) def get_git_version( @@ -477,6 +487,19 @@ def _http_to_git(repo: str) -> str: return repo +def is_subpath_of(parent, child): + """ + Check whether `child` is a path contained within `parent`. + """ + # Based on https://stackoverflow.com/a/37095733 . + + # In Python 3.9, the `Path.is_relative_to()` method will supplant this, so + # we can stop using crusty old os.path functions. + parent_realpath = os.path.realpath(parent) + child_realpath = os.path.realpath(child) + return os.path.commonpath([parent_realpath, child_realpath]) == parent_realpath + + def string_to_list(value: str, intify: bool = False) -> Union[List[str], List[int]]: """Parse a comma-separated string to a list and account for various formatting options. Mostly used to handle CLI arguments that take a list of diff --git a/spacy/cli/package.py b/spacy/cli/package.py index b6b993267..332a51bc7 100644 --- a/spacy/cli/package.py +++ b/spacy/cli/package.py @@ -200,17 +200,21 @@ def get_third_party_dependencies( exclude (list): List of packages to exclude (e.g. that already exist in meta). RETURNS (list): The versioned requirements. """ - own_packages = ("spacy", "spacy-nightly", "thinc", "srsly") + own_packages = ("spacy", "spacy-legacy", "spacy-nightly", "thinc", "srsly") distributions = util.packages_distributions() funcs = defaultdict(set) - for path, value in util.walk_dict(config): - if path[-1].startswith("@"): # collect all function references by registry - funcs[path[-1][1:]].add(value) + # We only want to look at runtime-relevant sections, not [training] or [initialize] + for section in ("nlp", "components"): + for path, value in util.walk_dict(config[section]): + if path[-1].startswith("@"): # collect all function references by registry + funcs[path[-1][1:]].add(value) + for component in config.get("components", {}).values(): + if "factory" in component: + funcs["factories"].add(component["factory"]) modules = set() for reg_name, func_names in funcs.items(): - sub_registry = getattr(util.registry, reg_name) for func_name in func_names: - func_info = sub_registry.find(func_name) + func_info = util.registry.find(reg_name, func_name) module_name = func_info.get("module") if module_name: # the code is part of a module, not a --code file modules.add(func_info["module"].split(".")[0]) diff --git a/spacy/cli/project/assets.py b/spacy/cli/project/assets.py index b49e18608..70fcd0ecf 100644 --- a/spacy/cli/project/assets.py +++ b/spacy/cli/project/assets.py @@ -59,6 +59,15 @@ def project_assets(project_dir: Path, *, sparse_checkout: bool = False) -> None: shutil.rmtree(dest) else: dest.unlink() + if "repo" not in asset["git"] or asset["git"]["repo"] is None: + msg.fail( + "A git asset must include 'repo', the repository address.", exits=1 + ) + if "path" not in asset["git"] or asset["git"]["path"] is None: + msg.fail( + "A git asset must include 'path' - use \"\" to get the entire repository.", + exits=1, + ) git_checkout( asset["git"]["repo"], asset["git"]["path"], diff --git a/spacy/cli/project/run.py b/spacy/cli/project/run.py index ececc2507..3736a6e1c 100644 --- a/spacy/cli/project/run.py +++ b/spacy/cli/project/run.py @@ -57,6 +57,7 @@ def project_run( project_dir (Path): Path to project directory. subcommand (str): Name of command to run. + overrides (Dict[str, Any]): Optional config overrides. force (bool): Force re-running, even if nothing changed. dry (bool): Perform a dry run and don't execute commands. capture (bool): Whether to capture the output and errors of individual commands. @@ -72,7 +73,14 @@ def project_run( if subcommand in workflows: msg.info(f"Running workflow '{subcommand}'") for cmd in workflows[subcommand]: - project_run(project_dir, cmd, force=force, dry=dry, capture=capture) + project_run( + project_dir, + cmd, + overrides=overrides, + force=force, + dry=dry, + capture=capture, + ) else: cmd = commands[subcommand] for dep in cmd.get("deps", []): diff --git a/spacy/errors.py b/spacy/errors.py index a206826ff..135aacf92 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -869,6 +869,10 @@ class Errors: E1019 = ("`noun_chunks` requires the pos tagging, which requires a " "statistical model to be installed and loaded. For more info, see " "the documentation:\nhttps://spacy.io/usage/models") + E1020 = ("No `epoch_resume` value specified and could not infer one from " + "filename. Specify an epoch to resume from.") + E1021 = ("`pos` value \"{pp}\" is not a valid Universal Dependencies tag. " + "Non-UD tags should use the `tag` property.") # Deprecated model shortcuts, only used in errors and warnings diff --git a/spacy/lang/fr/tokenizer_exceptions.py b/spacy/lang/fr/tokenizer_exceptions.py index 6f429eecc..060f81879 100644 --- a/spacy/lang/fr/tokenizer_exceptions.py +++ b/spacy/lang/fr/tokenizer_exceptions.py @@ -82,7 +82,8 @@ for orth in [ for verb in [ "a", - "est" "semble", + "est", + "semble", "indique", "moque", "passe", diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index be45dcaad..05c55c9a7 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -281,28 +281,19 @@ cdef class Matcher: final_matches.append((key, *match)) # Mark tokens that have matched memset(&matched[start], 1, span_len * sizeof(matched[0])) - if with_alignments: - final_matches_with_alignments = final_matches - final_matches = [(key, start, end) for key, start, end, alignments in final_matches] - # perform the callbacks on the filtered set of results - for i, (key, start, end) in enumerate(final_matches): - on_match = self._callbacks.get(key, None) - if on_match is not None: - on_match(self, doc, i, final_matches) if as_spans: - spans = [] - for key, start, end in final_matches: + final_results = [] + for key, start, end, *_ in final_matches: if isinstance(doclike, Span): start += doclike.start end += doclike.start - spans.append(Span(doc, start, end, label=key)) - return spans + final_results.append(Span(doc, start, end, label=key)) elif with_alignments: # convert alignments List[Dict[str, int]] --> List[int] - final_matches = [] # when multiple alignment (belongs to the same length) is found, # keeps the alignment that has largest token_idx - for key, start, end, alignments in final_matches_with_alignments: + final_results = [] + for key, start, end, alignments in final_matches: sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False) alignments = [0] * (end-start) for align in sorted_alignments: @@ -311,10 +302,16 @@ cdef class Matcher: # Since alignments are sorted in order of (length, token_idx) # this overwrites smaller token_idx when they have same length. alignments[align['length']] = align['token_idx'] - final_matches.append((key, start, end, alignments)) - return final_matches + final_results.append((key, start, end, alignments)) + final_matches = final_results # for callbacks else: - return final_matches + final_results = final_matches + # perform the callbacks on the filtered set of results + for i, (key, *_) in enumerate(final_matches): + on_match = self._callbacks.get(key, None) + if on_match is not None: + on_match(self, doc, i, final_matches) + return final_results def _normalize_key(self, key): if isinstance(key, basestring): diff --git a/spacy/pipeline/spancat.py b/spacy/pipeline/spancat.py index 4cdaf3d83..052bd2874 100644 --- a/spacy/pipeline/spancat.py +++ b/spacy/pipeline/spancat.py @@ -398,7 +398,9 @@ class SpanCategorizer(TrainablePipe): pass def _get_aligned_spans(self, eg: Example): - return eg.get_aligned_spans_y2x(eg.reference.spans.get(self.key, []), allow_overlap=True) + return eg.get_aligned_spans_y2x( + eg.reference.spans.get(self.key, []), allow_overlap=True + ) def _make_span_group( self, doc: Doc, indices: Ints2d, scores: Floats2d, labels: List[str] diff --git a/spacy/tests/doc/test_creation.py b/spacy/tests/doc/test_creation.py index 6989b965f..302a9b6ea 100644 --- a/spacy/tests/doc/test_creation.py +++ b/spacy/tests/doc/test_creation.py @@ -70,3 +70,10 @@ def test_create_with_heads_and_no_deps(vocab): heads = list(range(len(words))) with pytest.raises(ValueError): Doc(vocab, words=words, heads=heads) + + +def test_create_invalid_pos(vocab): + words = "I like ginger".split() + pos = "QQ ZZ XX".split() + with pytest.raises(ValueError): + Doc(vocab, words=words, pos=pos) diff --git a/spacy/tests/doc/test_token_api.py b/spacy/tests/doc/test_token_api.py index 5ea0bcff0..e715c5e85 100644 --- a/spacy/tests/doc/test_token_api.py +++ b/spacy/tests/doc/test_token_api.py @@ -203,6 +203,12 @@ def test_set_pos(): assert doc[1].pos_ == "VERB" +def test_set_invalid_pos(): + doc = Doc(Vocab(), words=["hello", "world"]) + with pytest.raises(ValueError): + doc[0].pos_ = "blah" + + def test_tokens_sent(doc): """Test token.sent property""" assert len(list(doc.sents)) == 3 diff --git a/spacy/tests/matcher/test_matcher_api.py b/spacy/tests/matcher/test_matcher_api.py index a42735eae..c02d65cdf 100644 --- a/spacy/tests/matcher/test_matcher_api.py +++ b/spacy/tests/matcher/test_matcher_api.py @@ -576,6 +576,16 @@ def test_matcher_callback(en_vocab): mock.assert_called_once_with(matcher, doc, 0, matches) +def test_matcher_callback_with_alignments(en_vocab): + mock = Mock() + matcher = Matcher(en_vocab) + pattern = [{"ORTH": "test"}] + matcher.add("Rule", [pattern], on_match=mock) + doc = Doc(en_vocab, words=["This", "is", "a", "test", "."]) + matches = matcher(doc, with_alignments=True) + mock.assert_called_once_with(matcher, doc, 0, matches) + + def test_matcher_span(matcher): text = "JavaScript is good but Java is better" doc = Doc(matcher.vocab, words=text.split()) diff --git a/spacy/tests/pipeline/test_spancat.py b/spacy/tests/pipeline/test_spancat.py index 3da5816ab..7b759f8f6 100644 --- a/spacy/tests/pipeline/test_spancat.py +++ b/spacy/tests/pipeline/test_spancat.py @@ -85,7 +85,12 @@ def test_doc_gc(): spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY}) spancat.add_label("PERSON") nlp.initialize() - texts = ["Just a sentence.", "I like London and Berlin", "I like Berlin", "I eat ham."] + texts = [ + "Just a sentence.", + "I like London and Berlin", + "I like Berlin", + "I eat ham.", + ] all_spans = [doc.spans for doc in nlp.pipe(texts)] for text, spangroups in zip(texts, all_spans): assert isinstance(spangroups, SpanGroups) @@ -338,7 +343,11 @@ def test_overfitting_IO_overlapping(): assert len(spans) == 3 assert len(spans.attrs["scores"]) == 3 assert min(spans.attrs["scores"]) > 0.9 - assert set([span.text for span in spans]) == {"London", "Berlin", "London and Berlin"} + assert set([span.text for span in spans]) == { + "London", + "Berlin", + "London and Berlin", + } assert set([span.label_ for span in spans]) == {"LOC", "DOUBLE_LOC"} # Also test the results are still the same after IO @@ -350,5 +359,9 @@ def test_overfitting_IO_overlapping(): assert len(spans2) == 3 assert len(spans2.attrs["scores"]) == 3 assert min(spans2.attrs["scores"]) > 0.9 - assert set([span.text for span in spans2]) == {"London", "Berlin", "London and Berlin"} + assert set([span.text for span in spans2]) == { + "London", + "Berlin", + "London and Berlin", + } assert set([span.label_ for span in spans2]) == {"LOC", "DOUBLE_LOC"} diff --git a/spacy/tests/test_cli.py b/spacy/tests/test_cli.py index 1841de317..72bbe04e5 100644 --- a/spacy/tests/test_cli.py +++ b/spacy/tests/test_cli.py @@ -9,6 +9,7 @@ from spacy.cli import info from spacy.cli.init_config import init_config, RECOMMENDATIONS from spacy.cli._util import validate_project_commands, parse_config_overrides from spacy.cli._util import load_project_config, substitute_project_variables +from spacy.cli._util import is_subpath_of from spacy.cli._util import string_to_list from spacy import about from spacy.util import get_minor_version @@ -535,8 +536,41 @@ def test_init_labels(component_name): assert len(nlp2.get_pipe(component_name).labels) == 4 -def test_get_third_party_dependencies_runs(): +def test_get_third_party_dependencies(): # We can't easily test the detection of third-party packages here, but we # can at least make sure that the function and its importlib magic runs. nlp = Dutch() + # Test with component factory based on Cython module + nlp.add_pipe("tagger") assert get_third_party_dependencies(nlp.config) == [] + + # Test with legacy function + nlp = Dutch() + nlp.add_pipe( + "textcat", + config={ + "model": { + # Do not update from legacy architecture spacy.TextCatBOW.v1 + "@architectures": "spacy.TextCatBOW.v1", + "exclusive_classes": True, + "ngram_size": 1, + "no_output_layer": False, + } + }, + ) + get_third_party_dependencies(nlp.config) == [] + + +@pytest.mark.parametrize( + "parent,child,expected", + [ + ("/tmp", "/tmp", True), + ("/tmp", "/", False), + ("/tmp", "/tmp/subdir", True), + ("/tmp", "/tmpdir", False), + ("/tmp", "/tmp/subdir/..", True), + ("/tmp", "/tmp/..", False), + ], +) +def test_is_subpath_of(parent, child, expected): + assert is_subpath_of(parent, child) == expected diff --git a/spacy/tokens/doc.pyx b/spacy/tokens/doc.pyx index cd2bd6f6c..b3eda26e1 100644 --- a/spacy/tokens/doc.pyx +++ b/spacy/tokens/doc.pyx @@ -30,6 +30,7 @@ from ..compat import copy_reg, pickle from ..errors import Errors, Warnings from ..morphology import Morphology from .. import util +from .. import parts_of_speech from .underscore import Underscore, get_ext_args from ._retokenize import Retokenizer from ._serialize import ALL_ATTRS as DOCBIN_ALL_ATTRS @@ -285,6 +286,10 @@ cdef class Doc: sent_starts[i] = -1 elif sent_starts[i] is None or sent_starts[i] not in [-1, 0, 1]: sent_starts[i] = 0 + if pos is not None: + for pp in set(pos): + if pp not in parts_of_speech.IDS: + raise ValueError(Errors.E1021.format(pp=pp)) ent_iobs = None ent_types = None if ents is not None: diff --git a/spacy/tokens/token.pyx b/spacy/tokens/token.pyx index 3fcfda691..9277eb6fa 100644 --- a/spacy/tokens/token.pyx +++ b/spacy/tokens/token.pyx @@ -867,6 +867,8 @@ cdef class Token: return parts_of_speech.NAMES[self.c.pos] def __set__(self, pos_name): + if pos_name not in parts_of_speech.IDS: + raise ValueError(Errors.E1021.format(pp=pos_name)) self.c.pos = parts_of_speech.IDS[pos_name] property tag_: diff --git a/spacy/training/loggers.py b/spacy/training/loggers.py index 5cf2db6b3..137e89e56 100644 --- a/spacy/training/loggers.py +++ b/spacy/training/loggers.py @@ -177,3 +177,89 @@ def wandb_logger( return log_step, finalize return setup_logger + + +@registry.loggers("spacy.WandbLogger.v3") +def wandb_logger( + project_name: str, + remove_config_values: List[str] = [], + model_log_interval: Optional[int] = None, + log_dataset_dir: Optional[str] = None, + entity: Optional[str] = None, + run_name: Optional[str] = None, +): + try: + import wandb + + # test that these are available + from wandb import init, log, join # noqa: F401 + except ImportError: + raise ImportError(Errors.E880) + + console = console_logger(progress_bar=False) + + def setup_logger( + nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr + ) -> Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]: + config = nlp.config.interpolate() + config_dot = util.dict_to_dot(config) + for field in remove_config_values: + del config_dot[field] + config = util.dot_to_dict(config_dot) + run = wandb.init( + project=project_name, config=config, entity=entity, reinit=True + ) + + if run_name: + wandb.run.name = run_name + + console_log_step, console_finalize = console(nlp, stdout, stderr) + + def log_dir_artifact( + path: str, + name: str, + type: str, + metadata: Optional[Dict[str, Any]] = {}, + aliases: Optional[List[str]] = [], + ): + dataset_artifact = wandb.Artifact(name, type=type, metadata=metadata) + dataset_artifact.add_dir(path, name=name) + wandb.log_artifact(dataset_artifact, aliases=aliases) + + if log_dataset_dir: + log_dir_artifact(path=log_dataset_dir, name="dataset", type="dataset") + + def log_step(info: Optional[Dict[str, Any]]): + console_log_step(info) + if info is not None: + score = info["score"] + other_scores = info["other_scores"] + losses = info["losses"] + wandb.log({"score": score}) + if losses: + wandb.log({f"loss_{k}": v for k, v in losses.items()}) + if isinstance(other_scores, dict): + wandb.log(other_scores) + if model_log_interval and info.get("output_path"): + if info["step"] % model_log_interval == 0 and info["step"] != 0: + log_dir_artifact( + path=info["output_path"], + name="pipeline_" + run.id, + type="checkpoint", + metadata=info, + aliases=[ + f"epoch {info['epoch']} step {info['step']}", + "latest", + "best" + if info["score"] == max(info["checkpoints"])[0] + else "", + ], + ) + + def finalize() -> None: + console_finalize() + wandb.join() + + return log_step, finalize + + return setup_logger diff --git a/spacy/training/pretrain.py b/spacy/training/pretrain.py index 6d7850212..0228f2947 100644 --- a/spacy/training/pretrain.py +++ b/spacy/training/pretrain.py @@ -41,10 +41,11 @@ def pretrain( optimizer = P["optimizer"] # Load in pretrained weights to resume from if resume_path is not None: - _resume_model(model, resume_path, epoch_resume, silent=silent) + epoch_resume = _resume_model(model, resume_path, epoch_resume, silent=silent) else: # Without '--resume-path' the '--epoch-resume' argument is ignored epoch_resume = 0 + objective = model.attrs["loss"] # TODO: move this to logger function? tracker = ProgressTracker(frequency=10000) @@ -93,20 +94,25 @@ def ensure_docs(examples_or_docs: Iterable[Union[Doc, Example]]) -> List[Doc]: def _resume_model( model: Model, resume_path: Path, epoch_resume: int, silent: bool = True -) -> None: +) -> int: msg = Printer(no_print=silent) msg.info(f"Resume training tok2vec from: {resume_path}") with resume_path.open("rb") as file_: weights_data = file_.read() model.get_ref("tok2vec").from_bytes(weights_data) - # Parse the epoch number from the given weight file - model_name = re.search(r"model\d+\.bin", str(resume_path)) - if model_name: - # Default weight file name so read epoch_start from it by cutting off 'model' and '.bin' - epoch_resume = int(model_name.group(0)[5:][:-4]) + 1 - msg.info(f"Resuming from epoch: {epoch_resume}") - else: - msg.info(f"Resuming from epoch: {epoch_resume}") + + if epoch_resume is None: + # Parse the epoch number from the given weight file + model_name = re.search(r"model\d+\.bin", str(resume_path)) + if model_name: + # Default weight file name so read epoch_start from it by cutting off 'model' and '.bin' + epoch_resume = int(model_name.group(0)[5:][:-4]) + 1 + else: + # No epoch given and couldn't infer it + raise ValueError(Errors.E1020) + + msg.info(f"Resuming from epoch: {epoch_resume}") + return epoch_resume def make_update( diff --git a/spacy/util.py b/spacy/util.py index 6638e94ce..b49bd096f 100644 --- a/spacy/util.py +++ b/spacy/util.py @@ -140,6 +140,32 @@ class registry(thinc.registry): ) from None return func + @classmethod + def find(cls, registry_name: str, func_name: str) -> Callable: + """Get info about a registered function from the registry.""" + # We're overwriting this classmethod so we're able to provide more + # specific error messages and implement a fallback to spacy-legacy. + if not hasattr(cls, registry_name): + names = ", ".join(cls.get_registry_names()) or "none" + raise RegistryError(Errors.E892.format(name=registry_name, available=names)) + reg = getattr(cls, registry_name) + try: + func_info = reg.find(func_name) + except RegistryError: + if func_name.startswith("spacy."): + legacy_name = func_name.replace("spacy.", "spacy-legacy.") + try: + return reg.find(legacy_name) + except catalogue.RegistryError: + pass + available = ", ".join(sorted(reg.get_all().keys())) or "none" + raise RegistryError( + Errors.E893.format( + name=func_name, reg_name=registry_name, available=available + ) + ) from None + return func_info + @classmethod def has(cls, registry_name: str, func_name: str) -> bool: """Check whether a function is available in a registry.""" diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 8190d9f78..3cf81ae93 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -462,7 +462,7 @@ start decreasing across epochs. -#### spacy.WandbLogger.v2 {#WandbLogger tag="registered function"} +#### spacy.WandbLogger.v3 {#WandbLogger tag="registered function"} > #### Installation > @@ -494,19 +494,21 @@ remain in the config file stored on your local system. > > ```ini > [training.logger] -> @loggers = "spacy.WandbLogger.v2" +> @loggers = "spacy.WandbLogger.v3" > project_name = "monitor_spacy_training" > remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"] > log_dataset_dir = "corpus" > model_log_interval = 1000 > ``` -| Name | Description | -| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | -| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ | -| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ | -| `model_log_interval` | Steps to wait between logging model checkpoints to W&B dasboard (default: None). ~~Optional[int]~~ | -| `log_dataset_dir` | Directory containing dataset to be logged and versioned as W&B artifact (default: None). ~~Optional[str]~~ | +| Name | Description | +| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ | +| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ | +| `model_log_interval` | Steps to wait between logging model checkpoints to W&B dasboard (default: None). ~~Optional[int]~~ | +| `log_dataset_dir` | Directory containing dataset to be logged and versioned as W&B artifact (default: None). ~~Optional[str]~~ | +| `run_name` | The name of the run. If you don't specify a run_name, the name will be created by wandb library. (default: None ). ~~Optional[str]~~ | +| `entity` | An entity is a username or team name where you're sending runs. If you don't specify an entity, the run will be sent to your default entity, which is usually your username. (default: None). ~~Optional[str]~~ | diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md index a646989a5..6f6cef7c8 100644 --- a/website/docs/usage/projects.md +++ b/website/docs/usage/projects.md @@ -291,7 +291,7 @@ files you need and not the whole repo. | Name | Description | | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. | -| `git` | `repo`: The URL of the repo to download from.
`path`: Path of the file or directory to download, relative to the repo root.
`branch`: The branch to download from. Defaults to `"master"`. | +| `git` | `repo`: The URL of the repo to download from.
`path`: Path of the file or directory to download, relative to the repo root. "" specifies the root directory.
`branch`: The branch to download from. Defaults to `"master"`. | | `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. | | `description` | Optional asset description, used in [auto-generated docs](#custom-docs). |