Merge remote-tracking branch 'upstream/master' into spacy.io

2025-09-14 08:02:40 +03:00 · 2021-09-20 15:54:00 +02:00 · 2021-09-20 15:54:00 +02:00 · ec621e6853
commit ec621e6853
parent e0e3e9653b 015d439eb6
26 changed files with 744 additions and 63 deletions
--- a/extra/DEVELOPER_DOCS/Listeners.md
+++ b/extra/DEVELOPER_DOCS/Listeners.md
@ -0,0 +1,220 @@
 # Listeners
 1. [Overview](#1-overview)
 2. [Initialization](#2-initialization)
   - [A. Linking listeners to the embedding component](#2a-linking-listeners-to-the-embedding-component)
   - [B. Shape inference](#2b-shape-inference)
 3. [Internal communication](#3-internal-communication)
   - [A. During prediction](#3a-during-prediction)
   - [B. During training](#3b-during-training)
   - [C. Frozen components](#3c-frozen-components)
 4. [Replacing listener with standalone](#4-replacing-listener-with-standalone)
 ## 1. Overview
 Trainable spaCy components typically use some sort of `tok2vec` layer as part of the `model` definition.
 This `tok2vec` layer produces embeddings and is either a standard `Tok2Vec` layer, or a Transformer-based one.
 Both versions can be used either inline/standalone, which means that they are defined and used
 by only one specific component (e.g. NER), or
 [shared](https://spacy.io/usage/embeddings-transformers#embedding-layers),
 in which case the embedding functionality becomes a separate component that can
 feed embeddings to multiple components downstream, using a listener-pattern.
 | Type          | Usage      | Model Architecture                                                                                 |
 | ------------- | ---------- | -------------------------------------------------------------------------------------------------- |
 | `Tok2Vec`     | standalone | [`spacy.Tok2Vec`](https://spacy.io/api/architectures#Tok2Vec)                                      |
 | `Tok2Vec`     | listener   | [`spacy.Tok2VecListener`](https://spacy.io/api/architectures#Tok2VecListener)                      |
 | `Transformer` | standalone | [`spacy-transformers.Tok2VecTransformer`](https://spacy.io/api/architectures#Tok2VecTransformer)   |
 | `Transformer` | listener   | [`spacy-transformers.TransformerListener`](https://spacy.io/api/architectures#TransformerListener) |
 Here we discuss the listener pattern and its implementation in code in more detail.
 ## 2. Initialization
 ### 2A. Linking listeners to the embedding component
 To allow sharing a `tok2vec` layer, a separate `tok2vec` component needs to be defined in the config:
 ```
 [components.tok2vec]
 factory = "tok2vec"
 [components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
 ```
 A listener can then be set up by making sure the correct `upstream` name is defined, referring to the
 name of the `tok2vec` component (which equals the factory name by default), or `*` as a wildcard:
 ```
 [components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
 upstream = "tok2vec"
 ```
 When an [`nlp`](https://github.com/explosion/spaCy/blob/master/extra/DEVELOPER_DOCS/Language.md) object is
 initialized or deserialized, it will make sure to link each `tok2vec` component to its listeners. This is
 implemented in the method `nlp._link_components()` which loops over each
 component in the pipeline and calls `find_listeners()` on a component if it's defined.
 The [`tok2vec` component](https://github.com/explosion/spaCy/blob/master/spacy/pipeline/tok2vec.py)'s implementation
 of this `find_listener()` method will specifically identify sublayers of a model definition that are of type
 `Tok2VecListener` with a matching upstream name and will then add that listener to the internal `self.listener_map`.
 If it's a Transformer-based pipeline, a
 [`transformer` component](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py)
 has a similar implementation but its `find_listener()` function will specifically look for `TransformerListener` 
 sublayers of downstream components.
 ### 2B. Shape inference
 Typically, the output dimension `nO` of a listener's model equals the `nO` (or `width`) of the upstream embedding layer.
 For a standard `Tok2Vec`-based component, this is typically known up-front and defined as such in the config:
 ```
 [components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
 width = ${components.tok2vec.model.encode.width}
 ```
 A `transformer` component however only knows its `nO` dimension after the HuggingFace transformer
 is set with the function `model.attrs["set_transformer"]`,
 [implemented](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py)
 by `set_pytorch_transformer`.
 This is why, upon linking of the transformer listeners, the `transformer` component also makes sure to set
 the listener's output dimension correctly.
 This shape inference mechanism also needs to happen with resumed/frozen components, which means that for some CLI
 commands (`assemble` and `train`), we need to call `nlp._link_components` even before initializing the `nlp`
 object. To cover all use-cases and avoid negative side effects, the code base ensures that performing the
 linking twice is not harmful.
 ## 3. Internal communication
 The internal communication between a listener and its downstream components is organized by sending and
 receiving information across the components - either directly or implicitly.
 The details are different depending on whether the pipeline is currently training, or predicting.
 Either way, the `tok2vec` or `transformer` component always needs to run before the listener.
 ### 3A. During prediction
 When the `Tok2Vec` pipeline component is called, its `predict()` method is executed to produce the results,
 which are then stored by `set_annotations()` in the `doc.tensor` field of the document(s).
 Similarly, the `Transformer` component stores the produced embeddings
 in `doc._.trf_data`. Next, the `forward` pass of a
 [`Tok2VecListener`](https://github.com/explosion/spaCy/blob/master/spacy/pipeline/tok2vec.py)
 or a
 [`TransformerListener`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/listener.py)
 accesses these fields on the `Doc` directly. Both listener implementations have a fallback mechanism for when these
 properties were not set on the `Doc`: in that case an all-zero tensor is produced and returned.
 We need this fallback mechanism to enable shape inference methods in Thinc, but the code
 is slightly risky and at times might hide another bug - so it's a good spot to be aware of.
 ### 3B. During training
 During training, the `update()` methods of the `Tok2Vec` & `Transformer` components don't necessarily set the
 annotations on the `Doc` (though since 3.1 they can if they are part of the `annotating_components` list in the config).
 Instead, we rely on a caching mechanism between the original embedding component and its listener.
 Specifically, the produced embeddings are sent to the listeners by calling `listener.receive()` and uniquely
 identifying the batch of documents with a `batch_id`. This `receive()` call also sends the appropriate `backprop`
 call to ensure that gradients from the downstream component flow back to the trainable `Tok2Vec` or `Transformer`
 network.
 We rely on the `nlp` object properly batching the data and sending each batch through the pipeline in sequence,
 which means that only one such batch needs to be kept in memory for each listener.
 When the downstream component runs and the listener should produce embeddings, it accesses the batch in memory,
 runs the backpropagation, and returns the results and the gradients.
 There are two ways in which this mechanism can fail, both are detected by `verify_inputs()`:
 - `E953` if a different batch is in memory than the requested one - signaling some kind of out-of-sync state of the
  training pipeline.
 - `E954` if no batch is in memory at all - signaling that the pipeline is probably not set up correctly.
 #### Training with multiple listeners
 One `Tok2Vec` or `Transformer` component may be listened to by several downstream components, e.g.
 a tagger and a parser could be sharing the same embeddings. In this case, we need to be careful about how we do
 the backpropagation. When the `Tok2Vec` or `Transformer` sends out data to the listener with `receive()`, they will
 send an `accumulate_gradient` function call to all listeners, except the last one. This function will keep track
 of the gradients received so far. Only the final listener in the pipeline will get an actual `backprop` call that
 will initiate the backpropagation of the `tok2vec` or `transformer` model with the accumulated gradients.
 ### 3C. Frozen components
 The listener pattern can get particularly tricky in combination with frozen components. To detect components
 with listeners that are not frozen consistently, `init_nlp()` (which is called by `spacy train`) goes through
 the listeners and their upstream components and warns in two scenarios.
 #### The Tok2Vec or Transformer is frozen
 If the `Tok2Vec` or `Transformer` was already trained,
 e.g. by [pretraining](https://spacy.io/usage/embeddings-transformers#pretraining),
 it could be a valid use-case to freeze the embedding architecture and only train downstream components such
 as a tagger or a parser. This used to be impossible before 3.1, but has become supported since then by putting the
 embedding component in the [`annotating_components`](https://spacy.io/usage/training#annotating-components)
 list of the config. This works like any other "annotating component" because it relies on the `Doc` attributes.
 However, if the `Tok2Vec` or `Transformer` is frozen, and not present in `annotating_components`, and a related 
 listener isn't frozen, then a `W086` warning is shown and further training of the pipeline will likely end with `E954`.
 #### The upstream component is frozen
 If an upstream component is frozen but the underlying `Tok2Vec` or `Transformer` isn't, the performance of
 the upstream component will be degraded after training. In this case, a `W087` warning is shown, explaining
 how to use the `replace_listeners` functionality to prevent this problem.
 ## 4. Replacing listener with standalone
 The [`replace_listeners`](https://spacy.io/api/language#replace_listeners) functionality changes the architecture
 of a downstream component from using a listener pattern to a standalone `tok2vec` or `transformer` layer,
 effectively making the downstream component independent of any other components in the pipeline.
 It is implemented by `nlp.replace_listeners()` and typically executed by `nlp.from_config()`.
 First, it fetches the original `Model` of the original component that creates the embeddings:
 ```
 tok2vec = self.get_pipe(tok2vec_name)
 tok2vec_model = tok2vec.model
 ```
 Which is either a [`Tok2Vec` model](https://github.com/explosion/spaCy/blob/master/spacy/ml/models/tok2vec.py) or a
 [`TransformerModel`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py).
 In the case of the `tok2vec`, this model can be copied as-is into the configuration and architecture of the
 downstream component. However, for the `transformer`, this doesn't work.
 The reason is that the `TransformerListener` architecture chains the listener with
 [`trfs2arrays`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/trfs2arrays.py):
 ```
 model = chain(
    TransformerListener(upstream_name=upstream)
    trfs2arrays(pooling, grad_factor),
 )
 ```
 but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained inbetween the model
 and `trfs2arrays`:
 ```
 model = chain(
    TransformerModel(name, get_spans, tokenizer_config),
    split_trf_batch(),
    trfs2arrays(pooling, grad_factor),
 )
 ```
 So you can't just take the model from the listener, and drop that into the component internally. You need to
 adjust the model and the config. To facilitate this, `nlp.replace_listeners()` will check whether additional
 [functions](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/_util.py) are
 [defined](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py)
 in `model.attrs`, and if so, it will essentially call these to make the appropriate changes:
 ```
 replace_func = tok2vec_model.attrs["replace_listener_cfg"]
 new_config = replace_func(tok2vec_cfg["model"], pipe_cfg["model"]["tok2vec"])
 ...
 new_model = tok2vec_model.attrs["replace_listener"](new_model)
 ```
 The new config and model are then properly stored on the `nlp` object.
 Note that this functionality (running the replacement for a transformer listener) was broken prior to 
 `spacy-transformers` 1.0.5.
--- a/extra/DEVELOPER_DOCS/StringStore-Vocab.md
+++ b/extra/DEVELOPER_DOCS/StringStore-Vocab.md
@ -0,0 +1,216 @@
 # StringStore & Vocab
 > Reference: `spacy/strings.pyx`
 > Reference: `spacy/vocab.pyx`
 ## Overview
 spaCy represents mosts strings internally using a `uint64` in Cython which
 corresponds to a hash. The magic required to make this largely transparent is
 handled by the `StringStore`, and is integrated into the pipelines using the
 `Vocab`, which also connects it to some other information. 
 These are mostly internal details that average library users should never have
 to think about. On the other hand, when developing a component it's normal to
 interact with the Vocab for lexeme data or word vectors, and it's not unusual
 to add labels to the `StringStore`.
 ## StringStore
 ### Overview
 The `StringStore` is a `cdef class` that looks a bit like a two-way dictionary,
 though it is not a subclass of anything in particular.
 The main functionality of the `StringStore` is that `__getitem__` converts
 hashes into strings or strings into hashes.
 The full details of the conversion are complicated. Normally you shouldn't have
 to worry about them, but the first applicable case here is used to get the
 return value:
 1. 0 and the empty string are special cased to each other
 2. internal symbols use a lookup table (`SYMBOLS_BY_STR`)
 3. normal strings or bytes are hashed
 4. internal symbol IDs in `SYMBOLS_BY_INT` are handled
 5. anything not yet handled is used as a hash to lookup a string
 For the symbol enums, see [`symbols.pxd`](https://github.com/explosion/spaCy/blob/master/spacy/symbols.pxd).
 Almost all strings in spaCy are stored in the `StringStore`. This naturally
 includes tokens, but also includes things like labels (not just NER/POS/dep,
 but also categories etc.), lemmas, lowercase forms, word shapes, and so on. One
 of the main results of this is that tokens can be represented by a compact C
 struct ([`LexemeC`](https://spacy.io/api/cython-structs#lexemec)/[`TokenC`](https://github.com/explosion/spaCy/issues/4854)) that mostly consists of string hashes. This also means that converting
 input for the models is straightforward, and there's not a token mapping step
 like in many machine learning frameworks. Additionally, because the token IDs
 in spaCy are based on hashes, they are consistent across environments or
 models.
 One pattern you'll see a lot in spaCy APIs is that `something.value` returns an
 `int` and `something.value_` returns a string. That's implemented using the
 `StringStore`. Typically the `int` is stored in a C struct and the string is
 generated via a property that calls into the `StringStore` with the `int`.
 Besides `__getitem__`, the `StringStore` has functions to return specifically a
 string or specifically a hash, regardless of whether the input was a string or
 hash to begin with, though these are only used occasionally.
 ### Implementation Details: Hashes and Allocations
 Hashes are 64-bit and are computed using [murmurhash][] on UTF-8 bytes. There is no
 mechanism for detecting and avoiding collisions. To date there has never been a
 reproducible collision or user report about any related issues.
 [murmurhash]: https://github.com/explosion/murmurhash
 The empty string is not hashed, it's just converted to/from 0. 
 A small number of strings use indices into a lookup table (so low integers)
 rather than hashes. This is mostly Universal Dependencies labels or other
 strings considered "core" in spaCy. This was critical in v1, which hadn't
 introduced hashing yet. Since v2 it's important for items in `spacy.attrs`,
 especially lexeme flags, but is otherwise only maintained for backwards
 compatibility.
 You can call `strings["mystring"]` with a string the `StringStore` has never seen
 before and it will return a hash. But in order to do the reverse operation, you
 need to call `strings.add("mystring")` first. Without a call to `add` the
 string will not be interned.
 Example:
 ```
 from spacy.strings import StringStore
 ss = StringStore()
 hashval = ss["spacy"] # 10639093010105930009
 try:
    # this won't work
    ss[hashval]
 except KeyError:
    print(f"key {hashval} unknown in the StringStore.")
 ss.add("spacy")
 assert ss[hashval] == "spacy" # it works now
 # There is no `.keys` property, but you can iterate over keys
 # The empty string will never be in the list of keys
 for key in ss:
    print(key)
 ```
 In normal use nothing is ever removed from the `StringStore`. In theory this
 means that if you do something like iterate through all hex values of a certain
 length you can have explosive memory usage. In practice this has never been an
 issue. (Note that this is also different from using `sys.intern` to intern
 Python strings, which does not guarantee they won't be garbage collected later.)
 Strings are stored in the `StringStore` in a peculiar way: each string uses a
 union that is either an eight-byte `char[]` or a `char*`. Short strings are
 stored directly in the `char[]`, while longer strings are stored in allocated
 memory and prefixed with their length. This is a strategy to reduce indirection
 and memory fragmentation. See  `decode_Utf8Str` and `_allocate` in
 `strings.pyx` for the implementation.
 ### When to Use the StringStore?
 While you can ignore the `StringStore` in many cases, there are situations where
 you should make use of it to avoid errors. 
 Any time you introduce a string that may be set on a `Doc` field that has a hash,
 you should add the string to the `StringStore`. This mainly happens when adding
 labels in components, but there are some other cases:
 - syntax iterators, mainly `get_noun_chunks`
 - external data used in components, like the `KnowledgeBase` in the `entity_linker`
 - labels used in tests
 ## Vocab
 The `Vocab` is a core component of a `Language` pipeline. Its main function is
 to manage `Lexeme`s, which are structs that contain information about a token
 that depends only on its surface form, without context. `Lexeme`s store much of
 the data associated with `Token`s. As a side effect of this the `Vocab` also
 manages the `StringStore` for a pipeline and a grab-bag of other data.
 These are things stored in the vocab:
 - `Lexeme`s
 - `StringStore`
 - `Morphology`: manages info used in `MorphAnalysis` objects
 - `vectors`: basically a dict for word vectors
 - `lookups`: language specific data like lemmas
 - `writing_system`: language specific metadata
 - `get_noun_chunks`: a syntax iterator
 - lex attribute getters: functions like `is_punct`, set in language defaults
 - `cfg`: **not** the pipeline config, this is mostly unused
 - `_unused_object`: Formerly an unused object, kept around until v4 for compatability
 Some of these, like the Morphology and Vectors, are complex enough that they
 need their own explanations. Here we'll just look at Vocab-specific items.
 ### Lexemes
 A `Lexeme` is a type that mainly wraps a `LexemeC`, a struct consisting of ints
 that identify various context-free token attributes. Lexemes are the core data
 of the `Vocab`, and can be accessed using `__getitem__` on the `Vocab`. The memory
 for storing `LexemeC` objects is managed by a pool that belongs to the `Vocab`.
 Note that `__getitem__` on the `Vocab` works much like the `StringStore`, in
 that it accepts a hash or id, with one important difference: if you do a lookup
 using a string, that value is added to the `StringStore` automatically. 
 The attributes stored in a `LexemeC` are:
 - orth (the raw text)
 - lower
 - norm
 - shape
 - prefix
 - suffix
 Most of these are straightforward. All of them can be customized, and (except
 `orth`) probably should be since the defaults are based on English, but in
 practice this is rarely done at present.
 ### Lookups
 This is basically a dict of dicts, implemented using a `Table` for each
 sub-dict, that stores lemmas and other language-specific lookup data. 
 A `Table` is a subclass of `OrderedDict` used for string-to-string data. It uses
 Bloom filters to speed up misses and has some extra serialization features.
 Tables are not used outside of the lookups.
 ### Lex Attribute Getters
 Lexical Attribute Getters like `is_punct` are defined on a per-language basis,
 much like lookups, but take the form of functions rather than string-to-string
 dicts, so they're stored separately.
 ### Writing System
 This is a dict with three attributes:
 - `direction`: ltr or rtl (default ltr)
 - `has_case`: bool (default `True`)
 - `has_letters`: bool (default `True`, `False` only for CJK for now)
 Currently these are not used much - the main use is that `direction` is used in
 visualizers, though `rtl` doesn't quite work (see
 [#4854](https://github.com/explosion/spaCy/issues/4854)). In the future they
 could be used when choosing hyperparameters for subwords, controlling word
 shape generation, and similar tasks.
 ### Other Vocab Members
 The Vocab is kind of the default place to store things from `Language.defaults`
 that don't belong to the Tokenizer. The following properties are in the Vocab
 just because they don't have anywhere else to go.
 - `get_noun_chunks`
 - `cfg`: This is a dict that just stores `oov_prob` (hardcoded to `-20`)
 - `_unused_object`: Leftover C member, should be removed in next major version
--- a/pyproject.toml
+++ b/pyproject.toml
@ -5,7 +5,7 @@ requires = [
    "cymem>=2.0.2,<2.1.0",
    "preshed>=3.0.2,<3.1.0",
    "murmurhash>=0.28.0,<1.1.0",
-    "thinc>=8.0.8,<8.1.0",
+    "thinc>=8.0.10,<8.1.0",
    "blis>=0.4.0,<0.8.0",
    "pathy",
    "numpy>=1.15.0",
--- a/requirements.txt
+++ b/requirements.txt
@ -1,15 +1,15 @@
 # Our libraries
-spacy-legacy>=3.0.7,<3.1.0
+spacy-legacy>=3.0.8,<3.1.0
 cymem>=2.0.2,<2.1.0
 preshed>=3.0.2,<3.1.0
-thinc>=8.0.8,<8.1.0
+thinc>=8.0.10,<8.1.0
 blis>=0.4.0,<0.8.0
 ml_datasets>=0.2.0,<0.3.0
 murmurhash>=0.28.0,<1.1.0
 wasabi>=0.8.1,<1.1.0
 srsly>=2.4.1,<3.0.0
-catalogue>=2.0.4,<2.1.0
+catalogue>=2.0.6,<2.1.0
-typer>=0.3.0,<0.4.0
+typer>=0.3.0,<0.5.0
 pathy>=0.3.5
 # Third party dependencies
 numpy>=1.15.0
--- a/setup.cfg
+++ b/setup.cfg
@ -37,19 +37,19 @@ setup_requires =
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
    murmurhash>=0.28.0,<1.1.0
-    thinc>=8.0.8,<8.1.0
+    thinc>=8.0.10,<8.1.0
 install_requires =
    # Our libraries
-    spacy-legacy>=3.0.7,<3.1.0
+    spacy-legacy>=3.0.8,<3.1.0
    murmurhash>=0.28.0,<1.1.0
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
-    thinc>=8.0.8,<8.1.0
+    thinc>=8.0.9,<8.1.0
    blis>=0.4.0,<0.8.0
    wasabi>=0.8.1,<1.1.0
    srsly>=2.4.1,<3.0.0
-    catalogue>=2.0.4,<2.1.0
+    catalogue>=2.0.6,<2.1.0
-    typer>=0.3.0,<0.4.0
+    typer>=0.3.0,<0.5.0
    pathy>=0.3.5
    # Third-party dependencies
    tqdm>=4.38.0,<5.0.0
--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,6 +1,6 @@
 # fmt: off
 __title__ = "spacy"
-__version__ = "3.1.2"
+__version__ = "3.1.3"
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
 __projects__ = "https://github.com/explosion/projects"
--- a/spacy/cli/_util.py
+++ b/spacy/cli/_util.py
@ -397,7 +397,11 @@ def git_checkout(
        run_command(cmd, capture=True)
        # We need Path(name) to make sure we also support subdirectories
        try:
-            shutil.copytree(str(tmp_dir / Path(subpath)), str(dest))
+            source_path = tmp_dir / Path(subpath)
            if not is_subpath_of(tmp_dir, source_path):
                err = f"'{subpath}' is a path outside of the cloned repository."
                msg.fail(err, repo, exits=1)
            shutil.copytree(str(source_path), str(dest))
        except FileNotFoundError:
            err = f"Can't clone {subpath}. Make sure the directory exists in the repo (branch '{branch}')"
            msg.fail(err, repo, exits=1)
@ -445,8 +449,14 @@ def git_sparse_checkout(repo, subpath, dest, branch):
        # And finally, we can checkout our subpath
        cmd = f"git -C {tmp_dir} checkout {branch} {subpath}"
        run_command(cmd, capture=True)
-        # We need Path(name) to make sure we also support subdirectories
+
-        shutil.move(str(tmp_dir / Path(subpath)), str(dest))
+        # Get a subdirectory of the cloned path, if appropriate
        source_path = tmp_dir / Path(subpath)
        if not is_subpath_of(tmp_dir, source_path):
            err = f"'{subpath}' is a path outside of the cloned repository."
            msg.fail(err, repo, exits=1)
        shutil.move(str(source_path), str(dest))
 def get_git_version(
@ -477,6 +487,19 @@ def _http_to_git(repo: str) -> str:
    return repo
 def is_subpath_of(parent, child):
    """
    Check whether `child` is a path contained within `parent`.
    """
    # Based on https://stackoverflow.com/a/37095733 .
    # In Python 3.9, the `Path.is_relative_to()` method will supplant this, so
    # we can stop using crusty old os.path functions.
    parent_realpath = os.path.realpath(parent)
    child_realpath = os.path.realpath(child)
    return os.path.commonpath([parent_realpath, child_realpath]) == parent_realpath
 def string_to_list(value: str, intify: bool = False) -> Union[List[str], List[int]]:
    """Parse a comma-separated string to a list and account for various
    formatting options. Mostly used to handle CLI arguments that take a list of
--- a/spacy/cli/package.py
+++ b/spacy/cli/package.py
@ -200,17 +200,21 @@ def get_third_party_dependencies(
    exclude (list): List of packages to exclude (e.g. that already exist in meta).
    RETURNS (list): The versioned requirements.
    """
-    own_packages = ("spacy", "spacy-nightly", "thinc", "srsly")
+    own_packages = ("spacy", "spacy-legacy", "spacy-nightly", "thinc", "srsly")
    distributions = util.packages_distributions()
    funcs = defaultdict(set)
-    for path, value in util.walk_dict(config):
+    # We only want to look at runtime-relevant sections, not [training] or [initialize]
-        if path[-1].startswith("@"):  # collect all function references by registry
+    for section in ("nlp", "components"):
-            funcs[path[-1][1:]].add(value)
+        for path, value in util.walk_dict(config[section]):
            if path[-1].startswith("@"):  # collect all function references by registry
                funcs[path[-1][1:]].add(value)
    for component in config.get("components", {}).values():
        if "factory" in component:
            funcs["factories"].add(component["factory"])
    modules = set()
    for reg_name, func_names in funcs.items():
        sub_registry = getattr(util.registry, reg_name)
        for func_name in func_names:
-            func_info = sub_registry.find(func_name)
+            func_info = util.registry.find(reg_name, func_name)
            module_name = func_info.get("module")
            if module_name:  # the code is part of a module, not a --code file
                modules.add(func_info["module"].split(".")[0])
--- a/spacy/cli/project/assets.py
+++ b/spacy/cli/project/assets.py
@ -59,6 +59,15 @@ def project_assets(project_dir: Path, *, sparse_checkout: bool = False) -> None:
                        shutil.rmtree(dest)
                    else:
                        dest.unlink()
            if "repo" not in asset["git"] or asset["git"]["repo"] is None:
                msg.fail(
                    "A git asset must include 'repo', the repository address.", exits=1
                )
            if "path" not in asset["git"] or asset["git"]["path"] is None:
                msg.fail(
                    "A git asset must include 'path' - use \"\" to get the entire repository.",
                    exits=1,
                )
            git_checkout(
                asset["git"]["repo"],
                asset["git"]["path"],
--- a/spacy/cli/project/run.py
+++ b/spacy/cli/project/run.py
@ -57,6 +57,7 @@ def project_run(
    project_dir (Path): Path to project directory.
    subcommand (str): Name of command to run.
    overrides (Dict[str, Any]): Optional config overrides.
    force (bool): Force re-running, even if nothing changed.
    dry (bool): Perform a dry run and don't execute commands.
    capture (bool): Whether to capture the output and errors of individual commands.
@ -72,7 +73,14 @@ def project_run(
    if subcommand in workflows:
        msg.info(f"Running workflow '{subcommand}'")
        for cmd in workflows[subcommand]:
-            project_run(project_dir, cmd, force=force, dry=dry, capture=capture)
+            project_run(
                project_dir,
                cmd,
                overrides=overrides,
                force=force,
                dry=dry,
                capture=capture,
            )
    else:
        cmd = commands[subcommand]
        for dep in cmd.get("deps", []):
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -869,6 +869,10 @@ class Errors:
    E1019 = ("`noun_chunks` requires the pos tagging, which requires a "
             "statistical model to be installed and loaded. For more info, see "
             "the documentation:\nhttps://spacy.io/usage/models")
    E1020 = ("No `epoch_resume` value specified and could not infer one from "
             "filename. Specify an epoch to resume from.")
    E1021 = ("`pos` value \"{pp}\" is not a valid Universal Dependencies tag. "
             "Non-UD tags should use the `tag` property.")
 # Deprecated model shortcuts, only used in errors and warnings
--- a/spacy/lang/fr/tokenizer_exceptions.py
+++ b/spacy/lang/fr/tokenizer_exceptions.py
@ -82,7 +82,8 @@ for orth in [
 for verb in [
    "a",
-    "est" "semble",
+    "est",
    "semble",
    "indique",
    "moque",
    "passe",
--- a/spacy/matcher/matcher.pyx
+++ b/spacy/matcher/matcher.pyx
@ -281,28 +281,19 @@ cdef class Matcher:
                    final_matches.append((key, *match))
                    # Mark tokens that have matched
                    memset(&matched[start], 1, span_len * sizeof(matched[0]))
        if with_alignments:
            final_matches_with_alignments = final_matches
            final_matches = [(key, start, end) for key, start, end, alignments in final_matches]
        # perform the callbacks on the filtered set of results
        for i, (key, start, end) in enumerate(final_matches):
            on_match = self._callbacks.get(key, None)
            if on_match is not None:
                on_match(self, doc, i, final_matches)
        if as_spans:
-            spans = []
+            final_results = []
-            for key, start, end in final_matches:
+            for key, start, end, *_ in final_matches:
                if isinstance(doclike, Span):
                    start += doclike.start
                    end += doclike.start
-                spans.append(Span(doc, start, end, label=key))
+                final_results.append(Span(doc, start, end, label=key))
            return spans
        elif with_alignments:
            # convert alignments List[Dict[str, int]] --> List[int]
            final_matches = []
            # when multiple alignment (belongs to the same length) is found,
            # keeps the alignment that has largest token_idx
-            for key, start, end, alignments in final_matches_with_alignments:
+            final_results = []
            for key, start, end, alignments in final_matches:
                sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False)
                alignments = [0] * (end-start)
                for align in sorted_alignments:
@ -311,10 +302,16 @@ cdef class Matcher:
                    # Since alignments are sorted in order of (length, token_idx)
                    # this overwrites smaller token_idx when they have same length.
                    alignments[align['length']] = align['token_idx']
-                final_matches.append((key, start, end, alignments))
+                final_results.append((key, start, end, alignments))
-            return final_matches
+            final_matches = final_results  # for callbacks
        else:
-            return final_matches
+            final_results = final_matches
        # perform the callbacks on the filtered set of results
        for i, (key, *_) in enumerate(final_matches):
            on_match = self._callbacks.get(key, None)
            if on_match is not None:
                on_match(self, doc, i, final_matches)
        return final_results
    def _normalize_key(self, key):
        if isinstance(key, basestring):
--- a/spacy/pipeline/spancat.py
+++ b/spacy/pipeline/spancat.py
@ -398,7 +398,9 @@ class SpanCategorizer(TrainablePipe):
        pass
    def _get_aligned_spans(self, eg: Example):
-        return eg.get_aligned_spans_y2x(eg.reference.spans.get(self.key, []), allow_overlap=True)
+        return eg.get_aligned_spans_y2x(
            eg.reference.spans.get(self.key, []), allow_overlap=True
        )
    def _make_span_group(
        self, doc: Doc, indices: Ints2d, scores: Floats2d, labels: List[str]
--- a/spacy/tests/doc/test_creation.py
+++ b/spacy/tests/doc/test_creation.py
@ -70,3 +70,10 @@ def test_create_with_heads_and_no_deps(vocab):
    heads = list(range(len(words)))
    with pytest.raises(ValueError):
        Doc(vocab, words=words, heads=heads)
 def test_create_invalid_pos(vocab):
    words = "I like ginger".split()
    pos = "QQ ZZ XX".split()
    with pytest.raises(ValueError):
        Doc(vocab, words=words, pos=pos)
--- a/spacy/tests/doc/test_token_api.py
+++ b/spacy/tests/doc/test_token_api.py
@ -203,6 +203,12 @@ def test_set_pos():
    assert doc[1].pos_ == "VERB"
 def test_set_invalid_pos():
    doc = Doc(Vocab(), words=["hello", "world"])
    with pytest.raises(ValueError):
        doc[0].pos_ = "blah"
 def test_tokens_sent(doc):
    """Test token.sent property"""
    assert len(list(doc.sents)) == 3
--- a/spacy/tests/matcher/test_matcher_api.py
+++ b/spacy/tests/matcher/test_matcher_api.py
@ -576,6 +576,16 @@ def test_matcher_callback(en_vocab):
    mock.assert_called_once_with(matcher, doc, 0, matches)
 def test_matcher_callback_with_alignments(en_vocab):
    mock = Mock()
    matcher = Matcher(en_vocab)
    pattern = [{"ORTH": "test"}]
    matcher.add("Rule", [pattern], on_match=mock)
    doc = Doc(en_vocab, words=["This", "is", "a", "test", "."])
    matches = matcher(doc, with_alignments=True)
    mock.assert_called_once_with(matcher, doc, 0, matches)
 def test_matcher_span(matcher):
    text = "JavaScript is good but Java is better"
    doc = Doc(matcher.vocab, words=text.split())
--- a/spacy/tests/pipeline/test_spancat.py
+++ b/spacy/tests/pipeline/test_spancat.py
@ -85,7 +85,12 @@ def test_doc_gc():
    spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY})
    spancat.add_label("PERSON")
    nlp.initialize()
-    texts = ["Just a sentence.", "I like London and Berlin", "I like Berlin", "I eat ham."]
+    texts = [
        "Just a sentence.",
        "I like London and Berlin",
        "I like Berlin",
        "I eat ham.",
    ]
    all_spans = [doc.spans for doc in nlp.pipe(texts)]
    for text, spangroups in zip(texts, all_spans):
        assert isinstance(spangroups, SpanGroups)
@ -338,7 +343,11 @@ def test_overfitting_IO_overlapping():
    assert len(spans) == 3
    assert len(spans.attrs["scores"]) == 3
    assert min(spans.attrs["scores"]) > 0.9
-    assert set([span.text for span in spans]) == {"London", "Berlin", "London and Berlin"}
+    assert set([span.text for span in spans]) == {
        "London",
        "Berlin",
        "London and Berlin",
    }
    assert set([span.label_ for span in spans]) == {"LOC", "DOUBLE_LOC"}
    # Also test the results are still the same after IO
@ -350,5 +359,9 @@ def test_overfitting_IO_overlapping():
        assert len(spans2) == 3
        assert len(spans2.attrs["scores"]) == 3
        assert min(spans2.attrs["scores"]) > 0.9
-        assert set([span.text for span in spans2]) == {"London", "Berlin", "London and Berlin"}
+        assert set([span.text for span in spans2]) == {
            "London",
            "Berlin",
            "London and Berlin",
        }
        assert set([span.label_ for span in spans2]) == {"LOC", "DOUBLE_LOC"}
--- a/spacy/tests/test_cli.py
+++ b/spacy/tests/test_cli.py
@ -9,6 +9,7 @@ from spacy.cli import info
 from spacy.cli.init_config import init_config, RECOMMENDATIONS
 from spacy.cli._util import validate_project_commands, parse_config_overrides
 from spacy.cli._util import load_project_config, substitute_project_variables
 from spacy.cli._util import is_subpath_of
 from spacy.cli._util import string_to_list
 from spacy import about
 from spacy.util import get_minor_version
@ -535,8 +536,41 @@ def test_init_labels(component_name):
        assert len(nlp2.get_pipe(component_name).labels) == 4
-def test_get_third_party_dependencies_runs():
+def test_get_third_party_dependencies():
    # We can't easily test the detection of third-party packages here, but we
    # can at least make sure that the function and its importlib magic runs.
    nlp = Dutch()
    # Test with component factory based on Cython module
    nlp.add_pipe("tagger")
    assert get_third_party_dependencies(nlp.config) == []
    # Test with legacy function
    nlp = Dutch()
    nlp.add_pipe(
        "textcat",
        config={
            "model": {
                # Do not update from legacy architecture spacy.TextCatBOW.v1
                "@architectures": "spacy.TextCatBOW.v1",
                "exclusive_classes": True,
                "ngram_size": 1,
                "no_output_layer": False,
            }
        },
    )
    get_third_party_dependencies(nlp.config) == []
@pytest.mark.parametrize(
    "parent,child,expected",
    [
        ("/tmp", "/tmp", True),
        ("/tmp", "/", False),
        ("/tmp", "/tmp/subdir", True),
        ("/tmp", "/tmpdir", False),
        ("/tmp", "/tmp/subdir/..", True),
        ("/tmp", "/tmp/..", False),
    ],
 )
 def test_is_subpath_of(parent, child, expected):
    assert is_subpath_of(parent, child) == expected
--- a/spacy/tokens/doc.pyx
+++ b/spacy/tokens/doc.pyx
@ -30,6 +30,7 @@ from ..compat import copy_reg, pickle
 from ..errors import Errors, Warnings
 from ..morphology import Morphology
 from .. import util
 from .. import parts_of_speech
 from .underscore import Underscore, get_ext_args
 from ._retokenize import Retokenizer
 from ._serialize import ALL_ATTRS as DOCBIN_ALL_ATTRS
@ -285,6 +286,10 @@ cdef class Doc:
                    sent_starts[i] = -1
                elif sent_starts[i] is None or sent_starts[i] not in [-1, 0, 1]:
                    sent_starts[i] = 0
        if pos is not None:
            for pp in set(pos):
                if pp not in parts_of_speech.IDS:
                    raise ValueError(Errors.E1021.format(pp=pp))
        ent_iobs = None
        ent_types = None
        if ents is not None:
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@ -867,6 +867,8 @@ cdef class Token:
            return parts_of_speech.NAMES[self.c.pos]
        def __set__(self, pos_name):
            if pos_name not in parts_of_speech.IDS:
                raise ValueError(Errors.E1021.format(pp=pos_name))
            self.c.pos = parts_of_speech.IDS[pos_name]
    property tag_:
--- a/spacy/training/loggers.py
+++ b/spacy/training/loggers.py
@ -177,3 +177,89 @@ def wandb_logger(
        return log_step, finalize
    return setup_logger
@registry.loggers("spacy.WandbLogger.v3")
 def wandb_logger(
    project_name: str,
    remove_config_values: List[str] = [],
    model_log_interval: Optional[int] = None,
    log_dataset_dir: Optional[str] = None,
    entity: Optional[str] = None,
    run_name: Optional[str] = None,
 ):
    try:
        import wandb
        # test that these are available
        from wandb import init, log, join  # noqa: F401
    except ImportError:
        raise ImportError(Errors.E880)
    console = console_logger(progress_bar=False)
    def setup_logger(
        nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr
    ) -> Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]:
        config = nlp.config.interpolate()
        config_dot = util.dict_to_dot(config)
        for field in remove_config_values:
            del config_dot[field]
        config = util.dot_to_dict(config_dot)
        run = wandb.init(
            project=project_name, config=config, entity=entity, reinit=True
        )
        if run_name:
            wandb.run.name = run_name
        console_log_step, console_finalize = console(nlp, stdout, stderr)
        def log_dir_artifact(
            path: str,
            name: str,
            type: str,
            metadata: Optional[Dict[str, Any]] = {},
            aliases: Optional[List[str]] = [],
        ):
            dataset_artifact = wandb.Artifact(name, type=type, metadata=metadata)
            dataset_artifact.add_dir(path, name=name)
            wandb.log_artifact(dataset_artifact, aliases=aliases)
        if log_dataset_dir:
            log_dir_artifact(path=log_dataset_dir, name="dataset", type="dataset")
        def log_step(info: Optional[Dict[str, Any]]):
            console_log_step(info)
            if info is not None:
                score = info["score"]
                other_scores = info["other_scores"]
                losses = info["losses"]
                wandb.log({"score": score})
                if losses:
                    wandb.log({f"loss_{k}": v for k, v in losses.items()})
                if isinstance(other_scores, dict):
                    wandb.log(other_scores)
                if model_log_interval and info.get("output_path"):
                    if info["step"] % model_log_interval == 0 and info["step"] != 0:
                        log_dir_artifact(
                            path=info["output_path"],
                            name="pipeline_" + run.id,
                            type="checkpoint",
                            metadata=info,
                            aliases=[
                                f"epoch {info['epoch']} step {info['step']}",
                                "latest",
                                "best"
                                if info["score"] == max(info["checkpoints"])[0]
                                else "",
                            ],
                        )
        def finalize() -> None:
            console_finalize()
            wandb.join()
        return log_step, finalize
    return setup_logger
--- a/spacy/training/pretrain.py
+++ b/spacy/training/pretrain.py
@ -41,10 +41,11 @@ def pretrain(
    optimizer = P["optimizer"]
    # Load in pretrained weights to resume from
    if resume_path is not None:
-        _resume_model(model, resume_path, epoch_resume, silent=silent)
+        epoch_resume = _resume_model(model, resume_path, epoch_resume, silent=silent)
    else:
        # Without '--resume-path' the '--epoch-resume' argument is ignored
        epoch_resume = 0
    objective = model.attrs["loss"]
    # TODO: move this to logger function?
    tracker = ProgressTracker(frequency=10000)
@ -93,20 +94,25 @@ def ensure_docs(examples_or_docs: Iterable[Union[Doc, Example]]) -> List[Doc]:
 def _resume_model(
    model: Model, resume_path: Path, epoch_resume: int, silent: bool = True
-) -> None:
+) -> int:
    msg = Printer(no_print=silent)
    msg.info(f"Resume training tok2vec from: {resume_path}")
    with resume_path.open("rb") as file_:
        weights_data = file_.read()
        model.get_ref("tok2vec").from_bytes(weights_data)
-    # Parse the epoch number from the given weight file
+
-    model_name = re.search(r"model\d+\.bin", str(resume_path))
+    if epoch_resume is None:
-    if model_name:
+        # Parse the epoch number from the given weight file
-        # Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
+        model_name = re.search(r"model\d+\.bin", str(resume_path))
-        epoch_resume = int(model_name.group(0)[5:][:-4]) + 1
+        if model_name:
-        msg.info(f"Resuming from epoch: {epoch_resume}")
+            # Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
-    else:
+            epoch_resume = int(model_name.group(0)[5:][:-4]) + 1
-        msg.info(f"Resuming from epoch: {epoch_resume}")
+        else:
            # No epoch given and couldn't infer it
            raise ValueError(Errors.E1020)
    msg.info(f"Resuming from epoch: {epoch_resume}")
    return epoch_resume
 def make_update(
--- a/spacy/util.py
+++ b/spacy/util.py
@ -140,6 +140,32 @@ class registry(thinc.registry):
            ) from None
        return func
    @classmethod
    def find(cls, registry_name: str, func_name: str) -> Callable:
        """Get info about a registered function from the registry."""
        # We're overwriting this classmethod so we're able to provide more
        # specific error messages and implement a fallback to spacy-legacy.
        if not hasattr(cls, registry_name):
            names = ", ".join(cls.get_registry_names()) or "none"
            raise RegistryError(Errors.E892.format(name=registry_name, available=names))
        reg = getattr(cls, registry_name)
        try:
            func_info = reg.find(func_name)
        except RegistryError:
            if func_name.startswith("spacy."):
                legacy_name = func_name.replace("spacy.", "spacy-legacy.")
                try:
                    return reg.find(legacy_name)
                except catalogue.RegistryError:
                    pass
            available = ", ".join(sorted(reg.get_all().keys())) or "none"
            raise RegistryError(
                Errors.E893.format(
                    name=func_name, reg_name=registry_name, available=available
                )
            ) from None
        return func_info
    @classmethod
    def has(cls, registry_name: str, func_name: str) -> bool:
        """Check whether a function is available in a registry."""
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -462,7 +462,7 @@ start decreasing across epochs.
 </Accordion>
-#### spacy.WandbLogger.v2 {#WandbLogger tag="registered function"}
+#### spacy.WandbLogger.v3 {#WandbLogger tag="registered function"}
 > #### Installation
 >
@ -494,19 +494,21 @@ remain in the config file stored on your local system.
 >
 > ```ini
 > [training.logger]
-> @loggers = "spacy.WandbLogger.v2"
+> @loggers = "spacy.WandbLogger.v3"
 > project_name = "monitor_spacy_training"
 > remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
 > log_dataset_dir = "corpus"
 > model_log_interval = 1000
 > ```
-| Name                   | Description                                                                                                                           |
+| Name                   | Description                                                                                                                                                                                                     |
-| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
+| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `project_name`         | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
+| `project_name`         | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~                                                                           |
-| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~                              |
+| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~                                                                                                        |
-| `model_log_interval`   | Steps to wait between logging model checkpoints to W&B dasboard (default: None). ~~Optional[int]~~                                    |
+| `model_log_interval`   | Steps to wait between logging model checkpoints to W&B dasboard (default: None). ~~Optional[int]~~                                                                                                              |
-| `log_dataset_dir`      | Directory containing dataset to be logged and versioned as W&B artifact (default: None). ~~Optional[str]~~                            |
+| `log_dataset_dir`      | Directory containing dataset to be logged and versioned as W&B artifact (default: None). ~~Optional[str]~~                                                                                                      |
 | `run_name`             | The name of the run. If you don't specify a run_name, the name will be created by wandb library. (default: None ). ~~Optional[str]~~                                                                            |
 | `entity`               | An entity is a username or team name where you're sending runs. If you don't specify an entity, the run will be sent to your default entity, which is usually your username. (default: None). ~~Optional[str]~~ |
 <Project id="integrations/wandb">
--- a/website/docs/usage/projects.md
+++ b/website/docs/usage/projects.md
@ -291,7 +291,7 @@ files you need and not the whole repo.
 | Name          | Description                                                                                                                                                                                          |
 | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `dest`        | The destination path to save the downloaded asset to (relative to the project directory), including the file name.                                                                                   |
-| `git`         | `repo`: The URL of the repo to download from.<br />`path`: Path of the file or directory to download, relative to the repo root.<br />`branch`: The branch to download from. Defaults to `"master"`. |
+| `git`         | `repo`: The URL of the repo to download from.<br />`path`: Path of the file or directory to download, relative to the repo root. "" specifies the root directory.<br />`branch`: The branch to download from. Defaults to `"master"`. |
 | `checksum`    | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists.                     |
 | `description` | Optional asset description, used in [auto-generated docs](#custom-docs).                                                                                                                             |