mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-09 06:13:08 +03:00
Merge remote-tracking branch 'upstream/master' into spacy.io
This commit is contained in:
commit
ec621e6853
220
extra/DEVELOPER_DOCS/Listeners.md
Normal file
220
extra/DEVELOPER_DOCS/Listeners.md
Normal file
|
@ -0,0 +1,220 @@
|
||||||
|
# Listeners
|
||||||
|
|
||||||
|
1. [Overview](#1-overview)
|
||||||
|
2. [Initialization](#2-initialization)
|
||||||
|
- [A. Linking listeners to the embedding component](#2a-linking-listeners-to-the-embedding-component)
|
||||||
|
- [B. Shape inference](#2b-shape-inference)
|
||||||
|
3. [Internal communication](#3-internal-communication)
|
||||||
|
- [A. During prediction](#3a-during-prediction)
|
||||||
|
- [B. During training](#3b-during-training)
|
||||||
|
- [C. Frozen components](#3c-frozen-components)
|
||||||
|
4. [Replacing listener with standalone](#4-replacing-listener-with-standalone)
|
||||||
|
|
||||||
|
## 1. Overview
|
||||||
|
|
||||||
|
Trainable spaCy components typically use some sort of `tok2vec` layer as part of the `model` definition.
|
||||||
|
This `tok2vec` layer produces embeddings and is either a standard `Tok2Vec` layer, or a Transformer-based one.
|
||||||
|
Both versions can be used either inline/standalone, which means that they are defined and used
|
||||||
|
by only one specific component (e.g. NER), or
|
||||||
|
[shared](https://spacy.io/usage/embeddings-transformers#embedding-layers),
|
||||||
|
in which case the embedding functionality becomes a separate component that can
|
||||||
|
feed embeddings to multiple components downstream, using a listener-pattern.
|
||||||
|
|
||||||
|
| Type | Usage | Model Architecture |
|
||||||
|
| ------------- | ---------- | -------------------------------------------------------------------------------------------------- |
|
||||||
|
| `Tok2Vec` | standalone | [`spacy.Tok2Vec`](https://spacy.io/api/architectures#Tok2Vec) |
|
||||||
|
| `Tok2Vec` | listener | [`spacy.Tok2VecListener`](https://spacy.io/api/architectures#Tok2VecListener) |
|
||||||
|
| `Transformer` | standalone | [`spacy-transformers.Tok2VecTransformer`](https://spacy.io/api/architectures#Tok2VecTransformer) |
|
||||||
|
| `Transformer` | listener | [`spacy-transformers.TransformerListener`](https://spacy.io/api/architectures#TransformerListener) |
|
||||||
|
|
||||||
|
Here we discuss the listener pattern and its implementation in code in more detail.
|
||||||
|
|
||||||
|
## 2. Initialization
|
||||||
|
|
||||||
|
### 2A. Linking listeners to the embedding component
|
||||||
|
|
||||||
|
To allow sharing a `tok2vec` layer, a separate `tok2vec` component needs to be defined in the config:
|
||||||
|
|
||||||
|
```
|
||||||
|
[components.tok2vec]
|
||||||
|
factory = "tok2vec"
|
||||||
|
|
||||||
|
[components.tok2vec.model]
|
||||||
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
```
|
||||||
|
|
||||||
|
A listener can then be set up by making sure the correct `upstream` name is defined, referring to the
|
||||||
|
name of the `tok2vec` component (which equals the factory name by default), or `*` as a wildcard:
|
||||||
|
|
||||||
|
```
|
||||||
|
[components.ner.model.tok2vec]
|
||||||
|
@architectures = "spacy.Tok2VecListener.v1"
|
||||||
|
upstream = "tok2vec"
|
||||||
|
```
|
||||||
|
|
||||||
|
When an [`nlp`](https://github.com/explosion/spaCy/blob/master/extra/DEVELOPER_DOCS/Language.md) object is
|
||||||
|
initialized or deserialized, it will make sure to link each `tok2vec` component to its listeners. This is
|
||||||
|
implemented in the method `nlp._link_components()` which loops over each
|
||||||
|
component in the pipeline and calls `find_listeners()` on a component if it's defined.
|
||||||
|
The [`tok2vec` component](https://github.com/explosion/spaCy/blob/master/spacy/pipeline/tok2vec.py)'s implementation
|
||||||
|
of this `find_listener()` method will specifically identify sublayers of a model definition that are of type
|
||||||
|
`Tok2VecListener` with a matching upstream name and will then add that listener to the internal `self.listener_map`.
|
||||||
|
|
||||||
|
If it's a Transformer-based pipeline, a
|
||||||
|
[`transformer` component](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py)
|
||||||
|
has a similar implementation but its `find_listener()` function will specifically look for `TransformerListener`
|
||||||
|
sublayers of downstream components.
|
||||||
|
|
||||||
|
### 2B. Shape inference
|
||||||
|
|
||||||
|
Typically, the output dimension `nO` of a listener's model equals the `nO` (or `width`) of the upstream embedding layer.
|
||||||
|
For a standard `Tok2Vec`-based component, this is typically known up-front and defined as such in the config:
|
||||||
|
|
||||||
|
```
|
||||||
|
[components.ner.model.tok2vec]
|
||||||
|
@architectures = "spacy.Tok2VecListener.v1"
|
||||||
|
width = ${components.tok2vec.model.encode.width}
|
||||||
|
```
|
||||||
|
|
||||||
|
A `transformer` component however only knows its `nO` dimension after the HuggingFace transformer
|
||||||
|
is set with the function `model.attrs["set_transformer"]`,
|
||||||
|
[implemented](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py)
|
||||||
|
by `set_pytorch_transformer`.
|
||||||
|
This is why, upon linking of the transformer listeners, the `transformer` component also makes sure to set
|
||||||
|
the listener's output dimension correctly.
|
||||||
|
|
||||||
|
This shape inference mechanism also needs to happen with resumed/frozen components, which means that for some CLI
|
||||||
|
commands (`assemble` and `train`), we need to call `nlp._link_components` even before initializing the `nlp`
|
||||||
|
object. To cover all use-cases and avoid negative side effects, the code base ensures that performing the
|
||||||
|
linking twice is not harmful.
|
||||||
|
|
||||||
|
## 3. Internal communication
|
||||||
|
|
||||||
|
The internal communication between a listener and its downstream components is organized by sending and
|
||||||
|
receiving information across the components - either directly or implicitly.
|
||||||
|
The details are different depending on whether the pipeline is currently training, or predicting.
|
||||||
|
Either way, the `tok2vec` or `transformer` component always needs to run before the listener.
|
||||||
|
|
||||||
|
### 3A. During prediction
|
||||||
|
|
||||||
|
When the `Tok2Vec` pipeline component is called, its `predict()` method is executed to produce the results,
|
||||||
|
which are then stored by `set_annotations()` in the `doc.tensor` field of the document(s).
|
||||||
|
Similarly, the `Transformer` component stores the produced embeddings
|
||||||
|
in `doc._.trf_data`. Next, the `forward` pass of a
|
||||||
|
[`Tok2VecListener`](https://github.com/explosion/spaCy/blob/master/spacy/pipeline/tok2vec.py)
|
||||||
|
or a
|
||||||
|
[`TransformerListener`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/listener.py)
|
||||||
|
accesses these fields on the `Doc` directly. Both listener implementations have a fallback mechanism for when these
|
||||||
|
properties were not set on the `Doc`: in that case an all-zero tensor is produced and returned.
|
||||||
|
We need this fallback mechanism to enable shape inference methods in Thinc, but the code
|
||||||
|
is slightly risky and at times might hide another bug - so it's a good spot to be aware of.
|
||||||
|
|
||||||
|
### 3B. During training
|
||||||
|
|
||||||
|
During training, the `update()` methods of the `Tok2Vec` & `Transformer` components don't necessarily set the
|
||||||
|
annotations on the `Doc` (though since 3.1 they can if they are part of the `annotating_components` list in the config).
|
||||||
|
Instead, we rely on a caching mechanism between the original embedding component and its listener.
|
||||||
|
Specifically, the produced embeddings are sent to the listeners by calling `listener.receive()` and uniquely
|
||||||
|
identifying the batch of documents with a `batch_id`. This `receive()` call also sends the appropriate `backprop`
|
||||||
|
call to ensure that gradients from the downstream component flow back to the trainable `Tok2Vec` or `Transformer`
|
||||||
|
network.
|
||||||
|
|
||||||
|
We rely on the `nlp` object properly batching the data and sending each batch through the pipeline in sequence,
|
||||||
|
which means that only one such batch needs to be kept in memory for each listener.
|
||||||
|
When the downstream component runs and the listener should produce embeddings, it accesses the batch in memory,
|
||||||
|
runs the backpropagation, and returns the results and the gradients.
|
||||||
|
|
||||||
|
There are two ways in which this mechanism can fail, both are detected by `verify_inputs()`:
|
||||||
|
|
||||||
|
- `E953` if a different batch is in memory than the requested one - signaling some kind of out-of-sync state of the
|
||||||
|
training pipeline.
|
||||||
|
- `E954` if no batch is in memory at all - signaling that the pipeline is probably not set up correctly.
|
||||||
|
|
||||||
|
#### Training with multiple listeners
|
||||||
|
|
||||||
|
One `Tok2Vec` or `Transformer` component may be listened to by several downstream components, e.g.
|
||||||
|
a tagger and a parser could be sharing the same embeddings. In this case, we need to be careful about how we do
|
||||||
|
the backpropagation. When the `Tok2Vec` or `Transformer` sends out data to the listener with `receive()`, they will
|
||||||
|
send an `accumulate_gradient` function call to all listeners, except the last one. This function will keep track
|
||||||
|
of the gradients received so far. Only the final listener in the pipeline will get an actual `backprop` call that
|
||||||
|
will initiate the backpropagation of the `tok2vec` or `transformer` model with the accumulated gradients.
|
||||||
|
|
||||||
|
### 3C. Frozen components
|
||||||
|
|
||||||
|
The listener pattern can get particularly tricky in combination with frozen components. To detect components
|
||||||
|
with listeners that are not frozen consistently, `init_nlp()` (which is called by `spacy train`) goes through
|
||||||
|
the listeners and their upstream components and warns in two scenarios.
|
||||||
|
|
||||||
|
#### The Tok2Vec or Transformer is frozen
|
||||||
|
|
||||||
|
If the `Tok2Vec` or `Transformer` was already trained,
|
||||||
|
e.g. by [pretraining](https://spacy.io/usage/embeddings-transformers#pretraining),
|
||||||
|
it could be a valid use-case to freeze the embedding architecture and only train downstream components such
|
||||||
|
as a tagger or a parser. This used to be impossible before 3.1, but has become supported since then by putting the
|
||||||
|
embedding component in the [`annotating_components`](https://spacy.io/usage/training#annotating-components)
|
||||||
|
list of the config. This works like any other "annotating component" because it relies on the `Doc` attributes.
|
||||||
|
|
||||||
|
However, if the `Tok2Vec` or `Transformer` is frozen, and not present in `annotating_components`, and a related
|
||||||
|
listener isn't frozen, then a `W086` warning is shown and further training of the pipeline will likely end with `E954`.
|
||||||
|
|
||||||
|
#### The upstream component is frozen
|
||||||
|
|
||||||
|
If an upstream component is frozen but the underlying `Tok2Vec` or `Transformer` isn't, the performance of
|
||||||
|
the upstream component will be degraded after training. In this case, a `W087` warning is shown, explaining
|
||||||
|
how to use the `replace_listeners` functionality to prevent this problem.
|
||||||
|
|
||||||
|
## 4. Replacing listener with standalone
|
||||||
|
|
||||||
|
The [`replace_listeners`](https://spacy.io/api/language#replace_listeners) functionality changes the architecture
|
||||||
|
of a downstream component from using a listener pattern to a standalone `tok2vec` or `transformer` layer,
|
||||||
|
effectively making the downstream component independent of any other components in the pipeline.
|
||||||
|
It is implemented by `nlp.replace_listeners()` and typically executed by `nlp.from_config()`.
|
||||||
|
First, it fetches the original `Model` of the original component that creates the embeddings:
|
||||||
|
|
||||||
|
```
|
||||||
|
tok2vec = self.get_pipe(tok2vec_name)
|
||||||
|
tok2vec_model = tok2vec.model
|
||||||
|
```
|
||||||
|
|
||||||
|
Which is either a [`Tok2Vec` model](https://github.com/explosion/spaCy/blob/master/spacy/ml/models/tok2vec.py) or a
|
||||||
|
[`TransformerModel`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py).
|
||||||
|
|
||||||
|
In the case of the `tok2vec`, this model can be copied as-is into the configuration and architecture of the
|
||||||
|
downstream component. However, for the `transformer`, this doesn't work.
|
||||||
|
The reason is that the `TransformerListener` architecture chains the listener with
|
||||||
|
[`trfs2arrays`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/trfs2arrays.py):
|
||||||
|
|
||||||
|
```
|
||||||
|
model = chain(
|
||||||
|
TransformerListener(upstream_name=upstream)
|
||||||
|
trfs2arrays(pooling, grad_factor),
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained inbetween the model
|
||||||
|
and `trfs2arrays`:
|
||||||
|
|
||||||
|
```
|
||||||
|
model = chain(
|
||||||
|
TransformerModel(name, get_spans, tokenizer_config),
|
||||||
|
split_trf_batch(),
|
||||||
|
trfs2arrays(pooling, grad_factor),
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
So you can't just take the model from the listener, and drop that into the component internally. You need to
|
||||||
|
adjust the model and the config. To facilitate this, `nlp.replace_listeners()` will check whether additional
|
||||||
|
[functions](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/_util.py) are
|
||||||
|
[defined](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py)
|
||||||
|
in `model.attrs`, and if so, it will essentially call these to make the appropriate changes:
|
||||||
|
|
||||||
|
```
|
||||||
|
replace_func = tok2vec_model.attrs["replace_listener_cfg"]
|
||||||
|
new_config = replace_func(tok2vec_cfg["model"], pipe_cfg["model"]["tok2vec"])
|
||||||
|
...
|
||||||
|
new_model = tok2vec_model.attrs["replace_listener"](new_model)
|
||||||
|
```
|
||||||
|
|
||||||
|
The new config and model are then properly stored on the `nlp` object.
|
||||||
|
Note that this functionality (running the replacement for a transformer listener) was broken prior to
|
||||||
|
`spacy-transformers` 1.0.5.
|
216
extra/DEVELOPER_DOCS/StringStore-Vocab.md
Normal file
216
extra/DEVELOPER_DOCS/StringStore-Vocab.md
Normal file
|
@ -0,0 +1,216 @@
|
||||||
|
# StringStore & Vocab
|
||||||
|
|
||||||
|
> Reference: `spacy/strings.pyx`
|
||||||
|
> Reference: `spacy/vocab.pyx`
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
spaCy represents mosts strings internally using a `uint64` in Cython which
|
||||||
|
corresponds to a hash. The magic required to make this largely transparent is
|
||||||
|
handled by the `StringStore`, and is integrated into the pipelines using the
|
||||||
|
`Vocab`, which also connects it to some other information.
|
||||||
|
|
||||||
|
These are mostly internal details that average library users should never have
|
||||||
|
to think about. On the other hand, when developing a component it's normal to
|
||||||
|
interact with the Vocab for lexeme data or word vectors, and it's not unusual
|
||||||
|
to add labels to the `StringStore`.
|
||||||
|
|
||||||
|
## StringStore
|
||||||
|
|
||||||
|
### Overview
|
||||||
|
|
||||||
|
The `StringStore` is a `cdef class` that looks a bit like a two-way dictionary,
|
||||||
|
though it is not a subclass of anything in particular.
|
||||||
|
|
||||||
|
The main functionality of the `StringStore` is that `__getitem__` converts
|
||||||
|
hashes into strings or strings into hashes.
|
||||||
|
|
||||||
|
The full details of the conversion are complicated. Normally you shouldn't have
|
||||||
|
to worry about them, but the first applicable case here is used to get the
|
||||||
|
return value:
|
||||||
|
|
||||||
|
1. 0 and the empty string are special cased to each other
|
||||||
|
2. internal symbols use a lookup table (`SYMBOLS_BY_STR`)
|
||||||
|
3. normal strings or bytes are hashed
|
||||||
|
4. internal symbol IDs in `SYMBOLS_BY_INT` are handled
|
||||||
|
5. anything not yet handled is used as a hash to lookup a string
|
||||||
|
|
||||||
|
For the symbol enums, see [`symbols.pxd`](https://github.com/explosion/spaCy/blob/master/spacy/symbols.pxd).
|
||||||
|
|
||||||
|
Almost all strings in spaCy are stored in the `StringStore`. This naturally
|
||||||
|
includes tokens, but also includes things like labels (not just NER/POS/dep,
|
||||||
|
but also categories etc.), lemmas, lowercase forms, word shapes, and so on. One
|
||||||
|
of the main results of this is that tokens can be represented by a compact C
|
||||||
|
struct ([`LexemeC`](https://spacy.io/api/cython-structs#lexemec)/[`TokenC`](https://github.com/explosion/spaCy/issues/4854)) that mostly consists of string hashes. This also means that converting
|
||||||
|
input for the models is straightforward, and there's not a token mapping step
|
||||||
|
like in many machine learning frameworks. Additionally, because the token IDs
|
||||||
|
in spaCy are based on hashes, they are consistent across environments or
|
||||||
|
models.
|
||||||
|
|
||||||
|
One pattern you'll see a lot in spaCy APIs is that `something.value` returns an
|
||||||
|
`int` and `something.value_` returns a string. That's implemented using the
|
||||||
|
`StringStore`. Typically the `int` is stored in a C struct and the string is
|
||||||
|
generated via a property that calls into the `StringStore` with the `int`.
|
||||||
|
|
||||||
|
Besides `__getitem__`, the `StringStore` has functions to return specifically a
|
||||||
|
string or specifically a hash, regardless of whether the input was a string or
|
||||||
|
hash to begin with, though these are only used occasionally.
|
||||||
|
|
||||||
|
### Implementation Details: Hashes and Allocations
|
||||||
|
|
||||||
|
Hashes are 64-bit and are computed using [murmurhash][] on UTF-8 bytes. There is no
|
||||||
|
mechanism for detecting and avoiding collisions. To date there has never been a
|
||||||
|
reproducible collision or user report about any related issues.
|
||||||
|
|
||||||
|
[murmurhash]: https://github.com/explosion/murmurhash
|
||||||
|
|
||||||
|
The empty string is not hashed, it's just converted to/from 0.
|
||||||
|
|
||||||
|
A small number of strings use indices into a lookup table (so low integers)
|
||||||
|
rather than hashes. This is mostly Universal Dependencies labels or other
|
||||||
|
strings considered "core" in spaCy. This was critical in v1, which hadn't
|
||||||
|
introduced hashing yet. Since v2 it's important for items in `spacy.attrs`,
|
||||||
|
especially lexeme flags, but is otherwise only maintained for backwards
|
||||||
|
compatibility.
|
||||||
|
|
||||||
|
You can call `strings["mystring"]` with a string the `StringStore` has never seen
|
||||||
|
before and it will return a hash. But in order to do the reverse operation, you
|
||||||
|
need to call `strings.add("mystring")` first. Without a call to `add` the
|
||||||
|
string will not be interned.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```
|
||||||
|
from spacy.strings import StringStore
|
||||||
|
|
||||||
|
ss = StringStore()
|
||||||
|
hashval = ss["spacy"] # 10639093010105930009
|
||||||
|
try:
|
||||||
|
# this won't work
|
||||||
|
ss[hashval]
|
||||||
|
except KeyError:
|
||||||
|
print(f"key {hashval} unknown in the StringStore.")
|
||||||
|
|
||||||
|
ss.add("spacy")
|
||||||
|
assert ss[hashval] == "spacy" # it works now
|
||||||
|
|
||||||
|
# There is no `.keys` property, but you can iterate over keys
|
||||||
|
# The empty string will never be in the list of keys
|
||||||
|
for key in ss:
|
||||||
|
print(key)
|
||||||
|
```
|
||||||
|
|
||||||
|
In normal use nothing is ever removed from the `StringStore`. In theory this
|
||||||
|
means that if you do something like iterate through all hex values of a certain
|
||||||
|
length you can have explosive memory usage. In practice this has never been an
|
||||||
|
issue. (Note that this is also different from using `sys.intern` to intern
|
||||||
|
Python strings, which does not guarantee they won't be garbage collected later.)
|
||||||
|
|
||||||
|
Strings are stored in the `StringStore` in a peculiar way: each string uses a
|
||||||
|
union that is either an eight-byte `char[]` or a `char*`. Short strings are
|
||||||
|
stored directly in the `char[]`, while longer strings are stored in allocated
|
||||||
|
memory and prefixed with their length. This is a strategy to reduce indirection
|
||||||
|
and memory fragmentation. See `decode_Utf8Str` and `_allocate` in
|
||||||
|
`strings.pyx` for the implementation.
|
||||||
|
|
||||||
|
### When to Use the StringStore?
|
||||||
|
|
||||||
|
While you can ignore the `StringStore` in many cases, there are situations where
|
||||||
|
you should make use of it to avoid errors.
|
||||||
|
|
||||||
|
Any time you introduce a string that may be set on a `Doc` field that has a hash,
|
||||||
|
you should add the string to the `StringStore`. This mainly happens when adding
|
||||||
|
labels in components, but there are some other cases:
|
||||||
|
|
||||||
|
- syntax iterators, mainly `get_noun_chunks`
|
||||||
|
- external data used in components, like the `KnowledgeBase` in the `entity_linker`
|
||||||
|
- labels used in tests
|
||||||
|
|
||||||
|
## Vocab
|
||||||
|
|
||||||
|
The `Vocab` is a core component of a `Language` pipeline. Its main function is
|
||||||
|
to manage `Lexeme`s, which are structs that contain information about a token
|
||||||
|
that depends only on its surface form, without context. `Lexeme`s store much of
|
||||||
|
the data associated with `Token`s. As a side effect of this the `Vocab` also
|
||||||
|
manages the `StringStore` for a pipeline and a grab-bag of other data.
|
||||||
|
|
||||||
|
These are things stored in the vocab:
|
||||||
|
|
||||||
|
- `Lexeme`s
|
||||||
|
- `StringStore`
|
||||||
|
- `Morphology`: manages info used in `MorphAnalysis` objects
|
||||||
|
- `vectors`: basically a dict for word vectors
|
||||||
|
- `lookups`: language specific data like lemmas
|
||||||
|
- `writing_system`: language specific metadata
|
||||||
|
- `get_noun_chunks`: a syntax iterator
|
||||||
|
- lex attribute getters: functions like `is_punct`, set in language defaults
|
||||||
|
- `cfg`: **not** the pipeline config, this is mostly unused
|
||||||
|
- `_unused_object`: Formerly an unused object, kept around until v4 for compatability
|
||||||
|
|
||||||
|
Some of these, like the Morphology and Vectors, are complex enough that they
|
||||||
|
need their own explanations. Here we'll just look at Vocab-specific items.
|
||||||
|
|
||||||
|
### Lexemes
|
||||||
|
|
||||||
|
A `Lexeme` is a type that mainly wraps a `LexemeC`, a struct consisting of ints
|
||||||
|
that identify various context-free token attributes. Lexemes are the core data
|
||||||
|
of the `Vocab`, and can be accessed using `__getitem__` on the `Vocab`. The memory
|
||||||
|
for storing `LexemeC` objects is managed by a pool that belongs to the `Vocab`.
|
||||||
|
|
||||||
|
Note that `__getitem__` on the `Vocab` works much like the `StringStore`, in
|
||||||
|
that it accepts a hash or id, with one important difference: if you do a lookup
|
||||||
|
using a string, that value is added to the `StringStore` automatically.
|
||||||
|
|
||||||
|
The attributes stored in a `LexemeC` are:
|
||||||
|
|
||||||
|
- orth (the raw text)
|
||||||
|
- lower
|
||||||
|
- norm
|
||||||
|
- shape
|
||||||
|
- prefix
|
||||||
|
- suffix
|
||||||
|
|
||||||
|
Most of these are straightforward. All of them can be customized, and (except
|
||||||
|
`orth`) probably should be since the defaults are based on English, but in
|
||||||
|
practice this is rarely done at present.
|
||||||
|
|
||||||
|
### Lookups
|
||||||
|
|
||||||
|
This is basically a dict of dicts, implemented using a `Table` for each
|
||||||
|
sub-dict, that stores lemmas and other language-specific lookup data.
|
||||||
|
|
||||||
|
A `Table` is a subclass of `OrderedDict` used for string-to-string data. It uses
|
||||||
|
Bloom filters to speed up misses and has some extra serialization features.
|
||||||
|
Tables are not used outside of the lookups.
|
||||||
|
|
||||||
|
### Lex Attribute Getters
|
||||||
|
|
||||||
|
Lexical Attribute Getters like `is_punct` are defined on a per-language basis,
|
||||||
|
much like lookups, but take the form of functions rather than string-to-string
|
||||||
|
dicts, so they're stored separately.
|
||||||
|
|
||||||
|
### Writing System
|
||||||
|
|
||||||
|
This is a dict with three attributes:
|
||||||
|
|
||||||
|
- `direction`: ltr or rtl (default ltr)
|
||||||
|
- `has_case`: bool (default `True`)
|
||||||
|
- `has_letters`: bool (default `True`, `False` only for CJK for now)
|
||||||
|
|
||||||
|
Currently these are not used much - the main use is that `direction` is used in
|
||||||
|
visualizers, though `rtl` doesn't quite work (see
|
||||||
|
[#4854](https://github.com/explosion/spaCy/issues/4854)). In the future they
|
||||||
|
could be used when choosing hyperparameters for subwords, controlling word
|
||||||
|
shape generation, and similar tasks.
|
||||||
|
|
||||||
|
### Other Vocab Members
|
||||||
|
|
||||||
|
The Vocab is kind of the default place to store things from `Language.defaults`
|
||||||
|
that don't belong to the Tokenizer. The following properties are in the Vocab
|
||||||
|
just because they don't have anywhere else to go.
|
||||||
|
|
||||||
|
- `get_noun_chunks`
|
||||||
|
- `cfg`: This is a dict that just stores `oov_prob` (hardcoded to `-20`)
|
||||||
|
- `_unused_object`: Leftover C member, should be removed in next major version
|
||||||
|
|
||||||
|
|
|
@ -5,7 +5,7 @@ requires = [
|
||||||
"cymem>=2.0.2,<2.1.0",
|
"cymem>=2.0.2,<2.1.0",
|
||||||
"preshed>=3.0.2,<3.1.0",
|
"preshed>=3.0.2,<3.1.0",
|
||||||
"murmurhash>=0.28.0,<1.1.0",
|
"murmurhash>=0.28.0,<1.1.0",
|
||||||
"thinc>=8.0.8,<8.1.0",
|
"thinc>=8.0.10,<8.1.0",
|
||||||
"blis>=0.4.0,<0.8.0",
|
"blis>=0.4.0,<0.8.0",
|
||||||
"pathy",
|
"pathy",
|
||||||
"numpy>=1.15.0",
|
"numpy>=1.15.0",
|
||||||
|
|
|
@ -1,15 +1,15 @@
|
||||||
# Our libraries
|
# Our libraries
|
||||||
spacy-legacy>=3.0.7,<3.1.0
|
spacy-legacy>=3.0.8,<3.1.0
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
thinc>=8.0.8,<8.1.0
|
thinc>=8.0.10,<8.1.0
|
||||||
blis>=0.4.0,<0.8.0
|
blis>=0.4.0,<0.8.0
|
||||||
ml_datasets>=0.2.0,<0.3.0
|
ml_datasets>=0.2.0,<0.3.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
wasabi>=0.8.1,<1.1.0
|
wasabi>=0.8.1,<1.1.0
|
||||||
srsly>=2.4.1,<3.0.0
|
srsly>=2.4.1,<3.0.0
|
||||||
catalogue>=2.0.4,<2.1.0
|
catalogue>=2.0.6,<2.1.0
|
||||||
typer>=0.3.0,<0.4.0
|
typer>=0.3.0,<0.5.0
|
||||||
pathy>=0.3.5
|
pathy>=0.3.5
|
||||||
# Third party dependencies
|
# Third party dependencies
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
|
|
10
setup.cfg
10
setup.cfg
|
@ -37,19 +37,19 @@ setup_requires =
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
thinc>=8.0.8,<8.1.0
|
thinc>=8.0.10,<8.1.0
|
||||||
install_requires =
|
install_requires =
|
||||||
# Our libraries
|
# Our libraries
|
||||||
spacy-legacy>=3.0.7,<3.1.0
|
spacy-legacy>=3.0.8,<3.1.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
thinc>=8.0.8,<8.1.0
|
thinc>=8.0.9,<8.1.0
|
||||||
blis>=0.4.0,<0.8.0
|
blis>=0.4.0,<0.8.0
|
||||||
wasabi>=0.8.1,<1.1.0
|
wasabi>=0.8.1,<1.1.0
|
||||||
srsly>=2.4.1,<3.0.0
|
srsly>=2.4.1,<3.0.0
|
||||||
catalogue>=2.0.4,<2.1.0
|
catalogue>=2.0.6,<2.1.0
|
||||||
typer>=0.3.0,<0.4.0
|
typer>=0.3.0,<0.5.0
|
||||||
pathy>=0.3.5
|
pathy>=0.3.5
|
||||||
# Third-party dependencies
|
# Third-party dependencies
|
||||||
tqdm>=4.38.0,<5.0.0
|
tqdm>=4.38.0,<5.0.0
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
# fmt: off
|
# fmt: off
|
||||||
__title__ = "spacy"
|
__title__ = "spacy"
|
||||||
__version__ = "3.1.2"
|
__version__ = "3.1.3"
|
||||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||||
__projects__ = "https://github.com/explosion/projects"
|
__projects__ = "https://github.com/explosion/projects"
|
||||||
|
|
|
@ -397,7 +397,11 @@ def git_checkout(
|
||||||
run_command(cmd, capture=True)
|
run_command(cmd, capture=True)
|
||||||
# We need Path(name) to make sure we also support subdirectories
|
# We need Path(name) to make sure we also support subdirectories
|
||||||
try:
|
try:
|
||||||
shutil.copytree(str(tmp_dir / Path(subpath)), str(dest))
|
source_path = tmp_dir / Path(subpath)
|
||||||
|
if not is_subpath_of(tmp_dir, source_path):
|
||||||
|
err = f"'{subpath}' is a path outside of the cloned repository."
|
||||||
|
msg.fail(err, repo, exits=1)
|
||||||
|
shutil.copytree(str(source_path), str(dest))
|
||||||
except FileNotFoundError:
|
except FileNotFoundError:
|
||||||
err = f"Can't clone {subpath}. Make sure the directory exists in the repo (branch '{branch}')"
|
err = f"Can't clone {subpath}. Make sure the directory exists in the repo (branch '{branch}')"
|
||||||
msg.fail(err, repo, exits=1)
|
msg.fail(err, repo, exits=1)
|
||||||
|
@ -445,8 +449,14 @@ def git_sparse_checkout(repo, subpath, dest, branch):
|
||||||
# And finally, we can checkout our subpath
|
# And finally, we can checkout our subpath
|
||||||
cmd = f"git -C {tmp_dir} checkout {branch} {subpath}"
|
cmd = f"git -C {tmp_dir} checkout {branch} {subpath}"
|
||||||
run_command(cmd, capture=True)
|
run_command(cmd, capture=True)
|
||||||
# We need Path(name) to make sure we also support subdirectories
|
|
||||||
shutil.move(str(tmp_dir / Path(subpath)), str(dest))
|
# Get a subdirectory of the cloned path, if appropriate
|
||||||
|
source_path = tmp_dir / Path(subpath)
|
||||||
|
if not is_subpath_of(tmp_dir, source_path):
|
||||||
|
err = f"'{subpath}' is a path outside of the cloned repository."
|
||||||
|
msg.fail(err, repo, exits=1)
|
||||||
|
|
||||||
|
shutil.move(str(source_path), str(dest))
|
||||||
|
|
||||||
|
|
||||||
def get_git_version(
|
def get_git_version(
|
||||||
|
@ -477,6 +487,19 @@ def _http_to_git(repo: str) -> str:
|
||||||
return repo
|
return repo
|
||||||
|
|
||||||
|
|
||||||
|
def is_subpath_of(parent, child):
|
||||||
|
"""
|
||||||
|
Check whether `child` is a path contained within `parent`.
|
||||||
|
"""
|
||||||
|
# Based on https://stackoverflow.com/a/37095733 .
|
||||||
|
|
||||||
|
# In Python 3.9, the `Path.is_relative_to()` method will supplant this, so
|
||||||
|
# we can stop using crusty old os.path functions.
|
||||||
|
parent_realpath = os.path.realpath(parent)
|
||||||
|
child_realpath = os.path.realpath(child)
|
||||||
|
return os.path.commonpath([parent_realpath, child_realpath]) == parent_realpath
|
||||||
|
|
||||||
|
|
||||||
def string_to_list(value: str, intify: bool = False) -> Union[List[str], List[int]]:
|
def string_to_list(value: str, intify: bool = False) -> Union[List[str], List[int]]:
|
||||||
"""Parse a comma-separated string to a list and account for various
|
"""Parse a comma-separated string to a list and account for various
|
||||||
formatting options. Mostly used to handle CLI arguments that take a list of
|
formatting options. Mostly used to handle CLI arguments that take a list of
|
||||||
|
|
|
@ -200,17 +200,21 @@ def get_third_party_dependencies(
|
||||||
exclude (list): List of packages to exclude (e.g. that already exist in meta).
|
exclude (list): List of packages to exclude (e.g. that already exist in meta).
|
||||||
RETURNS (list): The versioned requirements.
|
RETURNS (list): The versioned requirements.
|
||||||
"""
|
"""
|
||||||
own_packages = ("spacy", "spacy-nightly", "thinc", "srsly")
|
own_packages = ("spacy", "spacy-legacy", "spacy-nightly", "thinc", "srsly")
|
||||||
distributions = util.packages_distributions()
|
distributions = util.packages_distributions()
|
||||||
funcs = defaultdict(set)
|
funcs = defaultdict(set)
|
||||||
for path, value in util.walk_dict(config):
|
# We only want to look at runtime-relevant sections, not [training] or [initialize]
|
||||||
if path[-1].startswith("@"): # collect all function references by registry
|
for section in ("nlp", "components"):
|
||||||
funcs[path[-1][1:]].add(value)
|
for path, value in util.walk_dict(config[section]):
|
||||||
|
if path[-1].startswith("@"): # collect all function references by registry
|
||||||
|
funcs[path[-1][1:]].add(value)
|
||||||
|
for component in config.get("components", {}).values():
|
||||||
|
if "factory" in component:
|
||||||
|
funcs["factories"].add(component["factory"])
|
||||||
modules = set()
|
modules = set()
|
||||||
for reg_name, func_names in funcs.items():
|
for reg_name, func_names in funcs.items():
|
||||||
sub_registry = getattr(util.registry, reg_name)
|
|
||||||
for func_name in func_names:
|
for func_name in func_names:
|
||||||
func_info = sub_registry.find(func_name)
|
func_info = util.registry.find(reg_name, func_name)
|
||||||
module_name = func_info.get("module")
|
module_name = func_info.get("module")
|
||||||
if module_name: # the code is part of a module, not a --code file
|
if module_name: # the code is part of a module, not a --code file
|
||||||
modules.add(func_info["module"].split(".")[0])
|
modules.add(func_info["module"].split(".")[0])
|
||||||
|
|
|
@ -59,6 +59,15 @@ def project_assets(project_dir: Path, *, sparse_checkout: bool = False) -> None:
|
||||||
shutil.rmtree(dest)
|
shutil.rmtree(dest)
|
||||||
else:
|
else:
|
||||||
dest.unlink()
|
dest.unlink()
|
||||||
|
if "repo" not in asset["git"] or asset["git"]["repo"] is None:
|
||||||
|
msg.fail(
|
||||||
|
"A git asset must include 'repo', the repository address.", exits=1
|
||||||
|
)
|
||||||
|
if "path" not in asset["git"] or asset["git"]["path"] is None:
|
||||||
|
msg.fail(
|
||||||
|
"A git asset must include 'path' - use \"\" to get the entire repository.",
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
git_checkout(
|
git_checkout(
|
||||||
asset["git"]["repo"],
|
asset["git"]["repo"],
|
||||||
asset["git"]["path"],
|
asset["git"]["path"],
|
||||||
|
|
|
@ -57,6 +57,7 @@ def project_run(
|
||||||
|
|
||||||
project_dir (Path): Path to project directory.
|
project_dir (Path): Path to project directory.
|
||||||
subcommand (str): Name of command to run.
|
subcommand (str): Name of command to run.
|
||||||
|
overrides (Dict[str, Any]): Optional config overrides.
|
||||||
force (bool): Force re-running, even if nothing changed.
|
force (bool): Force re-running, even if nothing changed.
|
||||||
dry (bool): Perform a dry run and don't execute commands.
|
dry (bool): Perform a dry run and don't execute commands.
|
||||||
capture (bool): Whether to capture the output and errors of individual commands.
|
capture (bool): Whether to capture the output and errors of individual commands.
|
||||||
|
@ -72,7 +73,14 @@ def project_run(
|
||||||
if subcommand in workflows:
|
if subcommand in workflows:
|
||||||
msg.info(f"Running workflow '{subcommand}'")
|
msg.info(f"Running workflow '{subcommand}'")
|
||||||
for cmd in workflows[subcommand]:
|
for cmd in workflows[subcommand]:
|
||||||
project_run(project_dir, cmd, force=force, dry=dry, capture=capture)
|
project_run(
|
||||||
|
project_dir,
|
||||||
|
cmd,
|
||||||
|
overrides=overrides,
|
||||||
|
force=force,
|
||||||
|
dry=dry,
|
||||||
|
capture=capture,
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
cmd = commands[subcommand]
|
cmd = commands[subcommand]
|
||||||
for dep in cmd.get("deps", []):
|
for dep in cmd.get("deps", []):
|
||||||
|
|
|
@ -869,6 +869,10 @@ class Errors:
|
||||||
E1019 = ("`noun_chunks` requires the pos tagging, which requires a "
|
E1019 = ("`noun_chunks` requires the pos tagging, which requires a "
|
||||||
"statistical model to be installed and loaded. For more info, see "
|
"statistical model to be installed and loaded. For more info, see "
|
||||||
"the documentation:\nhttps://spacy.io/usage/models")
|
"the documentation:\nhttps://spacy.io/usage/models")
|
||||||
|
E1020 = ("No `epoch_resume` value specified and could not infer one from "
|
||||||
|
"filename. Specify an epoch to resume from.")
|
||||||
|
E1021 = ("`pos` value \"{pp}\" is not a valid Universal Dependencies tag. "
|
||||||
|
"Non-UD tags should use the `tag` property.")
|
||||||
|
|
||||||
|
|
||||||
# Deprecated model shortcuts, only used in errors and warnings
|
# Deprecated model shortcuts, only used in errors and warnings
|
||||||
|
|
|
@ -82,7 +82,8 @@ for orth in [
|
||||||
|
|
||||||
for verb in [
|
for verb in [
|
||||||
"a",
|
"a",
|
||||||
"est" "semble",
|
"est",
|
||||||
|
"semble",
|
||||||
"indique",
|
"indique",
|
||||||
"moque",
|
"moque",
|
||||||
"passe",
|
"passe",
|
||||||
|
|
|
@ -281,28 +281,19 @@ cdef class Matcher:
|
||||||
final_matches.append((key, *match))
|
final_matches.append((key, *match))
|
||||||
# Mark tokens that have matched
|
# Mark tokens that have matched
|
||||||
memset(&matched[start], 1, span_len * sizeof(matched[0]))
|
memset(&matched[start], 1, span_len * sizeof(matched[0]))
|
||||||
if with_alignments:
|
|
||||||
final_matches_with_alignments = final_matches
|
|
||||||
final_matches = [(key, start, end) for key, start, end, alignments in final_matches]
|
|
||||||
# perform the callbacks on the filtered set of results
|
|
||||||
for i, (key, start, end) in enumerate(final_matches):
|
|
||||||
on_match = self._callbacks.get(key, None)
|
|
||||||
if on_match is not None:
|
|
||||||
on_match(self, doc, i, final_matches)
|
|
||||||
if as_spans:
|
if as_spans:
|
||||||
spans = []
|
final_results = []
|
||||||
for key, start, end in final_matches:
|
for key, start, end, *_ in final_matches:
|
||||||
if isinstance(doclike, Span):
|
if isinstance(doclike, Span):
|
||||||
start += doclike.start
|
start += doclike.start
|
||||||
end += doclike.start
|
end += doclike.start
|
||||||
spans.append(Span(doc, start, end, label=key))
|
final_results.append(Span(doc, start, end, label=key))
|
||||||
return spans
|
|
||||||
elif with_alignments:
|
elif with_alignments:
|
||||||
# convert alignments List[Dict[str, int]] --> List[int]
|
# convert alignments List[Dict[str, int]] --> List[int]
|
||||||
final_matches = []
|
|
||||||
# when multiple alignment (belongs to the same length) is found,
|
# when multiple alignment (belongs to the same length) is found,
|
||||||
# keeps the alignment that has largest token_idx
|
# keeps the alignment that has largest token_idx
|
||||||
for key, start, end, alignments in final_matches_with_alignments:
|
final_results = []
|
||||||
|
for key, start, end, alignments in final_matches:
|
||||||
sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False)
|
sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False)
|
||||||
alignments = [0] * (end-start)
|
alignments = [0] * (end-start)
|
||||||
for align in sorted_alignments:
|
for align in sorted_alignments:
|
||||||
|
@ -311,10 +302,16 @@ cdef class Matcher:
|
||||||
# Since alignments are sorted in order of (length, token_idx)
|
# Since alignments are sorted in order of (length, token_idx)
|
||||||
# this overwrites smaller token_idx when they have same length.
|
# this overwrites smaller token_idx when they have same length.
|
||||||
alignments[align['length']] = align['token_idx']
|
alignments[align['length']] = align['token_idx']
|
||||||
final_matches.append((key, start, end, alignments))
|
final_results.append((key, start, end, alignments))
|
||||||
return final_matches
|
final_matches = final_results # for callbacks
|
||||||
else:
|
else:
|
||||||
return final_matches
|
final_results = final_matches
|
||||||
|
# perform the callbacks on the filtered set of results
|
||||||
|
for i, (key, *_) in enumerate(final_matches):
|
||||||
|
on_match = self._callbacks.get(key, None)
|
||||||
|
if on_match is not None:
|
||||||
|
on_match(self, doc, i, final_matches)
|
||||||
|
return final_results
|
||||||
|
|
||||||
def _normalize_key(self, key):
|
def _normalize_key(self, key):
|
||||||
if isinstance(key, basestring):
|
if isinstance(key, basestring):
|
||||||
|
|
|
@ -398,7 +398,9 @@ class SpanCategorizer(TrainablePipe):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
def _get_aligned_spans(self, eg: Example):
|
def _get_aligned_spans(self, eg: Example):
|
||||||
return eg.get_aligned_spans_y2x(eg.reference.spans.get(self.key, []), allow_overlap=True)
|
return eg.get_aligned_spans_y2x(
|
||||||
|
eg.reference.spans.get(self.key, []), allow_overlap=True
|
||||||
|
)
|
||||||
|
|
||||||
def _make_span_group(
|
def _make_span_group(
|
||||||
self, doc: Doc, indices: Ints2d, scores: Floats2d, labels: List[str]
|
self, doc: Doc, indices: Ints2d, scores: Floats2d, labels: List[str]
|
||||||
|
|
|
@ -70,3 +70,10 @@ def test_create_with_heads_and_no_deps(vocab):
|
||||||
heads = list(range(len(words)))
|
heads = list(range(len(words)))
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
Doc(vocab, words=words, heads=heads)
|
Doc(vocab, words=words, heads=heads)
|
||||||
|
|
||||||
|
|
||||||
|
def test_create_invalid_pos(vocab):
|
||||||
|
words = "I like ginger".split()
|
||||||
|
pos = "QQ ZZ XX".split()
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
Doc(vocab, words=words, pos=pos)
|
||||||
|
|
|
@ -203,6 +203,12 @@ def test_set_pos():
|
||||||
assert doc[1].pos_ == "VERB"
|
assert doc[1].pos_ == "VERB"
|
||||||
|
|
||||||
|
|
||||||
|
def test_set_invalid_pos():
|
||||||
|
doc = Doc(Vocab(), words=["hello", "world"])
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
doc[0].pos_ = "blah"
|
||||||
|
|
||||||
|
|
||||||
def test_tokens_sent(doc):
|
def test_tokens_sent(doc):
|
||||||
"""Test token.sent property"""
|
"""Test token.sent property"""
|
||||||
assert len(list(doc.sents)) == 3
|
assert len(list(doc.sents)) == 3
|
||||||
|
|
|
@ -576,6 +576,16 @@ def test_matcher_callback(en_vocab):
|
||||||
mock.assert_called_once_with(matcher, doc, 0, matches)
|
mock.assert_called_once_with(matcher, doc, 0, matches)
|
||||||
|
|
||||||
|
|
||||||
|
def test_matcher_callback_with_alignments(en_vocab):
|
||||||
|
mock = Mock()
|
||||||
|
matcher = Matcher(en_vocab)
|
||||||
|
pattern = [{"ORTH": "test"}]
|
||||||
|
matcher.add("Rule", [pattern], on_match=mock)
|
||||||
|
doc = Doc(en_vocab, words=["This", "is", "a", "test", "."])
|
||||||
|
matches = matcher(doc, with_alignments=True)
|
||||||
|
mock.assert_called_once_with(matcher, doc, 0, matches)
|
||||||
|
|
||||||
|
|
||||||
def test_matcher_span(matcher):
|
def test_matcher_span(matcher):
|
||||||
text = "JavaScript is good but Java is better"
|
text = "JavaScript is good but Java is better"
|
||||||
doc = Doc(matcher.vocab, words=text.split())
|
doc = Doc(matcher.vocab, words=text.split())
|
||||||
|
|
|
@ -85,7 +85,12 @@ def test_doc_gc():
|
||||||
spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY})
|
spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY})
|
||||||
spancat.add_label("PERSON")
|
spancat.add_label("PERSON")
|
||||||
nlp.initialize()
|
nlp.initialize()
|
||||||
texts = ["Just a sentence.", "I like London and Berlin", "I like Berlin", "I eat ham."]
|
texts = [
|
||||||
|
"Just a sentence.",
|
||||||
|
"I like London and Berlin",
|
||||||
|
"I like Berlin",
|
||||||
|
"I eat ham.",
|
||||||
|
]
|
||||||
all_spans = [doc.spans for doc in nlp.pipe(texts)]
|
all_spans = [doc.spans for doc in nlp.pipe(texts)]
|
||||||
for text, spangroups in zip(texts, all_spans):
|
for text, spangroups in zip(texts, all_spans):
|
||||||
assert isinstance(spangroups, SpanGroups)
|
assert isinstance(spangroups, SpanGroups)
|
||||||
|
@ -338,7 +343,11 @@ def test_overfitting_IO_overlapping():
|
||||||
assert len(spans) == 3
|
assert len(spans) == 3
|
||||||
assert len(spans.attrs["scores"]) == 3
|
assert len(spans.attrs["scores"]) == 3
|
||||||
assert min(spans.attrs["scores"]) > 0.9
|
assert min(spans.attrs["scores"]) > 0.9
|
||||||
assert set([span.text for span in spans]) == {"London", "Berlin", "London and Berlin"}
|
assert set([span.text for span in spans]) == {
|
||||||
|
"London",
|
||||||
|
"Berlin",
|
||||||
|
"London and Berlin",
|
||||||
|
}
|
||||||
assert set([span.label_ for span in spans]) == {"LOC", "DOUBLE_LOC"}
|
assert set([span.label_ for span in spans]) == {"LOC", "DOUBLE_LOC"}
|
||||||
|
|
||||||
# Also test the results are still the same after IO
|
# Also test the results are still the same after IO
|
||||||
|
@ -350,5 +359,9 @@ def test_overfitting_IO_overlapping():
|
||||||
assert len(spans2) == 3
|
assert len(spans2) == 3
|
||||||
assert len(spans2.attrs["scores"]) == 3
|
assert len(spans2.attrs["scores"]) == 3
|
||||||
assert min(spans2.attrs["scores"]) > 0.9
|
assert min(spans2.attrs["scores"]) > 0.9
|
||||||
assert set([span.text for span in spans2]) == {"London", "Berlin", "London and Berlin"}
|
assert set([span.text for span in spans2]) == {
|
||||||
|
"London",
|
||||||
|
"Berlin",
|
||||||
|
"London and Berlin",
|
||||||
|
}
|
||||||
assert set([span.label_ for span in spans2]) == {"LOC", "DOUBLE_LOC"}
|
assert set([span.label_ for span in spans2]) == {"LOC", "DOUBLE_LOC"}
|
||||||
|
|
|
@ -9,6 +9,7 @@ from spacy.cli import info
|
||||||
from spacy.cli.init_config import init_config, RECOMMENDATIONS
|
from spacy.cli.init_config import init_config, RECOMMENDATIONS
|
||||||
from spacy.cli._util import validate_project_commands, parse_config_overrides
|
from spacy.cli._util import validate_project_commands, parse_config_overrides
|
||||||
from spacy.cli._util import load_project_config, substitute_project_variables
|
from spacy.cli._util import load_project_config, substitute_project_variables
|
||||||
|
from spacy.cli._util import is_subpath_of
|
||||||
from spacy.cli._util import string_to_list
|
from spacy.cli._util import string_to_list
|
||||||
from spacy import about
|
from spacy import about
|
||||||
from spacy.util import get_minor_version
|
from spacy.util import get_minor_version
|
||||||
|
@ -535,8 +536,41 @@ def test_init_labels(component_name):
|
||||||
assert len(nlp2.get_pipe(component_name).labels) == 4
|
assert len(nlp2.get_pipe(component_name).labels) == 4
|
||||||
|
|
||||||
|
|
||||||
def test_get_third_party_dependencies_runs():
|
def test_get_third_party_dependencies():
|
||||||
# We can't easily test the detection of third-party packages here, but we
|
# We can't easily test the detection of third-party packages here, but we
|
||||||
# can at least make sure that the function and its importlib magic runs.
|
# can at least make sure that the function and its importlib magic runs.
|
||||||
nlp = Dutch()
|
nlp = Dutch()
|
||||||
|
# Test with component factory based on Cython module
|
||||||
|
nlp.add_pipe("tagger")
|
||||||
assert get_third_party_dependencies(nlp.config) == []
|
assert get_third_party_dependencies(nlp.config) == []
|
||||||
|
|
||||||
|
# Test with legacy function
|
||||||
|
nlp = Dutch()
|
||||||
|
nlp.add_pipe(
|
||||||
|
"textcat",
|
||||||
|
config={
|
||||||
|
"model": {
|
||||||
|
# Do not update from legacy architecture spacy.TextCatBOW.v1
|
||||||
|
"@architectures": "spacy.TextCatBOW.v1",
|
||||||
|
"exclusive_classes": True,
|
||||||
|
"ngram_size": 1,
|
||||||
|
"no_output_layer": False,
|
||||||
|
}
|
||||||
|
},
|
||||||
|
)
|
||||||
|
get_third_party_dependencies(nlp.config) == []
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"parent,child,expected",
|
||||||
|
[
|
||||||
|
("/tmp", "/tmp", True),
|
||||||
|
("/tmp", "/", False),
|
||||||
|
("/tmp", "/tmp/subdir", True),
|
||||||
|
("/tmp", "/tmpdir", False),
|
||||||
|
("/tmp", "/tmp/subdir/..", True),
|
||||||
|
("/tmp", "/tmp/..", False),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_is_subpath_of(parent, child, expected):
|
||||||
|
assert is_subpath_of(parent, child) == expected
|
||||||
|
|
|
@ -30,6 +30,7 @@ from ..compat import copy_reg, pickle
|
||||||
from ..errors import Errors, Warnings
|
from ..errors import Errors, Warnings
|
||||||
from ..morphology import Morphology
|
from ..morphology import Morphology
|
||||||
from .. import util
|
from .. import util
|
||||||
|
from .. import parts_of_speech
|
||||||
from .underscore import Underscore, get_ext_args
|
from .underscore import Underscore, get_ext_args
|
||||||
from ._retokenize import Retokenizer
|
from ._retokenize import Retokenizer
|
||||||
from ._serialize import ALL_ATTRS as DOCBIN_ALL_ATTRS
|
from ._serialize import ALL_ATTRS as DOCBIN_ALL_ATTRS
|
||||||
|
@ -285,6 +286,10 @@ cdef class Doc:
|
||||||
sent_starts[i] = -1
|
sent_starts[i] = -1
|
||||||
elif sent_starts[i] is None or sent_starts[i] not in [-1, 0, 1]:
|
elif sent_starts[i] is None or sent_starts[i] not in [-1, 0, 1]:
|
||||||
sent_starts[i] = 0
|
sent_starts[i] = 0
|
||||||
|
if pos is not None:
|
||||||
|
for pp in set(pos):
|
||||||
|
if pp not in parts_of_speech.IDS:
|
||||||
|
raise ValueError(Errors.E1021.format(pp=pp))
|
||||||
ent_iobs = None
|
ent_iobs = None
|
||||||
ent_types = None
|
ent_types = None
|
||||||
if ents is not None:
|
if ents is not None:
|
||||||
|
|
|
@ -867,6 +867,8 @@ cdef class Token:
|
||||||
return parts_of_speech.NAMES[self.c.pos]
|
return parts_of_speech.NAMES[self.c.pos]
|
||||||
|
|
||||||
def __set__(self, pos_name):
|
def __set__(self, pos_name):
|
||||||
|
if pos_name not in parts_of_speech.IDS:
|
||||||
|
raise ValueError(Errors.E1021.format(pp=pos_name))
|
||||||
self.c.pos = parts_of_speech.IDS[pos_name]
|
self.c.pos = parts_of_speech.IDS[pos_name]
|
||||||
|
|
||||||
property tag_:
|
property tag_:
|
||||||
|
|
|
@ -177,3 +177,89 @@ def wandb_logger(
|
||||||
return log_step, finalize
|
return log_step, finalize
|
||||||
|
|
||||||
return setup_logger
|
return setup_logger
|
||||||
|
|
||||||
|
|
||||||
|
@registry.loggers("spacy.WandbLogger.v3")
|
||||||
|
def wandb_logger(
|
||||||
|
project_name: str,
|
||||||
|
remove_config_values: List[str] = [],
|
||||||
|
model_log_interval: Optional[int] = None,
|
||||||
|
log_dataset_dir: Optional[str] = None,
|
||||||
|
entity: Optional[str] = None,
|
||||||
|
run_name: Optional[str] = None,
|
||||||
|
):
|
||||||
|
try:
|
||||||
|
import wandb
|
||||||
|
|
||||||
|
# test that these are available
|
||||||
|
from wandb import init, log, join # noqa: F401
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError(Errors.E880)
|
||||||
|
|
||||||
|
console = console_logger(progress_bar=False)
|
||||||
|
|
||||||
|
def setup_logger(
|
||||||
|
nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr
|
||||||
|
) -> Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]:
|
||||||
|
config = nlp.config.interpolate()
|
||||||
|
config_dot = util.dict_to_dot(config)
|
||||||
|
for field in remove_config_values:
|
||||||
|
del config_dot[field]
|
||||||
|
config = util.dot_to_dict(config_dot)
|
||||||
|
run = wandb.init(
|
||||||
|
project=project_name, config=config, entity=entity, reinit=True
|
||||||
|
)
|
||||||
|
|
||||||
|
if run_name:
|
||||||
|
wandb.run.name = run_name
|
||||||
|
|
||||||
|
console_log_step, console_finalize = console(nlp, stdout, stderr)
|
||||||
|
|
||||||
|
def log_dir_artifact(
|
||||||
|
path: str,
|
||||||
|
name: str,
|
||||||
|
type: str,
|
||||||
|
metadata: Optional[Dict[str, Any]] = {},
|
||||||
|
aliases: Optional[List[str]] = [],
|
||||||
|
):
|
||||||
|
dataset_artifact = wandb.Artifact(name, type=type, metadata=metadata)
|
||||||
|
dataset_artifact.add_dir(path, name=name)
|
||||||
|
wandb.log_artifact(dataset_artifact, aliases=aliases)
|
||||||
|
|
||||||
|
if log_dataset_dir:
|
||||||
|
log_dir_artifact(path=log_dataset_dir, name="dataset", type="dataset")
|
||||||
|
|
||||||
|
def log_step(info: Optional[Dict[str, Any]]):
|
||||||
|
console_log_step(info)
|
||||||
|
if info is not None:
|
||||||
|
score = info["score"]
|
||||||
|
other_scores = info["other_scores"]
|
||||||
|
losses = info["losses"]
|
||||||
|
wandb.log({"score": score})
|
||||||
|
if losses:
|
||||||
|
wandb.log({f"loss_{k}": v for k, v in losses.items()})
|
||||||
|
if isinstance(other_scores, dict):
|
||||||
|
wandb.log(other_scores)
|
||||||
|
if model_log_interval and info.get("output_path"):
|
||||||
|
if info["step"] % model_log_interval == 0 and info["step"] != 0:
|
||||||
|
log_dir_artifact(
|
||||||
|
path=info["output_path"],
|
||||||
|
name="pipeline_" + run.id,
|
||||||
|
type="checkpoint",
|
||||||
|
metadata=info,
|
||||||
|
aliases=[
|
||||||
|
f"epoch {info['epoch']} step {info['step']}",
|
||||||
|
"latest",
|
||||||
|
"best"
|
||||||
|
if info["score"] == max(info["checkpoints"])[0]
|
||||||
|
else "",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
def finalize() -> None:
|
||||||
|
console_finalize()
|
||||||
|
wandb.join()
|
||||||
|
|
||||||
|
return log_step, finalize
|
||||||
|
|
||||||
|
return setup_logger
|
||||||
|
|
|
@ -41,10 +41,11 @@ def pretrain(
|
||||||
optimizer = P["optimizer"]
|
optimizer = P["optimizer"]
|
||||||
# Load in pretrained weights to resume from
|
# Load in pretrained weights to resume from
|
||||||
if resume_path is not None:
|
if resume_path is not None:
|
||||||
_resume_model(model, resume_path, epoch_resume, silent=silent)
|
epoch_resume = _resume_model(model, resume_path, epoch_resume, silent=silent)
|
||||||
else:
|
else:
|
||||||
# Without '--resume-path' the '--epoch-resume' argument is ignored
|
# Without '--resume-path' the '--epoch-resume' argument is ignored
|
||||||
epoch_resume = 0
|
epoch_resume = 0
|
||||||
|
|
||||||
objective = model.attrs["loss"]
|
objective = model.attrs["loss"]
|
||||||
# TODO: move this to logger function?
|
# TODO: move this to logger function?
|
||||||
tracker = ProgressTracker(frequency=10000)
|
tracker = ProgressTracker(frequency=10000)
|
||||||
|
@ -93,20 +94,25 @@ def ensure_docs(examples_or_docs: Iterable[Union[Doc, Example]]) -> List[Doc]:
|
||||||
|
|
||||||
def _resume_model(
|
def _resume_model(
|
||||||
model: Model, resume_path: Path, epoch_resume: int, silent: bool = True
|
model: Model, resume_path: Path, epoch_resume: int, silent: bool = True
|
||||||
) -> None:
|
) -> int:
|
||||||
msg = Printer(no_print=silent)
|
msg = Printer(no_print=silent)
|
||||||
msg.info(f"Resume training tok2vec from: {resume_path}")
|
msg.info(f"Resume training tok2vec from: {resume_path}")
|
||||||
with resume_path.open("rb") as file_:
|
with resume_path.open("rb") as file_:
|
||||||
weights_data = file_.read()
|
weights_data = file_.read()
|
||||||
model.get_ref("tok2vec").from_bytes(weights_data)
|
model.get_ref("tok2vec").from_bytes(weights_data)
|
||||||
# Parse the epoch number from the given weight file
|
|
||||||
model_name = re.search(r"model\d+\.bin", str(resume_path))
|
if epoch_resume is None:
|
||||||
if model_name:
|
# Parse the epoch number from the given weight file
|
||||||
# Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
|
model_name = re.search(r"model\d+\.bin", str(resume_path))
|
||||||
epoch_resume = int(model_name.group(0)[5:][:-4]) + 1
|
if model_name:
|
||||||
msg.info(f"Resuming from epoch: {epoch_resume}")
|
# Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
|
||||||
else:
|
epoch_resume = int(model_name.group(0)[5:][:-4]) + 1
|
||||||
msg.info(f"Resuming from epoch: {epoch_resume}")
|
else:
|
||||||
|
# No epoch given and couldn't infer it
|
||||||
|
raise ValueError(Errors.E1020)
|
||||||
|
|
||||||
|
msg.info(f"Resuming from epoch: {epoch_resume}")
|
||||||
|
return epoch_resume
|
||||||
|
|
||||||
|
|
||||||
def make_update(
|
def make_update(
|
||||||
|
|
|
@ -140,6 +140,32 @@ class registry(thinc.registry):
|
||||||
) from None
|
) from None
|
||||||
return func
|
return func
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def find(cls, registry_name: str, func_name: str) -> Callable:
|
||||||
|
"""Get info about a registered function from the registry."""
|
||||||
|
# We're overwriting this classmethod so we're able to provide more
|
||||||
|
# specific error messages and implement a fallback to spacy-legacy.
|
||||||
|
if not hasattr(cls, registry_name):
|
||||||
|
names = ", ".join(cls.get_registry_names()) or "none"
|
||||||
|
raise RegistryError(Errors.E892.format(name=registry_name, available=names))
|
||||||
|
reg = getattr(cls, registry_name)
|
||||||
|
try:
|
||||||
|
func_info = reg.find(func_name)
|
||||||
|
except RegistryError:
|
||||||
|
if func_name.startswith("spacy."):
|
||||||
|
legacy_name = func_name.replace("spacy.", "spacy-legacy.")
|
||||||
|
try:
|
||||||
|
return reg.find(legacy_name)
|
||||||
|
except catalogue.RegistryError:
|
||||||
|
pass
|
||||||
|
available = ", ".join(sorted(reg.get_all().keys())) or "none"
|
||||||
|
raise RegistryError(
|
||||||
|
Errors.E893.format(
|
||||||
|
name=func_name, reg_name=registry_name, available=available
|
||||||
|
)
|
||||||
|
) from None
|
||||||
|
return func_info
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def has(cls, registry_name: str, func_name: str) -> bool:
|
def has(cls, registry_name: str, func_name: str) -> bool:
|
||||||
"""Check whether a function is available in a registry."""
|
"""Check whether a function is available in a registry."""
|
||||||
|
|
|
@ -462,7 +462,7 @@ start decreasing across epochs.
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
#### spacy.WandbLogger.v2 {#WandbLogger tag="registered function"}
|
#### spacy.WandbLogger.v3 {#WandbLogger tag="registered function"}
|
||||||
|
|
||||||
> #### Installation
|
> #### Installation
|
||||||
>
|
>
|
||||||
|
@ -494,19 +494,21 @@ remain in the config file stored on your local system.
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [training.logger]
|
> [training.logger]
|
||||||
> @loggers = "spacy.WandbLogger.v2"
|
> @loggers = "spacy.WandbLogger.v3"
|
||||||
> project_name = "monitor_spacy_training"
|
> project_name = "monitor_spacy_training"
|
||||||
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
|
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
|
||||||
> log_dataset_dir = "corpus"
|
> log_dataset_dir = "corpus"
|
||||||
> model_log_interval = 1000
|
> model_log_interval = 1000
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
|
| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
|
||||||
| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |
|
| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |
|
||||||
| `model_log_interval` | Steps to wait between logging model checkpoints to W&B dasboard (default: None). ~~Optional[int]~~ |
|
| `model_log_interval` | Steps to wait between logging model checkpoints to W&B dasboard (default: None). ~~Optional[int]~~ |
|
||||||
| `log_dataset_dir` | Directory containing dataset to be logged and versioned as W&B artifact (default: None). ~~Optional[str]~~ |
|
| `log_dataset_dir` | Directory containing dataset to be logged and versioned as W&B artifact (default: None). ~~Optional[str]~~ |
|
||||||
|
| `run_name` | The name of the run. If you don't specify a run_name, the name will be created by wandb library. (default: None ). ~~Optional[str]~~ |
|
||||||
|
| `entity` | An entity is a username or team name where you're sending runs. If you don't specify an entity, the run will be sent to your default entity, which is usually your username. (default: None). ~~Optional[str]~~ |
|
||||||
|
|
||||||
<Project id="integrations/wandb">
|
<Project id="integrations/wandb">
|
||||||
|
|
||||||
|
|
|
@ -291,7 +291,7 @@ files you need and not the whole repo.
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. |
|
| `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. |
|
||||||
| `git` | `repo`: The URL of the repo to download from.<br />`path`: Path of the file or directory to download, relative to the repo root.<br />`branch`: The branch to download from. Defaults to `"master"`. |
|
| `git` | `repo`: The URL of the repo to download from.<br />`path`: Path of the file or directory to download, relative to the repo root. "" specifies the root directory.<br />`branch`: The branch to download from. Defaults to `"master"`. |
|
||||||
| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. |
|
| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. |
|
||||||
| `description` | Optional asset description, used in [auto-generated docs](#custom-docs). |
|
| `description` | Optional asset description, used in [auto-generated docs](#custom-docs). |
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user