* Alignment: use a simplified ragged type for performance
This introduces the AlignmentArray type, which is a simplified version
of Ragged that performs better on the simple(r) indexing performed for
alignment.
* AlignmentArray: raise an error when using unsupported index
* AlignmentArray: move error messages to Errors
* AlignmentArray: remove simlified ... with simplifications
* AlignmentArray: fix typo that broke a[n:n] indexing
* Add edit tree lemmatizer
Co-authored-by: Daniël de Kok <me@danieldk.eu>
* Hide edit tree lemmatizer labels
* Use relative imports
* Switch to single quotes in error message
* Type annotation fixes
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Reformat edit_tree_lemmatizer with black
* EditTreeLemmatizer.predict: take Iterable
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Validate edit trees during deserialization
This change also changes the serialized representation. Rather than
mirroring the deep C structure, we use a simple flat union of the match
and substitution node types.
* Move edit_trees to _edit_tree_internals
* Fix invalid edit tree format error message
* edit_tree_lemmatizer: remove outdated TODO comment
* Rename factory name to trainable_lemmatizer
* Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14
* Switch to Tagger.v2
* Add documentation for EditTreeLemmatizer
* docs: Fix 3.2 -> 3.3 somewhere
* trainable_lemmatizer documentation fixes
* docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py
Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Tagger: use unnormalized probabilities for inference
Using unnormalized softmax avoids use of the relatively expensive exp function,
which can significantly speed up non-transformer models (e.g. I got a speedup
of 27% on a German tagging + parsing pipeline).
* Add spacy.Tagger.v2 with configurable normalization
Normalization of probabilities is disabled by default to improve
performance.
* Update documentation, models, and tests to spacy.Tagger.v2
* Move Tagger.v1 to spacy-legacy
* docs/architectures: run prettier
* Unnormalized softmax is now a Softmax_v2 option
* Require thinc 8.0.14 and spacy-legacy 3.0.9
* Add save_candidates attribute
* Change spancat api
* Add unit test
* reimplement method to produce a list of doc
* Add method to docs
* Add new version tag
* Add intended use to docstring
* prettier formatting
* Partial fix of entity linker batching
* Add import
* Better name
* Add `use_gold_ents` option, docs
* Change to v2, create stub v1, update docs etc.
* Fix error type
Honestly no idea what the right type to use here is.
ConfigValidationError seems wrong. Maybe a NotImplementedError?
* Make mypy happy
* Add hacky fix for init issue
* Add legacy pipeline entity linker
* Fix references to class name
* Add __init__.py for legacy
* Attempted fix for loss issue
* Remove placeholder V1
* formatting
* slightly more interesting train data
* Handle batches with no usable examples
This adds a test for batches that have docs but not entities, and a
check in the component that detects such cases and skips the update step
as thought the batch were empty.
* Remove todo about data verification
Check for empty data was moved further up so this should be OK now - the
case in question shouldn't be possible.
* Fix gradient calculation
The model doesn't know which entities are not in the kb, so it generates
embeddings for the context of all of them.
However, the loss does know which entities aren't in the kb, and it
ignores them, as there's no sensible gradient.
This has the issue that the gradient will not be calculated for some of
the input embeddings, which causes a dimension mismatch in backprop.
That should have caused a clear error, but with numpyops it was causing
nans to happen, which is another problem that should be addressed
separately.
This commit changes the loss to give a zero gradient for entities not in
the kb.
* add failing test for v1 EL legacy architecture
* Add nasty but simple working check for legacy arch
* Clarify why init hack works the way it does
* Clarify use_gold_ents use case
* Fix use gold ents related handling
* Add tests for no gold ents and fix other tests
* Use aligned ents function (not working)
This doesn't actually work because the "aligned" ents are gold-only. But
if I have a different function that returns the intersection, *then*
this will work as desired.
* Use proper matching ent check
This changes the process when gold ents are not used so that the
intersection of ents in the pred and gold is used.
* Move get_matching_ents to Example
* Use model attribute to check for legacy arch
* Rename flag
* bump spacy-legacy to lower 3.0.9
Co-authored-by: svlandeg <svlandeg@github.com>
* fixing argument order for rehearse
* rehearse test for ner and tagger
* rehearse bugfix
* added test for parser
* test for multilabel textcat
* rehearse fix
* remove debug line
* Update spacy/tests/training/test_rehearse.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/tests/training/test_rehearse.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Kádár Ákos <akos@onyx.uvt.nl>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Make core projectivization methods cdef nogil
While profiling the parser, I noticed that relatively a lot of time is
spent in projectivization. This change rewrites the functions in the
core loops as cdef nogil for efficiency.
In C++-land, we use vector in place of Python lists and absent heads
are represented as -1 in place of None.
* _heads_to_c: add assertion
Validation should be performed by the caller, but this assertion ensures that
we are not reading/writing out of bounds with incorrect input.
* Auto-format code with black
* add black requirement to dev dependencies and pin to 22.x
* ignore black dependency for comparison with setup.cfg
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Instead of the running the actual suggester, which may require
annotation from annotating components that is not necessarily present in
the reference docs, use the built-in 1-gram suggester.
This change changes the type of left/right-arc collections from
vector[ArcC] to unordered_map[int, vector[Arc]], so that the arcs are
keyed by the head. This allows us to find all the left/right arcs for a
particular head in constant time in StateC::{L,R}.
Benchmarks with long docs (N is the number of text repetitions):
Before (using #10019):
N Time (s)
400 3.2
800 5.0
1600 9.5
3200 23.2
6400 66.8
12800 220.0
After (this commit):
N Time (s)
400 3.1
800 4.3
1600 6.7
3200 12.0
6400 22.0
12800 42.0
Related to #9858 and #10019.
* Speed up the StateC::L feature function
This function gets the n-th most-recent left-arc with a particular head.
Before this change, StateC::L would construct a vector of all left-arcs
with the given head and then pick the n-th most recent from that vector.
Since the number of left-arcs strongly correlates with the doc length
and the feature is constructed for every transition, this can make
transition-parsing quadratic.
With this change StateC::L:
- Searches left-arcs backwards.
- Stops early when the n-th matching transition is found.
- Does not construct a vector (reducing memory pressure).
This change doesn't avoid the linear search when the transition that is
queried does not occur in the left-arcs. Regardless, performance is
improved quite a bit with very long docs:
Before:
N Time
400 3.3
800 5.4
1600 11.6
3200 30.7
After:
N Time
400 3.2
800 5.0
1600 9.5
3200 23.2
We can probably do better with more tailored data structures, but I
first wanted to make a low-impact PR.
Found while investigating #9858.
* StateC::L: simplify loop
* Speed up the StateC::L feature function
This function gets the n-th most-recent left-arc with a particular head.
Before this change, StateC::L would construct a vector of all left-arcs
with the given head and then pick the n-th most recent from that vector.
Since the number of left-arcs strongly correlates with the doc length
and the feature is constructed for every transition, this can make
transition-parsing quadratic.
With this change StateC::L:
- Searches left-arcs backwards.
- Stops early when the n-th matching transition is found.
- Does not construct a vector (reducing memory pressure).
This change doesn't avoid the linear search when the transition that is
queried does not occur in the left-arcs. Regardless, performance is
improved quite a bit with very long docs:
Before:
N Time
400 3.3
800 5.4
1600 11.6
3200 30.7
After:
N Time
400 3.2
800 5.0
1600 9.5
3200 23.2
We can probably do better with more tailored data structures, but I
first wanted to make a low-impact PR.
Found while investigating #9858.
* StateC::L: simplify loop
* Span/SpanGroup: wrap SpanC in shared_ptr
When a Span that was retrieved from a SpanGroup was modified, these
changes were not reflected in the SpanGroup because the underlying
SpanC struct was copied.
This change applies the solution proposed by @nrodnova, to wrap SpanC in
a shared_ptr. This makes a SpanGroup and Spans derived from it share the
same SpanC. So, changes made through a Span are visible in the SpanGroup
as well.
Fixes#9556
* Test that a SpanGroup is modified through its Spans
* SpanGroup.push_back: remove nogil
Modifying std::vector is not thread-safe.
* C++ >= 11 does not allow const T in vector<T>
* Add Span.span_c as a shorthand for Span.c.get
Since this method is cdef'ed, it is only visible from Cython, so we
avoid using raw pointers in Python
Replace existing uses of span.c.get() to use this new method.
* Fix formatting
* Style fix: pointer types
* SpanGroup.to_bytes: reduce number of shared_ptr::get calls
* Mark SpanGroup modification test with issue
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix Scorer.score_cats for missing labels
* Add test case for Scorer.score_cats missing labels
* semantic nitpick
* black formatting
* adjust test to give different results depending on multi_label setting
* fix loss function according to whether or not missing values are supported
* add note to docs
* small fixes
* make mypy happy
* Update spacy/pipeline/textcat.py
Co-authored-by: Florian Cäsar <florian.caesar@pm.me>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
* added ruler coe
* added error for none existing pattern
* changed error to warning
* changed error to warning
* added basic tests
* fixed place
* added test files
* went back to error
* went back to pattern error
* minor change to docs
* changed style
* changed doc
* changed error slightly
* added remove to phrasem api
* error key already existed
* phrase matcher match code to api
* blacked tests
* moved comments before expr
* corrected error no
* Update website/docs/api/entityruler.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/entityruler.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* morphologizer: avoid recreating label tuple for each token
The `labels` property converts the dictionary key set to a tuple. This
property was used for every annotated token, recreating the tuple over
and over again.
Construct the tuple once in the set_annotations function and reuse it.
On a Finnish pipeline that I was experimenting with, this results in a
speedup of ~15% (~13000 -> ~15000 WPS).
* tagger: avoid recreating label tuple for each token
* added error string
* added serialization test
* added more to if statements
* wrote file to tempdir
* added tempdir
* changed parameter a bit
* Update spacy/tests/pipeline/test_entity_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add custom protocols in spacy.ty
* add a test for the new types in spacy.ty
* import Example when type checking
* some type fixes
* put Protocol in compat
* revert update check back to hasattr
* runtime_checkable in compat as well
* 🚨 Ignore all existing Mypy errors
* 🏗 Add Mypy check to CI
* Add types-mock and types-requests as dev requirements
* Add additional type ignore directives
* Add types packages to dev-only list in reqs test
* Add types-dataclasses for python 3.6
* Add ignore to pretrain
* 🏷 Improve type annotation on `run_command` helper
The `run_command` helper previously declared that it returned an
`Optional[subprocess.CompletedProcess]`, but it isn't actually possible
for the function to return `None`. These changes modify the type
annotation of the `run_command` helper and remove all now-unnecessary
`# type: ignore` directives.
* 🔧 Allow variable type redefinition in limited contexts
These changes modify how Mypy is configured to allow variables to have
their type automatically redefined under certain conditions. The Mypy
documentation contains the following example:
```python
def process(items: List[str]) -> None:
# 'items' has type List[str]
items = [item.split() for item in items]
# 'items' now has type List[List[str]]
...
```
This configuration change is especially helpful in reducing the number
of `# type: ignore` directives needed to handle the common pattern of:
* Accepting a filepath as a string
* Overwriting the variable using `filepath = ensure_path(filepath)`
These changes enable redefinition and remove all `# type: ignore`
directives rendered redundant by this change.
* 🏷 Add type annotation to converters mapping
* 🚨 Fix Mypy error in convert CLI argument verification
* 🏷 Improve type annotation on `resolve_dot_names` helper
* 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors`
* 🏷 Add type annotations for more `Vocab` attributes
* 🏷 Add loose type annotation for gold data compilation
* 🏷 Improve `_format_labels` type annotation
* 🏷 Fix `get_lang_class` type annotation
* 🏷 Loosen return type of `Language.evaluate`
* 🏷 Don't accept `Scorer` in `handle_scores_per_type`
* 🏷 Add `string_to_list` overloads
* 🏷 Fix non-Optional command-line options
* 🙈 Ignore redefinition of `wandb_logger` in `loggers.py`
* ➕ Install `typing_extensions` in Python 3.8+
The `typing_extensions` package states that it should be used when
"writing code that must be compatible with multiple Python versions".
Since SpaCy needs to support multiple Python versions, it should be used
when newer `typing` module members are required. One example of this is
`Literal`, which is available starting with Python 3.8.
Previously SpaCy tried to import `Literal` from `typing`, falling back
to `typing_extensions` if the import failed. However, Mypy doesn't seem
to be able to understand what `Literal` means when the initial import
means. Therefore, these changes modify how `compat` imports `Literal` by
always importing it from `typing_extensions`.
These changes also modify how `typing_extensions` is installed, so that
it is a requirement for all Python versions, including those greater
than or equal to 3.8.
* 🏷 Improve type annotation for `Language.pipe`
These changes add a missing overload variant to the type signature of
`Language.pipe`. Additionally, the type signature is enhanced to allow
type checkers to differentiate between the two overload variants based
on the `as_tuple` parameter.
Fixes#8772
* ➖ Don't install `typing-extensions` in Python 3.8+
After more detailed analysis of how to implement Python version-specific
type annotations using SpaCy, it has been determined that by branching
on a comparison against `sys.version_info` can be statically analyzed by
Mypy well enough to enable us to conditionally use
`typing_extensions.Literal`. This means that we no longer need to
install `typing_extensions` for Python versions greater than or equal to
3.8! 🎉
These changes revert previous changes installing `typing-extensions`
regardless of Python version and modify how we import the `Literal` type
to ensure that Mypy treats it properly.
* resolve mypy errors for Strict pydantic types
* refactor code to avoid missing return statement
* fix types of convert CLI command
* avoid list-set confustion in debug_data
* fix typo and formatting
* small fixes to avoid type ignores
* fix types in profile CLI command and make it more efficient
* type fixes in projects CLI
* put one ignore back
* type fixes for render
* fix render types - the sequel
* fix BaseDefault in language definitions
* fix type of noun_chunks iterator - yields tuple instead of span
* fix types in language-specific modules
* 🏷 Expand accepted inputs of `get_string_id`
`get_string_id` accepts either a string (in which case it returns its
ID) or an ID (in which case it immediately returns the ID). These
changes extend the type annotation of `get_string_id` to indicate that
it can accept either strings or IDs.
* 🏷 Handle override types in `combine_score_weights`
The `combine_score_weights` function allows users to pass an `overrides`
mapping to override data extracted from the `weights` argument. Since it
allows `Optional` dictionary values, the return value may also include
`Optional` dictionary values.
These changes update the type annotations for `combine_score_weights` to
reflect this fact.
* 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer`
* 🏷 Fix redefinition of `wandb_logger`
These changes fix the redefinition of `wandb_logger` by giving a
separate name to each `WandbLogger` version. For
backwards-compatibility, `spacy.train` still exports `wandb_logger_v3`
as `wandb_logger` for now.
* more fixes for typing in language
* type fixes in model definitions
* 🏷 Annotate `_RandomWords.probs` as `NDArray`
* 🏷 Annotate `tok2vec` layers to help Mypy
* 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6
Also remove an import that I forgot to move to the top of the module 😅
* more fixes for matchers and other pipeline components
* quick fix for entity linker
* fixing types for spancat, textcat, etc
* bugfix for tok2vec
* type annotations for scorer
* add runtime_checkable for Protocol
* type and import fixes in tests
* mypy fixes for training utilities
* few fixes in util
* fix import
* 🐵 Remove unused `# type: ignore` directives
* 🏷 Annotate `Language._components`
* 🏷 Annotate `spacy.pipeline.Pipe`
* add doc as property to span.pyi
* small fixes and cleanup
* explicit type annotations instead of via comment
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
* remove text argument from W108 to enable 'once' filtering
* include the option of partial POS annotation
* fix typo
* Update spacy/errors.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add overwrite settings for more components
For pipeline components where it's relevant and not already implemented,
add an explicit `overwrite` setting that controls whether
`set_annotations` overwrites existing annotation.
For the `morphologizer`, add an additional setting `extend`, which
controls whether the existing features are preserved.
* +overwrite, +extend: overwrite values of existing features, add any new
features
* +overwrite, -extend: overwrite completely, removing any existing
features
* -overwrite, +extend: keep values of existing features, add any new
features
* -overwrite, -extend: do not modify the existing value if set
In all cases an unset value will be set by `set_annotations`.
Preserve current overwrite defaults:
* True: morphologizer, entity linker
* False: tagger, sentencizer, senter
* Add backwards compat overwrite settings
* Put empty line back
Removed by accident in last commit
* Set backwards-compatible defaults in __init__
Because the `TrainablePipe` serialization methods update `cfg`, there's
no straightforward way to detect whether models serialized with a
previous version are missing the overwrite settings.
It would be possible in the sentencizer due to its separate
serialization methods, however to keep the changes parallel, this also
sets the default in `__init__`.
* Remove traces
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Replace all basestring references with unicode
`basestring` was a compatability type introduced by Cython to make
dealing with utf-8 strings in Python2 easier. In Python3 it is
equivalent to the unicode (or str) type.
I replaced all references to basestring with unicode, since that was
used elsewhere, but we could also just replace them with str, which
shoudl also be equivalent.
All tests pass locally.
* Replace all references to unicode type with str
Since we only support python3 this is simpler.
* Remove all references to unicode type
This removes all references to the unicode type across the codebase and
replaces them with `str`, which makes it more drastic than the prior
commits. In order to make this work importing `unicode_literals` had to
be removed, and one explicit unicode literal also had to be removed (it
is unclear why this is necessary in Cython with language level 3, but
without doing it there were errors about implicit conversion).
When `unicode` is used as a type in comments it was also edited to be
`str`.
Additionally `coding: utf8` headers were removed from a few files.
* overfitting test on non-overlapping entities
* add failing overfitting test for overlapping entities
* failing test for list comprehension
* remove test that was put in separate PR
* bugfix
* cleanup
* Add scorer option to components
Add an optional `scorer` parameter to all pipeline components. If a
scoring function is provided, it overrides the default scoring method
for that component.
* Add registered scorers for all components
* Add `scorers` registry
* Move all scoring methods outside of components as independent
functions and register
* Use the registered scoring methods as defaults in configs and inits
Additional:
* The scoring methods no longer have access to the full component, so
use settings from `cfg` as default scorer options to handle settings
such as `labels`, `threshold`, and `positive_label`
* The `attribute_ruler` scoring method no longer has access to the
patterns, so all scoring methods are called
* Bug fix: `spancat` scoring method is updated to set `allow_overlap` to
score overlapping spans correctly
* Update Russian lemmatizer to use direct score method
* Check type of cfg in Pipe.score
* Fix check
* Update spacy/pipeline/sentencizer.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove validate_examples from scoring functions
* Use Pipe.labels instead of Pipe.cfg["labels"]
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add scores to output in spancat
This exposes the scores as an attribute on the SpanGroup. Includes a
basic test.
* Add basic doc note
* Vectorize score calcs
* Add "annotation format" section
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Clean up doc section
* Ran prettier on docs
* Get arrays off the gpu before iterating over them
* Remove int() calls
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Pass excludes when serializing vocab
Additional minor bug fix:
* Deserialize vocab in `EntityLinker.from_disk`
* Add test for excluding strings on load
* Fix formatting