* Change GPU efficient textcat to use CNN, not BOW
If you generate a config with a textcat component using GPU
(transformers), the defaut option (efficiency) uses a BOW architecture,
which does not use tok2vec features. While that can make sense as part
of a larger pipeline, in the case of just a transformer and a textcat,
that means the transformer is doing a lot of work for no purpose.
This changes it so that the CNN architecture is used instead. It could
also be changed to be the same as the accuracy config, which uses the
ensemble architecture.
* Add the transformer when using a textcat with GPU
* Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)
* Switch ubuntu-latest to ubuntu-20.04 in main tests
* Only use 20.04 for 3.6
* Require thinc v8.1.7
* Require thinc v8.1.8
* Break up longer expression
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Handle deprecation of pkg_resources
* Replace `pkg_resources` with `importlib_metadata` for `spacy info
--url`
* Remove requirements check from `spacy project` given the lack of
alternatives
* Fix installed model URL method and CI test
* Fix types/handling, simplify catch-all return
* Move imports instead of disabling requirements check
* Format
* Reenable test with ignored deprecation warning
* Fix except
* Fix return
* Make empty_kb() configurable.
* Format.
* Update docs.
* Be more specific in KB serialization test.
* Update KB serialization tests. Update docs.
* Remove doc update for batched candidate generation.
* Fix serialization of subclassed KB in tests.
* Format.
* Update docstring.
* Update docstring.
* Switch from pickle to json for custom field serialization.
* Add immediate left/right child/parent dependency relations
* Add tests for new REL_OPs: `>+`, `>-`, `<+`, and `<-`.
---------
Co-authored-by: Tan Long <tanloong@foxmail.com>
* add unittest for explosion#12311
* create punctuation.py for swedish
* removed : from infixes in swedish punctuation.py
* allow : as infix if succeeding char is uppercase
* standardize predicate key format
* single key function
* Make optional args in key function keyword-only
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Improve the correctness of _parse_patch
* If there are no more actions, do not attempt to make further
transitions, even if not all states are final.
* Assert that the number of actions for a step is the same as
the number of states.
* Reimplement distillation with oracle cut size
The code for distillation with an oracle cut size was not reimplemented
after the parser refactor. We did not notice, because we did not have
tests for this functionality. This change brings back the functionality
and adds this to the parser tests.
* Rename states2actions to _states_to_actions for consistency
* Test distillation max cuts in NER
* Mark parser/NER tests as slow
* Typo
* Fix invariant in _states_diff_to_actions
* Rename _init_batch -> _init_batch_from_teacher
* Ninja edit the ninja edit
* Check that we raise an exception when we pass the incorrect number or actions
* Remove unnecessary get
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Write out condition more explicitly
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* `Language.update`: ensure that tok2vec gets updated
The components in a pipeline can be updated independently. However,
tok2vec implementations are an exception to this, since they depend on
listeners for their gradients. The update method of a tok2vec
implementation computes the tok2vec forward and passes this along with a
backprop function to the listeners. This backprop function accumulates
gradients for all the listeners. There are two ways in which the
accumulated gradients can be used to update the tok2vec weights:
1. Call the `finish_update` method of tok2vec *after* the `update`
method is called on all of the pipes that use a tok2vec listener.
2. Pass an optimizer to the `update` method of tok2vec. In this
case, tok2vec will give the last listener a special backprop
function that calls `finish_update` on the tok2vec.
Unfortunately, `Language.update` did neither of these. Instead, it
immediately called `finish_update` on every pipe after `update`. As a
result, the tok2vec weights are updated when no gradients have been
accumulated from listeners yet. And the gradients of the listeners are
only used in the next call to `Language.update` (when `finish_update` is
called on tok2vec again).
This change fixes this issue by passing the optimizer to the `update`
method of trainable pipes, leading to use of the second strategy
outlined above.
The main updating loop in `Language.update` is also simplified by using
the `TrainableComponent` protocol consistently.
* Train loop: `sgd` is `Optional[Optimizer]`, do not pass false
* Language.update: call pipe finish_update after all pipe updates
This does correct and fast updates if multiple components update the
same parameters.
* Add comment why we moved `finish_update` to a separate loop
* Remove backwards-compatible overwrite from Entity Linker
This also adds a docstring about overwrite, since it wasn't present.
* Fix docstring
* Remove backward compat settings in Morphologizer
This also needed a docstring added.
For this component it's less clear what the right overwrite settings
are.
* Remove backward compat from sentencizer
This was simple
* Remove backward compat from senter
Another simple one
* Remove backward compat setting from tagger
* Add docstrings
* Update spacy/pipeline/morphologizer.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update docs
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* change logging call for spacy.LookupsDataLoader.v1
* substitutions in language and _util
* various more substitutions
* add string formatting guidelines to contribution guidelines
* Move Entity Linker v1 component to spacy-legacy
This is a follow up to #11889 that moves the component instead of
removing it.
In general, we never import from spacy-legacy in spaCy proper. However,
to use this component, that kind of import will be necessary. I was able
to test this without issues, but is this current import strategy
acceptable? Or should we put the component in a registry?
* Use spacy-legacy pr for CI
This will need to be reverted before merging.
* Add temporary step to log installed spacy-legacy version
* Modify requirements.txt to trigger tests
* Add comment to Python to trigger tests
* TODO REVERT This is a commit with logic changes to trigger tests
* Remove pipe from YAML
Works locally, but possibly this is causing a quoting error or
something.
* Revert "TODO REVERT This is a commit with logic changes to trigger tests"
This reverts commit 689fae71f3.
* Revert "Add comment to Python to trigger tests"
This reverts commit 11840fc598.
* Add more logging
* Try installing directly in workflow
* Try explicitly uninstalling spacy-legacy first
* Cat requirements.txt to confirm contents
In the branch, the thinc version spec is `thinc>=8.1.0,<8.2.0`. But in
the logs, it's clear that a development release of 9.0 is being
installed. It's not clear why that would happen.
* Log requirements at start of build
* TODO REVERT Change thinc spec
Want to see what happens to the installed thinc spec with this change.
* Update thinc requirements
This makes it the same as it was before the merge, >=8.1.0,<8.2.0.
* Use same thinc version as v4 branch
* TODO REVERT Mark dependency check as xfail
spacy-legacy is specified as a git checkout in requirements.txt while
this PR is in progress, which makes the consistency check here fail.
* Remove debugging output / install step
* Revert "Remove debugging output / install step"
This reverts commit 923ea7448b.
* Clean up debugging output
The manual install step with the URL fragment seems to have caused
issues on Windows due to the = in the URL being misinterpreted. On the
other hand, removing it seems to mean the git version of spacy-legacy
isn't actually installed.
This PR removes the URL fragment but keeps the direct command-line
install. Additionally, since it looks like this job is configured to use
the default shell (and not bash), it removes a comment that upsets the
Windows cmd shell.
* Revert "TODO REVERT Mark dependency check as xfail"
This reverts commit d4863ec156.
* Fix requirements.txt, increasing spacy-legacy version
* Raise spacy legacy version in setup.cfg
* Remove azure build workarounds
* make spacy-legacy version explicit in error message
* Remove debugging line
* Suggestions from code review
* Init
* fix tests
* Update spacy/errors.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix test_blank_languages
* Rename xx to mul in docs
* Format _util with black
* prettier formatting
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Language.distill: copy both reference and predicted
In distillation we also modify the teacher docs (e.g. in tok2vec
components), so we need to copy both the reference and predicted doc.
Problem caught by @shadeMe
* Make new `_copy_examples` args kwonly
* Add the configuration schema for distillation
This also adds the default configuration and some tests. The schema will
be used by the training loop and `distill` subcommand.
* Format
* Change distillation shortopt to -d
* Fix descripion of max_epochs
* Rename distillation flag to -dt
* Rename `pipe_map` to `student_to_teacher`
* Don't re-download installed models
When downloading a model, this checks if the same version of the same
model is already installed. If it is then the download is skipped.
This is necessary because pip uses the final download URL for its
caching feature, but because of the way models are hosted on Github,
their URLs change every few minutes.
* Use importlib instead of meta.json
* Use get_package_version
* Add untested, disabled test
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add `Language.distill`
This method is the distillation counterpart of `Language.update`. It
takes a teacher `Language` instance and distills the student pipes on
the teacher pipes.
* Apply suggestions from code review
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Clarify that how Example is used in distillation
* Update transition parser distill docstring for examples argument
* Pass optimizer to `TrainablePipe.distill`
* Annotate pipe before update
As discussed internally, we want to let a pipe annotate before doing an
update with gold/silver data. Otherwise, the output may be (too)
informed by the gold/silver data.
* Rename `component_map` to `student_to_teacher`
* Better synopsis in `Language.distill` docstring
* `name` -> `student_name`
* Fix labels type in docstring
* Mark distill test as slow
* Fix `student_to_teacher` type in docs
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Normalize whitespace in evaluate CLI output test
Depending on terminal settings, lines may be padded to the screen width
so the comparison is too strict with only the command string replacement.
* Move to test util method
* Change to normalization method
* Add span_id to Span.char_span, update Doc/Span.char_span docs
`Span.char_span(id=)` should be removed in the future.
* Also use Union[int, str] in Doc docstring
* WIP
* rm ipython embeds
* rm total
* WIP
* cleanup
* cleanup + reword
* rm component function
* remove migration support form
* fix reference dataset for dev data
* additional fixes
- set approach to identifying unique trees
- adjust line length on messages
- add logic for detecting docs without annotations
* use 0 instead of none for no annotation
* partial annotation support
* initial tests for _compile_gold lemma attributes
Using the example data from the edit tree lemmatizer tests for:
- lemmatizer_trees
- partial_lemma_annotations
- n_low_cardinality_lemmas
- no_lemma_annotations
* adds output test for cli app
* switch msg level
* rm unclear uniqueness check
* Revert "rm unclear uniqueness check"
This reverts commit 6ea2b3524b.
* remove good message on uniqueness
* formatting
* use en_vocab fixture
* clarify data set source in messages
* remove unnecessary import
Co-authored-by: svlandeg <svlandeg@github.com>
* Add `spacy.PlainTextCorpusReader.v1`
This is a corpus reader that reads plain text corpora with the following
format:
- UTF-8 encoding
- One line per document.
- Blank lines are ignored.
It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.
* Update spacy/training/corpus.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* docs: add version to `PlainTextCorpus`
* Add docstring to registry function
* Add plain text corpus tests
* Only strip newline/carriage return
* Add return type _string_to_tmp_file helper
* Use a temporary directory in place of file name
Different OS auto delete/sharing semantics are just wonky.
* This will be new in 3.5.1 (rather than 4)
* Test improvements from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Don't pass mem pool to new lexeme function
* Remove unused mem from function args
Two methods calling _new_lexeme, get and get_by_orth, took mem arguments
just to call the internal method. That's no longer necessary, so this
cleans it up.
* prettier formatting
* Remove more unused mem args
* Refactor _scores2guesses
* Handle arrays on GPU
* Convert argmax result to raw integer
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Use NumpyOps() to copy data to CPU
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Changes based on review comments
* Use different _scores2guesses depending on tree_k
* Add tests for corner cases
* Add empty line for consistency
* Improve naming
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Improve naming
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar
* adjust to mdx
* linkout to InMemoryLookupKB at first occurrence in kb.mdx
* fix links to docs
* revert Azure trigger setting (I'll make a separate PR)
Co-authored-by: svlandeg <svlandeg@github.com>
* Fix batching regression
Some time ago, the spaCy v4 branch switched to the new Thinc v9
schedule. However, this introduced an error in how batching is handed.
In the PR, the batchers were changed to keep track of their step,
so that the step can be passed to the schedule. However, the issue
is that the training loop repeatedly calls the batching functions
(rather than using an infinite generator/iterator). So, the step and
therefore the schedule would be reset each epoch. Before the schedule
switch we didn't have this issue, because the old schedules were
stateful.
This PR fixes this issue by reverting the batching functions to use
a (stateful) generator. Their registry functions do accept a `Schedule`
and we convert `Schedule`s to generators.
* Update batcher docs
* Docstring fixes
* Make minibatch take iterables again as well
* Bump thinc requirement to 9.0.0.dev2
* Use type declaration
* Convert another comment into a proper type declaration
* Try to fix doc.copy
* Set dev version
* Make vocab always own lexemes
* Change version
* Add SpanGroups.copy method
* Fix set_annotations during Parser.update
* Fix dict proxy copy
* Upd version
* Fix copying SpanGroups
* Fix set_annotations in parser.update
* Fix parser set_annotations during update
* Revert "Fix parser set_annotations during update"
This reverts commit eb138c89ed.
* Revert "Fix set_annotations in parser.update"
This reverts commit c6df0eafd0.
* Fix set_annotations during parser update
* Inc version
* Handle final states in get_oracle_sequence
* Inc version
* Try to fix parser training
* Inc version
* Fix
* Inc version
* Fix parser oracle
* Inc version
* Inc version
* Fix transition has_gold
* Inc version
* Try to use real histories, not oracle
* Inc version
* Upd parser
* Inc version
* WIP on rewrite parser
* WIP refactor parser
* New progress on parser model refactor
* Prepare to remove parser_model.pyx
* Convert parser from cdef class
* Delete spacy.ml.parser_model
* Delete _precomputable_affine module
* Wire up tb_framework to new parser model
* Wire up parser model
* Uncython ner.pyx and dep_parser.pyx
* Uncython
* Work on parser model
* Support unseen_classes in parser model
* Support unseen classes in parser
* Cleaner handling of unseen classes
* Work through tests
* Keep working through errors
* Keep working through errors
* Work on parser. 15 tests failing
* Xfail beam stuff. 9 failures
* More xfail. 7 failures
* Xfail. 6 failures
* cleanup
* formatting
* fixes
* pass nO through
* Fix empty doc in update
* Hackishly fix resizing. 3 failures
* Fix redundant test. 2 failures
* Add reference version
* black formatting
* Get tests passing with reference implementation
* Fix missing prints
* Add missing file
* Improve indexing on reference implementation
* Get non-reference forward func working
* Start rigging beam back up
* removing redundant tests, cf #8106
* black formatting
* temporarily xfailing issue 4314
* make flake8 happy again
* mypy fixes
* ensure labels are added upon predict
* cleanup remnants from merge conflicts
* Improve unseen label masking
Two changes to speed up masking by ~10%:
- Use a bool array rather than an array of float32.
- Let the mask indicate whether a label was seen, rather than
unseen. The mask is most frequently used to index scores for
seen labels. However, since the mask marked unseen labels,
this required computing an intermittent flipped mask.
* Write moves costs directly into numpy array (#10163)
This avoids elementwise indexing and the allocation of an additional
array.
Gives a ~15% speed improvement when using batch_by_sequence with size
32.
* Temporarily disable ner and rehearse tests
Until rehearse is implemented again in the refactored parser.
* Fix loss serialization issue (#10600)
* Fix loss serialization issue
Serialization of a model fails with:
TypeError: array(738.3855, dtype=float32) is not JSON serializable
Fix this using float conversion.
* Disable CI steps that require spacy.TransitionBasedParser.v2
After finishing the refactor, TransitionBasedParser.v2 should be
provided for backwards compat.
* Add back support for beam parsing to the refactored parser (#10633)
* Add back support for beam parsing
Beam parsing was already implemented as part of the `BeamBatch` class.
This change makes its counterpart `GreedyBatch`. Both classes are hooked
up in `TransitionModel`, selecting `GreedyBatch` when the beam size is
one, or `BeamBatch` otherwise.
* Use kwarg for beam width
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Avoid implicit default for beam_width and beam_density
* Parser.{beam,greedy}_parse: ensure labels are added
* Remove 'deprecated' comments
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Parser `StateC` optimizations (#10746)
* `StateC`: Optimizations
Avoid GIL acquisition in `__init__`
Increase default buffer capacities on init
Reduce C++ exception overhead
* Fix typo
* Replace `set::count` with `set::find`
* Add exception attribute to c'tor
* Remove unused import
* Use a power-of-two value for initial capacity
Use default-insert to init `_heads` and `_unshiftable`
* Merge `cdef` variable declarations and assignments
* Vectorize `example.get_aligned_parses` (#10789)
* `example`: Vectorize `get_aligned_parse`
Rename `numpy` import
* Convert aligned array to lists before returning
* Revert import renaming
* Elide slice arguments when selecting the entire range
* Tagger/morphologizer alignment performance optimizations (#10798)
* `example`: Unwrap `numpy` scalar arrays before passing them to `StringStore.__getitem__`
* `AlignmentArray`: Use native list as staging buffer for offset calculation
* `example`: Vectorize `get_aligned`
* Hoist inner functions out of `get_aligned`
* Replace inline `if..else` clause in assignment statement
* `AlignmentArray`: Use raw indexing into offset and data `numpy` arrays
* `example`: Replace array unique value check with `groupby`
* `example`: Correctly exclude tokens with no alignment in `_get_aligned_vectorized`
Simplify `_get_aligned_non_vectorized`
* `util`: Update `all_equal` docstring
* Explicitly use `int32_t*`
* Restore C CPU inference in the refactored parser (#10747)
* Bring back the C parsing model
The C parsing model is used for CPU inference and is still faster for
CPU inference than the forward pass of the Thinc model.
* Use C sgemm provided by the Ops implementation
* Make tb_framework module Cython, merge in C forward implementation
* TransitionModel: raise in backprop returned from forward_cpu
* Re-enable greedy parse test
* Return transition scores when forward_cpu is used
* Apply suggestions from code review
Import `Model` from `thinc.api`
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Use relative imports in tb_framework
* Don't assume a default for beam_width
* We don't have a direct dependency on BLIS anymore
* Rename forwards to _forward_{fallback,greedy_cpu}
* Require thinc >=8.1.0,<8.2.0
* tb_framework: clean up imports
* Fix return type of _get_seen_mask
* Move up _forward_greedy_cpu
* Style fixes.
* Lower thinc lowerbound to 8.1.0.dev0
* Formatting fix
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Reimplement parser rehearsal function (#10878)
* Reimplement parser rehearsal function
Before the parser refactor, rehearsal was driven by a loop in the
`rehearse` method itself. For each parsing step, the loops would:
1. Get the predictions of the teacher.
2. Get the predictions and backprop function of the student.
3. Compute the loss and backprop into the student.
4. Move the teacher and student forward with the predictions of
the student.
In the refactored parser, we cannot perform search stepwise rehearsal
anymore, since the model now predicts all parsing steps at once.
Therefore, rehearsal is performed in the following steps:
1. Get the predictions of all parsing steps from the student, along
with its backprop function.
2. Get the predictions from the teacher, but use the predictions of
the student to advance the parser while doing so.
3. Compute the loss and backprop into the student.
To support the second step a new method, `advance_with_actions` is
added to `GreedyBatch`, which performs the provided parsing steps.
* tb_framework: wrap upper_W and upper_b in Linear
Thinc's Optimizer cannot handle resizing of existing parameters. Until
it does, we work around this by wrapping the weights/biases of the upper
layer of the parser model in Linear. When the upper layer is resized, we
copy over the existing parameters into a new Linear instance. This does
not trigger an error in Optimizer, because it sees the resized layer as
a new set of parameters.
* Add test for TransitionSystem.apply_actions
* Better FIXME marker
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Fixes from Madeesh
* Apply suggestions from Sofie
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove useless assignment
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename some identifiers in the parser refactor (#10935)
* Rename _parseC to _parse_batch
* tb_framework: prefix many auxiliary functions with underscore
To clearly state the intent that they are private.
* Rename `lower` to `hidden`, `upper` to `output`
* Parser slow test fixup
We don't have TransitionBasedParser.{v1,v2} until we bring it back as a
legacy option.
* Remove last vestiges of PrecomputableAffine
This does not exist anymore as a separate layer.
* ner: re-enable sentence boundary checks
* Re-enable test that works now.
* test_ner: make loss test more strict again
* Remove commented line
* Re-enable some more beam parser tests
* Remove unused _forward_reference function
* Update for CBlas changes in Thinc 8.1.0.dev2
Bump thinc dependency to 8.1.0.dev3.
* Remove references to spacy.TransitionBasedParser.{v1,v2}
Since they will not be offered starting with spaCy v4.
* `tb_framework`: Replace references to `thinc.backends.linalg` with `CBlas`
* dont use get_array_module (#11056) (#11293)
Co-authored-by: kadarakos <kadar.akos@gmail.com>
* Move `thinc.extra.search` to `spacy.pipeline._parser_internals` (#11317)
* `search`: Move from `thinc.extra.search`
Fix NPE in `Beam.__dealloc__`
* `pytest`: Add support for executing Cython tests
Move `search` tests from thinc and patch them to run with `pytest`
* `mypy` fix
* Update comment
* `conftest`: Expose `register_cython_tests`
* Remove unused import
* Move `argmax` impls to new `_parser_utils` Cython module (#11410)
* Parser does not have to be a cdef class anymore
This also fixes validation of the initialization schema.
* Add back spacy.TransitionBasedParser.v2
* Fix a rename that was missed in #10878.
So that rehearsal tests pass.
* Remove module from setup.py that got added during the merge
* Bring back support for `update_with_oracle_cut_size` (#12086)
* Bring back support for `update_with_oracle_cut_size`
This option was available in the pre-refactor parser, but was never
implemented in the refactored parser. This option cuts transition
sequences that are longer than `update_with_oracle_cut` size into
separate sequences that have at most `update_with_oracle_cut`
transitions. The oracle (gold standard) transition sequence is used to
determine the cuts and the initial states for the additional sequences.
Applying this cut makes the batches more homogeneous in the transition
sequence lengths, making forward passes (and as a consequence training)
much faster.
Training time 1000 steps on de_core_news_lg:
- Before this change: 149s
- After this change: 68s
- Pre-refactor parser: 81s
* Fix a rename that was missed in #10878.
So that rehearsal tests pass.
* Apply suggestions from @shadeMe
* Use chained conditional
* Test with update_with_oracle_cut_size={0, 1, 5, 100}
And fix a git that occurs with a cut size of 1.
* Fix up some merge fall out
* Update parser distillation for the refactor
In the old parser, we'd iterate over the transitions in the distill
function and compute the loss/gradients on the go. In the refactored
parser, we first let the student model parse the inputs. Then we'll let
the teacher compute the transition probabilities of the states in the
student's transition sequence. We can then compute the gradients of the
student given the teacher.
* Add back spacy.TransitionBasedParser.v1 references
- Accordion in the architecture docs.
- Test in test_parse, but disabled until we have a spacy-legacy release.
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: kadarakos <kadar.akos@gmail.com>
* Add `TrainablePipe.{distill,get_teacher_student_loss}`
This change adds two methods:
- `TrainablePipe::distill` which performs a training step of a
student pipe on a teacher pipe, giving a batch of `Doc`s.
- `TrainablePipe::get_teacher_student_loss` computes the loss
of a student relative to the teacher.
The `distill` or `get_teacher_student_loss` methods are also implemented
in the tagger, edit tree lemmatizer, and parser pipes, to enable
distillation in those pipes and as an example for other pipes.
* Fix stray `Beam` import
* Fix incorrect import
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* TrainablePipe.distill: use `Iterable[Example]`
* Add Pipe.is_distillable method
* Add `validate_distillation_examples`
This first calls `validate_examples` and then checks that the
student/teacher tokens are the same.
* Update distill documentation
* Add distill documentation for all pipes that support distillation
* Fix incorrect identifier
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add comment to explain `is_distillable`
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add a `spacy evaluate speed` subcommand
This subcommand reports the mean batch performance of a model on a data set with
a 95% confidence interval. For reliability, it first performs some warmup
rounds. Then it will measure performance on batches with randomly shuffled
documents.
To avoid having too many spaCy commands, `speed` is a subcommand of `evaluate`
and accuracy evaluation is moved to its own `evaluate accuracy` subcommand.
* Fix import cycle
* Restore `spacy evaluate`, make `spacy benchmark speed` an alias
* Add documentation for `spacy benchmark`
* CREATES -> PRINTS
* WPS -> words/s
* Disable formatting of benchmark speed arguments
* Fail with an error message when trying to speed bench empty corpus
* Make it clearer that `benchmark accuracy` is a replacement for `evaluate`
* Fix docstring webpage reference
* tests: check `evaluate` output against `benchmark accuracy`
* Clean up displacy port-related error messages, docs
There were some issues in the error messages and docs in #11948.
1. the error messages didn't specify the port argument to displacy.serve correctly
2. the docs didn't mark the auto select argument as new
This addresses those issues.
* Update website/docs/api/top-level.md
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
* Apply prettier
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
In the v3 scorer refactoring, `token_acc` was implemented incorrectly.
It should use `precision` instead of `fscore` for the measure of
correctly aligned tokens / number of predicted tokens.
Fix the docs to reflect that the measure uses the number of predicted
tokens rather than the number of gold tokens.
* enable fuzzy matching
* add fuzzy param to EntityMatcher
* include rapidfuzz_capi
not yet used
* fix type
* add FUZZY predicate
* add fuzzy attribute list
* fix type properly
* tidying
* remove unnecessary dependency
* handle fuzzy sets
* simplify fuzzy sets
* case fix
* switch to FUZZYn predicates
use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.
* revert changes added for fuzzy param
* switch to polyleven
(Python package)
* enable fuzzy matching
* add fuzzy param to EntityMatcher
* include rapidfuzz_capi
not yet used
* fix type
* add FUZZY predicate
* add fuzzy attribute list
* fix type properly
* tidying
* remove unnecessary dependency
* handle fuzzy sets
* simplify fuzzy sets
* case fix
* switch to FUZZYn predicates
use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.
* revert changes added for fuzzy param
* switch to polyleven
(Python package)
* fuzzy match only on oov tokens
* remove polyleven
* exclude whitespace tokens
* don't allow more edits than characters
* fix min distance
* reinstate FUZZY operator
with length-based distance function
* handle sets inside regex operator
* remove is_oov check
* attempt build fix
no mypy failure locally
* re-attempt build fix
* don't overwrite fuzzy param value
* move fuzzy_match
to its own Python module to allow patching
* move fuzzy_match back inside Matcher
simplify logic and add tests
* Format tests
* Parametrize fuzzyn tests
* Parametrize and merge fuzzy+set tests
* Format
* Move fuzzy_match to a standalone method
* Change regex kwarg type to bool
* Add types for fuzzy_match
- Refactor variable names
- Add test for symmetrical behavior
* Parametrize fuzzyn+set tests
* Minor refactoring for fuzz/fuzzy
* Make fuzzy_match a Matcher kwarg
* Update type for _default_fuzzy_match
* don't overwrite function param
* Rename to fuzzy_compare
* Update fuzzy_compare default argument declarations
* allow fuzzy_compare override from EntityRuler
* define new Matcher keyword arg
* fix type definition
* Implement fuzzy_compare config option for EntityRuler and SpanRuler
* Rename _default_fuzzy_compare to fuzzy_compare, remove from reexported objects
* Use simpler fuzzy_compare algorithm
* Update types
* Increase minimum to 2 in fuzzy_compare to allow one transposition
* Fix predicate keys and matching for SetPredicate with FUZZY and REGEX
* Add FUZZY6..9
* Add initial docs
* Increase default fuzzy to rounded 30% of pattern length
* Update docs for fuzzy_compare in components
* Update EntityRuler and SpanRuler API docs
* Rename EntityRuler and SpanRuler setting to matcher_fuzzy_compare
To having naming similar to `phrase_matcher_attr`, rename
`fuzzy_compare` setting for `EntityRuler` and `SpanRuler` to
`matcher_fuzzy_compare. Organize next to `phrase_matcher_attr` in docs.
* Fix schema aliases
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix typo
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add FUZZY6-9 operators and update tests
* Parameterize test over greedy
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix type for fuzzy_compare to remove Optional
* Rename to spacy.levenshtein_compare.v1, move to spacy.matcher.levenshtein
* Update docs following levenshtein_compare renaming
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* check port in use and add itself
* check port in use and add itself
* Auto switch to nearest available port.
* Use bind to check port instead of connect_ex.
* Reformat.
* Add auto_select_port argument.
* update docs for displacy.serve
* Update spacy/errors.py
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Update website/docs/api/top-level.md
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Update spacy/errors.py
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Add test using multiprocessing
* fix argument name
* Increase sleep times
Want to rule this out as a cause of test failure
* Don't terminate a process that isn't alive
* Refactor port finding logic
This moves all the port logic into its own util function, which can be
tested without having to background a server directly.
* Use with for the server
This ensures the server is closed correctly.
* Pass in the host when checking port availability
* Shorten argument name
* Update error codes following merge
* Add types for arguments, specify docstrings.
* Add typing for arguments with default value.
* Update docstring to match spaCy format.
* Update docstring to match spaCy format.
* Fix docs
Arg name changed from `auto_select_port` to just `auto_select`.
* Revert "Fix docs"
This reverts commit 356966fe84.
Co-authored-by: zhiiw <1302593554@qq.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
* add test for running evaluate on an nlp pipeline with two distinct textcat components
* cleanup
* merge dicts instead of overwrite
* don't add more labels to the given set
* Revert "merge dicts instead of overwrite"
This reverts commit 89bee0ed77.
* Switch tests to separate scorer keys rather than merged dicts
* Revert unrelated edits
* Switch textcat scorers to v2
* formatting
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* fix processing of "auto" in walk_directory
* add check for None
* move AUTO check to convert and fix verification of args
* add specific CLI test with CliRunner
* cleanup
* more cleanup
* update docstring
* Fix inconsistency in displaCy docs about page option
The `page` option, which wraps the output SVG in HTML, is true by
default for `serve` but not for `render`. The `render` docs were wrong
though, so this updates them.
* Update the same statement in more docs
A few renderers used the same language
* Add `ConsoleLogger.v3`
This addition expands the progress bar feature to count up the training/distillation steps to either the next evaluation pass or the maximum number of steps.
* Rename progress bar types
* Add defaults to docs
Minor fixes
* Move comment
* Minor punctuation fixes
* Explicitly check for `None` when validating progress bar type
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Init
* Fix return type for mypy
* adjust types and improve setting new attributes
* Add underscore changes to json conversion
* Add test and underscore changes to from_docs
* add underscore changes and test to span.to_doc
* update return values
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add types to function
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* adjust formatting
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* shorten return type
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add helper function to improve readability
* Improve code and add comments
* rerun azure tests
* Fix tests for json conversion
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Convert all individual values explicitly to uint64 for array-based doc representations
* Temporarily test with latest numpy v1.24.0rc
* Remove unnecessary conversion from attr_t
* Reduce number of individual casts
* Convert specifically from int32 to uint64
* Revert "Temporarily test with latest numpy v1.24.0rc"
This reverts commit eb0e3c5006.
* Also use int32 in tests
* Remove experimental multi-task components
These are incomplete implementations and are not usable in their current state.
* Remove orphaned error message
* Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)
* Switch ubuntu-latest to ubuntu-20.04 in main tests
* Only use 20.04 for 3.6
* Revert "Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)"
This reverts commit 77c0fd7b17.
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Remove old model shortcuts
* Remove error, docs warnings about shortcuts
* Fix import in util
Accidentally deleted the whole import and not just the old part...
* Change universe example to v3 style
* Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)
* Switch ubuntu-latest to ubuntu-20.04 in main tests
* Only use 20.04 for 3.6
* Update some model loading in Universe
* Add v2 tag to neuralcoref
* Use the spacy-version feature instead of a v2 tag
Co-authored-by: svlandeg <svlandeg@github.com>
Strings in replacement nodes where not added to the `StringStore`
when `EditTreeLemmatizer` was initialized from a set of labels. The
corresponding test did not capture this because it added the strings
through the examples that were passed to the initialization.
This change fixes both this bug in the initialization as the 'shadowing'
of the bug in the test.
If you don't have spacy-transformers installed, but try to use `init
config` with the GPU flag, you'll get an error. The issue is that the
`use_transformers` flag in the config is conflated with the GPU flag,
and then there's an attempt to access transformers config info that may
not exist.
There may be a better way to do this, but this stops the error.