* feat: add example stubs
* fix: add required annotations
* fix: mypy issues
* fix: use Py36-compatible Portocol
* Minor reformatting
* adding further type specifications and removing internal methods
* black formatting
* widen type to iterable
* add private methods that are being used by the built-in convertors
* revert changes to corpus.py
* fixes
* fixes
* fix typing of PlainTextCorpus
---------
Co-authored-by: Basile Dura <basile@bdura.me>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Support custom token/lexeme attribute for vectors
* Fix imports
* Back off to ORTH without Vectors.attr
* Fallback if vectors.attr doesn't exist
* Update docs
When sourcing a component, the object from the original pipeline is added to the new pipeline as the same object. This creates a situation where there are several attributes that cannot be in sync between the original pipeline and the new pipeline at the same time for this one object:
* component.name
* component.listener_map / component.listening_components for tok2vec and transformer
When running replace_listeners on a component, the config is not updated correctly if the state of the component is incorrect for the current pipeline (in particular changes that should be applied from model.attrs["replace_listener_cfg"] as used in spacy-transformers) due to the fact that:
* find_listeners relies on component.name to set the name in the listener_map
* replace_listeners relies on listener_map to determine how to modify the configs
In addition, there are several places where pipeline components are modified and the listener map and/or internal component names aren't currently updated.
In cases where there is a component shared by two pipelines that cannot be in sync, this PR chooses to prioritize the most recently modified or initialized pipeline. There is no actual solution with the current source behavior that will make both pipelines usable, so the current pipeline is updated whenever components are added/renamed/removed or the pipeline is initialized for training.
* Use isort with Black profile
* isort all the things
* Fix import cycles as a result of import sorting
* Add DOCBIN_ALL_ATTRS type definition
* Add isort to requirements
* Remove isort from build dependencies check
* Typo
* Add distillation initialization and loop
* Fix up configuration keys
* Add docstring
* Type annotations
* init_nlp_distill -> init_nlp_student
* Do not resolve dot name distill corpus in initialization
(Since we don't use it.)
* student: do not request use of optimizer in student pipe
We apply finish up the updates once in the training loop instead.
Also add the necessary logic to `Language.distill` to mirror
`Language.update`.
* Correctly determine sort key in subdivide_batch
* Fix _distill_loop docstring wrt. stopping condition
* _distill_loop: fix distill_data docstring
Make similar changes in train_while_improving, since it also had
incorrect types and missing type annotations.
* Move `set_{gpu_allocator,seed}_from_config` to spacy.util
* Update Language.update docs for the sgd argument
* Type annotation
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Avoid `TrainablePipe.finish_update` getting called twice during training
PR #12136 fixed an issue where the tok2vec pipe was updated before
gradient were accumulated. However, it introduced a new bug that cause
`finish_update` to be called twice when using the training loop. This
causes a fairly large slowdown.
The `Language.update` method accepts the `sgd` argument for passing an
optimizer. This argument has three possible values:
- `Optimizer`: use the given optimizer to finish pipe updates.
- `None`: use a default optimizer to finish pipe updates.
- `False`: do not finish pipe updates.
However, the latter option was not documented and not valid with the
existing type of `sgd`. I assumed that this was a remnant of earlier
spaCy versions and removed handling of `False`.
However, with that change, we are passing `None` to `Language.update`.
As a result, we were calling `finish_update` in both `Language.update`
and in the training loop after all subbatches are processed.
This change restores proper handling/use of `False`. Moreover, the role
of `False` is now documented and added to the type to avoid future
accidents.
* Fix typo
* Document defaults for `Language.update`
* `Language.update`: ensure that tok2vec gets updated
The components in a pipeline can be updated independently. However,
tok2vec implementations are an exception to this, since they depend on
listeners for their gradients. The update method of a tok2vec
implementation computes the tok2vec forward and passes this along with a
backprop function to the listeners. This backprop function accumulates
gradients for all the listeners. There are two ways in which the
accumulated gradients can be used to update the tok2vec weights:
1. Call the `finish_update` method of tok2vec *after* the `update`
method is called on all of the pipes that use a tok2vec listener.
2. Pass an optimizer to the `update` method of tok2vec. In this
case, tok2vec will give the last listener a special backprop
function that calls `finish_update` on the tok2vec.
Unfortunately, `Language.update` did neither of these. Instead, it
immediately called `finish_update` on every pipe after `update`. As a
result, the tok2vec weights are updated when no gradients have been
accumulated from listeners yet. And the gradients of the listeners are
only used in the next call to `Language.update` (when `finish_update` is
called on tok2vec again).
This change fixes this issue by passing the optimizer to the `update`
method of trainable pipes, leading to use of the second strategy
outlined above.
The main updating loop in `Language.update` is also simplified by using
the `TrainableComponent` protocol consistently.
* Train loop: `sgd` is `Optional[Optimizer]`, do not pass false
* Language.update: call pipe finish_update after all pipe updates
This does correct and fast updates if multiple components update the
same parameters.
* Add comment why we moved `finish_update` to a separate loop
* change logging call for spacy.LookupsDataLoader.v1
* substitutions in language and _util
* various more substitutions
* add string formatting guidelines to contribution guidelines
* Init
* fix tests
* Update spacy/errors.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix test_blank_languages
* Rename xx to mul in docs
* Format _util with black
* prettier formatting
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add `spacy.PlainTextCorpusReader.v1`
This is a corpus reader that reads plain text corpora with the following
format:
- UTF-8 encoding
- One line per document.
- Blank lines are ignored.
It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.
* Update spacy/training/corpus.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* docs: add version to `PlainTextCorpus`
* Add docstring to registry function
* Add plain text corpus tests
* Only strip newline/carriage return
* Add return type _string_to_tmp_file helper
* Use a temporary directory in place of file name
Different OS auto delete/sharing semantics are just wonky.
* This will be new in 3.5.1 (rather than 4)
* Test improvements from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix batching regression
Some time ago, the spaCy v4 branch switched to the new Thinc v9
schedule. However, this introduced an error in how batching is handed.
In the PR, the batchers were changed to keep track of their step,
so that the step can be passed to the schedule. However, the issue
is that the training loop repeatedly calls the batching functions
(rather than using an infinite generator/iterator). So, the step and
therefore the schedule would be reset each epoch. Before the schedule
switch we didn't have this issue, because the old schedules were
stateful.
This PR fixes this issue by reverting the batching functions to use
a (stateful) generator. Their registry functions do accept a `Schedule`
and we convert `Schedule`s to generators.
* Update batcher docs
* Docstring fixes
* Make minibatch take iterables again as well
* Bump thinc requirement to 9.0.0.dev2
* Use type declaration
* Convert another comment into a proper type declaration
* Try to fix doc.copy
* Set dev version
* Make vocab always own lexemes
* Change version
* Add SpanGroups.copy method
* Fix set_annotations during Parser.update
* Fix dict proxy copy
* Upd version
* Fix copying SpanGroups
* Fix set_annotations in parser.update
* Fix parser set_annotations during update
* Revert "Fix parser set_annotations during update"
This reverts commit eb138c89ed.
* Revert "Fix set_annotations in parser.update"
This reverts commit c6df0eafd0.
* Fix set_annotations during parser update
* Inc version
* Handle final states in get_oracle_sequence
* Inc version
* Try to fix parser training
* Inc version
* Fix
* Inc version
* Fix parser oracle
* Inc version
* Inc version
* Fix transition has_gold
* Inc version
* Try to use real histories, not oracle
* Inc version
* Upd parser
* Inc version
* WIP on rewrite parser
* WIP refactor parser
* New progress on parser model refactor
* Prepare to remove parser_model.pyx
* Convert parser from cdef class
* Delete spacy.ml.parser_model
* Delete _precomputable_affine module
* Wire up tb_framework to new parser model
* Wire up parser model
* Uncython ner.pyx and dep_parser.pyx
* Uncython
* Work on parser model
* Support unseen_classes in parser model
* Support unseen classes in parser
* Cleaner handling of unseen classes
* Work through tests
* Keep working through errors
* Keep working through errors
* Work on parser. 15 tests failing
* Xfail beam stuff. 9 failures
* More xfail. 7 failures
* Xfail. 6 failures
* cleanup
* formatting
* fixes
* pass nO through
* Fix empty doc in update
* Hackishly fix resizing. 3 failures
* Fix redundant test. 2 failures
* Add reference version
* black formatting
* Get tests passing with reference implementation
* Fix missing prints
* Add missing file
* Improve indexing on reference implementation
* Get non-reference forward func working
* Start rigging beam back up
* removing redundant tests, cf #8106
* black formatting
* temporarily xfailing issue 4314
* make flake8 happy again
* mypy fixes
* ensure labels are added upon predict
* cleanup remnants from merge conflicts
* Improve unseen label masking
Two changes to speed up masking by ~10%:
- Use a bool array rather than an array of float32.
- Let the mask indicate whether a label was seen, rather than
unseen. The mask is most frequently used to index scores for
seen labels. However, since the mask marked unseen labels,
this required computing an intermittent flipped mask.
* Write moves costs directly into numpy array (#10163)
This avoids elementwise indexing and the allocation of an additional
array.
Gives a ~15% speed improvement when using batch_by_sequence with size
32.
* Temporarily disable ner and rehearse tests
Until rehearse is implemented again in the refactored parser.
* Fix loss serialization issue (#10600)
* Fix loss serialization issue
Serialization of a model fails with:
TypeError: array(738.3855, dtype=float32) is not JSON serializable
Fix this using float conversion.
* Disable CI steps that require spacy.TransitionBasedParser.v2
After finishing the refactor, TransitionBasedParser.v2 should be
provided for backwards compat.
* Add back support for beam parsing to the refactored parser (#10633)
* Add back support for beam parsing
Beam parsing was already implemented as part of the `BeamBatch` class.
This change makes its counterpart `GreedyBatch`. Both classes are hooked
up in `TransitionModel`, selecting `GreedyBatch` when the beam size is
one, or `BeamBatch` otherwise.
* Use kwarg for beam width
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Avoid implicit default for beam_width and beam_density
* Parser.{beam,greedy}_parse: ensure labels are added
* Remove 'deprecated' comments
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Parser `StateC` optimizations (#10746)
* `StateC`: Optimizations
Avoid GIL acquisition in `__init__`
Increase default buffer capacities on init
Reduce C++ exception overhead
* Fix typo
* Replace `set::count` with `set::find`
* Add exception attribute to c'tor
* Remove unused import
* Use a power-of-two value for initial capacity
Use default-insert to init `_heads` and `_unshiftable`
* Merge `cdef` variable declarations and assignments
* Vectorize `example.get_aligned_parses` (#10789)
* `example`: Vectorize `get_aligned_parse`
Rename `numpy` import
* Convert aligned array to lists before returning
* Revert import renaming
* Elide slice arguments when selecting the entire range
* Tagger/morphologizer alignment performance optimizations (#10798)
* `example`: Unwrap `numpy` scalar arrays before passing them to `StringStore.__getitem__`
* `AlignmentArray`: Use native list as staging buffer for offset calculation
* `example`: Vectorize `get_aligned`
* Hoist inner functions out of `get_aligned`
* Replace inline `if..else` clause in assignment statement
* `AlignmentArray`: Use raw indexing into offset and data `numpy` arrays
* `example`: Replace array unique value check with `groupby`
* `example`: Correctly exclude tokens with no alignment in `_get_aligned_vectorized`
Simplify `_get_aligned_non_vectorized`
* `util`: Update `all_equal` docstring
* Explicitly use `int32_t*`
* Restore C CPU inference in the refactored parser (#10747)
* Bring back the C parsing model
The C parsing model is used for CPU inference and is still faster for
CPU inference than the forward pass of the Thinc model.
* Use C sgemm provided by the Ops implementation
* Make tb_framework module Cython, merge in C forward implementation
* TransitionModel: raise in backprop returned from forward_cpu
* Re-enable greedy parse test
* Return transition scores when forward_cpu is used
* Apply suggestions from code review
Import `Model` from `thinc.api`
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Use relative imports in tb_framework
* Don't assume a default for beam_width
* We don't have a direct dependency on BLIS anymore
* Rename forwards to _forward_{fallback,greedy_cpu}
* Require thinc >=8.1.0,<8.2.0
* tb_framework: clean up imports
* Fix return type of _get_seen_mask
* Move up _forward_greedy_cpu
* Style fixes.
* Lower thinc lowerbound to 8.1.0.dev0
* Formatting fix
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Reimplement parser rehearsal function (#10878)
* Reimplement parser rehearsal function
Before the parser refactor, rehearsal was driven by a loop in the
`rehearse` method itself. For each parsing step, the loops would:
1. Get the predictions of the teacher.
2. Get the predictions and backprop function of the student.
3. Compute the loss and backprop into the student.
4. Move the teacher and student forward with the predictions of
the student.
In the refactored parser, we cannot perform search stepwise rehearsal
anymore, since the model now predicts all parsing steps at once.
Therefore, rehearsal is performed in the following steps:
1. Get the predictions of all parsing steps from the student, along
with its backprop function.
2. Get the predictions from the teacher, but use the predictions of
the student to advance the parser while doing so.
3. Compute the loss and backprop into the student.
To support the second step a new method, `advance_with_actions` is
added to `GreedyBatch`, which performs the provided parsing steps.
* tb_framework: wrap upper_W and upper_b in Linear
Thinc's Optimizer cannot handle resizing of existing parameters. Until
it does, we work around this by wrapping the weights/biases of the upper
layer of the parser model in Linear. When the upper layer is resized, we
copy over the existing parameters into a new Linear instance. This does
not trigger an error in Optimizer, because it sees the resized layer as
a new set of parameters.
* Add test for TransitionSystem.apply_actions
* Better FIXME marker
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Fixes from Madeesh
* Apply suggestions from Sofie
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove useless assignment
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename some identifiers in the parser refactor (#10935)
* Rename _parseC to _parse_batch
* tb_framework: prefix many auxiliary functions with underscore
To clearly state the intent that they are private.
* Rename `lower` to `hidden`, `upper` to `output`
* Parser slow test fixup
We don't have TransitionBasedParser.{v1,v2} until we bring it back as a
legacy option.
* Remove last vestiges of PrecomputableAffine
This does not exist anymore as a separate layer.
* ner: re-enable sentence boundary checks
* Re-enable test that works now.
* test_ner: make loss test more strict again
* Remove commented line
* Re-enable some more beam parser tests
* Remove unused _forward_reference function
* Update for CBlas changes in Thinc 8.1.0.dev2
Bump thinc dependency to 8.1.0.dev3.
* Remove references to spacy.TransitionBasedParser.{v1,v2}
Since they will not be offered starting with spaCy v4.
* `tb_framework`: Replace references to `thinc.backends.linalg` with `CBlas`
* dont use get_array_module (#11056) (#11293)
Co-authored-by: kadarakos <kadar.akos@gmail.com>
* Move `thinc.extra.search` to `spacy.pipeline._parser_internals` (#11317)
* `search`: Move from `thinc.extra.search`
Fix NPE in `Beam.__dealloc__`
* `pytest`: Add support for executing Cython tests
Move `search` tests from thinc and patch them to run with `pytest`
* `mypy` fix
* Update comment
* `conftest`: Expose `register_cython_tests`
* Remove unused import
* Move `argmax` impls to new `_parser_utils` Cython module (#11410)
* Parser does not have to be a cdef class anymore
This also fixes validation of the initialization schema.
* Add back spacy.TransitionBasedParser.v2
* Fix a rename that was missed in #10878.
So that rehearsal tests pass.
* Remove module from setup.py that got added during the merge
* Bring back support for `update_with_oracle_cut_size` (#12086)
* Bring back support for `update_with_oracle_cut_size`
This option was available in the pre-refactor parser, but was never
implemented in the refactored parser. This option cuts transition
sequences that are longer than `update_with_oracle_cut` size into
separate sequences that have at most `update_with_oracle_cut`
transitions. The oracle (gold standard) transition sequence is used to
determine the cuts and the initial states for the additional sequences.
Applying this cut makes the batches more homogeneous in the transition
sequence lengths, making forward passes (and as a consequence training)
much faster.
Training time 1000 steps on de_core_news_lg:
- Before this change: 149s
- After this change: 68s
- Pre-refactor parser: 81s
* Fix a rename that was missed in #10878.
So that rehearsal tests pass.
* Apply suggestions from @shadeMe
* Use chained conditional
* Test with update_with_oracle_cut_size={0, 1, 5, 100}
And fix a git that occurs with a cut size of 1.
* Fix up some merge fall out
* Update parser distillation for the refactor
In the old parser, we'd iterate over the transitions in the distill
function and compute the loss/gradients on the go. In the refactored
parser, we first let the student model parse the inputs. Then we'll let
the teacher compute the transition probabilities of the states in the
student's transition sequence. We can then compute the gradients of the
student given the teacher.
* Add back spacy.TransitionBasedParser.v1 references
- Accordion in the architecture docs.
- Test in test_parse, but disabled until we have a spacy-legacy release.
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: kadarakos <kadar.akos@gmail.com>
* Add `TrainablePipe.{distill,get_teacher_student_loss}`
This change adds two methods:
- `TrainablePipe::distill` which performs a training step of a
student pipe on a teacher pipe, giving a batch of `Doc`s.
- `TrainablePipe::get_teacher_student_loss` computes the loss
of a student relative to the teacher.
The `distill` or `get_teacher_student_loss` methods are also implemented
in the tagger, edit tree lemmatizer, and parser pipes, to enable
distillation in those pipes and as an example for other pipes.
* Fix stray `Beam` import
* Fix incorrect import
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* TrainablePipe.distill: use `Iterable[Example]`
* Add Pipe.is_distillable method
* Add `validate_distillation_examples`
This first calls `validate_examples` and then checks that the
student/teacher tokens are the same.
* Update distill documentation
* Add distill documentation for all pipes that support distillation
* Fix incorrect identifier
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add comment to explain `is_distillable`
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add `ConsoleLogger.v3`
This addition expands the progress bar feature to count up the training/distillation steps to either the next evaluation pass or the maximum number of steps.
* Rename progress bar types
* Add defaults to docs
Minor fixes
* Move comment
* Minor punctuation fixes
* Explicitly check for `None` when validating progress bar type
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Convert all individual values explicitly to uint64 for array-based doc representations
* Temporarily test with latest numpy v1.24.0rc
* Remove unnecessary conversion from attr_t
* Reduce number of individual casts
* Convert specifically from int32 to uint64
* Revert "Temporarily test with latest numpy v1.24.0rc"
This reverts commit eb0e3c5006.
* Also use int32 in tests
* Add `training.before_update` callback
This callback can be used to implement training paradigms like gradual (un)freezing of components (e.g: the Transformer) after a certain number of training steps to mitigate catastrophic forgetting during fine-tuning.
* Fix type annotation, default config value
* Generalize arguments passed to the callback
* Update schema
* Pass `epoch` to callback, rename `current_step` to `step`
* Add test
* Simplify test
* Replace config string with `spacy.blank`
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Cleanup imports
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Preserve both `-` and `O` annotation in augmenters rather than relying
on `Example.to_dict`'s default support for one option outside of labeled
entity spans.
This is intended as a temporary workaround for augmenters for v3.4.x.
The behavior of `Example` and related IOB utils could be improved in the
general case for v3.5.
* adding spans to doc_annotation in Example.to_dict
* to_dict compatible with from_dict: tuples instead of spans
* use strings for label and kb_id
* Simplify test
* Update data formats docs
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* account for NER labels with a hyphen in the name
* cleanup
* fix docstring
* add return type to helper method
* shorter method and few more occurrences
* user helper method across repo
* fix circular import
* partial revert to avoid circular import
* Alignment: use a simplified ragged type for performance
This introduces the AlignmentArray type, which is a simplified version
of Ragged that performs better on the simple(r) indexing performed for
alignment.
* AlignmentArray: raise an error when using unsupported index
* AlignmentArray: move error messages to Errors
* AlignmentArray: remove simlified ... with simplifications
* AlignmentArray: fix typo that broke a[n:n] indexing
* Add vector deduplication
* Add `Vocab.deduplicate_vectors()`
* Always run deduplication in `spacy init vectors`
* Clean up a few vector-related error messages and docs examples
* Always unique with numpy
* Fix types
* Fix get_matching_ents
Not sure what happened here - the code prior to this commit simply does
not work. It's already covered by entity linker tests, which were
succeeding in the NEL PR, but couldn't possibly succeed on master.
* Fix test
Test was indented inside another test and so doesn't seem to have been
running properly.