* Add the configuration schema for distillation
This also adds the default configuration and some tests. The schema will
be used by the training loop and `distill` subcommand.
* Format
* Change distillation shortopt to -d
* Fix descripion of max_epochs
* Rename distillation flag to -dt
* Rename `pipe_map` to `student_to_teacher`
* Try to fix doc.copy
* Set dev version
* Make vocab always own lexemes
* Change version
* Add SpanGroups.copy method
* Fix set_annotations during Parser.update
* Fix dict proxy copy
* Upd version
* Fix copying SpanGroups
* Fix set_annotations in parser.update
* Fix parser set_annotations during update
* Revert "Fix parser set_annotations during update"
This reverts commit eb138c89ed.
* Revert "Fix set_annotations in parser.update"
This reverts commit c6df0eafd0.
* Fix set_annotations during parser update
* Inc version
* Handle final states in get_oracle_sequence
* Inc version
* Try to fix parser training
* Inc version
* Fix
* Inc version
* Fix parser oracle
* Inc version
* Inc version
* Fix transition has_gold
* Inc version
* Try to use real histories, not oracle
* Inc version
* Upd parser
* Inc version
* WIP on rewrite parser
* WIP refactor parser
* New progress on parser model refactor
* Prepare to remove parser_model.pyx
* Convert parser from cdef class
* Delete spacy.ml.parser_model
* Delete _precomputable_affine module
* Wire up tb_framework to new parser model
* Wire up parser model
* Uncython ner.pyx and dep_parser.pyx
* Uncython
* Work on parser model
* Support unseen_classes in parser model
* Support unseen classes in parser
* Cleaner handling of unseen classes
* Work through tests
* Keep working through errors
* Keep working through errors
* Work on parser. 15 tests failing
* Xfail beam stuff. 9 failures
* More xfail. 7 failures
* Xfail. 6 failures
* cleanup
* formatting
* fixes
* pass nO through
* Fix empty doc in update
* Hackishly fix resizing. 3 failures
* Fix redundant test. 2 failures
* Add reference version
* black formatting
* Get tests passing with reference implementation
* Fix missing prints
* Add missing file
* Improve indexing on reference implementation
* Get non-reference forward func working
* Start rigging beam back up
* removing redundant tests, cf #8106
* black formatting
* temporarily xfailing issue 4314
* make flake8 happy again
* mypy fixes
* ensure labels are added upon predict
* cleanup remnants from merge conflicts
* Improve unseen label masking
Two changes to speed up masking by ~10%:
- Use a bool array rather than an array of float32.
- Let the mask indicate whether a label was seen, rather than
unseen. The mask is most frequently used to index scores for
seen labels. However, since the mask marked unseen labels,
this required computing an intermittent flipped mask.
* Write moves costs directly into numpy array (#10163)
This avoids elementwise indexing and the allocation of an additional
array.
Gives a ~15% speed improvement when using batch_by_sequence with size
32.
* Temporarily disable ner and rehearse tests
Until rehearse is implemented again in the refactored parser.
* Fix loss serialization issue (#10600)
* Fix loss serialization issue
Serialization of a model fails with:
TypeError: array(738.3855, dtype=float32) is not JSON serializable
Fix this using float conversion.
* Disable CI steps that require spacy.TransitionBasedParser.v2
After finishing the refactor, TransitionBasedParser.v2 should be
provided for backwards compat.
* Add back support for beam parsing to the refactored parser (#10633)
* Add back support for beam parsing
Beam parsing was already implemented as part of the `BeamBatch` class.
This change makes its counterpart `GreedyBatch`. Both classes are hooked
up in `TransitionModel`, selecting `GreedyBatch` when the beam size is
one, or `BeamBatch` otherwise.
* Use kwarg for beam width
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Avoid implicit default for beam_width and beam_density
* Parser.{beam,greedy}_parse: ensure labels are added
* Remove 'deprecated' comments
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Parser `StateC` optimizations (#10746)
* `StateC`: Optimizations
Avoid GIL acquisition in `__init__`
Increase default buffer capacities on init
Reduce C++ exception overhead
* Fix typo
* Replace `set::count` with `set::find`
* Add exception attribute to c'tor
* Remove unused import
* Use a power-of-two value for initial capacity
Use default-insert to init `_heads` and `_unshiftable`
* Merge `cdef` variable declarations and assignments
* Vectorize `example.get_aligned_parses` (#10789)
* `example`: Vectorize `get_aligned_parse`
Rename `numpy` import
* Convert aligned array to lists before returning
* Revert import renaming
* Elide slice arguments when selecting the entire range
* Tagger/morphologizer alignment performance optimizations (#10798)
* `example`: Unwrap `numpy` scalar arrays before passing them to `StringStore.__getitem__`
* `AlignmentArray`: Use native list as staging buffer for offset calculation
* `example`: Vectorize `get_aligned`
* Hoist inner functions out of `get_aligned`
* Replace inline `if..else` clause in assignment statement
* `AlignmentArray`: Use raw indexing into offset and data `numpy` arrays
* `example`: Replace array unique value check with `groupby`
* `example`: Correctly exclude tokens with no alignment in `_get_aligned_vectorized`
Simplify `_get_aligned_non_vectorized`
* `util`: Update `all_equal` docstring
* Explicitly use `int32_t*`
* Restore C CPU inference in the refactored parser (#10747)
* Bring back the C parsing model
The C parsing model is used for CPU inference and is still faster for
CPU inference than the forward pass of the Thinc model.
* Use C sgemm provided by the Ops implementation
* Make tb_framework module Cython, merge in C forward implementation
* TransitionModel: raise in backprop returned from forward_cpu
* Re-enable greedy parse test
* Return transition scores when forward_cpu is used
* Apply suggestions from code review
Import `Model` from `thinc.api`
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Use relative imports in tb_framework
* Don't assume a default for beam_width
* We don't have a direct dependency on BLIS anymore
* Rename forwards to _forward_{fallback,greedy_cpu}
* Require thinc >=8.1.0,<8.2.0
* tb_framework: clean up imports
* Fix return type of _get_seen_mask
* Move up _forward_greedy_cpu
* Style fixes.
* Lower thinc lowerbound to 8.1.0.dev0
* Formatting fix
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Reimplement parser rehearsal function (#10878)
* Reimplement parser rehearsal function
Before the parser refactor, rehearsal was driven by a loop in the
`rehearse` method itself. For each parsing step, the loops would:
1. Get the predictions of the teacher.
2. Get the predictions and backprop function of the student.
3. Compute the loss and backprop into the student.
4. Move the teacher and student forward with the predictions of
the student.
In the refactored parser, we cannot perform search stepwise rehearsal
anymore, since the model now predicts all parsing steps at once.
Therefore, rehearsal is performed in the following steps:
1. Get the predictions of all parsing steps from the student, along
with its backprop function.
2. Get the predictions from the teacher, but use the predictions of
the student to advance the parser while doing so.
3. Compute the loss and backprop into the student.
To support the second step a new method, `advance_with_actions` is
added to `GreedyBatch`, which performs the provided parsing steps.
* tb_framework: wrap upper_W and upper_b in Linear
Thinc's Optimizer cannot handle resizing of existing parameters. Until
it does, we work around this by wrapping the weights/biases of the upper
layer of the parser model in Linear. When the upper layer is resized, we
copy over the existing parameters into a new Linear instance. This does
not trigger an error in Optimizer, because it sees the resized layer as
a new set of parameters.
* Add test for TransitionSystem.apply_actions
* Better FIXME marker
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Fixes from Madeesh
* Apply suggestions from Sofie
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove useless assignment
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename some identifiers in the parser refactor (#10935)
* Rename _parseC to _parse_batch
* tb_framework: prefix many auxiliary functions with underscore
To clearly state the intent that they are private.
* Rename `lower` to `hidden`, `upper` to `output`
* Parser slow test fixup
We don't have TransitionBasedParser.{v1,v2} until we bring it back as a
legacy option.
* Remove last vestiges of PrecomputableAffine
This does not exist anymore as a separate layer.
* ner: re-enable sentence boundary checks
* Re-enable test that works now.
* test_ner: make loss test more strict again
* Remove commented line
* Re-enable some more beam parser tests
* Remove unused _forward_reference function
* Update for CBlas changes in Thinc 8.1.0.dev2
Bump thinc dependency to 8.1.0.dev3.
* Remove references to spacy.TransitionBasedParser.{v1,v2}
Since they will not be offered starting with spaCy v4.
* `tb_framework`: Replace references to `thinc.backends.linalg` with `CBlas`
* dont use get_array_module (#11056) (#11293)
Co-authored-by: kadarakos <kadar.akos@gmail.com>
* Move `thinc.extra.search` to `spacy.pipeline._parser_internals` (#11317)
* `search`: Move from `thinc.extra.search`
Fix NPE in `Beam.__dealloc__`
* `pytest`: Add support for executing Cython tests
Move `search` tests from thinc and patch them to run with `pytest`
* `mypy` fix
* Update comment
* `conftest`: Expose `register_cython_tests`
* Remove unused import
* Move `argmax` impls to new `_parser_utils` Cython module (#11410)
* Parser does not have to be a cdef class anymore
This also fixes validation of the initialization schema.
* Add back spacy.TransitionBasedParser.v2
* Fix a rename that was missed in #10878.
So that rehearsal tests pass.
* Remove module from setup.py that got added during the merge
* Bring back support for `update_with_oracle_cut_size` (#12086)
* Bring back support for `update_with_oracle_cut_size`
This option was available in the pre-refactor parser, but was never
implemented in the refactored parser. This option cuts transition
sequences that are longer than `update_with_oracle_cut` size into
separate sequences that have at most `update_with_oracle_cut`
transitions. The oracle (gold standard) transition sequence is used to
determine the cuts and the initial states for the additional sequences.
Applying this cut makes the batches more homogeneous in the transition
sequence lengths, making forward passes (and as a consequence training)
much faster.
Training time 1000 steps on de_core_news_lg:
- Before this change: 149s
- After this change: 68s
- Pre-refactor parser: 81s
* Fix a rename that was missed in #10878.
So that rehearsal tests pass.
* Apply suggestions from @shadeMe
* Use chained conditional
* Test with update_with_oracle_cut_size={0, 1, 5, 100}
And fix a git that occurs with a cut size of 1.
* Fix up some merge fall out
* Update parser distillation for the refactor
In the old parser, we'd iterate over the transitions in the distill
function and compute the loss/gradients on the go. In the refactored
parser, we first let the student model parse the inputs. Then we'll let
the teacher compute the transition probabilities of the states in the
student's transition sequence. We can then compute the gradients of the
student given the teacher.
* Add back spacy.TransitionBasedParser.v1 references
- Accordion in the architecture docs.
- Test in test_parse, but disabled until we have a spacy-legacy release.
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: kadarakos <kadar.akos@gmail.com>
* Tagger: use unnormalized probabilities for inference
Using unnormalized softmax avoids use of the relatively expensive exp function,
which can significantly speed up non-transformer models (e.g. I got a speedup
of 27% on a German tagging + parsing pipeline).
* Add spacy.Tagger.v2 with configurable normalization
Normalization of probabilities is disabled by default to improve
performance.
* Update documentation, models, and tests to spacy.Tagger.v2
* Move Tagger.v1 to spacy-legacy
* docs/architectures: run prettier
* Unnormalized softmax is now a Softmax_v2 option
* Require thinc 8.0.14 and spacy-legacy 3.0.9
* Migrate regressions 1-1000
* Move serialize test to correct file
* Remove tests that won't work in v3
* Migrate regressions 1000-1500
Removed regression test 1250 because v3 doesn't support the old LEX
scheme anymore.
* Add missing imports in serializer tests
* Migrate tests 1500-2000
* Migrate regressions from 2000-2500
* Migrate regressions from 2501-3000
* Migrate regressions from 3000-3501
* Migrate regressions from 3501-4000
* Migrate regressions from 4001-4500
* Migrate regressions from 4501-5000
* Migrate regressions from 5001-5501
* Migrate regressions from 5501 to 7000
* Migrate regressions from 7001 to 8000
* Migrate remaining regression tests
* Fixing missing imports
* Update docs with new system [ci skip]
* Update CONTRIBUTING.md
- Fix formatting
- Update wording
* Remove lemmatizer tests in el lang
* Move a few tests into the general tokenizer
* Separate Doc and DocBin tests
* initialize NLP with train corpus
* add more pretraining tests
* more tests
* function to fetch tok2vec layer for pretraining
* clarify parameter name
* test different objectives
* formatting
* fix check for static vectors when using vectors objective
* clarify docs
* logger statement
* fix init_tok2vec and proc.initialize order
* test training after pretraining
* add init_config tests for pretraining
* pop pretraining block to avoid config validation errors
* custom errors
* define new architectures for the pretraining objective
* add loss function as attr of the omdel
* cleanup
* cleanup
* shorten name
* fix typo
* remove unused error
* Prevent Tagger model init with 0 labels
Raise an error before trying to initialize a tagger model with 0 labels.
* Add dummy tagger label for test
* Remove tagless tagger model initializiation
* Fix error number after merge
* Add dummy tagger label to test
* Fix formatting
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Update with WIP
* Update with WIP
* Update with pipeline serialization
* Update types and pipe factories
* Add deep merge, tidy up and add tests
* Fix pipe creation from config
* Don't validate default configs on load
* Update spacy/language.py
Co-authored-by: Ines Montani <ines@ines.io>
* Adjust factory/component meta error
* Clean up factory args and remove defaults
* Add test for failing empty dict defaults
* Update pipeline handling and methods
* provide KB as registry function instead of as object
* small change in test to make functionality more clear
* update example script for EL configuration
* Fix typo
* Simplify test
* Simplify test
* splitting pipes.pyx into separate files
* moving default configs to each component file
* fix batch_size type
* removing default values from component constructors where possible (TODO: test 4725)
* skip instead of xfail
* Add test for config -> nlp with multiple instances
* pipeline.pipes -> pipeline.pipe
* Tidy up, document, remove kwargs
* small cleanup/generalization for Tok2VecListener
* use DEFAULT_UPSTREAM field
* revert to avoid circular imports
* Fix tests
* Replace deprecated arg
* Make model dirs require config
* fix pickling of keyword-only arguments in constructor
* WIP: clean up and integrate full config
* Add helper to handle function args more reliably
Now also includes keyword-only args
* Fix config composition and serialization
* Improve config debugging and add visual diff
* Remove unused defaults and fix type
* Remove pipeline and factories from meta
* Update spacy/default_config.cfg
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/default_config.cfg
* small UX edits
* avoid printing stack trace for debug CLI commands
* Add support for language-specific factories
* specify the section of the config which holds the model to debug
* WIP: add Language.from_config
* Update with language data refactor WIP
* Auto-format
* Add backwards-compat handling for Language.factories
* Update morphologizer.pyx
* Fix morphologizer
* Update and simplify lemmatizers
* Fix Japanese tests
* Port over tagger changes
* Fix Chinese and tests
* Update to latest Thinc
* WIP: xfail first Russian lemmatizer test
* Fix component-specific overrides
* fix nO for output layers in debug_model
* Fix default value
* Fix tests and don't pass objects in config
* Fix deep merging
* Fix lemma lookup data registry
Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)
* Add types
* Add Vocab.from_config
* Fix typo
* Fix tests
* Make config copying more elegant
* Fix pipe analysis
* Fix lemmatizers and is_base_form
* WIP: move language defaults to config
* Fix morphology type
* Fix vocab
* Remove comment
* Update to latest Thinc
* Add morph rules to config
* Tidy up
* Remove set_morphology option from tagger factory
* Hack use_gpu
* Move [pipeline] to top-level block and make [nlp.pipeline] list
Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them
* Fix use_gpu and resume in CLI
* Auto-format
* Remove resume from config
* Fix formatting and error
* [pipeline] -> [components]
* Fix types
* Fix tagger test: requires set_morphology?
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Draft layer for BILUO actions
* Fixes to biluo layer
* WIP on BILUO layer
* Add tests for BILUO layer
* Format
* Fix transitions
* Update test
* Link in the simple_ner
* Update BILUO tagger
* Update __init__
* Import simple_ner
* Update test
* Import
* Add files
* Add config
* Fix label passing for BILUO and tagger
* Fix label handling for simple_ner component
* Update simple NER test
* Update config
* Hack train script
* Update BILUO layer
* Fix SimpleNER component
* Update train_from_config
* Add biluo_to_iob helper
* Add IOB layer
* Add IOBTagger model
* Update biluo layer
* Update SimpleNER tagger
* Update BILUO
* Read random seed in train-from-config
* Update use of normal_init
* Fix normalization of gradient in SimpleNER
* Update IOBTagger
* Remove print
* Tweak masking in BILUO
* Add dropout in SimpleNER
* Update thinc
* Tidy up simple_ner
* Fix biluo model
* Unhack train-from-config
* Update setup.cfg and requirements
* Add tb_framework.py for parser model
* Try to avoid memory leak in BILUO
* Move ParserModel into spacy.ml, avoid need for subclass.
* Use updated parser model
* Remove incorrect call to model.initializre in PrecomputableAffine
* Update parser model
* Avoid divide by zero in tagger
* Add extra dropout layer in tagger
* Refine minibatch_by_words function to avoid oom
* Fix parser model after refactor
* Try to avoid div-by-zero in SimpleNER
* Fix infinite loop in minibatch_by_words
* Use SequenceCategoricalCrossentropy in Tagger
* Fix parser model when hidden layer
* Remove extra dropout from tagger
* Add extra nan check in tagger
* Fix thinc version
* Update tests and imports
* Fix test
* Update test
* Update tests
* Fix tests
* Fix test
Co-authored-by: Ines Montani <ines@ines.io>
* fix grad_clip naming
* cleaning up pretrained_vectors out of cfg
* further refactoring Model init's
* move Model building out of pipes
* further refactor to require a model config when creating a pipe
* small fixes
* making cfg in nn_parser more consistent
* fixing nr_class for parser
* fixing nn_parser's nO
* fix printing of loss
* architectures in own file per type, consistent naming
* convenience methods default_tagger_config and default_tok2vec_config
* let create_pipe access default config if available for that component
* default_parser_config
* move defaults to separate folder
* allow reading nlp from package or dir with argument 'name'
* architecture spacy.VocabVectors.v1 to read static vectors from file
* cleanup
* default configs for nel, textcat, morphologizer, tensorizer
* fix imports
* fixing unit tests
* fixes and clean up
* fixing defaults, nO, fix unit tests
* restore parser IO
* fix IO
* 'fix' serialization test
* add *.cfg to manifest
* fix example configs with additional arguments
* replace Morpohologizer with Tagger
* add IO bit when testing overfitting of tagger (currently failing)
* fix IO - don't initialize when reading from disk
* expand overfitting tests to also check IO goes OK
* remove dropout from HashEmbed to fix Tagger performance
* add defaults for sentrec
* update thinc
* always pass a Model instance to a Pipe
* fix piped_added statement
* remove obsolete W029
* remove obsolete errors
* restore byte checking tests (work again)
* clean up test
* further test cleanup
* convert from config to Model in create_pipe
* bring back error when component is not initialized
* cleanup
* remove calls for nlp2.begin_training
* use thinc.api in imports
* allow setting charembed's nM and nC
* fix for hardcoded nM/nC + unit test
* formatting fixes
* trigger build