Torch is required for the coref/spanpred models but shouldn't be
required for spaCy in general.
The one tricky part of this is that one function in coref_util relied on
torch, but that file was imported in several places. Since the function
was only used in one place I moved it there.
This absolutely does not work. First step here is getting over most of
the code in roughly the files we want it in. After the code has been
pulled over it can be restructured to match spaCy and cleaned up.
* Draft spancat model
* Add spancat model
* Add test for extract_spans
* Add extract_spans layer
* Upd extract_spans
* Add spancat model
* Add test for spancat model
* Upd spancat model
* Update spancat component
* Upd spancat
* Update spancat model
* Add quick spancat test
* Import SpanCategorizer
* Fix SpanCategorizer component
* Import SpanGroup
* Fix span extraction
* Fix import
* Fix import
* Upd model
* Update spancat models
* Add scoring, update defaults
* Update and add docs
* Fix type
* Update spacy/ml/extract_spans.py
* Auto-format and fix import
* Fix comment
* Fix type
* Fix type
* Update website/docs/api/spancategorizer.md
* Fix comment
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Better defense
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix labels list
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/ml/extract_spans.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Set annotations during update
* Set annotations in spancat
* fix imports in test
* Update spacy/pipeline/spancat.py
* replace MaxoutLogistic with LinearLogistic
* fix config
* various small fixes
* remove set_annotations parameter in update
* use our beloved tupley format with recent support for doc.spans
* bugfix to allow renaming the default span_key (scores weren't showing up)
* use different key in docs example
* change defaults to better-working parameters from project (WIP)
* register spacy.extract_spans.v1 for legacy purposes
* Upd dev version so can build wheel
* layers instead of architectures for smaller building blocks
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Include additional scores from overrides in combined score weights
* Parameterize spans key in scoring
Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so
that it's possible to evaluate multiple `spancat` components in the same
pipeline.
* Use the (intentionally very short) default spans key `sc` in the
`SpanCategorizer`
* Adjust the default score weights to include the default key
* Adjust the scorer to use `spans_{spans_key}` as the prefix for the
returned score
* Revert addition of `attr_name` argument to `score_spans` and adjust
the key in the `getter` instead.
Note that for `spancat` components with a custom `span_key`, the score
weights currently need to be modified manually in
`[training.score_weights]` for them to be available during training. To
suppress the default score weights `spans_sc_p/r/f` during training, set
them to `null` in `[training.score_weights]`.
* Update website/docs/api/scorer.md
* Fix scorer for spans key containing underscore
* Increment version
* Add Spans to Evaluate CLI (#8439)
* Add Spans to Evaluate CLI
* Change to spans_key
* Add spans per_type output
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix spancat GPU issues (#8455)
* Fix GPU issues
* Require thinc >=8.0.6
* Switch to glorot_uniform_init
* Fix and test ngram suggester
* Include final ngram in doc for all sizes
* Fix ngrams for docs of the same length as ngram size
* Handle batches of docs that result in no ngrams
* Add tests
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Nirant <NirantK@users.noreply.github.com>
* initial coref_er pipe
* matcher more flexible
* base coref component without actual model
* initial setup of coref_er.score
* rename to include_label
* preliminary score_clusters method
* apply scoring in coref component
* IO fix
* return None loss for now
* rename to CoreferenceResolver
* some preliminary unit tests
* use registry as callable
* define new architectures for the pretraining objective
* add loss function as attr of the omdel
* cleanup
* cleanup
* shorten name
* fix typo
* remove unused error
* Draft layer for BILUO actions
* Fixes to biluo layer
* WIP on BILUO layer
* Add tests for BILUO layer
* Format
* Fix transitions
* Update test
* Link in the simple_ner
* Update BILUO tagger
* Update __init__
* Import simple_ner
* Update test
* Import
* Add files
* Add config
* Fix label passing for BILUO and tagger
* Fix label handling for simple_ner component
* Update simple NER test
* Update config
* Hack train script
* Update BILUO layer
* Fix SimpleNER component
* Update train_from_config
* Add biluo_to_iob helper
* Add IOB layer
* Add IOBTagger model
* Update biluo layer
* Update SimpleNER tagger
* Update BILUO
* Read random seed in train-from-config
* Update use of normal_init
* Fix normalization of gradient in SimpleNER
* Update IOBTagger
* Remove print
* Tweak masking in BILUO
* Add dropout in SimpleNER
* Update thinc
* Tidy up simple_ner
* Fix biluo model
* Unhack train-from-config
* Update setup.cfg and requirements
* Add tb_framework.py for parser model
* Try to avoid memory leak in BILUO
* Move ParserModel into spacy.ml, avoid need for subclass.
* Use updated parser model
* Remove incorrect call to model.initializre in PrecomputableAffine
* Update parser model
* Avoid divide by zero in tagger
* Add extra dropout layer in tagger
* Refine minibatch_by_words function to avoid oom
* Fix parser model after refactor
* Try to avoid div-by-zero in SimpleNER
* Fix infinite loop in minibatch_by_words
* Use SequenceCategoricalCrossentropy in Tagger
* Fix parser model when hidden layer
* Remove extra dropout from tagger
* Add extra nan check in tagger
* Fix thinc version
* Update tests and imports
* Fix test
* Update test
* Update tests
* Fix tests
* Fix test
Co-authored-by: Ines Montani <ines@ines.io>
* fix grad_clip naming
* cleaning up pretrained_vectors out of cfg
* further refactoring Model init's
* move Model building out of pipes
* further refactor to require a model config when creating a pipe
* small fixes
* making cfg in nn_parser more consistent
* fixing nr_class for parser
* fixing nn_parser's nO
* fix printing of loss
* architectures in own file per type, consistent naming
* convenience methods default_tagger_config and default_tok2vec_config
* let create_pipe access default config if available for that component
* default_parser_config
* move defaults to separate folder
* allow reading nlp from package or dir with argument 'name'
* architecture spacy.VocabVectors.v1 to read static vectors from file
* cleanup
* default configs for nel, textcat, morphologizer, tensorizer
* fix imports
* fixing unit tests
* fixes and clean up
* fixing defaults, nO, fix unit tests
* restore parser IO
* fix IO
* 'fix' serialization test
* add *.cfg to manifest
* fix example configs with additional arguments
* replace Morpohologizer with Tagger
* add IO bit when testing overfitting of tagger (currently failing)
* fix IO - don't initialize when reading from disk
* expand overfitting tests to also check IO goes OK
* remove dropout from HashEmbed to fix Tagger performance
* add defaults for sentrec
* update thinc
* always pass a Model instance to a Pipe
* fix piped_added statement
* remove obsolete W029
* remove obsolete errors
* restore byte checking tests (work again)
* clean up test
* further test cleanup
* convert from config to Model in create_pipe
* bring back error when component is not initialized
* cleanup
* remove calls for nlp2.begin_training
* use thinc.api in imports
* allow setting charembed's nM and nC
* fix for hardcoded nM/nC + unit test
* formatting fixes
* trigger build