Not necessary for convergence, but in coref-hoi this seems to add a few
f1 points.
Note that there are two width-related features in coref-hoi. This is a
"prior" that is added to mention scores. The other width related feature
is appended to the span embedding representation for other layers to
reference.
This rewrites the loss to not use the Thinc crossentropy code at all.
The main difference here is that the negative predictions are being
masked out (= marginalized over), but negative gradient is still being
reflected.
I'm still not sure this is exactly right but models seem to train
reliably now.
The calculation of this in the coref-hoi code is hard to follow. Based
on comments and variable names it sounds like it's using the doc length,
but it might actually be the number of mentions? Number of mentions
should be much larger and seems more correct, but might want to revisit
this.
I think this was technically incorrect but harmless. The reason the code
here is different than the reference in coref-hoi is that the indices
there are such that they get +1 at the end of processing, while the code
here handles indices directly.
* Draft spancat model
* Add spancat model
* Add test for extract_spans
* Add extract_spans layer
* Upd extract_spans
* Add spancat model
* Add test for spancat model
* Upd spancat model
* Update spancat component
* Upd spancat
* Update spancat model
* Add quick spancat test
* Import SpanCategorizer
* Fix SpanCategorizer component
* Import SpanGroup
* Fix span extraction
* Fix import
* Fix import
* Upd model
* Update spancat models
* Add scoring, update defaults
* Update and add docs
* Fix type
* Update spacy/ml/extract_spans.py
* Auto-format and fix import
* Fix comment
* Fix type
* Fix type
* Update website/docs/api/spancategorizer.md
* Fix comment
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Better defense
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix labels list
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/ml/extract_spans.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Set annotations during update
* Set annotations in spancat
* fix imports in test
* Update spacy/pipeline/spancat.py
* replace MaxoutLogistic with LinearLogistic
* fix config
* various small fixes
* remove set_annotations parameter in update
* use our beloved tupley format with recent support for doc.spans
* bugfix to allow renaming the default span_key (scores weren't showing up)
* use different key in docs example
* change defaults to better-working parameters from project (WIP)
* register spacy.extract_spans.v1 for legacy purposes
* Upd dev version so can build wheel
* layers instead of architectures for smaller building blocks
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Include additional scores from overrides in combined score weights
* Parameterize spans key in scoring
Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so
that it's possible to evaluate multiple `spancat` components in the same
pipeline.
* Use the (intentionally very short) default spans key `sc` in the
`SpanCategorizer`
* Adjust the default score weights to include the default key
* Adjust the scorer to use `spans_{spans_key}` as the prefix for the
returned score
* Revert addition of `attr_name` argument to `score_spans` and adjust
the key in the `getter` instead.
Note that for `spancat` components with a custom `span_key`, the score
weights currently need to be modified manually in
`[training.score_weights]` for them to be available during training. To
suppress the default score weights `spans_sc_p/r/f` during training, set
them to `null` in `[training.score_weights]`.
* Update website/docs/api/scorer.md
* Fix scorer for spans key containing underscore
* Increment version
* Add Spans to Evaluate CLI (#8439)
* Add Spans to Evaluate CLI
* Change to spans_key
* Add spans per_type output
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix spancat GPU issues (#8455)
* Fix GPU issues
* Require thinc >=8.0.6
* Switch to glorot_uniform_init
* Fix and test ngram suggester
* Include final ngram in doc for all sizes
* Fix ngrams for docs of the same length as ngram size
* Handle batches of docs that result in no ngrams
* Add tests
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Nirant <NirantK@users.noreply.github.com>
The call here was creating a float64 array, which was turning many
downstream scores into float64s. Later on these values were assigned to
a float32 array in backprop, and numerical underflow caused things to go
to zero.
That's almost certainly not the only reason things go to zero, but it is
incorrect.
* implement textcat resizing for TextCatCNN
* resizing textcat in-place
* simplify code
* ensure predictions for old textcat labels remain the same after resizing (WIP)
* fix for softmax
* store softmax as attr
* fix ensemble weight copy and cleanup
* restructure slightly
* adjust documentation, update tests and quickstart templates to use latest versions
* extend unit test slightly
* revert unnecessary edits
* fix typo
* ensemble architecture won't be resizable for now
* use resizable layer (WIP)
* revert using resizable layer
* resizable container while avoid shape inference trouble
* cleanup
* ensure model continues training after resizing
* use fill_b parameter
* use fill_defaults
* resize_layer callback
* format
* bump thinc to 8.0.4
* bump spacy-legacy to 3.0.6
At a few points in the code it's normal to get a "2d" array where each
row is a single entry. Calling squeeze will make that a proper 1d
array... unless it's just one entry, in which case it turns into a 0d
scalar. That's not what we want; flatten() provides the desired
behavior.
`make_clean_doc` is not needed and was removed.
`logsumexp` may be needed if I misunderstood the loss calculation, so I
left it in for now with a note.
When sentences are not available, just treat the whole doc as one
sentence. A reasonable general fallback, but important due to the init
call, where upstream components aren't run.
This includes the coref code that was being tested separately, modified
to work in spaCy. It hasn't been tested yet and presumably still needs
fixes.
In particular, the evaluation code is currently omitted. It's unclear at
the moment whether we want to use a complex scorer similar to the
official one, or a simpler scorer using more modern evaluation methods.
* Replace negative rows with 0 in StaticVectors
Replace negative row indices with 0-vectors in `StaticVectors`.
* Increase versions related to StaticVectors
* Increase versions of all architctures and layers related to
`StaticVectors`
* Improve efficiency of 0-vector operations
Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5
* Update config defaults to new versions
* Update docs
* initialize NLP with train corpus
* add more pretraining tests
* more tests
* function to fetch tok2vec layer for pretraining
* clarify parameter name
* test different objectives
* formatting
* fix check for static vectors when using vectors objective
* clarify docs
* logger statement
* fix init_tok2vec and proc.initialize order
* test training after pretraining
* add init_config tests for pretraining
* pop pretraining block to avoid config validation errors
* custom errors
* initial coref_er pipe
* matcher more flexible
* base coref component without actual model
* initial setup of coref_er.score
* rename to include_label
* preliminary score_clusters method
* apply scoring in coref component
* IO fix
* return None loss for now
* rename to CoreferenceResolver
* some preliminary unit tests
* use registry as callable
* fix TorchBiLSTMEncoder documentation
* ensure the types of the encoding Tok2vec layers are correct
* update references from v1 to v2 for the new architectures
* add convenience method to determine tok2vec width in a model
* fix transformer tok2vec dimensions in TextCatEnsemble architecture
* init function should not be nested to avoid pickle issues
* define new architectures for the pretraining objective
* add loss function as attr of the omdel
* cleanup
* cleanup
* shorten name
* fix typo
* remove unused error
* small fix in example imports
* throw error when train_corpus or dev_corpus is not a string
* small fix in custom logger example
* limit macro_auc to labels with 2 annotations
* fix typo
* also create parents of output_dir if need be
* update documentation of textcat scores
* refactor TextCatEnsemble
* fix tests for new AUC definition
* bump to 3.0.0a42
* update docs
* rename to spacy.TextCatEnsemble.v2
* spacy.TextCatEnsemble.v1 in legacy
* cleanup
* small fix
* update to 3.0.0rc2
* fix import that got lost in merge
* cursed IDE
* fix two typos
Update arguments to MultiHashEmbed layer so that the attributes can be
controlled. A kind of tricky scheme is used to allow optional
specification of the rows. I think it's an okay balance between
flexibility and convenience.
* ensure Language passes on valid examples for initialization
* fix tagger model initialization
* check for valid get_examples across components
* assume labels were added before begin_training
* fix senter initialization
* fix morphologizer initialization
* use methods to check arguments
* test textcat init, requires thinc>=8.0.0a31
* fix tok2vec init
* fix entity linker init
* use islice
* fix simple NER
* cleanup debug model
* fix assert statements
* fix tests
* throw error when adding a label if the output layer can't be resized anymore
* fix test
* add failing test for simple_ner
* UX improvements
* morphologizer UX
* assume begin_training gets a representative set and processes the labels
* remove assumptions for output of untrained NER model
* restore test for original purpose
* candidate generator as separate part of EL config
* update comment
* ent instead of str as input for candidate generation
* Span instead of str: correct type indication
* fix types
* unit test to create new candidate generator
* fix replace_pipe argument passing
* move error message, general cleanup
* add vocab back to KB constructor
* provide KB as callable from Vocab arg
* rename to kb_loader, fix KB serialization as part of the EL pipe
* fix typo
* reformatting
* cleanup
* fix comment
* fix wrongly duplicated code from merge conflict
* rename dump to to_disk
* from_disk instead of load_bulk
* update test after recent removal of set_morphology in tagger
* remove old doc
* Update with WIP
* Update with WIP
* Update with pipeline serialization
* Update types and pipe factories
* Add deep merge, tidy up and add tests
* Fix pipe creation from config
* Don't validate default configs on load
* Update spacy/language.py
Co-authored-by: Ines Montani <ines@ines.io>
* Adjust factory/component meta error
* Clean up factory args and remove defaults
* Add test for failing empty dict defaults
* Update pipeline handling and methods
* provide KB as registry function instead of as object
* small change in test to make functionality more clear
* update example script for EL configuration
* Fix typo
* Simplify test
* Simplify test
* splitting pipes.pyx into separate files
* moving default configs to each component file
* fix batch_size type
* removing default values from component constructors where possible (TODO: test 4725)
* skip instead of xfail
* Add test for config -> nlp with multiple instances
* pipeline.pipes -> pipeline.pipe
* Tidy up, document, remove kwargs
* small cleanup/generalization for Tok2VecListener
* use DEFAULT_UPSTREAM field
* revert to avoid circular imports
* Fix tests
* Replace deprecated arg
* Make model dirs require config
* fix pickling of keyword-only arguments in constructor
* WIP: clean up and integrate full config
* Add helper to handle function args more reliably
Now also includes keyword-only args
* Fix config composition and serialization
* Improve config debugging and add visual diff
* Remove unused defaults and fix type
* Remove pipeline and factories from meta
* Update spacy/default_config.cfg
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/default_config.cfg
* small UX edits
* avoid printing stack trace for debug CLI commands
* Add support for language-specific factories
* specify the section of the config which holds the model to debug
* WIP: add Language.from_config
* Update with language data refactor WIP
* Auto-format
* Add backwards-compat handling for Language.factories
* Update morphologizer.pyx
* Fix morphologizer
* Update and simplify lemmatizers
* Fix Japanese tests
* Port over tagger changes
* Fix Chinese and tests
* Update to latest Thinc
* WIP: xfail first Russian lemmatizer test
* Fix component-specific overrides
* fix nO for output layers in debug_model
* Fix default value
* Fix tests and don't pass objects in config
* Fix deep merging
* Fix lemma lookup data registry
Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)
* Add types
* Add Vocab.from_config
* Fix typo
* Fix tests
* Make config copying more elegant
* Fix pipe analysis
* Fix lemmatizers and is_base_form
* WIP: move language defaults to config
* Fix morphology type
* Fix vocab
* Remove comment
* Update to latest Thinc
* Add morph rules to config
* Tidy up
* Remove set_morphology option from tagger factory
* Hack use_gpu
* Move [pipeline] to top-level block and make [nlp.pipeline] list
Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them
* Fix use_gpu and resume in CLI
* Auto-format
* Remove resume from config
* Fix formatting and error
* [pipeline] -> [components]
* Fix types
* Fix tagger test: requires set_morphology?
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Add initial reproducibility tests
* failing test for default_text_classifier (WIP)
* track trouble to underlying tok2vec layer
* add regression test for Issue 5551
* tests go green with https://github.com/explosion/thinc/pull/359
* update test
* adding fixed seeds to HashEmbed layers, seems to fix the reproducility issue
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Update errors
* Remove beam for now (maybe)
Remove beam_utils
Update setup.py
Remove beam
* Remove GoldParse
WIP on removing goldparse
Get ArcEager compiling after GoldParse excise
Update setup.py
Get spacy.syntax compiling after removing GoldParse
Rename NewExample -> Example and clean up
Clean html files
Start updating tests
Update Morphologizer
* fix error numbers
* fix merge conflict
* informative error when calling to_array with wrong field
* fix error catching
* fixing language and scoring tests
* start testing get_aligned
* additional tests for new get_aligned function
* Draft create_gold_state for arc_eager oracle
* Fix import
* Fix import
* Remove TokenAnnotation code from nonproj
* fixing NER one-to-many alignment
* Fix many-to-one IOB codes
* fix test for misaligned
* attempt to fix cases with weird spaces
* fix spaces
* test_gold_biluo_different_tokenization works
* allow None as BILUO annotation
* fixed some tests + WIP roundtrip unit test
* add spaces to json output format
* minibatch utiltiy can deal with strings, docs or examples
* fix augment (needs further testing)
* various fixes in scripts - needs to be further tested
* fix test_cli
* cleanup
* correct silly typo
* add support for MORPH in to/from_array, fix morphologizer overfitting test
* fix tagger
* fix entity linker
* ensure test keeps working with non-linked entities
* pipe() takes docs, not examples
* small bug fix
* textcat bugfix
* throw informative error when running the components with the wrong type of objects
* fix parser tests to work with example (most still failing)
* fix BiluoPushDown parsing entities
* small fixes
* bugfix tok2vec
* fix renames and simple_ner labels
* various small fixes
* prevent writing dummy values like deps because that could interfer with sent_start values
* fix the fix
* implement split_sent with aligned SENT_START attribute
* test for split sentences with various alignment issues, works
* Return ArcEagerGoldParse from ArcEager
* Update parser and NER gold stuff
* Draft new GoldCorpus class
* add links to to_dict
* clean up
* fix test checking for variants
* Fix oracles
* Start updating converters
* Move converters under spacy.gold
* Move things around
* Fix naming
* Fix name
* Update converter to produce DocBin
* Update converters
* Allow DocBin to take list of Doc objects.
* Make spacy convert output docbin
* Fix import
* Fix docbin
* Fix compile in ArcEager
* Fix import
* Serialize all attrs by default
* Update converter
* Remove jsonl converter
* Add json2docs converter
* Draft Corpus class for DocBin
* Work on train script
* Update Corpus
* Update DocBin
* Allocate Doc before starting to add words
* Make doc.from_array several times faster
* Update train.py
* Fix Corpus
* Fix parser model
* Start debugging arc_eager oracle
* Update header
* Fix parser declaration
* Xfail some tests
* Skip tests that cause crashes
* Skip test causing segfault
* Remove GoldCorpus
* Update imports
* Update after removing GoldCorpus
* Fix module name of corpus
* Fix mimport
* Work on parser oracle
* Update arc_eager oracle
* Restore ArcEager.get_cost function
* Update transition system
* Update test_arc_eager_oracle
* Remove beam test
* Update test
* Unskip
* Unskip tests
* add links to to_dict
* clean up
* fix test checking for variants
* Allow DocBin to take list of Doc objects.
* Fix compile in ArcEager
* Serialize all attrs by default
Move converters under spacy.gold
Move things around
Fix naming
Fix name
Update converter to produce DocBin
Update converters
Make spacy convert output docbin
Fix import
Fix docbin
Fix import
Update converter
Remove jsonl converter
Add json2docs converter
* Allocate Doc before starting to add words
* Make doc.from_array several times faster
* Start updating converters
* Work on train script
* Draft Corpus class for DocBin
Update Corpus
Fix Corpus
* Update DocBin
Add missing strings when serializing
* Update train.py
* Fix parser model
* Start debugging arc_eager oracle
* Update header
* Fix parser declaration
* Xfail some tests
Skip tests that cause crashes
Skip test causing segfault
* Remove GoldCorpus
Update imports
Update after removing GoldCorpus
Fix module name of corpus
Fix mimport
* Work on parser oracle
Update arc_eager oracle
Restore ArcEager.get_cost function
Update transition system
* Update tests
Remove beam test
Update test
Unskip
Unskip tests
* Add get_aligned_parse method in Example
Fix Example.get_aligned_parse
* Add kwargs to Corpus.dev_dataset to match train_dataset
* Update nonproj
* Use get_aligned_parse in ArcEager
* Add another arc-eager oracle test
* Remove Example.doc property
Remove Example.doc
Remove Example.doc
Remove Example.doc
Remove Example.doc
* Update ArcEager oracle
Fix Break oracle
* Debugging
* Fix Corpus
* Fix eg.doc
* Format
* small fixes
* limit arg for Corpus
* fix test_roundtrip_docs_to_docbin
* fix test_make_orth_variants
* fix add_label test
* Update tests
* avoid writing temp dir in json2docs, fixing 4402 test
* Update test
* Add missing costs to NER oracle
* Update test
* Work on Example.get_aligned_ner method
* Clean up debugging
* Xfail tests
* Remove prints
* Remove print
* Xfail some tests
* Replace unseen labels for parser
* Update test
* Update test
* Xfail test
* Fix Corpus
* fix imports
* fix docs_to_json
* various small fixes
* cleanup
* Support gold_preproc in Corpus
* Support gold_preproc
* Pass gold_preproc setting into corpus
* Remove debugging
* Fix gold_preproc
* Fix json2docs converter
* Fix convert command
* Fix flake8
* Fix import
* fix output_dir (converted to Path by typer)
* fix var
* bugfix: update states after creating golds to avoid out of bounds indexing
* Improve efficiency of ArEager oracle
* pull merge_sent into iob2docs to avoid Doc creation for each line
* fix asserts
* bugfix excl Span.end in iob2docs
* Support max_length in Corpus
* Fix arc_eager oracle
* Filter out uannotated sentences in NER
* Remove debugging in parser
* Simplify NER alignment
* Fix conversion of NER data
* Fix NER init_gold_batch
* Tweak efficiency of precomputable affine
* Update onto-json default
* Update gold test for NER
* Fix parser test
* Update test
* Add NER data test
* Fix convert for single file
* Fix test
* Hack scorer to avoid evaluating non-nered data
* Fix handling of NER data in Example
* Output unlabelled spans from O biluo tags in iob_utils
* Fix unset variable
* Return kept examples from init_gold_batch
* Return examples from init_gold_batch
* Dont return Example from init_gold_batch
* Set spaces on gold doc after conversion
* Add test
* Fix spaces reading
* Improve NER alignment
* Improve handling of missing values in NER
* Restore the 'cutting' in parser training
* Add assertion
* Print epochs
* Restore random cuts in parser/ner training
* Implement Doc.copy
* Implement Example.copy
* Copy examples at the start of Language.update
* Don't unset example docs
* Tweak parser model slightly
* attempt to fix _guess_spaces
* _add_entities_to_doc first, so that links don't get overwritten
* fixing get_aligned_ner for one-to-many
* fix indexing into x_text
* small fix biluo_tags_from_offsets
* Add onto-ner config
* Simplify NER alignment
* Fix NER scoring for partially annotated documents
* fix indexing into x_text
* fix test_cli failing tests by ignoring spans in doc.ents with empty label
* Fix limit
* Improve NER alignment
* Fix count_train
* Remove print statement
* fix tests, we're not having nothing but None
* fix clumsy fingers
* Fix tests
* Fix doc.ents
* Remove empty docs in Corpus and improve limit
* Update config
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
* verbose and tag_map options
* adding init_tok2vec option and only changing the tok2vec that is specified
* adding omit_extra_lookups and verifying textcat config
* wip
* pretrain bugfix
* add replace and resume options
* train_textcat fix
* raw text functionality
* improve UX when KeyError or when input data can't be parsed
* avoid unnecessary access to goldparse in TextCat pipe
* save performance information in nlp.meta
* add noise_level to config
* move nn_parser's defaults to config file
* multitask in config - doesn't work yet
* scorer offering both F and AUC options, need to be specified in config
* add textcat verification code from old train script
* small fixes to config files
* clean up
* set default config for ner/parser to allow create_pipe to work as before
* two more test fixes
* small fixes
* cleanup
* fix NER pickling + additional unit test
* create_pipe as before
* setting KB in the EL constructor, similar to how the model is passed on
* removing wikipedia example files - moved to projects
* throw an error when nlp.update is called with 2 positional arguments
* rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config
* update config files with new parameters
* avoid training pipeline components that don't have a model (like sentencizer)
* various small fixes + UX improvements
* small fixes
* set thinc to 8.0.0a9 everywhere
* remove outdated comment