* Tidy up train-from-config a bit
* Fix accidentally quadratic perf in TokenAnnotation.brackets
When we're reading in the gold data, we had a nested loop where
we looped over the brackets for each token, looking for brackets
that start on that word. This is accidentally quadratic, because
we have one bracket per word (for the POS tags). So we had
an O(N**2) behaviour here that ended up being pretty slow.
To solve this I'm indexing the brackets by their starting word
on the TokenAnnotations object, and having a property to provide
the previous view.
* Fixes
* setting KB in the EL constructor, similar to how the model is passed on
* removing wikipedia example files - moved to projects
* throw an error when nlp.update is called with 2 positional arguments
* rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config
* update config files with new parameters
* avoid training pipeline components that don't have a model (like sentencizer)
* various small fixes + UX improvements
* small fixes
* set thinc to 8.0.0a9 everywhere
* remove outdated comment
* make disable_pipes deprecated in favour of the new toggle_pipes
* rewrite disable_pipes statements
* update documentation
* remove bin/wiki_entity_linking folder
* one more fix
* remove deprecated link to documentation
* few more doc fixes
* add note about name change to the docs
* restore original disable_pipes
* small fixes
* fix typo
* fix error number to W096
* rename to select_pipes
* also make changes to the documentation
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Draft layer for BILUO actions
* Fixes to biluo layer
* WIP on BILUO layer
* Add tests for BILUO layer
* Format
* Fix transitions
* Update test
* Link in the simple_ner
* Update BILUO tagger
* Update __init__
* Import simple_ner
* Update test
* Import
* Add files
* Add config
* Fix label passing for BILUO and tagger
* Fix label handling for simple_ner component
* Update simple NER test
* Update config
* Hack train script
* Update BILUO layer
* Fix SimpleNER component
* Update train_from_config
* Add biluo_to_iob helper
* Add IOB layer
* Add IOBTagger model
* Update biluo layer
* Update SimpleNER tagger
* Update BILUO
* Read random seed in train-from-config
* Update use of normal_init
* Fix normalization of gradient in SimpleNER
* Update IOBTagger
* Remove print
* Tweak masking in BILUO
* Add dropout in SimpleNER
* Update thinc
* Tidy up simple_ner
* Fix biluo model
* Unhack train-from-config
* Update setup.cfg and requirements
* Add tb_framework.py for parser model
* Try to avoid memory leak in BILUO
* Move ParserModel into spacy.ml, avoid need for subclass.
* Use updated parser model
* Remove incorrect call to model.initializre in PrecomputableAffine
* Update parser model
* Avoid divide by zero in tagger
* Add extra dropout layer in tagger
* Refine minibatch_by_words function to avoid oom
* Fix parser model after refactor
* Try to avoid div-by-zero in SimpleNER
* Fix infinite loop in minibatch_by_words
* Use SequenceCategoricalCrossentropy in Tagger
* Fix parser model when hidden layer
* Remove extra dropout from tagger
* Add extra nan check in tagger
* Fix thinc version
* Update tests and imports
* Fix test
* Update test
* Update tests
* Fix tests
* Fix test
Co-authored-by: Ines Montani <ines@ines.io>
* Add pos and morph scoring to Scorer
Add pos, morph, and morph_per_type to `Scorer`. Report pos and morph
accuracy in `spacy evaluate`.
* Update morphologizer for v3
* switch to tagger-based morphologizer
* use `spacy.HashCharEmbedCNN` for morphologizer defaults
* add `Doc.is_morphed` flag
* Add morphologizer to train CLI
* Add basic morphologizer pipeline tests
* Add simple morphologizer training example
* Remove subword_features from CharEmbed models
Remove `subword_features` argument from `spacy.HashCharEmbedCNN.v1` and
`spacy.HashCharEmbedBiLSTM.v1` since in these cases `subword_features`
is always `False`.
* Rename setting in morphologizer example
Use `with_pos_tags` instead of `without_pos_tags`.
* Fix kwargs for spacy.HashCharEmbedBiLSTM.v1
* Remove defaults for spacy.HashCharEmbedBiLSTM.v1
Remove default `nM/nC` for `spacy.HashCharEmbedBiLSTM.v1`.
* Set random seed for textcat overfitting test
* Omit per_type scores from model-best calculations
The addition of per_type scores to the included metrics (#4911) causes
errors when they're compared while determining the best model, so omit
them for this `max()` comparison.
* Add default speed data for interrupted train CLI
Add better speed meta defaults so that an interrupted iteration still
produces a best model.
Co-authored-by: Ines Montani <ines@ines.io>
* Update sentence recognizer
* rename `sentrec` to `senter`
* use `spacy.HashEmbedCNN.v1` by default
* update to follow `Tagger` modifications
* remove component methods that can be inherited from `Tagger`
* add simple initialization and overfitting pipeline tests
* Update serialization test for senter
* Fix model-final/model-best meta
* include speed and accuracy from final iteration
* combine with speeds from base model if necessary
* Include token_acc metric for all components
* fix grad_clip naming
* cleaning up pretrained_vectors out of cfg
* further refactoring Model init's
* move Model building out of pipes
* further refactor to require a model config when creating a pipe
* small fixes
* making cfg in nn_parser more consistent
* fixing nr_class for parser
* fixing nn_parser's nO
* fix printing of loss
* architectures in own file per type, consistent naming
* convenience methods default_tagger_config and default_tok2vec_config
* let create_pipe access default config if available for that component
* default_parser_config
* move defaults to separate folder
* allow reading nlp from package or dir with argument 'name'
* architecture spacy.VocabVectors.v1 to read static vectors from file
* cleanup
* default configs for nel, textcat, morphologizer, tensorizer
* fix imports
* fixing unit tests
* fixes and clean up
* fixing defaults, nO, fix unit tests
* restore parser IO
* fix IO
* 'fix' serialization test
* add *.cfg to manifest
* fix example configs with additional arguments
* replace Morpohologizer with Tagger
* add IO bit when testing overfitting of tagger (currently failing)
* fix IO - don't initialize when reading from disk
* expand overfitting tests to also check IO goes OK
* remove dropout from HashEmbed to fix Tagger performance
* add defaults for sentrec
* update thinc
* always pass a Model instance to a Pipe
* fix piped_added statement
* remove obsolete W029
* remove obsolete errors
* restore byte checking tests (work again)
* clean up test
* further test cleanup
* convert from config to Model in create_pipe
* bring back error when component is not initialized
* cleanup
* remove calls for nlp2.begin_training
* use thinc.api in imports
* allow setting charembed's nM and nC
* fix for hardcoded nM/nC + unit test
* formatting fixes
* trigger build
* Add convert CLI option to merge CoNLL-U subtokens
Add `-T` option to convert CLI that merges CoNLL-U subtokens into one
token in the converted data. Each CoNLL-U sentence is read into a `Doc`
and the `Retokenizer` is used to merge subtokens with features as
follows:
* `orth` is the merged token orth (should correspond to raw text and `#
text`)
* `tag` is all subtoken tags concatenated with `_`, e.g. `ADP_DET`
* `pos` is the POS of the syntactic root of the span (as determined by
the Retokenizer)
* `morph` is all morphological features merged
* `lemma` is all subtoken lemmas concatenated with ` `, e.g. `de o`
* with `-m` all morphological features are combined with the tag using
the separator `__`, e.g.
`ADP_DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art`
* `dep` is the dependency relation for the syntactic root of the span
(as determined by the Retokenizer)
Concatenated tags will be mapped to the UD POS of the syntactic root
(e.g., `ADP`) and the morphological features will be the combined
features.
In many cases, the original UD subtokens can be reconstructed from the
available features given a language-specific lookup table, e.g.,
Portuguese `do / ADP_DET /
Definite=Def|Gender=Masc|Number=Sing|PronType=Art` is `de / ADP`, `o /
DET / Definite=Def|Gender=Masc|Number=Sing|PronType=Art` or lookup rules
for forms containing open class words like Spanish `hablarlo / VERB_PRON
/
Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|VerbForm=Inf`.
* Clean up imports