* Fix warning message for lemmatization tables
* Add a warning when the `lexeme_norm` table is empty. (Given the
relatively lang-specific loading for `Lookups`, it seemed like too much
overhead to dynamically extract the list of languages, so for now it's
hard-coded.)
* added setting for neighbour sentence in NEL
* added spaCy contributor agreement
* added multi sentence also for training
* made the try-except block smaller
* verbose and tag_map options
* adding init_tok2vec option and only changing the tok2vec that is specified
* adding omit_extra_lookups and verifying textcat config
* wip
* pretrain bugfix
* add replace and resume options
* train_textcat fix
* raw text functionality
* improve UX when KeyError or when input data can't be parsed
* avoid unnecessary access to goldparse in TextCat pipe
* save performance information in nlp.meta
* add noise_level to config
* move nn_parser's defaults to config file
* multitask in config - doesn't work yet
* scorer offering both F and AUC options, need to be specified in config
* add textcat verification code from old train script
* small fixes to config files
* clean up
* set default config for ner/parser to allow create_pipe to work as before
* two more test fixes
* small fixes
* cleanup
* fix NER pickling + additional unit test
* create_pipe as before
* setting KB in the EL constructor, similar to how the model is passed on
* removing wikipedia example files - moved to projects
* throw an error when nlp.update is called with 2 positional arguments
* rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config
* update config files with new parameters
* avoid training pipeline components that don't have a model (like sentencizer)
* various small fixes + UX improvements
* small fixes
* set thinc to 8.0.0a9 everywhere
* remove outdated comment
* make disable_pipes deprecated in favour of the new toggle_pipes
* rewrite disable_pipes statements
* update documentation
* remove bin/wiki_entity_linking folder
* one more fix
* remove deprecated link to documentation
* few more doc fixes
* add note about name change to the docs
* restore original disable_pipes
* small fixes
* fix typo
* fix error number to W096
* rename to select_pipes
* also make changes to the documentation
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Draft layer for BILUO actions
* Fixes to biluo layer
* WIP on BILUO layer
* Add tests for BILUO layer
* Format
* Fix transitions
* Update test
* Link in the simple_ner
* Update BILUO tagger
* Update __init__
* Import simple_ner
* Update test
* Import
* Add files
* Add config
* Fix label passing for BILUO and tagger
* Fix label handling for simple_ner component
* Update simple NER test
* Update config
* Hack train script
* Update BILUO layer
* Fix SimpleNER component
* Update train_from_config
* Add biluo_to_iob helper
* Add IOB layer
* Add IOBTagger model
* Update biluo layer
* Update SimpleNER tagger
* Update BILUO
* Read random seed in train-from-config
* Update use of normal_init
* Fix normalization of gradient in SimpleNER
* Update IOBTagger
* Remove print
* Tweak masking in BILUO
* Add dropout in SimpleNER
* Update thinc
* Tidy up simple_ner
* Fix biluo model
* Unhack train-from-config
* Update setup.cfg and requirements
* Add tb_framework.py for parser model
* Try to avoid memory leak in BILUO
* Move ParserModel into spacy.ml, avoid need for subclass.
* Use updated parser model
* Remove incorrect call to model.initializre in PrecomputableAffine
* Update parser model
* Avoid divide by zero in tagger
* Add extra dropout layer in tagger
* Refine minibatch_by_words function to avoid oom
* Fix parser model after refactor
* Try to avoid div-by-zero in SimpleNER
* Fix infinite loop in minibatch_by_words
* Use SequenceCategoricalCrossentropy in Tagger
* Fix parser model when hidden layer
* Remove extra dropout from tagger
* Add extra nan check in tagger
* Fix thinc version
* Update tests and imports
* Fix test
* Update test
* Update tests
* Fix tests
* Fix test
Co-authored-by: Ines Montani <ines@ines.io>
Previously, pipelines with shared tok2vec weights would call the
tok2vec backprop callback multiple times, once for each pipeline
component. This caused errors for PyTorch, and was inefficient.
Instead, accumulate the gradient for all but one component, and just
call the callback once.
* Add pos and morph scoring to Scorer
Add pos, morph, and morph_per_type to `Scorer`. Report pos and morph
accuracy in `spacy evaluate`.
* Update morphologizer for v3
* switch to tagger-based morphologizer
* use `spacy.HashCharEmbedCNN` for morphologizer defaults
* add `Doc.is_morphed` flag
* Add morphologizer to train CLI
* Add basic morphologizer pipeline tests
* Add simple morphologizer training example
* Remove subword_features from CharEmbed models
Remove `subword_features` argument from `spacy.HashCharEmbedCNN.v1` and
`spacy.HashCharEmbedBiLSTM.v1` since in these cases `subword_features`
is always `False`.
* Rename setting in morphologizer example
Use `with_pos_tags` instead of `without_pos_tags`.
* Fix kwargs for spacy.HashCharEmbedBiLSTM.v1
* Remove defaults for spacy.HashCharEmbedBiLSTM.v1
Remove default `nM/nC` for `spacy.HashCharEmbedBiLSTM.v1`.
* Set random seed for textcat overfitting test
* bring back default build_text_classifier method
* remove _set_dims_ hack in favor of proper dim inference
* add tok2vec initialize to unit test
* small fixes
* add unit test for various textcat config settings
* logistic output layer does not have nO
* fix window_size setting
* proper fix
* fix W initialization
* Update textcat training example
* Use ml_datasets
* Convert training data to `Example` format
* Use `n_texts` to set proportionate dev size
* fix _init renaming on latest thinc
* avoid setting a non-existing dim
* update to thinc==8.0.0a2
* add BOW and CNN defaults for easy testing
* various experiments with train_textcat script, fix softmax activation in textcat bow
* allow textcat train script to work on other datasets as well
* have dataset as a parameter
* train textcat from config, with example config
* add config for training textcat
* formatting
* fix exclusive_classes
* fixing BOW for GPU
* bump thinc to 8.0.0a3 (not published yet so CI will fail)
* add in link_vectors_to_models which got deleted
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* avoid changing original config
* fix elif structure, batch with just int crashes otherwise
* tok2vec example with doc2feats, encode and embed architectures
* further clean up MultiHashEmbed
* further generalize Tok2Vec to work with extract-embed-encode parts
* avoid initializing the charembed layer with Docs (for now ?)
* small fixes for bilstm config (still does not run)
* rename to core layer
* move new configs
* walk model to set nI instead of using core ref
* fix senter overfitting test to be more similar to the training data (avoid flakey behaviour)
* Update sentence recognizer
* rename `sentrec` to `senter`
* use `spacy.HashEmbedCNN.v1` by default
* update to follow `Tagger` modifications
* remove component methods that can be inherited from `Tagger`
* add simple initialization and overfitting pipeline tests
* Update serialization test for senter
* fix grad_clip naming
* cleaning up pretrained_vectors out of cfg
* further refactoring Model init's
* move Model building out of pipes
* further refactor to require a model config when creating a pipe
* small fixes
* making cfg in nn_parser more consistent
* fixing nr_class for parser
* fixing nn_parser's nO
* fix printing of loss
* architectures in own file per type, consistent naming
* convenience methods default_tagger_config and default_tok2vec_config
* let create_pipe access default config if available for that component
* default_parser_config
* move defaults to separate folder
* allow reading nlp from package or dir with argument 'name'
* architecture spacy.VocabVectors.v1 to read static vectors from file
* cleanup
* default configs for nel, textcat, morphologizer, tensorizer
* fix imports
* fixing unit tests
* fixes and clean up
* fixing defaults, nO, fix unit tests
* restore parser IO
* fix IO
* 'fix' serialization test
* add *.cfg to manifest
* fix example configs with additional arguments
* replace Morpohologizer with Tagger
* add IO bit when testing overfitting of tagger (currently failing)
* fix IO - don't initialize when reading from disk
* expand overfitting tests to also check IO goes OK
* remove dropout from HashEmbed to fix Tagger performance
* add defaults for sentrec
* update thinc
* always pass a Model instance to a Pipe
* fix piped_added statement
* remove obsolete W029
* remove obsolete errors
* restore byte checking tests (work again)
* clean up test
* further test cleanup
* convert from config to Model in create_pipe
* bring back error when component is not initialized
* cleanup
* remove calls for nlp2.begin_training
* use thinc.api in imports
* allow setting charembed's nM and nC
* fix for hardcoded nM/nC + unit test
* formatting fixes
* trigger build
* Fix ent_ids and labels properties when id attribute used in patterns
* use set for labels
* sort end_ids for comparison in entity_ruler tests
* fixing entity_ruler ent_ids test
* add to set
* Run make_doc optimistically if using phrase matcher patterns.
* remove unused coveragerc I was testing with
* format
* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.
* Removing old add_patterns function
* Fixing spacing
* Make sure token_patterns loaded as well, before generator was being emptied in from_disk
* label in span not writable anymore
* Revert "label in span not writable anymore"
This reverts commit ab442338c8.
* fixing yield - remove redundant list
* Add load_from_config function
* Add train_from_config script
* Merge configs and expose via spacy.config
* Fix script
* Suggest create_evaluation_callback
* Hard-code for NER
* Fix errors
* Register command
* Add TODO
* Update train-from-config todos
* Fix imports
* Allow delayed setting of parser model nr_class
* Get train-from-config working
* Tidy up and fix scores and printing
* Hide traceback if cancelled
* Fix weighted score formatting
* Fix score formatting
* Make output_path optional
* Add Tok2Vec component
* Tidy up and add tok2vec_tensors
* Add option to copy docs in nlp.update
* Copy docs in nlp.update
* Adjust nlp.update() for set_annotations
* Don't shuffle pipes in nlp.update, decruft
* Support set_annotations arg in component update
* Support set_annotations in parser update
* Add get_gradients method
* Add get_gradients to parser
* Update errors.py
* Fix problems caused by merge
* Add _link_components method in nlp
* Add concept of 'listeners' and ControlledModel
* Support optional attributes arg in ControlledModel
* Try having tok2vec component in pipeline
* Fix tok2vec component
* Fix config
* Fix tok2vec
* Update for Example
* Update for Example
* Update config
* Add eg2doc util
* Update and add schemas/types
* Update schemas
* Fix nlp.update
* Fix tagger
* Remove hacks from train-from-config
* Remove hard-coded config str
* Calculate loss in tok2vec component
* Tidy up and use function signatures instead of models
* Support union types for registry models
* Minor cleaning in Language.update
* Make ControlledModel specifically Tok2VecListener
* Fix train_from_config
* Fix tok2vec
* Tidy up
* Add function for bilstm tok2vec
* Fix type
* Fix syntax
* Fix pytorch optimizer
* Add example configs
* Update for thinc describe changes
* Update for Thinc changes
* Update for dropout/sgd changes
* Update for dropout/sgd changes
* Unhack gradient update
* Work on refactoring _ml
* Remove _ml.py module
* WIP upgrade cli scripts for thinc
* Move some _ml stuff to util
* Import link_vectors from util
* Update train_from_config
* Import from util
* Import from util
* Temporarily add ml.component_models module
* Move ml methods
* Move typedefs
* Update load vectors
* Update gitignore
* Move imports
* Add PrecomputableAffine
* Fix imports
* Fix imports
* Fix imports
* Fix missing imports
* Update CLI scripts
* Update spacy.language
* Add stubs for building the models
* Update model definition
* Update create_default_optimizer
* Fix import
* Fix comment
* Update imports in tests
* Update imports in spacy.cli
* Fix import
* fix obsolete thinc imports
* update srsly pin
* from thinc to ml_datasets for example data such as imdb
* update ml_datasets pin
* using STATE.vectors
* small fix
* fix Sentencizer.pipe
* black formatting
* rename Affine to Linear as in thinc
* set validate explicitely to True
* rename with_square_sequences to with_list2padded
* rename with_flatten to with_list2array
* chaining layernorm
* small fixes
* revert Optimizer import
* build_nel_encoder with new thinc style
* fixes using model's get and set methods
* Tok2Vec in component models, various fixes
* fix up legacy tok2vec code
* add model initialize calls
* add in build_tagger_model
* small fixes
* setting model dims
* fixes for ParserModel
* various small fixes
* initialize thinc Models
* fixes
* consistent naming of window_size
* fixes, removing set_dropout
* work around Iterable issue
* remove legacy tok2vec
* util fix
* fix forward function of tok2vec listener
* more fixes
* trying to fix PrecomputableAffine (not succesful yet)
* alloc instead of allocate
* add morphologizer
* rename residual
* rename fixes
* Fix predict function
* Update parser and parser model
* fixing few more tests
* Fix precomputable affine
* Update component model
* Update parser model
* Move backprop padding to own function, for test
* Update test
* Fix p. affine
* Update NEL
* build_bow_text_classifier and extract_ngrams
* Fix parser init
* Fix test add label
* add build_simple_cnn_text_classifier
* Fix parser init
* Set gpu off by default in example
* Fix tok2vec listener
* Fix parser model
* Small fixes
* small fix for PyTorchLSTM parameters
* revert my_compounding hack (iterable fixed now)
* fix biLSTM
* Fix uniqued
* PyTorchRNNWrapper fix
* small fixes
* use helper function to calculate cosine loss
* small fixes for build_simple_cnn_text_classifier
* putting dropout default at 0.0 to ensure the layer gets built
* using thinc util's set_dropout_rate
* moving layer normalization inside of maxout definition to optimize dropout
* temp debugging in NEL
* fixed NEL model by using init defaults !
* fixing after set_dropout_rate refactor
* proper fix
* fix test_update_doc after refactoring optimizers in thinc
* Add CharacterEmbed layer
* Construct tagger Model
* Add missing import
* Remove unused stuff
* Work on textcat
* fix test (again :)) after optimizer refactor
* fixes to allow reading Tagger from_disk without overwriting dimensions
* don't build the tok2vec prematuraly
* fix CharachterEmbed init
* CharacterEmbed fixes
* Fix CharacterEmbed architecture
* fix imports
* renames from latest thinc update
* one more rename
* add initialize calls where appropriate
* fix parser initialization
* Update Thinc version
* Fix errors, auto-format and tidy up imports
* Fix validation
* fix if bias is cupy array
* revert for now
* ensure it's a numpy array before running bp in ParserStepModel
* no reason to call require_gpu twice
* use CupyOps.to_numpy instead of cupy directly
* fix initialize of ParserModel
* remove unnecessary import
* fixes for CosineDistance
* fix device renaming
* use refactored loss functions (Thinc PR 251)
* overfitting test for tagger
* experimental settings for the tagger: avoid zero-init and subword normalization
* clean up tagger overfitting test
* use previous default value for nP
* remove toy config
* bringing layernorm back (had a bug - fixed in thinc)
* revert setting nP explicitly
* remove setting default in constructor
* restore values as they used to be
* add overfitting test for NER
* add overfitting test for dep parser
* add overfitting test for textcat
* fixing init for linear (previously affine)
* larger eps window for textcat
* ensure doc is not None
* Require newer thinc
* Make float check vaguer
* Slop the textcat overfit test more
* Fix textcat test
* Fix exclusive classes for textcat
* fix after renaming of alloc methods
* fixing renames and mandatory arguments (staticvectors WIP)
* upgrade to thinc==8.0.0.dev3
* refer to vocab.vectors directly instead of its name
* rename alpha to learn_rate
* adding hashembed and staticvectors dropout
* upgrade to thinc 8.0.0.dev4
* add name back to avoid warning W020
* thinc dev4
* update srsly
* using thinc 8.0.0a0 !
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>