* Improve token head verification
Improve the verification for valid token heads when heads are set:
* in `Token.head`: heads come from the same document
* in `Doc.from_array()`: head indices are within the bounds of the
document
* Improve error message
Modify flag settings so that `DEP` is not sufficient to set `is_parsed`
and only run `set_children_from_heads()` if `HEAD` is provided.
Then the combination `[SENT_START, DEP]` will set deps and not clobber
sent starts with a lot of one-word sentences.
* Add load_from_config function
* Add train_from_config script
* Merge configs and expose via spacy.config
* Fix script
* Suggest create_evaluation_callback
* Hard-code for NER
* Fix errors
* Register command
* Add TODO
* Update train-from-config todos
* Fix imports
* Allow delayed setting of parser model nr_class
* Get train-from-config working
* Tidy up and fix scores and printing
* Hide traceback if cancelled
* Fix weighted score formatting
* Fix score formatting
* Make output_path optional
* Add Tok2Vec component
* Tidy up and add tok2vec_tensors
* Add option to copy docs in nlp.update
* Copy docs in nlp.update
* Adjust nlp.update() for set_annotations
* Don't shuffle pipes in nlp.update, decruft
* Support set_annotations arg in component update
* Support set_annotations in parser update
* Add get_gradients method
* Add get_gradients to parser
* Update errors.py
* Fix problems caused by merge
* Add _link_components method in nlp
* Add concept of 'listeners' and ControlledModel
* Support optional attributes arg in ControlledModel
* Try having tok2vec component in pipeline
* Fix tok2vec component
* Fix config
* Fix tok2vec
* Update for Example
* Update for Example
* Update config
* Add eg2doc util
* Update and add schemas/types
* Update schemas
* Fix nlp.update
* Fix tagger
* Remove hacks from train-from-config
* Remove hard-coded config str
* Calculate loss in tok2vec component
* Tidy up and use function signatures instead of models
* Support union types for registry models
* Minor cleaning in Language.update
* Make ControlledModel specifically Tok2VecListener
* Fix train_from_config
* Fix tok2vec
* Tidy up
* Add function for bilstm tok2vec
* Fix type
* Fix syntax
* Fix pytorch optimizer
* Add example configs
* Update for thinc describe changes
* Update for Thinc changes
* Update for dropout/sgd changes
* Update for dropout/sgd changes
* Unhack gradient update
* Work on refactoring _ml
* Remove _ml.py module
* WIP upgrade cli scripts for thinc
* Move some _ml stuff to util
* Import link_vectors from util
* Update train_from_config
* Import from util
* Import from util
* Temporarily add ml.component_models module
* Move ml methods
* Move typedefs
* Update load vectors
* Update gitignore
* Move imports
* Add PrecomputableAffine
* Fix imports
* Fix imports
* Fix imports
* Fix missing imports
* Update CLI scripts
* Update spacy.language
* Add stubs for building the models
* Update model definition
* Update create_default_optimizer
* Fix import
* Fix comment
* Update imports in tests
* Update imports in spacy.cli
* Fix import
* fix obsolete thinc imports
* update srsly pin
* from thinc to ml_datasets for example data such as imdb
* update ml_datasets pin
* using STATE.vectors
* small fix
* fix Sentencizer.pipe
* black formatting
* rename Affine to Linear as in thinc
* set validate explicitely to True
* rename with_square_sequences to with_list2padded
* rename with_flatten to with_list2array
* chaining layernorm
* small fixes
* revert Optimizer import
* build_nel_encoder with new thinc style
* fixes using model's get and set methods
* Tok2Vec in component models, various fixes
* fix up legacy tok2vec code
* add model initialize calls
* add in build_tagger_model
* small fixes
* setting model dims
* fixes for ParserModel
* various small fixes
* initialize thinc Models
* fixes
* consistent naming of window_size
* fixes, removing set_dropout
* work around Iterable issue
* remove legacy tok2vec
* util fix
* fix forward function of tok2vec listener
* more fixes
* trying to fix PrecomputableAffine (not succesful yet)
* alloc instead of allocate
* add morphologizer
* rename residual
* rename fixes
* Fix predict function
* Update parser and parser model
* fixing few more tests
* Fix precomputable affine
* Update component model
* Update parser model
* Move backprop padding to own function, for test
* Update test
* Fix p. affine
* Update NEL
* build_bow_text_classifier and extract_ngrams
* Fix parser init
* Fix test add label
* add build_simple_cnn_text_classifier
* Fix parser init
* Set gpu off by default in example
* Fix tok2vec listener
* Fix parser model
* Small fixes
* small fix for PyTorchLSTM parameters
* revert my_compounding hack (iterable fixed now)
* fix biLSTM
* Fix uniqued
* PyTorchRNNWrapper fix
* small fixes
* use helper function to calculate cosine loss
* small fixes for build_simple_cnn_text_classifier
* putting dropout default at 0.0 to ensure the layer gets built
* using thinc util's set_dropout_rate
* moving layer normalization inside of maxout definition to optimize dropout
* temp debugging in NEL
* fixed NEL model by using init defaults !
* fixing after set_dropout_rate refactor
* proper fix
* fix test_update_doc after refactoring optimizers in thinc
* Add CharacterEmbed layer
* Construct tagger Model
* Add missing import
* Remove unused stuff
* Work on textcat
* fix test (again :)) after optimizer refactor
* fixes to allow reading Tagger from_disk without overwriting dimensions
* don't build the tok2vec prematuraly
* fix CharachterEmbed init
* CharacterEmbed fixes
* Fix CharacterEmbed architecture
* fix imports
* renames from latest thinc update
* one more rename
* add initialize calls where appropriate
* fix parser initialization
* Update Thinc version
* Fix errors, auto-format and tidy up imports
* Fix validation
* fix if bias is cupy array
* revert for now
* ensure it's a numpy array before running bp in ParserStepModel
* no reason to call require_gpu twice
* use CupyOps.to_numpy instead of cupy directly
* fix initialize of ParserModel
* remove unnecessary import
* fixes for CosineDistance
* fix device renaming
* use refactored loss functions (Thinc PR 251)
* overfitting test for tagger
* experimental settings for the tagger: avoid zero-init and subword normalization
* clean up tagger overfitting test
* use previous default value for nP
* remove toy config
* bringing layernorm back (had a bug - fixed in thinc)
* revert setting nP explicitly
* remove setting default in constructor
* restore values as they used to be
* add overfitting test for NER
* add overfitting test for dep parser
* add overfitting test for textcat
* fixing init for linear (previously affine)
* larger eps window for textcat
* ensure doc is not None
* Require newer thinc
* Make float check vaguer
* Slop the textcat overfit test more
* Fix textcat test
* Fix exclusive classes for textcat
* fix after renaming of alloc methods
* fixing renames and mandatory arguments (staticvectors WIP)
* upgrade to thinc==8.0.0.dev3
* refer to vocab.vectors directly instead of its name
* rename alpha to learn_rate
* adding hashembed and staticvectors dropout
* upgrade to thinc 8.0.0.dev4
* add name back to avoid warning W020
* thinc dev4
* update srsly
* using thinc 8.0.0a0 !
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
* expand serialization test for custom token attribute
* add failing test for issue 4849
* define ENT_ID as attr and use in doc serialization
* fix few typos
* Include Doc.cats in to_bytes()
* Include Doc.cats in DocBin serialization
* Add tests for serialization of cats
Test serialization of cats for Doc and DocBin.
Iterate over lr_edges until all heads are within the current sentence.
Instead of iterating over them for a fixed number of iterations, check
whether the sentence boundaries are correct for the heads and stop when
all are correct. Stop after a maximum of 10 iterations, providing a
warning in this case since the sentence boundaries may not be correct.
* raise specific error when removing a matcher rule that doesn't exist
* rephrasing
* goldparse init: allocate fields only if doc is not empty
* avoid zero length alloc in saving tokenizer cache
* avoid allocating zero length mem in matcher
* asserts to avoid allocating zero length mem
* fix zero-length allocation in matcher
* bump cymem version
* revert cymem version bump
* remove duplicate unit test
* unit test (currently failing) for issue 4267
* bugfix: ensure doc.ents preserves kb_id annotations
* fix in setting doc.ents with empty label
* rename
* test for presetting an entity to a certain type
* allow overwriting Outside + blocking presets
* fix actions when previous label needs to be kept
* fix default ent_iob in set entities
* cleaner solution with U- action
* remove debugging print statements
* unit tests with explicit transitions and is_valid testing
* remove U- from move_names explicitly
* remove unit tests with pre-trained models that don't work
* remove (working) unit tests with pre-trained models
* clean up unit tests
* move unit tests
* small fixes
* remove two TODO's from doc.ents comments
* document token ent_kb_id
* document span kb_id
* update pipeline documentation
* prior and context weights as bool's instead
* entitylinker api documentation
* drop for both models
* finish entitylinker documentation
* small fixes
* documentation for KB
* candidate documentation
* links to api pages in code
* small fix
* frequency examples as counts for consistency
* consistent documentation about tensors returned by predict
* add entity linking to usage 101
* add entity linking infobox and KB section to 101
* entity-linking in linguistic features
* small typo corrections
* training example and docs for entity_linker
* predefined nlp and kb
* revert back to similarity encodings for simplicity (for now)
* set prior probabilities to 0 when excluded
* code clean up
* bugfix: deleting kb ID from tokens when entities were removed
* refactor train el example to use either model or vocab
* pretrain_kb example for example kb generation
* add to training docs for KB + EL example scripts
* small fixes
* error numbering
* ensure the language of vocab and nlp stay consistent across serialization
* equality with =
* avoid conflict in errors file
* add error 151
* final adjustements to the train scripts - consistency
* update of goldparse documentation
* small corrections
* push commit
* turn kb_creator into CLI script (wip)
* proper parameters for training entity vectors
* wikidata pipeline split up into two executable scripts
* remove context_width
* move wikidata scripts in bin directory, remove old dummy script
* refine KB script with logs and preprocessing options
* small edits
* small improvements to logging of EL CLI script
* failing unit test for issue 3962
* attempt to fix Issue #3962
* create artificial unit test example
* using length instead of self.length
* sp
* reformat with black
* find better ancestor within span and use generic 'dep'
* attach to span.root if there is no appropriate ancestor
* comment span text
* clean up ancestor code
* reconstruct dep tree to keep same number of sentences
Closes#2203. Closes#3268.
Lemmas set from outside the `Morphology` class were being overwritten. The result was especially confusing when deserialising, as it meant some lemmas could change when storing and retrieving a `Doc` object.
This PR applies two fixes:
1) When we go to set the lemma in the `Morphology` class, first check whether a lemma is already set. If so, don't overwrite.
2) When we load with `doc.from_array()`, take care to apply the `TAG` field first. This allows other fields to overwrite the `TAG` implied properties, if they're provided explicitly (e.g. the `LEMMA`).
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* Make serialization methods consistent
exclude keyword argument instead of random named keyword arguments and deprecation handling
* Update docs and add section on serialization fields