When sourcing a component, the object from the original pipeline is added to the new pipeline as the same object. This creates a situation where there are several attributes that cannot be in sync between the original pipeline and the new pipeline at the same time for this one object:
* component.name
* component.listener_map / component.listening_components for tok2vec and transformer
When running replace_listeners on a component, the config is not updated correctly if the state of the component is incorrect for the current pipeline (in particular changes that should be applied from model.attrs["replace_listener_cfg"] as used in spacy-transformers) due to the fact that:
* find_listeners relies on component.name to set the name in the listener_map
* replace_listeners relies on listener_map to determine how to modify the configs
In addition, there are several places where pipeline components are modified and the listener map and/or internal component names aren't currently updated.
In cases where there is a component shared by two pipelines that cannot be in sync, this PR chooses to prioritize the most recently modified or initialized pipeline. There is no actual solution with the current source behavior that will make both pipelines usable, so the current pipeline is updated whenever components are added/renamed/removed or the pipeline is initialized for training.
* Use isort with Black profile
* isort all the things
* Fix import cycles as a result of import sorting
* Add DOCBIN_ALL_ATTRS type definition
* Add isort to requirements
* Remove isort from build dependencies check
* Typo
* replicate bug with tok2vec in annotating components
* add overfitting test with a frozen tok2vec
* remove broadcast from predict and check doc.tensor instead
* remove broadcast
* proper error
* slight rephrase of documentation
* add v1 and v2 tests for tok2vec architectures
* textcat architectures are not "layers"
* test older textcat architectures
* test older parser architecture
* Tagger: use unnormalized probabilities for inference
Using unnormalized softmax avoids use of the relatively expensive exp function,
which can significantly speed up non-transformer models (e.g. I got a speedup
of 27% on a German tagging + parsing pipeline).
* Add spacy.Tagger.v2 with configurable normalization
Normalization of probabilities is disabled by default to improve
performance.
* Update documentation, models, and tests to spacy.Tagger.v2
* Move Tagger.v1 to spacy-legacy
* docs/architectures: run prettier
* Unnormalized softmax is now a Softmax_v2 option
* Require thinc 8.0.14 and spacy-legacy 3.0.9
* Set up CI for tests with GPU agent
* Update tests for enabled GPU
* Fix steps filename
* Add parallel build jobs as a setting
* Fix test requirements
* Fix install test requirements condition
* Fix pipeline models test
* Reset current ops in prefer/require testing
* Fix more tests
* Remove separate test_models test
* Fix regression 5551
* fix StaticVectors for GPU use
* fix vocab tests
* Fix regression test 5082
* Move azure steps to .github and reenable default pool jobs
* Consolidate/rename azure steps
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
* fix TorchBiLSTMEncoder documentation
* ensure the types of the encoding Tok2vec layers are correct
* update references from v1 to v2 for the new architectures