Commit Graph

19 Commits

Author SHA1 Message Date
Daniël de Kok
c8a12c5d85 Extend tagger and senter tests
So that they, like the other tests, also check that we get no
activations if no activations were requested.
2022-06-24 10:18:00 +02:00
Daniël de Kok
c3da32b46b Change type of store_activations to Union[bool, List[str]]
When the value is:

- A bool: all activations are stored when set to `True`.
- A List[str]: the activations named in the list are stored
2022-06-22 14:28:03 +02:00
Daniël de Kok
b71c6043bc Store activations in Doc when store_activations is enabled
This change adds the new `activations` attribute to `Doc`. This
attribute can be used by trainable pipes to store their activations,
probabilities, and guesses for downstream users.

As an example, this change modifies the `tagger` and `senter` pipes to
add an `store_activations` option. When this option is enabled, the
probabilities and guesses are stored in `set_annotations`.
2022-06-22 10:03:26 +02:00
Adriane Boyd
0668a449ba
Add Pipe.hide_labels to omit labels from pipeline meta (#10175) 2022-02-05 17:59:24 +01:00
svlandeg
e94a21638e adding tests for trained models to ensure predict reproducibility 2020-10-13 21:07:13 +02:00
Ines Montani
bfa3931c9d
Revert added_strings change (#6236) 2020-10-10 18:55:07 +02:00
Sofie Van Landeghem
d093d6343b
TrainablePipe (#6213)
* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
2020-10-08 21:33:49 +02:00
Ines Montani
7851020653 Update tests 2020-09-29 18:14:15 +02:00
Ines Montani
ff9a63bfbd begin_training -> initialize 2020-09-28 21:35:09 +02:00
Ines Montani
febb99916d Tidy up and auto-format [ci skip] 2020-09-13 10:55:36 +02:00
Sofie Van Landeghem
8e7557656f
Renaming gold & annotation_setter (#6042)
* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
2020-09-09 10:31:03 +02:00
Sofie Van Landeghem
60f22e1800
Pipe API (#6034)
* ensure Language passes on valid examples for initialization

* fix tagger model initialization

* check for valid get_examples across components

* assume labels were added before begin_training

* fix senter initialization

* fix morphologizer initialization

* use methods to check arguments

* test textcat init, requires thinc>=8.0.0a31

* fix tok2vec init

* fix entity linker init

* use islice

* fix simple NER

* cleanup debug model

* fix assert statements

* fix tests

* throw error when adding a label if the output layer can't be resized anymore

* fix test

* add failing test for simple_ner

* UX improvements

* morphologizer UX

* assume begin_training gets a representative set and processes the labels

* remove assumptions for output of untrained NER model

* restore test for original purpose
2020-09-08 22:44:25 +02:00
Ines Montani
43b960c01b
Refactor pipeline components, config and language data (#5759)
* Update with WIP

* Update with WIP

* Update with pipeline serialization

* Update types and pipe factories

* Add deep merge, tidy up and add tests

* Fix pipe creation from config

* Don't validate default configs on load

* Update spacy/language.py

Co-authored-by: Ines Montani <ines@ines.io>

* Adjust factory/component meta error

* Clean up factory args and remove defaults

* Add test for failing empty dict defaults

* Update pipeline handling and methods

* provide KB as registry function instead of as object

* small change in test to make functionality more clear

* update example script for EL configuration

* Fix typo

* Simplify test

* Simplify test

* splitting pipes.pyx into separate files

* moving default configs to each component file

* fix batch_size type

* removing default values from component constructors where possible (TODO: test 4725)

* skip instead of xfail

* Add test for config -> nlp with multiple instances

* pipeline.pipes -> pipeline.pipe

* Tidy up, document, remove kwargs

* small cleanup/generalization for Tok2VecListener

* use DEFAULT_UPSTREAM field

* revert to avoid circular imports

* Fix tests

* Replace deprecated arg

* Make model dirs require config

* fix pickling of keyword-only arguments in constructor

* WIP: clean up and integrate full config

* Add helper to handle function args more reliably

Now also includes keyword-only args

* Fix config composition and serialization

* Improve config debugging and add visual diff

* Remove unused defaults and fix type

* Remove pipeline and factories from meta

* Update spacy/default_config.cfg

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/default_config.cfg

* small UX edits

* avoid printing stack trace for debug CLI commands

* Add support for language-specific factories

* specify the section of the config which holds the model to debug

* WIP: add Language.from_config

* Update with language data refactor WIP

* Auto-format

* Add backwards-compat handling for Language.factories

* Update morphologizer.pyx

* Fix morphologizer

* Update and simplify lemmatizers

* Fix Japanese tests

* Port over tagger changes

* Fix Chinese and tests

* Update to latest Thinc

* WIP: xfail first Russian lemmatizer test

* Fix component-specific overrides

* fix nO for output layers in debug_model

* Fix default value

* Fix tests and don't pass objects in config

* Fix deep merging

* Fix lemma lookup data registry

Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)

* Add types

* Add Vocab.from_config

* Fix typo

* Fix tests

* Make config copying more elegant

* Fix pipe analysis

* Fix lemmatizers and is_base_form

* WIP: move language defaults to config

* Fix morphology type

* Fix vocab

* Remove comment

* Update to latest Thinc

* Add morph rules to config

* Tidy up

* Remove set_morphology option from tagger factory

* Hack use_gpu

* Move [pipeline] to top-level block and make [nlp.pipeline] list

Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them

* Fix use_gpu and resume in CLI

* Auto-format

* Remove resume from config

* Fix formatting and error

* [pipeline] -> [components]

* Fix types

* Fix tagger test: requires set_morphology?

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-07-22 13:42:59 +02:00
Adriane Boyd
ad15499b3b
Fix get_loss for values outside of labels in senter (#5730)
* Fix get_loss for None alignments in senter

When converting the `sent_start` values back to `SentenceRecognizer`
labels, handle `None` alignments.

* Handle SENT_START as -1

Handle SENT_START as -1 (or -1 converted to uint64) by treating any
values other than 1 the same as 0 in `SentenceRecognizer.get_loss`.
2020-07-09 01:41:58 +02:00
Sofie Van Landeghem
fcbf899b08
Feature/example only (#5707)
* remove _convert_examples

* fix test_gold, raise TypeError if tuples are used instead of Example's

* throwing proper errors when the wrong type of objects are passed

* fix deprectated format in tests

* fix deprectated format in parser tests

* fix tests for NEL, morph, senter, tagger, textcat

* update regression tests with new Example format

* use make_doc

* more fixes to nlp.update calls

* few more small fixes for rehearse and evaluate

* only import ml_datasets if really necessary
2020-07-06 13:02:36 +02:00
Ines Montani
581bda9f98 Update senter test and auto-format 2020-05-21 20:17:14 +02:00
Sofie Van Landeghem
b2e93be867
Optimizer defaults (#5244)
* set optimizer defaults to mimic thinc 7 + bump to dev6

* larger error range for senter overfitting test
2020-04-03 13:02:46 +02:00
Sofie Van Landeghem
5847be6022
Tok2Vec: extract-embed-encode (#5102)
* avoid changing original config

* fix elif structure, batch with just int crashes otherwise

* tok2vec example with doc2feats, encode and embed architectures

* further clean up MultiHashEmbed

* further generalize Tok2Vec to work with extract-embed-encode parts

* avoid initializing the charembed layer with Docs (for now ?)

* small fixes for bilstm config (still does not run)

* rename to core layer

* move new configs

* walk model to set nI instead of using core ref

* fix senter overfitting test to be more similar to the training data (avoid flakey behaviour)
2020-03-08 13:23:18 +01:00
adrianeboyd
c95ce96c44
Update sentence recognizer (#5109)
* Update sentence recognizer

* rename `sentrec` to `senter`
* use `spacy.HashEmbedCNN.v1` by default
* update to follow `Tagger` modifications
* remove component methods that can be inherited from `Tagger`
* add simple initialization and overfitting pipeline tests

* Update serialization test for senter
2020-03-06 14:45:02 +01:00