* Tidy up train-from-config a bit
* Fix accidentally quadratic perf in TokenAnnotation.brackets
When we're reading in the gold data, we had a nested loop where
we looped over the brackets for each token, looking for brackets
that start on that word. This is accidentally quadratic, because
we have one bracket per word (for the POS tags). So we had
an O(N**2) behaviour here that ended up being pretty slow.
To solve this I'm indexing the brackets by their starting word
on the TokenAnnotations object, and having a property to provide
the previous view.
* Fixes
* setting KB in the EL constructor, similar to how the model is passed on
* removing wikipedia example files - moved to projects
* throw an error when nlp.update is called with 2 positional arguments
* rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config
* update config files with new parameters
* avoid training pipeline components that don't have a model (like sentencizer)
* various small fixes + UX improvements
* small fixes
* set thinc to 8.0.0a9 everywhere
* remove outdated comment
* Draft layer for BILUO actions
* Fixes to biluo layer
* WIP on BILUO layer
* Add tests for BILUO layer
* Format
* Fix transitions
* Update test
* Link in the simple_ner
* Update BILUO tagger
* Update __init__
* Import simple_ner
* Update test
* Import
* Add files
* Add config
* Fix label passing for BILUO and tagger
* Fix label handling for simple_ner component
* Update simple NER test
* Update config
* Hack train script
* Update BILUO layer
* Fix SimpleNER component
* Update train_from_config
* Add biluo_to_iob helper
* Add IOB layer
* Add IOBTagger model
* Update biluo layer
* Update SimpleNER tagger
* Update BILUO
* Read random seed in train-from-config
* Update use of normal_init
* Fix normalization of gradient in SimpleNER
* Update IOBTagger
* Remove print
* Tweak masking in BILUO
* Add dropout in SimpleNER
* Update thinc
* Tidy up simple_ner
* Fix biluo model
* Unhack train-from-config
* Update setup.cfg and requirements
* Add tb_framework.py for parser model
* Try to avoid memory leak in BILUO
* Move ParserModel into spacy.ml, avoid need for subclass.
* Use updated parser model
* Remove incorrect call to model.initializre in PrecomputableAffine
* Update parser model
* Avoid divide by zero in tagger
* Add extra dropout layer in tagger
* Refine minibatch_by_words function to avoid oom
* Fix parser model after refactor
* Try to avoid div-by-zero in SimpleNER
* Fix infinite loop in minibatch_by_words
* Use SequenceCategoricalCrossentropy in Tagger
* Fix parser model when hidden layer
* Remove extra dropout from tagger
* Add extra nan check in tagger
* Fix thinc version
* Update tests and imports
* Fix test
* Update test
* Update tests
* Fix tests
* Fix test
Co-authored-by: Ines Montani <ines@ines.io>
* Check whether doc is instantiated
When creating docs to pair with gold parses, modify test to check
whether a doc is unset rather than whether it contains tokens.
* Restore test of evaluate on an empty doc
* Set a minimal gold.orig for the scorer
Without a minimal gold.orig the scorer can't evaluate empty docs. This
is the v3 equivalent of #4925.
* label in span not writable anymore
* Revert "label in span not writable anymore"
This reverts commit ab442338c8.
* provide more friendly error msg for parsing file
* Add sent_starts to GoldParse
* Add SentTagger pipeline component
Add `SentTagger` pipeline component as a subclass of `Tagger`.
* Model reduces default parameters from `Tagger` to be small and fast
* Hard-coded set of two labels:
* S (1): token at beginning of sentence
* I (0): all other sentence positions
* Sets `token.sent_start` values
* Add sentence segmentation to Scorer
Report `sent_p/r/f` for sentence boundaries, which may be provided by
various pipeline components.
* Add sentence segmentation to CLI evaluate
* Add senttagger metrics/scoring to train CLI
* Rename SentTagger to SentenceRecognizer
* Add SentenceRecognizer to spacy.pipes imports
* Add SentenceRecognizer serialization test
* Shorten component name to sentrec
* Remove duplicates from train CLI output metrics
Replace old gold alignment that allowed for some noise in the alignment between raw and orth with the new simpler alignment that requires that the raw and orth strings are identical except for whitespace and capitalization.
* Replace old alignment with new alignment, removing `_align.pyx` and
its tests
* Remove all quote normalizations
* Enable test for new align
* Modify test case for quote normalization
* Switch to train_dataset() function in train CLI
* Fixes for pipe() methods in pipeline components
* Don't clobber `examples` variable with `as_example` in pipe() methods
* Remove unnecessary traversals of `examples`
* Update Parser.pipe() for Examples
* Add `as_examples` kwarg to `pipe()` with implementation to return
`Example`s
* Accept `Doc` or `Example` in `pipe()` with `_get_doc()` (copied from
`Pipe`)
* Fixes to Example implementation in spacy.gold
* Move `make_projective` from an attribute of Example to an argument of
`Example.get_gold_parses()`
* Head of 0 are not treated as unset
* Unset heads are set to self rather than `None` (which causes problems
while projectivizing)
* Check for `Doc` (not just not `None`) when creating GoldParses for
pre-merged example
* Don't clobber `examples` variable in `iter_gold_docs()`
* Add/modify gold tests for handling projectivity
* In JSON roundtrip compare results from `dev_dataset` rather than
`train_dataset` to avoid projectivization (and other potential
modifications)
* Add test for projective train vs. nonprojective dev versions of the
same `Doc`
* Handle ignore_misaligned as arg rather than attr
Move `ignore_misaligned` from an attribute of `Example` to an argument
to `Example.get_gold_parses()`, which makes it parallel to
`make_projective`.
Add test with old and new align that checks whether `ignore_misaligned`
errors are raised as expected (only for new align).
* Remove unused attrs from gold.pxd
Remove `ignore_misaligned` and `make_projective` from `gold.pxd`
* Restructure Example with merged sents as default
An `Example` now includes a single `TokenAnnotation` that includes all
the information from one `Doc` (=JSON `paragraph`). If required, the
individual sentences can be returned as a list of examples with
`Example.split_sents()` with no raw text available.
* Input/output a single `Example.token_annotation`
* Add `sent_starts` to `TokenAnnotation` to handle sentence boundaries
* Replace `Example.merge_sents()` with `Example.split_sents()`
* Modify components to use a single `Example.token_annotation`
* Pipeline components
* conllu2json converter
* Rework/rename `add_token_annotation()` and `add_doc_annotation()` to
`set_token_annotation()` and `set_doc_annotation()`, functions that set
rather then appending/extending.
* Rename `morphology` to `morphs` in `TokenAnnotation` and `GoldParse`
* Add getters to `TokenAnnotation` to supply default values when a given
attribute is not available
* `Example.get_gold_parses()` in `spacy.gold._make_golds()` is only
applied on single examples, so the `GoldParse` is returned saved in the
provided `Example` rather than creating a new `Example` with no other
internal annotation
* Update tests for API changes and `merge_sents()` vs. `split_sents()`
* Refer to Example.goldparse in iter_gold_docs()
Use `Example.goldparse` in `iter_gold_docs()` instead of `Example.gold`
because a `None` `GoldParse` is generated with ignore_misaligned and
generating it on-the-fly can raise an unwanted AlignmentError
* Fix make_orth_variants()
Fix bug in make_orth_variants() related to conversion from multiple to
one TokenAnnotation per Example.
* Add basic test for make_orth_variants()
* Replace try/except with conditionals
* Replace default morph value with set
* Switch to train_dataset() function in train CLI
* Fixes for pipe() methods in pipeline components
* Don't clobber `examples` variable with `as_example` in pipe() methods
* Remove unnecessary traversals of `examples`
* Update Parser.pipe() for Examples
* Add `as_examples` kwarg to `pipe()` with implementation to return
`Example`s
* Accept `Doc` or `Example` in `pipe()` with `_get_doc()` (copied from
`Pipe`)
* Fixes to Example implementation in spacy.gold
* Move `make_projective` from an attribute of Example to an argument of
`Example.get_gold_parses()`
* Head of 0 are not treated as unset
* Unset heads are set to self rather than `None` (which causes problems
while projectivizing)
* Check for `Doc` (not just not `None`) when creating GoldParses for
pre-merged example
* Don't clobber `examples` variable in `iter_gold_docs()`
* Add/modify gold tests for handling projectivity
* In JSON roundtrip compare results from `dev_dataset` rather than
`train_dataset` to avoid projectivization (and other potential
modifications)
* Add test for projective train vs. nonprojective dev versions of the
same `Doc`
* Handle ignore_misaligned as arg rather than attr
Move `ignore_misaligned` from an attribute of `Example` to an argument
to `Example.get_gold_parses()`, which makes it parallel to
`make_projective`.
Add test with old and new align that checks whether `ignore_misaligned`
errors are raised as expected (only for new align).
* Remove unused attrs from gold.pxd
Remove `ignore_misaligned` and `make_projective` from `gold.pxd`
* Refer to Example.goldparse in iter_gold_docs()
Use `Example.goldparse` in `iter_gold_docs()` instead of `Example.gold`
because a `None` `GoldParse` is generated with ignore_misaligned and
generating it on-the-fly can raise an unwanted AlignmentError
* Update test for ignore_misaligned
* OrigAnnot class instead of gold.orig_annot list of zipped tuples
* from_orig to replace from_annot_tuples
* rename to RawAnnot
* some unit tests for GoldParse creation and internal format
* removing orig_annot and switching to lists instead of tuple
* rewriting tuples to use RawAnnot (+ debug statements, WIP)
* fix pop() changing the data
* small fixes
* pop-append fixes
* return RawAnnot for existing GoldParse to have uniform interface
* clean up imports
* fix merge_sents
* add unit test for 4402 with new structure (not working yet)
* introduce DocAnnot
* typo fixes
* add unit test for merge_sents
* rename from_orig to from_raw
* fixing unit tests
* fix nn parser
* read_annots to produce text, doc_annot pairs
* _make_golds fix
* rename golds_to_gold_annots
* small fixes
* fix encoding
* have golds_to_gold_annots use DocAnnot
* missed a spot
* merge_sents as function in DocAnnot
* allow specifying only part of the token-level annotations
* refactor with Example class + underlying dicts
* pipeline components to work with Example objects (wip)
* input checking
* fix yielding
* fix calls to update
* small fixes
* fix scorer unit test with new format
* fix kwargs order
* fixes for ud and conllu scripts
* fix reading data for conllu script
* add in proper errors (not fixed numbering yet to avoid merge conflicts)
* fixing few more small bugs
* fix EL script
* Xfail new tokenization test
* Put new alignment behind feature flag
* Move USE_ALIGN to top of the file [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
* Flag to ignore examples with mismatched raw/gold text
After #4525, we're seeing some alignment failures on our OntoNotes data. I think we actually have fixes for most of these cases.
In general it's better to fix the data, but it seems good to allow the GoldCorpus class to just skip cases where the raw text doesn't
match up to the gold words. I think previously we were silently ignoring these cases.
* Try to fix test on Python 2.7
* Fix get labels for textcat
* Fix char_embed for gpu
* Revert "Fix char_embed for gpu"
This reverts commit 055b9a9e85.
* Fix passing of cats in gold.pyx
* Revert "Match pop with append for training format (#4516)"
This reverts commit 8e7414dace.
* Fix popping gold parses
* Fix handling of cats in gold tuples
* Fix name
* Fix ner_multitask_objective script
* Add test for 4402
* trying to fix script - not succesful yet
* match pop() with extend() to avoid changing the data
* few more pop-extend fixes
* reinsert deleted print statement
* fix print statement
* add last tested version
* append instead of extend
* add in few comments
* quick fix for 4402 + unit test
* fixing number of docs (not counting cats)
* more fixes
* fix len
* print tmp file instead of using data from examples dir
* print tmp file instead of using data from examples dir (2)
* Support train dict format as JSONL
* Add (overly simple) check for dict vs. tuple to read JSONL lines as
either train dicts or train tuples
* Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()`
and `GoldCorpus.train_tuples`
* Revert docs to default JSON output with convert
* raise specific error when removing a matcher rule that doesn't exist
* rephrasing
* goldparse init: allocate fields only if doc is not empty
* avoid zero length alloc in saving tokenizer cache
* avoid allocating zero length mem in matcher
* asserts to avoid allocating zero length mem
* fix zero-length allocation in matcher
* bump cymem version
* revert cymem version bump
* Error for ill-formed input to iob_to_biluo()
Check for empty label in iob_to_biluo(), which can result from
ill-formed input.
* Check for empty NER label in debug-data
* Add doc.cats to spacy.gold at the paragraph level
Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in
the spacy JSON training format at the paragraph level.
* `spacy.gold.docs_to_json()` writes `docs.cats`
* `GoldCorpus` reads in cats in each `GoldParse`
* Update instances of gold_tuples to handle cats
Update iteration over gold_tuples / gold_parses to handle addition of
cats at the paragraph level.
* Add textcat to train CLI
* Add textcat options to train CLI
* Add textcat labels in `TextCategorizer.begin_training()`
* Add textcat evaluation to `Scorer`:
* For binary exclusive classes with provided label: F1 for label
* For 2+ exclusive classes: F1 macro average
* For multilabel (not exclusive): ROC AUC macro average (currently
relying on sklearn)
* Provide user info on textcat evaluation settings, potential
incompatibilities
* Provide pipeline to Scorer in `Language.evaluate` for textcat config
* Customize train CLI output to include only metrics relevant to current
pipeline
* Add textcat evaluation to evaluate CLI
* Fix handling of unset arguments and config params
Fix handling of unset arguments and model confiug parameters in Scorer
initialization.
* Temporarily add sklearn requirement
* Remove sklearn version number
* Improve Scorer handling of models without textcats
* Fixing Scorer handling of models without textcats
* Update Scorer output for python 2.7
* Modify inf in Scorer for python 2.7
* Auto-format
Also make small adjustments to make auto-formatting with black easier and produce nicer results
* Move error message to Errors
* Update documentation
* Add cats to annotation JSON format [ci skip]
* Fix tpl flag and docs [ci skip]
* Switch to internal roc_auc_score
Switch to internal `roc_auc_score()` adapted from scikit-learn.
* Add AUCROCScore tests and improve errors/warnings
* Add tests for AUCROCScore and roc_auc_score
* Add missing error for only positive/negative values
* Remove unnecessary warnings and errors
* Make reduced roc_auc_score functions private
Because most of the checks and warnings have been stripped for the
internal functions and access is only intended through `ROCAUCScore`,
make the functions for roc_auc_score adapted from scikit-learn private.
* Check that data corresponds with multilabel flag
Check that the training instances correspond with the multilabel flag,
adding the multilabel flag if required.
* Add textcat score to early stopping check
* Add more checks to debug-data for textcat
* Add example training data for textcat
* Add more checks to textcat train CLI
* Check configuration when extending base model
* Fix typos
* Update textcat example data
* Provide licensing details and licenses for data
* Remove two labels with no positive instances from jigsaw-toxic-comment
data.
Co-authored-by: Ines Montani <ines@ines.io>
Filtering by orth and tag, create variants of training docs with
alternate orth variants, e.g., unicode quotes, dashes, and ellipses.
The variants can be single tokens (dashes) or paired tokens (quotes)
with left and right versions.
Currently restricted to only add variants to training documents without
raw text provided, where only gold.words needs to be modified.
* Extending debug-data with dependency checks, etc.
* Modify debug-data to load with GoldCorpus to iterate over .json/.jsonl
files within directories
* Add GoldCorpus iterator train_docs_without_preprocessing to load
original train docs without shuffling and projectivizing
* Report number of misaligned tokens
* Add more dependency checks and messages
* Update spacy/cli/debug_data.py
Co-Authored-By: Ines Montani <ines@ines.io>
* Fixed conflict
* Move counts to _compile_gold()
* Move all dependency nonproj/sent/head/cycle counting to
_compile_gold()
* Unclobber previous merges
* Update variable names
* Update more variable names, fix misspelling
* Don't clobber loading error messages
* Only warn about misaligned tokens if present
* Check whether two entities overlap
- biluo_gold_biluo_overlap now throw exception when entities passed in have overlaps
- added unit test
* SCA agreement
Provide the tokens in the cycle and the first 50 tokens from document in
the error message so it's easier to track down the location of the cycle
in the data.
Addresses feature request in #3698.