* verbose and tag_map options
* adding init_tok2vec option and only changing the tok2vec that is specified
* adding omit_extra_lookups and verifying textcat config
* wip
* pretrain bugfix
* add replace and resume options
* train_textcat fix
* raw text functionality
* improve UX when KeyError or when input data can't be parsed
* avoid unnecessary access to goldparse in TextCat pipe
* save performance information in nlp.meta
* add noise_level to config
* move nn_parser's defaults to config file
* multitask in config - doesn't work yet
* scorer offering both F and AUC options, need to be specified in config
* add textcat verification code from old train script
* small fixes to config files
* clean up
* set default config for ner/parser to allow create_pipe to work as before
* two more test fixes
* small fixes
* cleanup
* fix NER pickling + additional unit test
* create_pipe as before
Port relevant changes from #5361:
* Initialize lower flag explicitly
* Handle whitespace words from GoldParse correctly when creating raw
text with orth variants
* Tidy up train-from-config a bit
* Fix accidentally quadratic perf in TokenAnnotation.brackets
When we're reading in the gold data, we had a nested loop where
we looped over the brackets for each token, looking for brackets
that start on that word. This is accidentally quadratic, because
we have one bracket per word (for the POS tags). So we had
an O(N**2) behaviour here that ended up being pretty slow.
To solve this I'm indexing the brackets by their starting word
on the TokenAnnotations object, and having a property to provide
the previous view.
* Fixes
* setting KB in the EL constructor, similar to how the model is passed on
* removing wikipedia example files - moved to projects
* throw an error when nlp.update is called with 2 positional arguments
* rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config
* update config files with new parameters
* avoid training pipeline components that don't have a model (like sentencizer)
* various small fixes + UX improvements
* small fixes
* set thinc to 8.0.0a9 everywhere
* remove outdated comment
* Add warning for misaligned character offset spans
* Resolve conflict
* Filter warnings in example scripts
Filter warnings in example scripts to show warnings once, in particular
warnings about misaligned entities.
Co-authored-by: Ines Montani <ines@ines.io>
* Draft layer for BILUO actions
* Fixes to biluo layer
* WIP on BILUO layer
* Add tests for BILUO layer
* Format
* Fix transitions
* Update test
* Link in the simple_ner
* Update BILUO tagger
* Update __init__
* Import simple_ner
* Update test
* Import
* Add files
* Add config
* Fix label passing for BILUO and tagger
* Fix label handling for simple_ner component
* Update simple NER test
* Update config
* Hack train script
* Update BILUO layer
* Fix SimpleNER component
* Update train_from_config
* Add biluo_to_iob helper
* Add IOB layer
* Add IOBTagger model
* Update biluo layer
* Update SimpleNER tagger
* Update BILUO
* Read random seed in train-from-config
* Update use of normal_init
* Fix normalization of gradient in SimpleNER
* Update IOBTagger
* Remove print
* Tweak masking in BILUO
* Add dropout in SimpleNER
* Update thinc
* Tidy up simple_ner
* Fix biluo model
* Unhack train-from-config
* Update setup.cfg and requirements
* Add tb_framework.py for parser model
* Try to avoid memory leak in BILUO
* Move ParserModel into spacy.ml, avoid need for subclass.
* Use updated parser model
* Remove incorrect call to model.initializre in PrecomputableAffine
* Update parser model
* Avoid divide by zero in tagger
* Add extra dropout layer in tagger
* Refine minibatch_by_words function to avoid oom
* Fix parser model after refactor
* Try to avoid div-by-zero in SimpleNER
* Fix infinite loop in minibatch_by_words
* Use SequenceCategoricalCrossentropy in Tagger
* Fix parser model when hidden layer
* Remove extra dropout from tagger
* Add extra nan check in tagger
* Fix thinc version
* Update tests and imports
* Fix test
* Update test
* Update tests
* Fix tests
* Fix test
Co-authored-by: Ines Montani <ines@ines.io>
* Initialize lower flag explicitly
* Handle whitespace words from GoldParse correctly when creating raw
text with orth variants
* Return the text with original casing if anything goes wrong
Improve GoldParse NER alignment by including all cases where the start
and end of the NER span can be aligned, regardless of internal
tokenization differences.
To do this, convert BILUO tags to character offsets, check start/end
alignment with `doc.char_span()`, and assign the BILUO tags for the
aligned spans. Alignment for `O/-` tags is handled through the
one-to-one and multi alignments.
* Check whether doc is instantiated
When creating docs to pair with gold parses, modify test to check
whether a doc is unset rather than whether it contains tokens.
* Restore test of evaluate on an empty doc
* Set a minimal gold.orig for the scorer
Without a minimal gold.orig the scorer can't evaluate empty docs. This
is the v3 equivalent of #4925.
* label in span not writable anymore
* Revert "label in span not writable anymore"
This reverts commit ab442338c8.
* provide more friendly error msg for parsing file
* Add sent_starts to GoldParse
* Add SentTagger pipeline component
Add `SentTagger` pipeline component as a subclass of `Tagger`.
* Model reduces default parameters from `Tagger` to be small and fast
* Hard-coded set of two labels:
* S (1): token at beginning of sentence
* I (0): all other sentence positions
* Sets `token.sent_start` values
* Add sentence segmentation to Scorer
Report `sent_p/r/f` for sentence boundaries, which may be provided by
various pipeline components.
* Add sentence segmentation to CLI evaluate
* Add senttagger metrics/scoring to train CLI
* Rename SentTagger to SentenceRecognizer
* Add SentenceRecognizer to spacy.pipes imports
* Add SentenceRecognizer serialization test
* Shorten component name to sentrec
* Remove duplicates from train CLI output metrics
Replace old gold alignment that allowed for some noise in the alignment between raw and orth with the new simpler alignment that requires that the raw and orth strings are identical except for whitespace and capitalization.
* Replace old alignment with new alignment, removing `_align.pyx` and
its tests
* Remove all quote normalizations
* Enable test for new align
* Modify test case for quote normalization
* Switch to train_dataset() function in train CLI
* Fixes for pipe() methods in pipeline components
* Don't clobber `examples` variable with `as_example` in pipe() methods
* Remove unnecessary traversals of `examples`
* Update Parser.pipe() for Examples
* Add `as_examples` kwarg to `pipe()` with implementation to return
`Example`s
* Accept `Doc` or `Example` in `pipe()` with `_get_doc()` (copied from
`Pipe`)
* Fixes to Example implementation in spacy.gold
* Move `make_projective` from an attribute of Example to an argument of
`Example.get_gold_parses()`
* Head of 0 are not treated as unset
* Unset heads are set to self rather than `None` (which causes problems
while projectivizing)
* Check for `Doc` (not just not `None`) when creating GoldParses for
pre-merged example
* Don't clobber `examples` variable in `iter_gold_docs()`
* Add/modify gold tests for handling projectivity
* In JSON roundtrip compare results from `dev_dataset` rather than
`train_dataset` to avoid projectivization (and other potential
modifications)
* Add test for projective train vs. nonprojective dev versions of the
same `Doc`
* Handle ignore_misaligned as arg rather than attr
Move `ignore_misaligned` from an attribute of `Example` to an argument
to `Example.get_gold_parses()`, which makes it parallel to
`make_projective`.
Add test with old and new align that checks whether `ignore_misaligned`
errors are raised as expected (only for new align).
* Remove unused attrs from gold.pxd
Remove `ignore_misaligned` and `make_projective` from `gold.pxd`
* Restructure Example with merged sents as default
An `Example` now includes a single `TokenAnnotation` that includes all
the information from one `Doc` (=JSON `paragraph`). If required, the
individual sentences can be returned as a list of examples with
`Example.split_sents()` with no raw text available.
* Input/output a single `Example.token_annotation`
* Add `sent_starts` to `TokenAnnotation` to handle sentence boundaries
* Replace `Example.merge_sents()` with `Example.split_sents()`
* Modify components to use a single `Example.token_annotation`
* Pipeline components
* conllu2json converter
* Rework/rename `add_token_annotation()` and `add_doc_annotation()` to
`set_token_annotation()` and `set_doc_annotation()`, functions that set
rather then appending/extending.
* Rename `morphology` to `morphs` in `TokenAnnotation` and `GoldParse`
* Add getters to `TokenAnnotation` to supply default values when a given
attribute is not available
* `Example.get_gold_parses()` in `spacy.gold._make_golds()` is only
applied on single examples, so the `GoldParse` is returned saved in the
provided `Example` rather than creating a new `Example` with no other
internal annotation
* Update tests for API changes and `merge_sents()` vs. `split_sents()`
* Refer to Example.goldparse in iter_gold_docs()
Use `Example.goldparse` in `iter_gold_docs()` instead of `Example.gold`
because a `None` `GoldParse` is generated with ignore_misaligned and
generating it on-the-fly can raise an unwanted AlignmentError
* Fix make_orth_variants()
Fix bug in make_orth_variants() related to conversion from multiple to
one TokenAnnotation per Example.
* Add basic test for make_orth_variants()
* Replace try/except with conditionals
* Replace default morph value with set
* Switch to train_dataset() function in train CLI
* Fixes for pipe() methods in pipeline components
* Don't clobber `examples` variable with `as_example` in pipe() methods
* Remove unnecessary traversals of `examples`
* Update Parser.pipe() for Examples
* Add `as_examples` kwarg to `pipe()` with implementation to return
`Example`s
* Accept `Doc` or `Example` in `pipe()` with `_get_doc()` (copied from
`Pipe`)
* Fixes to Example implementation in spacy.gold
* Move `make_projective` from an attribute of Example to an argument of
`Example.get_gold_parses()`
* Head of 0 are not treated as unset
* Unset heads are set to self rather than `None` (which causes problems
while projectivizing)
* Check for `Doc` (not just not `None`) when creating GoldParses for
pre-merged example
* Don't clobber `examples` variable in `iter_gold_docs()`
* Add/modify gold tests for handling projectivity
* In JSON roundtrip compare results from `dev_dataset` rather than
`train_dataset` to avoid projectivization (and other potential
modifications)
* Add test for projective train vs. nonprojective dev versions of the
same `Doc`
* Handle ignore_misaligned as arg rather than attr
Move `ignore_misaligned` from an attribute of `Example` to an argument
to `Example.get_gold_parses()`, which makes it parallel to
`make_projective`.
Add test with old and new align that checks whether `ignore_misaligned`
errors are raised as expected (only for new align).
* Remove unused attrs from gold.pxd
Remove `ignore_misaligned` and `make_projective` from `gold.pxd`
* Refer to Example.goldparse in iter_gold_docs()
Use `Example.goldparse` in `iter_gold_docs()` instead of `Example.gold`
because a `None` `GoldParse` is generated with ignore_misaligned and
generating it on-the-fly can raise an unwanted AlignmentError
* Update test for ignore_misaligned
* OrigAnnot class instead of gold.orig_annot list of zipped tuples
* from_orig to replace from_annot_tuples
* rename to RawAnnot
* some unit tests for GoldParse creation and internal format
* removing orig_annot and switching to lists instead of tuple
* rewriting tuples to use RawAnnot (+ debug statements, WIP)
* fix pop() changing the data
* small fixes
* pop-append fixes
* return RawAnnot for existing GoldParse to have uniform interface
* clean up imports
* fix merge_sents
* add unit test for 4402 with new structure (not working yet)
* introduce DocAnnot
* typo fixes
* add unit test for merge_sents
* rename from_orig to from_raw
* fixing unit tests
* fix nn parser
* read_annots to produce text, doc_annot pairs
* _make_golds fix
* rename golds_to_gold_annots
* small fixes
* fix encoding
* have golds_to_gold_annots use DocAnnot
* missed a spot
* merge_sents as function in DocAnnot
* allow specifying only part of the token-level annotations
* refactor with Example class + underlying dicts
* pipeline components to work with Example objects (wip)
* input checking
* fix yielding
* fix calls to update
* small fixes
* fix scorer unit test with new format
* fix kwargs order
* fixes for ud and conllu scripts
* fix reading data for conllu script
* add in proper errors (not fixed numbering yet to avoid merge conflicts)
* fixing few more small bugs
* fix EL script
* Xfail new tokenization test
* Put new alignment behind feature flag
* Move USE_ALIGN to top of the file [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
* Flag to ignore examples with mismatched raw/gold text
After #4525, we're seeing some alignment failures on our OntoNotes data. I think we actually have fixes for most of these cases.
In general it's better to fix the data, but it seems good to allow the GoldCorpus class to just skip cases where the raw text doesn't
match up to the gold words. I think previously we were silently ignoring these cases.
* Try to fix test on Python 2.7
* Fix get labels for textcat
* Fix char_embed for gpu
* Revert "Fix char_embed for gpu"
This reverts commit 055b9a9e85.
* Fix passing of cats in gold.pyx
* Revert "Match pop with append for training format (#4516)"
This reverts commit 8e7414dace.
* Fix popping gold parses
* Fix handling of cats in gold tuples
* Fix name
* Fix ner_multitask_objective script
* Add test for 4402
* trying to fix script - not succesful yet
* match pop() with extend() to avoid changing the data
* few more pop-extend fixes
* reinsert deleted print statement
* fix print statement
* add last tested version
* append instead of extend
* add in few comments
* quick fix for 4402 + unit test
* fixing number of docs (not counting cats)
* more fixes
* fix len
* print tmp file instead of using data from examples dir
* print tmp file instead of using data from examples dir (2)
* Support train dict format as JSONL
* Add (overly simple) check for dict vs. tuple to read JSONL lines as
either train dicts or train tuples
* Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()`
and `GoldCorpus.train_tuples`
* Revert docs to default JSON output with convert
* raise specific error when removing a matcher rule that doesn't exist
* rephrasing
* goldparse init: allocate fields only if doc is not empty
* avoid zero length alloc in saving tokenizer cache
* avoid allocating zero length mem in matcher
* asserts to avoid allocating zero length mem
* fix zero-length allocation in matcher
* bump cymem version
* revert cymem version bump