Validate both `[initialize]` and `[training]` in `debug data` and
`nlp.initialize()` with separate config validation error blocks that
indicate which block of the config is being validated.
Add `initialize.before_init` and `initialize.after_init` callbacks to
the config. The `initialize.before_init` callback is a place to
implement one-time tokenizer customizations that are then saved with the
model.
* Update stop_words.py
Added three aditional stopwords: "a" and "o" that means "the", and "e" that means "and"
* Create cristianasp.md
* zero edit to push CI
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* fix TorchBiLSTMEncoder documentation
* ensure the types of the encoding Tok2vec layers are correct
* update references from v1 to v2 for the new architectures
* add syntax iterators for danish
* add test noun chunks for danish syntax iterators
* add contributor agreement
* update da syntax iterators to remove nested chunks
* add tests for da noun chunks
* Fix test
* add missing import
* fix example
* Prevent overlapping noun chunks
Prevent overlapping noun chunks by tracking the end index of the
previous noun chunk span.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* clean up of ner tests
* beam_parser tests
* implement get_beam_parses and scored_parses for the dep parser
* we don't have to add the parse if there are no arcs
* add convenience method to determine tok2vec width in a model
* fix transformer tok2vec dimensions in TextCatEnsemble architecture
* init function should not be nested to avoid pickle issues
* small fixes and formatting
* bring test_issue4313 up-to-date, currently fails
* formatting
* add get_beam_parses method back
* add scored_ents function
* delete tag map
Instead of unsetting lemmas on retokenized tokens, set the default
lemmas to:
* merge: concatenate any existing lemmas with `SPACY` preserved
* split: use the new `ORTH` values if lemmas were previously set,
otherwise leave unset
* multi-label textcat component
* formatting
* fix comment
* cleanup
* fix from #6481
* random edit to push the tests
* add explicit error when textcat is called with multi-label gold data
* fix error nr
* small fix
* Fix memory issues in Language.evaluate
Reset annotation in predicted docs before evaluating and store all data
in `examples`.
* Minor refactor to docs generator init
* Fix generator expression
* Fix final generator check
* Refactor pipeline loop
* Handle examples generator in Language.evaluate
* Add test with generator
* Use make_doc
* Add Amharic to space
* clean up
* Add some PRON_LEMMA
* add Tigrinya support
* remove text_noun_chunks
* Tigrinya Support
* added some more details for ti
* fix unit test
* add amharic char range
* changes from review
* amharic and tigrinya share same unicode block
* get rid of _amharic/_tigrinya in char_classes
Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>