* Add callback to copy vocab/tokenizer from model
Add callback `spacy.copy_from_base_model.v1` to copy the tokenizer
settings and/or vocab (including vectors) from a base model.
* Move spacy.copy_from_base_model.v1 to spacy.training.callbacks
* Add documentation
* Modify to specify model as tokenizer and vocab params
* Update sent_starts in Example.from_dict
Update `sent_starts` for `Example.from_dict` so that `Optional[bool]`
values have the same meaning as for `Token.is_sent_start`.
Use `Optional[bool]` as the type for sent start values in the docs.
* Use helper function for conversion to ternary ints
* Replace negative rows with 0 in StaticVectors
Replace negative row indices with 0-vectors in `StaticVectors`.
* Increase versions related to StaticVectors
* Increase versions of all architctures and layers related to
`StaticVectors`
* Improve efficiency of 0-vector operations
Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5
* Update config defaults to new versions
* Update docs
* extend span scorer with consider_label and allow_overlap
* unit test for spans y2x overlap
* add score_spans unit test
* docs for new fields in scorer.score_spans
* rename to include_label
* spell out if-else for clarity
* rename to 'labeled'
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Support match alignments
* change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case
* remove added errors, utilize bint type, cleanup whitespace
* fix no new line in end of file
* Minor formatting
* Skip alignments processing if as_spans is set
* Add with_alignments to Matcher API docs
* Update website/docs/api/matcher.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add multi-label textcat to menu
* add infobox on textcat API
* add info to v3 migration guide
* small edits
* further fixes in doc strings
* add infobox to textcat architectures
* add textcat_multilabel to overview of built-in components
* spelling
* fix unrelated warn msg
* Add textcat_multilabel to quickstart [ci skip]
* remove separate documentation page for multilabel_textcategorizer
* small edits
* positive label clarification
* avoid duplicating information in self.cfg and fix textcat.score
* fix multilabel textcat too
* revert threshold to storage in cfg
* revert threshold stuff for multi-textcat
Co-authored-by: Ines Montani <ines@ines.io>
* initialize NLP with train corpus
* add more pretraining tests
* more tests
* function to fetch tok2vec layer for pretraining
* clarify parameter name
* test different objectives
* formatting
* fix check for static vectors when using vectors objective
* clarify docs
* logger statement
* fix init_tok2vec and proc.initialize order
* test training after pretraining
* add init_config tests for pretraining
* pop pretraining block to avoid config validation errors
* custom errors
* Add regression test
* Run PhraseMatcher on Spans
* Add test for PhraseMatcher on Spans and Docs
* Add SCA
* Add test with 3 matches in Doc, 1 match in Span
* Update docs
* Use doc.length for find_matches in tokenizer
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>