* Add `spacy.PlainTextCorpusReader.v1`
This is a corpus reader that reads plain text corpora with the following
format:
- UTF-8 encoding
- One line per document.
- Blank lines are ignored.
It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.
* Update spacy/training/corpus.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* docs: add version to `PlainTextCorpus`
* Add docstring to registry function
* Add plain text corpus tests
* Only strip newline/carriage return
* Add return type _string_to_tmp_file helper
* Use a temporary directory in place of file name
Different OS auto delete/sharing semantics are just wonky.
* This will be new in 3.5.1 (rather than 4)
* Test improvements from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add `TrainablePipe.{distill,get_teacher_student_loss}`
This change adds two methods:
- `TrainablePipe::distill` which performs a training step of a
student pipe on a teacher pipe, giving a batch of `Doc`s.
- `TrainablePipe::get_teacher_student_loss` computes the loss
of a student relative to the teacher.
The `distill` or `get_teacher_student_loss` methods are also implemented
in the tagger, edit tree lemmatizer, and parser pipes, to enable
distillation in those pipes and as an example for other pipes.
* Fix stray `Beam` import
* Fix incorrect import
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* TrainablePipe.distill: use `Iterable[Example]`
* Add Pipe.is_distillable method
* Add `validate_distillation_examples`
This first calls `validate_examples` and then checks that the
student/teacher tokens are the same.
* Update distill documentation
* Add distill documentation for all pipes that support distillation
* Fix incorrect identifier
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add comment to explain `is_distillable`
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* account for NER labels with a hyphen in the name
* cleanup
* fix docstring
* add return type to helper method
* shorter method and few more occurrences
* user helper method across repo
* fix circular import
* partial revert to avoid circular import
* factor out the WandB logger into spacy-loggers
Signed-off-by: Elia Robyn Speer <gh@arborelia.net>
* depend on spacy-loggers so they are available
Signed-off-by: Elia Robyn Speer <gh@arborelia.net>
* remove docs of spacy.WandbLogger.v2 (moved to spacy-loggers)
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* Version number suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* update references to WandbLogger
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* make order of deps more consistent
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add callback to copy vocab/tokenizer from model
Add callback `spacy.copy_from_base_model.v1` to copy the tokenizer
settings and/or vocab (including vectors) from a base model.
* Move spacy.copy_from_base_model.v1 to spacy.training.callbacks
* Add documentation
* Modify to specify model as tokenizer and vocab params
* Replace pytokenizations with internal alignment
Replace pytokenizations with internal alignment algorithm that is
restricted to only allow differences in whitespace and capitalization.
* Rename `spacy.training.align` to `spacy.training.alignment` to contain
the `Alignment` dataclass
* Implement `get_alignments` in `spacy.training.align`
* Refactor trailing whitespace handling
* Remove unnecessary exception for empty docs
Allow a non-empty whitespace-only doc to be aligned with an empty doc
* Remove empty docs exceptions completely
* rename Pipe to TrainablePipe
* split functionality between Pipe and TrainablePipe
* remove unnecessary methods from certain components
* cleanup
* hasattr(component, "pipe") should be sufficient again
* remove serialization and vocab/cfg from Pipe
* unify _ensure_examples and validate_examples
* small fixes
* hasattr checks for self.cfg and self.vocab
* make is_resizable and is_trainable properties
* serialize strings.json instead of vocab
* fix KB IO + tests
* fix typos
* more typos
* _added_strings as a set
* few more tests specifically for _added_strings field
* bump to 3.0.0a36
* Support data augmentation in Corpus
* Note initial docs for data augmentation
* Add augmenter to quickstart
* Fix flake8
* Format
* Fix test
* Update spacy/tests/training/test_training.py
* Improve data augmentation arguments
* Update templates
* Move randomization out into caller
* Refactor
* Update spacy/training/augment.py
* Update spacy/tests/training/test_training.py
* Fix augment
* Fix test