* Avoid `TrainablePipe.finish_update` getting called twice during training
PR #12136 fixed an issue where the tok2vec pipe was updated before
gradient were accumulated. However, it introduced a new bug that cause
`finish_update` to be called twice when using the training loop. This
causes a fairly large slowdown.
The `Language.update` method accepts the `sgd` argument for passing an
optimizer. This argument has three possible values:
- `Optimizer`: use the given optimizer to finish pipe updates.
- `None`: use a default optimizer to finish pipe updates.
- `False`: do not finish pipe updates.
However, the latter option was not documented and not valid with the
existing type of `sgd`. I assumed that this was a remnant of earlier
spaCy versions and removed handling of `False`.
However, with that change, we are passing `None` to `Language.update`.
As a result, we were calling `finish_update` in both `Language.update`
and in the training loop after all subbatches are processed.
This change restores proper handling/use of `False`. Moreover, the role
of `False` is now documented and added to the type to avoid future
accidents.
* Fix typo
* Document defaults for `Language.update`
* Convert Candidate from Cython to Python class.
* Format.
* Fix .entity_ typo in _add_activations() usage.
* Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span].
* Update docs.
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update doc string of BaseCandidate.__init__().
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.
* Adjust Candidate to support and mandate numerical entity IDs.
* Format.
* Fix docstring and docs.
* Update website/docs/api/kb.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename alias -> mention.
* Refactor Candidate attribute names. Update docs and tests accordingly.
* Refacor Candidate attributes and their usage.
* Format.
* Fix mypy error.
* Update error code in line with v4 convention.
* Reverse erroneous changes during merge.
* Update return type in EL tests.
* Re-add Candidate to setup.py.
* Format updated docs.
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Clean up Vocab constructor
* Change effective type of `strings` from `Iterable[str]` to `Optional[StringStore]`
* Don't automatically add strings to vocab
* Change default values to `None`
* Remove `**deprecated_kwargs`
* Format
* Make empty_kb() configurable.
* Format.
* Update docs.
* Be more specific in KB serialization test.
* Update KB serialization tests. Update docs.
* Remove doc update for batched candidate generation.
* Fix serialization of subclassed KB in tests.
* Format.
* Update docstring.
* Update docstring.
* Switch from pickle to json for custom field serialization.
* Add immediate left/right child/parent dependency relations
* Add tests for new REL_OPs: `>+`, `>-`, `<+`, and `<-`.
---------
Co-authored-by: Tan Long <tanloong@foxmail.com>
* Fix FUZZY operator definition
The default length of the FUZZY operator is 2 and not 3.
* adjust edit distance in matcher usage docs too
---------
Co-authored-by: svlandeg <svlandeg@github.com>
* Remove backwards-compatible overwrite from Entity Linker
This also adds a docstring about overwrite, since it wasn't present.
* Fix docstring
* Remove backward compat settings in Morphologizer
This also needed a docstring added.
For this component it's less clear what the right overwrite settings
are.
* Remove backward compat from sentencizer
This was simple
* Remove backward compat from senter
Another simple one
* Remove backward compat setting from tagger
* Add docstrings
* Update spacy/pipeline/morphologizer.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update docs
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Init
* fix tests
* Update spacy/errors.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix test_blank_languages
* Rename xx to mul in docs
* Format _util with black
* prettier formatting
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add `Language.distill`
This method is the distillation counterpart of `Language.update`. It
takes a teacher `Language` instance and distills the student pipes on
the teacher pipes.
* Apply suggestions from code review
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Clarify that how Example is used in distillation
* Update transition parser distill docstring for examples argument
* Pass optimizer to `TrainablePipe.distill`
* Annotate pipe before update
As discussed internally, we want to let a pipe annotate before doing an
update with gold/silver data. Otherwise, the output may be (too)
informed by the gold/silver data.
* Rename `component_map` to `student_to_teacher`
* Better synopsis in `Language.distill` docstring
* `name` -> `student_name`
* Fix labels type in docstring
* Mark distill test as slow
* Fix `student_to_teacher` type in docs
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Add span_id to Span.char_span, update Doc/Span.char_span docs
`Span.char_span(id=)` should be removed in the future.
* Also use Union[int, str] in Doc docstring
* Add `spacy.PlainTextCorpusReader.v1`
This is a corpus reader that reads plain text corpora with the following
format:
- UTF-8 encoding
- One line per document.
- Blank lines are ignored.
It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.
* Update spacy/training/corpus.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* docs: add version to `PlainTextCorpus`
* Add docstring to registry function
* Add plain text corpus tests
* Only strip newline/carriage return
* Add return type _string_to_tmp_file helper
* Use a temporary directory in place of file name
Different OS auto delete/sharing semantics are just wonky.
* This will be new in 3.5.1 (rather than 4)
* Test improvements from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Don't pass mem pool to new lexeme function
* Remove unused mem from function args
Two methods calling _new_lexeme, get and get_by_orth, took mem arguments
just to call the internal method. That's no longer necessary, so this
cleans it up.
* prettier formatting
* Remove more unused mem args
* Rename CSS class to make use more clear
* Rename component prop to improve code readability
* Fix `aria-hidden` directly on a link element
This link wouldn't have been clickable by screenreaders
* Refactor component
This removes a unnessary `div` and a duplicate link
Co-authored-by: Ines Montani <ines@ines.io>
Originally introduced in 62b9c9c6d7
Original error: Warning: Invalid DOM property `class`. Did you mean `className`?
React doesn't have `class`, it uses `className`.