* Don't add duplicate patterns (fix#8216)
* Refactor EntityRuler init
This simplifies the EntityRuler init code. This is helpful as prep for
allowing the EntityRuler to reset itself.
* Make EntityRuler.clear reset matchers
Includes a new test for this.
* Tidy PhraseMatcher instantiation
Since the attr can be None safely now, the guard if is no longer
required here.
Also renamed the `_validate` attr. Maybe it's not needed?
* Fix NER test
* Add test to make sure patterns aren't increasing
* Move test to regression tests
The behavior of `spacy.Corpus.v1` is unexpected enough for `max_length
!= 0` that `0` is a better default for users creating a new config with
the quickstart.
If not, documents are skipped, sometimes the entire corpus is skipped,
and sometimes documents are (quite unexpectedly for your average user)
split into sentences.
* Adapt tokenization methods from `pyvi` to preserve text encoding and
whitespace
* Add serialization support similar to Chinese and Japanese
Note: as for Chinese and Japanese, some settings are duplicated in
`config.cfg` and `tokenizer/cfg`.
* Add training option to set annotations on update
Add a `[training]` option called `set_annotations_on_update` to specify
a list of components for which the predicted annotations should be set
on `example.predicted` immediately after that component has been
updated. The predicted annotations can be accessed by later components
in the pipeline during the processing of the batch in the same `update`
call.
* Rename to annotates / annotating_components
* Add test for `annotating_components` when training from config
* Add documentation
* Add empty lines at the end of Python files
* Only prepend the lang code if it's not there already
* Update spacy/cli/package.py
* fix whitespace stripping
* Set up CI for tests with GPU agent
* Update tests for enabled GPU
* Fix steps filename
* Add parallel build jobs as a setting
* Fix test requirements
* Fix install test requirements condition
* Fix pipeline models test
* Reset current ops in prefer/require testing
* Fix more tests
* Remove separate test_models test
* Fix regression 5551
* fix StaticVectors for GPU use
* fix vocab tests
* Fix regression test 5082
* Move azure steps to .github and reenable default pool jobs
* Consolidate/rename azure steps
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
* Add callback to copy vocab/tokenizer from model
Add callback `spacy.copy_from_base_model.v1` to copy the tokenizer
settings and/or vocab (including vectors) from a base model.
* Move spacy.copy_from_base_model.v1 to spacy.training.callbacks
* Add documentation
* Modify to specify model as tokenizer and vocab params
* Update sent_starts in Example.from_dict
Update `sent_starts` for `Example.from_dict` so that `Optional[bool]`
values have the same meaning as for `Token.is_sent_start`.
Use `Optional[bool]` as the type for sent start values in the docs.
* Use helper function for conversion to ternary ints
* Fix tokenizer cache flushing
Fix/simplify tokenizer init detection in order to fix cache flushing
when properties are modified.
* Remove init reloading logic
* Remove logic disabling `_reload_special_cases` on init
* Setting `rules` last in `__init__` (as before) means that setting
other properties doesn't reload any special cases
* Reset `rules` first in `from_bytes` so that setting other properties
during deserialization doesn't reload any special cases
unnecessarily
* Reset all properties in `Tokenizer.from_bytes` to allow any settings
to be `None`
* Also reset special matcher when special cache is flushed
* Remove duplicate special case validation
* Add test for special cases flushing
* Extend test for tokenizer deserialization of None values
* Replace negative rows with 0 in StaticVectors
Replace negative row indices with 0-vectors in `StaticVectors`.
* Increase versions related to StaticVectors
* Increase versions of all architctures and layers related to
`StaticVectors`
* Improve efficiency of 0-vector operations
Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5
* Update config defaults to new versions
* Update docs
* Update processing-pipelines.md
Under "things to try," inform users they can save metadata when using nlp.pipe(foobar, as_tuples=True)
Link to a new example on the attributes page detailing the following:
> ```
> data = [
> ("Some text to process", {"meta": "foo"}),
> ("And more text...", {"meta": "bar"})
> ]
>
> for doc, context in nlp.pipe(data, as_tuples=True):
> # Let's assume you have a "meta" extension registered on the Doc
> doc._.meta = context["meta"]
> ```
from https://stackoverflow.com/questions/57058798/make-spacy-nlp-pipe-process-tuples-of-text-and-additional-information-to-add-as
* Updating the attributes section
Update the attributes section with example of how extensions can be used to store metadata.
* Update processing-pipelines.md
* Update processing-pipelines.md
Made as_tuples example executable and relocated to the end of the "Processing Text" section.
* Update processing-pipelines.md
* Update processing-pipelines.md
Removed extra line
* Reformat and rephrase
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update Tokenizer.explain with special matches
Update `Tokenizer.explain` and the pseudo-code in the docs to include
the processing of special cases that contain affixes or whitespace.
* Handle optional settings in explain
* Add test for special matches in explain
Add test for `Tokenizer.explain` for special cases containing affixes.
* Set catalogue lower pin to v2.0.2
* Update importlib-metadata pins to match
* Require catalogue v2.0.3
Switch to vendored `importlib-metadata` v3.2.0 provided by `catalogue`.
* ensure vectors data is stored on right device
* ensure the added vector is on the right device
* move vector to numpy before iterating
* move best_rows to numpy before iterating
* Terminology: deprecated vs obsolete
Typically, deprecated is used for functionality that is bound to become unavailable but that can still be used. Obsolete is used for features that have been removed. In E941, I think what is meant is "obsolete" since loading a model by a shortcut simply does not work anymore (and throws an error). This is different from downloading a model with a shortcut, which is deprecated but still works.
In light of this, perhaps all other error codes should be checked as well.
* clarify that the link command is removed and not just deprecated
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
* Update debug data further for v3
* Remove new/existing label distinction (new labels are not immediately
distinguishable because the pipeline is already initialized)
* Warn on missing labels in training data for all components except parser
* Separate textcat and textcat_multilabel sections
* Add section for morphologizer
* Reword missing label warnings
* Make vocab update in get_docs deterministic
The attribute `DocBin.strings` is a set. In `DocBin.get_docs`
a given vocab is updated by iterating over this set.
Iteration over a python set produces an arbitrary ordering,
therefore vocab is updated non-deterministically.
When training (fine-tuning) a spacy model, the base model's
vocabulary will be updated with the new vocabulary in the
training data in exactly the way described above. After
serialization, the file `model/vocab/strings.json` will
be sorted in an arbitrary way. This prevents reproducible
model training.
* Revert "Make vocab update in get_docs deterministic"
This reverts commit d6b87a2f55.
* Sort strings in StringStore serialization
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>