* Replace EntityRuler with SpanRuler implementation
Remove `EntityRuler` and rename the `SpanRuler`-based
`future_entity_ruler` to `entity_ruler`.
Main changes:
* It is no longer possible to load patterns on init as with
`EntityRuler(patterns=)`.
* The older serialization formats (`patterns.jsonl`) are no longer
supported and the related tests are removed.
* The config settings are only stored in the config, not in the
serialized component (in particular the `phrase_matcher_attr` and
overwrite settings).
* Add migration guide to EntityRuler API docs
* docs update
* Minor edit
Co-authored-by: svlandeg <svlandeg@github.com>
* Fix flag handling in dvc
Prior to this commit, if a flag (--verbose or --quiet) was passed to
DVC, it would be added to the end of the generated dvc command line.
This would result in the command being interpreted as part of the actual
command to run, rather than an argument to dvc. This would result in
command lines like:
spacy project run preprocess --verbose
That would fail with an error that there's no such directory as
`--verbose`.
This change puts the flags at the front of the dvc command so that they
are interpreted correctly. It removes the `run_dvc_commands` function,
which had been reduced to just a for loop and wasn't used elsewhere.
A separate problem is that there's no way to specify the quiet behaviour
to dvc from the command line, though it's unclear if that's a bug.
* Add dvc quiet flag to docs
* Handle case in DVC where no commands are appropriate
If only have commands with no deps or outputs (admittedly unlikely), you
get a weird error about the dvc file not existing. This gives explicit
output instead.
* Add support for quiet flag
* Fix command execution
Commands are strings now because they're joined further up.
* `strings`: Remove unused `hash32_utf8` function
* `strings`: Make `hash_utf8` and `decode_Utf8Str` private
* `strings`: Reorganize private functions
* 'strings': Raise error when non-string/-int types are passed to functions that don't accept them
* `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods
* `Morphology`: Use `StringStore.items()` to enumerate features when pickling
* `test_stringstore`: Update pre-Python 3 tests
* Update `StringStore` docs
* Fix `get_string_id` imports
* Replace redundant test with tests for type checking
* Rename `_retrieve_interned_str`, remove `.get` default arg
* Add `get_string_id` to `strings.pyi`
Remove `mypy` ignore directives from imports of the above
* `strings.pyi`: Replace functions that consume `Union`-typed params with overloads
* `strings.pyi`: Revert some function signatures
* Update `SYMBOLS_BY_INT` lookups and error codes post-merge
* Revert clobbered change introduced in a previous merge
* Remove unnecessary type hint
* Invert tuple order in `StringStore.items()`
* Add test for `StringStore.items()`
* Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling"
This reverts commit 1af9510ceb.
* Rename `keys` and `key_map`
* Add `keys()` and `values()`
* Add comment about the inverted key-value semantics in the API
* Fix type hints
* Implement `keys()`, `values()`, `items()` without generators
* Fix type hints, remove unnecessary boxing
* Update docs
* Simplify `keys/values/items()` impl
* `mypy` fix
* Fix error message, doc fixes
* Add experimental coref docs
* Docs cleanup
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Apply changes from code review
* Fix prettier formatting
It seems a period after a number made this think it was a list?
* Update docs on examples for initialize
* Add docs for coref scorers
* Remove 3.4 notes from coref
There won't be a "new" tag until it's in core.
* Add docs for span cleaner
* Fix docs
* Fix docs to match spacy-experimental
These weren't properly updated when the code was moved out of spacy
core.
* More doc fixes
* Formatting
* Update architectures
* Fix links
* Fix another link
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
* Remove side effects from Doc.__init__()
* Changes based on review comment
* Readd test
* Change interface of Doc.__init__()
* Simplify test
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update doc.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Store activations in Doc when `store_activations` is enabled
This change adds the new `activations` attribute to `Doc`. This
attribute can be used by trainable pipes to store their activations,
probabilities, and guesses for downstream users.
As an example, this change modifies the `tagger` and `senter` pipes to
add an `store_activations` option. When this option is enabled, the
probabilities and guesses are stored in `set_annotations`.
* Change type of `store_activations` to `Union[bool, List[str]]`
When the value is:
- A bool: all activations are stored when set to `True`.
- A List[str]: the activations named in the list are stored
* Formatting fixes in Tagger
* Support store_activations in spancat and morphologizer
* Make Doc.activations type visible to MyPy
* textcat/textcat_multilabel: add store_activations option
* trainable_lemmatizer/entity_linker: add store_activations option
* parser/ner: do not currently support returning activations
* Extend tagger and senter tests
So that they, like the other tests, also check that we get no
activations if no activations were requested.
* Document `Doc.activations` and `store_activations` in the relevant pipes
* Start errors/warnings at higher numbers to avoid merge conflicts
Between the master and v4 branches.
* Add `store_activations` to docstrings.
* Replace store_activations setter by set_store_activations method
Setters that take a different type than what the getter returns are still
problematic for MyPy. Replace the setter by a method, so that type inference
works everywhere.
* Use dict comprehension suggested by @svlandeg
* Revert "Use dict comprehension suggested by @svlandeg"
This reverts commit 6e7b958f70.
* EntityLinker: add type annotations to _add_activations
* _store_activations: make kwarg-only, remove doc_scores_lens arg
* set_annotations: add type annotations
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* TextCat.predict: return dict
* Make the `TrainablePipe.store_activations` property a bool
This means that we can also bring back `store_activations` setter.
* Remove `TrainablePipe.activations`
We do not need to enumerate the activations anymore since `store_activations` is
`bool`.
* Add type annotations for activations in predict/set_annotations
* Rename `TrainablePipe.store_activations` to `save_activations`
* Error E1400 is not used anymore
This error was used when activations were still `Union[bool, List[str]]`.
* Change wording in API docs after store -> save change
* docs: tag (save_)activations as new in spaCy 4.0
* Fix copied line in morphologizer activations test
* Don't train in any test_save_activations test
* Rename activations
- "probs" -> "probabilities"
- "guesses" -> "label_ids", except in the edit tree lemmatizer, where
"guesses" -> "tree_ids".
* Remove unused W400 warning.
This warning was used when we still allowed the user to specify
which activations to save.
* Formatting fixes
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Replace "kb_ids" by a constant
* spancat: replace a cast by an assertion
* Fix EOF spacing
* Fix comments in test_save_activations tests
* Do not set RNG seed in activation saving tests
* Revert "spancat: replace a cast by an assertion"
This reverts commit 0bd5730d16.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* replicate bug with tok2vec in annotating components
* add overfitting test with a frozen tok2vec
* remove broadcast from predict and check doc.tensor instead
* remove broadcast
* proper error
* slight rephrase of documentation
* Add a dry run flag to download
* Remove --dry-run, add --url option to `spacy info` instead
* Make mypy happy
* Print only the URL, so it's easier to use in scripts
* Don't add the egg hash unless downloading an sdist
* Update spacy/cli/info.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add two implementations of requirements
* Clean up requirements sample slightly
This should make mypy happy
* Update URL help string
* Remove requirements option
* Add url option to docs
* Add URL to spacy info model output, when available
* Add types-setuptools to testing reqs
* Add types-setuptools to requirements
* Add "compatible", expand docstring
* Update spacy/cli/info.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Run prettier on CLI docs
* Update docs
Add a sidebar about finding download URLs, with some examples of the new
command.
* Add download URLs to table on model page
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Updates from review
* download url -> download link
* Update docs
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* adding unit test for spacy.load with disable/exclude string arg
* allow pure strings in from_config
* update docs
* upstream type adjustements
* docs update
* make docstring more consistent
* Update spacy/language.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* two more cleanups
* fix type in internal method
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Clean up old Matcher call style related stuff
In v2 Matcher.add was called with (key, on_match, *patterns). In v3 this
was changed to (key, patterns, *, on_match=None), but there were various
points where the old call syntax was documented or handled specially.
This removes all those.
The Matcher itself didn't need any code changes, as it just gives a
generic type error. However the PhraseMatcher required some changes
because it would automatically "fix" the old call style.
Surprisingly, the tokenizer was still using the old call style in one
place.
After these changes tests failed in two places:
1. one test for the "new" call style, including the "old" call style. I
removed this test.
2. deserializing the PhraseMatcher fails because the input docs are a
set.
I am not sure why 2 is happening - I guess it's a quirk of the
serialization format? - so for now I just convert the set to a list when
deserializing. The check that the input Docs are a List in the
PhraseMatcher is a new check, but makes it parallel with the other
Matchers, which seemed like the right thing to do.
* Add notes related to input docs / deserialization type
* Remove Typing import
* Remove old note about call style change
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Use separate method for setting internal doc representations
In addition to the title change, this changes the internal dict to be a
defaultdict, instead of a dict with frequent use of setdefault.
* Add _add_from_arrays for unpickling
* Cleanup around adding from arrays
This moves adding to internal structures into the private batch method,
and removes the single-add method.
This has one behavioral change for `add`, in that if something is wrong
with the list of input Docs (such as one of the items not being a Doc),
valid items before the invalid one will not be added. Also the callback
will not be updated if anything is invalid. This change should not be
significant.
This also adds a test to check failure when given a non-Doc.
* Update spacy/matcher/phrasematcher.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add lang folder for la (Latin)
* Add Latin lang classes
* Add minimal tokenizer exceptions
* Add minimal stopwords
* Add minimal lex_attrs
* Update stopwords, tokenizer exceptions
* Add la tests; register la_tokenizer in conftest.py
* Update spacy/lang/la/lex_attrs.py
Remove duplicate form in Latin lex_attrs
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update natto-py version spec (#11222)
* Update natto-py version spec
* Update setup.cfg
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add scorer to textcat API docs config settings (#11263)
* Update docs for pipeline initialize() methods (#11221)
* Update documentation for dependency parser
* Update documentation for trainable_lemmatizer
* Update documentation for entity_linker
* Update documentation for ner
* Update documentation for morphologizer
* Update documentation for senter
* Update documentation for spancat
* Update documentation for tagger
* Update documentation for textcat
* Update documentation for tok2vec
* Run prettier on edited files
* Apply similar changes in transformer docs
* Remove need to say annotated example explicitly
I removed the need to say "Must contain at least one annotated Example"
because it's often a given that Examples will contain some gold-standard
annotation.
* Run prettier on transformer docs
* chore: add 'concepCy' to spacy universe (#11255)
* chore: add 'concepCy' to spacy universe
* docs: add 'slogan' to concepCy
* Support full prerelease versions in the compat table (#11228)
* Support full prerelease versions in the compat table
* Fix types
* adding spans to doc_annotation in Example.to_dict (#11261)
* adding spans to doc_annotation in Example.to_dict
* to_dict compatible with from_dict: tuples instead of spans
* use strings for label and kb_id
* Simplify test
* Update data formats docs
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix regex invalid escape sequences (#11276)
* Add W605 to the errors raised by flake8 in the CI (#11283)
* Clean up automated label-based issue handling (#11284)
* Clean up automated label-based issue handline
1. upgrade tiangolo/issue-manager to latest
2. move needs-more-info to tiangolo
3. change needs-more-info close time to 7 days
4. delete old needs-more-info config
* Use old, longer message
* Fix label name
* Fix Dutch noun chunks to skip overlapping spans (#11275)
* Add test for overlapping noun chunks
* Skip overlapping noun chunks
* Update spacy/tests/lang/nl/test_noun_chunks.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Docs: displaCy documentation - data types, `parse_{deps,ents,spans}`, spans example (#10950)
* add in spans example and parse references
* rm autoformatter
* rm extra ents copy
* TypedDict draft
* type fixes
* restore non-documentation files
* docs update
* fix spans example
* fix hyperlinks
* add parse example
* example fix + argument fix
* fix api arg in docs
* fix bad variable replacement
* fix spacing in style
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* fix spacing on table
* fix spacing on table
* rm temp files
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* include span_ruler for default warning filter (#11333)
* Add uk pipelines to website (#11332)
* Check for . in factory names (#11336)
* Make fixes for PR #11349
* Fix roman numeral coverage in #11349
Co-authored-by: Patrick J. Burns <patricks@diyclassics.org>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Lj Miranda <12949683+ljvmiranda921@users.noreply.github.com>
Co-authored-by: Jules Belveze <32683010+JulesBelveze@users.noreply.github.com>
Co-authored-by: stefawolf <wlf.ste@gmail.com>
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
* Switch to mecab-ko as default Korean tokenizer
Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.
Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.
* Temporarily run tests with mecab-ko tokenizer
* Fix types
* Fix duplicate test names
* Update requirements test
* Revert "Temporarily run tests with mecab-ko tokenizer"
This reverts commit d2083e7044.
* Add mecab_args setting, fix pickle for KoreanNattoTokenizer
* Fix length check
* Update docs
* Formatting
* Update natto-py error message
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Map `Span.id` to `Token.ent_id` in all cases when setting `Doc.ents`
* Reset `Token.ent_id` and `Token.ent_kb_id` when setting `Doc.ents`
* Make `Span.ent_id` an alias of `Span.id` rather than a read-only view
of the root token's `ent_id` annotation
* adding spans to doc_annotation in Example.to_dict
* to_dict compatible with from_dict: tuples instead of spans
* use strings for label and kb_id
* Simplify test
* Update data formats docs
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update documentation for dependency parser
* Update documentation for trainable_lemmatizer
* Update documentation for entity_linker
* Update documentation for ner
* Update documentation for morphologizer
* Update documentation for senter
* Update documentation for spancat
* Update documentation for tagger
* Update documentation for textcat
* Update documentation for tok2vec
* Run prettier on edited files
* Apply similar changes in transformer docs
* Remove need to say annotated example explicitly
I removed the need to say "Must contain at least one annotated Example"
because it's often a given that Examples will contain some gold-standard
annotation.
* Run prettier on transformer docs
* add additional REL_OP
* change to condition and new rel_op symbols
* add operators to docs
* add the anchor while we're in here
* add tests
Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>