* Update textcat scorer threshold behavior
For `textcat` (with exclusive classes) the scorer should always use a
threshold of 0.0 because there should be one predicted label per doc and
the numeric score for that particular label should not matter.
* Rename to test_textcat_multilabel_threshold
* Remove all uses of threshold for multi_label=False
* Update Scorer.score_cats API docs
* Add tests for score_cats with thresholds
* Update textcat API docs
* Fix types
* Convert threshold back to float
* Fix threshold type in docstring
* Improve formatting in Scorer API docs
* Handle docs with no entities
If a whole batch contains no entities it won't make it to the model, but
it's possible for individual Docs to have no entities. Before this
commit, those Docs would cause an error when attempting to concatenate
arrays because the dimensions didn't match.
It turns out the process of preparing the Ragged at the end of the span
maker forward was a little different from list2ragged, which just uses
the flatten function directly. Letting list2ragged do the conversion
avoids the dimension issue.
This did not come up before because in NEL demo projects it's typical
for data with no entities to be discarded before it reaches the NEL
component.
This includes a simple direct test that shows the issue and checks it's
resolved. It doesn't check if there are any downstream changes, so a
more complete test could be added. A full run was tested by adding an
example with no entities to the Emerson sample project.
* Add a blank instance to default training data in tests
Rather than adding a specific test, since not failing on instances with
no entities is basic functionality, it makes sense to add it to the
default set.
* Fix without modifying architecture
If the architecture is modified this would have to be a new version, but
this change isn't big enough to merit that.
* Replace EntityRuler with SpanRuler implementation
Remove `EntityRuler` and rename the `SpanRuler`-based
`future_entity_ruler` to `entity_ruler`.
Main changes:
* It is no longer possible to load patterns on init as with
`EntityRuler(patterns=)`.
* The older serialization formats (`patterns.jsonl`) are no longer
supported and the related tests are removed.
* The config settings are only stored in the config, not in the
serialized component (in particular the `phrase_matcher_attr` and
overwrite settings).
* Add migration guide to EntityRuler API docs
* docs update
* Minor edit
Co-authored-by: svlandeg <svlandeg@github.com>
* Remove imports marked as v2 leftovers
There are a few functions that were in `spacy.util` in v2, but were
moved to Thinc. In v3 these were imported in `spacy.util` so that code
could be used unchanged, but the comment over them indicates they should
always be imported from Thinc. This commit removes those imports.
It doesn't look like any DeprecationWarning was ever thrown for using
these, but it is probably fine to remove them anyway with a major
version. It is not clear that they were widely used.
* Import fix_random_seed correctly
This seems to be the only place in spaCy that was using the old import.
* Fix multiple extensions and character offset
* Rename token_start/end to start/end
* Refactor Doc.from_json based on review
* Iterate over user_data items
* Only add non-empty underscore entries
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* `strings`: Remove unused `hash32_utf8` function
* `strings`: Make `hash_utf8` and `decode_Utf8Str` private
* `strings`: Reorganize private functions
* 'strings': Raise error when non-string/-int types are passed to functions that don't accept them
* `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods
* `Morphology`: Use `StringStore.items()` to enumerate features when pickling
* `test_stringstore`: Update pre-Python 3 tests
* Update `StringStore` docs
* Fix `get_string_id` imports
* Replace redundant test with tests for type checking
* Rename `_retrieve_interned_str`, remove `.get` default arg
* Add `get_string_id` to `strings.pyi`
Remove `mypy` ignore directives from imports of the above
* `strings.pyi`: Replace functions that consume `Union`-typed params with overloads
* `strings.pyi`: Revert some function signatures
* Update `SYMBOLS_BY_INT` lookups and error codes post-merge
* Revert clobbered change introduced in a previous merge
* Remove unnecessary type hint
* Invert tuple order in `StringStore.items()`
* Add test for `StringStore.items()`
* Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling"
This reverts commit 1af9510ceb.
* Rename `keys` and `key_map`
* Add `keys()` and `values()`
* Add comment about the inverted key-value semantics in the API
* Fix type hints
* Implement `keys()`, `values()`, `items()` without generators
* Fix type hints, remove unnecessary boxing
* Update docs
* Simplify `keys/values/items()` impl
* `mypy` fix
* Fix error message, doc fixes
* Change enable/disable behavior so that arguments take precedence over config options. Extend error message on conflict. Add warning message in case of overwriting config option with arguments.
* Fix tests in test_serialize_pipeline.py to reflect changes to handling of enable/disable.
* Fix type issue.
* Move comment.
* Move comment.
* Issue UserWarning instead of printing wasabi message. Adjust test.
* Added pytest.warns(UserWarning) for expected warning to fix tests.
* Update warning message.
* Move type handling out of fetch_pipes_status().
* Add global variable for default value. Use id() to determine whether used values are default value.
* Fix default value for disable.
* Rename DEFAULT_PIPE_STATUS to _DEFAULT_EMPTY_PIPES.
* add punctuation to grc
Add support for special editorial punctuation that is common in ancient Greek texts. Ancient Greek texts, as found in digital and print form, have been largely edited by scholars. Restorations and improvements are normally marked with special characters that need to be handled properly by the tokenizer.
* add unit tests
* simplify regex
* move generic quotes to char classes
* rename unit test
* fix regex
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Preserve both `-` and `O` annotation in augmenters rather than relying
on `Example.to_dict`'s default support for one option outside of labeled
entity spans.
This is intended as a temporary workaround for augmenters for v3.4.x.
The behavior of `Example` and related IOB utils could be improved in the
general case for v3.5.
* Remove side effects from Doc.__init__()
* Changes based on review comment
* Readd test
* Change interface of Doc.__init__()
* Simplify test
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update doc.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Store activations in Doc when `store_activations` is enabled
This change adds the new `activations` attribute to `Doc`. This
attribute can be used by trainable pipes to store their activations,
probabilities, and guesses for downstream users.
As an example, this change modifies the `tagger` and `senter` pipes to
add an `store_activations` option. When this option is enabled, the
probabilities and guesses are stored in `set_annotations`.
* Change type of `store_activations` to `Union[bool, List[str]]`
When the value is:
- A bool: all activations are stored when set to `True`.
- A List[str]: the activations named in the list are stored
* Formatting fixes in Tagger
* Support store_activations in spancat and morphologizer
* Make Doc.activations type visible to MyPy
* textcat/textcat_multilabel: add store_activations option
* trainable_lemmatizer/entity_linker: add store_activations option
* parser/ner: do not currently support returning activations
* Extend tagger and senter tests
So that they, like the other tests, also check that we get no
activations if no activations were requested.
* Document `Doc.activations` and `store_activations` in the relevant pipes
* Start errors/warnings at higher numbers to avoid merge conflicts
Between the master and v4 branches.
* Add `store_activations` to docstrings.
* Replace store_activations setter by set_store_activations method
Setters that take a different type than what the getter returns are still
problematic for MyPy. Replace the setter by a method, so that type inference
works everywhere.
* Use dict comprehension suggested by @svlandeg
* Revert "Use dict comprehension suggested by @svlandeg"
This reverts commit 6e7b958f70.
* EntityLinker: add type annotations to _add_activations
* _store_activations: make kwarg-only, remove doc_scores_lens arg
* set_annotations: add type annotations
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* TextCat.predict: return dict
* Make the `TrainablePipe.store_activations` property a bool
This means that we can also bring back `store_activations` setter.
* Remove `TrainablePipe.activations`
We do not need to enumerate the activations anymore since `store_activations` is
`bool`.
* Add type annotations for activations in predict/set_annotations
* Rename `TrainablePipe.store_activations` to `save_activations`
* Error E1400 is not used anymore
This error was used when activations were still `Union[bool, List[str]]`.
* Change wording in API docs after store -> save change
* docs: tag (save_)activations as new in spaCy 4.0
* Fix copied line in morphologizer activations test
* Don't train in any test_save_activations test
* Rename activations
- "probs" -> "probabilities"
- "guesses" -> "label_ids", except in the edit tree lemmatizer, where
"guesses" -> "tree_ids".
* Remove unused W400 warning.
This warning was used when we still allowed the user to specify
which activations to save.
* Formatting fixes
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Replace "kb_ids" by a constant
* spancat: replace a cast by an assertion
* Fix EOF spacing
* Fix comments in test_save_activations tests
* Do not set RNG seed in activation saving tests
* Revert "spancat: replace a cast by an assertion"
This reverts commit 0bd5730d16.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* replicate bug with tok2vec in annotating components
* add overfitting test with a frozen tok2vec
* remove broadcast from predict and check doc.tensor instead
* remove broadcast
* proper error
* slight rephrase of documentation
* Add a dry run flag to download
* Remove --dry-run, add --url option to `spacy info` instead
* Make mypy happy
* Print only the URL, so it's easier to use in scripts
* Don't add the egg hash unless downloading an sdist
* Update spacy/cli/info.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add two implementations of requirements
* Clean up requirements sample slightly
This should make mypy happy
* Update URL help string
* Remove requirements option
* Add url option to docs
* Add URL to spacy info model output, when available
* Add types-setuptools to testing reqs
* Add types-setuptools to requirements
* Add "compatible", expand docstring
* Update spacy/cli/info.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Run prettier on CLI docs
* Update docs
Add a sidebar about finding download URLs, with some examples of the new
command.
* Add download URLs to table on model page
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Updates from review
* download url -> download link
* Update docs
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* `Matcher`: Better type checking of values in `SetPredicate`
`SetPredicate`: Emit warning and return `False` on unexpected value types
* Rename `value_type_mismatch` variable
* Inline warning
* Remove unexpected type warning from `_SetPredicate`
* Ensure that `str` values are not interpreted as sequences
Check elements of sequence values for convertibility to `str` or `int`
* Add more `INTERSECT` and `IN` test cases
* Test for inputs with multiple characters
* Return `False` early instead of using a boolean flag
* Remove superfluous `int` check, parentheses
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Appy suggestions from code review
* Clarify test comment
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Consolidate and freeze symbols
Instead of having symbol values defined in three potentially conflicting
places (`spacy.attrs`, `spacy.parts_of_speech`, `spacy.symbols`), define
all symbols in `spacy.symbols` and reference those values in
`spacy.attrs` and `spacy.parts_of_speech`.
Remove deprecated and placeholder symbols from `spacy.attrs.IDS`.
Make `spacy.attrs.NAMES` and `spacy.symbols.NAMES` reverse dicts rather
than lists in order to support future use of hash values in `attr_id_t`.
Minor changes:
* Use `uint64_t` for attrs in `Doc.to_array` to support future use of
hash values
* Remove unneeded attrs filter for error message in `Doc.to_array`
* Remove unused attr `SENT_END`
* Handle dynamic size of attr_id_t in Doc.to_array
* Undo added warnings
* Refactor to make Doc.to_array more similar to Doc.from_array
* Improve refactoring
* adding unit test for spacy.load with disable/exclude string arg
* allow pure strings in from_config
* update docs
* upstream type adjustements
* docs update
* make docstring more consistent
* Update spacy/language.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* two more cleanups
* fix type in internal method
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Clean up old Matcher call style related stuff
In v2 Matcher.add was called with (key, on_match, *patterns). In v3 this
was changed to (key, patterns, *, on_match=None), but there were various
points where the old call syntax was documented or handled specially.
This removes all those.
The Matcher itself didn't need any code changes, as it just gives a
generic type error. However the PhraseMatcher required some changes
because it would automatically "fix" the old call style.
Surprisingly, the tokenizer was still using the old call style in one
place.
After these changes tests failed in two places:
1. one test for the "new" call style, including the "old" call style. I
removed this test.
2. deserializing the PhraseMatcher fails because the input docs are a
set.
I am not sure why 2 is happening - I guess it's a quirk of the
serialization format? - so for now I just convert the set to a list when
deserializing. The check that the input Docs are a List in the
PhraseMatcher is a new check, but makes it parallel with the other
Matchers, which seemed like the right thing to do.
* Add notes related to input docs / deserialization type
* Remove Typing import
* Remove old note about call style change
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Use separate method for setting internal doc representations
In addition to the title change, this changes the internal dict to be a
defaultdict, instead of a dict with frequent use of setdefault.
* Add _add_from_arrays for unpickling
* Cleanup around adding from arrays
This moves adding to internal structures into the private batch method,
and removes the single-add method.
This has one behavioral change for `add`, in that if something is wrong
with the list of input Docs (such as one of the items not being a Doc),
valid items before the invalid one will not be added. Also the callback
will not be updated if anything is invalid. This change should not be
significant.
This also adds a test to check failure when given a non-Doc.
* Update spacy/matcher/phrasematcher.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix `test_{prefer,require}_gpu`
These tests assumed that GPUs are only supported with CuPy, but since Thinc 8.1
we also support Metal Performance Shaders.
* test_misc: arrange thinc imports to be together
* Add lang folder for la (Latin)
* Add Latin lang classes
* Add minimal tokenizer exceptions
* Add minimal stopwords
* Add minimal lex_attrs
* Update stopwords, tokenizer exceptions
* Add la tests; register la_tokenizer in conftest.py
* Update spacy/lang/la/lex_attrs.py
Remove duplicate form in Latin lex_attrs
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update natto-py version spec (#11222)
* Update natto-py version spec
* Update setup.cfg
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add scorer to textcat API docs config settings (#11263)
* Update docs for pipeline initialize() methods (#11221)
* Update documentation for dependency parser
* Update documentation for trainable_lemmatizer
* Update documentation for entity_linker
* Update documentation for ner
* Update documentation for morphologizer
* Update documentation for senter
* Update documentation for spancat
* Update documentation for tagger
* Update documentation for textcat
* Update documentation for tok2vec
* Run prettier on edited files
* Apply similar changes in transformer docs
* Remove need to say annotated example explicitly
I removed the need to say "Must contain at least one annotated Example"
because it's often a given that Examples will contain some gold-standard
annotation.
* Run prettier on transformer docs
* chore: add 'concepCy' to spacy universe (#11255)
* chore: add 'concepCy' to spacy universe
* docs: add 'slogan' to concepCy
* Support full prerelease versions in the compat table (#11228)
* Support full prerelease versions in the compat table
* Fix types
* adding spans to doc_annotation in Example.to_dict (#11261)
* adding spans to doc_annotation in Example.to_dict
* to_dict compatible with from_dict: tuples instead of spans
* use strings for label and kb_id
* Simplify test
* Update data formats docs
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix regex invalid escape sequences (#11276)
* Add W605 to the errors raised by flake8 in the CI (#11283)
* Clean up automated label-based issue handling (#11284)
* Clean up automated label-based issue handline
1. upgrade tiangolo/issue-manager to latest
2. move needs-more-info to tiangolo
3. change needs-more-info close time to 7 days
4. delete old needs-more-info config
* Use old, longer message
* Fix label name
* Fix Dutch noun chunks to skip overlapping spans (#11275)
* Add test for overlapping noun chunks
* Skip overlapping noun chunks
* Update spacy/tests/lang/nl/test_noun_chunks.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Docs: displaCy documentation - data types, `parse_{deps,ents,spans}`, spans example (#10950)
* add in spans example and parse references
* rm autoformatter
* rm extra ents copy
* TypedDict draft
* type fixes
* restore non-documentation files
* docs update
* fix spans example
* fix hyperlinks
* add parse example
* example fix + argument fix
* fix api arg in docs
* fix bad variable replacement
* fix spacing in style
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* fix spacing on table
* fix spacing on table
* rm temp files
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* include span_ruler for default warning filter (#11333)
* Add uk pipelines to website (#11332)
* Check for . in factory names (#11336)
* Make fixes for PR #11349
* Fix roman numeral coverage in #11349
Co-authored-by: Patrick J. Burns <patricks@diyclassics.org>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Lj Miranda <12949683+ljvmiranda921@users.noreply.github.com>
Co-authored-by: Jules Belveze <32683010+JulesBelveze@users.noreply.github.com>
Co-authored-by: stefawolf <wlf.ste@gmail.com>
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
* Switch to mecab-ko as default Korean tokenizer
Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.
Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.
* Temporarily run tests with mecab-ko tokenizer
* Fix types
* Fix duplicate test names
* Update requirements test
* Revert "Temporarily run tests with mecab-ko tokenizer"
This reverts commit d2083e7044.
* Add mecab_args setting, fix pickle for KoreanNattoTokenizer
* Fix length check
* Update docs
* Formatting
* Update natto-py error message
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Add token and span custom attributes to to_json()
* Change logic for to_json
* Add functionality to from_json
* Small adjustments
* Move token/span attributes to new dict key
* Fix test
* Fix the same test but much better
* Add backwards compatibility tests and adjust logic
* Add test to check if attributes not set in underscore are not saved in the json
* Add tests for json compatibility
* Adjust test names
* Fix tests and clean up code
* Fix assert json tests
* small adjustment
* adjust naming and code readability
* Adjust naming, added more tests and changed logic
* Fix typo
* Adjust errors, naming, and small test optimization
* Fix byte tests
* Fix bytes tests
* Change naming and json structure
* update schema
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/tokens/doc.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/tokens/doc.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update schema for underscore attributes
* Adjust underscore schema
* adjust schema tests
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Map `Span.id` to `Token.ent_id` in all cases when setting `Doc.ents`
* Reset `Token.ent_id` and `Token.ent_kb_id` when setting `Doc.ents`
* Make `Span.ent_id` an alias of `Span.id` rather than a read-only view
of the root token's `ent_id` annotation