* `strings`: Remove unused `hash32_utf8` function
* `strings`: Make `hash_utf8` and `decode_Utf8Str` private
* `strings`: Reorganize private functions
* 'strings': Raise error when non-string/-int types are passed to functions that don't accept them
* `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods
* `Morphology`: Use `StringStore.items()` to enumerate features when pickling
* `test_stringstore`: Update pre-Python 3 tests
* Update `StringStore` docs
* Fix `get_string_id` imports
* Replace redundant test with tests for type checking
* Rename `_retrieve_interned_str`, remove `.get` default arg
* Add `get_string_id` to `strings.pyi`
Remove `mypy` ignore directives from imports of the above
* `strings.pyi`: Replace functions that consume `Union`-typed params with overloads
* `strings.pyi`: Revert some function signatures
* Update `SYMBOLS_BY_INT` lookups and error codes post-merge
* Revert clobbered change introduced in a previous merge
* Remove unnecessary type hint
* Invert tuple order in `StringStore.items()`
* Add test for `StringStore.items()`
* Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling"
This reverts commit 1af9510ceb.
* Rename `keys` and `key_map`
* Add `keys()` and `values()`
* Add comment about the inverted key-value semantics in the API
* Fix type hints
* Implement `keys()`, `values()`, `items()` without generators
* Fix type hints, remove unnecessary boxing
* Update docs
* Simplify `keys/values/items()` impl
* `mypy` fix
* Fix error message, doc fixes
* `Matcher`: Better type checking of values in `SetPredicate`
`SetPredicate`: Emit warning and return `False` on unexpected value types
* Rename `value_type_mismatch` variable
* Inline warning
* Remove unexpected type warning from `_SetPredicate`
* Ensure that `str` values are not interpreted as sequences
Check elements of sequence values for convertibility to `str` or `int`
* Add more `INTERSECT` and `IN` test cases
* Test for inputs with multiple characters
* Return `False` early instead of using a boolean flag
* Remove superfluous `int` check, parentheses
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Appy suggestions from code review
* Clarify test comment
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Clean up old Matcher call style related stuff
In v2 Matcher.add was called with (key, on_match, *patterns). In v3 this
was changed to (key, patterns, *, on_match=None), but there were various
points where the old call syntax was documented or handled specially.
This removes all those.
The Matcher itself didn't need any code changes, as it just gives a
generic type error. However the PhraseMatcher required some changes
because it would automatically "fix" the old call style.
Surprisingly, the tokenizer was still using the old call style in one
place.
After these changes tests failed in two places:
1. one test for the "new" call style, including the "old" call style. I
removed this test.
2. deserializing the PhraseMatcher fails because the input docs are a
set.
I am not sure why 2 is happening - I guess it's a quirk of the
serialization format? - so for now I just convert the set to a list when
deserializing. The check that the input Docs are a List in the
PhraseMatcher is a new check, but makes it parallel with the other
Matchers, which seemed like the right thing to do.
* Add notes related to input docs / deserialization type
* Remove Typing import
* Remove old note about call style change
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Use separate method for setting internal doc representations
In addition to the title change, this changes the internal dict to be a
defaultdict, instead of a dict with frequent use of setdefault.
* Add _add_from_arrays for unpickling
* Cleanup around adding from arrays
This moves adding to internal structures into the private batch method,
and removes the single-add method.
This has one behavioral change for `add`, in that if something is wrong
with the list of input Docs (such as one of the items not being a Doc),
valid items before the invalid one will not be added. Also the callback
will not be updated if anything is invalid. This change should not be
significant.
This also adds a test to check failure when given a non-Doc.
* Update spacy/matcher/phrasematcher.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* add additional REL_OP
* change to condition and new rel_op symbols
* add operators to docs
* add the anchor while we're in here
* add tests
Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
* Min_max_operators
1. Modified API and Usage for spaCy website to include min_max operator
2. Modified matcher.pyx to include min_max function {n,m} and its variants
3. Modified schemas.py to include min_max validation error
4. Added test cases to test_matcher_api.py, test_matcher_logic.py and test_pattern_validation.py
* attempt to fix mypy/pydantic compat issue
* formatting
* Update spacy/tests/matcher/test_pattern_validation.py
Co-authored-by: Source-Shen <82353723+Source-Shen@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* `Matcher`: Remove superfluous GIL-acquiring check in `get_is_final`
This check incurred a significant performance penalty due to implict interactions between the GIL and Cython ref-counting code.
* `Matcher`: Inline `PatternStateC` accessors
The returned match offsets were only adjusted for `as_spans`, not
generally. Because the `on_match` callbacks are always applied to the
doc, the `Matcher` matches on spans should consistently use the doc
offsets.
* Improve typing hints for Matcher.__call__
* Add typing hints for DependencyMatcher
* Add typing hints to underscore extensions
* Update Doc.tensor type (requires numpy 1.21)
* Fix typing hints for Language.component decorator
* Use generic np.ndarray type in Doc to avoid numpy version update
* Fix mypy errors
* Fix cyclic import caused by Underscore typing hints
* Use Literal type from spacy.compat
* Update matcher.pyi import format
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* added new field
* added exception for IOb strings
* minor refinement to schema
* removed field
* fixed typo
* imported numeriacla val
* changed the code bit
* cosmetics
* added test for matcher
* set ents of moc docs
* added invalid pattern
* minor update to documentation
* blacked matcher
* added pattern validation
* add IOB vals to schema
* changed into test
* mypy compat
* cleaned left over
* added compat import
* changed type
* added compat import
* changed literal a bit
* went back to old
* made explicit type
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* added ruler coe
* added error for none existing pattern
* changed error to warning
* changed error to warning
* added basic tests
* fixed place
* added test files
* went back to error
* went back to pattern error
* minor change to docs
* changed style
* changed doc
* changed error slightly
* added remove to phrasem api
* error key already existed
* phrase matcher match code to api
* blacked tests
* moved comments before expr
* corrected error no
* Update website/docs/api/entityruler.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/entityruler.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove some old version refs in the docs
* Remove warning
* Update spacy/matcher/matcher.pyx
* Remove all references to the punctuation warning
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* 🚨 Ignore all existing Mypy errors
* 🏗 Add Mypy check to CI
* Add types-mock and types-requests as dev requirements
* Add additional type ignore directives
* Add types packages to dev-only list in reqs test
* Add types-dataclasses for python 3.6
* Add ignore to pretrain
* 🏷 Improve type annotation on `run_command` helper
The `run_command` helper previously declared that it returned an
`Optional[subprocess.CompletedProcess]`, but it isn't actually possible
for the function to return `None`. These changes modify the type
annotation of the `run_command` helper and remove all now-unnecessary
`# type: ignore` directives.
* 🔧 Allow variable type redefinition in limited contexts
These changes modify how Mypy is configured to allow variables to have
their type automatically redefined under certain conditions. The Mypy
documentation contains the following example:
```python
def process(items: List[str]) -> None:
# 'items' has type List[str]
items = [item.split() for item in items]
# 'items' now has type List[List[str]]
...
```
This configuration change is especially helpful in reducing the number
of `# type: ignore` directives needed to handle the common pattern of:
* Accepting a filepath as a string
* Overwriting the variable using `filepath = ensure_path(filepath)`
These changes enable redefinition and remove all `# type: ignore`
directives rendered redundant by this change.
* 🏷 Add type annotation to converters mapping
* 🚨 Fix Mypy error in convert CLI argument verification
* 🏷 Improve type annotation on `resolve_dot_names` helper
* 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors`
* 🏷 Add type annotations for more `Vocab` attributes
* 🏷 Add loose type annotation for gold data compilation
* 🏷 Improve `_format_labels` type annotation
* 🏷 Fix `get_lang_class` type annotation
* 🏷 Loosen return type of `Language.evaluate`
* 🏷 Don't accept `Scorer` in `handle_scores_per_type`
* 🏷 Add `string_to_list` overloads
* 🏷 Fix non-Optional command-line options
* 🙈 Ignore redefinition of `wandb_logger` in `loggers.py`
* ➕ Install `typing_extensions` in Python 3.8+
The `typing_extensions` package states that it should be used when
"writing code that must be compatible with multiple Python versions".
Since SpaCy needs to support multiple Python versions, it should be used
when newer `typing` module members are required. One example of this is
`Literal`, which is available starting with Python 3.8.
Previously SpaCy tried to import `Literal` from `typing`, falling back
to `typing_extensions` if the import failed. However, Mypy doesn't seem
to be able to understand what `Literal` means when the initial import
means. Therefore, these changes modify how `compat` imports `Literal` by
always importing it from `typing_extensions`.
These changes also modify how `typing_extensions` is installed, so that
it is a requirement for all Python versions, including those greater
than or equal to 3.8.
* 🏷 Improve type annotation for `Language.pipe`
These changes add a missing overload variant to the type signature of
`Language.pipe`. Additionally, the type signature is enhanced to allow
type checkers to differentiate between the two overload variants based
on the `as_tuple` parameter.
Fixes#8772
* ➖ Don't install `typing-extensions` in Python 3.8+
After more detailed analysis of how to implement Python version-specific
type annotations using SpaCy, it has been determined that by branching
on a comparison against `sys.version_info` can be statically analyzed by
Mypy well enough to enable us to conditionally use
`typing_extensions.Literal`. This means that we no longer need to
install `typing_extensions` for Python versions greater than or equal to
3.8! 🎉
These changes revert previous changes installing `typing-extensions`
regardless of Python version and modify how we import the `Literal` type
to ensure that Mypy treats it properly.
* resolve mypy errors for Strict pydantic types
* refactor code to avoid missing return statement
* fix types of convert CLI command
* avoid list-set confustion in debug_data
* fix typo and formatting
* small fixes to avoid type ignores
* fix types in profile CLI command and make it more efficient
* type fixes in projects CLI
* put one ignore back
* type fixes for render
* fix render types - the sequel
* fix BaseDefault in language definitions
* fix type of noun_chunks iterator - yields tuple instead of span
* fix types in language-specific modules
* 🏷 Expand accepted inputs of `get_string_id`
`get_string_id` accepts either a string (in which case it returns its
ID) or an ID (in which case it immediately returns the ID). These
changes extend the type annotation of `get_string_id` to indicate that
it can accept either strings or IDs.
* 🏷 Handle override types in `combine_score_weights`
The `combine_score_weights` function allows users to pass an `overrides`
mapping to override data extracted from the `weights` argument. Since it
allows `Optional` dictionary values, the return value may also include
`Optional` dictionary values.
These changes update the type annotations for `combine_score_weights` to
reflect this fact.
* 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer`
* 🏷 Fix redefinition of `wandb_logger`
These changes fix the redefinition of `wandb_logger` by giving a
separate name to each `WandbLogger` version. For
backwards-compatibility, `spacy.train` still exports `wandb_logger_v3`
as `wandb_logger` for now.
* more fixes for typing in language
* type fixes in model definitions
* 🏷 Annotate `_RandomWords.probs` as `NDArray`
* 🏷 Annotate `tok2vec` layers to help Mypy
* 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6
Also remove an import that I forgot to move to the top of the module 😅
* more fixes for matchers and other pipeline components
* quick fix for entity linker
* fixing types for spancat, textcat, etc
* bugfix for tok2vec
* type annotations for scorer
* add runtime_checkable for Protocol
* type and import fixes in tests
* mypy fixes for training utilities
* few fixes in util
* fix import
* 🐵 Remove unused `# type: ignore` directives
* 🏷 Annotate `Language._components`
* 🏷 Annotate `spacy.pipeline.Pipe`
* add doc as property to span.pyi
* small fixes and cleanup
* explicit type annotations instead of via comment
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
* Fix inconsistency
This makes the failing test pass, so that behavior is consistent whether
patterns are added in one call or two.
The issue is that the hash for patterns depended on the index of the
pattern in the list of current patterns, not the list of total patterns,
so a second call would get identical match ids.
* Add illustrative test case
* Add failing test for remove case
Patterns are not removed from the internal matcher on calls to remove,
which causes spurious weird matches (or misses).
* Fix removal issue
Remove patterns from the internal matcher.
* Check that the single add call also gets no matches
* Replace all basestring references with unicode
`basestring` was a compatability type introduced by Cython to make
dealing with utf-8 strings in Python2 easier. In Python3 it is
equivalent to the unicode (or str) type.
I replaced all references to basestring with unicode, since that was
used elsewhere, but we could also just replace them with str, which
shoudl also be equivalent.
All tests pass locally.
* Replace all references to unicode type with str
Since we only support python3 this is simpler.
* Remove all references to unicode type
This removes all references to the unicode type across the codebase and
replaces them with `str`, which makes it more drastic than the prior
commits. In order to make this work importing `unicode_literals` had to
be removed, and one explicit unicode literal also had to be removed (it
is unclear why this is necessary in Cython with language level 3, but
without doing it there were errors about implicit conversion).
When `unicode` is used as a type in comments it was also edited to be
`str`.
Additionally `coding: utf8` headers were removed from a few files.
* pass alignments to callbacks
* refactor for single callback loop
* Update spacy/matcher/matcher.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Refactor to use list comps and enumerate.
Replace loops that append to a list with a list comprehensions where this does not change the behavior; replace range(len(...)) loops with enumerate. Correct one typo in a comment. Replace a call to set() with a set literal.
* Undo double assignment.
Expand `tokens_to_key[j] = k = self._get_matcher_key(key, i, j)` to two statements.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Sign contributors agreement
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add stub files for main API classes
* Add contributor agreement for ezorita
* Update types for ndarray and hash()
* Fix __getitem__ and __iter__
* Add attributes of Doc and Token classes
* Overload type hints for Span.__getitem__
* Fix type hint overload for Span.__getitem__
Co-authored-by: Luca Dorigo <dorigoluca@gmail.com>
* Fix check for RIGHT_ATTRs in dep matcher
If a non-anchor node does not have RIGHT_ATTRS, the dep matcher throws
an E100, which says that non-anchor nodes must have LEFT_ID, REL_OP, and
RIGHT_ID. It specifically does not say RIGHT_ATTRS is required.
A blank RIGHT_ATTRS is also valid, and patterns with one will be
excepted. While not normal, sometimes a REL_OP is enough to specify a
non-anchor node - maybe you just want the head of another node
unconditionally, for example.
This change just sets RIGHT_ATTRS to {} if not present. Alternatively
changing E100 to state RIGHT_ATTRS is required could also be reasonable.
* Fix test
This test was written on the assumption that if `RIGHT_ATTRS` isn't
present an error will be raised. Since the proposed changes make it so
an error won't be raised this is no longer necessary.
* Revert test, update error message
Error message now lists missing keys, and RIGHT_ATTRS is required.
* Use list of required keys in error message
Also removes unused key param arg.
* Support list values and IS_INTERSECT in Matcher
* Support list values as token attributes for set operators, not just as
pattern values.
* Add `IS_INTERSECT` operator.
* Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs.
* Rename IS_INTERSECT to INTERSECTS
* Show warning if entity_ruler runs without patterns
* Show warning if matcher runs without patterns
* fix wording
* unit test for warning once (WIP)
* warn W036 only once
* cleanup
* create filter_warning helper
* Don't add duplicate patterns (fix#8216)
* Refactor EntityRuler init
This simplifies the EntityRuler init code. This is helpful as prep for
allowing the EntityRuler to reset itself.
* Make EntityRuler.clear reset matchers
Includes a new test for this.
* Tidy PhraseMatcher instantiation
Since the attr can be None safely now, the guard if is no longer
required here.
Also renamed the `_validate` attr. Maybe it's not needed?
* Fix NER test
* Add test to make sure patterns aren't increasing
* Move test to regression tests
* Show warning if entity_ruler runs without patterns
* Show warning if matcher runs without patterns
* fix wording
* unit test for warning once (WIP)
* warn W036 only once
* cleanup
* create filter_warning helper
* Support match alignments
* change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case
* remove added errors, utilize bint type, cleanup whitespace
* fix no new line in end of file
* Minor formatting
* Skip alignments processing if as_spans is set
* Add with_alignments to Matcher API docs
* Update website/docs/api/matcher.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add regression test
* Run PhraseMatcher on Spans
* Add test for PhraseMatcher on Spans and Docs
* Add SCA
* Add test with 3 matches in Doc, 1 match in Span
* Update docs
* Use doc.length for find_matches in tokenizer
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Instead of silently using only the first token in each matched span:
* Forbid `OP: ?/*/+` through `DependencyMatcher` validation
* As a fail-safe, add warning if a token match that's not exactly one
token long is found by a token pattern.
* Fix `spacy.util.minibatch` when the size iterator is finished (#6745)
* Skip 0-length matches (#6759)
Add hack to prevent matcher from returning 0-length matches.
* support IS_SENT_START in PhraseMatcher (#6771)
* support IS_SENT_START in PhraseMatcher
* add unit test and friendlier error
* use IDS.get instead
* ensure span.text works for an empty span (#6772)
* Remove unicode_literals
Co-authored-by: Santiago Castro <bryant@montevideo.com.uy>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Adding contributor agreement for user werew
* [DependencyMatcher] Comment and clean code
* [DependencyMatcher] Use defaultdicts
* [DependencyMatcher] Simplify _retrieve_tree method
* [DependencyMatcher] Remove prepended underscores
* [DependencyMatcher] Address TODO and move grouping of token's positions out of the loop
* [DependencyMatcher] Remove _nodes attribute
* [DependencyMatcher] Use enumerate in _retrieve_tree method
* [DependencyMatcher] Clean unused vars and use camel_case naming
* [DependencyMatcher] Memoize node+operator map
* Add root property to Token
* [DependencyMatcher] Groups matches by root
* [DependencyMatcher] Remove unused _keys_to_token attribute
* [DependencyMatcher] Use a list to map tokens to matcher's keys
* [DependencyMatcher] Remove recursion
* [DependencyMatcher] Use a generator to retrieve matches
* [DependencyMatcher] Remove unused memory pool
* [DependencyMatcher] Hide private methods and attributes
* [DependencyMatcher] Improvements to the matches validation
* Apply suggestions from code review
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* [DependencyMatcher] Fix keys_to_position_maps
* Remove Token.root property
* [DependencyMatcher] Remove functools' lru_cache
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
For the `DependencyMatcher`:
* Fix on_match callback so that it is called once per matched pattern
* Fix results so that patterns with empty match lists are not returned