Add official support for the `DependencyMatcher`. Redesign the pattern
specification. Fix and extend operator implementations. Update API docs
and add usage docs.
Patterns
--------
Refactor pattern structure to:
```
{
"LEFT_ID": str,
"REL_OP": str,
"RIGHT_ID": str,
"RIGHT_ATTRS": dict,
}
```
The first node contains only `RIGHT_ID` and `RIGHT_ATTRS` and all
subsequent nodes contain all four keys.
New operators
-------------
Because of the way patterns are constructed from left to right, it's
helpful to have `follows` operators along with `precedes` operators. Add
operators for simple precedes / follows alongside immediate precedes /
follows.
* `.*`: precedes
* `;`: immediately follows
* `;*`: follows
Operator fixes
--------------
* `<` and `<<` do not include the node itself
* Fix reversed order for all operators involving linear precedence (`.`,
all sibling operators)
* Linear precedence operators do not match nodes outside the same parse
Additional fixes
----------------
* Use v3 Matcher API
* Support `get` and `remove`
* Support pickling
* make disable_pipes deprecated in favour of the new toggle_pipes
* rewrite disable_pipes statements
* update documentation
* remove bin/wiki_entity_linking folder
* one more fix
* remove deprecated link to documentation
* few more doc fixes
* add note about name change to the docs
* restore original disable_pipes
* small fixes
* fix typo
* fix error number to W096
* rename to select_pipes
* also make changes to the documentation
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Fix ent_ids and labels properties when id attribute used in patterns
* use set for labels
* sort end_ids for comparison in entity_ruler tests
* fixing entity_ruler ent_ids test
* add to set
* Run make_doc optimistically if using phrase matcher patterns.
* remove unused coveragerc I was testing with
* format
* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.
* Removing old add_patterns function
* Fixing spacing
* Make sure token_patterns loaded as well, before generator was being emptied in from_disk
* Fix typo in rule-based matching docs
* Improve token pattern checking without validation
Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.
Addresses #4070 (also related: #4063, #4100).
* Check whether top-level attributes in patterns and attr for PhraseMatcher are
in token pattern schema
* Check whether attribute value types are supported in general (as opposed to
per attribute with full validation)
* Report various internal error types (OverflowError, AttributeError, KeyError)
as ValueError with standard error messages
* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
LEMMA, and DEP
* Add error messages with relevant details on how to use validate=True or nlp()
instead of nlp.make_doc()
* Support attr=TEXT for PhraseMatcher
* Add NORM to schema
* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler
* Remove unnecessary .keys()
* Rephrase error messages
* Add another type check to Matcher
Add another type check to Matcher for more understandable error messages
in some rare cases.
* Support phrase_matcher_attr=TEXT for EntityRuler
* Don't use spacy.errors in examples and bin scripts
* Fix error code
* Auto-format
Also try get Azure pipelines to finally start a build :(
* Update errors.py
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>