* Clean up old Matcher call style related stuff
In v2 Matcher.add was called with (key, on_match, *patterns). In v3 this
was changed to (key, patterns, *, on_match=None), but there were various
points where the old call syntax was documented or handled specially.
This removes all those.
The Matcher itself didn't need any code changes, as it just gives a
generic type error. However the PhraseMatcher required some changes
because it would automatically "fix" the old call style.
Surprisingly, the tokenizer was still using the old call style in one
place.
After these changes tests failed in two places:
1. one test for the "new" call style, including the "old" call style. I
removed this test.
2. deserializing the PhraseMatcher fails because the input docs are a
set.
I am not sure why 2 is happening - I guess it's a quirk of the
serialization format? - so for now I just convert the set to a list when
deserializing. The check that the input Docs are a List in the
PhraseMatcher is a new check, but makes it parallel with the other
Matchers, which seemed like the right thing to do.
* Add notes related to input docs / deserialization type
* Remove Typing import
* Remove old note about call style change
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Use separate method for setting internal doc representations
In addition to the title change, this changes the internal dict to be a
defaultdict, instead of a dict with frequent use of setdefault.
* Add _add_from_arrays for unpickling
* Cleanup around adding from arrays
This moves adding to internal structures into the private batch method,
and removes the single-add method.
This has one behavioral change for `add`, in that if something is wrong
with the list of input Docs (such as one of the items not being a Doc),
valid items before the invalid one will not be added. Also the callback
will not be updated if anything is invalid. This change should not be
significant.
This also adds a test to check failure when given a non-Doc.
* Update spacy/matcher/phrasematcher.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Min_max_operators
1. Modified API and Usage for spaCy website to include min_max operator
2. Modified matcher.pyx to include min_max function {n,m} and its variants
3. Modified schemas.py to include min_max validation error
4. Added test cases to test_matcher_api.py, test_matcher_logic.py and test_pattern_validation.py
* attempt to fix mypy/pydantic compat issue
* formatting
* Update spacy/tests/matcher/test_pattern_validation.py
Co-authored-by: Source-Shen <82353723+Source-Shen@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* added new field
* added exception for IOb strings
* minor refinement to schema
* removed field
* fixed typo
* imported numeriacla val
* changed the code bit
* cosmetics
* added test for matcher
* set ents of moc docs
* added invalid pattern
* minor update to documentation
* blacked matcher
* added pattern validation
* add IOB vals to schema
* changed into test
* mypy compat
* cleaned left over
* added compat import
* changed type
* added compat import
* changed literal a bit
* went back to old
* made explicit type
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Added ENT_ID and ENT_KB_ID into the list of the attributes that Matcher matches on
* Added ENT_ID and ENT_KB_ID to TEST_PATTERNS in test_pattern_validation.py. Disabled tests that I added before
* Update website/docs/api/matcher.md
* Format
* Remove skipped tests
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Support list values and IS_INTERSECT in Matcher
* Support list values as token attributes for set operators, not just as
pattern values.
* Add `IS_INTERSECT` operator.
* Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs.
* Rename IS_INTERSECT to INTERSECTS
* Support match alignments
* change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case
* remove added errors, utilize bint type, cleanup whitespace
* fix no new line in end of file
* Minor formatting
* Skip alignments processing if as_spans is set
* Add with_alignments to Matcher API docs
* Update website/docs/api/matcher.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add MORPH handling to Matcher
* Add `MORPH` to `Matcher` schema
* Rename `_SetMemberPredicate` to `_SetPredicate`
* Add `ISSUBSET` and `ISSUPERSET` operators to `_SetPredicate`
* Add special handling for normalization and conversion of morph
values into sets
* For other attrs, `ISSUBSET` acts like `IN` and `ISSUPERSET` only
matches for 0 or 1 values
* Update test
* Rename to IS_SUBSET and IS_SUPERSET
* Update website models for v2.3.0
* Add docs for Chinese word segmentation
* Tighten up Chinese docs section
* Merge branch 'master' into docs/v2.3.0 [ci skip]
* Merge branch 'master' into docs/v2.3.0 [ci skip]
* Auto-format and update version
* Update matcher.md
* Update languages and sorting
* Typo in landing page
* Infobox about token_match behavior
* Add meta and basic docs for Japanese
* POS -> TAG in models table
* Add info about lookups for normalization
* Updates to API docs for v2.3
* Update adding norm exceptions for adding languages
* Add --omit-extra-lookups to CLI API docs
* Add initial draft of "What's New in v2.3"
* Add new in v2.3 tags to Chinese and Japanese sections
* Add tokenizer to migration section
* Add new in v2.3 flags to init-model
* Typo
* More what's new in v2.3
Co-authored-by: Ines Montani <ines@ines.io>
* Implement new API for {Phrase}Matcher.add (backwards-compatible)
* Update docs
* Also update DependencyMatcher.add
* Update internals
* Rewrite tests to use new API
* Add basic check for common mistake
Raise error with suggestion if user likely passed in a pattern instead of a list of patterns
* Fix typo [ci skip]
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves#3270. Resolves#3222. Resolves#2947. Resolves#2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.