spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-11 12:18:04 +03:00

Author	SHA1	Message	Date
Adriane Boyd	86d01e9229	Tidy up with flake8: imports, comparisons, etc.	2021-06-28 12:08:15 +02:00
Adriane Boyd	5eeb25f043	Tidy up code	2021-06-28 12:08:15 +02:00
Adriane Boyd	ec71a6b572	Filter W036 for entity ruler, etc. (#8424 )	2021-06-21 09:34:29 +02:00
Sofie Van Landeghem	ff91e6dac7	Show warning if entity_ruler runs without patterns (#7807 ) * Show warning if entity_ruler runs without patterns * Show warning if matcher runs without patterns * fix wording * unit test for warning once (WIP) * warn W036 only once * cleanup * create filter_warning helper	2021-05-31 18:20:27 +10:00
Adriane Boyd	0a22fed634	Fix span offsets for Matcher(as_spans) on spans (#7992 ) Fix returned span offsets for `Matcher(as_spans=True)(span)`.	2021-05-06 18:42:44 +10:00
broaddeep	ee159b8543	Support match alignments (#7321 ) * Support match alignments * change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case * remove added errors, utilize bint type, cleanup whitespace * fix no new line in end of file * Minor formatting * Skip alignments processing if as_spans is set * Add with_alignments to Matcher API docs * Update website/docs/api/matcher.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-04-08 18:10:14 +10:00
Adriane Boyd	348d1829c7	Preserve user data for DependencyMatcher on spans (#7528 ) * Preserve user data for DependencyMatcher on spans * Clean underscore in test * Modify test to use extensions stored in user data	2021-03-30 12:26:22 +02:00
Ines Montani	9ba715ed16	Tidy up and auto-format	2021-02-13 12:55:56 +11:00
Peter Baumann	61b04a70d5	Run PhraseMatcher on Spans (#6918 ) * Add regression test * Run PhraseMatcher on Spans * Add test for PhraseMatcher on Spans and Docs * Add SCA * Add test with 3 matches in Doc, 1 match in Span * Update docs * Use doc.length for find_matches in tokenizer Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-02-10 23:43:32 +11:00
Ines Montani	e6accb3a9e	Tidy up and auto-format	2021-01-30 12:52:33 +11:00
Adriane Boyd	fcce3600ed	Forbid OP matching 2+ tokens in DependencyMatcher (#6824 ) Instead of silently using only the first token in each matched span: * Forbid `OP: ?//+` through `DependencyMatcher` validation As a fail-safe, add warning if a token match that's not exactly one token long is found by a token pattern.	2021-01-29 08:52:01 +08:00
Adriane Boyd	2263bc7b28	Update develop from master for v3.0.0rc5 (#6811 ) * Fix `spacy.util.minibatch` when the size iterator is finished (#6745) * Skip 0-length matches (#6759) Add hack to prevent matcher from returning 0-length matches. * support IS_SENT_START in PhraseMatcher (#6771) * support IS_SENT_START in PhraseMatcher * add unit test and friendlier error * use IDS.get instead * ensure span.text works for an empty span (#6772) * Remove unicode_literals Co-authored-by: Santiago Castro <bryant@montevideo.com.uy> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-01-26 14:52:45 +11:00
Ines Montani	1980203229	Merge branch 'master' into pr/6444	2020-12-09 11:09:40 +11:00
Adriane Boyd	03ae77e603	Add SPACY as a Matcher attribute (#6463 )	2020-11-30 09:34:50 +08:00
Adriane Boyd	31de700b0f	Fix on_match callback and remove empty patterns (#6312 ) For the `DependencyMatcher`: * Fix on_match callback so that it is called once per matched pattern * Fix results so that patterns with empty match lists are not returned	2020-11-05 09:16:26 +01:00
Adriane Boyd	5d2cb86c34	Fix on_match callback for DependencyMatcher (#6313 ) Fix `DependencyMatcher` so that the callback is called only once per match.	2020-10-31 12:20:27 +01:00
Adriane Boyd	45c9a68828	Identify final Matcher pattern node by quantifier (#6317 ) Modify the internal pattern representation in `Matcher` patterns to identify the final ID state using a unique quantifier rather than a combination of other attributes. It was insufficient to identify the final ID node based on an uninitialized `quantifier` (coincidentally being the same as the `ZERO`) with `nr_attr` as 0. (In addition, it was potentially bug-prone that `nr_attr` was set to 0 even though attrs were allocated.) In the case of `{"OP": "!"}` (a valid, if pointless, pattern), `nr_attr` is 0 and the quantifier is ZERO, so the previous methods for incrementing to the ID node at the end of the pattern weren't able to distinguish the final ID node from the `{"OP": "!"}` pattern.	2020-10-31 12:18:48 +01:00
Adriane Boyd	86c3ec9c2b	Refactor Token morph setting (#6175 ) * Refactor Token morph setting * Remove `Token.morph_` * Add `Token.set_morph()` * `0` resets `token.c.morph` to unset * Any other values are passed to `Morphology.add` * Add token.morph setter to set from MorphAnalysis	2020-10-01 22:21:46 +02:00
Adriane Boyd	3c062b3911	Add MORPH handling to Matcher (#6107 ) * Add MORPH handling to Matcher * Add `MORPH` to `Matcher` schema * Rename `_SetMemberPredicate` to `_SetPredicate` * Add `ISSUBSET` and `ISSUPERSET` operators to `_SetPredicate` * Add special handling for normalization and conversion of morph values into sets * For other attrs, `ISSUBSET` acts like `IN` and `ISSUPERSET` only matches for 0 or 1 values * Update test * Rename to IS_SUBSET and IS_SUPERSET	2020-09-24 16:55:09 +02:00
Adriane Boyd	59340606b7	Add option to disable Matcher errors (#6125 ) * Add option to disable Matcher errors * Add option to disable Matcher errors when a doc doesn't contain a particular type of annotation Minor additional change: * Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH` values * Rename suppress_errors to allow_missing Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Refactor annotation checks in Matcher and PhraseMatcher Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-24 16:54:39 +02:00
Ines Montani	67fbcb3da5	Tidy up tests and docs	2020-09-21 20:43:54 +02:00
Adriane Boyd	7e4cd7575c	Refactor Docs.is_ flags (#6044 ) * Refactor Docs.is_ flags * Add derived `Doc.has_annotation` method * `Doc.has_annotation(attr)` returns `True` for partial annotation * `Doc.has_annotation(attr, require_complete=True)` returns `True` for complete annotation * Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced` and `is_nered` * Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The list is the `DocBin` attributes list plus `SPACY` and `LENGTH`. Notes on `Doc.has_annotation`: * `HEAD` is converted to `DEP` because heads don't have an unset state * Accept `IS_SENT_START` as a synonym of `SENT_START` Additional changes: * Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for `DocBin` * In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override `SENT_START` * In `Doc.from_array()` using `attrs` other than `Doc._get_array_attrs()` (i.e., a user's custom list rather than our default internal list) with both `HEAD` and `SENT_START` shows a warning that `HEAD` will override `SENT_START` * `set_children_from_heads` does not require dependency labels to set sentence boundaries and sets `sent_start` for all non-sentence starts to `-1` * Fix call to set_children_form_heads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-17 00:14:01 +02:00
Ines Montani	febb99916d	Tidy up and auto-format [ci skip]	2020-09-13 10:55:36 +02:00
Ines Montani	f174c7b1f3	Merge branch 'develop' into pr/6018	2020-09-04 15:54:49 +02:00
Ines Montani	d7cc2ee72d	Fix tests	2020-09-04 14:05:55 +02:00
Ines Montani	864a697e63	Merge branch 'develop' into master-tmp	2020-09-04 13:15:36 +02:00
Adriane Boyd	8b5594df86	Remove near-duplicate test	2020-09-02 20:32:01 +02:00
Adriane Boyd	960d9cfadc	Officially support DependencyMatcher Add official support for the `DependencyMatcher`. Redesign the pattern specification. Fix and extend operator implementations. Update API docs and add usage docs. Patterns -------- Refactor pattern structure to: ``` { "LEFT_ID": str, "REL_OP": str, "RIGHT_ID": str, "RIGHT_ATTRS": dict, } ``` The first node contains only `RIGHT_ID` and `RIGHT_ATTRS` and all subsequent nodes contain all four keys. New operators ------------- Because of the way patterns are constructed from left to right, it's helpful to have `follows` operators along with `precedes` operators. Add operators for simple precedes / follows alongside immediate precedes / follows. * `.`: precedes `;`: immediately follows * `;`: follows Operator fixes -------------- `<` and `<<` do not include the node itself * Fix reversed order for all operators involving linear precedence (`.`, all sibling operators) * Linear precedence operators do not match nodes outside the same parse Additional fixes ---------------- * Use v3 Matcher API * Support `get` and `remove` * Support pickling	2020-09-02 17:45:29 +02:00
Ines Montani	add9de5487	Deprecate (Phrase)Matcher.pipe	2020-08-31 17:01:24 +02:00
Ines Montani	6340d1c63d	Add as_spans to Matcher/PhraseMatcher	2020-08-31 14:53:22 +02:00
Ines Montani	e68459296d	Tidy up and auto-format	2020-08-05 16:00:59 +02:00
Adriane Boyd	4193402c47	Add warning when Matcher subpattern is discarded (#5873 ) * Add a warning when a subpattern is not processed and discarded * Normalize subpattern attribute/operator keys to upper case like top-level attributes	2020-08-05 14:56:14 +02:00
Sofie Van Landeghem	40c995b1be	Option for returning only greedy matches (#5771 ) * add "greedy" option for match pattern * distinction between greedy FIRST or LONGEST * check for proper values, throw custom warning otherwise * unxfail one more test * add comment in docstring * add test that LONGEST also prefers first match if equal length * use c arrays for more efficient processing * rename 'greediness' to 'greedy'	2020-07-29 11:04:43 +02:00
Ines Montani	52728d8fa3	Merge branch 'develop' into master-tmp	2020-06-20 15:52:00 +02:00
Adriane Boyd	e06ca7ea24	Switch to new add API in PhraseMatcher unpickle	2020-05-25 11:22:47 +02:00
Ines Montani	24f72c669c	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
Ines Montani	d8f3190c0a	Tidy up and auto-format	2020-05-21 14:14:01 +02:00
Paolo Arduin	1ca32d8f9c	Matcher support for Span as well as Doc (#5113 ) * Matcher support for Span, as well as Doc #5056 * Removes an import unused * Signed contributors agreement * Code optimization and better test * Add error message for bad Matcher call argument * Fix merging	2020-04-15 13:51:33 +02:00
Paolo Arduin	8ce408d2e1	Comparison predicate handling for `!=` (#5282 ) * Fix #5281 * Optim test	2020-04-14 19:14:15 +02:00
Ines Montani	46568f40a7	Merge branch 'master' into tmp/sync	2020-03-26 13:38:14 +01:00
Ines Montani	828acffc12	Tidy up and auto-format	2020-03-25 12:28:12 +01:00
adrianeboyd	697bec764d	Normalize IS_SENT_START to SENT_START for Matcher (#5080 )	2020-03-03 12:22:39 +01:00
svlandeg	b49a3afd0c	use clean_underscore fixture	2020-02-23 15:49:20 +01:00
Ines Montani	33a2682d60	Add better schemas and validation using Pydantic (#4831 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Add better schemas and validation using Pydantic * Revert lookups.md * Remove unused import * Update spacy/schemas.py Co-Authored-By: Sebastián Ramírez <tiangolo@gmail.com> * Various small fixes * Fix docstring Co-authored-by: Sebastián Ramírez <tiangolo@gmail.com>	2019-12-25 12:39:49 +01:00
Ines Montani	db55577c45	Drop Python 2.7 and 3.5 (#4828 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Revert lookups.md * Revert top-level.md * Small adjustments and docs [ci skip]	2019-12-22 01:53:56 +01:00
Ines Montani	cfffdba7b1	Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522 ) * Implement new API for {Phrase}Matcher.add (backwards-compatible) * Update docs * Also update DependencyMatcher.add * Update internals * Rewrite tests to use new API * Add basic check for common mistake Raise error with suggestion if user likely passed in a pattern instead of a list of patterns * Fix typo [ci skip]	2019-10-25 22:21:08 +02:00
Ines Montani	181c01f629	Tidy up and auto-format	2019-10-18 11:27:38 +02:00
adrianeboyd	275c9ad872	Allow int values in token patterns (#4444 ) * Add missing int value option to top-level pattern validation in Matcher * Adjust existing tests accordingly * Add new test for valid pattern `{"LENGTH": int}`	2019-10-16 13:40:18 +02:00
Sofie Van Landeghem	7d1efac4eb	Fix remove pattern from matcher (#4454 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing * bugfix in remove matcher + extended unit test	2019-10-16 13:34:58 +02:00
adrianeboyd	98a961a60e	Fix PhraseMatcher.remove for overlapping patterns (#4437 )	2019-10-14 12:19:51 +02:00

1 2 3

103 Commits