Commit Graph

380 Commits

Author SHA1 Message Date
Adriane Boyd
cae4589f5a
Replace EntityRuler with SpanRuler implementation (#11320)
* Replace EntityRuler with SpanRuler implementation

Remove `EntityRuler` and rename the `SpanRuler`-based
`future_entity_ruler` to `entity_ruler`.

Main changes:

* It is no longer possible to load patterns on init as with
`EntityRuler(patterns=)`.
* The older serialization formats (`patterns.jsonl`) are no longer
supported and the related tests are removed.
* The config settings are only stored in the config, not in the
serialized component (in particular the `phrase_matcher_attr` and
overwrite settings).

* Add migration guide to EntityRuler API docs

* docs update

* Minor edit

Co-authored-by: svlandeg <svlandeg@github.com>
2022-10-24 09:11:35 +02:00
Madeesh Kannan
446a3ecf34
StringStore refactoring (#11344)
* `strings`: Remove unused `hash32_utf8` function

* `strings`: Make `hash_utf8` and `decode_Utf8Str` private

* `strings`: Reorganize private functions

* 'strings': Raise error when non-string/-int types are passed to functions that don't accept them

* `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods

* `Morphology`: Use `StringStore.items()` to enumerate features when pickling

* `test_stringstore`: Update pre-Python 3 tests

* Update `StringStore` docs

* Fix `get_string_id` imports

* Replace redundant test with tests for type checking

* Rename `_retrieve_interned_str`, remove `.get` default arg

* Add `get_string_id` to `strings.pyi`
Remove `mypy` ignore directives from imports of the above

* `strings.pyi`: Replace functions that consume `Union`-typed params with overloads

* `strings.pyi`: Revert some function signatures

* Update `SYMBOLS_BY_INT` lookups and error codes post-merge

* Revert clobbered change introduced in a previous merge

* Remove unnecessary type hint

* Invert tuple order in `StringStore.items()`

* Add test for `StringStore.items()`

* Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling"

This reverts commit 1af9510ceb.

* Rename `keys` and `key_map`

* Add `keys()` and `values()`

* Add comment about the inverted key-value semantics in the API

* Fix type hints

* Implement `keys()`, `values()`, `items()` without generators

* Fix type hints, remove unnecessary boxing

* Update docs

* Simplify `keys/values/items()` impl

* `mypy` fix

* Fix error message, doc fixes
2022-10-06 10:51:06 +02:00
svlandeg
e3027c65b8 Merge branch 'copy_develop' into copy_v4 2022-10-03 14:12:16 +02:00
svlandeg
9c8cdb403e Merge branch 'master_copy' into develop_copy 2022-09-30 15:40:26 +02:00
Raphael Mitsch
aea16719be
Simplify and clarify enable/disable behavior of spacy.load() (#11459)
* Change enable/disable behavior so that arguments take precedence over config options. Extend error message on conflict. Add warning message in case of overwriting config option with arguments.

* Fix tests in test_serialize_pipeline.py to reflect changes to handling of enable/disable.

* Fix type issue.

* Move comment.

* Move comment.

* Issue UserWarning instead of printing wasabi message. Adjust test.

* Added pytest.warns(UserWarning) for expected warning to fix tests.

* Update warning message.

* Move type handling out of fetch_pipes_status().

* Add global variable for default value. Use id() to determine whether used values are default value.

* Fix default value for disable.

* Rename DEFAULT_PIPE_STATUS to _DEFAULT_EMPTY_PIPES.
2022-09-27 14:22:36 +02:00
Sofie Van Landeghem
0509f90874
add dot (#11500) 2022-09-15 17:29:42 +02:00
Sofie Van Landeghem
cc10a27c59
Prevent tok2vec to broadcast to listeners when predicting (#11385)
* replicate bug with tok2vec in annotating components

* add overfitting test with a frozen tok2vec

* remove broadcast from predict and check doc.tensor instead

* remove broadcast

* proper error

* slight rephrase of documentation
2022-09-12 15:36:48 +02:00
Raphael Mitsch
1f23c615d7
Refactor KB for easier customization (#11268)
* Add implementation of batching + backwards compatibility fixes. Tests indicate issue with batch disambiguation for custom singular entity lookups.

* Fix tests. Add distinction w.r.t. batch size.

* Remove redundant and add new comments.

* Adjust comments. Fix variable naming in EL prediction.

* Fix mypy errors.

* Remove KB entity type config option. Change return types of candidate retrieval functions to Iterable from Iterator. Fix various other issues.

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/kb_base.pyx

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/kb_base.pyx

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Add error messages to NotImplementedErrors. Remove redundant comment.

* Fix imports.

* Remove redundant comments.

* Rename KnowledgeBase to InMemoryLookupKB and BaseKnowledgeBase to KnowledgeBase.

* Fix tests.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move KB into subdirectory.

* Adjust imports after KB move to dedicated subdirectory.

* Fix config imports.

* Move Candidate + retrieval functions to separate module. Fix other, small issues.

* Fix docstrings and error message w.r.t. class names. Fix typing for candidate retrieval functions.

* Update spacy/kb/kb_in_memory.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix typing.

* Change typing of mentions to be Span instead of Union[Span, str].

* Update docs.

* Update EntityLinker and _architecture docs.

* Update website/docs/api/entitylinker.md

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Adjust message for E1046.

* Re-add section for Candidate in kb.md, add reference to dedicated page.

* Update docs and docstrings.

* Re-add section + reference for KnowledgeBase.get_alias_candidates() in docs.

* Update spacy/kb/candidate.pyx

* Update spacy/kb/kb_in_memory.pyx

* Update spacy/pipeline/legacy/entity_linker.py

* Remove canididate.md. Remove mistakenly added config snippet in entity_linker.py.

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-08 10:38:07 +02:00
shademe
977b847cce
Merge branch 'develop' into merge-develop-into-v4 2022-09-07 11:35:47 +02:00
Adriane Boyd
78f5503a29
Check for any non-Doc returned value for components (#11424) 2022-09-01 19:37:23 +02:00
Paul O'Leary McCann
698b8b495f
Update/remove old Matcher syntax (#11370)
* Clean up old Matcher call style related stuff

In v2 Matcher.add was called with (key, on_match, *patterns). In v3 this
was changed to (key, patterns, *, on_match=None), but there were various
points where the old call syntax was documented or handled specially.
This removes all those.

The Matcher itself didn't need any code changes, as it just gives a
generic type error. However the PhraseMatcher required some changes
because it would automatically "fix" the old call style.

Surprisingly, the tokenizer was still using the old call style in one
place.

After these changes tests failed in two places:

1. one test for the "new" call style, including the "old" call style. I
   removed this test.
2. deserializing the PhraseMatcher fails because the input docs are a
   set.

I am not sure why 2 is happening - I guess it's a quirk of the
serialization format? - so for now I just convert the set to a list when
deserializing. The check that the input Docs are a List in the
PhraseMatcher is a new check, but makes it parallel with the other
Matchers, which seemed like the right thing to do.

* Add notes related to input docs / deserialization type

* Remove Typing import

* Remove old note about call style change

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Use separate method for setting internal doc representations

In addition to the title change, this changes the internal dict to be a
defaultdict, instead of a dict with frequent use of setdefault.

* Add _add_from_arrays for unpickling

* Cleanup around adding from arrays

This moves adding to internal structures into the private batch method,
and removes the single-add method.

This has one behavioral change for `add`, in that if something is wrong
with the list of input Docs (such as one of the items not being a Doc),
valid items before the invalid one will not be added. Also the callback
will not be updated if anything is invalid. This change should not be
significant.

This also adds a test to check failure when given a non-Doc.

* Update spacy/matcher/phrasematcher.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-30 15:40:31 +02:00
Edward
5afa98aabf
Support custom attributes for tokens and spans in json conversion (#11125)
* Add token and span custom attributes to to_json()

* Change logic for to_json

* Add functionality to from_json

* Small adjustments

* Move token/span attributes to new dict key

* Fix test

* Fix the same test but much better

* Add backwards compatibility tests and adjust logic

* Add test to check if attributes not set in underscore are not saved in the json

* Add tests for json compatibility

* Adjust test names

* Fix tests and clean up code

* Fix assert json tests

* small adjustment

* adjust naming and code readability

* Adjust naming, added more tests and changed logic

* Fix typo

* Adjust errors, naming, and small test optimization

* Fix byte tests

* Fix bytes tests

* Change naming and json structure

* update schema

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update schema for underscore attributes

* Adjust underscore schema

* adjust schema tests

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-23 10:05:02 +02:00
Paul O'Leary McCann
0f07defe2c
Remove reference to voting on issue (#11335)
Not clear which issue this refers to, we don't suggest this for any
other issues, and we don't use votes in general.
2022-08-22 11:29:05 +02:00
Adriane Boyd
3e4cf1bbe1
Check for . in factory names (#11336) 2022-08-19 09:52:12 +02:00
Sofie Van Landeghem
cab263791f
include span_ruler for default warning filter (#11333) 2022-08-17 19:55:54 +02:00
Raphael Mitsch
e9eb59699f
NEL confidence threshold (#11016)
* Add base for NEL abstention threshold mechanism.

* Add abstention threshold to entity linker. Add test.

* Fix entity linking tests.

* Changed abstention default threshold from 0 to None.

* Fix default values for abstention thresholds.

* Fix mypy errors.

* Replace assertion with raise of proper error code.

* Simplify threshold check. Remove thresholding from EntityLinker_v1.

* Rename test.

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Make E1043 configurable.

* Update docs.

* Rephrase description in docs. Adjusting error code message.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-07-04 17:05:21 +02:00
Madeesh Kannan
eaf66e7431
Add NVTX ranges to TrainablePipe components (#10965)
* `TrainablePipe`: Add NVTX range decorator

* Annotate `TrainablePipe` subclasses with NVTX ranges

* Export function signature to allow introspection of args in tests

* Revert "Annotate `TrainablePipe` subclasses with NVTX ranges"

This reverts commit d8684f7372.

* Revert "Export function signature to allow introspection of args in tests"

This reverts commit f4405ca3ad.

* Revert "`TrainablePipe`: Add NVTX range decorator"

This reverts commit 26536eb6b8.

* Add `spacy.pipes_with_nvtx_range` pipeline callback

* Show warnings for all missing user-defined pipe functions that need to be annotated
Fix imports, typos

* Rename `DEFAULT_ANNOTATABLE_PIPE_METHODS` to `DEFAULT_NVTX_ANNOTATABLE_PIPE_METHODS`
Reorder import

* Walk model nodes directly whilst applying NVTX ranges
Ignore pipe method wrapper when applying range
2022-06-30 11:28:12 +02:00
Raphael Mitsch
4c058eb40a
enable argument for spacy.load() (#10784)
* Enable flag on spacy.load: foundation for include, enable arguments.

* Enable flag on spacy.load: fixed tests.

* Enable flag on spacy.load: switched from pretrained model to empty model with added pipes for tests.

* Enable flag on spacy.load: switched to more consistent error on misspecification of component activity. Test refactoring. Added  to default config.

* Enable flag on spacy.load: added support for fields not in pipeline.

* Enable flag on spacy.load: removed serialization fields from supported fields.

* Enable flag on spacy.load: removed 'enable' from config again.

* Enable flag on spacy.load: relaxed checks in _resolve_component_activation_status() to allow non-standard pipes.

* Enable flag on spacy.load: fixed relaxed checks for _resolve_component_activation_status() to allow non-standard pipes. Extended tests.

* Enable flag on spacy.load: comments w.r.t. resolution workarounds.

* Enable flag on spacy.load: remove include fields. Update website docs.

* Enable flag on spacy.load: updates w.r.t. changes in master.

* Implement Doc.from_json(): update docstrings.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): remove newline.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): change error message for E1038.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Enable flag on spacy.load: wrapped docstring for _resolve_component_status() at 80 chars.

* Enable flag on spacy.load: changed exmples for enable flag.

* Remove newline.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix docstring for Language._resolve_component_status().

* Rename E1038 to E1042.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-17 20:24:13 +01:00
Madeesh Kannan
41389ffe1e
Avoid pickling Doc inputs passed to Language.pipe() (#10864)
* `Language.pipe()`: Serialize `Doc` objects to bytes when using multiprocessing to avoid pickling overhead

* `Doc.to_dict()`: Serialize `_context` attribute (keeping in line with `(un)pickle_doc()`

* Correct type annotations

* Fix typo

* `Doc`: Do not serialize `_context`

* `Language.pipe`: Send context objects to child processes, Simplify `as_tuples` handling

* Fix type annotation

* `Language.pipe`: Simplify `as_tuple` multiprocessor handling

* Cleanup code, fix typos

* MyPy fixes

* Move doc preparation function into `_multiprocessing_pipe`
Whitespace changes

* Remove superfluous comma

* Rename `prepare_doc` to `prepare_input`

* Update spacy/errors.py

* Undo renaming for error

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-02 20:06:49 +02:00
single-fingal
6c6b8da7cc
Fix: De/Serialize SpanGroups including the SpanGroup keys (#10707)
* fix: De/Serialize `SpanGroups` including the SpanGroup keys

This prevents the loss of `SpanGroup`s that have the same .name as other `SpanGroup`s within the same `SpanGroups` object (upon de/serialization of the `SpanGroups`).

Fixes #10685

* Maintain backwards compatibility for serialized `SpanGroups`

(serialized as: a list of `SpanGroup`s, or b'')

* Add tests for `SpanGroups` deserialization backwards-compatibility

* Move a `SpanGroups` de/serialization test (test_issue10685)
  to tests/serialize/test_serialize_spangroups.py

* Output a warning if deserializing a `SpanGroups` with duplicate .name-d `SpanGroup`s

* Minor refactor

* `SpanGroups.from_bytes` handles only `list` and `dict` types with
`dict` as the expected default
* For lists, keep first rather than last value encountered
* Update error message
* Rename and update tests

* Update to preserve list serialization of SpanGroups

To avoid breaking compatibility of serialized `Doc` and `DocBin` with
earlier versions of spacy v3, revert back to a list-only serialization,
but update the names just for serialization so that the SpanGroups keys
override the SpanGroup names.

* Preserve object identity and current key overwrite

* Preserve SpanGroup object identity
* Preserve last rather than first span group from SpanGroup list
  format without SpanGroups keys

* Update inline comments

* Fix types

* Add type info for SpanGroup.copy

* Deserialize `SpanGroup`s as copies

when a single SpanGroup is the value for more than 1 `SpanGroups` key.
This is because we serialize `SpanGroups` as dicts (to maintain backward-
and forward-compatibility) and we can't assume `SpanGroup`s with the same
bytes/serialization were the same (identical) object, pre-serialization.

* Update spacy/tokens/_dict_proxies.py

* Add more SpanGroups serialization tests

Test that serialized SpanGroups maintain their Span order

* small clarification on older spaCy version

* Update spacy/tests/serialize/test_serialize_span_groups.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-02 15:56:27 +02:00
Raphael Mitsch
8387ce4c01
Add Doc.from_json() (#10688)
* Implement Doc.from_json: rough draft.

* Implement Doc.from_json: first draft with tests.

* Implement Doc.from_json: added documentation on website for Doc.to_json(), Doc.from_json().

* Implement Doc.from_json: formatting changes.

* Implement Doc.to_json(): reverting unrelated formatting changes.

* Implement Doc.to_json(): fixing entity and span conversion. Moving fixture and doc <-> json conversion tests into single file.

* Implement Doc.from_json(): replaced entity/span converters with doc.char_span() calls.

* Implement Doc.from_json(): handling sentence boundaries in spans.

* Implementing Doc.from_json(): added parser-free sentence boundaries transfer.

* Implementing Doc.from_json(): added parser-free sentence boundaries transfer.

* Implementing Doc.from_json(): incorporated various PR feedback.

* Renaming fixture for document without dependencies.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): using two sent_starts instead of one.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): doc_without_dependency_parser() -> doc_without_deps.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): incorporating various PR feedback. Rebased on latest master.

* Implementing Doc.from_json(): refactored Doc.from_json() to work with annotation IDs instead of their string representations.

* Implement Doc.from_json(): reverting unwanted formatting/rebasing changes.

* Implement Doc.from_json(): added check for char_span() calculation for entities.

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): minor refactoring, additional check for token attribute consistency with corresponding test.

* Implement Doc.from_json(): removed redundancy in annotation type key naming.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): Simplifying setting annotation values.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement doc.from_json(): renaming annot_types to token_attrs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjustments for renaming of annot_types to token_attrs.

* Implement Doc.from_json(): removing default categories.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): simplifying lexeme initialization.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): simplifying lexeme initialization.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): refactoring to only have keys for present annotations.

* Implement Doc.from_json(): fix check for tokens' HEAD attributes.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): refactoring Doc.from_json().

* Implement Doc.from_json(): fixing span_group retrieval.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fixing span retrieval.

* Implement Doc.from_json(): added schema for Doc JSON format. Minor refactoring in Doc.from_json().

* Implement Doc.from_json(): added comment regarding Token and Span extension support.

* Implement Doc.from_json(): renaming inconsistent_props to partial_attrs..

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjusting error message.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): extending E1038 message.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): added params to E1038 raises.

* Implement Doc.from_json(): combined attribute collection with partial attributes check.

* Implement Doc.from_json(): added optional schema validation.

* Implement Doc.from_json(): fixed optional fields in schema, tests.

* Implement Doc.from_json(): removed redundant None check for DEP.

* Implement Doc.from_json(): added passing of schema validatoin message to E1037..

* Implement Doc.from_json(): removing redundant error E1040.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): changing message for E1037.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjusted website docs and docstring of Doc.from_json().

* Update spacy/tests/doc/test_json_doc_conversion.py

* Implement Doc.from_json(): docstring update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): website docs update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring formatting.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring formatting.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fixing Doc reference in website docs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): reformatted website/docs/api/doc.md.

* Implement Doc.from_json(): bumped IDs of new errors to avoid merge conflicts.

* Implement Doc.from_json(): fixing bug in tests.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fix setting of sentence starts for docs without DEP.

* Implement Doc.from_json(): add check for valid char spans when manually setting sentence boundaries. Refactor sentence boundary setting slightly. Move error message for lack of support for partial token annotations to errors.py.

* Implement Doc.from_json(): simplify token sentence start manipulation.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Combine related error messages

* Update spacy/tests/doc/test_json_doc_conversion.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-02 14:03:47 +02:00
Adriane Boyd
a322d6d5f2
Add SpanRuler component (#9880)
* Add SpanRuler component

Add a `SpanRuler` component similar to `EntityRuler` that saves a list
of matched spans to `Doc.spans[spans_key]`. The matches from the token
and phrase matchers are deduplicated and sorted before assignment but
are not otherwise filtered.

* Update spacy/pipeline/span_ruler.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix cast

* Add self.key property

* Use number of patterns as length

* Remove patterns kwarg from init

* Update spacy/tests/pipeline/test_span_ruler.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add options for spans filter and setting to ents

* Add `spans_filter` option as a registered function'
* Make `spans_key` optional and if `None`, set to `doc.ents` instead of
`doc.spans[spans_key]`.

* Update and generalize tests

* Add test for setting doc.ents, fix key property type

* Fix typing

* Allow independent doc.spans and doc.ents

* If `spans_key` is set, set `doc.spans` with `spans_filter`.
* If `annotate_ents` is set, set `doc.ents` with `ents_fitler`.
  * Use `util.filter_spans` by default as `ents_filter`.
  * Use a custom warning if the filter does not work for `doc.ents`.

* Enable use of SpanC.id in Span

* Support id in SpanRuler as Span.id

* Update types

* `id` can only be provided as string (already by `PatternType`
definition)

* Update all uses of Span.id/ent_id in Doc

* Rename Span id kwarg to span_id

* Update types and docs

* Add ents filter to mimic EntityRuler overwrite_ents

* Refactor `ents_filter` to take `entities, spans` args for more
  filtering options
* Give registered filters more descriptive names
* Allow registered `filter_spans` filter
  (`spacy.first_longest_spans_filter.v1`) to take any number of
  `Iterable[Span]` objects as args so it can be used for spans filter
  or ents filter

* Implement future entity ruler as span ruler

Implement a compatible `entity_ruler` as `future_entity_ruler` using
`SpanRuler` as the underlying component:
* Add `sort_key` and `sort_reverse` to allow the sorting behavior to be
  customized. (Necessary for the same sorting/filtering as in
  `EntityRuler`.)
* Implement `overwrite_overlapping_ents_filter` and
  `preserve_existing_ents_filter` to support
  `EntityRuler.overwrite_ents` settings.
* Add `remove_by_id` to support `EntityRuler.remove` functionality.
* Refactor `entity_ruler` tests to parametrize all tests to test both
  `entity_ruler` and `future_entity_ruler`
* Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns`
  properties.

Additional changes:

* Move all config settings to top-level attributes to avoid duplicating
  settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of
  casting.)

* Format

* Fix filter make method name

* Refactor to use same error for removing by label or ID

* Also provide existing spans to spans filter

* Support ids property

* Remove token_patterns and phrase_patterns

* Update docstrings

* Add span ruler docs

* Fix types

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move sorting into filters

* Check for all tokens in seen tokens in entity ruler filters

* Remove registered sort key

* Set Token.ent_id in a backwards-compatible way in Doc.set_ents

* Remove sort options from API docs

* Update docstrings

* Rename entity ruler filters

* Fix and parameterize scoring

* Add id to Span API docs

* Fix typo in API docs

* Include explicit labeled=True for scorer

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-02 13:12:53 +02:00
kadarakos
f6a4b80c0b
Better errors for has_annotation and Matcher (#10830)
* Show input argument instead of None

* catch invalid attr early

* moved error message from code to errors.py

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/errors.py

* update E153 and E154

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-05-25 11:12:29 +02:00
Richard Hudson
32954c3bcb
Fix issues for Mypy 0.950 and Pydantic 1.9.0 (#10786)
* Make changes to typing

* Correction

* Format with black

* Corrections based on review

* Bumped Thinc dependency version

* Bumped blis requirement

* Correction for older Python versions

* Update spacy/ml/models/textcat.py

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

* Corrections based on review feedback

* Readd deleted docstring line

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2022-05-25 09:33:54 +02:00
Raphael Mitsch
6f9e2ca81f
Ignore overrides for pipe names in config argument (#10779)
* Pipe name override in config: added check with warning, added removal of name override from config, extended tests.

* Pipoe name override in config: added pytest UserWarning.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-05-12 11:46:08 +02:00
Raphael Mitsch
f5390e278a
Refactor error messages to remove hardcoded strings (#10729)
* Use custom error msg instead of hardcoded string: replaced remaining hardcoded error message strings.

* Use custom error msg instead of hardcoded string: fixing faulty Errors import.
2022-05-02 13:38:46 +02:00
Richard Hudson
75fbbcdc18
Display warning when spacy.explain() finds no term (#10645)
* Display warning when spacy.explain() finds no term

* Updated warning message text
2022-04-12 10:48:28 +02:00
Adriane Boyd
ca54de27bb
Support more internal methods for SpanGroup (#10476)
* Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects

* Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap

* Added a method to efficiently merge SpanGroups

* Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge

* Renamed merge to concat and added missing things to documentation

* Added operator+ and operator += in the documentation

* Added a test for Doc deallocation

* Update spacy/tokens/span_group.pyx

* Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction

* Fixed typos in SpanGroup documentation

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation

* SpanGroup: moved repetitive list index check/adjustment in a separate function

* Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove formatting that hurts readability spacy/tests/doc/test_span_group.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Support more internal methods for SpanGroup

Add support for:

* `__setitem__`
* `__delitem__`
* `__iadd__`: for `SpanGroup` or `Iterable[Span]`
* `__add__`: for `SpanGroup` only

Adapted from #9698 with the scope limited to the magic methods.

* Use v3.3 as new version in docs

* Add new tag to SpanGroup.copy in API docs

* Remove duplicate import

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remaining suggestions and formatting

Co-authored-by: nrodnova <nrodnova@hotmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>
2022-04-01 09:56:26 +02:00
Daniël de Kok
c90dd6f265
Alignment: use a simplified ragged type for performance (#10319)
* Alignment: use a simplified ragged type for performance

This introduces the AlignmentArray type, which is a simplified version
of Ragged that performs better on the simple(r) indexing performed for
alignment.

* AlignmentArray: raise an error when using unsupported index

* AlignmentArray: move error messages to Errors

* AlignmentArray: remove simlified ... with simplifications

* AlignmentArray: fix typo that broke a[n:n] indexing
2022-04-01 09:02:06 +02:00
Adriane Boyd
f98b41c390
Add vector deduplication (#10551)
* Add vector deduplication

* Add `Vocab.deduplicate_vectors()`
* Always run deduplication in `spacy init vectors`
* Clean up a few vector-related error messages and docs examples

* Always unique with numpy

* Fix types
2022-03-30 08:54:23 +02:00
Adriane Boyd
85778dfcf4
Add edit tree lemmatizer (#10231)
* Add edit tree lemmatizer

Co-authored-by: Daniël de Kok <me@danieldk.eu>

* Hide edit tree lemmatizer labels

* Use relative imports

* Switch to single quotes in error message

* Type annotation fixes

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Reformat edit_tree_lemmatizer with black

* EditTreeLemmatizer.predict: take Iterable

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Validate edit trees during deserialization

This change also changes the serialized representation. Rather than
mirroring the deep C structure, we use a simple flat union of the match
and substitution node types.

* Move edit_trees to _edit_tree_internals

* Fix invalid edit tree format error message

* edit_tree_lemmatizer: remove outdated TODO comment

* Rename factory name to trainable_lemmatizer

* Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14

* Switch to Tagger.v2

* Add documentation for EditTreeLemmatizer

* docs: Fix 3.2 -> 3.3 somewhere

* trainable_lemmatizer documentation fixes

* docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py

Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-28 11:13:50 +02:00
Lj Miranda
a79cd3542b
Add displacy support for overlapping Spans (#10332)
* Fix docstring for EntityRenderer

* Add warning in displacy if doc.spans are empty

* Implement parse_spans converter

One notable change here is that the default spans_key is sc, and
it's set by the user through the options.

* Implement SpanRenderer

Here, I implemented a SpanRenderer that looks similar to the
EntityRenderer except for some templates.  The spans_key, by default, is
set to sc, but can be configured in the options (see parse_spans). The
way I rendered these spans is per-token, i.e., I first check if each
token (1) belongs to a given span type and (2) a starting token of a
given span type. Once I have this information, I render them into the
markup.

* Fix mypy issues on typing

* Add tests for displacy spans support

* Update colors from RGB to hex

Co-authored-by: Ines Montani <ines@ines.io>

* Remove unnecessary CSS properties

* Add documentation for website

* Remove unnecesasry scripts

* Update wording on the documentation

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Put typing dependency on top of file

* Put back z-index so that spans overlap properly

* Make warning more explicit for spans_key

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-16 18:14:34 +01:00
Sofie Van Landeghem
3f68bbcfec
Clean up loggers docs (#10351)
* update docs to point to spacy-loggers docs

* remove unused error code
2022-02-25 16:29:12 +01:00
Edward
7961a0a959
Fix typo in errors (#10256) 2022-02-10 13:45:46 +01:00
Duygu Altinok
47a2916801
Intify IOB (#9738)
* added iob to int

* added tests

* added iob strings

* added error

* blacked attrs

* Update spacy/tests/lang/test_attrs.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/attrs.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* added iob strings as global

* minor refinement with iob

* removed iob strings from token

* changed to uppercase

* cleaned and went back to master version

* imported iob from attrs

* Update and format errors

* Support and test both str and int ENT_IOB key

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-01-20 13:19:38 +01:00
Sofie Van Landeghem
56dcb39fb7
Fix references to config file in the docs & UX (#9961)
* doc fixes around config file

* fix typo

* clarify default
2022-01-04 14:31:26 +01:00
Duygu Altinok
b56b9e7f31
Entity ruler remove pattern (#9685)
* added ruler coe

* added error for none existing pattern

* changed error to warning

* changed error to warning

* added basic tests

* fixed place

* added test files

* went back to error

* went back to pattern error

* minor change to docs

* changed style

* changed doc

* changed error slightly

* added remove to phrasem api

* error key already existed

* phrase matcher match code to api

* blacked tests

* moved comments before expr

* corrected error no

* Update website/docs/api/entityruler.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/entityruler.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-12-06 15:32:49 +01:00
Duygu Altinok
a7d7e80adb
EntityRuler improve disk load error message (#9658)
* added error string

* added serialization test

* added more to if statements

* wrote file to tempdir

* added tempdir

* changed parameter a bit

* Update spacy/tests/pipeline/test_entity_ruler.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-11-23 16:26:05 +01:00
Adriane Boyd
9ac6d4991e
Add doc_cleaner component (#9659)
* Add doc_cleaner component

* Fix types

* Fix loop

* Rephrase method description
2021-11-23 15:33:33 +01:00
Adriane Boyd
07dea324f6 Merge remote-tracking branch 'upstream/develop' into chore/switch-to-master-v3.2.0 2021-11-03 15:32:18 +01:00
Bram Vanroy
cab9209c3d
use metaclass to decorate errors (#9593) 2021-11-03 15:29:32 +01:00
Adriane Boyd
2d430958e1 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-3 2021-10-29 12:18:15 +02:00
Paul O'Leary McCann
006df1ae1f
Clarify error when words are of wrong type (#9541)
* Clarify error when words are of wrong type

See #9437

* Update docs

* Use try/except

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-10-29 12:08:40 +02:00
Adriane Boyd
c053f158c5
Add support for floret vectors (#8909)
* Add support for fasttext-bloom hash-only vectors

Overview:

* Extend `Vectors` to have two modes: `default` and `ngram`
  * `default` is the default mode and equivalent to the current
    `Vectors`
  * `ngram` supports the hash-only ngram tables from `fasttext-bloom`
* Extend `spacy.StaticVectors.v2` to handle both modes with no changes
  for `default` vectors
* Extend `spacy init vectors` to support ngram tables

The `ngram` mode **only** supports vector tables produced by this
fork of fastText, which adds an option to represent all vectors using
only the ngram buckets table and which uses the exact same ngram
generation algorithm and hash function (`MurmurHash3_x64_128`).
`fasttext-bloom` produces an additional `.hashvec` table, which can be
loaded by `spacy init vectors --fasttext-bloom-vectors`.

https://github.com/adrianeboyd/fastText/tree/feature/bloom

Implementation details:

* `Vectors` now includes the `StringStore` as `Vectors.strings` so that
  the API can stay consistent for both `default` (which can look up from
  `str` or `int`) and `ngram` (which requires `str` to calculate the
  ngrams).

* In ngram mode `Vectors` uses a default `Vectors` object as a cache
  since the ngram vectors lookups are relatively expensive.

  * The default cache size is the same size as the provided ngram vector
    table.

  * Once the cache is full, no more entries are added. The user is
    responsible for managing the cache in cases where the initial
    documents are not representative of the texts.

  * The cache can be resized by setting `Vectors.ngram_cache_size` or
    cleared with `vectors._ngram_cache.clear()`.

* The API ends up a bit split between methods for `default` and for
  `ngram`, so functions that only make sense for `default` or `ngram`
  include warnings with custom messages suggesting alternatives where
  possible.

* `Vocab.vectors` becomes a property so that the string stores can be
  synced when assigning vectors to a vocab.

* `Vectors` serializes its own config settings as `vectors.cfg`.

* The `Vectors` serialization methods have added support for `exclude`
  so that the `Vocab` can exclude the `Vectors` strings while serializing.

Removed:

* The `minn` and `maxn` options and related code from
  `Vocab.get_vector`, which does not work in a meaningful way for default
  vector tables.

* The unused `GlobalRegistry` in `Vectors`.

* Refactor to use reduce_mean

Refactor to use reduce_mean and remove the ngram vectors cache.

* Rename to floret

* Rename to floret in error messages

* Use --vectors-mode in CLI, vector init

* Fix vectors mode in init

* Remove unused var

* Minor API and docstrings adjustments

* Rename `--vectors-mode` to `--mode` in `init vectors` CLI
* Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support
  both modes.
* Minor updates to Vectors docstrings.

* Update API docs for Vectors and init vectors CLI

* Update types for StaticVectors
2021-10-27 14:08:31 +02:00
Adriane Boyd
a803af9dfa Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
Daniël de Kok
f31ac6fd4f
Print a warning when multiprocessing is used on a GPU (#9475)
* Raise an error when multiprocessing is used on a GPU

As reported in #5507, a confusing exception is thrown when
multiprocessing is used with a GPU model and the `fork` multiprocessing
start method:

cupy.cuda.runtime.CUDARuntimeError: cudaErrorInitializationError: initialization error

This change checks whether one of the models uses the GPU when
multiprocessing is used. If so, raise a friendly error message.

Even though multiprocessing can work on a GPU with the `spawn` method,
it quickly runs the GPU out-of-memory on real-world data. Also,
multiprocessing on a single GPU typically does not provide large
performance gains.

* Move GPU multiprocessing check to Language.pipe

* Warn rather than error when using multiprocessing with GPU models

* Improve GPU multiprocessing warning message.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Reduce API assumptions

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/language.py

* Update spacy/language.py

* Test that warning is thrown with GPU + multiprocessing

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-10-21 16:14:23 +02:00
Adriane Boyd
d98d525bc8 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-3 2021-10-14 09:41:46 +02:00
Sofie Van Landeghem
5e8e8525f0
fix W108 filter (#9438)
* remove text argument from W108 to enable 'once' filtering

* include the option of partial POS annotation

* fix typo

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-10-12 19:56:44 +02:00
Elia Robyn Lake (Robyn Speer)
53b5f245ed
Allow IETF language codes, aliases, and close matches (#9342)
* use language-matching to allow language code aliases

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* link to "IETF language tags" in docs

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* Make requirements consistent

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* change "two-letter language ID" to "IETF language tag" in language docs

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* use langcodes 3.2 and handle language-tag errors better

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* all unknown language codes are ImportErrors

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
2021-10-05 09:52:22 +02:00
Sofie Van Landeghem
a361df00cd
Raise E983 early on in docbin init (#9247)
* raise E983 early on in docbin init

* catch situation before error is raised

* add more info on the spacy debug command
2021-09-27 20:43:03 +02:00