spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-15 18:52:29 +03:00

Author	SHA1	Message	Date
Paul O'Leary McCann	006df1ae1f	Clarify error when words are of wrong type (#9541 ) * Clarify error when words are of wrong type See #9437 * Update docs * Use try/except * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-29 12:08:40 +02:00
Connor Brinton	657af5f91f	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 ) * 🚨 Ignore all existing Mypy errors * 🏗 Add Mypy check to CI * Add types-mock and types-requests as dev requirements * Add additional type ignore directives * Add types packages to dev-only list in reqs test * Add types-dataclasses for python 3.6 * Add ignore to pretrain * 🏷 Improve type annotation on `run_command` helper The `run_command` helper previously declared that it returned an `Optional[subprocess.CompletedProcess]`, but it isn't actually possible for the function to return `None`. These changes modify the type annotation of the `run_command` helper and remove all now-unnecessary `# type: ignore` directives. * 🔧 Allow variable type redefinition in limited contexts These changes modify how Mypy is configured to allow variables to have their type automatically redefined under certain conditions. The Mypy documentation contains the following example: ```python def process(items: List[str]) -> None: # 'items' has type List[str] items = [item.split() for item in items] # 'items' now has type List[List[str]] ... ``` This configuration change is especially helpful in reducing the number of `# type: ignore` directives needed to handle the common pattern of: * Accepting a filepath as a string * Overwriting the variable using `filepath = ensure_path(filepath)` These changes enable redefinition and remove all `# type: ignore` directives rendered redundant by this change. * 🏷 Add type annotation to converters mapping * 🚨 Fix Mypy error in convert CLI argument verification * 🏷 Improve type annotation on `resolve_dot_names` helper * 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors` * 🏷 Add type annotations for more `Vocab` attributes * 🏷 Add loose type annotation for gold data compilation * 🏷 Improve `_format_labels` type annotation * 🏷 Fix `get_lang_class` type annotation * 🏷 Loosen return type of `Language.evaluate` * 🏷 Don't accept `Scorer` in `handle_scores_per_type` * 🏷 Add `string_to_list` overloads * 🏷 Fix non-Optional command-line options * 🙈 Ignore redefinition of `wandb_logger` in `loggers.py` * ➕ Install `typing_extensions` in Python 3.8+ The `typing_extensions` package states that it should be used when "writing code that must be compatible with multiple Python versions". Since SpaCy needs to support multiple Python versions, it should be used when newer `typing` module members are required. One example of this is `Literal`, which is available starting with Python 3.8. Previously SpaCy tried to import `Literal` from `typing`, falling back to `typing_extensions` if the import failed. However, Mypy doesn't seem to be able to understand what `Literal` means when the initial import means. Therefore, these changes modify how `compat` imports `Literal` by always importing it from `typing_extensions`. These changes also modify how `typing_extensions` is installed, so that it is a requirement for all Python versions, including those greater than or equal to 3.8. * 🏷 Improve type annotation for `Language.pipe` These changes add a missing overload variant to the type signature of `Language.pipe`. Additionally, the type signature is enhanced to allow type checkers to differentiate between the two overload variants based on the `as_tuple` parameter. Fixes #8772 * ➖ Don't install `typing-extensions` in Python 3.8+ After more detailed analysis of how to implement Python version-specific type annotations using SpaCy, it has been determined that by branching on a comparison against `sys.version_info` can be statically analyzed by Mypy well enough to enable us to conditionally use `typing_extensions.Literal`. This means that we no longer need to install `typing_extensions` for Python versions greater than or equal to 3.8! 🎉 These changes revert previous changes installing `typing-extensions` regardless of Python version and modify how we import the `Literal` type to ensure that Mypy treats it properly. * resolve mypy errors for Strict pydantic types * refactor code to avoid missing return statement * fix types of convert CLI command * avoid list-set confustion in debug_data * fix typo and formatting * small fixes to avoid type ignores * fix types in profile CLI command and make it more efficient * type fixes in projects CLI * put one ignore back * type fixes for render * fix render types - the sequel * fix BaseDefault in language definitions * fix type of noun_chunks iterator - yields tuple instead of span * fix types in language-specific modules * 🏷 Expand accepted inputs of `get_string_id` `get_string_id` accepts either a string (in which case it returns its ID) or an ID (in which case it immediately returns the ID). These changes extend the type annotation of `get_string_id` to indicate that it can accept either strings or IDs. * 🏷 Handle override types in `combine_score_weights` The `combine_score_weights` function allows users to pass an `overrides` mapping to override data extracted from the `weights` argument. Since it allows `Optional` dictionary values, the return value may also include `Optional` dictionary values. These changes update the type annotations for `combine_score_weights` to reflect this fact. * 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer` * 🏷 Fix redefinition of `wandb_logger` These changes fix the redefinition of `wandb_logger` by giving a separate name to each `WandbLogger` version. For backwards-compatibility, `spacy.train` still exports `wandb_logger_v3` as `wandb_logger` for now. * more fixes for typing in language * type fixes in model definitions * 🏷 Annotate `_RandomWords.probs` as `NDArray` * 🏷 Annotate `tok2vec` layers to help Mypy * 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6 Also remove an import that I forgot to move to the top of the module 😅 * more fixes for matchers and other pipeline components * quick fix for entity linker * fixing types for spancat, textcat, etc * bugfix for tok2vec * type annotations for scorer * add runtime_checkable for Protocol * type and import fixes in tests * mypy fixes for training utilities * few fixes in util * fix import * 🐵 Remove unused `# type: ignore` directives * 🏷 Annotate `Language._components` * 🏷 Annotate `spacy.pipeline.Pipe` * add doc as property to span.pyi * small fixes and cleanup * explicit type annotations instead of via comment Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com>	2021-10-14 15:21:40 +02:00
Paul O'Leary McCann	c4f0800fb8	Validate pos values when creating Doc (#9148 ) * Validate pos values when creating Doc * Add clear error when setting invalid pos This also changes the error language slightly. * Fix variable name * Update spacy/tokens/doc.pyx * Test that setting invalid pos raises an error Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-16 13:28:05 +02:00
Adriane Boyd	393c3c70d7	Various fixes for spans in Docs.from_docs (#8487 ) * Fix spans offsets if a doc ends in a single space and no space is inserted * Also include spans key in merged doc for empty spans lists	2021-06-23 15:51:35 +02:00
Adriane Boyd	5646fcbe46	Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1	2021-06-15 15:05:17 +02:00
Paul O'Leary McCann	2c105cdbce	Raise error if deps not provided with heads (#8335 ) * Fill in deps if not provided with heads Before this change, if heads were passed without deps they would be silently ignored, which could be confusing. See #8334. * Use "dep" instead of a blank string This is the customary placeholder dep. It might be better to show an error here instead though. * Throw error on heads without deps * Add a test * Fix tests * Formatting * Fix all tests * Fix a test I missed * Revise error message * Clean up whitespace Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-06-15 13:23:32 +02:00
Adriane Boyd	9dfd3c9484	Use warnings.warn instead of logger.warning	2021-06-04 17:44:08 +02:00
Adriane Boyd	2c545c4c5b	Fix offsets in Span.get_lca_matrix (#8116 ) * Fix range in Span.get_lca_matrix Fix the adjusted token index / lca matrix index ranges for `_get_lca_matrix` for spans. * The range for `k` should correspond to the adjusted indices in `lca_matrix` with the `start` indexed at `0` * Update test for v3.x	2021-05-17 16:54:23 +02:00
Adriane Boyd	a71194362f	Fix Docs.from_docs for all empty docs (#8009 )	2021-05-05 18:44:14 +02:00
Adriane Boyd	27a48f2802	Fix/update extension copying in Span.as_doc and Doc.from_docs (#7574 ) * Adjust custom extension data when copying user data in `Span.as_doc()` * Restrict `Doc.from_docs()` to adjusting offsets for custom extension data * Update test to use extension * (Duplicate bug fix for character offset from #7497)	2021-03-30 09:49:12 +02:00
Adriane Boyd	139f655f34	Merge doc.spans in Doc.from_docs() (#7497 ) Merge data from `doc.spans` in `Doc.from_docs()`. * Fix internal character offset set when merging empty docs (only affects tokens and spans in `user_data` if an empty doc is in the list of docs)	2021-03-29 22:34:01 +11:00
Sofie Van Landeghem	dd99872bb0	Fix spans weak ref in doc copy (#7225 ) * failing unit test * ensure that doc.spans refers to the copied doc, not the old * add type info	2021-02-28 12:32:48 +11:00
Ines Montani	d0c3775712	Replace links to nightly docs [ci skip]	2021-01-30 20:09:38 +11:00
Ines Montani	30765674d0	Merge branch 'master' into develop	2021-01-30 12:20:28 +11:00
Adriane Boyd	4096a79de7	Add alignment mode error and fix Doc.char_span docs (#6820 ) * Raise an error on an unrecognized alignment mode rather than defaulting to `strict` * Fix the `Doc.char_span` API doc alignment mode details	2021-01-27 23:40:42 +11:00
Matthew Honnibal	42b117e561	Fix Doc.copy bugs (#6809 ) * Dont let the Doc own LexemeC, to fix Doc.copy * Copy doc.spans * Copy doc.spans	2021-01-25 21:40:18 +08:00
Sofie Van Landeghem	fed8f48965	raise NotImplementedError when noun_chunks iterator is not implemented (#6711 ) * raise NotImplementedError when noun_chunks iterator is not implemented * bring back, fix and document span.noun_chunks * formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2021-01-17 19:56:05 +08:00
Matthew Honnibal	92310a5e26	Merge branch 'develop' into feature/missing-dep	2021-01-14 17:39:01 +11:00
Matthew Honnibal	f277bfdf0f	Add SpanGroup and Graph container types to represent arbitrary annotations (#6696 ) * Draft out initial Spans data structure * Initial span group commit * Basic span group support on Doc * Basic test for span group * Compile span_group.pyx * Draft addition of SpanGroup to DocBin * Add deserialization for SpanGroup * Add tests for serializing SpanGroup * Fix serialization of SpanGroup * Add EdgeC and GraphC structs * Add draft Graph data structure * Compile graph * More work on Graph * Update GraphC * Upd graph * Fix walk functions * Let Graph take nodes and edges on construction * Fix walking and getting * Add graph tests * Fix import * Add module with the SpanGroups dict thingy * Update test * Rename 'span_groups' attribute * Try to fix c++11 compilation * Fix test * Update DocBin * Try to fix compilation * Try to fix graph * Improve SpanGroup docstrings * Add doc.spans to documentation * Fix serialization * Tidy up and add docs * Update docs [ci skip] * Add SpanGroup.has_overlap * WIP updated Graph API * Start testing new Graph API * Update Graph tests * Update Graph * Add docstring Co-authored-by: Ines Montani <ines@ines.io>	2021-01-14 17:30:41 +11:00
svlandeg	ed53bb979d	cleanup	2021-01-13 14:20:05 +01:00
svlandeg	86a4e316b8	fix sent_starts	2021-01-13 13:47:25 +01:00
svlandeg	a581d82f33	introduce token.has_head and refer to MISSING_DEP_ (WIP)	2021-01-12 17:17:06 +01:00
svlandeg	dd12c6c8fd	allow missing information in deps and heads annotations	2021-01-07 19:10:32 +01:00
Yohei Tamura	411c842a71	convert tuple to list, because the type mismatches (#6625 )	2021-01-07 16:42:12 +11:00
Sofie Van Landeghem	29b59086f9	Prevent 0-length mem alloc (#6653 ) * prevent 0-length mem alloc by adding asserts * fix lexeme mem allocation	2021-01-06 12:50:17 +11:00
Adriane Boyd	5ca57d8221	Add logger warning when serializing user hooks (#6595 ) Add a warning that user hooks are lost on serialization. Add a `user_hooks` exclude to skip the warning with pickle.	2020-12-29 11:54:32 +01:00
Adriane Boyd	320a8b1481	Add ent_id_ to strings serialized with Doc (#6353 )	2020-11-10 20:16:07 +08:00
Adriane Boyd	a4b32b9552	Handle missing reference values in scorer (#6286 ) * Handle missing reference values in scorer Handle missing values in reference doc during scoring where it is possible to detect an unset state for the attribute. If no reference docs contain annotation, `None` is returned instead of a score. `spacy evaluate` displays `-` for missing scores and the missing scores are saved as `None`/`null` in the metrics. Attributes without unset states: * `token.head`: relies on `token.dep` to recognize unset values * `doc.cats`: unable to handle missing annotation Additional changes: * add optional `has_annotation` check to `score_scans` to replace `doc.sents` hack * update `score_token_attr_per_feat` to handle missing and empty morph representations * fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START` vs. `SENT_START` * Fix import * Update return types	2020-11-03 15:47:18 +01:00
Ines Montani	86d648740f	Fix morph representation in Doc.to_json	2020-10-13 11:39:03 +02:00
Ines Montani	8ff73f04db	Fix morph in Doc.to_json	2020-10-08 14:44:35 +02:00
Ines Montani	f1d1f78636	Make warning debug log [ci skip]	2020-10-04 22:44:21 +02:00
Ines Montani	bcd52e5486	Tidy up errors and warnings	2020-10-04 11:16:31 +02:00
Adriane Boyd	86c3ec9c2b	Refactor Token morph setting (#6175 ) * Refactor Token morph setting * Remove `Token.morph_` * Add `Token.set_morph()` * `0` resets `token.c.morph` to unset * Any other values are passed to `Morphology.add` * Add token.morph setter to set from MorphAnalysis	2020-10-01 22:21:46 +02:00
Adriane Boyd	73538782a0	Switch Doc.__init__(ents=) to IOB tags (#6173 ) * Switch Doc.__init__(ents=) to IOB tags * Fix check for "-" * Allow "" or None as missing IOB tag	2020-10-01 16:22:18 +02:00
Adriane Boyd	5c13e0cf1b	Remove unused error	2020-09-24 13:41:55 +02:00
Adriane Boyd	8eaacaae97	Refactor Doc.ents setter to use Doc.set_ents Additional changes: * Entity spans with missing labels are ignored * Fix ent_kb_id setting in `Doc.set_ents`	2020-09-24 12:36:51 +02:00
Adriane Boyd	b1a7d6c528	Refactor seen token detection	2020-09-22 14:42:51 +02:00
Adriane Boyd	535842e483	Merge branch 'develop' into feature/doc-ents-v3-2	2020-09-22 13:45:50 +02:00
Matthew Honnibal	3abc4a5adb	Slightly tidy doc.ents.__set__	2020-09-21 22:58:03 +02:00
Ines Montani	67fbcb3da5	Tidy up tests and docs	2020-09-21 20:43:54 +02:00
Adriane Boyd	f212303729	Add sent_starts to Doc.__init__ Add sent_starts to `Doc.__init__`. Officially specify `is_sent_start` values but also convert to and accept `sent_start` internally.	2020-09-21 17:59:09 +02:00
Adriane Boyd	6aa91c7ca0	Make user_data keyword-only	2020-09-21 16:00:06 +02:00
Adriane Boyd	177df15d89	Implement Doc.set_ents	2020-09-21 15:54:05 +02:00
Adriane Boyd	13fbf6556a	Merge remote-tracking branch 'upstream/develop' into feature/doc-ents-v3-2	2020-09-21 14:42:04 +02:00
Adriane Boyd	ce455f30ca	Fix formatting	2020-09-21 13:53:29 +02:00
Adriane Boyd	bc02e86494	Extend Doc.__init__ with additional annotation Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to `Doc.__init__` to initialize the most common doc/token values.	2020-09-21 13:36:24 +02:00
Adriane Boyd	a88106e852	Remove W106: HEAD and SENT_START in doc.from_array (#6086 ) * Remove W106: HEAD and SENT_START in doc.from_array This warning was hacky and being triggered too often. * Fix test	2020-09-18 03:01:29 +02:00
Adriane Boyd	8b650f3a78	Modify setting missing and blocked entity tokens In order to make it easier to construct `Doc` objects as training data, modify how missing and blocked entity tokens are set to prioritize setting `O` and missing entity tokens for training purposes over setting blocked entity tokens. * `Doc.ents` setter sets tokens outside entity spans to `O` regardless of the current state of each token * For `Doc.ents`, setting a span with a missing label sets the `ent_iob` to missing instead of blocked * `Doc.block_ents(spans)` marks spans as hard `O` for use with the `EntityRecognizer`	2020-09-17 21:27:42 +02:00
Adriane Boyd	7e4cd7575c	Refactor Docs.is_ flags (#6044 ) * Refactor Docs.is_ flags * Add derived `Doc.has_annotation` method * `Doc.has_annotation(attr)` returns `True` for partial annotation * `Doc.has_annotation(attr, require_complete=True)` returns `True` for complete annotation * Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced` and `is_nered` * Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The list is the `DocBin` attributes list plus `SPACY` and `LENGTH`. Notes on `Doc.has_annotation`: * `HEAD` is converted to `DEP` because heads don't have an unset state * Accept `IS_SENT_START` as a synonym of `SENT_START` Additional changes: * Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for `DocBin` * In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override `SENT_START` * In `Doc.from_array()` using `attrs` other than `Doc._get_array_attrs()` (i.e., a user's custom list rather than our default internal list) with both `HEAD` and `SENT_START` shows a warning that `HEAD` will override `SENT_START` * `set_children_from_heads` does not require dependency labels to set sentence boundaries and sets `sent_start` for all non-sentence starts to `-1` * Fix call to set_children_form_heads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-17 00:14:01 +02:00
Adriane Boyd	a119667a36	Clean up spacy.tokens (#6046 ) * Clean up spacy.tokens * Update `set_children_from_heads`: * Don't check `dep` when setting lr_* or sentence starts * Set all non-sentence starts to `False` * Use `set_children_from_heads` in `Token.head` setter * Reduce similar/duplicate code (admittedly adds a bit of overhead) * Update sentence starts consistently * Remove unused `Doc.set_parse` * Minor changes: * Declare cython variables (to avoid cython warnings) * Clean up imports * Modify set_children_from_heads to set token range Modify `set_children_from_heads` so that it adjust tokens within a specified range rather then the whole document. Modify the `Token.head` setter to adjust only the tokens affected by the new head assignment.	2020-09-16 20:32:38 +02:00

1 2 3 4 5 ...

357 Commits