🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167)

* 🚨 Ignore all existing Mypy errors * 🏗 Add Mypy check to CI * Add types-mock and types-requests as dev requirements * Add additional type ignore directives * Add types packages to dev-only list in reqs test * Add types-dataclasses for python 3.6 * Add ignore to pretrain * 🏷 Improve type annotation on `run_command` helper The `run_command` helper previously declared that it returned an `Optional[subprocess.CompletedProcess]`, but it isn't actually possible for the function to return `None`. These changes modify the type annotation of the `run_command` helper and remove all now-unnecessary `# type: ignore` directives. * 🔧 Allow variable type redefinition in limited contexts These changes modify how Mypy is configured to allow variables to have their type automatically redefined under certain conditions. The Mypy documentation contains the following example: ```python def process(items: List[str]) -> None: # 'items' has type List[str] items = [item.split() for item in items] # 'items' now has type List[List[str]] ... ``` This configuration change is especially helpful in reducing the number of `# type: ignore` directives needed to handle the common pattern of: * Accepting a filepath as a string * Overwriting the variable using `filepath = ensure_path(filepath)` These changes enable redefinition and remove all `# type: ignore` directives rendered redundant by this change. * 🏷 Add type annotation to converters mapping * 🚨 Fix Mypy error in convert CLI argument verification * 🏷 Improve type annotation on `resolve_dot_names` helper * 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors` * 🏷 Add type annotations for more `Vocab` attributes * 🏷 Add loose type annotation for gold data compilation * 🏷 Improve `_format_labels` type annotation * 🏷 Fix `get_lang_class` type annotation * 🏷 Loosen return type of `Language.evaluate` * 🏷 Don't accept `Scorer` in `handle_scores_per_type` * 🏷 Add `string_to_list` overloads * 🏷 Fix non-Optional command-line options * 🙈 Ignore redefinition of `wandb_logger` in `loggers.py` * ➕ Install `typing_extensions` in Python 3.8+ The `typing_extensions` package states that it should be used when "writing code that must be compatible with multiple Python versions". Since SpaCy needs to support multiple Python versions, it should be used when newer `typing` module members are required. One example of this is `Literal`, which is available starting with Python 3.8. Previously SpaCy tried to import `Literal` from `typing`, falling back to `typing_extensions` if the import failed. However, Mypy doesn't seem to be able to understand what `Literal` means when the initial import means. Therefore, these changes modify how `compat` imports `Literal` by always importing it from `typing_extensions`. These changes also modify how `typing_extensions` is installed, so that it is a requirement for all Python versions, including those greater than or equal to 3.8. * 🏷 Improve type annotation for `Language.pipe` These changes add a missing overload variant to the type signature of `Language.pipe`. Additionally, the type signature is enhanced to allow type checkers to differentiate between the two overload variants based on the `as_tuple` parameter. Fixes #8772 * ➖ Don't install `typing-extensions` in Python 3.8+ After more detailed analysis of how to implement Python version-specific type annotations using SpaCy, it has been determined that by branching on a comparison against `sys.version_info` can be statically analyzed by Mypy well enough to enable us to conditionally use `typing_extensions.Literal`. This means that we no longer need to install `typing_extensions` for Python versions greater than or equal to 3.8! 🎉 These changes revert previous changes installing `typing-extensions` regardless of Python version and modify how we import the `Literal` type to ensure that Mypy treats it properly. * resolve mypy errors for Strict pydantic types * refactor code to avoid missing return statement * fix types of convert CLI command * avoid list-set confustion in debug_data * fix typo and formatting * small fixes to avoid type ignores * fix types in profile CLI command and make it more efficient * type fixes in projects CLI * put one ignore back * type fixes for render * fix render types - the sequel * fix BaseDefault in language definitions * fix type of noun_chunks iterator - yields tuple instead of span * fix types in language-specific modules * 🏷 Expand accepted inputs of `get_string_id` `get_string_id` accepts either a string (in which case it returns its ID) or an ID (in which case it immediately returns the ID). These changes extend the type annotation of `get_string_id` to indicate that it can accept either strings or IDs. * 🏷 Handle override types in `combine_score_weights` The `combine_score_weights` function allows users to pass an `overrides` mapping to override data extracted from the `weights` argument. Since it allows `Optional` dictionary values, the return value may also include `Optional` dictionary values. These changes update the type annotations for `combine_score_weights` to reflect this fact. * 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer` * 🏷 Fix redefinition of `wandb_logger` These changes fix the redefinition of `wandb_logger` by giving a separate name to each `WandbLogger` version. For backwards-compatibility, `spacy.train` still exports `wandb_logger_v3` as `wandb_logger` for now. * more fixes for typing in language * type fixes in model definitions * 🏷 Annotate `_RandomWords.probs` as `NDArray` * 🏷 Annotate `tok2vec` layers to help Mypy * 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6 Also remove an import that I forgot to move to the top of the module 😅 * more fixes for matchers and other pipeline components * quick fix for entity linker * fixing types for spancat, textcat, etc * bugfix for tok2vec * type annotations for scorer * add runtime_checkable for Protocol * type and import fixes in tests * mypy fixes for training utilities * few fixes in util * fix import * 🐵 Remove unused `# type: ignore` directives * 🏷 Annotate `Language._components` * 🏷 Annotate `spacy.pipeline.Pipe` * add doc as property to span.pyi * small fixes and cleanup * explicit type annotations instead of via comment Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com>
2025-07-18 04:02:20 +03:00 · 2021-10-14 09:21:40 -04:00 · 2021-10-14 09:21:40 -04:00 · 657af5f91f
commit 657af5f91f
parent 631c170fa5
179 changed files with 1163 additions and 828 deletions
--- a/.github/azure-steps.yml
+++ b/.github/azure-steps.yml
@ -25,6 +25,9 @@ steps:
      ${{ parameters.prefix }} python setup.py sdist --formats=gztar
    displayName: "Compile and build sdist"

+  - script: python -m mypy spacy
+    displayName: 'Run mypy'
+
  - task: DeleteFiles@1
    inputs:
      contents: "spacy"
--- a/.github/contributors/connorbrinton.md
+++ b/.github/contributors/connorbrinton.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Connor Brinton       |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | July 20th, 2021      |
+| GitHub username                | connorbrinton        |
+| Website (optional)             |                      |
--- a/requirements.txt
+++ b/requirements.txt
@ -29,3 +29,7 @@ pytest-timeout>=1.3.0,<2.0.0
 mock>=2.0.0,<3.0.0
 flake8>=3.8.0,<3.10.0
 hypothesis>=3.27.0,<7.0.0
+mypy>=0.910
+types-dataclasses>=0.1.3; python_version < "3.7"
+types-mock>=0.1.1
+types-requests
--- a/setup.cfg
+++ b/setup.cfg
@ -129,3 +129,4 @@ markers =
 ignore_missing_imports = True
 no_implicit_optional = True
 plugins = pydantic.mypy, thinc.mypy
+allow_redefinition = True
--- a/spacy/cli/_util.py
+++ b/spacy/cli/_util.py
@ -1,4 +1,5 @@
-from typing import Dict, Any, Union, List, Optional, Tuple, Iterable, TYPE_CHECKING
+from typing import Dict, Any, Union, List, Optional, Tuple, Iterable
+from typing import TYPE_CHECKING, overload
 import sys
 import shutil
 from pathlib import Path
@ -15,6 +16,7 @@ from thinc.util import has_cupy, gpu_is_available
 from configparser import InterpolationError
 import os

+from ..compat import Literal
 from ..schemas import ProjectConfigSchema, validate
 from ..util import import_file, run_command, make_tempdir, registry, logger
 from ..util import is_compatible_version, SimpleFrozenDict, ENV_VARS
@ -260,15 +262,16 @@ def get_checksum(path: Union[Path, str]) -> str:
    RETURNS (str): The checksum.
    """
    path = Path(path)
+    if not (path.is_file() or path.is_dir()):
+        msg.fail(f"Can't get checksum for {path}: not a file or directory", exits=1)
    if path.is_file():
        return hashlib.md5(Path(path).read_bytes()).hexdigest()
-    if path.is_dir():
+    else:
        # TODO: this is currently pretty slow
        dir_checksum = hashlib.md5()
        for sub_file in sorted(fp for fp in path.rglob("*") if fp.is_file()):
            dir_checksum.update(sub_file.read_bytes())
        return dir_checksum.hexdigest()
-    msg.fail(f"Can't get checksum for {path}: not a file or directory", exits=1)


@contextmanager
@ -468,12 +471,15 @@ def get_git_version(
    RETURNS (Tuple[int, int]): The version as a (major, minor) tuple. Returns
        (0, 0) if the version couldn't be determined.
    """
-    ret = run_command("git --version", capture=True)
+    try:
+        ret = run_command("git --version", capture=True)
+    except:
+        raise RuntimeError(error)
    stdout = ret.stdout.strip()
    if not stdout or not stdout.startswith("git version"):
-        return (0, 0)
+        return 0, 0
    version = stdout[11:].strip().split(".")
-    return (int(version[0]), int(version[1]))
+    return int(version[0]), int(version[1])


 def _http_to_git(repo: str) -> str:
@ -500,6 +506,16 @@ def is_subpath_of(parent, child):
    return os.path.commonpath([parent_realpath, child_realpath]) == parent_realpath


+@overload
+def string_to_list(value: str, intify: Literal[False] = ...) -> List[str]:
+    ...
+
+
+@overload
+def string_to_list(value: str, intify: Literal[True]) -> List[int]:
+    ...
+
+
 def string_to_list(value: str, intify: bool = False) -> Union[List[str], List[int]]:
    """Parse a comma-separated string to a list and account for various
    formatting options. Mostly used to handle CLI arguments that take a list of
@ -510,7 +526,7 @@ def string_to_list(value: str, intify: bool = False) -> Union[List[str], List[in
    RETURNS (Union[List[str], List[int]]): A list of strings or ints.
    """
    if not value:
-        return []
+        return []  # type: ignore[return-value]
    if value.startswith("[") and value.endswith("]"):
        value = value[1:-1]
    result = []
@ -522,7 +538,7 @@ def string_to_list(value: str, intify: bool = False) -> Union[List[str], List[in
            p = p[1:-1]
        p = p.strip()
        if intify:
-            p = int(p)
+            p = int(p)  # type: ignore[assignment]
        result.append(p)
    return result

--- a/spacy/cli/convert.py
+++ b/spacy/cli/convert.py
@ -1,4 +1,4 @@
-from typing import Optional, Any, List, Union
+from typing import Callable, Iterable, Mapping, Optional, Any, List, Union
 from enum import Enum
 from pathlib import Path
 from wasabi import Printer
@ -9,7 +9,7 @@ import itertools

 from ._util import app, Arg, Opt
 from ..training import docs_to_json
-from ..tokens import DocBin
+from ..tokens import Doc, DocBin
 from ..training.converters import iob_to_docs, conll_ner_to_docs, json_to_docs
 from ..training.converters import conllu_to_docs

@ -19,7 +19,7 @@ from ..training.converters import conllu_to_docs
 # entry to this dict with the file extension mapped to the converter function
 # imported from /converters.

-CONVERTERS = {
+CONVERTERS: Mapping[str, Callable[..., Iterable[Doc]]] = {
    "conllubio": conllu_to_docs,
    "conllu": conllu_to_docs,
    "conll": conll_ner_to_docs,
@ -66,19 +66,16 @@ def convert_cli(

    DOCS: https://spacy.io/api/cli#convert
    """
-    if isinstance(file_type, FileTypes):
-        # We get an instance of the FileTypes from the CLI so we need its string value
-        file_type = file_type.value
    input_path = Path(input_path)
-    output_dir = "-" if output_dir == Path("-") else output_dir
+    output_dir: Union[str, Path] = "-" if output_dir == Path("-") else output_dir
    silent = output_dir == "-"
    msg = Printer(no_print=silent)
-    verify_cli_args(msg, input_path, output_dir, file_type, converter, ner_map)
+    verify_cli_args(msg, input_path, output_dir, file_type.value, converter, ner_map)
    converter = _get_converter(msg, converter, input_path)
    convert(
        input_path,
        output_dir,
-        file_type=file_type,
+        file_type=file_type.value,
        n_sents=n_sents,
        seg_sents=seg_sents,
        model=model,
@ -94,7 +91,7 @@ def convert_cli(


 def convert(
-    input_path: Union[str, Path],
+    input_path: Path,
    output_dir: Union[str, Path],
    *,
    file_type: str = "json",
@ -114,7 +111,7 @@ def convert(
        msg = Printer(no_print=silent)
    ner_map = srsly.read_json(ner_map) if ner_map is not None else None
    doc_files = []
-    for input_loc in walk_directory(Path(input_path), converter):
+    for input_loc in walk_directory(input_path, converter):
        with input_loc.open("r", encoding="utf-8") as infile:
            input_data = infile.read()
        # Use converter function to convert data
@ -141,7 +138,7 @@ def convert(
        else:
            db = DocBin(docs=docs, store_user_data=True)
            len_docs = len(db)
-            data = db.to_bytes()
+            data = db.to_bytes()  # type: ignore[assignment]
        if output_dir == "-":
            _print_docs_to_stdout(data, file_type)
        else:
@ -220,13 +217,12 @@ def walk_directory(path: Path, converter: str) -> List[Path]:

 def verify_cli_args(
    msg: Printer,
-    input_path: Union[str, Path],
+    input_path: Path,
    output_dir: Union[str, Path],
-    file_type: FileTypes,
+    file_type: str,
    converter: str,
    ner_map: Optional[Path],
 ):
-    input_path = Path(input_path)
    if file_type not in FILE_TYPES_STDOUT and output_dir == "-":
        msg.fail(
            f"Can't write .{file_type} data to stdout. Please specify an output directory.",
@ -244,13 +240,13 @@ def verify_cli_args(
            msg.fail("No input files in directory", input_path, exits=1)
        file_types = list(set([loc.suffix[1:] for loc in input_locs]))
        if converter == "auto" and len(file_types) >= 2:
-            file_types = ",".join(file_types)
-            msg.fail("All input files must be same type", file_types, exits=1)
+            file_types_str = ",".join(file_types)
+            msg.fail("All input files must be same type", file_types_str, exits=1)
    if converter != "auto" and converter not in CONVERTERS:
        msg.fail(f"Can't find converter for {converter}", exits=1)


-def _get_converter(msg, converter, input_path):
+def _get_converter(msg, converter, input_path: Path):
    if input_path.is_dir():
        input_path = walk_directory(input_path, converter)[0]
    if converter == "auto":
--- a/spacy/cli/debug_data.py
+++ b/spacy/cli/debug_data.py
@ -1,4 +1,5 @@
-from typing import List, Sequence, Dict, Any, Tuple, Optional, Set
+from typing import Any, Dict, Iterable, List, Optional, Sequence, Set, Tuple, Union
+from typing import cast, overload
 from pathlib import Path
 from collections import Counter
 import sys
@ -17,6 +18,7 @@ from ..pipeline import Morphologizer
 from ..morphology import Morphology
 from ..language import Language
 from ..util import registry, resolve_dot_names
+from ..compat import Literal
 from .. import util


@ -378,10 +380,11 @@ def debug_data(

    if "tagger" in factory_names:
        msg.divider("Part-of-speech Tagging")
-        labels = [label for label in gold_train_data["tags"]]
+        label_list = [label for label in gold_train_data["tags"]]
        model_labels = _get_labels_from_model(nlp, "tagger")
-        msg.info(f"{len(labels)} label(s) in train data")
-        missing_labels = model_labels - set(labels)
+        msg.info(f"{len(label_list)} label(s) in train data")
+        labels = set(label_list)
+        missing_labels = model_labels - labels
        if missing_labels:
            msg.warn(
                "Some model labels are not present in the train data. The "
@ -395,10 +398,11 @@ def debug_data(

    if "morphologizer" in factory_names:
        msg.divider("Morphologizer (POS+Morph)")
-        labels = [label for label in gold_train_data["morphs"]]
+        label_list = [label for label in gold_train_data["morphs"]]
        model_labels = _get_labels_from_model(nlp, "morphologizer")
-        msg.info(f"{len(labels)} label(s) in train data")
-        missing_labels = model_labels - set(labels)
+        msg.info(f"{len(label_list)} label(s) in train data")
+        labels = set(label_list)
+        missing_labels = model_labels - labels
        if missing_labels:
            msg.warn(
                "Some model labels are not present in the train data. The "
@ -565,7 +569,7 @@ def _compile_gold(
    nlp: Language,
    make_proj: bool,
 ) -> Dict[str, Any]:
-    data = {
+    data: Dict[str, Any] = {
        "ner": Counter(),
        "cats": Counter(),
        "tags": Counter(),
@ -670,10 +674,28 @@ def _compile_gold(
    return data


-def _format_labels(labels: List[Tuple[str, int]], counts: bool = False) -> str:
+@overload
+def _format_labels(labels: Iterable[str], counts: Literal[False] = False) -> str:
+    ...
+
+
+@overload
+def _format_labels(
+    labels: Iterable[Tuple[str, int]],
+    counts: Literal[True],
+) -> str:
+    ...
+
+
+def _format_labels(
+    labels: Union[Iterable[str], Iterable[Tuple[str, int]]],
+    counts: bool = False,
+) -> str:
    if counts:
-        return ", ".join([f"'{l}' ({c})" for l, c in labels])
-    return ", ".join([f"'{l}'" for l in labels])
+        return ", ".join(
+            [f"'{l}' ({c})" for l, c in cast(Iterable[Tuple[str, int]], labels)]
+        )
+    return ", ".join([f"'{l}'" for l in cast(Iterable[str], labels)])


 def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
--- a/spacy/cli/evaluate.py
+++ b/spacy/cli/evaluate.py
@ -136,7 +136,7 @@ def evaluate(


 def handle_scores_per_type(
-    scores: Union[Scorer, Dict[str, Any]],
+    scores: Dict[str, Any],
    data: Dict[str, Any] = {},
    *,
    spans_key: str = "sc",
--- a/spacy/cli/info.py
+++ b/spacy/cli/info.py
@ -15,7 +15,7 @@ def info_cli(
    model: Optional[str] = Arg(None, help="Optional loadable spaCy pipeline"),
    markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues"),
    silent: bool = Opt(False, "--silent", "-s", "-S", help="Don't print anything (just return)"),
-    exclude: Optional[str] = Opt("labels", "--exclude", "-e", help="Comma-separated keys to exclude from the print-out"),
+    exclude: str = Opt("labels", "--exclude", "-e", help="Comma-separated keys to exclude from the print-out"),
    # fmt: on
 ):
    """
@ -61,7 +61,7 @@ def info(
    return raw_data


-def info_spacy() -> Dict[str, any]:
+def info_spacy() -> Dict[str, Any]:
    """Generate info about the current spaCy intallation.

    RETURNS (dict): The spaCy info.
--- a/spacy/cli/init_config.py
+++ b/spacy/cli/init_config.py
@ -28,8 +28,8 @@ class Optimizations(str, Enum):
 def init_config_cli(
    # fmt: off
    output_file: Path = Arg(..., help="File to save config.cfg to or - for stdout (will only output config and no additional logging info)", allow_dash=True),
-    lang: Optional[str] = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"),
-    pipeline: Optional[str] = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include (without 'tok2vec' or 'transformer')"),
+    lang: str = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"),
+    pipeline: str = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include (without 'tok2vec' or 'transformer')"),
    optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
    gpu: bool = Opt(False, "--gpu", "-G", help="Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."),
    pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
@ -44,8 +44,6 @@ def init_config_cli(

    DOCS: https://spacy.io/api/cli#init-config
    """
-    if isinstance(optimize, Optimizations):  # instance of enum from the CLI
-        optimize = optimize.value
    pipeline = string_to_list(pipeline)
    is_stdout = str(output_file) == "-"
    if not is_stdout and output_file.exists() and not force_overwrite:
@ -57,7 +55,7 @@ def init_config_cli(
    config = init_config(
        lang=lang,
        pipeline=pipeline,
-        optimize=optimize,
+        optimize=optimize.value,
        gpu=gpu,
        pretraining=pretraining,
        silent=is_stdout,
@ -175,8 +173,8 @@ def init_config(
        "Pipeline": ", ".join(pipeline),
        "Optimize for": optimize,
        "Hardware": variables["hardware"].upper(),
-        "Transformer": template_vars.transformer.get("name")
-        if template_vars.use_transformer
+        "Transformer": template_vars.transformer.get("name")  # type: ignore[attr-defined]
+        if template_vars.use_transformer  # type: ignore[attr-defined]
        else None,
    }
    msg.info("Generated config template specific for your use case")
--- a/spacy/cli/package.py
+++ b/spacy/cli/package.py
@ -1,4 +1,4 @@
-from typing import Optional, Union, Any, Dict, List, Tuple
+from typing import Optional, Union, Any, Dict, List, Tuple, cast
 import shutil
 from pathlib import Path
 from wasabi import Printer, MarkdownRenderer, get_raw_input
@ -215,9 +215,9 @@ def get_third_party_dependencies(
    for reg_name, func_names in funcs.items():
        for func_name in func_names:
            func_info = util.registry.find(reg_name, func_name)
-            module_name = func_info.get("module")
+            module_name = func_info.get("module")  # type: ignore[attr-defined]
            if module_name:  # the code is part of a module, not a --code file
-                modules.add(func_info["module"].split(".")[0])
+                modules.add(func_info["module"].split(".")[0])  # type: ignore[index]
    dependencies = []
    for module_name in modules:
        if module_name in distributions:
@ -227,7 +227,7 @@ def get_third_party_dependencies(
                if pkg in own_packages or pkg in exclude:
                    continue
                version = util.get_package_version(pkg)
-                version_range = util.get_minor_version_range(version)
+                version_range = util.get_minor_version_range(version)  # type: ignore[arg-type]
                dependencies.append(f"{pkg}{version_range}")
    return dependencies

@ -252,7 +252,7 @@ def create_file(file_path: Path, contents: str) -> None:
 def get_meta(
    model_path: Union[str, Path], existing_meta: Dict[str, Any]
 ) -> Dict[str, Any]:
-    meta = {
+    meta: Dict[str, Any] = {
        "lang": "en",
        "name": "pipeline",
        "version": "0.0.0",
@ -324,8 +324,8 @@ def generate_readme(meta: Dict[str, Any]) -> str:
    license_name = meta.get("license")
    sources = _format_sources(meta.get("sources"))
    description = meta.get("description")
-    label_scheme = _format_label_scheme(meta.get("labels"))
-    accuracy = _format_accuracy(meta.get("performance"))
+    label_scheme = _format_label_scheme(cast(Dict[str, Any], meta.get("labels")))
+    accuracy = _format_accuracy(cast(Dict[str, Any], meta.get("performance")))
    table_data = [
        (md.bold("Name"), md.code(name)),
        (md.bold("Version"), md.code(version)),
--- a/spacy/cli/profile.py
+++ b/spacy/cli/profile.py
@ -32,7 +32,7 @@ def profile_cli(

    DOCS: https://spacy.io/api/cli#debug-profile
    """
-    if ctx.parent.command.name == NAME:  # called as top-level command
+    if ctx.parent.command.name == NAME:  # type: ignore[union-attr]    # called as top-level command
        msg.warn(
            "The profile command is now available via the 'debug profile' "
            "subcommand. You can run python -m spacy debug --help for an "
@ -42,9 +42,9 @@ def profile_cli(


 def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) -> None:
-
    if inputs is not None:
-        inputs = _read_inputs(inputs, msg)
+        texts = _read_inputs(inputs, msg)
+        texts = list(itertools.islice(texts, n_texts))
    if inputs is None:
        try:
            import ml_datasets
@ -56,16 +56,13 @@ def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) ->
                exits=1,
            )

-        n_inputs = 25000
-        with msg.loading("Loading IMDB dataset via Thinc..."):
-            imdb_train, _ = ml_datasets.imdb()
-            inputs, _ = zip(*imdb_train)
-        msg.info(f"Loaded IMDB dataset and using {n_inputs} examples")
-        inputs = inputs[:n_inputs]
+        with msg.loading("Loading IMDB dataset via ml_datasets..."):
+            imdb_train, _ = ml_datasets.imdb(train_limit=n_texts, dev_limit=0)
+            texts, _ = zip(*imdb_train)
+        msg.info(f"Loaded IMDB dataset and using {n_texts} examples")
    with msg.loading(f"Loading pipeline '{model}'..."):
        nlp = load_model(model)
    msg.good(f"Loaded pipeline '{model}'")
-    texts = list(itertools.islice(inputs, n_texts))
    cProfile.runctx("parse_texts(nlp, texts)", globals(), locals(), "Profile.prof")
    s = pstats.Stats("Profile.prof")
    msg.divider("Profile stats")
@ -87,7 +84,7 @@ def _read_inputs(loc: Union[Path, str], msg: Printer) -> Iterator[str]:
        if not input_path.exists() or not input_path.is_file():
            msg.fail("Not a valid input data file", loc, exits=1)
        msg.info(f"Using data from {input_path.parts[-1]}")
-        file_ = input_path.open()
+        file_ = input_path.open()  # type: ignore[assignment]
    for line in file_:
        data = srsly.json_loads(line)
        text = data["text"]
--- a/spacy/cli/project/assets.py
+++ b/spacy/cli/project/assets.py
@ -133,7 +133,6 @@ def fetch_asset(
        # If there's already a file, check for checksum
        if checksum == get_checksum(dest_path):
            msg.good(f"Skipping download with matching checksum: {dest}")
-            return dest_path
    # We might as well support the user here and create parent directories in
    # case the asset dir isn't listed as a dir to create in the project.yml
    if not dest_path.parent.exists():
@ -150,7 +149,6 @@ def fetch_asset(
                msg.good(f"Copied local asset {dest}")
            else:
                msg.fail(f"Download failed: {dest}", e)
-                return
    if checksum and checksum != get_checksum(dest_path):
        msg.fail(f"Checksum doesn't match value defined in {PROJECT_FILE}: {dest}")

--- a/spacy/cli/project/clone.py
+++ b/spacy/cli/project/clone.py
@ -80,9 +80,9 @@ def check_clone(name: str, dest: Path, repo: str) -> None:
    repo (str): URL of the repo to clone from.
    """
    git_err = (
-        f"Cloning spaCy project templates requires Git and the 'git' command. ",
+        f"Cloning spaCy project templates requires Git and the 'git' command. "
        f"To clone a project without Git, copy the files from the '{name}' "
-        f"directory in the {repo} to {dest} manually.",
+        f"directory in the {repo} to {dest} manually."
    )
    get_git_version(error=git_err)
    if not dest:
--- a/spacy/cli/project/dvc.py
+++ b/spacy/cli/project/dvc.py
@ -143,8 +143,8 @@ def run_dvc_commands(
        easier to pass flags like --quiet that depend on a variable or
        command-line setting while avoiding lots of nested conditionals.
    """
-    for command in commands:
-        command = split_command(command)
+    for c in commands:
+        command = split_command(c)
        dvc_command = ["dvc", *command]
        # Add the flags if they are set to True
        for flag, is_active in flags.items():
--- a/spacy/cli/project/remote_storage.py
+++ b/spacy/cli/project/remote_storage.py
@ -41,7 +41,7 @@ class RemoteStorage:
            raise IOError(f"Cannot push {loc}: does not exist.")
        url = self.make_url(path, command_hash, content_hash)
        if url.exists():
-            return None
+            return url
        tmp: Path
        with make_tempdir() as tmp:
            tar_loc = tmp / self.encode_name(str(path))
@ -131,8 +131,10 @@ def get_command_hash(
    currently installed packages, whatever environment variables have been marked
    as relevant, and the command.
    """
-    check_commit = check_bool_env_var(ENV_VARS.PROJECT_USE_GIT_VERSION)
-    spacy_v = GIT_VERSION if check_commit else get_minor_version(about.__version__)
+    if check_bool_env_var(ENV_VARS.PROJECT_USE_GIT_VERSION):
+        spacy_v = GIT_VERSION
+    else:
+        spacy_v = str(get_minor_version(about.__version__) or "")
    dep_checksums = [get_checksum(dep) for dep in sorted(deps)]
    hashes = [spacy_v, site_hash, env_hash] + dep_checksums
    hashes.extend(cmd)
--- a/spacy/cli/project/run.py
+++ b/spacy/cli/project/run.py
@ -70,7 +70,7 @@ def project_run(
    config = load_project_config(project_dir, overrides=overrides)
    commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
    workflows = config.get("workflows", {})
-    validate_subcommand(commands.keys(), workflows.keys(), subcommand)
+    validate_subcommand(list(commands.keys()), list(workflows.keys()), subcommand)
    if subcommand in workflows:
        msg.info(f"Running workflow '{subcommand}'")
        for cmd in workflows[subcommand]:
@ -116,7 +116,7 @@ def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None:
    workflows = config.get("workflows", {})
    project_loc = "" if is_cwd(project_dir) else project_dir
    if subcommand:
-        validate_subcommand(commands.keys(), workflows.keys(), subcommand)
+        validate_subcommand(list(commands.keys()), list(workflows.keys()), subcommand)
        print(f"Usage: {COMMAND} project run {subcommand} {project_loc}")
        if subcommand in commands:
            help_text = commands[subcommand].get("help")
@ -164,8 +164,8 @@ def run_commands(
        when you want to turn over execution to the command, and capture=True
        when you want to run the command more like a function.
    """
-    for command in commands:
-        command = split_command(command)
+    for c in commands:
+        command = split_command(c)
        # Not sure if this is needed or a good idea. Motivation: users may often
        # use commands in their config that reference "python" and we want to
        # make sure that it's always executing the same Python that spaCy is
@ -294,7 +294,7 @@ def get_lock_entry(project_dir: Path, command: Dict[str, Any]) -> Dict[str, Any]
    }


-def get_fileinfo(project_dir: Path, paths: List[str]) -> List[Dict[str, str]]:
+def get_fileinfo(project_dir: Path, paths: List[str]) -> List[Dict[str, Optional[str]]]:
    """Generate the file information for a list of paths (dependencies, outputs).
    Includes the file path and the file's checksum.

--- a/spacy/cli/validate.py
+++ b/spacy/cli/validate.py
@ -99,7 +99,7 @@ def get_model_pkgs(silent: bool = False) -> Tuple[dict, dict]:
                warnings.filterwarnings("ignore", message="\\[W09[45]")
                model_meta = get_model_meta(model_path)
            spacy_version = model_meta.get("spacy_version", "n/a")
-            is_compat = is_compatible_version(about.__version__, spacy_version)
+            is_compat = is_compatible_version(about.__version__, spacy_version)  # type: ignore[assignment]
        pkgs[pkg_name] = {
            "name": package,
            "version": version,
--- a/spacy/compat.py
+++ b/spacy/compat.py
@ -5,12 +5,12 @@ from thinc.util import copy_array
 try:
    import cPickle as pickle
 except ImportError:
-    import pickle
+    import pickle  # type: ignore[no-redef]

 try:
    import copy_reg
 except ImportError:
-    import copyreg as copy_reg
+    import copyreg as copy_reg  # type: ignore[no-redef]

 try:
    from cupy.cuda.stream import Stream as CudaStream
@ -22,10 +22,10 @@ try:
 except ImportError:
    cupy = None

-try:  # Python 3.8+
+if sys.version_info[:2] >= (3, 8):  # Python 3.8+
    from typing import Literal
-except ImportError:
-    from typing_extensions import Literal  # noqa: F401
+else:
+    from typing_extensions import Literal   # noqa: F401

 # Important note: The importlib_metadata "backport" includes functionality
 # that's not part of the built-in importlib.metadata. We should treat this
@ -33,7 +33,7 @@ except ImportError:
 try:  # Python 3.8+
    import importlib.metadata as importlib_metadata
 except ImportError:
-    from catalogue import _importlib_metadata as importlib_metadata  # noqa: F401
+    from catalogue import _importlib_metadata as importlib_metadata  # type: ignore[no-redef]    # noqa: F401

 from thinc.api import Optimizer  # noqa: F401

--- a/spacy/displacy/init.py
+++ b/spacy/displacy/init.py
@ -18,7 +18,7 @@ RENDER_WRAPPER = None


 def render(
-    docs: Union[Iterable[Union[Doc, Span]], Doc, Span],
+    docs: Union[Iterable[Union[Doc, Span, dict]], Doc, Span, dict],
    style: str = "dep",
    page: bool = False,
    minify: bool = False,
@ -28,7 +28,8 @@ def render(
 ) -> str:
    """Render displaCy visualisation.

-    docs (Union[Iterable[Doc], Doc]): Document(s) to visualise.
+    docs (Union[Iterable[Union[Doc, Span, dict]], Doc, Span, dict]]): Document(s) to visualise.
+        a 'dict' is only allowed here when 'manual' is set to True
    style (str): Visualisation style, 'dep' or 'ent'.
    page (bool): Render markup as full HTML page.
    minify (bool): Minify HTML markup.
@ -53,8 +54,8 @@ def render(
        raise ValueError(Errors.E096)
    renderer_func, converter = factories[style]
    renderer = renderer_func(options=options)
-    parsed = [converter(doc, options) for doc in docs] if not manual else docs
-    _html["parsed"] = renderer.render(parsed, page=page, minify=minify).strip()
+    parsed = [converter(doc, options) for doc in docs] if not manual else docs  # type: ignore
+    _html["parsed"] = renderer.render(parsed, page=page, minify=minify).strip()  # type: ignore
    html = _html["parsed"]
    if RENDER_WRAPPER is not None:
        html = RENDER_WRAPPER(html)
@ -133,7 +134,7 @@ def parse_deps(orig_doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
                    "lemma": np.root.lemma_,
                    "ent_type": np.root.ent_type_,
                }
-                retokenizer.merge(np, attrs=attrs)
+                retokenizer.merge(np, attrs=attrs)  # type: ignore[arg-type]
    if options.get("collapse_punct", True):
        spans = []
        for word in doc[:-1]:
@ -148,7 +149,7 @@ def parse_deps(orig_doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
        with doc.retokenize() as retokenizer:
            for span, tag, lemma, ent_type in spans:
                attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
-                retokenizer.merge(span, attrs=attrs)
+                retokenizer.merge(span, attrs=attrs)  # type: ignore[arg-type]
    fine_grained = options.get("fine_grained")
    add_lemma = options.get("add_lemma")
    words = [
--- a/spacy/kb.pyx
+++ b/spacy/kb.pyx
@ -1,5 +1,5 @@
 # cython: infer_types=True, profile=True
-from typing import Iterator, Iterable
+from typing import Iterator, Iterable, Callable, Dict, Any

 import srsly
 from cymem.cymem cimport Pool
@ -446,7 +446,7 @@ cdef class KnowledgeBase:
            raise ValueError(Errors.E929.format(loc=path))
        if not path.is_dir():
            raise ValueError(Errors.E928.format(loc=path))
-        deserialize = {}
+        deserialize: Dict[str, Callable[[Any], Any]] = {}
        deserialize["contents"] = lambda p: self.read_contents(p)
        deserialize["strings.json"] = lambda p: self.vocab.strings.from_disk(p)
        util.from_disk(path, deserialize, exclude)
--- a/spacy/lang/af/init.py
+++ b/spacy/lang/af/init.py
@ -1,8 +1,8 @@
 from .stop_words import STOP_WORDS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class AfrikaansDefaults(Language.Defaults):
+class AfrikaansDefaults(BaseDefaults):
    stop_words = STOP_WORDS


--- a/spacy/lang/am/init.py
+++ b/spacy/lang/am/init.py
@ -4,12 +4,12 @@ from .punctuation import TOKENIZER_SUFFIXES

 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
-from ...language import Language
+from ...language import Language, BaseDefaults
 from ...attrs import LANG
 from ...util import update_exc


-class AmharicDefaults(Language.Defaults):
+class AmharicDefaults(BaseDefaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "am"
--- a/spacy/lang/ar/init.py
+++ b/spacy/lang/ar/init.py
@ -2,10 +2,10 @@ from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .punctuation import TOKENIZER_SUFFIXES
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class ArabicDefaults(Language.Defaults):
+class ArabicDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    suffixes = TOKENIZER_SUFFIXES
    stop_words = STOP_WORDS
--- a/spacy/lang/az/init.py
+++ b/spacy/lang/az/init.py
@ -1,9 +1,9 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class AzerbaijaniDefaults(Language.Defaults):
+class AzerbaijaniDefaults(BaseDefaults):
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS

--- a/spacy/lang/bg/init.py
+++ b/spacy/lang/bg/init.py
@ -3,12 +3,12 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .lex_attrs import LEX_ATTRS
 from ..tokenizer_exceptions import BASE_EXCEPTIONS

-from ...language import Language
+from ...language import Language, BaseDefaults
 from ...attrs import LANG
 from ...util import update_exc


-class BulgarianDefaults(Language.Defaults):
+class BulgarianDefaults(BaseDefaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: "bg"

--- a/spacy/lang/bn/init.py
+++ b/spacy/lang/bn/init.py
@ -3,11 +3,11 @@ from thinc.api import Model
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
 from .stop_words import STOP_WORDS
-from ...language import Language
+from ...language import Language, BaseDefaults
 from ...pipeline import Lemmatizer


-class BengaliDefaults(Language.Defaults):
+class BengaliDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    prefixes = TOKENIZER_PREFIXES
    suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/ca/init.py
+++ b/spacy/lang/ca/init.py
@ -7,11 +7,11 @@ from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .syntax_iterators import SYNTAX_ITERATORS
-from ...language import Language
+from ...language import Language, BaseDefaults
 from .lemmatizer import CatalanLemmatizer


-class CatalanDefaults(Language.Defaults):
+class CatalanDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    infixes = TOKENIZER_INFIXES
    suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/ca/syntax_iterators.py
+++ b/spacy/lang/ca/syntax_iterators.py
@ -1,8 +1,10 @@
+from typing import Union, Iterator, Tuple
+from ...tokens import Doc, Span
 from ...symbols import NOUN, PROPN
 from ...errors import Errors


-def noun_chunks(doclike):
+def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """Detect base noun phrases from a dependency parse. Works on Doc and Span."""
    # fmt: off
    labels = ["nsubj", "nsubj:pass", "obj", "obl", "iobj", "ROOT", "appos", "nmod", "nmod:poss"]
--- a/spacy/lang/cs/init.py
+++ b/spacy/lang/cs/init.py
@ -1,9 +1,9 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class CzechDefaults(Language.Defaults):
+class CzechDefaults(BaseDefaults):
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS

--- a/spacy/lang/da/init.py
+++ b/spacy/lang/da/init.py
@ -3,10 +3,10 @@ from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .syntax_iterators import SYNTAX_ITERATORS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class DanishDefaults(Language.Defaults):
+class DanishDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    infixes = TOKENIZER_INFIXES
    suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/da/syntax_iterators.py
+++ b/spacy/lang/da/syntax_iterators.py
@ -1,8 +1,10 @@
+from typing import Union, Iterator, Tuple
+from ...tokens import Doc, Span
 from ...symbols import NOUN, PROPN, PRON, VERB, AUX
 from ...errors import Errors


-def noun_chunks(doclike):
+def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    def is_verb_token(tok):
        return tok.pos in [VERB, AUX]

@ -32,7 +34,7 @@ def noun_chunks(doclike):
    def get_bounds(doc, root):
        return get_left_bound(doc, root), get_right_bound(doc, root)

-    doc = doclike.doc
+    doc = doclike.doc  # Ensure works on both Doc and Span.

    if not doc.has_annotation("DEP"):
        raise ValueError(Errors.E029)
--- a/spacy/lang/de/init.py
+++ b/spacy/lang/de/init.py
@ -2,10 +2,10 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
 from .stop_words import STOP_WORDS
 from .syntax_iterators import SYNTAX_ITERATORS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class GermanDefaults(Language.Defaults):
+class GermanDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    prefixes = TOKENIZER_PREFIXES
    suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/de/syntax_iterators.py
+++ b/spacy/lang/de/syntax_iterators.py
@ -1,11 +1,11 @@
-from typing import Union, Iterator
+from typing import Union, Iterator, Tuple

 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
 from ...tokens import Doc, Span


-def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
+def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """Detect base noun phrases from a dependency parse. Works on Doc and Span."""
    # this iterator extracts spans headed by NOUNs starting from the left-most
    # syntactic dependent until the NOUN itself for close apposition and
--- a/spacy/lang/el/init.py
+++ b/spacy/lang/el/init.py
@ -7,10 +7,10 @@ from .lex_attrs import LEX_ATTRS
 from .syntax_iterators import SYNTAX_ITERATORS
 from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
 from .lemmatizer import GreekLemmatizer
-from ...language import Language
+from ...language import Language, BaseDefaults


-class GreekDefaults(Language.Defaults):
+class GreekDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    prefixes = TOKENIZER_PREFIXES
    suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/el/syntax_iterators.py
+++ b/spacy/lang/el/syntax_iterators.py
@ -1,11 +1,11 @@
-from typing import Union, Iterator
+from typing import Union, Iterator, Tuple

 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
 from ...tokens import Doc, Span


-def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
+def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """Detect base noun phrases from a dependency parse. Works on Doc and Span."""
    # It follows the logic of the noun chunks finder of English language,
    # adjusted to some Greek language special characteristics.
--- a/spacy/lang/en/init.py
+++ b/spacy/lang/en/init.py
@ -7,10 +7,10 @@ from .lex_attrs import LEX_ATTRS
 from .syntax_iterators import SYNTAX_ITERATORS
 from .punctuation import TOKENIZER_INFIXES
 from .lemmatizer import EnglishLemmatizer
-from ...language import Language
+from ...language import Language, BaseDefaults


-class EnglishDefaults(Language.Defaults):
+class EnglishDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    infixes = TOKENIZER_INFIXES
    lex_attr_getters = LEX_ATTRS
--- a/spacy/lang/en/lex_attrs.py
+++ b/spacy/lang/en/lex_attrs.py
@ -19,7 +19,7 @@ _ordinal_words = [
 # fmt: on


-def like_num(text: str) -> bool:
+def like_num(text):
    if text.startswith(("+", "-", "±", "~")):
        text = text[1:]
    text = text.replace(",", "").replace(".", "")
--- a/spacy/lang/en/syntax_iterators.py
+++ b/spacy/lang/en/syntax_iterators.py
@ -1,11 +1,11 @@
-from typing import Union, Iterator
+from typing import Union, Iterator, Tuple

 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
 from ...tokens import Doc, Span


-def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
+def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
--- a/spacy/lang/en/tokenizer_exceptions.py
+++ b/spacy/lang/en/tokenizer_exceptions.py
@ -1,9 +1,10 @@
+from typing import Dict, List
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ...symbols import ORTH, NORM
 from ...util import update_exc


-_exc = {}
+_exc: Dict[str, List[Dict]] = {}
 _exclude = [
    "Ill",
    "ill",
@ -294,9 +295,9 @@ for verb_data in [
    {ORTH: "has", NORM: "has"},
    {ORTH: "dare", NORM: "dare"},
 ]:
-    verb_data_tc = dict(verb_data)
+    verb_data_tc = dict(verb_data)  # type: ignore[call-overload]
    verb_data_tc[ORTH] = verb_data_tc[ORTH].title()
-    for data in [verb_data, verb_data_tc]:
+    for data in [verb_data, verb_data_tc]:  # type: ignore[assignment]
        _exc[data[ORTH] + "n't"] = [
            dict(data),
            {ORTH: "n't", NORM: "not"},
--- a/spacy/lang/es/init.py
+++ b/spacy/lang/es/init.py
@ -6,10 +6,10 @@ from .lex_attrs import LEX_ATTRS
 from .lemmatizer import SpanishLemmatizer
 from .syntax_iterators import SYNTAX_ITERATORS
 from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
-from ...language import Language
+from ...language import Language, BaseDefaults


-class SpanishDefaults(Language.Defaults):
+class SpanishDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    infixes = TOKENIZER_INFIXES
    suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/es/lemmatizer.py
+++ b/spacy/lang/es/lemmatizer.py
@ -52,7 +52,7 @@ class SpanishLemmatizer(Lemmatizer):
                rule_pos = "verb"
            else:
                rule_pos = pos
-            rule = self.select_rule(rule_pos, features)
+            rule = self.select_rule(rule_pos, list(features))
            index = self.lookups.get_table("lemma_index").get(rule_pos, [])
            lemmas = getattr(self, "lemmatize_" + rule_pos)(
                string, features, rule, index
@ -191,6 +191,8 @@ class SpanishLemmatizer(Lemmatizer):
                return selected_lemmas
            else:
                return possible_lemmas
+        else:
+            return []

    def lemmatize_noun(
        self, word: str, features: List[str], rule: str, index: List[str]
@ -268,7 +270,7 @@ class SpanishLemmatizer(Lemmatizer):
        return [word]

    def lemmatize_pron(
-        self, word: str, features: List[str], rule: str, index: List[str]
+        self, word: str, features: List[str], rule: Optional[str], index: List[str]
    ) -> List[str]:
        """
        Lemmatize a pronoun.
@ -319,9 +321,11 @@ class SpanishLemmatizer(Lemmatizer):
                return selected_lemmas
            else:
                return possible_lemmas
+        else:
+            return []

    def lemmatize_verb(
-        self, word: str, features: List[str], rule: str, index: List[str]
+        self, word: str, features: List[str], rule: Optional[str], index: List[str]
    ) -> List[str]:
        """
        Lemmatize a verb.
@ -342,6 +346,7 @@ class SpanishLemmatizer(Lemmatizer):
        selected_lemmas = []

        # Apply lemmatization rules
+        rule = str(rule or "")
        for old, new in self.lookups.get_table("lemma_rules").get(rule, []):
            possible_lemma = re.sub(old + "$", new, word)
            if possible_lemma != word:
@ -389,11 +394,11 @@ class SpanishLemmatizer(Lemmatizer):
            return [word]

    def lemmatize_verb_pron(
-        self, word: str, features: List[str], rule: str, index: List[str]
+        self, word: str, features: List[str], rule: Optional[str], index: List[str]
    ) -> List[str]:
        # Strip and collect pronouns
        pron_patt = "^(.*?)([mts]e|l[aeo]s?|n?os)$"
-        prons = []
+        prons: List[str] = []
        verb = word
        m = re.search(pron_patt, verb)
        while m is not None and len(prons) <= 3:
@ -410,7 +415,7 @@ class SpanishLemmatizer(Lemmatizer):
        else:
            rule = self.select_rule("verb", features)
            verb_lemma = self.lemmatize_verb(
-                verb, features - {"PronType=Prs"}, rule, index
+                verb, features - {"PronType=Prs"}, rule, index  # type: ignore[operator]
            )[0]
        pron_lemmas = []
        for pron in prons:
--- a/spacy/lang/es/syntax_iterators.py
+++ b/spacy/lang/es/syntax_iterators.py
@ -1,11 +1,11 @@
-from typing import Union, Iterator
+from typing import Union, Iterator, Tuple

 from ...symbols import NOUN, PROPN, PRON, VERB, AUX
 from ...errors import Errors
 from ...tokens import Doc, Span, Token


-def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
+def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """Detect base noun phrases from a dependency parse. Works on Doc and Span."""
    doc = doclike.doc
    if not doc.has_annotation("DEP"):
--- a/spacy/lang/et/init.py
+++ b/spacy/lang/et/init.py
@ -1,8 +1,8 @@
 from .stop_words import STOP_WORDS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class EstonianDefaults(Language.Defaults):
+class EstonianDefaults(BaseDefaults):
    stop_words = STOP_WORDS


--- a/spacy/lang/eu/init.py
+++ b/spacy/lang/eu/init.py
@ -1,10 +1,10 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .punctuation import TOKENIZER_SUFFIXES
-from ...language import Language
+from ...language import Language, BaseDefaults


-class BasqueDefaults(Language.Defaults):
+class BasqueDefaults(BaseDefaults):
    suffixes = TOKENIZER_SUFFIXES
    stop_words = STOP_WORDS
    lex_attr_getters = LEX_ATTRS
--- a/spacy/lang/fa/init.py
+++ b/spacy/lang/fa/init.py
@ -5,11 +5,11 @@ from .lex_attrs import LEX_ATTRS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .punctuation import TOKENIZER_SUFFIXES
 from .syntax_iterators import SYNTAX_ITERATORS
-from ...language import Language
+from ...language import Language, BaseDefaults
 from ...pipeline import Lemmatizer


-class PersianDefaults(Language.Defaults):
+class PersianDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    suffixes = TOKENIZER_SUFFIXES
    lex_attr_getters = LEX_ATTRS
--- a/spacy/lang/fa/generate_verbs_exc.py
+++ b/spacy/lang/fa/generate_verbs_exc.py
@ -639,10 +639,12 @@ for verb_root in verb_roots:
        )

    if past.startswith("آ"):
-        conjugations = set(
-            map(
-                lambda item: item.replace("بآ", "بیا").replace("نآ", "نیا"),
-                conjugations,
+        conjugations = list(
+            set(
+                map(
+                    lambda item: item.replace("بآ", "بیا").replace("نآ", "نیا"),
+                    conjugations,
+                )
            )
        )

--- a/spacy/lang/fa/syntax_iterators.py
+++ b/spacy/lang/fa/syntax_iterators.py
@ -1,8 +1,10 @@
+from typing import Union, Iterator, Tuple
+from ...tokens import Doc, Span
 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors


-def noun_chunks(doclike):
+def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
--- a/spacy/lang/fi/init.py
+++ b/spacy/lang/fi/init.py
@ -2,10 +2,10 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
-from ...language import Language
+from ...language import Language, BaseDefaults


-class FinnishDefaults(Language.Defaults):
+class FinnishDefaults(BaseDefaults):
    infixes = TOKENIZER_INFIXES
    suffixes = TOKENIZER_SUFFIXES
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
--- a/spacy/lang/fr/init.py
+++ b/spacy/lang/fr/init.py
@ -9,10 +9,10 @@ from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .syntax_iterators import SYNTAX_ITERATORS
 from .lemmatizer import FrenchLemmatizer
-from ...language import Language
+from ...language import Language, BaseDefaults


-class FrenchDefaults(Language.Defaults):
+class FrenchDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    prefixes = TOKENIZER_PREFIXES
    infixes = TOKENIZER_INFIXES
--- a/spacy/lang/fr/syntax_iterators.py
+++ b/spacy/lang/fr/syntax_iterators.py
@ -1,11 +1,11 @@
-from typing import Union, Iterator
+from typing import Union, Iterator, Tuple

 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
 from ...tokens import Doc, Span


-def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
+def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """Detect base noun phrases from a dependency parse. Works on Doc and Span."""
    # fmt: off
    labels = ["nsubj", "nsubj:pass", "obj", "iobj", "ROOT", "appos", "nmod", "nmod:poss"]
--- a/spacy/lang/fr/tokenizer_exceptions.py
+++ b/spacy/lang/fr/tokenizer_exceptions.py
@ -115,7 +115,7 @@ for s, verb, pronoun in [("s", "est", "il"), ("S", "EST", "IL")]:
    ]


-_infixes_exc = []
+_infixes_exc = []  # type: ignore[var-annotated]
 orig_elision = "'"
 orig_hyphen = "-"

--- a/spacy/lang/ga/init.py
+++ b/spacy/lang/ga/init.py
@ -1,9 +1,9 @@
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class IrishDefaults(Language.Defaults):
+class IrishDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    stop_words = STOP_WORDS

--- a/spacy/lang/grc/init.py
+++ b/spacy/lang/grc/init.py
@ -1,10 +1,10 @@
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class AncientGreekDefaults(Language.Defaults):
+class AncientGreekDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS
--- a/spacy/lang/grc/tokenizer_exceptions.py
+++ b/spacy/lang/grc/tokenizer_exceptions.py
@ -108,8 +108,4 @@ _other_exc = {

 _exc.update(_other_exc)

-_exc_data = {}
-
-_exc.update(_exc_data)
-
 TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
--- a/spacy/lang/gu/init.py
+++ b/spacy/lang/gu/init.py
@ -1,8 +1,8 @@
 from .stop_words import STOP_WORDS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class GujaratiDefaults(Language.Defaults):
+class GujaratiDefaults(BaseDefaults):
    stop_words = STOP_WORDS


--- a/spacy/lang/he/init.py
+++ b/spacy/lang/he/init.py
@ -1,9 +1,9 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class HebrewDefaults(Language.Defaults):
+class HebrewDefaults(BaseDefaults):
    stop_words = STOP_WORDS
    lex_attr_getters = LEX_ATTRS
    writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
--- a/spacy/lang/hi/init.py
+++ b/spacy/lang/hi/init.py
@ -1,9 +1,9 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class HindiDefaults(Language.Defaults):
+class HindiDefaults(BaseDefaults):
    stop_words = STOP_WORDS
    lex_attr_getters = LEX_ATTRS

--- a/spacy/lang/hr/init.py
+++ b/spacy/lang/hr/init.py
@ -1,8 +1,8 @@
 from .stop_words import STOP_WORDS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class CroatianDefaults(Language.Defaults):
+class CroatianDefaults(BaseDefaults):
    stop_words = STOP_WORDS


--- a/spacy/lang/hu/init.py
+++ b/spacy/lang/hu/init.py
@ -1,10 +1,10 @@
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, TOKEN_MATCH
 from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
 from .stop_words import STOP_WORDS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class HungarianDefaults(Language.Defaults):
+class HungarianDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    prefixes = TOKENIZER_PREFIXES
    suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/hy/init.py
+++ b/spacy/lang/hy/init.py
@ -1,9 +1,9 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class ArmenianDefaults(Language.Defaults):
+class ArmenianDefaults(BaseDefaults):
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS

--- a/spacy/lang/id/init.py
+++ b/spacy/lang/id/init.py
@ -3,10 +3,10 @@ from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_PREFIXES, TOKENIZER_INFIX
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .lex_attrs import LEX_ATTRS
 from .syntax_iterators import SYNTAX_ITERATORS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class IndonesianDefaults(Language.Defaults):
+class IndonesianDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    prefixes = TOKENIZER_PREFIXES
    suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/id/syntax_iterators.py
+++ b/spacy/lang/id/syntax_iterators.py
@ -1,11 +1,11 @@
-from typing import Union, Iterator
+from typing import Union, Iterator, Tuple

 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
 from ...tokens import Doc, Span


-def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
+def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
--- a/spacy/lang/is/init.py
+++ b/spacy/lang/is/init.py
@ -1,8 +1,8 @@
 from .stop_words import STOP_WORDS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class IcelandicDefaults(Language.Defaults):
+class IcelandicDefaults(BaseDefaults):
    stop_words = STOP_WORDS


--- a/spacy/lang/it/init.py
+++ b/spacy/lang/it/init.py
@ -4,11 +4,11 @@ from thinc.api import Model
 from .stop_words import STOP_WORDS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
-from ...language import Language
+from ...language import Language, BaseDefaults
 from .lemmatizer import ItalianLemmatizer


-class ItalianDefaults(Language.Defaults):
+class ItalianDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    stop_words = STOP_WORDS
    prefixes = TOKENIZER_PREFIXES
--- a/spacy/lang/ja/init.py
+++ b/spacy/lang/ja/init.py
@ -10,7 +10,7 @@ from .tag_orth_map import TAG_ORTH_MAP
 from .tag_bigram_map import TAG_BIGRAM_MAP
 from ...compat import copy_reg
 from ...errors import Errors
-from ...language import Language
+from ...language import Language, BaseDefaults
 from ...scorer import Scorer
 from ...symbols import POS
 from ...tokens import Doc
@ -154,7 +154,7 @@ class JapaneseTokenizer(DummyTokenizer):
    def to_disk(self, path: Union[str, Path], **kwargs) -> None:
        path = util.ensure_path(path)
        serializers = {"cfg": lambda p: srsly.write_json(p, self._get_config())}
-        return util.to_disk(path, serializers, [])
+        util.to_disk(path, serializers, [])

    def from_disk(self, path: Union[str, Path], **kwargs) -> "JapaneseTokenizer":
        path = util.ensure_path(path)
@ -164,7 +164,7 @@ class JapaneseTokenizer(DummyTokenizer):
        return self


-class JapaneseDefaults(Language.Defaults):
+class JapaneseDefaults(BaseDefaults):
    config = load_config_from_str(DEFAULT_CONFIG)
    stop_words = STOP_WORDS
    syntax_iterators = SYNTAX_ITERATORS
--- a/spacy/lang/ja/syntax_iterators.py
+++ b/spacy/lang/ja/syntax_iterators.py
@ -1,4 +1,4 @@
-from typing import Union, Iterator
+from typing import Union, Iterator, Tuple, Set

 from ...symbols import NOUN, PROPN, PRON, VERB
 from ...tokens import Doc, Span
@ -10,13 +10,13 @@ labels = ["nsubj", "nmod", "ddoclike", "nsubjpass", "pcomp", "pdoclike", "doclik
 # fmt: on


-def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
+def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """Detect base noun phrases from a dependency parse. Works on Doc and Span."""
    doc = doclike.doc  # Ensure works on both Doc and Span.
    np_deps = [doc.vocab.strings.add(label) for label in labels]
    doc.vocab.strings.add("conj")
    np_label = doc.vocab.strings.add("NP")
-    seen = set()
+    seen: Set[int] = set()
    for i, word in enumerate(doclike):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
--- a/spacy/lang/kn/init.py
+++ b/spacy/lang/kn/init.py
@ -1,8 +1,8 @@
 from .stop_words import STOP_WORDS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class KannadaDefaults(Language.Defaults):
+class KannadaDefaults(BaseDefaults):
    stop_words = STOP_WORDS


--- a/spacy/lang/ko/init.py
+++ b/spacy/lang/ko/init.py
@ -1,9 +1,9 @@
-from typing import Optional, Any, Dict
+from typing import Iterator, Any, Dict

 from .stop_words import STOP_WORDS
 from .tag_map import TAG_MAP
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults
 from ...tokens import Doc
 from ...compat import copy_reg
 from ...scorer import Scorer
@ -29,9 +29,9 @@ def create_tokenizer():


 class KoreanTokenizer(DummyTokenizer):
-    def __init__(self, nlp: Optional[Language] = None):
+    def __init__(self, nlp: Language):
        self.vocab = nlp.vocab
-        MeCab = try_mecab_import()
+        MeCab = try_mecab_import()  # type: ignore[func-returns-value]
        self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")

    def __del__(self):
@ -49,7 +49,7 @@ class KoreanTokenizer(DummyTokenizer):
        doc.user_data["full_tags"] = [dt["tag"] for dt in dtokens]
        return doc

-    def detailed_tokens(self, text: str) -> Dict[str, Any]:
+    def detailed_tokens(self, text: str) -> Iterator[Dict[str, Any]]:
        # 품사 태그(POS)[0], 의미 부류(semantic class)[1],	종성 유무(jongseong)[2], 읽기(reading)[3],
        # 타입(type)[4], 첫번째 품사(start pos)[5],	마지막 품사(end pos)[6], 표현(expression)[7], *
        for node in self.mecab_tokenizer.parse(text, as_nodes=True):
@ -68,7 +68,7 @@ class KoreanTokenizer(DummyTokenizer):
        return Scorer.score_tokenization(examples)


-class KoreanDefaults(Language.Defaults):
+class KoreanDefaults(BaseDefaults):
    config = load_config_from_str(DEFAULT_CONFIG)
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS
--- a/spacy/lang/ky/init.py
+++ b/spacy/lang/ky/init.py
@ -2,10 +2,10 @@ from .lex_attrs import LEX_ATTRS
 from .punctuation import TOKENIZER_INFIXES
 from .stop_words import STOP_WORDS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class KyrgyzDefaults(Language.Defaults):
+class KyrgyzDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    infixes = TOKENIZER_INFIXES
    lex_attr_getters = LEX_ATTRS
--- a/spacy/lang/lb/init.py
+++ b/spacy/lang/lb/init.py
@ -2,10 +2,10 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .punctuation import TOKENIZER_INFIXES
 from .lex_attrs import LEX_ATTRS
 from .stop_words import STOP_WORDS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class LuxembourgishDefaults(Language.Defaults):
+class LuxembourgishDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    infixes = TOKENIZER_INFIXES
    lex_attr_getters = LEX_ATTRS
--- a/spacy/lang/lij/init.py
+++ b/spacy/lang/lij/init.py
@ -1,10 +1,10 @@
 from .stop_words import STOP_WORDS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .punctuation import TOKENIZER_INFIXES
-from ...language import Language
+from ...language import Language, BaseDefaults


-class LigurianDefaults(Language.Defaults):
+class LigurianDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    infixes = TOKENIZER_INFIXES
    stop_words = STOP_WORDS
--- a/spacy/lang/lt/init.py
+++ b/spacy/lang/lt/init.py
@ -2,10 +2,10 @@ from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class LithuanianDefaults(Language.Defaults):
+class LithuanianDefaults(BaseDefaults):
    infixes = TOKENIZER_INFIXES
    suffixes = TOKENIZER_SUFFIXES
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
--- a/spacy/lang/lv/init.py
+++ b/spacy/lang/lv/init.py
@ -1,8 +1,8 @@
 from .stop_words import STOP_WORDS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class LatvianDefaults(Language.Defaults):
+class LatvianDefaults(BaseDefaults):
    stop_words = STOP_WORDS


--- a/spacy/lang/mk/init.py
+++ b/spacy/lang/mk/init.py
@ -6,13 +6,13 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .lex_attrs import LEX_ATTRS
 from ..tokenizer_exceptions import BASE_EXCEPTIONS

-from ...language import Language
+from ...language import Language, BaseDefaults
 from ...attrs import LANG
 from ...util import update_exc
 from ...lookups import Lookups


-class MacedonianDefaults(Language.Defaults):
+class MacedonianDefaults(BaseDefaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: "mk"

--- a/spacy/lang/ml/init.py
+++ b/spacy/lang/ml/init.py
@ -1,9 +1,9 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class MalayalamDefaults(Language.Defaults):
+class MalayalamDefaults(BaseDefaults):
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS

--- a/spacy/lang/mr/init.py
+++ b/spacy/lang/mr/init.py
@ -1,8 +1,8 @@
 from .stop_words import STOP_WORDS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class MarathiDefaults(Language.Defaults):
+class MarathiDefaults(BaseDefaults):
    stop_words = STOP_WORDS


--- a/spacy/lang/nb/init.py
+++ b/spacy/lang/nb/init.py
@ -5,11 +5,11 @@ from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
 from .punctuation import TOKENIZER_SUFFIXES
 from .stop_words import STOP_WORDS
 from .syntax_iterators import SYNTAX_ITERATORS
-from ...language import Language
+from ...language import Language, BaseDefaults
 from ...pipeline import Lemmatizer


-class NorwegianDefaults(Language.Defaults):
+class NorwegianDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    prefixes = TOKENIZER_PREFIXES
    infixes = TOKENIZER_INFIXES
--- a/spacy/lang/nb/syntax_iterators.py
+++ b/spacy/lang/nb/syntax_iterators.py
@ -1,11 +1,11 @@
-from typing import Union, Iterator
+from typing import Union, Iterator, Tuple

 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
 from ...tokens import Doc, Span


-def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
+def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """Detect base noun phrases from a dependency parse. Works on Doc and Span."""
    # fmt: off
    labels = ["nsubj", "nsubj:pass", "obj", "iobj", "ROOT", "appos", "nmod", "nmod:poss"]
--- a/spacy/lang/ne/init.py
+++ b/spacy/lang/ne/init.py
@ -1,9 +1,9 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class NepaliDefaults(Language.Defaults):
+class NepaliDefaults(BaseDefaults):
    stop_words = STOP_WORDS
    lex_attr_getters = LEX_ATTRS

--- a/spacy/lang/nl/init.py
+++ b/spacy/lang/nl/init.py
@ -9,10 +9,10 @@ from .punctuation import TOKENIZER_SUFFIXES
 from .stop_words import STOP_WORDS
 from .syntax_iterators import SYNTAX_ITERATORS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class DutchDefaults(Language.Defaults):
+class DutchDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    prefixes = TOKENIZER_PREFIXES
    infixes = TOKENIZER_INFIXES
--- a/spacy/lang/nl/syntax_iterators.py
+++ b/spacy/lang/nl/syntax_iterators.py
@ -1,11 +1,11 @@
-from typing import Union, Iterator
+from typing import Union, Iterator, Tuple

 from ...symbols import NOUN, PRON
 from ...errors import Errors
 from ...tokens import Doc, Span


-def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
+def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """
    Detect base noun phrases from a dependency parse. Works on Doc and Span.
    The definition is inspired by https://www.nltk.org/book/ch07.html
--- a/spacy/lang/pl/init.py
+++ b/spacy/lang/pl/init.py
@ -8,7 +8,7 @@ from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .lemmatizer import PolishLemmatizer
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
-from ...language import Language
+from ...language import Language, BaseDefaults


 TOKENIZER_EXCEPTIONS = {
@ -16,7 +16,7 @@ TOKENIZER_EXCEPTIONS = {
 }


-class PolishDefaults(Language.Defaults):
+class PolishDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    prefixes = TOKENIZER_PREFIXES
    infixes = TOKENIZER_INFIXES
--- a/spacy/lang/pt/init.py
+++ b/spacy/lang/pt/init.py
@ -2,10 +2,10 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES
-from ...language import Language
+from ...language import Language, BaseDefaults


-class PortugueseDefaults(Language.Defaults):
+class PortugueseDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    infixes = TOKENIZER_INFIXES
    prefixes = TOKENIZER_PREFIXES
--- a/spacy/lang/ro/init.py
+++ b/spacy/lang/ro/init.py
@ -3,14 +3,14 @@ from .stop_words import STOP_WORDS
 from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
 from .punctuation import TOKENIZER_SUFFIXES
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults

 # Lemma data note:
 # Original pairs downloaded from http://www.lexiconista.com/datasets/lemmatization/
 # Replaced characters using cedillas with the correct ones (ș and ț)


-class RomanianDefaults(Language.Defaults):
+class RomanianDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    prefixes = TOKENIZER_PREFIXES
    suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/ru/init.py
+++ b/spacy/lang/ru/init.py
@ -5,10 +5,10 @@ from .stop_words import STOP_WORDS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .lex_attrs import LEX_ATTRS
 from .lemmatizer import RussianLemmatizer
-from ...language import Language
+from ...language import Language, BaseDefaults


-class RussianDefaults(Language.Defaults):
+class RussianDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS
--- a/spacy/lang/sa/init.py
+++ b/spacy/lang/sa/init.py
@ -1,9 +1,9 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class SanskritDefaults(Language.Defaults):
+class SanskritDefaults(BaseDefaults):
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS

--- a/spacy/lang/si/init.py
+++ b/spacy/lang/si/init.py
@ -1,9 +1,9 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class SinhalaDefaults(Language.Defaults):
+class SinhalaDefaults(BaseDefaults):
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS

--- a/spacy/lang/sk/init.py
+++ b/spacy/lang/sk/init.py
@ -1,9 +1,9 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class SlovakDefaults(Language.Defaults):
+class SlovakDefaults(BaseDefaults):
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS

--- a/spacy/lang/sl/init.py
+++ b/spacy/lang/sl/init.py
@ -1,8 +1,8 @@
 from .stop_words import STOP_WORDS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class SlovenianDefaults(Language.Defaults):
+class SlovenianDefaults(BaseDefaults):
    stop_words = STOP_WORDS


--- a/spacy/lang/sq/init.py
+++ b/spacy/lang/sq/init.py
@ -1,8 +1,8 @@
 from .stop_words import STOP_WORDS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class AlbanianDefaults(Language.Defaults):
+class AlbanianDefaults(BaseDefaults):
    stop_words = STOP_WORDS


--- a/spacy/lang/sr/init.py
+++ b/spacy/lang/sr/init.py
@ -1,10 +1,10 @@
 from .stop_words import STOP_WORDS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class SerbianDefaults(Language.Defaults):
+class SerbianDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS
--- a/spacy/lang/sv/init.py
+++ b/spacy/lang/sv/init.py
@ -4,7 +4,7 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .syntax_iterators import SYNTAX_ITERATORS
-from ...language import Language
+from ...language import Language, BaseDefaults
 from ...pipeline import Lemmatizer


@ -12,7 +12,7 @@ from ...pipeline import Lemmatizer
 from ..da.punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES


-class SwedishDefaults(Language.Defaults):
+class SwedishDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    infixes = TOKENIZER_INFIXES
    suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/sv/syntax_iterators.py
+++ b/spacy/lang/sv/syntax_iterators.py
@ -1,11 +1,11 @@
-from typing import Union, Iterator
+from typing import Union, Iterator, Tuple

 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
 from ...tokens import Doc, Span


-def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
+def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """Detect base noun phrases from a dependency parse. Works on Doc and Span."""
    # fmt: off
    labels = ["nsubj", "nsubj:pass", "dobj", "obj", "iobj", "ROOT", "appos", "nmod", "nmod:poss"]
--- a/spacy/lang/ta/init.py
+++ b/spacy/lang/ta/init.py
@ -1,9 +1,9 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class TamilDefaults(Language.Defaults):
+class TamilDefaults(BaseDefaults):
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS

--- a/spacy/lang/te/init.py
+++ b/spacy/lang/te/init.py
@ -1,9 +1,9 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class TeluguDefaults(Language.Defaults):
+class TeluguDefaults(BaseDefaults):
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS

--- a/spacy/lang/th/init.py
+++ b/spacy/lang/th/init.py
@ -1,6 +1,6 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults
 from ...tokens import Doc
 from ...util import DummyTokenizer, registry, load_config_from_str

@ -39,7 +39,7 @@ class ThaiTokenizer(DummyTokenizer):
        return Doc(self.vocab, words=words, spaces=spaces)


-class ThaiDefaults(Language.Defaults):
+class ThaiDefaults(BaseDefaults):
    config = load_config_from_str(DEFAULT_CONFIG)
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS
--- a/spacy/lang/ti/init.py
+++ b/spacy/lang/ti/init.py
@ -4,12 +4,12 @@ from .punctuation import TOKENIZER_SUFFIXES

 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
-from ...language import Language
+from ...language import Language, BaseDefaults
 from ...attrs import LANG
 from ...util import update_exc


-class TigrinyaDefaults(Language.Defaults):
+class TigrinyaDefaults(BaseDefaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "ti"
--- a/spacy/lang/tl/init.py
+++ b/spacy/lang/tl/init.py
@ -1,10 +1,10 @@
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
-from ...language import Language
+from ...language import Language, BaseDefaults


-class TagalogDefaults(Language.Defaults):
+class TagalogDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    lex_attr_getters = LEX_ATTRS
    stop_words = STOP_WORDS
--- a/spacy/lang/tn/init.py
+++ b/spacy/lang/tn/init.py
@ -1,10 +1,10 @@
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .punctuation import TOKENIZER_INFIXES
-from ...language import Language
+from ...language import Language, BaseDefaults


-class SetswanaDefaults(Language.Defaults):
+class SetswanaDefaults(BaseDefaults):
    infixes = TOKENIZER_INFIXES
    stop_words = STOP_WORDS
    lex_attr_getters = LEX_ATTRS
--- a/Show More
+++ b/Show More