Added Haitian Creole (ht) Language Support to spaCy (#13807 )

This PR adds official support for Haitian Creole (ht) to spaCy's spacy/lang module. It includes: Added all core language data files for spacy/lang/ht: tokenizer_exceptions.py punctuation.py lex_attrs.py syntax_iterators.py lemmatizer.py stop_words.py tag_map.py Unit tests for tokenizer and noun chunking (test_tokenizer.py, test_noun_chunking.py, etc.). Passed all 58 pytest spacy/tests/lang/ht tests that I've created. Basic tokenizer rules adapted for Haitian Creole orthography and informal contractions. Custom like_num atrribute supporting Haitian number formats (e.g., "3yèm"). Support for common informal apostrophe usage (e.g., "m'ap", "n'ap", "di'm"). Ensured no breakages in other language modules. Followed spaCy coding style (PEP8, Black). This provides a foundation for Haitian Creole NLP development using spaCy.
Correct API docs for Span.lemma_, Vocab.to_bytes and Vectors.__init__ (#13436 )
2025-08-03 11:50:19 +03:00 · 2025-05-28 17:23:38 +02:00 · 2025-05-28 17:22:50 +02:00 · 2025-05-28 17:21:46 +02:00 · 2025-05-28 17:06:11 +02:00 · 2025-05-28 17:04:23 +02:00
162 changed files with 8484 additions and 1858 deletions
--- a/.github/workflows/cibuildwheel.yml
+++ b/.github/workflows/cibuildwheel.yml
@ -14,7 +14,7 @@ jobs:
    strategy:
      matrix:
        # macos-13 is an intel runner, macos-14 is apple silicon
-        os: [ubuntu-latest, windows-latest, macos-13, macos-14]
+        os: [ubuntu-latest, windows-latest, macos-13, macos-14, ubuntu-24.04-arm]

    steps:
      - uses: actions/checkout@v4
@ -26,7 +26,7 @@ jobs:
      #  with:
      #    platforms: all
      - name: Build wheels
-        uses: pypa/cibuildwheel@v2.19.1
+        uses: pypa/cibuildwheel@v2.21.3
        env:
          CIBW_ARCHS_LINUX: auto
        with:
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@ -2,6 +2,8 @@ name: tests

 on:
  push:
+    tags-ignore:
+      - '**'
    branches-ignore:
      - "spacy.io"
      - "nightly.spacy.io"
@ -10,7 +12,6 @@ on:
      - "*.md"
      - "*.mdx"
      - "website/**"
-      - ".github/workflows/**"
  pull_request:
    types: [opened, synchronize, reopened, edited]
    paths-ignore:
@ -30,7 +31,7 @@ jobs:
      - name: Configure Python version
        uses: actions/setup-python@v4
        with:
-          python-version: "3.7"
+          python-version: "3.10"

      - name: black
        run: |
@ -44,11 +45,12 @@ jobs:
        run: |
          python -m pip install flake8==5.0.4
          python -m flake8 spacy --count --select=E901,E999,F821,F822,F823,W605 --show-source --statistics
-      - name: cython-lint
-        run: |
-          python -m pip install cython-lint -c requirements.txt
-          # E501: line too log, W291: trailing whitespace, E266: too many leading '#' for block comment
-          cython-lint spacy --ignore E501,W291,E266
+          # Unfortunately cython-lint isn't working after the shift to Cython 3.
+          #- name: cython-lint
+          #  run: |
+          #    python -m pip install cython-lint -c requirements.txt
+          #    # E501: line too log, W291: trailing whitespace, E266: too many leading '#' for block comment
+          #    cython-lint spacy --ignore E501,W291,E266

  tests:
    name: Test
@ -57,18 +59,7 @@ jobs:
      fail-fast: true
      matrix:
        os: [ubuntu-latest, windows-latest, macos-latest]
-        python_version: ["3.12"]
-        include:
-          - os: windows-latest
-            python_version: "3.7"
-          - os: macos-latest
-            python_version: "3.8"
-          - os: ubuntu-latest
-            python_version: "3.9"
-          - os: windows-latest
-            python_version: "3.10"
-          - os: macos-latest
-            python_version: "3.11"
+        python_version: ["3.9", "3.12", "3.13"]

    runs-on: ${{ matrix.os }}

@ -157,7 +148,9 @@ jobs:
      - name: "Test assemble CLI"
        run: |
          python -c "import spacy; config = spacy.util.load_config('ner.cfg'); config['components']['ner'] = {'source': 'ca_core_news_sm'}; config.to_disk('ner_source_sm.cfg')"
-          PYTHONWARNINGS="error,ignore::DeprecationWarning" python -m spacy assemble ner_source_sm.cfg output_dir
+          python -m spacy assemble ner_source_sm.cfg output_dir
+        env:
+          PYTHONWARNINGS: "error,ignore::DeprecationWarning" 
        if: matrix.python_version == '3.9'

      - name: "Test assemble CLI vectors warning"
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -35,7 +35,7 @@ so that more people can benefit from it.

 When opening an issue, use a **descriptive title** and include your
 **environment** (operating system, Python version, spaCy version). Our
-[issue template](https://github.com/explosion/spaCy/issues/new) helps you
+[issue templates](https://github.com/explosion/spaCy/issues/new/choose) help you
 remember the most important details to include. If you've discovered a bug, you
 can also submit a [regression test](#fixing-bugs) straight away. When you're
 opening an issue to report the bug, simply refer to your pull request in the
@ -449,8 +449,8 @@ and plugins in spaCy v3.0, and we can't wait to see what you build with it!
  [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
  [`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
  to make it easier to find. Those are also the topics we're linking to from the
-  spaCy website. If you're sharing your project on Twitter, feel free to tag
-  [@spacy_io](https://twitter.com/spacy_io) so we can check it out.
+  spaCy website. If you're sharing your project on X, feel free to tag
+  [@spacy_io](https://x.com/spacy_io) so we can check it out.

 - Once your extension is published, you can open a
  [PR](https://github.com/explosion/spaCy/pulls) to suggest it for the
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -4,5 +4,6 @@ include README.md
 include pyproject.toml
 include spacy/py.typed
 recursive-include spacy/cli *.yml
+recursive-include spacy/tests *.json
 recursive-include licenses *
 recursive-exclude spacy *.cpp
--- a/README.md
+++ b/README.md
@ -16,7 +16,7 @@ model packaging, deployment and workflow management. spaCy is commercial
 open-source software, released under the
 [MIT license](https://github.com/explosion/spaCy/blob/master/LICENSE).

-💫 **Version 3.7 out now!**
+💫 **Version 3.8 out now!**
 [Check out the release notes here.](https://github.com/explosion/spaCy/releases)

 [![tests](https://github.com/explosion/spaCy/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spaCy/actions/workflows/tests.yml)
@ -28,7 +28,6 @@ open-source software, released under the
 <br />
 [![PyPi downloads](https://static.pepy.tech/personalized-badge/spacy?period=total&units=international_system&left_color=grey&right_color=orange&left_text=pip%20downloads)](https://pypi.org/project/spacy/)
 [![Conda downloads](https://img.shields.io/conda/dn/conda-forge/spacy?label=conda%20downloads)](https://anaconda.org/conda-forge/spacy)
-[![spaCy on Twitter](https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow)](https://twitter.com/spacy_io)

 ## 📖 Documentation

@ -47,6 +46,7 @@ open-source software, released under the
 | 👩‍🏫 **[Online Course]**                                                                                                                                                                                                    | Learn spaCy in this free and interactive online course.                                                                                                                                                                                                                                                                                      |
 | 📰 **[Blog]**                                                                                                                                                                                                             | Read about current spaCy and Prodigy development, releases, talks and more from Explosion.                                                                                                                                                                                                                 |
 | 📺 **[Videos]**                                                                                                                                                                                                           | Our YouTube channel with video tutorials, talks and more.                                                                                                                                                                                                                                                                                    |
+| 🔴 **[Live Stream]**                                                                                                                                                                                                       | Join Matt as he works on spaCy and chat about NLP, live every week.                                                                                                                                                                                                                                                                         |
 | 🛠 **[Changelog]**                                                                                                                                                                                                         | Changes and version history.                                                                                                                                                                                                                                                                                                                 |
 | 💝 **[Contribute]**                                                                                                                                                                                                       | How to contribute to the spaCy project and code base.                                                                                                                                                                                                                                                                                        |
 | 👕 **[Swag]**                                                                                                                                                                                                             | Support us and our work with unique, custom-designed swag!                                                                                                                                                                                                                                                                                   |
@ -62,6 +62,7 @@ open-source software, released under the
 [universe]: https://spacy.io/universe
 [spacy vs code extension]: https://github.com/explosion/spacy-vscode
 [videos]: https://www.youtube.com/c/ExplosionAI
+[live stream]: https://www.youtube.com/playlist?list=PLBmcuObd5An5_iAxNYLJa_xWmNzsYce8c
 [online course]: https://course.spacy.io
 [blog]: https://explosion.ai
 [project templates]: https://github.com/explosion/projects
@ -79,13 +80,14 @@ more people can benefit from it.
 | Type                            | Platforms                               |
 | ------------------------------- | --------------------------------------- |
 | 🚨 **Bug Reports**              | [GitHub Issue Tracker]                  |
-| 🎁 **Feature Requests & Ideas** | [GitHub Discussions]                    |
+| 🎁 **Feature Requests & Ideas** | [GitHub Discussions] · [Live Stream]    |
 | 👩‍💻 **Usage Questions**          | [GitHub Discussions] · [Stack Overflow] |
-| 🗯 **General Discussion**        | [GitHub Discussions]                    |
+| 🗯 **General Discussion**        | [GitHub Discussions] · [Live Stream]   |

 [github issue tracker]: https://github.com/explosion/spaCy/issues
 [github discussions]: https://github.com/explosion/spaCy/discussions
 [stack overflow]: https://stackoverflow.com/questions/tagged/spacy
+[live stream]: https://www.youtube.com/playlist?list=PLBmcuObd5An5_iAxNYLJa_xWmNzsYce8c

 ## Features

@ -115,7 +117,7 @@ For detailed installation instructions, see the

 - **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual
  Studio)
- **Python version**: Python 3.7+ (only 64 bit)
+- **Python version**: Python >=3.7, <3.13 (only 64 bit)
 - **Package managers**: [pip] · [conda] (via `conda-forge`)

 [pip]: https://pypi.org/project/spacy/
--- a/bin/release.sh
+++ b/bin/release.sh
@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+
+set -e
+
+# Insist repository is clean
+git diff-index --quiet HEAD
+
+version=$(grep "__version__ = " spacy/about.py)
+version=${version/__version__ = }
+version=${version/\'/}
+version=${version/\'/}
+version=${version/\"/}
+version=${version/\"/}
+
+echo "Pushing release-v"$version
+
+git tag -d release-v$version || true
+git push origin :release-v$version || true
+git tag release-v$version
+git push origin release-v$version
--- a/build-constraints.txt
+++ b/build-constraints.txt
@ -1,6 +1,2 @@
 # build version constraints for use with wheelwright
-numpy==1.15.0; python_version=='3.7' and platform_machine!='aarch64'
-numpy==1.19.2; python_version=='3.7' and platform_machine=='aarch64'
-numpy==1.17.3; python_version=='3.8' and platform_machine!='aarch64'
-numpy==1.19.2; python_version=='3.8' and platform_machine=='aarch64'
-numpy>=1.25.0; python_version>='3.9'
+numpy>=2.0.0,<3.0.0
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,13 +1,12 @@
 [build-system]
 requires = [
    "setuptools",
-    "cython>=0.25,<3.0",
+    "cython>=3.0,<4.0",
    "cymem>=2.0.2,<2.1.0",
    "preshed>=3.0.2,<3.1.0",
    "murmurhash>=0.28.0,<1.1.0",
-    "thinc>=8.3.0,<8.4.0",
-    "numpy>=2.0.0,<2.1.0; python_version < '3.9'",
-    "numpy>=2.0.0,<2.1.0; python_version >= '3.9'",
+    "thinc>=8.3.4,<8.4.0",
+    "numpy>=2.0.0,<3.0.0"
 ]
 build-backend = "setuptools.build_meta"

--- a/requirements.txt
+++ b/requirements.txt
@ -3,28 +3,26 @@ spacy-legacy>=3.0.11,<3.1.0
 spacy-loggers>=1.0.0,<2.0.0
 cymem>=2.0.2,<2.1.0
 preshed>=3.0.2,<3.1.0
-thinc>=8.2.2,<8.3.0
+thinc>=8.3.4,<8.4.0
 ml_datasets>=0.2.0,<0.3.0
 murmurhash>=0.28.0,<1.1.0
 wasabi>=0.9.1,<1.2.0
 srsly>=2.4.3,<3.0.0
 catalogue>=2.0.6,<2.1.0
-typer>=0.3.0,<1.0.0
+typer-slim>=0.3.0,<1.0.0
 weasel>=0.1.0,<0.5.0
 # Third party dependencies
-numpy>=2.0.0; python_version < "3.9"
-numpy>=2.0.0; python_version >= "3.9"
+numpy>=2.0.0,<3.0.0
 requests>=2.13.0,<3.0.0
 tqdm>=4.38.0,<5.0.0
 pydantic>=1.7.4,!=1.8,!=1.8.1,<3.0.0
 jinja2
-langcodes>=3.2.0,<4.0.0
 # Official Python utilities
 setuptools
 packaging>=20.0
 # Development dependencies
 pre-commit>=2.13.0
-cython>=0.25,<3.0
+cython>=3.0,<4.0
 pytest>=5.2.0,!=7.1.0
 pytest-timeout>=1.3.0,<2.0.0
 mock>=2.0.0,<3.0.0
--- a/setup.cfg
+++ b/setup.cfg
@ -17,12 +17,11 @@ classifiers =
    Operating System :: Microsoft :: Windows
    Programming Language :: Cython
    Programming Language :: Python :: 3
-    Programming Language :: Python :: 3.7
-    Programming Language :: Python :: 3.8
    Programming Language :: Python :: 3.9
    Programming Language :: Python :: 3.10
    Programming Language :: Python :: 3.11
    Programming Language :: Python :: 3.12
+    Programming Language :: Python :: 3.13
    Topic :: Scientific/Engineering
 project_urls =
    Release notes = https://github.com/explosion/spaCy/releases
@ -31,18 +30,18 @@ project_urls =
 [options]
 zip_safe = false
 include_package_data = true
-python_requires = >=3.7
+python_requires = >=3.9,<3.14
 # NOTE: This section is superseded by pyproject.toml and will be removed in
 # spaCy v4
 setup_requires =
-    cython>=0.25,<3.0
-    numpy>=2.0.0,<2.1.0; python_version < "3.9"
-    numpy>=2.0.0,<2.1.0; python_version >= "3.9"
+    cython>=3.0,<4.0
+    numpy>=2.0.0,<3.0.0; python_version < "3.9"
+    numpy>=2.0.0,<3.0.0; python_version >= "3.9"
    # We also need our Cython packages here to compile against
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
    murmurhash>=0.28.0,<1.1.0
-    thinc>=8.3.0,<8.4.0
+    thinc>=8.3.4,<8.4.0
 install_requires =
    # Our libraries
    spacy-legacy>=3.0.11,<3.1.0
@ -50,13 +49,13 @@ install_requires =
    murmurhash>=0.28.0,<1.1.0
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
-    thinc>=8.3.0,<8.4.0
+    thinc>=8.3.4,<8.4.0
    wasabi>=0.9.1,<1.2.0
    srsly>=2.4.3,<3.0.0
    catalogue>=2.0.6,<2.1.0
    weasel>=0.1.0,<0.5.0
    # Third-party dependencies
-    typer>=0.3.0,<1.0.0
+    typer-slim>=0.3.0,<1.0.0
    tqdm>=4.38.0,<5.0.0
    numpy>=1.15.0; python_version < "3.9"
    numpy>=1.19.0; python_version >= "3.9"
@ -66,7 +65,6 @@ install_requires =
    # Official Python utilities
    setuptools
    packaging>=20.0
-    langcodes>=3.2.0,<4.0.0

 [options.entry_points]
 console_scripts =
@ -116,7 +114,7 @@ cuda12x =
 cuda-autodetect =
    cupy-wheel>=11.0.0,<13.0.0
 apple =
-    thinc-apple-ops>=0.1.0.dev0,<1.0.0
+    thinc-apple-ops>=1.0.0,<2.0.0
 # Language tokenizers with external dependencies
 ja =
    sudachipy>=0.5.2,!=0.6.1
--- a/spacy/init.py
+++ b/spacy/init.py
@ -17,6 +17,7 @@ from .cli.info import info  # noqa: F401
 from .errors import Errors
 from .glossary import explain  # noqa: F401
 from .language import Language
+from .registrations import REGISTRY_POPULATED, populate_registry
 from .util import logger, registry  # noqa: F401
 from .vocab import Vocab

--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,5 +1,5 @@
 # fmt: off
 __title__ = "spacy"
-__version__ = "3.8.1"
+__version__ = "3.8.7"
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
--- a/spacy/cli/debug_model.py
+++ b/spacy/cli/debug_model.py
@ -170,7 +170,7 @@ def debug_model(
        msg.divider(f"STEP 3 - prediction")
        msg.info(str(prediction))

-    msg.good(f"Succesfully ended analysis - model looks good.")
+    msg.good(f"Successfully ended analysis - model looks good.")


 def _sentences():
--- a/spacy/cli/package.py
+++ b/spacy/cli/package.py
@ -30,6 +30,7 @@ def package_cli(
    version: Optional[str] = Opt(None, "--version", "-v", help="Package version to override meta"),
    build: str = Opt("sdist", "--build", "-b", help="Comma-separated formats to build: sdist and/or wheel, or none."),
    force: bool = Opt(False, "--force", "-f", "-F", help="Force overwriting existing data in output directory"),
+    require_parent: bool = Opt(True, "--require-parent/--no-require-parent", "-R", "-R", help="Include the parent package (e.g. spacy) in the requirements"),
    # fmt: on
 ):
    """
@ -60,6 +61,7 @@ def package_cli(
        create_sdist=create_sdist,
        create_wheel=create_wheel,
        force=force,
+        require_parent=require_parent,
        silent=False,
    )

@ -74,6 +76,7 @@ def package(
    create_meta: bool = False,
    create_sdist: bool = True,
    create_wheel: bool = False,
+    require_parent: bool = False,
    force: bool = False,
    silent: bool = True,
 ) -> None:
@ -113,7 +116,7 @@ def package(
    if not meta_path.exists() or not meta_path.is_file():
        msg.fail("Can't load pipeline meta.json", meta_path, exits=1)
    meta = srsly.read_json(meta_path)
-    meta = get_meta(input_dir, meta)
+    meta = get_meta(input_dir, meta, require_parent=require_parent)
    if meta["requirements"]:
        msg.good(
            f"Including {len(meta['requirements'])} package requirement(s) from "
@ -186,6 +189,7 @@ def package(
        imports.append(code_path.stem)
        shutil.copy(str(code_path), str(package_path))
    create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2))
+
    create_file(main_path / "setup.py", TEMPLATE_SETUP)
    create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST)
    init_py = TEMPLATE_INIT.format(
@ -302,6 +306,8 @@ def get_third_party_dependencies(
                modules.add(func_info["module"].split(".")[0])  # type: ignore[union-attr]
    dependencies = []
    for module_name in modules:
+        if module_name == about.__title__:
+            continue
        if module_name in distributions:
            dist = distributions.get(module_name)
            if dist:
@ -332,7 +338,9 @@ def create_file(file_path: Path, contents: str) -> None:


 def get_meta(
-    model_path: Union[str, Path], existing_meta: Dict[str, Any]
+    model_path: Union[str, Path],
+    existing_meta: Dict[str, Any],
+    require_parent: bool = False,
 ) -> Dict[str, Any]:
    meta: Dict[str, Any] = {
        "lang": "en",
@ -361,6 +369,8 @@ def get_meta(
    existing_reqs = [util.split_requirement(req)[0] for req in meta["requirements"]]
    reqs = get_third_party_dependencies(nlp.config, exclude=existing_reqs)
    meta["requirements"].extend(reqs)
+    if require_parent and about.__title__ not in meta["requirements"]:
+        meta["requirements"].append(about.__title__ + meta["spacy_version"])
    return meta


@ -535,8 +545,11 @@ def list_files(data_dir):


 def list_requirements(meta):
-    parent_package = meta.get('parent_package', 'spacy')
-    requirements = [parent_package + meta['spacy_version']]
+    # Up to version 3.7, we included the parent package
+    # in requirements by default. This behaviour is removed
+    # in 3.8, with a setting to include the parent package in
+    # the requirements list in the meta if desired.
+    requirements = []
    if 'setup_requires' in meta:
        requirements += meta['setup_requires']
    if 'requirements' in meta:
--- a/spacy/lang/bo/init.py
+++ b/spacy/lang/bo/init.py
@ -0,0 +1,16 @@
+from ...language import BaseDefaults, Language
+from .lex_attrs import LEX_ATTRS
+from .stop_words import STOP_WORDS
+
+
+class TibetanDefaults(BaseDefaults):
+    lex_attr_getters = LEX_ATTRS
+    stop_words = STOP_WORDS
+
+
+class Tibetan(Language):
+    lang = "bo"
+    Defaults = TibetanDefaults
+
+
+__all__ = ["Tibetan"]
--- a/spacy/lang/bo/examples.py
+++ b/spacy/lang/bo/examples.py
@ -0,0 +1,16 @@
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.bo.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+    "དོན་དུ་རྒྱ་མཚོ་བླ་མ་ཞེས་བྱ་ཞིང༌།",
+    "ཏཱ་ལའི་ཞེས་པ་ནི་སོག་སྐད་ཡིན་པ་དེ་བོད་སྐད་དུ་རྒྱ་མཚོའི་དོན་དུ་འཇུག",
+    "སོག་པོ་ཨལ་ཐན་རྒྱལ་པོས་རྒྱལ་དབང་བསོད་ནམས་རྒྱ་མཚོར་ཆེ་བསྟོད་ཀྱི་མཚན་གསོལ་བ་ཞིག་ཡིན་ཞིང༌།",
+    "རྗེས་སུ་རྒྱལ་བ་དགེ་འདུན་གྲུབ་དང༌། དགེ་འདུན་རྒྱ་མཚོ་སོ་སོར་ཡང་ཏཱ་ལའི་བླ་མའི་སྐུ་ཕྲེང་དང་པོ་དང༌།",
+    "གཉིས་པའི་མཚན་དེ་གསོལ་ཞིང༌།༸རྒྱལ་དབང་སྐུ་ཕྲེང་ལྔ་པས་དགའ་ལྡན་ཕོ་བྲང་གི་སྲིད་དབང་བཙུགས་པ་ནས་ཏཱ་ལའི་བླ་མ་ནི་བོད་ཀྱི་ཆོས་སྲིད་གཉིས་ཀྱི་དབུ་ཁྲིད་དུ་གྱུར་ཞིང་།",
+    "ད་ལྟའི་བར་ཏཱ་ལའི་བླ་མ་སྐུ་ཕྲེང་བཅུ་བཞི་བྱོན་ཡོད།",
+]
--- a/spacy/lang/bo/lex_attrs.py
+++ b/spacy/lang/bo/lex_attrs.py
@ -0,0 +1,65 @@
+from ...attrs import LIKE_NUM
+
+# reference 1: https://en.wikipedia.org/wiki/Tibetan_numerals
+
+_num_words = [
+    "ཀླད་ཀོར་",
+    "གཅིག་",
+    "གཉིས་",
+    "གསུམ་",
+    "བཞི་",
+    "ལྔ་",
+    "དྲུག་",
+    "བདུན་",
+    "བརྒྱད་",
+    "དགུ་",
+    "བཅུ་",
+    "བཅུ་གཅིག་",
+    "བཅུ་གཉིས་",
+    "བཅུ་གསུམ་",
+    "བཅུ་བཞི་",
+    "བཅུ་ལྔ་",
+    "བཅུ་དྲུག་",
+    "བཅུ་བདུན་",
+    "བཅུ་པརྒྱད",
+    "བཅུ་དགུ་",
+    "ཉི་ཤུ་",
+    "སུམ་ཅུ",
+    "བཞི་བཅུ",
+    "ལྔ་བཅུ",
+    "དྲུག་ཅུ",
+    "བདུན་ཅུ",
+    "བརྒྱད་ཅུ",
+    "དགུ་བཅུ",
+    "བརྒྱ་",
+    "སྟོང་",
+    "ཁྲི་",
+    "ས་ཡ་",
+    "	བྱེ་བ་",
+    "དུང་ཕྱུར་",
+    "ཐེར་འབུམ་",
+    "ཐེར་འབུམ་ཆེན་པོ་",
+    "ཁྲག་ཁྲིག་",
+    "ཁྲག་ཁྲིག་ཆེན་པོ་",
+]
+
+
+def like_num(text):
+    """
+    Check if text resembles a number
+    """
+    if text.startswith(("+", "-", "±", "~")):
+        text = text[1:]
+    text = text.replace(",", "").replace(".", "")
+    if text.isdigit():
+        return True
+    if text.count("/") == 1:
+        num, denom = text.split("/")
+        if num.isdigit() and denom.isdigit():
+            return True
+    if text in _num_words:
+        return True
+    return False
+
+
+LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/bo/stop_words.py
+++ b/spacy/lang/bo/stop_words.py
@ -0,0 +1,198 @@
+# Source: https://zenodo.org/records/10148636
+
+STOP_WORDS = set(
+    """
+འི་
+།
+དུ་
+གིས་
+སོགས་
+ཏེ
+གི་
+རྣམས་
+ནི
+ཀུན་
+ཡི་
+འདི
+ཀྱི་
+སྙེད་
+པས་
+གཞན་
+ཀྱིས་
+ཡི
+ལ
+ནི་
+དང་
+སོགས
+ཅིང་
+ར
+དུ
+མི་
+སུ་
+བཅས་
+ཡོངས་
+ལས
+ཙམ་
+གྱིས་
+དེ་
+ཡང་
+མཐའ་དག་
+ཏུ་
+ཉིད་
+ས
+ཏེ་
+གྱི་
+སྤྱི
+དེ
+ཀ་
+ཡིན་
+ཞིང་
+འདི་
+རུང་
+རང་
+ཞིག་
+སྟེ
+སྟེ་
+ན་རེ
+ངམ
+ཤིང་
+དག་
+ཏོ
+རེ་
+འང་
+ཀྱང་
+ལགས་པ
+ཚུ
+དོ
+ཡིན་པ
+རེ
+ན་རེ་
+ཨེ་
+ཚང་མ
+ཐམས་ཅད་
+དམ་
+འོ་
+ཅིག་
+གྱིན་
+ཡིན
+ན
+ཁོ་ན་
+འམ་
+ཀྱིན་
+ལོ
+ཀྱིས
+བས་
+ལགས་
+ཤིག
+གིས
+ཀི་
+སྣ་ཚོགས་
+རྣམས
+སྙེད་པ
+ཡིས་
+གྱི
+གི
+བམ་
+ཤིག་
+རེ་རེ་
+ནམ
+མིན་
+ནམ་
+ངམ་
+རུ་
+འགའ་
+ཀུན
+ཤས་
+ཏུ
+ཡིས
+གིན་
+གམ་
+འོ
+ཡིན་པ་
+མིན
+ལགས
+གྱིས
+ཅང་
+འགའ
+སམ་
+ཞིག
+འང
+ལས་ཆེ་
+འཕྲལ་
+བར་
+རུ
+དང
+ཡ
+འག
+སམ
+ཀ
+ཅུང་ཟད་
+ཅིག
+ཉིད
+དུ་མ
+མ
+ཡིན་བ
+འམ
+མམ
+དམ
+དག
+ཁོ་ན
+ཀྱི
+ལམ
+ཕྱི་
+ནང་
+ཙམ
+ནོ་
+སོ་
+རམ་
+བོ་
+ཨང་
+ཕྱི
+ཏོ་
+ཚོ
+ལ་ལ་
+ཚོ་
+ཅིང
+མ་གི་
+གེ
+གོ
+ཡིན་ལུགས་
+རོ་
+བོ
+ལགས་པ་
+པས
+རབ་
+འི
+རམ
+བས
+གཞན
+སྙེད་པ་
+འབའ་
+མཾ་
+པོ
+ག་
+ག
+གམ
+སྤྱི་
+བམ
+མོ་
+ཙམ་པ་
+ཤ་སྟག་
+མམ་
+རེ་རེ
+སྙེད
+ཏམ་
+ངོ
+གྲང་
+ཏ་རེ
+ཏམ
+ཁ་
+ངེ་
+ཅོག་
+རིལ་
+ཉུང་ཤས་
+གིང་
+ཚ་
+ཀྱང
+""".split()
+)
--- a/spacy/lang/gd/init.py
+++ b/spacy/lang/gd/init.py
@ -0,0 +1,18 @@
+from typing import Optional
+
+from ...language import BaseDefaults, Language
+from .stop_words import STOP_WORDS
+from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
+
+
+class ScottishDefaults(BaseDefaults):
+    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
+    stop_words = STOP_WORDS
+
+
+class Scottish(Language):
+    lang = "gd"
+    Defaults = ScottishDefaults
+
+
+__all__ = ["Scottish"]
--- a/spacy/lang/gd/stop_words.py
+++ b/spacy/lang/gd/stop_words.py
@ -0,0 +1,388 @@
+STOP_WORDS = set(
+    """
+'ad
+'ar
+'d # iad
+'g # ag
+'ga
+'gam
+'gan
+'gar
+'gur
+'m # am
+'n # an
+'n seo
+'na
+'nad
+'nam
+'nan
+'nar
+'nuair
+'nur
+'s
+'sa
+'san
+'sann
+'se
+'sna
+a
+a'
+a'd # agad
+a'm # agam
+a-chèile
+a-seo
+a-sin
+a-siud
+a chionn
+a chionn 's
+a chèile
+a chéile
+a dh'
+a h-uile
+a seo
+ac' # aca
+aca
+aca-san
+acasan
+ach
+ag
+agad
+agad-sa
+agads'
+agadsa
+agaibh
+agaibhse
+againn
+againne
+agam
+agam-sa
+agams'
+agamsa
+agus
+aice
+aice-se
+aicese
+aig
+aig' # aige
+aige
+aige-san
+aigesan
+air
+air-san
+air neo
+airsan
+am
+an
+an seo
+an sin
+an siud
+an uair
+ann
+ann a
+ann a'
+ann a shin
+ann am
+ann an
+annad
+annam
+annam-s'
+annamsa
+anns
+anns an
+annta
+aon
+ar
+as
+asad
+asda
+asta
+b'
+bho
+bhon
+bhuaidhe # bhuaithe
+bhuainn
+bhuaipe
+bhuaithe
+bhuapa
+bhur
+brì
+bu
+c'à
+car son
+carson
+cha
+chan
+chionn
+choir
+chon
+chun
+chèile
+chéile
+chòir
+cia mheud
+ciamar
+co-dhiubh
+cuide
+cuin
+cuin'
+cuine
+cà
+cà'
+càil
+càit
+càit'
+càite
+cò
+cò mheud
+có
+d'
+da
+de
+dh'
+dha
+dhaibh
+dhaibh-san
+dhaibhsan
+dhan
+dhasan
+dhe
+dhen
+dheth
+dhi
+dhiom
+dhiot
+dhith
+dhiubh
+dhomh
+dhomh-s'
+dhomhsa
+dhu'sa # dhut-sa
+dhuibh
+dhuibhse
+dhuinn
+dhuinne
+dhuit
+dhut
+dhutsa
+dhut-sa
+dhà
+dhà-san
+dhàsan
+dhòmhsa
+diubh
+do
+docha
+don
+dà
+dè
+dè mar
+dé
+dé mar
+dòch'
+dòcha
+e
+eadar
+eatarra
+eatorra
+eile
+esan
+fa
+far
+feud
+fhad
+fheudar
+fhearr
+fhein
+fheudar
+fheàrr
+fhèin
+fhéin
+fhìn
+fo
+fodha
+fodhainn
+foipe
+fon
+fèin
+ga
+gach
+gam
+gan
+ge brith
+ged
+gu
+gu dè
+gu ruige
+gun
+gur
+gus
+i
+iad
+iadsan
+innte
+is
+ise
+le
+leam
+leam-sa
+leamsa
+leat
+leat-sa
+leatha
+leatsa
+leibh
+leis
+leis-san
+leoth'
+leotha
+leotha-san
+linn
+m'
+m'a
+ma
+mac
+man
+mar
+mas
+mathaid
+mi
+mis'
+mise
+mo
+mu
+mu 'n
+mun
+mur
+mura
+mus
+na
+na b'
+na bu
+na iad
+nach
+nad
+nam
+nan
+nar
+nas
+neo
+no
+nuair
+o
+o'n
+oir
+oirbh
+oirbh-se
+oirnn
+oirnne
+oirre
+on
+orm
+orm-sa
+ormsa
+orra
+orra-san
+orrasan
+ort
+os
+r'
+ri
+ribh
+rinn
+ris
+rithe
+rithe-se
+rium
+rium-sa
+riums'
+riumsa
+riut
+riuth'
+riutha
+riuthasan
+ro
+ro'n
+roimh
+roimhe
+romhainn
+romham
+romhpa
+ron
+ruibh
+ruinn
+ruinne
+sa
+san
+sann
+se
+seach
+seo
+seothach
+shin
+sibh
+sibh-se
+sibhse
+sin
+sineach
+sinn
+sinne
+siod
+siodach
+siud
+siudach
+sna # ann an
+sè
+t'
+tarsaing
+tarsainn
+tarsuinn
+thar
+thoigh
+thro
+thu
+thuc'
+thuca
+thugad
+thugaibh
+thugainn
+thugam
+thugamsa
+thuice
+thuige
+thus'
+thusa
+timcheall
+toigh
+toil
+tro
+tro' # troimh
+troimh
+troimhe
+tron
+tu
+tusa
+uair
+ud
+ugaibh
+ugam-s'
+ugam-sa
+uice
+uige
+uige-san
+umad
+unnta # ann an
+ur
+urrainn
+à
+às
+àsan
+á
+ás
+è
+ì
+ò
+ó
+""".split(
+        "\n"
+    )
+)
--- a/spacy/lang/gd/tokenizer_exceptions.py
+++ b/spacy/lang/gd/tokenizer_exceptions.py
--- a/spacy/lang/hr/lemma_lookup_license.txt
+++ b/spacy/lang/hr/lemma_lookup_license.txt
@ -1,5 +1,5 @@
 The list of Croatian lemmas was extracted from the reldi-tagger repository (https://github.com/clarinsi/reldi-tagger).
-Reldi-tagger is licesned under the Apache 2.0 licence.
+Reldi-tagger is licensed under the Apache 2.0 licence.

@InProceedings{ljubesic16-new,
  author = {Nikola Ljubešić and Filip Klubička and Željko Agić and Ivo-Pavao Jazbec},
@ -12,4 +12,4 @@ Reldi-tagger is licesned under the Apache 2.0 licence.
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1}
- }
+ }
--- a/spacy/lang/ht/init.py
+++ b/spacy/lang/ht/init.py
@ -0,0 +1,52 @@
+from typing import Callable, Optional
+
+from thinc.api import Model
+
+from ...language import BaseDefaults, Language
+from .lemmatizer import HaitianCreoleLemmatizer
+from .lex_attrs import LEX_ATTRS
+from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
+from .stop_words import STOP_WORDS
+from .syntax_iterators import SYNTAX_ITERATORS
+from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
+from .tag_map import TAG_MAP
+
+
+class HaitianCreoleDefaults(BaseDefaults):
+    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
+    prefixes = TOKENIZER_PREFIXES
+    infixes = TOKENIZER_INFIXES
+    suffixes = TOKENIZER_SUFFIXES
+    lex_attr_getters = LEX_ATTRS
+    syntax_iterators = SYNTAX_ITERATORS
+    stop_words = STOP_WORDS
+    tag_map = TAG_MAP
+
+class HaitianCreole(Language):
+    lang = "ht"
+    Defaults = HaitianCreoleDefaults
+
+@HaitianCreole.factory(
+    "lemmatizer",
+    assigns=["token.lemma"],
+    default_config={
+        "model": None,
+        "mode": "rule",
+        "overwrite": False,
+        "scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
+    },
+    default_score_weights={"lemma_acc": 1.0},
+)
+def make_lemmatizer(
+    nlp: Language,
+    model: Optional[Model],
+    name: str,
+    mode: str,
+    overwrite: bool,
+    scorer: Optional[Callable],
+):
+    return HaitianCreoleLemmatizer(
+        nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
+    )
+
+__all__ = ["HaitianCreole"]
--- a/spacy/lang/ht/examples.py
+++ b/spacy/lang/ht/examples.py
@ -0,0 +1,18 @@
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.ht.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+    "Apple ap panse achte yon demaraj nan Wayòm Ini pou $1 milya dola",
+    "Machin otonòm fè responsablite asirans lan ale sou men fabrikan yo",
+    "San Francisco ap konsidere entèdi robo ki livre sou twotwa yo",
+    "Lond se yon gwo vil nan Wayòm Ini",
+    "Kote ou ye?",
+    "Kilès ki prezidan Lafrans?",
+    "Ki kapital Etazini?",
+    "Kile Barack Obama te fèt?",
+]
--- a/spacy/lang/ht/lemmatizer.py
+++ b/spacy/lang/ht/lemmatizer.py
@ -0,0 +1,51 @@
+from typing import List, Tuple
+
+from ...pipeline import Lemmatizer
+from ...tokens import Token
+from ...lookups import Lookups
+
+
+class HaitianCreoleLemmatizer(Lemmatizer):
+    """
+    Minimal Haitian Creole lemmatizer.
+    Returns a word's base form based on rules and lookup,
+    or defaults to the original form.
+    """
+
+    def is_base_form(self, token: Token) -> bool:
+        morph = token.morph.to_dict()
+        upos = token.pos_.lower()
+
+        # Consider unmarked forms to be base
+        if upos in {"noun", "verb", "adj", "adv"}:
+            if not morph:
+                return True
+            if upos == "noun" and morph.get("Number") == "Sing":
+                return True
+            if upos == "verb" and morph.get("VerbForm") == "Inf":
+                return True
+            if upos == "adj" and morph.get("Degree") == "Pos":
+                return True
+        return False
+
+    def rule_lemmatize(self, token: Token) -> List[str]:
+        string = token.text.lower()
+        pos = token.pos_.lower()
+        cache_key = (token.orth, token.pos)
+        if cache_key in self.cache:
+            return self.cache[cache_key]
+
+        forms = []
+
+        # fallback rule: just return lowercased form
+        forms.append(string)
+
+        self.cache[cache_key] = forms
+        return forms
+
+    @classmethod
+    def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]:
+        if mode == "rule":
+            required = ["lemma_lookup", "lemma_rules", "lemma_exc", "lemma_index"]
+            return (required, [])
+        return super().get_lookups_config(mode)
--- a/spacy/lang/ht/lex_attrs.py
+++ b/spacy/lang/ht/lex_attrs.py
@ -0,0 +1,78 @@
+from ...attrs import LIKE_NUM, NORM
+
+# Cardinal numbers in Creole
+_num_words = set(
+    """
+zewo youn en de twa kat senk sis sèt uit nèf dis
+onz douz trèz katoz kenz sèz disèt dizwit diznèf
+vent trant karant sinkant swasant swasann-dis
+san mil milyon milya
+""".split()
+)
+
+# Ordinal numbers in Creole (some are French-influenced, some simplified)
+_ordinal_words = set(
+    """
+premye dezyèm twazyèm katryèm senkyèm sizyèm sètvyèm uitvyèm nèvyèm dizyèm
+onzèm douzyèm trèzyèm katozyèm kenzèm sèzyèm disetyèm dizwityèm diznèvyèm
+ventyèm trantyèm karantyèm sinkantyèm swasantyèm
+swasann-disyèm santyèm milyèm milyonnyèm milyadyèm
+""".split()
+)
+
+NORM_MAP = {
+    "'m": "mwen",
+    "'w": "ou",
+    "'l": "li",
+    "'n": "nou",
+    "'y": "yo",
+    "’m": "mwen",
+    "’w": "ou",
+    "’l": "li",
+    "’n": "nou",
+    "’y": "yo",
+    "m": "mwen",
+    "n": "nou",
+    "l": "li",
+    "y": "yo",
+    "w": "ou",
+    "t": "te",
+    "k": "ki",
+    "p": "pa",
+    "M": "Mwen",
+    "N": "Nou",
+    "L": "Li",
+    "Y": "Yo",
+    "W": "Ou",
+    "T": "Te",
+    "K": "Ki",
+    "P": "Pa",
+}
+
+def like_num(text):
+    text = text.strip().lower()
+    if text.startswith(("+", "-", "±", "~")):
+        text = text[1:]
+    text = text.replace(",", "").replace(".", "")
+    if text.isdigit():
+        return True
+    if text.count("/") == 1:
+        num, denom = text.split("/")
+        if num.isdigit() and denom.isdigit():
+            return True
+    if text in _num_words:
+        return True
+    if text in _ordinal_words:
+        return True
+    # Handle things like "3yèm", "10yèm", "25yèm", etc.
+    if text.endswith("yèm") and text[:-3].isdigit():
+        return True
+    return False
+
+def norm_custom(text):
+    return NORM_MAP.get(text, text.lower())
+
+LEX_ATTRS = {
+    LIKE_NUM: like_num,
+    NORM: norm_custom,
+}
--- a/spacy/lang/ht/punctuation.py
+++ b/spacy/lang/ht/punctuation.py
@ -0,0 +1,43 @@
+from ..char_classes import (
+    ALPHA,
+    ALPHA_LOWER,
+    ALPHA_UPPER,
+    CONCAT_QUOTES,
+    HYPHENS,
+    LIST_PUNCT,
+    LIST_QUOTES,
+    LIST_ELLIPSES,
+    LIST_ICONS,
+    merge_chars,
+)
+
+ELISION = "'’".replace(" ", "")
+
+_prefixes_elision = "m n l y t k w"
+_prefixes_elision += " " + _prefixes_elision.upper()
+
+TOKENIZER_PREFIXES = LIST_PUNCT + LIST_QUOTES + [
+    r"(?:({pe})[{el}])(?=[{a}])".format(
+        a=ALPHA, el=ELISION, pe=merge_chars(_prefixes_elision)
+    )
+]
+
+TOKENIZER_SUFFIXES = LIST_PUNCT + LIST_QUOTES + LIST_ELLIPSES + [
+    r"(?<=[0-9])%",  # numbers like 10%
+    r"(?<=[0-9])(?:{h})".format(h=HYPHENS),  # hyphens after numbers
+    r"(?<=[{a}])['’]".format(a=ALPHA),  # apostrophes after letters
+    r"(?<=[{a}])['’][mwlnytk](?=\s|$)".format(a=ALPHA),  # contractions
+    r"(?<=[{a}0-9])\)",  # right parenthesis after letter/number
+    r"(?<=[{a}])\.(?=\s|$)".format(a=ALPHA),  # period after letter if space or end of string
+    r"(?<=\))[\.\?!]",  # punctuation immediately after right parenthesis
+]
+
+TOKENIZER_INFIXES = LIST_ELLIPSES + LIST_ICONS + [
+    r"(?<=[0-9])[+\-\*^](?=[0-9-])",
+    r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
+        al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
+    ),
+    r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
+    r"(?<=[{a}0-9])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
+    r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION),
+]
--- a/spacy/lang/ht/stop_words.py
+++ b/spacy/lang/ht/stop_words.py
@ -0,0 +1,50 @@
+STOP_WORDS = set(
+    """
+a ak an ankò ant apre ap atò avan avanlè
+byen bò byenke
+
+chak
+
+de depi deja deja
+
+e en epi èske
+
+fò fòk
+
+gen genyen
+
+ki kisa kilès kote koukou konsa konbyen konn konnen kounye kouman
+
+la l laa le lè li lye lò
+
+m m' mwen
+
+nan nap nou n'
+
+ou oumenm
+
+pa paske pami pandan pito pou pral preske pwiske
+
+se selman si sou sòt
+
+ta tap tankou te toujou tou tan tout toutotan twòp tèl
+
+w w' wi wè
+
+y y' yo yon yonn
+
+non o oh eh
+
+sa san si swa si
+
+men mèsi oswa osinon
+
+"""
+.split()
+)
+
+# Add common contractions, with and without apostrophe variants
+contractions = ["m'", "n'", "w'", "y'", "l'", "t'", "k'"]
+for apostrophe in ["'", "’", "‘"]:
+    for word in contractions:
+        STOP_WORDS.add(word.replace("'", apostrophe))
--- a/spacy/lang/ht/syntax_iterators.py
+++ b/spacy/lang/ht/syntax_iterators.py
@ -0,0 +1,74 @@
+from typing import Iterator, Tuple, Union
+
+from ...errors import Errors
+from ...symbols import NOUN, PRON, PROPN
+from ...tokens import Doc, Span
+
+
+def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
+    """
+    Detect base noun phrases from a dependency parse for Haitian Creole.
+    Works on both Doc and Span objects.
+    """
+
+    # Core nominal dependencies common in Haitian Creole
+    labels = [
+        "nsubj",
+        "obj",
+        "obl",
+        "nmod",
+        "appos",
+        "ROOT",
+    ]
+
+    # Modifiers to optionally include in chunk (to the right)
+    post_modifiers = ["compound", "flat", "flat:name", "fixed"]
+
+    doc = doclike.doc
+    if not doc.has_annotation("DEP"):
+        raise ValueError(Errors.E029)
+
+    np_deps = {doc.vocab.strings.add(label) for label in labels}
+    np_mods = {doc.vocab.strings.add(mod) for mod in post_modifiers}
+    conj_label = doc.vocab.strings.add("conj")
+    np_label = doc.vocab.strings.add("NP")
+    adp_pos = doc.vocab.strings.add("ADP")
+    cc_pos = doc.vocab.strings.add("CCONJ")
+
+    prev_end = -1
+    for i, word in enumerate(doclike):
+        if word.pos not in (NOUN, PROPN, PRON):
+            continue
+        if word.left_edge.i <= prev_end:
+            continue
+
+        if word.dep in np_deps:
+            right_end = word
+            # expand to include known modifiers to the right
+            for child in word.rights:
+                if child.dep in np_mods:
+                    right_end = child.right_edge
+                elif child.pos == NOUN:
+                    right_end = child.right_edge
+
+            left_index = word.left_edge.i
+            # Skip prepositions at the start
+            if word.left_edge.pos == adp_pos:
+                left_index += 1
+
+            prev_end = right_end.i
+            yield left_index, right_end.i + 1, np_label
+
+        elif word.dep == conj_label:
+            head = word.head
+            while head.dep == conj_label and head.head.i < head.i:
+                head = head.head
+            if head.dep in np_deps:
+                left_index = word.left_edge.i
+                if word.left_edge.pos == cc_pos:
+                    left_index += 1
+                prev_end = word.i
+                yield left_index, word.i + 1, np_label
+
+
+SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}
--- a/spacy/lang/ht/tag_map.py
+++ b/spacy/lang/ht/tag_map.py
@ -0,0 +1,21 @@
+from spacy.symbols import NOUN, VERB, AUX, ADJ, ADV, PRON, DET, ADP, SCONJ, CCONJ, PART, INTJ, NUM, PROPN, PUNCT, SYM, X
+
+TAG_MAP = {
+    "NOUN": {"pos": NOUN},
+    "VERB": {"pos": VERB},
+    "AUX": {"pos": AUX},
+    "ADJ": {"pos": ADJ},
+    "ADV": {"pos": ADV},
+    "PRON": {"pos": PRON},
+    "DET": {"pos": DET},
+    "ADP": {"pos": ADP},
+    "SCONJ": {"pos": SCONJ},
+    "CCONJ": {"pos": CCONJ},
+    "PART": {"pos": PART},
+    "INTJ": {"pos": INTJ},
+    "NUM": {"pos": NUM},
+    "PROPN": {"pos": PROPN},
+    "PUNCT": {"pos": PUNCT},
+    "SYM": {"pos": SYM},
+    "X": {"pos": X},
+}
--- a/spacy/lang/ht/tokenizer_exceptions.py
+++ b/spacy/lang/ht/tokenizer_exceptions.py
@ -0,0 +1,121 @@
+from spacy.symbols import ORTH, NORM
+
+def make_variants(base, first_norm, second_orth, second_norm):
+    return {
+        base: [
+            {ORTH: base.split("'")[0] + "'", NORM: first_norm},
+            {ORTH: second_orth, NORM: second_norm},
+        ],
+        base.capitalize(): [
+            {ORTH: base.split("'")[0].capitalize() + "'", NORM: first_norm.capitalize()},
+            {ORTH: second_orth, NORM: second_norm},
+        ]
+    }
+
+TOKENIZER_EXCEPTIONS = {
+    "Dr.": [{ORTH: "Dr."}]
+}
+
+# Apostrophe forms
+TOKENIZER_EXCEPTIONS.update(make_variants("m'ap", "mwen", "ap", "ap"))
+TOKENIZER_EXCEPTIONS.update(make_variants("n'ap", "nou", "ap", "ap"))
+TOKENIZER_EXCEPTIONS.update(make_variants("l'ap", "li", "ap", "ap"))
+TOKENIZER_EXCEPTIONS.update(make_variants("y'ap", "yo", "ap", "ap"))
+TOKENIZER_EXCEPTIONS.update(make_variants("m'te", "mwen", "te", "te"))
+TOKENIZER_EXCEPTIONS.update(make_variants("m'pral", "mwen", "pral", "pral"))
+TOKENIZER_EXCEPTIONS.update(make_variants("w'ap", "ou", "ap", "ap"))
+TOKENIZER_EXCEPTIONS.update(make_variants("k'ap", "ki", "ap", "ap"))
+TOKENIZER_EXCEPTIONS.update(make_variants("p'ap", "pa", "ap", "ap"))
+TOKENIZER_EXCEPTIONS.update(make_variants("t'ap", "te", "ap", "ap"))
+
+# Non-apostrophe contractions (with capitalized variants)
+TOKENIZER_EXCEPTIONS.update({
+    "map": [
+        {ORTH: "m", NORM: "mwen"},
+        {ORTH: "ap", NORM: "ap"},
+    ],
+    "Map": [
+        {ORTH: "M", NORM: "Mwen"},
+        {ORTH: "ap", NORM: "ap"},
+    ],
+    "lem": [
+        {ORTH: "le", NORM: "le"},
+        {ORTH: "m", NORM: "mwen"},
+    ],
+    "Lem": [
+        {ORTH: "Le", NORM: "Le"},
+        {ORTH: "m", NORM: "mwen"},
+    ],
+    "lew": [
+        {ORTH: "le", NORM: "le"},
+        {ORTH: "w", NORM: "ou"},
+    ],
+    "Lew": [
+        {ORTH: "Le", NORM: "Le"},
+        {ORTH: "w", NORM: "ou"},
+    ],
+    "nap": [
+        {ORTH: "n", NORM: "nou"},
+        {ORTH: "ap", NORM: "ap"},
+    ],
+    "Nap": [
+        {ORTH: "N", NORM: "Nou"},
+        {ORTH: "ap", NORM: "ap"},
+    ],
+    "lap": [
+        {ORTH: "l", NORM: "li"},
+        {ORTH: "ap", NORM: "ap"},
+    ],
+    "Lap": [
+        {ORTH: "L", NORM: "Li"},
+        {ORTH: "ap", NORM: "ap"},
+    ],
+    "yap": [
+        {ORTH: "y", NORM: "yo"},
+        {ORTH: "ap", NORM: "ap"},
+    ],
+    "Yap": [
+        {ORTH: "Y", NORM: "Yo"},
+        {ORTH: "ap", NORM: "ap"},
+    ],
+    "mte": [
+        {ORTH: "m", NORM: "mwen"},
+        {ORTH: "te", NORM: "te"},
+    ],
+    "Mte": [
+        {ORTH: "M", NORM: "Mwen"},
+        {ORTH: "te", NORM: "te"},
+    ],
+    "mpral": [
+        {ORTH: "m", NORM: "mwen"},
+        {ORTH: "pral", NORM: "pral"},
+    ],
+    "Mpral": [
+        {ORTH: "M", NORM: "Mwen"},
+        {ORTH: "pral", NORM: "pral"},
+    ],
+    "wap": [
+        {ORTH: "w", NORM: "ou"},
+        {ORTH: "ap", NORM: "ap"},
+    ],
+    "Wap": [
+        {ORTH: "W", NORM: "Ou"},
+        {ORTH: "ap", NORM: "ap"},
+    ],
+    "kap": [
+        {ORTH: "k", NORM: "ki"},
+        {ORTH: "ap", NORM: "ap"},
+    ],
+    "Kap": [
+        {ORTH: "K", NORM: "Ki"},
+        {ORTH: "ap", NORM: "ap"},
+    ],
+    "tap": [
+        {ORTH: "t", NORM: "te"},
+        {ORTH: "ap", NORM: "ap"},
+    ],
+    "Tap": [
+        {ORTH: "T", NORM: "Te"},
+        {ORTH: "ap", NORM: "ap"},
+    ],
+})
--- a/spacy/lang/ja/init.py
+++ b/spacy/lang/ja/init.py
@ -32,7 +32,6 @@ split_mode = null
 """


-@registry.tokenizers("spacy.ja.JapaneseTokenizer")
 def create_tokenizer(split_mode: Optional[str] = None):
    def japanese_tokenizer_factory(nlp):
        return JapaneseTokenizer(nlp.vocab, split_mode=split_mode)
--- a/spacy/lang/kmr/init.py
+++ b/spacy/lang/kmr/init.py
@ -0,0 +1,16 @@
+from ...language import BaseDefaults, Language
+from .lex_attrs import LEX_ATTRS
+from .stop_words import STOP_WORDS
+
+
+class KurmanjiDefaults(BaseDefaults):
+    stop_words = STOP_WORDS
+    lex_attr_getters = LEX_ATTRS
+
+
+class Kurmanji(Language):
+    lang = "kmr"
+    Defaults = KurmanjiDefaults
+
+
+__all__ = ["Kurmanji"]
--- a/spacy/lang/kmr/examples.py
+++ b/spacy/lang/kmr/examples.py
@ -0,0 +1,17 @@
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.kmr.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+sentences = [
+    "Berê mirovan her tim li geşedana pêşerojê ye",  # People's gaze is always on the development of the future
+    "Kawa Nemir di 14 salan de Ulysses wergerand Kurmancî.",  # Kawa Nemir translated Ulysses into Kurmanji in 14 years.
+    "Mem Ararat hunermendekî Kurd yê bi nav û deng e.",  # Mem Ararat is a famous Kurdish artist
+    "Firat Cewerî 40 sal e pirtûkên Kurdî dinivîsîne.",  # Firat Ceweri has been writing Kurdish books for 40 years
+    "Rojnamegerê ciwan nûçeyeke balkêş li ser rewşa aborî nivîsand",  # The young journalist wrote an interesting news article about the economic situation
+    "Sektora çandiniyê beşeke giring a belavkirina gaza serayê li seranserê cîhanê pêk tîne",  # The agricultural sector constitutes an important part of greenhouse gas emissions worldwide
+    "Xwendekarên jêhatî di pêşbaziya matematîkê de serkeftî bûn",  # Talented students succeeded in the mathematics competition
+    "Ji ber ji tunebûnê bavê min xwişkeke min nedan xwendin ew ji min re bû derd û kulek.",  # Because of poverty, my father didn't send my sister to school, which became a pain and sorrow for me
+]
--- a/spacy/lang/kmr/lex_attrs.py
+++ b/spacy/lang/kmr/lex_attrs.py
@ -0,0 +1,138 @@
+from ...attrs import LIKE_NUM
+
+_num_words = [
+    "sifir",
+    "yek",
+    "du",
+    "sê",
+    "çar",
+    "pênc",
+    "şeş",
+    "heft",
+    "heşt",
+    "neh",
+    "deh",
+    "yazde",
+    "dazde",
+    "sêzde",
+    "çarde",
+    "pazde",
+    "şazde",
+    "hevde",
+    "hejde",
+    "nozde",
+    "bîst",
+    "sî",
+    "çil",
+    "pêncî",
+    "şêst",
+    "heftê",
+    "heştê",
+    "nod",
+    "sed",
+    "hezar",
+    "milyon",
+    "milyar",
+]
+
+_ordinal_words = [
+    "yekem",
+    "yekemîn",
+    "duyem",
+    "duyemîn",
+    "sêyem",
+    "sêyemîn",
+    "çarem",
+    "çaremîn",
+    "pêncem",
+    "pêncemîn",
+    "şeşem",
+    "şeşemîn",
+    "heftem",
+    "heftemîn",
+    "heştem",
+    "heştemîn",
+    "nehem",
+    "nehemîn",
+    "dehem",
+    "dehemîn",
+    "yazdehem",
+    "yazdehemîn",
+    "dazdehem",
+    "dazdehemîn",
+    "sêzdehem",
+    "sêzdehemîn",
+    "çardehem",
+    "çardehemîn",
+    "pazdehem",
+    "pazdehemîn",
+    "şanzdehem",
+    "şanzdehemîn",
+    "hevdehem",
+    "hevdehemîn",
+    "hejdehem",
+    "hejdehemîn",
+    "nozdehem",
+    "nozdehemîn",
+    "bîstem",
+    "bîstemîn",
+    "sîyem",
+    "sîyemîn",
+    "çilem",
+    "çilemîn",
+    "pêncîyem",
+    "pênciyemîn",
+    "şêstem",
+    "şêstemîn",
+    "heftêyem",
+    "heftêyemîn",
+    "heştêyem",
+    "heştêyemîn",
+    "notem",
+    "notemîn",
+    "sedem",
+    "sedemîn",
+    "hezarem",
+    "hezaremîn",
+    "milyonem",
+    "milyonemîn",
+    "milyarem",
+    "milyaremîn",
+]
+
+
+def like_num(text):
+    if text.startswith(("+", "-", "±", "~")):
+        text = text[1:]
+    text = text.replace(",", "").replace(".", "")
+    if text.isdigit():
+        return True
+    if text.count("/") == 1:
+        num, denom = text.split("/")
+        if num.isdigit() and denom.isdigit():
+            return True
+    text_lower = text.lower()
+    if text_lower in _num_words:
+        return True
+
+    # Check ordinal number
+    if text_lower in _ordinal_words:
+        return True
+
+    if is_digit(text_lower):
+        return True
+
+    return False
+
+
+def is_digit(text):
+    endings = ("em", "yem", "emîn", "yemîn")
+    for ending in endings:
+        to = len(ending)
+        if text.endswith(ending) and text[:-to].isdigit():
+            return True
+
+    return False
+
+
+LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/kmr/stop_words.py
+++ b/spacy/lang/kmr/stop_words.py
@ -0,0 +1,44 @@
+STOP_WORDS = set(
+    """
+û
+li
+bi
+di
+da
+de
+ji
+ku
+ew
+ez
+tu
+em
+hûn
+ew
+ev
+min
+te
+wî
+wê
+me
+we
+wan
+vê
+vî
+va
+çi
+kî
+kê
+çawa
+çima
+kengî
+li ku
+çend
+çiqas
+her
+hin
+gelek
+hemû
+kes
+tişt
+""".split()
+)
--- a/spacy/lang/ko/init.py
+++ b/spacy/lang/ko/init.py
@ -20,7 +20,6 @@ DEFAULT_CONFIG = """
 """


-@registry.tokenizers("spacy.ko.KoreanTokenizer")
 def create_tokenizer():
    def korean_tokenizer_factory(nlp):
        return KoreanTokenizer(nlp.vocab)
--- a/spacy/lang/mk/init.py
+++ b/spacy/lang/mk/init.py
@ -24,12 +24,6 @@ class MacedonianDefaults(BaseDefaults):
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS

-    @classmethod
-    def create_lemmatizer(cls, nlp=None, lookups=None):
-        if lookups is None:
-            lookups = Lookups()
-        return MacedonianLemmatizer(lookups)
-

 class Macedonian(Language):
    lang = "mk"
--- a/spacy/lang/th/init.py
+++ b/spacy/lang/th/init.py
@ -13,7 +13,6 @@ DEFAULT_CONFIG = """
 """


-@registry.tokenizers("spacy.th.ThaiTokenizer")
 def create_thai_tokenizer():
    def thai_tokenizer_factory(nlp):
        return ThaiTokenizer(nlp.vocab)
--- a/spacy/lang/vi/init.py
+++ b/spacy/lang/vi/init.py
@ -22,7 +22,6 @@ use_pyvi = true
 """


-@registry.tokenizers("spacy.vi.VietnameseTokenizer")
 def create_vietnamese_tokenizer(use_pyvi: bool = True):
    def vietnamese_tokenizer_factory(nlp):
        return VietnameseTokenizer(nlp.vocab, use_pyvi=use_pyvi)
--- a/spacy/lang/zh/init.py
+++ b/spacy/lang/zh/init.py
@ -46,7 +46,6 @@ class Segmenter(str, Enum):
        return list(cls.__members__.keys())


-@registry.tokenizers("spacy.zh.ChineseTokenizer")
 def create_chinese_tokenizer(segmenter: Segmenter = Segmenter.char):
    def chinese_tokenizer_factory(nlp):
        return ChineseTokenizer(nlp.vocab, segmenter=segmenter)
--- a/spacy/language.py
+++ b/spacy/language.py
@ -5,7 +5,7 @@ import multiprocessing as mp
 import random
 import traceback
 import warnings
-from contextlib import contextmanager
+from contextlib import ExitStack, contextmanager
 from copy import deepcopy
 from dataclasses import dataclass
 from itertools import chain, cycle
@ -30,8 +30,11 @@ from typing import (
    overload,
 )

+import numpy
 import srsly
+from cymem.cymem import Pool
 from thinc.api import Config, CupyOps, Optimizer, get_current_ops
+from thinc.util import convert_recursive

 from . import about, ty, util
 from .compat import Literal
@ -101,7 +104,6 @@ class BaseDefaults:
    writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}


-@registry.tokenizers("spacy.Tokenizer.v1")
 def create_tokenizer() -> Callable[["Language"], Tokenizer]:
    """Registered function to create a tokenizer. Returns a factory that takes
    the nlp object and returns a Tokenizer instance using the language detaults.
@ -127,7 +129,6 @@ def create_tokenizer() -> Callable[["Language"], Tokenizer]:
    return tokenizer_factory


-@registry.misc("spacy.LookupsDataLoader.v1")
 def load_lookups_data(lang, tables):
    util.logger.debug("Loading lookups from spacy-lookups-data: %s", tables)
    lookups = load_lookups(lang=lang, tables=tables)
@ -140,7 +141,7 @@ class Language:

    Defaults (class): Settings, data and factory methods for creating the `nlp`
        object and processing pipeline.
-    lang (str): IETF language code, such as 'en'.
+    lang (str): Two-letter ISO 639-1 or three-letter ISO 639-3 language codes, such as 'en' and 'eng'.

    DOCS: https://spacy.io/api/language
    """
@ -182,6 +183,9 @@ class Language:

        DOCS: https://spacy.io/api/language#init
        """
+        from .pipeline.factories import register_factories
+
+        register_factories()
        # We're only calling this to import all factories provided via entry
        # points. The factory decorator applied to these functions takes care
        # of the rest.
@ -1211,7 +1215,7 @@ class Language:
                    examples,
                ):
                    eg.predicted = doc
-        return losses
+        return _replace_numpy_floats(losses)

    def rehearse(
        self,
@ -1462,7 +1466,7 @@ class Language:
        results = scorer.score(examples, per_component=per_component)
        n_words = sum(len(eg.predicted) for eg in examples)
        results["speed"] = n_words / (end_time - start_time)
-        return results
+        return _replace_numpy_floats(results)

    def create_optimizer(self):
        """Create an optimizer, usually using the [training.optimizer] config."""
@ -2091,6 +2095,38 @@ class Language:
                util.replace_model_node(pipe.model, listener, new_model)  # type: ignore[attr-defined]
                tok2vec.remove_listener(listener, pipe_name)

+    @contextmanager
+    def memory_zone(self, mem: Optional[Pool] = None) -> Iterator[Pool]:
+        """Begin a block where all resources allocated during the block will
+        be freed at the end of it. If a resources was created within the
+        memory zone block, accessing it outside the block is invalid.
+        Behaviour of this invalid access is undefined. Memory zones should
+        not be nested.
+
+        The memory zone is helpful for services that need to process large
+        volumes of text with a defined memory budget.
+
+        Example
+        -------
+        >>> with nlp.memory_zone():
+        ...     for doc in nlp.pipe(texts):
+        ...        process_my_doc(doc)
+        >>> # use_doc(doc) <-- Invalid: doc was allocated in the memory zone
+        """
+        if mem is None:
+            mem = Pool()
+        # The ExitStack allows programmatic nested context managers.
+        # We don't know how many we need, so it would be awkward to have
+        # them as nested blocks.
+        with ExitStack() as stack:
+            contexts = [stack.enter_context(self.vocab.memory_zone(mem))]
+            if hasattr(self.tokenizer, "memory_zone"):
+                contexts.append(stack.enter_context(self.tokenizer.memory_zone(mem)))
+            for _, pipe in self.pipeline:
+                if hasattr(pipe, "memory_zone"):
+                    contexts.append(stack.enter_context(pipe.memory_zone(mem)))
+            yield mem
+
    def to_disk(
        self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
    ) -> None:
@ -2108,7 +2144,9 @@ class Language:
        serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(  # type: ignore[union-attr]
            p, exclude=["vocab"]
        )
-        serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta)
+        serializers["meta.json"] = lambda p: srsly.write_json(
+            p, _replace_numpy_floats(self.meta)
+        )
        serializers["config.cfg"] = lambda p: self.config.to_disk(p)
        for name, proc in self._components:
            if name in exclude:
@ -2222,7 +2260,9 @@ class Language:
        serializers: Dict[str, Callable[[], bytes]] = {}
        serializers["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
        serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])  # type: ignore[union-attr]
-        serializers["meta.json"] = lambda: srsly.json_dumps(self.meta)
+        serializers["meta.json"] = lambda: srsly.json_dumps(
+            _replace_numpy_floats(self.meta)
+        )
        serializers["config.cfg"] = lambda: self.config.to_bytes()
        for name, proc in self._components:
            if name in exclude:
@ -2273,6 +2313,12 @@ class Language:
        return self


+def _replace_numpy_floats(meta_dict: dict) -> dict:
+    return convert_recursive(
+        lambda v: isinstance(v, numpy.floating), lambda v: float(v), dict(meta_dict)
+    )
+
+
@dataclass
 class FactoryMeta:
    """Dataclass containing information about a component and its defaults
--- a/spacy/lexeme.pxd
+++ b/spacy/lexeme.pxd
@ -35,7 +35,7 @@ cdef class Lexeme:
        return self

    @staticmethod
-    cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil:
+    cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) noexcept nogil:
        if name < (sizeof(flags_t) * 8):
            Lexeme.c_set_flag(lex, name, value)
        elif name == ID:
@ -54,7 +54,7 @@ cdef class Lexeme:
            lex.lang = value

    @staticmethod
-    cdef inline attr_t get_struct_attr(const LexemeC* lex, attr_id_t feat_name) nogil:
+    cdef inline attr_t get_struct_attr(const LexemeC* lex, attr_id_t feat_name) noexcept nogil:
        if feat_name < (sizeof(flags_t) * 8):
            if Lexeme.c_check_flag(lex, feat_name):
                return 1
@ -82,7 +82,7 @@ cdef class Lexeme:
            return 0

    @staticmethod
-    cdef inline bint c_check_flag(const LexemeC* lexeme, attr_id_t flag_id) nogil:
+    cdef inline bint c_check_flag(const LexemeC* lexeme, attr_id_t flag_id) noexcept nogil:
        cdef flags_t one = 1
        if lexeme.flags & (one << flag_id):
            return True
@ -90,7 +90,7 @@ cdef class Lexeme:
            return False

    @staticmethod
-    cdef inline bint c_set_flag(LexemeC* lex, attr_id_t flag_id, bint value) nogil:
+    cdef inline bint c_set_flag(LexemeC* lex, attr_id_t flag_id, bint value) noexcept nogil:
        cdef flags_t one = 1
        if value:
            lex.flags |= one << flag_id
--- a/spacy/lexeme.pyx
+++ b/spacy/lexeme.pyx
@ -70,7 +70,7 @@ cdef class Lexeme:
        if isinstance(other, Lexeme):
            a = self.orth
            b = other.orth
-        elif isinstance(other, long):
+        elif isinstance(other, int):
            a = self.orth
            b = other
        elif isinstance(other, str):
@ -104,7 +104,7 @@ cdef class Lexeme:
            # skip PROB, e.g. from lexemes.jsonl
            if isinstance(value, float):
                continue
-            elif isinstance(value, (int, long)):
+            elif isinstance(value, int):
                Lexeme.set_struct_attr(self.c, attr, value)
            else:
                Lexeme.set_struct_attr(self.c, attr, self.vocab.strings.add(value))
--- a/spacy/matcher/levenshtein.pyx
+++ b/spacy/matcher/levenshtein.pyx
@ -1,4 +1,4 @@
-# cython: binding=True, infer_types=True
+# cython: binding=True, infer_types=True, language_level=3
 from cpython.object cimport PyObject
 from libc.stdint cimport int64_t

@ -27,6 +27,5 @@ cpdef bint levenshtein_compare(input_text: str, pattern_text: str, fuzzy: int =
    return levenshtein(input_text, pattern_text, max_edits) <= max_edits


-@registry.misc("spacy.levenshtein_compare.v1")
 def make_levenshtein_compare():
    return levenshtein_compare
--- a/spacy/matcher/matcher.pyx
+++ b/spacy/matcher/matcher.pyx
@ -625,7 +625,7 @@ cdef action_t get_action(
    const TokenC * token,
    const attr_t * extra_attrs,
    const int8_t * predicate_matches
-) nogil:
+) noexcept nogil:
    """We need to consider:
    a) Does the token match the specification? [Yes, No]
    b) What's the quantifier? [1, 0+, ?]
@ -740,7 +740,7 @@ cdef int8_t get_is_match(
    const TokenC* token,
    const attr_t* extra_attrs,
    const int8_t* predicate_matches
-) nogil:
+) noexcept nogil:
    for i in range(state.pattern.nr_py):
        if predicate_matches[state.pattern.py_predicates[i]] == -1:
            return 0
@ -755,14 +755,14 @@ cdef int8_t get_is_match(
    return True


-cdef inline int8_t get_is_final(PatternStateC state) nogil:
+cdef inline int8_t get_is_final(PatternStateC state) noexcept nogil:
    if state.pattern[1].quantifier == FINAL_ID:
        return 1
    else:
        return 0


-cdef inline int8_t get_quantifier(PatternStateC state) nogil:
+cdef inline int8_t get_quantifier(PatternStateC state) noexcept nogil:
    return state.pattern.quantifier


@ -805,7 +805,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
    return pattern


-cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil:
+cdef attr_t get_ent_id(const TokenPatternC* pattern) noexcept nogil:
    while pattern.quantifier != FINAL_ID:
        pattern += 1
    id_attr = pattern[0].attrs[0]
--- a/spacy/matcher/phrasematcher.pyx
+++ b/spacy/matcher/phrasematcher.pyx
@ -47,7 +47,7 @@ cdef class PhraseMatcher:
        self._terminal_hash = 826361138722620965
        map_init(self.mem, self.c_map, 8)

-        if isinstance(attr, (int, long)):
+        if isinstance(attr, int):
            self.attr = attr
        else:
            if attr is None:
--- a/spacy/ml/_character_embed.py
+++ b/spacy/ml/_character_embed.py
@ -7,7 +7,6 @@ from ..tokens import Doc
 from ..util import registry


-@registry.layers("spacy.CharEmbed.v1")
 def CharacterEmbed(nM: int, nC: int) -> Model[List[Doc], List[Floats2d]]:
    # nM: Number of dimensions per character. nC: Number of characters.
    return Model(
--- a/spacy/ml/_precomputable_affine.py
+++ b/spacy/ml/_precomputable_affine.py
@ -3,7 +3,6 @@ from thinc.api import Model, normal_init
 from ..util import registry


-@registry.layers("spacy.PrecomputableAffine.v1")
 def PrecomputableAffine(nO, nI, nF, nP, dropout=0.1):
    model = Model(
        "precomputable_affine",
--- a/spacy/ml/callbacks.py
+++ b/spacy/ml/callbacks.py
@ -50,7 +50,6 @@ def models_with_nvtx_range(nlp, forward_color: int, backprop_color: int):
    return nlp


-@registry.callbacks("spacy.models_with_nvtx_range.v1")
 def create_models_with_nvtx_range(
    forward_color: int = -1, backprop_color: int = -1
 ) -> Callable[["Language"], "Language"]:
@ -110,7 +109,6 @@ def pipes_with_nvtx_range(
    return nlp


-@registry.callbacks("spacy.models_and_pipes_with_nvtx_range.v1")
 def create_models_and_pipes_with_nvtx_range(
    forward_color: int = -1,
    backprop_color: int = -1,
--- a/spacy/ml/extract_ngrams.py
+++ b/spacy/ml/extract_ngrams.py
@ -4,7 +4,6 @@ from ..attrs import LOWER
 from ..util import registry


-@registry.layers("spacy.extract_ngrams.v1")
 def extract_ngrams(ngram_size: int, attr: int = LOWER) -> Model:
    model: Model = Model("extract_ngrams", forward)
    model.attrs["ngram_size"] = ngram_size
--- a/spacy/ml/extract_spans.py
+++ b/spacy/ml/extract_spans.py
@ -6,7 +6,6 @@ from thinc.types import Ints1d, Ragged
 from ..util import registry


-@registry.layers("spacy.extract_spans.v1")
 def extract_spans() -> Model[Tuple[Ragged, Ragged], Ragged]:
    """Extract spans from a sequence of source arrays, as specified by an array
    of (start, end) indices. The output is a ragged array of the
--- a/spacy/ml/featureextractor.py
+++ b/spacy/ml/featureextractor.py
@ -6,8 +6,9 @@ from thinc.types import Ints2d
 from ..tokens import Doc


-@registry.layers("spacy.FeatureExtractor.v1")
-def FeatureExtractor(columns: List[Union[int, str]]) -> Model[List[Doc], List[Ints2d]]:
+def FeatureExtractor(
+    columns: Union[List[str], List[int], List[Union[int, str]]]
+) -> Model[List[Doc], List[Ints2d]]:
    return Model("extract_features", forward, attrs={"columns": columns})


--- a/spacy/ml/models/entity_linker.py
+++ b/spacy/ml/models/entity_linker.py
@ -28,7 +28,6 @@ from ...vocab import Vocab
 from ..extract_spans import extract_spans


-@registry.architectures("spacy.EntityLinker.v2")
 def build_nel_encoder(
    tok2vec: Model, nO: Optional[int] = None
 ) -> Model[List[Doc], Floats2d]:
@ -92,7 +91,6 @@ def span_maker_forward(model, docs: List[Doc], is_train) -> Tuple[Ragged, Callab
    return out, lambda x: []


-@registry.misc("spacy.KBFromFile.v1")
 def load_kb(
    kb_path: Path,
 ) -> Callable[[Vocab], KnowledgeBase]:
@ -104,7 +102,6 @@ def load_kb(
    return kb_from_file


-@registry.misc("spacy.EmptyKB.v2")
 def empty_kb_for_config() -> Callable[[Vocab, int], KnowledgeBase]:
    def empty_kb_factory(vocab: Vocab, entity_vector_length: int):
        return InMemoryLookupKB(vocab=vocab, entity_vector_length=entity_vector_length)
@ -112,7 +109,6 @@ def empty_kb_for_config() -> Callable[[Vocab, int], KnowledgeBase]:
    return empty_kb_factory


-@registry.misc("spacy.EmptyKB.v1")
 def empty_kb(
    entity_vector_length: int,
 ) -> Callable[[Vocab], KnowledgeBase]:
@ -122,12 +118,10 @@ def empty_kb(
    return empty_kb_factory


-@registry.misc("spacy.CandidateGenerator.v1")
 def create_candidates() -> Callable[[KnowledgeBase, Span], Iterable[Candidate]]:
    return get_candidates


-@registry.misc("spacy.CandidateBatchGenerator.v1")
 def create_candidates_batch() -> Callable[
    [KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]
 ]:
--- a/spacy/ml/models/multi_task.py
+++ b/spacy/ml/models/multi_task.py
@ -30,7 +30,6 @@ if TYPE_CHECKING:
    from ...vocab import Vocab  # noqa: F401


-@registry.architectures("spacy.PretrainVectors.v1")
 def create_pretrain_vectors(
    maxout_pieces: int, hidden_size: int, loss: str
 ) -> Callable[["Vocab", Model], Model]:
@ -57,7 +56,6 @@ def create_pretrain_vectors(
    return create_vectors_objective


-@registry.architectures("spacy.PretrainCharacters.v1")
 def create_pretrain_characters(
    maxout_pieces: int, hidden_size: int, n_characters: int
 ) -> Callable[["Vocab", Model], Model]:
--- a/spacy/ml/models/parser.py
+++ b/spacy/ml/models/parser.py
@ -11,7 +11,6 @@ from .._precomputable_affine import PrecomputableAffine
 from ..tb_framework import TransitionModel


-@registry.architectures("spacy.TransitionBasedParser.v2")
 def build_tb_parser_model(
    tok2vec: Model[List[Doc], List[Floats2d]],
    state_type: Literal["parser", "ner"],
--- a/spacy/ml/models/span_finder.py
+++ b/spacy/ml/models/span_finder.py
@ -10,7 +10,6 @@ InT = List[Doc]
 OutT = Floats2d


-@registry.architectures("spacy.SpanFinder.v1")
 def build_finder_model(
    tok2vec: Model[InT, List[Floats2d]], scorer: Model[OutT, OutT]
 ) -> Model[InT, OutT]:
--- a/spacy/ml/models/spancat.py
+++ b/spacy/ml/models/spancat.py
@ -22,7 +22,6 @@ from ...util import registry
 from ..extract_spans import extract_spans


-@registry.layers("spacy.LinearLogistic.v1")
 def build_linear_logistic(nO=None, nI=None) -> Model[Floats2d, Floats2d]:
    """An output layer for multi-label classification. It uses a linear layer
    followed by a logistic activation.
@ -30,7 +29,6 @@ def build_linear_logistic(nO=None, nI=None) -> Model[Floats2d, Floats2d]:
    return chain(Linear(nO=nO, nI=nI, init_W=glorot_uniform_init), Logistic())


-@registry.layers("spacy.mean_max_reducer.v1")
 def build_mean_max_reducer(hidden_size: int) -> Model[Ragged, Floats2d]:
    """Reduce sequences by concatenating their mean and max pooled vectors,
    and then combine the concatenated vectors with a hidden layer.
@ -46,7 +44,6 @@ def build_mean_max_reducer(hidden_size: int) -> Model[Ragged, Floats2d]:
    )


-@registry.architectures("spacy.SpanCategorizer.v1")
 def build_spancat_model(
    tok2vec: Model[List[Doc], List[Floats2d]],
    reducer: Model[Ragged, Floats2d],
--- a/spacy/ml/models/tagger.py
+++ b/spacy/ml/models/tagger.py
@ -7,7 +7,6 @@ from ...tokens import Doc
 from ...util import registry


-@registry.architectures("spacy.Tagger.v2")
 def build_tagger_model(
    tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None, normalize=False
 ) -> Model[List[Doc], List[Floats2d]]:
--- a/spacy/ml/models/textcat.py
+++ b/spacy/ml/models/textcat.py
@ -44,7 +44,6 @@ from .tok2vec import get_tok2vec_width
 NEG_VALUE = -5000


-@registry.architectures("spacy.TextCatCNN.v2")
 def build_simple_cnn_text_classifier(
    tok2vec: Model, exclusive_classes: bool, nO: Optional[int] = None
 ) -> Model[List[Doc], Floats2d]:
@ -72,7 +71,6 @@ def resize_and_set_ref(model, new_nO, resizable_layer):
    return model


-@registry.architectures("spacy.TextCatBOW.v2")
 def build_bow_text_classifier(
    exclusive_classes: bool,
    ngram_size: int,
@ -88,7 +86,6 @@ def build_bow_text_classifier(
    )


-@registry.architectures("spacy.TextCatBOW.v3")
 def build_bow_text_classifier_v3(
    exclusive_classes: bool,
    ngram_size: int,
@ -142,7 +139,6 @@ def _build_bow_text_classifier(
    return model


-@registry.architectures("spacy.TextCatEnsemble.v2")
 def build_text_classifier_v2(
    tok2vec: Model[List[Doc], List[Floats2d]],
    linear_model: Model[List[Doc], Floats2d],
@ -200,7 +196,6 @@ def init_ensemble_textcat(model, X, Y) -> Model:
    return model


-@registry.architectures("spacy.TextCatLowData.v1")
 def build_text_classifier_lowdata(
    width: int, dropout: Optional[float], nO: Optional[int] = None
 ) -> Model[List[Doc], Floats2d]:
@ -221,7 +216,6 @@ def build_text_classifier_lowdata(
    return model


-@registry.architectures("spacy.TextCatParametricAttention.v1")
 def build_textcat_parametric_attention_v1(
    tok2vec: Model[List[Doc], List[Floats2d]],
    exclusive_classes: bool,
@ -294,7 +288,6 @@ def _init_parametric_attention_with_residual_nonlinear(model, X, Y) -> Model:
    return model


-@registry.architectures("spacy.TextCatReduce.v1")
 def build_reduce_text_classifier(
    tok2vec: Model,
    exclusive_classes: bool,
--- a/spacy/ml/models/tok2vec.py
+++ b/spacy/ml/models/tok2vec.py
@ -29,7 +29,6 @@ from ..featureextractor import FeatureExtractor
 from ..staticvectors import StaticVectors


-@registry.architectures("spacy.Tok2VecListener.v1")
 def tok2vec_listener_v1(width: int, upstream: str = "*"):
    tok2vec = Tok2VecListener(upstream_name=upstream, width=width)
    return tok2vec
@ -46,7 +45,6 @@ def get_tok2vec_width(model: Model):
    return nO


-@registry.architectures("spacy.HashEmbedCNN.v2")
 def build_hash_embed_cnn_tok2vec(
    *,
    width: int,
@ -102,7 +100,6 @@ def build_hash_embed_cnn_tok2vec(
    )


-@registry.architectures("spacy.Tok2Vec.v2")
 def build_Tok2Vec_model(
    embed: Model[List[Doc], List[Floats2d]],
    encode: Model[List[Floats2d], List[Floats2d]],
@ -123,10 +120,9 @@ def build_Tok2Vec_model(
    return tok2vec


-@registry.architectures("spacy.MultiHashEmbed.v2")
 def MultiHashEmbed(
    width: int,
-    attrs: List[Union[str, int]],
+    attrs: Union[List[str], List[int], List[Union[str, int]]],
    rows: List[int],
    include_static_vectors: bool,
 ) -> Model[List[Doc], List[Floats2d]]:
@ -192,7 +188,7 @@ def MultiHashEmbed(
        )
    else:
        model = chain(
-            FeatureExtractor(list(attrs)),
+            FeatureExtractor(attrs),
            cast(Model[List[Ints2d], Ragged], list2ragged()),
            with_array(concatenate(*embeddings)),
            max_out,
@ -201,7 +197,6 @@ def MultiHashEmbed(
    return model


-@registry.architectures("spacy.CharacterEmbed.v2")
 def CharacterEmbed(
    width: int,
    rows: int,
@ -278,7 +273,6 @@ def CharacterEmbed(
    return model


-@registry.architectures("spacy.MaxoutWindowEncoder.v2")
 def MaxoutWindowEncoder(
    width: int, window_size: int, maxout_pieces: int, depth: int
 ) -> Model[List[Floats2d], List[Floats2d]]:
@ -310,7 +304,6 @@ def MaxoutWindowEncoder(
    return with_array(model, pad=receptive_field)


-@registry.architectures("spacy.MishWindowEncoder.v2")
 def MishWindowEncoder(
    width: int, window_size: int, depth: int
 ) -> Model[List[Floats2d], List[Floats2d]]:
@ -333,7 +326,6 @@ def MishWindowEncoder(
    return with_array(model)


-@registry.architectures("spacy.TorchBiLSTMEncoder.v1")
 def BiLSTMEncoder(
    width: int, depth: int, dropout: float
 ) -> Model[List[Floats2d], List[Floats2d]]:
--- a/spacy/ml/parser_model.pyx
+++ b/spacy/ml/parser_model.pyx
@ -52,14 +52,14 @@ cdef SizesC get_c_sizes(model, int batch_size) except *:
    return output


-cdef ActivationsC alloc_activations(SizesC n) nogil:
+cdef ActivationsC alloc_activations(SizesC n) noexcept nogil:
    cdef ActivationsC A
    memset(&A, 0, sizeof(A))
    resize_activations(&A, n)
    return A


-cdef void free_activations(const ActivationsC* A) nogil:
+cdef void free_activations(const ActivationsC* A) noexcept nogil:
    free(A.token_ids)
    free(A.scores)
    free(A.unmaxed)
@ -67,7 +67,7 @@ cdef void free_activations(const ActivationsC* A) nogil:
    free(A.is_valid)


-cdef void resize_activations(ActivationsC* A, SizesC n) nogil:
+cdef void resize_activations(ActivationsC* A, SizesC n) noexcept nogil:
    if n.states <= A._max_size:
        A._curr_size = n.states
        return
@ -100,7 +100,7 @@ cdef void resize_activations(ActivationsC* A, SizesC n) nogil:

 cdef void predict_states(
    CBlas cblas, ActivationsC* A, StateC** states, const WeightsC* W, SizesC n
-) nogil:
+) noexcept nogil:
    resize_activations(A, n)
    for i in range(n.states):
        states[i].set_context_tokens(&A.token_ids[i*n.feats], n.feats)
@ -159,7 +159,7 @@ cdef void sum_state_features(
    int B,
    int F,
    int O
-) nogil:
+) noexcept nogil:
    cdef int idx, b, f
    cdef const float* feature
    padding = cached
@ -183,7 +183,7 @@ cdef void cpu_log_loss(
    const int* is_valid,
    const float* scores,
    int O
-) nogil:
+) noexcept nogil:
    """Do multi-label log loss"""
    cdef double max_, gmax, Z, gZ
    best = arg_max_if_gold(scores, costs, is_valid, O)
@ -209,7 +209,7 @@ cdef void cpu_log_loss(

 cdef int arg_max_if_gold(
    const weight_t* scores, const weight_t* costs, const int* is_valid, int n
-) nogil:
+) noexcept nogil:
    # Find minimum cost
    cdef float cost = 1
    for i in range(n):
@ -224,7 +224,7 @@ cdef int arg_max_if_gold(
    return best


-cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) nogil:
+cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) noexcept nogil:
    cdef int best = -1
    for i in range(n):
        if is_valid[i] >= 1:
--- a/spacy/ml/staticvectors.py
+++ b/spacy/ml/staticvectors.py
@ -13,7 +13,6 @@ from ..vectors import Mode, Vectors
 from ..vocab import Vocab


-@registry.layers("spacy.StaticVectors.v2")
 def StaticVectors(
    nO: Optional[int] = None,
    nM: Optional[int] = None,
--- a/spacy/ml/tb_framework.py
+++ b/spacy/ml/tb_framework.py
@ -4,7 +4,6 @@ from ..util import registry
 from .parser_model import ParserStepModel


-@registry.layers("spacy.TransitionModel.v1")
 def TransitionModel(
    tok2vec, lower, upper, resize_output, dropout=0.2, unseen_classes=set()
 ):
--- a/spacy/morphology.pyx
+++ b/spacy/morphology.pyx
@ -57,16 +57,20 @@ cdef class Morphology:
        field_feature_pairs = []
        for field in sorted(string_features):
            values = string_features[field]
+            self.strings.add(field, allow_transient=False),
+            field_id = self.strings[field]
            for value in values.split(self.VALUE_SEP):
+                field_sep_value = field + self.FIELD_SEP + value
+                self.strings.add(field_sep_value, allow_transient=False),
                field_feature_pairs.append((
-                    self.strings.add(field),
-                    self.strings.add(field + self.FIELD_SEP + value),
+                    field_id,
+                    self.strings[field_sep_value]
                ))
        cdef MorphAnalysisC tag = self.create_morph_tag(field_feature_pairs)
        # the hash key for the tag is either the hash of the normalized UFEATS
        # string or the hash of an empty placeholder
        norm_feats_string = self.normalize_features(features)
-        tag.key = self.strings.add(norm_feats_string)
+        tag.key = self.strings.add(norm_feats_string, allow_transient=False)
        self.insert(tag)
        return tag.key

--- a/spacy/parts_of_speech.pyx
+++ b/spacy/parts_of_speech.pyx
@ -25,3 +25,8 @@ IDS = {


 NAMES = {value: key for key, value in IDS.items()}
+
+# As of Cython 3.1, the global Python namespace no longer has the enum
+# contents by default.
+globals().update(IDS)
+
--- a/spacy/pipeline/_parser_internals/_state.pxd
+++ b/spacy/pipeline/_parser_internals/_state.pxd
@ -17,7 +17,7 @@ from ...typedefs cimport attr_t
 from ...vocab cimport EMPTY_LEXEME


-cdef inline bint is_space_token(const TokenC* token) nogil:
+cdef inline bint is_space_token(const TokenC* token) noexcept nogil:
    return Lexeme.c_check_flag(token.lex, IS_SPACE)

 cdef struct ArcC:
@ -41,7 +41,7 @@ cdef cppclass StateC:
    int offset
    int _b_i

-    __init__(const TokenC* sent, int length) nogil:
+    inline __init__(const TokenC* sent, int length) noexcept nogil:
        this._sent = sent
        this._heads = <int*>calloc(length, sizeof(int))
        if not (this._sent and this._heads):
@ -57,10 +57,10 @@ cdef cppclass StateC:
        memset(&this._empty_token, 0, sizeof(TokenC))
        this._empty_token.lex = &EMPTY_LEXEME

-    __dealloc__():
+    inline __dealloc__():
        free(this._heads)

-    void set_context_tokens(int* ids, int n) nogil:
+    inline void set_context_tokens(int* ids, int n) noexcept nogil:
        cdef int i, j
        if n == 1:
            if this.B(0) >= 0:
@ -131,14 +131,14 @@ cdef cppclass StateC:
            else:
                ids[i] = -1

-    int S(int i) nogil const:
+    inline int S(int i) noexcept nogil const:
        if i >= this._stack.size():
            return -1
        elif i < 0:
            return -1
        return this._stack.at(this._stack.size() - (i+1))

-    int B(int i) nogil const:
+    inline int B(int i) noexcept nogil const:
        if i < 0:
            return -1
        elif i < this._rebuffer.size():
@ -150,19 +150,19 @@ cdef cppclass StateC:
            else:
                return b_i

-    const TokenC* B_(int i) nogil const:
+    inline const TokenC* B_(int i) noexcept nogil const:
        return this.safe_get(this.B(i))

-    const TokenC* E_(int i) nogil const:
+    inline const TokenC* E_(int i) noexcept nogil const:
        return this.safe_get(this.E(i))

-    const TokenC* safe_get(int i) nogil const:
+    inline const TokenC* safe_get(int i) noexcept nogil const:
        if i < 0 or i >= this.length:
            return &this._empty_token
        else:
            return &this._sent[i]

-    void map_get_arcs(const unordered_map[int, vector[ArcC]] &heads_arcs, vector[ArcC]* out) nogil const:
+    inline void map_get_arcs(const unordered_map[int, vector[ArcC]] &heads_arcs, vector[ArcC]* out) noexcept nogil const:
        cdef const vector[ArcC]* arcs
        head_arcs_it = heads_arcs.const_begin()
        while head_arcs_it != heads_arcs.const_end():
@ -175,23 +175,23 @@ cdef cppclass StateC:
                incr(arcs_it)
            incr(head_arcs_it)

-    void get_arcs(vector[ArcC]* out) nogil const:
+    inline void get_arcs(vector[ArcC]* out) noexcept nogil const:
        this.map_get_arcs(this._left_arcs, out)
        this.map_get_arcs(this._right_arcs, out)

-    int H(int child) nogil const:
+    inline int H(int child) noexcept nogil const:
        if child >= this.length or child < 0:
            return -1
        else:
            return this._heads[child]

-    int E(int i) nogil const:
+    inline int E(int i) noexcept nogil const:
        if this._ents.size() == 0:
            return -1
        else:
            return this._ents.back().start

-    int nth_child(const unordered_map[int, vector[ArcC]]& heads_arcs, int head, int idx) nogil const:
+    inline int nth_child(const unordered_map[int, vector[ArcC]]& heads_arcs, int head, int idx) noexcept nogil const:
        if idx < 1:
            return -1

@ -215,22 +215,22 @@ cdef cppclass StateC:

        return -1

-    int L(int head, int idx) nogil const:
+    inline int L(int head, int idx) noexcept nogil const:
        return this.nth_child(this._left_arcs, head, idx)

-    int R(int head, int idx) nogil const:
+    inline int R(int head, int idx) noexcept nogil const:
        return this.nth_child(this._right_arcs, head, idx)

-    bint empty() nogil const:
+    inline bint empty() noexcept nogil const:
        return this._stack.size() == 0

-    bint eol() nogil const:
+    inline bint eol() noexcept nogil const:
        return this.buffer_length() == 0

-    bint is_final() nogil const:
+    inline bint is_final() noexcept nogil const:
        return this.stack_depth() <= 0 and this.eol()

-    int cannot_sent_start(int word) nogil const:
+    inline int cannot_sent_start(int word) noexcept nogil const:
        if word < 0 or word >= this.length:
            return 0
        elif this._sent[word].sent_start == -1:
@ -238,7 +238,7 @@ cdef cppclass StateC:
        else:
            return 0

-    int is_sent_start(int word) nogil const:
+    inline int is_sent_start(int word) noexcept nogil const:
        if word < 0 or word >= this.length:
            return 0
        elif this._sent[word].sent_start == 1:
@ -248,20 +248,20 @@ cdef cppclass StateC:
        else:
            return 0

-    void set_sent_start(int word, int value) nogil:
+    inline void set_sent_start(int word, int value) noexcept nogil:
        if value >= 1:
            this._sent_starts.insert(word)

-    bint has_head(int child) nogil const:
+    inline bint has_head(int child) noexcept nogil const:
        return this._heads[child] >= 0

-    int l_edge(int word) nogil const:
+    inline int l_edge(int word) noexcept nogil const:
        return word

-    int r_edge(int word) nogil const:
+    inline int r_edge(int word) noexcept nogil const:
        return word

-    int n_arcs(const unordered_map[int, vector[ArcC]] &heads_arcs, int head) nogil const:
+    inline int n_arcs(const unordered_map[int, vector[ArcC]] &heads_arcs, int head) noexcept nogil const:
        cdef int n = 0
        head_arcs_it = heads_arcs.const_find(head)
        if head_arcs_it == heads_arcs.const_end():
@ -277,28 +277,28 @@ cdef cppclass StateC:

        return n

-    int n_L(int head) nogil const:
+    inline int n_L(int head) noexcept nogil const:
        return n_arcs(this._left_arcs, head)

-    int n_R(int head) nogil const:
+    inline int n_R(int head) noexcept nogil const:
        return n_arcs(this._right_arcs, head)

-    bint stack_is_connected() nogil const:
+    inline bint stack_is_connected() noexcept nogil const:
        return False

-    bint entity_is_open() nogil const:
+    inline bint entity_is_open() noexcept nogil const:
        if this._ents.size() == 0:
            return False
        else:
            return this._ents.back().end == -1

-    int stack_depth() nogil const:
+    inline int stack_depth() noexcept nogil const:
        return this._stack.size()

-    int buffer_length() nogil const:
+    inline int buffer_length() noexcept nogil const:
        return (this.length - this._b_i) + this._rebuffer.size()

-    void push() nogil:
+    inline void push() noexcept nogil:
        b0 = this.B(0)
        if this._rebuffer.size():
            b0 = this._rebuffer.back()
@ -308,32 +308,32 @@ cdef cppclass StateC:
            this._b_i += 1
        this._stack.push_back(b0)

-    void pop() nogil:
+    inline void pop() noexcept nogil:
        this._stack.pop_back()

-    void force_final() nogil:
+    inline void force_final() noexcept nogil:
        # This should only be used in desperate situations, as it may leave
        # the analysis in an unexpected state.
        this._stack.clear()
        this._b_i = this.length

-    void unshift() nogil:
+    inline void unshift() noexcept nogil:
        s0 = this._stack.back()
        this._unshiftable[s0] = 1
        this._rebuffer.push_back(s0)
        this._stack.pop_back()

-    int is_unshiftable(int item) nogil const:
+    inline int is_unshiftable(int item) noexcept nogil const:
        if item >= this._unshiftable.size():
            return 0
        else:
            return this._unshiftable.at(item)

-    void set_reshiftable(int item) nogil:
+    inline void set_reshiftable(int item) noexcept nogil:
        if item < this._unshiftable.size():
            this._unshiftable[item] = 0

-    void add_arc(int head, int child, attr_t label) nogil:
+    inline void add_arc(int head, int child, attr_t label) noexcept nogil:
        if this.has_head(child):
            this.del_arc(this.H(child), child)
        cdef ArcC arc
@ -346,7 +346,7 @@ cdef cppclass StateC:
            this._right_arcs[arc.head].push_back(arc)
        this._heads[child] = head

-    void map_del_arc(unordered_map[int, vector[ArcC]]* heads_arcs, int h_i, int c_i) nogil:
+    inline void map_del_arc(unordered_map[int, vector[ArcC]]* heads_arcs, int h_i, int c_i) noexcept nogil:
        arcs_it = heads_arcs.find(h_i)
        if arcs_it == heads_arcs.end():
            return
@ -367,13 +367,13 @@ cdef cppclass StateC:
                    arc.label = 0
                    break

-    void del_arc(int h_i, int c_i) nogil:
+    inline void del_arc(int h_i, int c_i) noexcept nogil:
        if h_i > c_i:
            this.map_del_arc(&this._left_arcs, h_i, c_i)
        else:
            this.map_del_arc(&this._right_arcs, h_i, c_i)

-    SpanC get_ent() nogil const:
+    inline SpanC get_ent() noexcept nogil const:
        cdef SpanC ent
        if this._ents.size() == 0:
            ent.start = 0
@ -383,17 +383,17 @@ cdef cppclass StateC:
        else:
            return this._ents.back()

-    void open_ent(attr_t label) nogil:
+    inline void open_ent(attr_t label) noexcept nogil:
        cdef SpanC ent
        ent.start = this.B(0)
        ent.label = label
        ent.end = -1
        this._ents.push_back(ent)

-    void close_ent() nogil:
+    inline void close_ent() noexcept nogil:
        this._ents.back().end = this.B(0)+1

-    void clone(const StateC* src) nogil:
+    inline void clone(const StateC* src) noexcept nogil:
        this.length = src.length
        this._sent = src._sent
        this._stack = src._stack
--- a/spacy/pipeline/_parser_internals/arc_eager.pyx
+++ b/spacy/pipeline/_parser_internals/arc_eager.pyx
@ -155,7 +155,7 @@ cdef GoldParseStateC create_gold_state(
    return gs


-cdef void update_gold_state(GoldParseStateC* gs, const StateC* s) nogil:
+cdef void update_gold_state(GoldParseStateC* gs, const StateC* s) noexcept nogil:
    for i in range(gs.length):
        gs.state_bits[i] = set_state_flag(
            gs.state_bits[i],
@ -203,7 +203,7 @@ cdef class ArcEagerGold:
    def __init__(self, ArcEager moves, StateClass stcls, Example example):
        self.mem = Pool()
        heads, labels = example.get_aligned_parse(projectivize=True)
-        labels = [example.x.vocab.strings.add(label) if label is not None else MISSING_DEP for label in labels]
+        labels = [example.x.vocab.strings.add(label, allow_transient=False) if label is not None else MISSING_DEP for label in labels]
        sent_starts = _get_aligned_sent_starts(example)
        assert len(heads) == len(labels) == len(sent_starts), (len(heads), len(labels), len(sent_starts))
        self.c = create_gold_state(self.mem, stcls.c, heads, labels, sent_starts)
@ -239,12 +239,12 @@ def _get_aligned_sent_starts(example):
        return [None] * len(example.x)


-cdef int check_state_gold(char state_bits, char flag) nogil:
+cdef int check_state_gold(char state_bits, char flag) noexcept nogil:
    cdef char one = 1
    return 1 if (state_bits & (one << flag)) else 0


-cdef int set_state_flag(char state_bits, char flag, int value) nogil:
+cdef int set_state_flag(char state_bits, char flag, int value) noexcept nogil:
    cdef char one = 1
    if value:
        return state_bits | (one << flag)
@ -252,27 +252,27 @@ cdef int set_state_flag(char state_bits, char flag, int value) nogil:
        return state_bits & ~(one << flag)


-cdef int is_head_in_stack(const GoldParseStateC* gold, int i) nogil:
+cdef int is_head_in_stack(const GoldParseStateC* gold, int i) noexcept nogil:
    return check_state_gold(gold.state_bits[i], HEAD_IN_STACK)


-cdef int is_head_in_buffer(const GoldParseStateC* gold, int i) nogil:
+cdef int is_head_in_buffer(const GoldParseStateC* gold, int i) noexcept nogil:
    return check_state_gold(gold.state_bits[i], HEAD_IN_BUFFER)


-cdef int is_head_unknown(const GoldParseStateC* gold, int i) nogil:
+cdef int is_head_unknown(const GoldParseStateC* gold, int i) noexcept nogil:
    return check_state_gold(gold.state_bits[i], HEAD_UNKNOWN)

-cdef int is_sent_start(const GoldParseStateC* gold, int i) nogil:
+cdef int is_sent_start(const GoldParseStateC* gold, int i) noexcept nogil:
    return check_state_gold(gold.state_bits[i], IS_SENT_START)

-cdef int is_sent_start_unknown(const GoldParseStateC* gold, int i) nogil:
+cdef int is_sent_start_unknown(const GoldParseStateC* gold, int i) noexcept nogil:
    return check_state_gold(gold.state_bits[i], SENT_START_UNKNOWN)


 # Helper functions for the arc-eager oracle

-cdef weight_t push_cost(const StateC* state, const GoldParseStateC* gold) nogil:
+cdef weight_t push_cost(const StateC* state, const GoldParseStateC* gold) noexcept nogil:
    cdef weight_t cost = 0
    b0 = state.B(0)
    if b0 < 0:
@ -285,7 +285,7 @@ cdef weight_t push_cost(const StateC* state, const GoldParseStateC* gold) nogil:
    return cost


-cdef weight_t pop_cost(const StateC* state, const GoldParseStateC* gold) nogil:
+cdef weight_t pop_cost(const StateC* state, const GoldParseStateC* gold) noexcept nogil:
    cdef weight_t cost = 0
    s0 = state.S(0)
    if s0 < 0:
@ -296,7 +296,7 @@ cdef weight_t pop_cost(const StateC* state, const GoldParseStateC* gold) nogil:
    return cost


-cdef bint arc_is_gold(const GoldParseStateC* gold, int head, int child) nogil:
+cdef bint arc_is_gold(const GoldParseStateC* gold, int head, int child) noexcept nogil:
    if is_head_unknown(gold, child):
        return True
    elif gold.heads[child] == head:
@ -305,7 +305,7 @@ cdef bint arc_is_gold(const GoldParseStateC* gold, int head, int child) nogil:
        return False


-cdef bint label_is_gold(const GoldParseStateC* gold, int child, attr_t label) nogil:
+cdef bint label_is_gold(const GoldParseStateC* gold, int child, attr_t label) noexcept nogil:
    if is_head_unknown(gold, child):
        return True
    elif label == 0:
@ -316,7 +316,7 @@ cdef bint label_is_gold(const GoldParseStateC* gold, int child, attr_t label) no
        return False


-cdef bint _is_gold_root(const GoldParseStateC* gold, int word) nogil:
+cdef bint _is_gold_root(const GoldParseStateC* gold, int word) noexcept nogil:
    return gold.heads[word] == word or is_head_unknown(gold, word)


@ -336,7 +336,7 @@ cdef class Shift:
    * Advance buffer
    """
    @staticmethod
-    cdef bint is_valid(const StateC* st, attr_t label) nogil:
+    cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
        if st.stack_depth() == 0:
            return 1
        elif st.buffer_length() < 2:
@ -349,11 +349,11 @@ cdef class Shift:
            return 1

    @staticmethod
-    cdef int transition(StateC* st, attr_t label) nogil:
+    cdef int transition(StateC* st, attr_t label) noexcept nogil:
        st.push()

    @staticmethod
-    cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil:
+    cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) noexcept nogil:
        gold = <const GoldParseStateC*>_gold
        return gold.push_cost

@ -375,7 +375,7 @@ cdef class Reduce:
        cost by those arcs.
    """
    @staticmethod
-    cdef bint is_valid(const StateC* st, attr_t label) nogil:
+    cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
        if st.stack_depth() == 0:
            return False
        elif st.buffer_length() == 0:
@ -386,14 +386,14 @@ cdef class Reduce:
            return True

    @staticmethod
-    cdef int transition(StateC* st, attr_t label) nogil:
+    cdef int transition(StateC* st, attr_t label) noexcept nogil:
        if st.has_head(st.S(0)) or st.stack_depth() == 1:
            st.pop()
        else:
            st.unshift()

    @staticmethod
-    cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil:
+    cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) noexcept nogil:
        gold = <const GoldParseStateC*>_gold
        if state.is_sent_start(state.B(0)):
            return 0
@ -421,7 +421,7 @@ cdef class LeftArc:
        pop_cost - Arc(B[0], S[0], label) + (Arc(S[1], S[0]) if H(S[0]) else Arcs(S, S[0]))
    """
    @staticmethod
-    cdef bint is_valid(const StateC* st, attr_t label) nogil:
+    cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
        if st.stack_depth() == 0:
            return 0
        elif st.buffer_length() == 0:
@ -434,7 +434,7 @@ cdef class LeftArc:
            return 1

    @staticmethod
-    cdef int transition(StateC* st, attr_t label) nogil:
+    cdef int transition(StateC* st, attr_t label) noexcept nogil:
        st.add_arc(st.B(0), st.S(0), label)
        # If we change the stack, it's okay to remove the shifted mark, as
        # we can't get in an infinite loop this way.
@ -442,7 +442,7 @@ cdef class LeftArc:
        st.pop()

    @staticmethod
-    cdef inline weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil:
+    cdef inline weight_t cost(const StateC* state, const void* _gold, attr_t label) noexcept nogil:
        gold = <const GoldParseStateC*>_gold
        cdef weight_t cost = gold.pop_cost
        s0 = state.S(0)
@ -474,7 +474,7 @@ cdef class RightArc:
        push_cost + (not shifted[b0] and Arc(B[1:], B[0])) - Arc(S[0], B[0], label)
    """
    @staticmethod
-    cdef bint is_valid(const StateC* st, attr_t label) nogil:
+    cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
        if st.stack_depth() == 0:
            return 0
        elif st.buffer_length() == 0:
@ -488,12 +488,12 @@ cdef class RightArc:
            return 1

    @staticmethod
-    cdef int transition(StateC* st, attr_t label) nogil:
+    cdef int transition(StateC* st, attr_t label) noexcept nogil:
        st.add_arc(st.S(0), st.B(0), label)
        st.push()

    @staticmethod
-    cdef inline weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil:
+    cdef inline weight_t cost(const StateC* state, const void* _gold, attr_t label) noexcept nogil:
        gold = <const GoldParseStateC*>_gold
        cost = gold.push_cost
        s0 = state.S(0)
@ -525,7 +525,7 @@ cdef class Break:
    * Arcs between S and B[1]
    """
    @staticmethod
-    cdef bint is_valid(const StateC* st, attr_t label) nogil:
+    cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
        if st.buffer_length() < 2:
            return False
        elif st.B(1) != st.B(0) + 1:
@ -538,11 +538,11 @@ cdef class Break:
            return True

    @staticmethod
-    cdef int transition(StateC* st, attr_t label) nogil:
+    cdef int transition(StateC* st, attr_t label) noexcept nogil:
        st.set_sent_start(st.B(1), 1)

    @staticmethod
-    cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) nogil:
+    cdef weight_t cost(const StateC* state, const void* _gold, attr_t label) noexcept nogil:
        gold = <const GoldParseStateC*>_gold
        cdef int b0 = state.B(0)
        cdef int cost = 0
@ -785,7 +785,7 @@ cdef class ArcEager(TransitionSystem):
        else:
            return False

-    cdef int set_valid(self, int* output, const StateC* st) nogil:
+    cdef int set_valid(self, int* output, const StateC* st) noexcept nogil:
        cdef int[N_MOVES] is_valid
        is_valid[SHIFT] = Shift.is_valid(st, 0)
        is_valid[REDUCE] = Reduce.is_valid(st, 0)
--- a/spacy/pipeline/_parser_internals/ner.pyx
+++ b/spacy/pipeline/_parser_internals/ner.pyx
@ -110,7 +110,7 @@ cdef void update_gold_state(GoldNERStateC* gs, const StateC* state) except *:
 cdef do_func_t[N_MOVES] do_funcs


-cdef bint _entity_is_sunk(const StateC* state, Transition* golds) nogil:
+cdef bint _entity_is_sunk(const StateC* state, Transition* golds) noexcept nogil:
    if not state.entity_is_open():
        return False

@ -238,7 +238,7 @@ cdef class BiluoPushDown(TransitionSystem):

    def add_action(self, int action, label_name, freq=None):
        cdef attr_t label_id
-        if not isinstance(label_name, (int, long)):
+        if not isinstance(label_name, int):
            label_id = self.strings.add(label_name)
        else:
            label_id = label_name
@ -347,21 +347,21 @@ cdef class BiluoPushDown(TransitionSystem):

 cdef class Missing:
    @staticmethod
-    cdef bint is_valid(const StateC* st, attr_t label) nogil:
+    cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
        return False

    @staticmethod
-    cdef int transition(StateC* s, attr_t label) nogil:
+    cdef int transition(StateC* s, attr_t label) noexcept nogil:
        pass

    @staticmethod
-    cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil:
+    cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) noexcept nogil:
        return 9000


 cdef class Begin:
    @staticmethod
-    cdef bint is_valid(const StateC* st, attr_t label) nogil:
+    cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
        cdef int preset_ent_iob = st.B_(0).ent_iob
        cdef attr_t preset_ent_label = st.B_(0).ent_type
        if st.entity_is_open():
@ -400,13 +400,13 @@ cdef class Begin:
            return True

    @staticmethod
-    cdef int transition(StateC* st, attr_t label) nogil:
+    cdef int transition(StateC* st, attr_t label) noexcept nogil:
        st.open_ent(label)
        st.push()
        st.pop()

    @staticmethod
-    cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil:
+    cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) noexcept nogil:
        gold = <GoldNERStateC*>_gold
        b0 = s.B(0)
        cdef int cost = 0
@ -439,7 +439,7 @@ cdef class Begin:

 cdef class In:
    @staticmethod
-    cdef bint is_valid(const StateC* st, attr_t label) nogil:
+    cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
        if not st.entity_is_open():
            return False
        if st.buffer_length() < 2:
@ -475,12 +475,12 @@ cdef class In:
            return True

    @staticmethod
-    cdef int transition(StateC* st, attr_t label) nogil:
+    cdef int transition(StateC* st, attr_t label) noexcept nogil:
        st.push()
        st.pop()

    @staticmethod
-    cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil:
+    cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) noexcept nogil:
        gold = <GoldNERStateC*>_gold
        cdef int next_act = gold.ner[s.B(1)].move if s.B(1) >= 0 else OUT
        cdef int g_act = gold.ner[s.B(0)].move
@ -510,7 +510,7 @@ cdef class In:

 cdef class Last:
    @staticmethod
-    cdef bint is_valid(const StateC* st, attr_t label) nogil:
+    cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
        cdef int preset_ent_iob = st.B_(0).ent_iob
        cdef attr_t preset_ent_label = st.B_(0).ent_type
        if label == 0:
@ -535,13 +535,13 @@ cdef class Last:
            return True

    @staticmethod
-    cdef int transition(StateC* st, attr_t label) nogil:
+    cdef int transition(StateC* st, attr_t label) noexcept nogil:
        st.close_ent()
        st.push()
        st.pop()

    @staticmethod
-    cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil:
+    cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) noexcept nogil:
        gold = <GoldNERStateC*>_gold
        b0 = s.B(0)
        ent_start = s.E(0)
@ -581,7 +581,7 @@ cdef class Last:

 cdef class Unit:
    @staticmethod
-    cdef bint is_valid(const StateC* st, attr_t label) nogil:
+    cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
        cdef int preset_ent_iob = st.B_(0).ent_iob
        cdef attr_t preset_ent_label = st.B_(0).ent_type
        if label == 0:
@ -609,14 +609,14 @@ cdef class Unit:
            return True

    @staticmethod
-    cdef int transition(StateC* st, attr_t label) nogil:
+    cdef int transition(StateC* st, attr_t label) noexcept nogil:
        st.open_ent(label)
        st.close_ent()
        st.push()
        st.pop()

    @staticmethod
-    cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil:
+    cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) noexcept nogil:
        gold = <GoldNERStateC*>_gold
        cdef int g_act = gold.ner[s.B(0)].move
        cdef attr_t g_tag = gold.ner[s.B(0)].label
@ -646,7 +646,7 @@ cdef class Unit:

 cdef class Out:
    @staticmethod
-    cdef bint is_valid(const StateC* st, attr_t label) nogil:
+    cdef bint is_valid(const StateC* st, attr_t label) noexcept nogil:
        cdef int preset_ent_iob = st.B_(0).ent_iob
        if st.entity_is_open():
            return False
@ -658,12 +658,12 @@ cdef class Out:
            return True

    @staticmethod
-    cdef int transition(StateC* st, attr_t label) nogil:
+    cdef int transition(StateC* st, attr_t label) noexcept nogil:
        st.push()
        st.pop()

    @staticmethod
-    cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil:
+    cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) noexcept nogil:
        gold = <GoldNERStateC*>_gold
        cdef int g_act = gold.ner[s.B(0)].move
        cdef weight_t cost = 0
--- a/spacy/pipeline/_parser_internals/nonproj.pyx
+++ b/spacy/pipeline/_parser_internals/nonproj.pyx
@ -94,7 +94,7 @@ cdef bool _has_head_as_ancestor(int tokenid, int head, const vector[int]& heads)
    return False


-cdef string heads_to_string(const vector[int]& heads) nogil:
+cdef string heads_to_string(const vector[int]& heads) noexcept nogil:
    cdef vector[int].const_iterator citer
    cdef string cycle_str

@ -183,7 +183,7 @@ cpdef deprojectivize(Doc doc):
            new_label, head_label = label.split(DELIMITER)
            new_head = _find_new_head(doc[i], head_label)
            doc.c[i].head = new_head.i - i
-            doc.c[i].dep = doc.vocab.strings.add(new_label)
+            doc.c[i].dep = doc.vocab.strings.add(new_label, allow_transient=False)
    set_children_from_heads(doc.c, 0, doc.length)
    return doc

--- a/spacy/pipeline/_parser_internals/transition_system.pxd
+++ b/spacy/pipeline/_parser_internals/transition_system.pxd
@ -15,22 +15,22 @@ cdef struct Transition:

    weight_t score

-    bint (*is_valid)(const StateC* state, attr_t label) nogil
-    weight_t (*get_cost)(const StateC* state, const void* gold, attr_t label) nogil
-    int (*do)(StateC* state, attr_t label) nogil
+    bint (*is_valid)(const StateC* state, attr_t label) noexcept nogil
+    weight_t (*get_cost)(const StateC* state, const void* gold, attr_t label) noexcept nogil
+    int (*do)(StateC* state, attr_t label) noexcept nogil


 ctypedef weight_t (*get_cost_func_t)(
    const StateC* state, const void* gold, attr_tlabel
-) nogil
+) noexcept nogil
 ctypedef weight_t (*move_cost_func_t)(
        const StateC* state, const void* gold
-) nogil
+) noexcept nogil
 ctypedef weight_t (*label_cost_func_t)(
    const StateC* state, const void* gold, attr_t label
-) nogil
+) noexcept nogil

-ctypedef int (*do_func_t)(StateC* state, attr_t label) nogil
+ctypedef int (*do_func_t)(StateC* state, attr_t label) noexcept nogil

 ctypedef void* (*init_state_t)(Pool mem, int length, void* tokens) except NULL

@ -53,7 +53,7 @@ cdef class TransitionSystem:

    cdef Transition init_transition(self, int clas, int move, attr_t label) except *

-    cdef int set_valid(self, int* output, const StateC* st) nogil
+    cdef int set_valid(self, int* output, const StateC* st) noexcept nogil

    cdef int set_costs(self, int* is_valid, weight_t* costs,
                       const StateC* state, gold) except -1
--- a/spacy/pipeline/_parser_internals/transition_system.pyx
+++ b/spacy/pipeline/_parser_internals/transition_system.pyx
@ -149,7 +149,7 @@ cdef class TransitionSystem:
        action = self.lookup_transition(move_name)
        return action.is_valid(stcls.c, action.label)

-    cdef int set_valid(self, int* is_valid, const StateC* st) nogil:
+    cdef int set_valid(self, int* is_valid, const StateC* st) noexcept nogil:
        cdef int i
        for i in range(self.n_moves):
            is_valid[i] = self.c[i].is_valid(st, self.c[i].label)
@ -191,8 +191,7 @@ cdef class TransitionSystem:

    def add_action(self, int action, label_name):
        cdef attr_t label_id
-        if not isinstance(label_name, int) and \
-           not isinstance(label_name, long):
+        if not isinstance(label_name, int):
            label_id = self.strings.add(label_name)
        else:
            label_id = label_name
--- a/spacy/pipeline/attributeruler.py
+++ b/spacy/pipeline/attributeruler.py
@ -1,3 +1,5 @@
+import importlib
+import sys
 from pathlib import Path
 from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union

@ -22,19 +24,6 @@ TagMapType = Dict[str, Dict[Union[int, str], Union[int, str]]]
 MorphRulesType = Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]


-@Language.factory(
-    "attribute_ruler",
-    default_config={
-        "validate": False,
-        "scorer": {"@scorers": "spacy.attribute_ruler_scorer.v1"},
-    },
-)
-def make_attribute_ruler(
-    nlp: Language, name: str, validate: bool, scorer: Optional[Callable]
-):
-    return AttributeRuler(nlp.vocab, name, validate=validate, scorer=scorer)
-
-
 def attribute_ruler_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
    def morph_key_getter(token, attr):
        return getattr(token, attr).key
@ -54,7 +43,6 @@ def attribute_ruler_score(examples: Iterable[Example], **kwargs) -> Dict[str, An
    return results


-@registry.scorers("spacy.attribute_ruler_scorer.v1")
 def make_attribute_ruler_scorer():
    return attribute_ruler_score

@ -355,3 +343,11 @@ def _split_morph_attrs(attrs: dict) -> Tuple[dict, dict]:
        else:
            morph_attrs[k] = v
    return other_attrs, morph_attrs
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_attribute_ruler":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_attribute_ruler
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/dep_parser.pyx
+++ b/spacy/pipeline/dep_parser.pyx
@ -1,4 +1,6 @@
 # cython: infer_types=True, binding=True
+import importlib
+import sys
 from collections import defaultdict
 from typing import Callable, Optional

@ -39,188 +41,6 @@ subword_features = true
 DEFAULT_PARSER_MODEL = Config().from_str(default_model_config)["model"]


-@Language.factory(
-    "parser",
-    assigns=["token.dep", "token.head", "token.is_sent_start", "doc.sents"],
-    default_config={
-        "moves": None,
-        "update_with_oracle_cut_size": 100,
-        "learn_tokens": False,
-        "min_action_freq": 30,
-        "model": DEFAULT_PARSER_MODEL,
-        "scorer": {"@scorers": "spacy.parser_scorer.v1"},
-    },
-    default_score_weights={
-        "dep_uas": 0.5,
-        "dep_las": 0.5,
-        "dep_las_per_type": None,
-        "sents_p": None,
-        "sents_r": None,
-        "sents_f": 0.0,
-    },
-)
-def make_parser(
-    nlp: Language,
-    name: str,
-    model: Model,
-    moves: Optional[TransitionSystem],
-    update_with_oracle_cut_size: int,
-    learn_tokens: bool,
-    min_action_freq: int,
-    scorer: Optional[Callable],
-):
-    """Create a transition-based DependencyParser component. The dependency parser
-    jointly learns sentence segmentation and labelled dependency parsing, and can
-    optionally learn to merge tokens that had been over-segmented by the tokenizer.
-
-    The parser uses a variant of the non-monotonic arc-eager transition-system
-    described by Honnibal and Johnson (2014), with the addition of a "break"
-    transition to perform the sentence segmentation. Nivre's pseudo-projective
-    dependency transformation is used to allow the parser to predict
-    non-projective parses.
-
-    The parser is trained using an imitation learning objective. The parser follows
-    the actions predicted by the current weights, and at each state, determines
-    which actions are compatible with the optimal parse that could be reached
-    from the current state. The weights such that the scores assigned to the
-    set of optimal actions is increased, while scores assigned to other
-    actions are decreased. Note that more than one action may be optimal for
-    a given state.
-
-    model (Model): The model for the transition-based parser. The model needs
-        to have a specific substructure of named components --- see the
-        spacy.ml.tb_framework.TransitionModel for details.
-    moves (Optional[TransitionSystem]): This defines how the parse-state is created,
-        updated and evaluated. If 'moves' is None, a new instance is
-        created with `self.TransitionSystem()`. Defaults to `None`.
-    update_with_oracle_cut_size (int): During training, cut long sequences into
-        shorter segments by creating intermediate states based on the gold-standard
-        history. The model is not very sensitive to this parameter, so you usually
-        won't need to change it. 100 is a good default.
-    learn_tokens (bool): Whether to learn to merge subtokens that are split
-        relative to the gold standard. Experimental.
-    min_action_freq (int): The minimum frequency of labelled actions to retain.
-        Rarer labelled actions have their label backed-off to "dep". While this
-        primarily affects the label accuracy, it can also affect the attachment
-        structure, as the labels are used to represent the pseudo-projectivity
-        transformation.
-    scorer (Optional[Callable]): The scoring method.
-    """
-    return DependencyParser(
-        nlp.vocab,
-        model,
-        name,
-        moves=moves,
-        update_with_oracle_cut_size=update_with_oracle_cut_size,
-        multitasks=[],
-        learn_tokens=learn_tokens,
-        min_action_freq=min_action_freq,
-        beam_width=1,
-        beam_density=0.0,
-        beam_update_prob=0.0,
-        # At some point in the future we can try to implement support for
-        # partial annotations, perhaps only in the beam objective.
-        incorrect_spans_key=None,
-        scorer=scorer,
-    )
-
-
-@Language.factory(
-    "beam_parser",
-    assigns=["token.dep", "token.head", "token.is_sent_start", "doc.sents"],
-    default_config={
-        "beam_width": 8,
-        "beam_density": 0.01,
-        "beam_update_prob": 0.5,
-        "moves": None,
-        "update_with_oracle_cut_size": 100,
-        "learn_tokens": False,
-        "min_action_freq": 30,
-        "model": DEFAULT_PARSER_MODEL,
-        "scorer": {"@scorers": "spacy.parser_scorer.v1"},
-    },
-    default_score_weights={
-        "dep_uas": 0.5,
-        "dep_las": 0.5,
-        "dep_las_per_type": None,
-        "sents_p": None,
-        "sents_r": None,
-        "sents_f": 0.0,
-    },
-)
-def make_beam_parser(
-    nlp: Language,
-    name: str,
-    model: Model,
-    moves: Optional[TransitionSystem],
-    update_with_oracle_cut_size: int,
-    learn_tokens: bool,
-    min_action_freq: int,
-    beam_width: int,
-    beam_density: float,
-    beam_update_prob: float,
-    scorer: Optional[Callable],
-):
-    """Create a transition-based DependencyParser component that uses beam-search.
-    The dependency parser jointly learns sentence segmentation and labelled
-    dependency parsing, and can optionally learn to merge tokens that had been
-    over-segmented by the tokenizer.
-
-    The parser uses a variant of the non-monotonic arc-eager transition-system
-    described by Honnibal and Johnson (2014), with the addition of a "break"
-    transition to perform the sentence segmentation. Nivre's pseudo-projective
-    dependency transformation is used to allow the parser to predict
-    non-projective parses.
-
-    The parser is trained using a global objective. That is, it learns to assign
-    probabilities to whole parses.
-
-    model (Model): The model for the transition-based parser. The model needs
-        to have a specific substructure of named components --- see the
-        spacy.ml.tb_framework.TransitionModel for details.
-    moves (Optional[TransitionSystem]): This defines how the parse-state is created,
-        updated and evaluated. If 'moves' is None, a new instance is
-        created with `self.TransitionSystem()`. Defaults to `None`.
-    update_with_oracle_cut_size (int): During training, cut long sequences into
-        shorter segments by creating intermediate states based on the gold-standard
-        history. The model is not very sensitive to this parameter, so you usually
-        won't need to change it. 100 is a good default.
-    beam_width (int): The number of candidate analyses to maintain.
-    beam_density (float): The minimum ratio between the scores of the first and
-        last candidates in the beam. This allows the parser to avoid exploring
-        candidates that are too far behind. This is mostly intended to improve
-        efficiency, but it can also improve accuracy as deeper search is not
-        always better.
-    beam_update_prob (float): The chance of making a beam update, instead of a
-        greedy update. Greedy updates are an approximation for the beam updates,
-        and are faster to compute.
-    learn_tokens (bool): Whether to learn to merge subtokens that are split
-        relative to the gold standard. Experimental.
-    min_action_freq (int): The minimum frequency of labelled actions to retain.
-        Rarer labelled actions have their label backed-off to "dep". While this
-        primarily affects the label accuracy, it can also affect the attachment
-        structure, as the labels are used to represent the pseudo-projectivity
-        transformation.
-    """
-    return DependencyParser(
-        nlp.vocab,
-        model,
-        name,
-        moves=moves,
-        update_with_oracle_cut_size=update_with_oracle_cut_size,
-        beam_width=beam_width,
-        beam_density=beam_density,
-        beam_update_prob=beam_update_prob,
-        multitasks=[],
-        learn_tokens=learn_tokens,
-        min_action_freq=min_action_freq,
-        # At some point in the future we can try to implement support for
-        # partial annotations, perhaps only in the beam objective.
-        incorrect_spans_key=None,
-        scorer=scorer,
-    )
-
-
 def parser_score(examples, **kwargs):
    """Score a batch of examples.

@ -246,7 +66,6 @@ def parser_score(examples, **kwargs):
    return results


-@registry.scorers("spacy.parser_scorer.v1")
 def make_parser_scorer():
    return parser_score

@ -346,3 +165,14 @@ cdef class DependencyParser(Parser):
        # because we instead have a label frequency cut-off and back off rare
        # labels to 'dep'.
        pass
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_parser":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_parser
+    elif name == "make_beam_parser":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_beam_parser
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/edit_tree_lemmatizer.py
+++ b/spacy/pipeline/edit_tree_lemmatizer.py
@ -1,3 +1,5 @@
+import importlib
+import sys
 from collections import Counter
 from itertools import islice
 from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, cast
@ -39,43 +41,6 @@ subword_features = true
 DEFAULT_EDIT_TREE_LEMMATIZER_MODEL = Config().from_str(default_model_config)["model"]


-@Language.factory(
-    "trainable_lemmatizer",
-    assigns=["token.lemma"],
-    requires=[],
-    default_config={
-        "model": DEFAULT_EDIT_TREE_LEMMATIZER_MODEL,
-        "backoff": "orth",
-        "min_tree_freq": 3,
-        "overwrite": False,
-        "top_k": 1,
-        "scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
-    },
-    default_score_weights={"lemma_acc": 1.0},
-)
-def make_edit_tree_lemmatizer(
-    nlp: Language,
-    name: str,
-    model: Model,
-    backoff: Optional[str],
-    min_tree_freq: int,
-    overwrite: bool,
-    top_k: int,
-    scorer: Optional[Callable],
-):
-    """Construct an EditTreeLemmatizer component."""
-    return EditTreeLemmatizer(
-        nlp.vocab,
-        model,
-        name,
-        backoff=backoff,
-        min_tree_freq=min_tree_freq,
-        overwrite=overwrite,
-        top_k=top_k,
-        scorer=scorer,
-    )
-
-
 class EditTreeLemmatizer(TrainablePipe):
    """
    Lemmatizer that lemmatizes each word using a predicted edit tree.
@ -421,3 +386,11 @@ class EditTreeLemmatizer(TrainablePipe):
            self.tree2label[tree_id] = len(self.cfg["labels"])
            self.cfg["labels"].append(tree_id)
        return self.tree2label[tree_id]
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_edit_tree_lemmatizer":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_edit_tree_lemmatizer
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/entity_linker.py
+++ b/spacy/pipeline/entity_linker.py
@ -1,4 +1,6 @@
+import importlib
 import random
+import sys
 from itertools import islice
 from pathlib import Path
 from typing import Any, Callable, Dict, Iterable, List, Optional, Union
@ -40,117 +42,10 @@ subword_features = true
 DEFAULT_NEL_MODEL = Config().from_str(default_model_config)["model"]


-@Language.factory(
-    "entity_linker",
-    requires=["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"],
-    assigns=["token.ent_kb_id"],
-    default_config={
-        "model": DEFAULT_NEL_MODEL,
-        "labels_discard": [],
-        "n_sents": 0,
-        "incl_prior": True,
-        "incl_context": True,
-        "entity_vector_length": 64,
-        "get_candidates": {"@misc": "spacy.CandidateGenerator.v1"},
-        "get_candidates_batch": {"@misc": "spacy.CandidateBatchGenerator.v1"},
-        "generate_empty_kb": {"@misc": "spacy.EmptyKB.v2"},
-        "overwrite": True,
-        "scorer": {"@scorers": "spacy.entity_linker_scorer.v1"},
-        "use_gold_ents": True,
-        "candidates_batch_size": 1,
-        "threshold": None,
-    },
-    default_score_weights={
-        "nel_micro_f": 1.0,
-        "nel_micro_r": None,
-        "nel_micro_p": None,
-    },
-)
-def make_entity_linker(
-    nlp: Language,
-    name: str,
-    model: Model,
-    *,
-    labels_discard: Iterable[str],
-    n_sents: int,
-    incl_prior: bool,
-    incl_context: bool,
-    entity_vector_length: int,
-    get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
-    get_candidates_batch: Callable[
-        [KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]
-    ],
-    generate_empty_kb: Callable[[Vocab, int], KnowledgeBase],
-    overwrite: bool,
-    scorer: Optional[Callable],
-    use_gold_ents: bool,
-    candidates_batch_size: int,
-    threshold: Optional[float] = None,
-):
-    """Construct an EntityLinker component.
-
-    model (Model[List[Doc], Floats2d]): A model that learns document vector
-        representations. Given a batch of Doc objects, it should return a single
-        array, with one row per item in the batch.
-    labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
-    n_sents (int): The number of neighbouring sentences to take into account.
-    incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
-    incl_context (bool): Whether or not to include the local context in the model.
-    entity_vector_length (int): Size of encoding vectors in the KB.
-    get_candidates (Callable[[KnowledgeBase, Span], Iterable[Candidate]]): Function that
-        produces a list of candidates, given a certain knowledge base and a textual mention.
-    get_candidates_batch (
-        Callable[[KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]], Iterable[Candidate]]
-        ): Function that produces a list of candidates, given a certain knowledge base and several textual mentions.
-    generate_empty_kb (Callable[[Vocab, int], KnowledgeBase]): Callable returning empty KnowledgeBase.
-    scorer (Optional[Callable]): The scoring method.
-    use_gold_ents (bool): Whether to copy entities from gold docs during training or not. If false, another
-        component must provide entity annotations.
-    candidates_batch_size (int): Size of batches for entity candidate generation.
-    threshold (Optional[float]): Confidence threshold for entity predictions. If confidence is below the threshold,
-        prediction is discarded. If None, predictions are not filtered by any threshold.
-    """
-
-    if not model.attrs.get("include_span_maker", False):
-        # The only difference in arguments here is that use_gold_ents and threshold aren't available.
-        return EntityLinker_v1(
-            nlp.vocab,
-            model,
-            name,
-            labels_discard=labels_discard,
-            n_sents=n_sents,
-            incl_prior=incl_prior,
-            incl_context=incl_context,
-            entity_vector_length=entity_vector_length,
-            get_candidates=get_candidates,
-            overwrite=overwrite,
-            scorer=scorer,
-        )
-    return EntityLinker(
-        nlp.vocab,
-        model,
-        name,
-        labels_discard=labels_discard,
-        n_sents=n_sents,
-        incl_prior=incl_prior,
-        incl_context=incl_context,
-        entity_vector_length=entity_vector_length,
-        get_candidates=get_candidates,
-        get_candidates_batch=get_candidates_batch,
-        generate_empty_kb=generate_empty_kb,
-        overwrite=overwrite,
-        scorer=scorer,
-        use_gold_ents=use_gold_ents,
-        candidates_batch_size=candidates_batch_size,
-        threshold=threshold,
-    )
-
-
 def entity_linker_score(examples, **kwargs):
    return Scorer.score_links(examples, negative_labels=[EntityLinker.NIL], **kwargs)


-@registry.scorers("spacy.entity_linker_scorer.v1")
 def make_entity_linker_scorer():
    return entity_linker_score

@ -676,3 +571,11 @@ class EntityLinker(TrainablePipe):

    def add_label(self, label):
        raise NotImplementedError
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_entity_linker":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_entity_linker
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/entityruler.py
+++ b/spacy/pipeline/entityruler.py
@ -1,3 +1,5 @@
+import importlib
+import sys
 import warnings
 from collections import defaultdict
 from pathlib import Path
@ -19,51 +21,10 @@ DEFAULT_ENT_ID_SEP = "||"
 PatternType = Dict[str, Union[str, List[Dict[str, Any]]]]


-@Language.factory(
-    "entity_ruler",
-    assigns=["doc.ents", "token.ent_type", "token.ent_iob"],
-    default_config={
-        "phrase_matcher_attr": None,
-        "matcher_fuzzy_compare": {"@misc": "spacy.levenshtein_compare.v1"},
-        "validate": False,
-        "overwrite_ents": False,
-        "ent_id_sep": DEFAULT_ENT_ID_SEP,
-        "scorer": {"@scorers": "spacy.entity_ruler_scorer.v1"},
-    },
-    default_score_weights={
-        "ents_f": 1.0,
-        "ents_p": 0.0,
-        "ents_r": 0.0,
-        "ents_per_type": None,
-    },
-)
-def make_entity_ruler(
-    nlp: Language,
-    name: str,
-    phrase_matcher_attr: Optional[Union[int, str]],
-    matcher_fuzzy_compare: Callable,
-    validate: bool,
-    overwrite_ents: bool,
-    ent_id_sep: str,
-    scorer: Optional[Callable],
-):
-    return EntityRuler(
-        nlp,
-        name,
-        phrase_matcher_attr=phrase_matcher_attr,
-        matcher_fuzzy_compare=matcher_fuzzy_compare,
-        validate=validate,
-        overwrite_ents=overwrite_ents,
-        ent_id_sep=ent_id_sep,
-        scorer=scorer,
-    )
-
-
 def entity_ruler_score(examples, **kwargs):
    return get_ner_prf(examples)


-@registry.scorers("spacy.entity_ruler_scorer.v1")
 def make_entity_ruler_scorer():
    return entity_ruler_score

@ -539,3 +500,11 @@ class EntityRuler(Pipe):
            srsly.write_jsonl(path, self.patterns)
        else:
            to_disk(path, serializers, {})
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_entity_ruler":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_entity_ruler
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/factories.py
+++ b/spacy/pipeline/factories.py
@ -0,0 +1,929 @@
+from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union
+
+from thinc.api import Model
+from thinc.types import Floats2d, Ragged
+
+from ..kb import Candidate, KnowledgeBase
+from ..language import Language
+from ..pipeline._parser_internals.transition_system import TransitionSystem
+from ..pipeline.attributeruler import AttributeRuler
+from ..pipeline.dep_parser import DEFAULT_PARSER_MODEL, DependencyParser
+from ..pipeline.edit_tree_lemmatizer import (
+    DEFAULT_EDIT_TREE_LEMMATIZER_MODEL,
+    EditTreeLemmatizer,
+)
+
+# Import factory default configurations
+from ..pipeline.entity_linker import DEFAULT_NEL_MODEL, EntityLinker, EntityLinker_v1
+from ..pipeline.entityruler import DEFAULT_ENT_ID_SEP, EntityRuler
+from ..pipeline.functions import DocCleaner, TokenSplitter
+from ..pipeline.lemmatizer import Lemmatizer
+from ..pipeline.morphologizer import DEFAULT_MORPH_MODEL, Morphologizer
+from ..pipeline.multitask import DEFAULT_MT_MODEL, MultitaskObjective
+from ..pipeline.ner import DEFAULT_NER_MODEL, EntityRecognizer
+from ..pipeline.sentencizer import Sentencizer
+from ..pipeline.senter import DEFAULT_SENTER_MODEL, SentenceRecognizer
+from ..pipeline.span_finder import DEFAULT_SPAN_FINDER_MODEL, SpanFinder
+from ..pipeline.span_ruler import DEFAULT_SPANS_KEY as SPAN_RULER_DEFAULT_SPANS_KEY
+from ..pipeline.span_ruler import (
+    SpanRuler,
+    prioritize_existing_ents_filter,
+    prioritize_new_ents_filter,
+)
+from ..pipeline.spancat import (
+    DEFAULT_SPANCAT_MODEL,
+    DEFAULT_SPANCAT_SINGLELABEL_MODEL,
+    DEFAULT_SPANS_KEY,
+    SpanCategorizer,
+    Suggester,
+)
+from ..pipeline.tagger import DEFAULT_TAGGER_MODEL, Tagger
+from ..pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL, TextCategorizer
+from ..pipeline.textcat_multilabel import (
+    DEFAULT_MULTI_TEXTCAT_MODEL,
+    MultiLabel_TextCategorizer,
+)
+from ..pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL, Tok2Vec
+from ..tokens.doc import Doc
+from ..tokens.span import Span
+from ..vocab import Vocab
+
+# Global flag to track if factories have been registered
+FACTORIES_REGISTERED = False
+
+
+def register_factories() -> None:
+    """Register all factories with the registry.
+
+    This function registers all pipeline component factories, centralizing
+    the registrations that were previously done with @Language.factory decorators.
+    """
+    global FACTORIES_REGISTERED
+
+    if FACTORIES_REGISTERED:
+        return
+
+    # Register factories using the same pattern as Language.factory decorator
+    # We use Language.factory()() pattern which exactly mimics the decorator
+
+    # attributeruler
+    Language.factory(
+        "attribute_ruler",
+        default_config={
+            "validate": False,
+            "scorer": {"@scorers": "spacy.attribute_ruler_scorer.v1"},
+        },
+    )(make_attribute_ruler)
+
+    # entity_linker
+    Language.factory(
+        "entity_linker",
+        requires=["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"],
+        assigns=["token.ent_kb_id"],
+        default_config={
+            "model": DEFAULT_NEL_MODEL,
+            "labels_discard": [],
+            "n_sents": 0,
+            "incl_prior": True,
+            "incl_context": True,
+            "entity_vector_length": 64,
+            "get_candidates": {"@misc": "spacy.CandidateGenerator.v1"},
+            "get_candidates_batch": {"@misc": "spacy.CandidateBatchGenerator.v1"},
+            "generate_empty_kb": {"@misc": "spacy.EmptyKB.v2"},
+            "overwrite": True,
+            "scorer": {"@scorers": "spacy.entity_linker_scorer.v1"},
+            "use_gold_ents": True,
+            "candidates_batch_size": 1,
+            "threshold": None,
+        },
+        default_score_weights={
+            "nel_micro_f": 1.0,
+            "nel_micro_r": None,
+            "nel_micro_p": None,
+        },
+    )(make_entity_linker)
+
+    # entity_ruler
+    Language.factory(
+        "entity_ruler",
+        assigns=["doc.ents", "token.ent_type", "token.ent_iob"],
+        default_config={
+            "phrase_matcher_attr": None,
+            "matcher_fuzzy_compare": {"@misc": "spacy.levenshtein_compare.v1"},
+            "validate": False,
+            "overwrite_ents": False,
+            "ent_id_sep": DEFAULT_ENT_ID_SEP,
+            "scorer": {"@scorers": "spacy.entity_ruler_scorer.v1"},
+        },
+        default_score_weights={
+            "ents_f": 1.0,
+            "ents_p": 0.0,
+            "ents_r": 0.0,
+            "ents_per_type": None,
+        },
+    )(make_entity_ruler)
+
+    # lemmatizer
+    Language.factory(
+        "lemmatizer",
+        assigns=["token.lemma"],
+        default_config={
+            "model": None,
+            "mode": "lookup",
+            "overwrite": False,
+            "scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
+        },
+        default_score_weights={"lemma_acc": 1.0},
+    )(make_lemmatizer)
+
+    # textcat
+    Language.factory(
+        "textcat",
+        assigns=["doc.cats"],
+        default_config={
+            "threshold": 0.0,
+            "model": DEFAULT_SINGLE_TEXTCAT_MODEL,
+            "scorer": {"@scorers": "spacy.textcat_scorer.v2"},
+        },
+        default_score_weights={
+            "cats_score": 1.0,
+            "cats_score_desc": None,
+            "cats_micro_p": None,
+            "cats_micro_r": None,
+            "cats_micro_f": None,
+            "cats_macro_p": None,
+            "cats_macro_r": None,
+            "cats_macro_f": None,
+            "cats_macro_auc": None,
+            "cats_f_per_type": None,
+        },
+    )(make_textcat)
+
+    # token_splitter
+    Language.factory(
+        "token_splitter",
+        default_config={"min_length": 25, "split_length": 10},
+        retokenizes=True,
+    )(make_token_splitter)
+
+    # doc_cleaner
+    Language.factory(
+        "doc_cleaner",
+        default_config={"attrs": {"tensor": None, "_.trf_data": None}, "silent": True},
+    )(make_doc_cleaner)
+
+    # tok2vec
+    Language.factory(
+        "tok2vec",
+        assigns=["doc.tensor"],
+        default_config={"model": DEFAULT_TOK2VEC_MODEL},
+    )(make_tok2vec)
+
+    # senter
+    Language.factory(
+        "senter",
+        assigns=["token.is_sent_start"],
+        default_config={
+            "model": DEFAULT_SENTER_MODEL,
+            "overwrite": False,
+            "scorer": {"@scorers": "spacy.senter_scorer.v1"},
+        },
+        default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0},
+    )(make_senter)
+
+    # morphologizer
+    Language.factory(
+        "morphologizer",
+        assigns=["token.morph", "token.pos"],
+        default_config={
+            "model": DEFAULT_MORPH_MODEL,
+            "overwrite": True,
+            "extend": False,
+            "scorer": {"@scorers": "spacy.morphologizer_scorer.v1"},
+            "label_smoothing": 0.0,
+        },
+        default_score_weights={
+            "pos_acc": 0.5,
+            "morph_acc": 0.5,
+            "morph_per_feat": None,
+        },
+    )(make_morphologizer)
+
+    # spancat
+    Language.factory(
+        "spancat",
+        assigns=["doc.spans"],
+        default_config={
+            "threshold": 0.5,
+            "spans_key": DEFAULT_SPANS_KEY,
+            "max_positive": None,
+            "model": DEFAULT_SPANCAT_MODEL,
+            "suggester": {"@misc": "spacy.ngram_suggester.v1", "sizes": [1, 2, 3]},
+            "scorer": {"@scorers": "spacy.spancat_scorer.v1"},
+        },
+        default_score_weights={"spans_sc_f": 1.0, "spans_sc_p": 0.0, "spans_sc_r": 0.0},
+    )(make_spancat)
+
+    # spancat_singlelabel
+    Language.factory(
+        "spancat_singlelabel",
+        assigns=["doc.spans"],
+        default_config={
+            "spans_key": DEFAULT_SPANS_KEY,
+            "model": DEFAULT_SPANCAT_SINGLELABEL_MODEL,
+            "negative_weight": 1.0,
+            "suggester": {"@misc": "spacy.ngram_suggester.v1", "sizes": [1, 2, 3]},
+            "scorer": {"@scorers": "spacy.spancat_scorer.v1"},
+            "allow_overlap": True,
+        },
+        default_score_weights={"spans_sc_f": 1.0, "spans_sc_p": 0.0, "spans_sc_r": 0.0},
+    )(make_spancat_singlelabel)
+
+    # future_entity_ruler
+    Language.factory(
+        "future_entity_ruler",
+        assigns=["doc.ents"],
+        default_config={
+            "phrase_matcher_attr": None,
+            "validate": False,
+            "overwrite_ents": False,
+            "scorer": {"@scorers": "spacy.entity_ruler_scorer.v1"},
+            "ent_id_sep": "__unused__",
+            "matcher_fuzzy_compare": {"@misc": "spacy.levenshtein_compare.v1"},
+        },
+        default_score_weights={
+            "ents_f": 1.0,
+            "ents_p": 0.0,
+            "ents_r": 0.0,
+            "ents_per_type": None,
+        },
+    )(make_future_entity_ruler)
+
+    # span_ruler
+    Language.factory(
+        "span_ruler",
+        assigns=["doc.spans"],
+        default_config={
+            "spans_key": SPAN_RULER_DEFAULT_SPANS_KEY,
+            "spans_filter": None,
+            "annotate_ents": False,
+            "ents_filter": {"@misc": "spacy.first_longest_spans_filter.v1"},
+            "phrase_matcher_attr": None,
+            "matcher_fuzzy_compare": {"@misc": "spacy.levenshtein_compare.v1"},
+            "validate": False,
+            "overwrite": True,
+            "scorer": {
+                "@scorers": "spacy.overlapping_labeled_spans_scorer.v1",
+                "spans_key": SPAN_RULER_DEFAULT_SPANS_KEY,
+            },
+        },
+        default_score_weights={
+            f"spans_{SPAN_RULER_DEFAULT_SPANS_KEY}_f": 1.0,
+            f"spans_{SPAN_RULER_DEFAULT_SPANS_KEY}_p": 0.0,
+            f"spans_{SPAN_RULER_DEFAULT_SPANS_KEY}_r": 0.0,
+            f"spans_{SPAN_RULER_DEFAULT_SPANS_KEY}_per_type": None,
+        },
+    )(make_span_ruler)
+
+    # trainable_lemmatizer
+    Language.factory(
+        "trainable_lemmatizer",
+        assigns=["token.lemma"],
+        requires=[],
+        default_config={
+            "model": DEFAULT_EDIT_TREE_LEMMATIZER_MODEL,
+            "backoff": "orth",
+            "min_tree_freq": 3,
+            "overwrite": False,
+            "top_k": 1,
+            "scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
+        },
+        default_score_weights={"lemma_acc": 1.0},
+    )(make_edit_tree_lemmatizer)
+
+    # textcat_multilabel
+    Language.factory(
+        "textcat_multilabel",
+        assigns=["doc.cats"],
+        default_config={
+            "threshold": 0.5,
+            "model": DEFAULT_MULTI_TEXTCAT_MODEL,
+            "scorer": {"@scorers": "spacy.textcat_multilabel_scorer.v2"},
+        },
+        default_score_weights={
+            "cats_score": 1.0,
+            "cats_score_desc": None,
+            "cats_micro_p": None,
+            "cats_micro_r": None,
+            "cats_micro_f": None,
+            "cats_macro_p": None,
+            "cats_macro_r": None,
+            "cats_macro_f": None,
+            "cats_macro_auc": None,
+            "cats_f_per_type": None,
+        },
+    )(make_multilabel_textcat)
+
+    # span_finder
+    Language.factory(
+        "span_finder",
+        assigns=["doc.spans"],
+        default_config={
+            "threshold": 0.5,
+            "model": DEFAULT_SPAN_FINDER_MODEL,
+            "spans_key": DEFAULT_SPANS_KEY,
+            "max_length": 25,
+            "min_length": None,
+            "scorer": {"@scorers": "spacy.span_finder_scorer.v1"},
+        },
+        default_score_weights={
+            f"spans_{DEFAULT_SPANS_KEY}_f": 1.0,
+            f"spans_{DEFAULT_SPANS_KEY}_p": 0.0,
+            f"spans_{DEFAULT_SPANS_KEY}_r": 0.0,
+        },
+    )(make_span_finder)
+
+    # ner
+    Language.factory(
+        "ner",
+        assigns=["doc.ents", "token.ent_iob", "token.ent_type"],
+        default_config={
+            "moves": None,
+            "update_with_oracle_cut_size": 100,
+            "model": DEFAULT_NER_MODEL,
+            "incorrect_spans_key": None,
+            "scorer": {"@scorers": "spacy.ner_scorer.v1"},
+        },
+        default_score_weights={
+            "ents_f": 1.0,
+            "ents_p": 0.0,
+            "ents_r": 0.0,
+            "ents_per_type": None,
+        },
+    )(make_ner)
+
+    # beam_ner
+    Language.factory(
+        "beam_ner",
+        assigns=["doc.ents", "token.ent_iob", "token.ent_type"],
+        default_config={
+            "moves": None,
+            "update_with_oracle_cut_size": 100,
+            "model": DEFAULT_NER_MODEL,
+            "beam_density": 0.01,
+            "beam_update_prob": 0.5,
+            "beam_width": 32,
+            "incorrect_spans_key": None,
+            "scorer": {"@scorers": "spacy.ner_scorer.v1"},
+        },
+        default_score_weights={
+            "ents_f": 1.0,
+            "ents_p": 0.0,
+            "ents_r": 0.0,
+            "ents_per_type": None,
+        },
+    )(make_beam_ner)
+
+    # parser
+    Language.factory(
+        "parser",
+        assigns=["token.dep", "token.head", "token.is_sent_start", "doc.sents"],
+        default_config={
+            "moves": None,
+            "update_with_oracle_cut_size": 100,
+            "learn_tokens": False,
+            "min_action_freq": 30,
+            "model": DEFAULT_PARSER_MODEL,
+            "scorer": {"@scorers": "spacy.parser_scorer.v1"},
+        },
+        default_score_weights={
+            "dep_uas": 0.5,
+            "dep_las": 0.5,
+            "dep_las_per_type": None,
+            "sents_p": None,
+            "sents_r": None,
+            "sents_f": 0.0,
+        },
+    )(make_parser)
+
+    # beam_parser
+    Language.factory(
+        "beam_parser",
+        assigns=["token.dep", "token.head", "token.is_sent_start", "doc.sents"],
+        default_config={
+            "moves": None,
+            "update_with_oracle_cut_size": 100,
+            "learn_tokens": False,
+            "min_action_freq": 30,
+            "beam_width": 8,
+            "beam_density": 0.0001,
+            "beam_update_prob": 0.5,
+            "model": DEFAULT_PARSER_MODEL,
+            "scorer": {"@scorers": "spacy.parser_scorer.v1"},
+        },
+        default_score_weights={
+            "dep_uas": 0.5,
+            "dep_las": 0.5,
+            "dep_las_per_type": None,
+            "sents_p": None,
+            "sents_r": None,
+            "sents_f": 0.0,
+        },
+    )(make_beam_parser)
+
+    # tagger
+    Language.factory(
+        "tagger",
+        assigns=["token.tag"],
+        default_config={
+            "model": DEFAULT_TAGGER_MODEL,
+            "overwrite": False,
+            "scorer": {"@scorers": "spacy.tagger_scorer.v1"},
+            "neg_prefix": "!",
+            "label_smoothing": 0.0,
+        },
+        default_score_weights={
+            "tag_acc": 1.0,
+            "pos_acc": 0.0,
+            "tag_micro_p": None,
+            "tag_micro_r": None,
+            "tag_micro_f": None,
+        },
+    )(make_tagger)
+
+    # nn_labeller
+    Language.factory(
+        "nn_labeller",
+        default_config={
+            "labels": None,
+            "target": "dep_tag_offset",
+            "model": DEFAULT_MT_MODEL,
+        },
+    )(make_nn_labeller)
+
+    # sentencizer
+    Language.factory(
+        "sentencizer",
+        assigns=["token.is_sent_start", "doc.sents"],
+        default_config={
+            "punct_chars": None,
+            "overwrite": False,
+            "scorer": {"@scorers": "spacy.senter_scorer.v1"},
+        },
+        default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0},
+    )(make_sentencizer)
+
+    # Set the flag to indicate that all factories have been registered
+    FACTORIES_REGISTERED = True
+
+
+# We can't have function implementations for these factories in Cython, because
+# we need to build a Pydantic model for them dynamically, reading their argument
+# structure from the signature. In Cython 3, this doesn't work because the
+# from __future__ import annotations semantics are used, which means the types
+# are stored as strings.
+def make_sentencizer(
+    nlp: Language,
+    name: str,
+    punct_chars: Optional[List[str]],
+    overwrite: bool,
+    scorer: Optional[Callable],
+):
+    return Sentencizer(
+        name, punct_chars=punct_chars, overwrite=overwrite, scorer=scorer
+    )
+
+
+def make_attribute_ruler(
+    nlp: Language, name: str, validate: bool, scorer: Optional[Callable]
+):
+    return AttributeRuler(nlp.vocab, name, validate=validate, scorer=scorer)
+
+
+def make_entity_linker(
+    nlp: Language,
+    name: str,
+    model: Model,
+    *,
+    labels_discard: Iterable[str],
+    n_sents: int,
+    incl_prior: bool,
+    incl_context: bool,
+    entity_vector_length: int,
+    get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
+    get_candidates_batch: Callable[
+        [KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]
+    ],
+    generate_empty_kb: Callable[[Vocab, int], KnowledgeBase],
+    overwrite: bool,
+    scorer: Optional[Callable],
+    use_gold_ents: bool,
+    candidates_batch_size: int,
+    threshold: Optional[float] = None,
+):
+
+    if not model.attrs.get("include_span_maker", False):
+        # The only difference in arguments here is that use_gold_ents and threshold aren't available.
+        return EntityLinker_v1(
+            nlp.vocab,
+            model,
+            name,
+            labels_discard=labels_discard,
+            n_sents=n_sents,
+            incl_prior=incl_prior,
+            incl_context=incl_context,
+            entity_vector_length=entity_vector_length,
+            get_candidates=get_candidates,
+            overwrite=overwrite,
+            scorer=scorer,
+        )
+    return EntityLinker(
+        nlp.vocab,
+        model,
+        name,
+        labels_discard=labels_discard,
+        n_sents=n_sents,
+        incl_prior=incl_prior,
+        incl_context=incl_context,
+        entity_vector_length=entity_vector_length,
+        get_candidates=get_candidates,
+        get_candidates_batch=get_candidates_batch,
+        generate_empty_kb=generate_empty_kb,
+        overwrite=overwrite,
+        scorer=scorer,
+        use_gold_ents=use_gold_ents,
+        candidates_batch_size=candidates_batch_size,
+        threshold=threshold,
+    )
+
+
+def make_lemmatizer(
+    nlp: Language,
+    model: Optional[Model],
+    name: str,
+    mode: str,
+    overwrite: bool,
+    scorer: Optional[Callable],
+):
+    return Lemmatizer(
+        nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
+    )
+
+
+def make_textcat(
+    nlp: Language,
+    name: str,
+    model: Model[List[Doc], List[Floats2d]],
+    threshold: float,
+    scorer: Optional[Callable],
+) -> TextCategorizer:
+    return TextCategorizer(nlp.vocab, model, name, threshold=threshold, scorer=scorer)
+
+
+def make_token_splitter(
+    nlp: Language, name: str, *, min_length: int = 0, split_length: int = 0
+):
+    return TokenSplitter(min_length=min_length, split_length=split_length)
+
+
+def make_doc_cleaner(nlp: Language, name: str, *, attrs: Dict[str, Any], silent: bool):
+    return DocCleaner(attrs, silent=silent)
+
+
+def make_tok2vec(nlp: Language, name: str, model: Model) -> Tok2Vec:
+    return Tok2Vec(nlp.vocab, model, name)
+
+
+def make_spancat(
+    nlp: Language,
+    name: str,
+    suggester: Suggester,
+    model: Model[Tuple[List[Doc], Ragged], Floats2d],
+    spans_key: str,
+    scorer: Optional[Callable],
+    threshold: float,
+    max_positive: Optional[int],
+) -> SpanCategorizer:
+    return SpanCategorizer(
+        nlp.vocab,
+        model=model,
+        suggester=suggester,
+        name=name,
+        spans_key=spans_key,
+        negative_weight=None,
+        allow_overlap=True,
+        max_positive=max_positive,
+        threshold=threshold,
+        scorer=scorer,
+        add_negative_label=False,
+    )
+
+
+def make_spancat_singlelabel(
+    nlp: Language,
+    name: str,
+    suggester: Suggester,
+    model: Model[Tuple[List[Doc], Ragged], Floats2d],
+    spans_key: str,
+    negative_weight: float,
+    allow_overlap: bool,
+    scorer: Optional[Callable],
+) -> SpanCategorizer:
+    return SpanCategorizer(
+        nlp.vocab,
+        model=model,
+        suggester=suggester,
+        name=name,
+        spans_key=spans_key,
+        negative_weight=negative_weight,
+        allow_overlap=allow_overlap,
+        max_positive=1,
+        add_negative_label=True,
+        threshold=None,
+        scorer=scorer,
+    )
+
+
+def make_future_entity_ruler(
+    nlp: Language,
+    name: str,
+    phrase_matcher_attr: Optional[Union[int, str]],
+    matcher_fuzzy_compare: Callable,
+    validate: bool,
+    overwrite_ents: bool,
+    scorer: Optional[Callable],
+    ent_id_sep: str,
+):
+    if overwrite_ents:
+        ents_filter = prioritize_new_ents_filter
+    else:
+        ents_filter = prioritize_existing_ents_filter
+    return SpanRuler(
+        nlp,
+        name,
+        spans_key=None,
+        spans_filter=None,
+        annotate_ents=True,
+        ents_filter=ents_filter,
+        phrase_matcher_attr=phrase_matcher_attr,
+        matcher_fuzzy_compare=matcher_fuzzy_compare,
+        validate=validate,
+        overwrite=False,
+        scorer=scorer,
+    )
+
+
+def make_entity_ruler(
+    nlp: Language,
+    name: str,
+    phrase_matcher_attr: Optional[Union[int, str]],
+    matcher_fuzzy_compare: Callable,
+    validate: bool,
+    overwrite_ents: bool,
+    ent_id_sep: str,
+    scorer: Optional[Callable],
+):
+    return EntityRuler(
+        nlp,
+        name,
+        phrase_matcher_attr=phrase_matcher_attr,
+        matcher_fuzzy_compare=matcher_fuzzy_compare,
+        validate=validate,
+        overwrite_ents=overwrite_ents,
+        ent_id_sep=ent_id_sep,
+        scorer=scorer,
+    )
+
+
+def make_span_ruler(
+    nlp: Language,
+    name: str,
+    spans_key: Optional[str],
+    spans_filter: Optional[Callable[[Iterable[Span], Iterable[Span]], Iterable[Span]]],
+    annotate_ents: bool,
+    ents_filter: Callable[[Iterable[Span], Iterable[Span]], Iterable[Span]],
+    phrase_matcher_attr: Optional[Union[int, str]],
+    matcher_fuzzy_compare: Callable,
+    validate: bool,
+    overwrite: bool,
+    scorer: Optional[Callable],
+):
+    return SpanRuler(
+        nlp,
+        name,
+        spans_key=spans_key,
+        spans_filter=spans_filter,
+        annotate_ents=annotate_ents,
+        ents_filter=ents_filter,
+        phrase_matcher_attr=phrase_matcher_attr,
+        matcher_fuzzy_compare=matcher_fuzzy_compare,
+        validate=validate,
+        overwrite=overwrite,
+        scorer=scorer,
+    )
+
+
+def make_edit_tree_lemmatizer(
+    nlp: Language,
+    name: str,
+    model: Model,
+    backoff: Optional[str],
+    min_tree_freq: int,
+    overwrite: bool,
+    top_k: int,
+    scorer: Optional[Callable],
+):
+    return EditTreeLemmatizer(
+        nlp.vocab,
+        model,
+        name,
+        backoff=backoff,
+        min_tree_freq=min_tree_freq,
+        overwrite=overwrite,
+        top_k=top_k,
+        scorer=scorer,
+    )
+
+
+def make_multilabel_textcat(
+    nlp: Language,
+    name: str,
+    model: Model[List[Doc], List[Floats2d]],
+    threshold: float,
+    scorer: Optional[Callable],
+) -> MultiLabel_TextCategorizer:
+    return MultiLabel_TextCategorizer(
+        nlp.vocab, model, name, threshold=threshold, scorer=scorer
+    )
+
+
+def make_span_finder(
+    nlp: Language,
+    name: str,
+    model: Model[Iterable[Doc], Floats2d],
+    spans_key: str,
+    threshold: float,
+    max_length: Optional[int],
+    min_length: Optional[int],
+    scorer: Optional[Callable],
+) -> SpanFinder:
+    return SpanFinder(
+        nlp,
+        model=model,
+        threshold=threshold,
+        name=name,
+        scorer=scorer,
+        max_length=max_length,
+        min_length=min_length,
+        spans_key=spans_key,
+    )
+
+
+def make_ner(
+    nlp: Language,
+    name: str,
+    model: Model,
+    moves: Optional[TransitionSystem],
+    update_with_oracle_cut_size: int,
+    incorrect_spans_key: Optional[str],
+    scorer: Optional[Callable],
+):
+    return EntityRecognizer(
+        nlp.vocab,
+        model,
+        name=name,
+        moves=moves,
+        update_with_oracle_cut_size=update_with_oracle_cut_size,
+        incorrect_spans_key=incorrect_spans_key,
+        scorer=scorer,
+    )
+
+
+def make_beam_ner(
+    nlp: Language,
+    name: str,
+    model: Model,
+    moves: Optional[TransitionSystem],
+    update_with_oracle_cut_size: int,
+    beam_width: int,
+    beam_density: float,
+    beam_update_prob: float,
+    incorrect_spans_key: Optional[str],
+    scorer: Optional[Callable],
+):
+    return EntityRecognizer(
+        nlp.vocab,
+        model,
+        name=name,
+        moves=moves,
+        update_with_oracle_cut_size=update_with_oracle_cut_size,
+        beam_width=beam_width,
+        beam_density=beam_density,
+        beam_update_prob=beam_update_prob,
+        incorrect_spans_key=incorrect_spans_key,
+        scorer=scorer,
+    )
+
+
+def make_parser(
+    nlp: Language,
+    name: str,
+    model: Model,
+    moves: Optional[TransitionSystem],
+    update_with_oracle_cut_size: int,
+    learn_tokens: bool,
+    min_action_freq: int,
+    scorer: Optional[Callable],
+):
+    return DependencyParser(
+        nlp.vocab,
+        model,
+        name=name,
+        moves=moves,
+        update_with_oracle_cut_size=update_with_oracle_cut_size,
+        learn_tokens=learn_tokens,
+        min_action_freq=min_action_freq,
+        scorer=scorer,
+    )
+
+
+def make_beam_parser(
+    nlp: Language,
+    name: str,
+    model: Model,
+    moves: Optional[TransitionSystem],
+    update_with_oracle_cut_size: int,
+    learn_tokens: bool,
+    min_action_freq: int,
+    beam_width: int,
+    beam_density: float,
+    beam_update_prob: float,
+    scorer: Optional[Callable],
+):
+    return DependencyParser(
+        nlp.vocab,
+        model,
+        name=name,
+        moves=moves,
+        update_with_oracle_cut_size=update_with_oracle_cut_size,
+        learn_tokens=learn_tokens,
+        min_action_freq=min_action_freq,
+        beam_width=beam_width,
+        beam_density=beam_density,
+        beam_update_prob=beam_update_prob,
+        scorer=scorer,
+    )
+
+
+def make_tagger(
+    nlp: Language,
+    name: str,
+    model: Model,
+    overwrite: bool,
+    scorer: Optional[Callable],
+    neg_prefix: str,
+    label_smoothing: float,
+):
+    return Tagger(
+        nlp.vocab,
+        model,
+        name=name,
+        overwrite=overwrite,
+        scorer=scorer,
+        neg_prefix=neg_prefix,
+        label_smoothing=label_smoothing,
+    )
+
+
+def make_nn_labeller(
+    nlp: Language, name: str, model: Model, labels: Optional[dict], target: str
+):
+    return MultitaskObjective(nlp.vocab, model, name, target=target)
+
+
+def make_morphologizer(
+    nlp: Language,
+    model: Model,
+    name: str,
+    overwrite: bool,
+    extend: bool,
+    label_smoothing: float,
+    scorer: Optional[Callable],
+):
+    return Morphologizer(
+        nlp.vocab,
+        model,
+        name,
+        overwrite=overwrite,
+        extend=extend,
+        label_smoothing=label_smoothing,
+        scorer=scorer,
+    )
+
+
+def make_senter(
+    nlp: Language, name: str, model: Model, overwrite: bool, scorer: Optional[Callable]
+):
+    return SentenceRecognizer(
+        nlp.vocab, model, name, overwrite=overwrite, scorer=scorer
+    )
--- a/spacy/pipeline/functions.py
+++ b/spacy/pipeline/functions.py
@ -1,3 +1,5 @@
+import importlib
+import sys
 import warnings
 from typing import Any, Dict

@ -73,17 +75,6 @@ def merge_subtokens(doc: Doc, label: str = "subtok") -> Doc:
    return doc


-@Language.factory(
-    "token_splitter",
-    default_config={"min_length": 25, "split_length": 10},
-    retokenizes=True,
-)
-def make_token_splitter(
-    nlp: Language, name: str, *, min_length: int = 0, split_length: int = 0
-):
-    return TokenSplitter(min_length=min_length, split_length=split_length)
-
-
 class TokenSplitter:
    def __init__(self, min_length: int = 0, split_length: int = 0):
        self.min_length = min_length
@ -141,14 +132,6 @@ class TokenSplitter:
        util.from_disk(path, serializers, [])


-@Language.factory(
-    "doc_cleaner",
-    default_config={"attrs": {"tensor": None, "_.trf_data": None}, "silent": True},
-)
-def make_doc_cleaner(nlp: Language, name: str, *, attrs: Dict[str, Any], silent: bool):
-    return DocCleaner(attrs, silent=silent)
-
-
 class DocCleaner:
    def __init__(self, attrs: Dict[str, Any], *, silent: bool = True):
        self.cfg: Dict[str, Any] = {"attrs": dict(attrs), "silent": silent}
@ -201,3 +184,14 @@ class DocCleaner:
            "cfg": lambda p: self.cfg.update(srsly.read_json(p)),
        }
        util.from_disk(path, serializers, [])
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_doc_cleaner":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_doc_cleaner
+    elif name == "make_token_splitter":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_token_splitter
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/lemmatizer.py
+++ b/spacy/pipeline/lemmatizer.py
@ -1,3 +1,5 @@
+import importlib
+import sys
 import warnings
 from pathlib import Path
 from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union
@ -16,35 +18,10 @@ from ..vocab import Vocab
 from .pipe import Pipe


-@Language.factory(
-    "lemmatizer",
-    assigns=["token.lemma"],
-    default_config={
-        "model": None,
-        "mode": "lookup",
-        "overwrite": False,
-        "scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
-    },
-    default_score_weights={"lemma_acc": 1.0},
-)
-def make_lemmatizer(
-    nlp: Language,
-    model: Optional[Model],
-    name: str,
-    mode: str,
-    overwrite: bool,
-    scorer: Optional[Callable],
-):
-    return Lemmatizer(
-        nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
-    )
-
-
 def lemmatizer_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
    return Scorer.score_token_attr(examples, "lemma", **kwargs)


-@registry.scorers("spacy.lemmatizer_scorer.v1")
 def make_lemmatizer_scorer():
    return lemmatizer_score

@ -241,7 +218,10 @@ class Lemmatizer(Pipe):
                if not form:
                    pass
                elif form in index or not form.isalpha():
-                    forms.append(form)
+                    if form in index:
+                        forms.insert(0, form)
+                    else:
+                        forms.append(form)
                else:
                    oov_forms.append(form)
        # Remove duplicates but preserve the ordering of applied "rules"
@ -334,3 +314,11 @@ class Lemmatizer(Pipe):
        util.from_bytes(bytes_data, deserialize, exclude)
        self._validate_tables()
        return self
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_lemmatizer":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_lemmatizer
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/morphologizer.pyx
+++ b/spacy/pipeline/morphologizer.pyx
@ -1,4 +1,6 @@
 # cython: infer_types=True, binding=True
+import importlib
+import sys
 from itertools import islice
 from typing import Callable, Dict, Optional, Union

@ -47,25 +49,6 @@ maxout_pieces = 3
 DEFAULT_MORPH_MODEL = Config().from_str(default_model_config)["model"]


-@Language.factory(
-    "morphologizer",
-    assigns=["token.morph", "token.pos"],
-    default_config={"model": DEFAULT_MORPH_MODEL, "overwrite": True, "extend": False,
-                    "scorer": {"@scorers": "spacy.morphologizer_scorer.v1"}, "label_smoothing": 0.0},
-    default_score_weights={"pos_acc": 0.5, "morph_acc": 0.5, "morph_per_feat": None},
-)
-def make_morphologizer(
-    nlp: Language,
-    model: Model,
-    name: str,
-    overwrite: bool,
-    extend: bool,
-    label_smoothing: float,
-    scorer: Optional[Callable],
-):
-    return Morphologizer(nlp.vocab, model, name, overwrite=overwrite, extend=extend, label_smoothing=label_smoothing, scorer=scorer)
-
-
 def morphologizer_score(examples, **kwargs):
    def morph_key_getter(token, attr):
        return getattr(token, attr).key
@ -81,7 +64,6 @@ def morphologizer_score(examples, **kwargs):
    return results


-@registry.scorers("spacy.morphologizer_scorer.v1")
 def make_morphologizer_scorer():
    return morphologizer_score

@ -309,3 +291,11 @@ class Morphologizer(Tagger):
        if self.model.ops.xp.isnan(loss):
            raise ValueError(Errors.E910.format(name=self.name))
        return float(loss), d_scores
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_morphologizer":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_morphologizer
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/multitask.pyx
+++ b/spacy/pipeline/multitask.pyx
@ -1,4 +1,6 @@
 # cython: infer_types=True, binding=True
+import importlib
+import sys
 from typing import Optional

 import numpy
@ -30,14 +32,6 @@ subword_features = true
 DEFAULT_MT_MODEL = Config().from_str(default_model_config)["model"]


-@Language.factory(
-    "nn_labeller",
-    default_config={"labels": None, "target": "dep_tag_offset", "model": DEFAULT_MT_MODEL}
-)
-def make_nn_labeller(nlp: Language, name: str, model: Model, labels: Optional[dict], target: str):
-    return MultitaskObjective(nlp.vocab, model, name)
-
-
 class MultitaskObjective(Tagger):
    """Experimental: Assist training of a parser or tagger, by training a
    side-objective.
@ -213,3 +207,11 @@ class ClozeMultitask(TrainablePipe):

    def add_label(self, label):
        raise NotImplementedError
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_nn_labeller":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_nn_labeller
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/ner.pyx
+++ b/spacy/pipeline/ner.pyx
@ -1,4 +1,6 @@
 # cython: infer_types=True, binding=True
+import importlib
+import sys
 from collections import defaultdict
 from typing import Callable, Optional

@ -36,154 +38,10 @@ subword_features = true
 DEFAULT_NER_MODEL = Config().from_str(default_model_config)["model"]


-@Language.factory(
-    "ner",
-    assigns=["doc.ents", "token.ent_iob", "token.ent_type"],
-    default_config={
-        "moves": None,
-        "update_with_oracle_cut_size": 100,
-        "model": DEFAULT_NER_MODEL,
-        "incorrect_spans_key": None,
-        "scorer": {"@scorers": "spacy.ner_scorer.v1"},
-    },
-    default_score_weights={"ents_f": 1.0, "ents_p": 0.0, "ents_r": 0.0, "ents_per_type": None},
-
-)
-def make_ner(
-    nlp: Language,
-    name: str,
-    model: Model,
-    moves: Optional[TransitionSystem],
-    update_with_oracle_cut_size: int,
-    incorrect_spans_key: Optional[str],
-    scorer: Optional[Callable],
-):
-    """Create a transition-based EntityRecognizer component. The entity recognizer
-    identifies non-overlapping labelled spans of tokens.
-
-    The transition-based algorithm used encodes certain assumptions that are
-    effective for "traditional" named entity recognition tasks, but may not be
-    a good fit for every span identification problem. Specifically, the loss
-    function optimizes for whole entity accuracy, so if your inter-annotator
-    agreement on boundary tokens is low, the component will likely perform poorly
-    on your problem. The transition-based algorithm also assumes that the most
-    decisive information about your entities will be close to their initial tokens.
-    If your entities are long and characterised by tokens in their middle, the
-    component will likely do poorly on your task.
-
-    model (Model): The model for the transition-based parser. The model needs
-        to have a specific substructure of named components --- see the
-        spacy.ml.tb_framework.TransitionModel for details.
-    moves (Optional[TransitionSystem]): This defines how the parse-state is created,
-        updated and evaluated. If 'moves' is None, a new instance is
-        created with `self.TransitionSystem()`. Defaults to `None`.
-    update_with_oracle_cut_size (int): During training, cut long sequences into
-        shorter segments by creating intermediate states based on the gold-standard
-        history. The model is not very sensitive to this parameter, so you usually
-        won't need to change it. 100 is a good default.
-    incorrect_spans_key (Optional[str]): Identifies spans that are known
-        to be incorrect entity annotations. The incorrect entity annotations
-        can be stored in the span group, under this key.
-    scorer (Optional[Callable]): The scoring method.
-    """
-    return EntityRecognizer(
-        nlp.vocab,
-        model,
-        name,
-        moves=moves,
-        update_with_oracle_cut_size=update_with_oracle_cut_size,
-        incorrect_spans_key=incorrect_spans_key,
-        multitasks=[],
-        beam_width=1,
-        beam_density=0.0,
-        beam_update_prob=0.0,
-        scorer=scorer,
-    )
-
-
-@Language.factory(
-    "beam_ner",
-    assigns=["doc.ents", "token.ent_iob", "token.ent_type"],
-    default_config={
-        "moves": None,
-        "update_with_oracle_cut_size": 100,
-        "model": DEFAULT_NER_MODEL,
-        "beam_density": 0.01,
-        "beam_update_prob": 0.5,
-        "beam_width": 32,
-        "incorrect_spans_key": None,
-        "scorer": None,
-    },
-    default_score_weights={"ents_f": 1.0, "ents_p": 0.0, "ents_r": 0.0, "ents_per_type": None},
-)
-def make_beam_ner(
-    nlp: Language,
-    name: str,
-    model: Model,
-    moves: Optional[TransitionSystem],
-    update_with_oracle_cut_size: int,
-    beam_width: int,
-    beam_density: float,
-    beam_update_prob: float,
-    incorrect_spans_key: Optional[str],
-    scorer: Optional[Callable],
-):
-    """Create a transition-based EntityRecognizer component that uses beam-search.
-    The entity recognizer identifies non-overlapping labelled spans of tokens.
-
-    The transition-based algorithm used encodes certain assumptions that are
-    effective for "traditional" named entity recognition tasks, but may not be
-    a good fit for every span identification problem. Specifically, the loss
-    function optimizes for whole entity accuracy, so if your inter-annotator
-    agreement on boundary tokens is low, the component will likely perform poorly
-    on your problem. The transition-based algorithm also assumes that the most
-    decisive information about your entities will be close to their initial tokens.
-    If your entities are long and characterised by tokens in their middle, the
-    component will likely do poorly on your task.
-
-    model (Model): The model for the transition-based parser. The model needs
-        to have a specific substructure of named components --- see the
-        spacy.ml.tb_framework.TransitionModel for details.
-    moves (Optional[TransitionSystem]): This defines how the parse-state is created,
-        updated and evaluated. If 'moves' is None, a new instance is
-        created with `self.TransitionSystem()`. Defaults to `None`.
-    update_with_oracle_cut_size (int): During training, cut long sequences into
-        shorter segments by creating intermediate states based on the gold-standard
-        history. The model is not very sensitive to this parameter, so you usually
-        won't need to change it. 100 is a good default.
-    beam_width (int): The number of candidate analyses to maintain.
-    beam_density (float): The minimum ratio between the scores of the first and
-        last candidates in the beam. This allows the parser to avoid exploring
-        candidates that are too far behind. This is mostly intended to improve
-        efficiency, but it can also improve accuracy as deeper search is not
-        always better.
-    beam_update_prob (float): The chance of making a beam update, instead of a
-        greedy update. Greedy updates are an approximation for the beam updates,
-        and are faster to compute.
-    incorrect_spans_key (Optional[str]): Optional key into span groups of
-        entities known to be non-entities.
-    scorer (Optional[Callable]): The scoring method.
-    """
-    return EntityRecognizer(
-        nlp.vocab,
-        model,
-        name,
-        moves=moves,
-        update_with_oracle_cut_size=update_with_oracle_cut_size,
-        multitasks=[],
-        beam_width=beam_width,
-        beam_density=beam_density,
-        beam_update_prob=beam_update_prob,
-        incorrect_spans_key=incorrect_spans_key,
-        scorer=scorer,
-    )
-
-
 def ner_score(examples, **kwargs):
    return get_ner_prf(examples, **kwargs)


-@registry.scorers("spacy.ner_scorer.v1")
 def make_ner_scorer():
    return ner_score

@ -261,3 +119,14 @@ cdef class EntityRecognizer(Parser):
                    score_dict[(start, end, label)] += score
            entity_scores.append(score_dict)
        return entity_scores
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_ner":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_ner
+    elif name == "make_beam_ner":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_beam_ner
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/pipe.pyx
+++ b/spacy/pipeline/pipe.pyx
@ -21,13 +21,6 @@ cdef class Pipe:
    DOCS: https://spacy.io/api/pipe
    """

-    @classmethod
-    def __init_subclass__(cls, **kwargs):
-        """Raise a warning if an inheriting class implements 'begin_training'
-         (from v2) instead of the new 'initialize' method (from v3)"""
-        if hasattr(cls, "begin_training"):
-            warnings.warn(Warnings.W088.format(name=cls.__name__))
-
    def __call__(self, Doc doc) -> Doc:
        """Apply the pipe to one document. The document is modified in place,
        and returned. This usually happens under the hood when the nlp object
--- a/spacy/pipeline/sentencizer.pyx
+++ b/spacy/pipeline/sentencizer.pyx
@ -1,4 +1,6 @@
 # cython: infer_types=True, binding=True
+import importlib
+import sys
 from typing import Callable, List, Optional

 import srsly
@ -14,22 +16,6 @@ from .senter import senter_score
 BACKWARD_OVERWRITE = False


-@Language.factory(
-    "sentencizer",
-    assigns=["token.is_sent_start", "doc.sents"],
-    default_config={"punct_chars": None, "overwrite": False, "scorer": {"@scorers": "spacy.senter_scorer.v1"}},
-    default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0},
-)
-def make_sentencizer(
-    nlp: Language,
-    name: str,
-    punct_chars: Optional[List[str]],
-    overwrite: bool,
-    scorer: Optional[Callable],
-):
-    return Sentencizer(name, punct_chars=punct_chars, overwrite=overwrite, scorer=scorer)
-
-
 class Sentencizer(Pipe):
    """Segment the Doc into sentences using a rule-based strategy.

@ -181,3 +167,11 @@ class Sentencizer(Pipe):
        self.punct_chars = set(cfg.get("punct_chars", self.default_punct_chars))
        self.overwrite = cfg.get("overwrite", self.overwrite)
        return self
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_sentencizer":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_sentencizer
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/senter.pyx
+++ b/spacy/pipeline/senter.pyx
@ -1,4 +1,6 @@
 # cython: infer_types=True, binding=True
+import importlib
+import sys
 from itertools import islice
 from typing import Callable, Optional

@ -34,16 +36,6 @@ subword_features = true
 DEFAULT_SENTER_MODEL = Config().from_str(default_model_config)["model"]


-@Language.factory(
-    "senter",
-    assigns=["token.is_sent_start"],
-    default_config={"model": DEFAULT_SENTER_MODEL, "overwrite": False, "scorer": {"@scorers": "spacy.senter_scorer.v1"}},
-    default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0},
-)
-def make_senter(nlp: Language, name: str, model: Model, overwrite: bool, scorer: Optional[Callable]):
-    return SentenceRecognizer(nlp.vocab, model, name, overwrite=overwrite, scorer=scorer)
-
-
 def senter_score(examples, **kwargs):
    def has_sents(doc):
        return doc.has_annotation("SENT_START")
@ -53,7 +45,6 @@ def senter_score(examples, **kwargs):
    return results


-@registry.scorers("spacy.senter_scorer.v1")
 def make_senter_scorer():
    return senter_score

@ -185,3 +176,11 @@ class SentenceRecognizer(Tagger):

    def add_label(self, label, values=None):
        raise NotImplementedError
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_senter":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_senter
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/span_finder.py
+++ b/spacy/pipeline/span_finder.py
@ -1,3 +1,5 @@
+import importlib
+import sys
 from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple

 from thinc.api import Config, Model, Optimizer, set_dropout_rate
@ -41,63 +43,6 @@ depth = 4
 DEFAULT_SPAN_FINDER_MODEL = Config().from_str(span_finder_default_config)["model"]


-@Language.factory(
-    "span_finder",
-    assigns=["doc.spans"],
-    default_config={
-        "threshold": 0.5,
-        "model": DEFAULT_SPAN_FINDER_MODEL,
-        "spans_key": DEFAULT_SPANS_KEY,
-        "max_length": 25,
-        "min_length": None,
-        "scorer": {"@scorers": "spacy.span_finder_scorer.v1"},
-    },
-    default_score_weights={
-        f"spans_{DEFAULT_SPANS_KEY}_f": 1.0,
-        f"spans_{DEFAULT_SPANS_KEY}_p": 0.0,
-        f"spans_{DEFAULT_SPANS_KEY}_r": 0.0,
-    },
-)
-def make_span_finder(
-    nlp: Language,
-    name: str,
-    model: Model[Iterable[Doc], Floats2d],
-    spans_key: str,
-    threshold: float,
-    max_length: Optional[int],
-    min_length: Optional[int],
-    scorer: Optional[Callable],
-) -> "SpanFinder":
-    """Create a SpanFinder component. The component predicts whether a token is
-    the start or the end of a potential span.
-
-    model (Model[List[Doc], Floats2d]): A model instance that
-        is given a list of documents and predicts a probability for each token.
-    spans_key (str): Key of the doc.spans dict to save the spans under. During
-        initialization and training, the component will look for spans on the
-        reference document under the same key.
-    threshold (float): Minimum probability to consider a prediction positive.
-    max_length (Optional[int]): Maximum length of the produced spans, defaults
-        to None meaning unlimited length.
-    min_length (Optional[int]): Minimum length of the produced spans, defaults
-        to None meaning shortest span length is 1.
-    scorer (Optional[Callable]): The scoring method. Defaults to
-        Scorer.score_spans for the Doc.spans[spans_key] with overlapping
-        spans allowed.
-    """
-    return SpanFinder(
-        nlp,
-        model=model,
-        threshold=threshold,
-        name=name,
-        scorer=scorer,
-        max_length=max_length,
-        min_length=min_length,
-        spans_key=spans_key,
-    )
-
-
-@registry.scorers("spacy.span_finder_scorer.v1")
 def make_span_finder_scorer():
    return span_finder_score

@ -333,3 +278,11 @@ class SpanFinder(TrainablePipe):
            self.model.initialize(X=docs, Y=Y)
        else:
            self.model.initialize()
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_span_finder":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_span_finder
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/span_ruler.py
+++ b/spacy/pipeline/span_ruler.py
@ -1,3 +1,5 @@
+import importlib
+import sys
 import warnings
 from functools import partial
 from pathlib import Path
@ -32,105 +34,6 @@ PatternType = Dict[str, Union[str, List[Dict[str, Any]]]]
 DEFAULT_SPANS_KEY = "ruler"


-@Language.factory(
-    "future_entity_ruler",
-    assigns=["doc.ents"],
-    default_config={
-        "phrase_matcher_attr": None,
-        "validate": False,
-        "overwrite_ents": False,
-        "scorer": {"@scorers": "spacy.entity_ruler_scorer.v1"},
-        "ent_id_sep": "__unused__",
-        "matcher_fuzzy_compare": {"@misc": "spacy.levenshtein_compare.v1"},
-    },
-    default_score_weights={
-        "ents_f": 1.0,
-        "ents_p": 0.0,
-        "ents_r": 0.0,
-        "ents_per_type": None,
-    },
-)
-def make_entity_ruler(
-    nlp: Language,
-    name: str,
-    phrase_matcher_attr: Optional[Union[int, str]],
-    matcher_fuzzy_compare: Callable,
-    validate: bool,
-    overwrite_ents: bool,
-    scorer: Optional[Callable],
-    ent_id_sep: str,
-):
-    if overwrite_ents:
-        ents_filter = prioritize_new_ents_filter
-    else:
-        ents_filter = prioritize_existing_ents_filter
-    return SpanRuler(
-        nlp,
-        name,
-        spans_key=None,
-        spans_filter=None,
-        annotate_ents=True,
-        ents_filter=ents_filter,
-        phrase_matcher_attr=phrase_matcher_attr,
-        matcher_fuzzy_compare=matcher_fuzzy_compare,
-        validate=validate,
-        overwrite=False,
-        scorer=scorer,
-    )
-
-
-@Language.factory(
-    "span_ruler",
-    assigns=["doc.spans"],
-    default_config={
-        "spans_key": DEFAULT_SPANS_KEY,
-        "spans_filter": None,
-        "annotate_ents": False,
-        "ents_filter": {"@misc": "spacy.first_longest_spans_filter.v1"},
-        "phrase_matcher_attr": None,
-        "matcher_fuzzy_compare": {"@misc": "spacy.levenshtein_compare.v1"},
-        "validate": False,
-        "overwrite": True,
-        "scorer": {
-            "@scorers": "spacy.overlapping_labeled_spans_scorer.v1",
-            "spans_key": DEFAULT_SPANS_KEY,
-        },
-    },
-    default_score_weights={
-        f"spans_{DEFAULT_SPANS_KEY}_f": 1.0,
-        f"spans_{DEFAULT_SPANS_KEY}_p": 0.0,
-        f"spans_{DEFAULT_SPANS_KEY}_r": 0.0,
-        f"spans_{DEFAULT_SPANS_KEY}_per_type": None,
-    },
-)
-def make_span_ruler(
-    nlp: Language,
-    name: str,
-    spans_key: Optional[str],
-    spans_filter: Optional[Callable[[Iterable[Span], Iterable[Span]], Iterable[Span]]],
-    annotate_ents: bool,
-    ents_filter: Callable[[Iterable[Span], Iterable[Span]], Iterable[Span]],
-    phrase_matcher_attr: Optional[Union[int, str]],
-    matcher_fuzzy_compare: Callable,
-    validate: bool,
-    overwrite: bool,
-    scorer: Optional[Callable],
-):
-    return SpanRuler(
-        nlp,
-        name,
-        spans_key=spans_key,
-        spans_filter=spans_filter,
-        annotate_ents=annotate_ents,
-        ents_filter=ents_filter,
-        phrase_matcher_attr=phrase_matcher_attr,
-        matcher_fuzzy_compare=matcher_fuzzy_compare,
-        validate=validate,
-        overwrite=overwrite,
-        scorer=scorer,
-    )
-
-
 def prioritize_new_ents_filter(
    entities: Iterable[Span], spans: Iterable[Span]
 ) -> List[Span]:
@ -157,7 +60,6 @@ def prioritize_new_ents_filter(
    return entities + new_entities


-@registry.misc("spacy.prioritize_new_ents_filter.v1")
 def make_prioritize_new_ents_filter():
    return prioritize_new_ents_filter

@ -188,7 +90,6 @@ def prioritize_existing_ents_filter(
    return entities + new_entities


-@registry.misc("spacy.prioritize_existing_ents_filter.v1")
 def make_preserve_existing_ents_filter():
    return prioritize_existing_ents_filter

@ -208,7 +109,6 @@ def overlapping_labeled_spans_score(
    return Scorer.score_spans(examples, **kwargs)


-@registry.scorers("spacy.overlapping_labeled_spans_scorer.v1")
 def make_overlapping_labeled_spans_scorer(spans_key: str = DEFAULT_SPANS_KEY):
    return partial(overlapping_labeled_spans_score, spans_key=spans_key)

@ -595,3 +495,14 @@ class SpanRuler(Pipe):
            "patterns": lambda p: srsly.write_jsonl(p, self.patterns),
        }
        util.to_disk(path, serializers, {})
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_span_ruler":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_span_ruler
+    elif name == "make_entity_ruler":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_future_entity_ruler
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/spancat.py
+++ b/spacy/pipeline/spancat.py
@ -1,3 +1,5 @@
+import importlib
+import sys
 from dataclasses import dataclass
 from functools import partial
 from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union, cast
@ -134,7 +136,6 @@ def preset_spans_suggester(
    return output


-@registry.misc("spacy.ngram_suggester.v1")
 def build_ngram_suggester(sizes: List[int]) -> Suggester:
    """Suggest all spans of the given lengths. Spans are returned as a ragged
    array of integers. The array has two columns, indicating the start and end
@ -143,7 +144,6 @@ def build_ngram_suggester(sizes: List[int]) -> Suggester:
    return partial(ngram_suggester, sizes=sizes)


-@registry.misc("spacy.ngram_range_suggester.v1")
 def build_ngram_range_suggester(min_size: int, max_size: int) -> Suggester:
    """Suggest all spans of the given lengths between a given min and max value - both inclusive.
    Spans are returned as a ragged array of integers. The array has two columns,
@ -152,7 +152,6 @@ def build_ngram_range_suggester(min_size: int, max_size: int) -> Suggester:
    return build_ngram_suggester(sizes)


-@registry.misc("spacy.preset_spans_suggester.v1")
 def build_preset_spans_suggester(spans_key: str) -> Suggester:
    """Suggest all spans that are already stored in doc.spans[spans_key].
    This is useful when an upstream component is used to set the spans
@ -160,136 +159,6 @@ def build_preset_spans_suggester(spans_key: str) -> Suggester:
    return partial(preset_spans_suggester, spans_key=spans_key)


-@Language.factory(
-    "spancat",
-    assigns=["doc.spans"],
-    default_config={
-        "threshold": 0.5,
-        "spans_key": DEFAULT_SPANS_KEY,
-        "max_positive": None,
-        "model": DEFAULT_SPANCAT_MODEL,
-        "suggester": {"@misc": "spacy.ngram_suggester.v1", "sizes": [1, 2, 3]},
-        "scorer": {"@scorers": "spacy.spancat_scorer.v1"},
-    },
-    default_score_weights={"spans_sc_f": 1.0, "spans_sc_p": 0.0, "spans_sc_r": 0.0},
-)
-def make_spancat(
-    nlp: Language,
-    name: str,
-    suggester: Suggester,
-    model: Model[Tuple[List[Doc], Ragged], Floats2d],
-    spans_key: str,
-    scorer: Optional[Callable],
-    threshold: float,
-    max_positive: Optional[int],
-) -> "SpanCategorizer":
-    """Create a SpanCategorizer component and configure it for multi-label
-    classification to be able to assign multiple labels for each span.
-    The span categorizer consists of two
-    parts: a suggester function that proposes candidate spans, and a labeller
-    model that predicts one or more labels for each span.
-
-    name (str): The component instance name, used to add entries to the
-        losses during training.
-    suggester (Callable[[Iterable[Doc], Optional[Ops]], Ragged]): A function that suggests spans.
-        Spans are returned as a ragged array with two integer columns, for the
-        start and end positions.
-    model (Model[Tuple[List[Doc], Ragged], Floats2d]): A model instance that
-        is given a list of documents and (start, end) indices representing
-        candidate span offsets. The model predicts a probability for each category
-        for each span.
-    spans_key (str): Key of the doc.spans dict to save the spans under. During
-        initialization and training, the component will look for spans on the
-        reference document under the same key.
-    scorer (Optional[Callable]): The scoring method. Defaults to
-        Scorer.score_spans for the Doc.spans[spans_key] with overlapping
-        spans allowed.
-    threshold (float): Minimum probability to consider a prediction positive.
-        Spans with a positive prediction will be saved on the Doc. Defaults to
-        0.5.
-    max_positive (Optional[int]): Maximum number of labels to consider positive
-        per span. Defaults to None, indicating no limit.
-    """
-    return SpanCategorizer(
-        nlp.vocab,
-        model=model,
-        suggester=suggester,
-        name=name,
-        spans_key=spans_key,
-        negative_weight=None,
-        allow_overlap=True,
-        max_positive=max_positive,
-        threshold=threshold,
-        scorer=scorer,
-        add_negative_label=False,
-    )
-
-
-@Language.factory(
-    "spancat_singlelabel",
-    assigns=["doc.spans"],
-    default_config={
-        "spans_key": DEFAULT_SPANS_KEY,
-        "model": DEFAULT_SPANCAT_SINGLELABEL_MODEL,
-        "negative_weight": 1.0,
-        "suggester": {"@misc": "spacy.ngram_suggester.v1", "sizes": [1, 2, 3]},
-        "scorer": {"@scorers": "spacy.spancat_scorer.v1"},
-        "allow_overlap": True,
-    },
-    default_score_weights={"spans_sc_f": 1.0, "spans_sc_p": 0.0, "spans_sc_r": 0.0},
-)
-def make_spancat_singlelabel(
-    nlp: Language,
-    name: str,
-    suggester: Suggester,
-    model: Model[Tuple[List[Doc], Ragged], Floats2d],
-    spans_key: str,
-    negative_weight: float,
-    allow_overlap: bool,
-    scorer: Optional[Callable],
-) -> "SpanCategorizer":
-    """Create a SpanCategorizer component and configure it for multi-class
-    classification. With this configuration each span can get at most one
-    label. The span categorizer consists of two
-    parts: a suggester function that proposes candidate spans, and a labeller
-    model that predicts one or more labels for each span.
-
-    name (str): The component instance name, used to add entries to the
-        losses during training.
-    suggester (Callable[[Iterable[Doc], Optional[Ops]], Ragged]): A function that suggests spans.
-        Spans are returned as a ragged array with two integer columns, for the
-        start and end positions.
-    model (Model[Tuple[List[Doc], Ragged], Floats2d]): A model instance that
-        is given a list of documents and (start, end) indices representing
-        candidate span offsets. The model predicts a probability for each category
-        for each span.
-    spans_key (str): Key of the doc.spans dict to save the spans under. During
-        initialization and training, the component will look for spans on the
-        reference document under the same key.
-    scorer (Optional[Callable]): The scoring method. Defaults to
-        Scorer.score_spans for the Doc.spans[spans_key] with overlapping
-        spans allowed.
-    negative_weight (float): Multiplier for the loss terms.
-        Can be used to downweight the negative samples if there are too many.
-    allow_overlap (bool): If True the data is assumed to contain overlapping spans.
-        Otherwise it produces non-overlapping spans greedily prioritizing
-        higher assigned label scores.
-    """
-    return SpanCategorizer(
-        nlp.vocab,
-        model=model,
-        suggester=suggester,
-        name=name,
-        spans_key=spans_key,
-        negative_weight=negative_weight,
-        allow_overlap=allow_overlap,
-        max_positive=1,
-        add_negative_label=True,
-        threshold=None,
-        scorer=scorer,
-    )
-
-
 def spancat_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
    kwargs = dict(kwargs)
    attr_prefix = "spans_"
@ -303,7 +172,6 @@ def spancat_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
    return Scorer.score_spans(examples, **kwargs)


-@registry.scorers("spacy.spancat_scorer.v1")
 def make_spancat_scorer():
    return spancat_score

@ -785,3 +653,14 @@ class SpanCategorizer(TrainablePipe):

        spans.attrs["scores"] = numpy.array(attrs_scores)
        return spans
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_spancat":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_spancat
+    elif name == "make_spancat_singlelabel":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_spancat_singlelabel
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/tagger.pyx
+++ b/spacy/pipeline/tagger.pyx
@ -1,4 +1,6 @@
 # cython: infer_types=True, binding=True
+import importlib
+import sys
 from itertools import islice
 from typing import Callable, Optional

@ -35,36 +37,10 @@ subword_features = true
 DEFAULT_TAGGER_MODEL = Config().from_str(default_model_config)["model"]


-@Language.factory(
-    "tagger",
-    assigns=["token.tag"],
-    default_config={"model": DEFAULT_TAGGER_MODEL, "overwrite": False, "scorer": {"@scorers": "spacy.tagger_scorer.v1"}, "neg_prefix": "!", "label_smoothing": 0.0},
-    default_score_weights={"tag_acc": 1.0},
-)
-def make_tagger(
-    nlp: Language,
-    name: str,
-    model: Model,
-    overwrite: bool,
-    scorer: Optional[Callable],
-    neg_prefix: str,
-    label_smoothing: float,
-):
-    """Construct a part-of-speech tagger component.
-
-    model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
-        the tag probabilities. The output vectors should match the number of tags
-        in size, and be normalized as probabilities (all scores between 0 and 1,
-        with the rows summing to 1).
-    """
-    return Tagger(nlp.vocab, model, name, overwrite=overwrite, scorer=scorer, neg_prefix=neg_prefix, label_smoothing=label_smoothing)
-
-
 def tagger_score(examples, **kwargs):
    return Scorer.score_token_attr(examples, "tag", **kwargs)


-@registry.scorers("spacy.tagger_scorer.v1")
 def make_tagger_scorer():
    return tagger_score

@ -317,3 +293,11 @@ class Tagger(TrainablePipe):
        self.cfg["labels"].append(label)
        self.vocab.strings.add(label)
        return 1
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_tagger":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_tagger
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/textcat.py
+++ b/spacy/pipeline/textcat.py
@ -1,3 +1,5 @@
+import importlib
+import sys
 from itertools import islice
 from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple

@ -74,46 +76,6 @@ subword_features = true
 """


-@Language.factory(
-    "textcat",
-    assigns=["doc.cats"],
-    default_config={
-        "threshold": 0.0,
-        "model": DEFAULT_SINGLE_TEXTCAT_MODEL,
-        "scorer": {"@scorers": "spacy.textcat_scorer.v2"},
-    },
-    default_score_weights={
-        "cats_score": 1.0,
-        "cats_score_desc": None,
-        "cats_micro_p": None,
-        "cats_micro_r": None,
-        "cats_micro_f": None,
-        "cats_macro_p": None,
-        "cats_macro_r": None,
-        "cats_macro_f": None,
-        "cats_macro_auc": None,
-        "cats_f_per_type": None,
-    },
-)
-def make_textcat(
-    nlp: Language,
-    name: str,
-    model: Model[List[Doc], List[Floats2d]],
-    threshold: float,
-    scorer: Optional[Callable],
-) -> "TextCategorizer":
-    """Create a TextCategorizer component. The text categorizer predicts categories
-    over a whole document. It can learn one or more labels, and the labels are considered
-    to be mutually exclusive (i.e. one true label per doc).
-
-    model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
-        scores for each category.
-    threshold (float): Cutoff to consider a prediction "positive".
-    scorer (Optional[Callable]): The scoring method.
-    """
-    return TextCategorizer(nlp.vocab, model, name, threshold=threshold, scorer=scorer)
-
-
 def textcat_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
    return Scorer.score_cats(
        examples,
@ -123,7 +85,6 @@ def textcat_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
    )


-@registry.scorers("spacy.textcat_scorer.v2")
 def make_textcat_scorer():
    return textcat_score

@ -412,3 +373,11 @@ class TextCategorizer(TrainablePipe):
            for val in vals:
                if not (val == 1.0 or val == 0.0):
                    raise ValueError(Errors.E851.format(val=val))
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_textcat":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_textcat
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/textcat_multilabel.py
+++ b/spacy/pipeline/textcat_multilabel.py
@ -1,3 +1,5 @@
+import importlib
+import sys
 from itertools import islice
 from typing import Any, Callable, Dict, Iterable, List, Optional

@ -72,49 +74,6 @@ subword_features = true
 """


-@Language.factory(
-    "textcat_multilabel",
-    assigns=["doc.cats"],
-    default_config={
-        "threshold": 0.5,
-        "model": DEFAULT_MULTI_TEXTCAT_MODEL,
-        "scorer": {"@scorers": "spacy.textcat_multilabel_scorer.v2"},
-    },
-    default_score_weights={
-        "cats_score": 1.0,
-        "cats_score_desc": None,
-        "cats_micro_p": None,
-        "cats_micro_r": None,
-        "cats_micro_f": None,
-        "cats_macro_p": None,
-        "cats_macro_r": None,
-        "cats_macro_f": None,
-        "cats_macro_auc": None,
-        "cats_f_per_type": None,
-    },
-)
-def make_multilabel_textcat(
-    nlp: Language,
-    name: str,
-    model: Model[List[Doc], List[Floats2d]],
-    threshold: float,
-    scorer: Optional[Callable],
-) -> "MultiLabel_TextCategorizer":
-    """Create a MultiLabel_TextCategorizer component. The text categorizer predicts categories
-    over a whole document. It can learn one or more labels, and the labels are considered
-    to be non-mutually exclusive, which means that there can be zero or more labels
-    per doc).
-
-    model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
-        scores for each category.
-    threshold (float): Cutoff to consider a prediction "positive".
-    scorer (Optional[Callable]): The scoring method.
-    """
-    return MultiLabel_TextCategorizer(
-        nlp.vocab, model, name, threshold=threshold, scorer=scorer
-    )
-
-
 def textcat_multilabel_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
    return Scorer.score_cats(
        examples,
@ -124,7 +83,6 @@ def textcat_multilabel_score(examples: Iterable[Example], **kwargs) -> Dict[str,
    )


-@registry.scorers("spacy.textcat_multilabel_scorer.v2")
 def make_textcat_multilabel_scorer():
    return textcat_multilabel_score

@ -212,3 +170,11 @@ class MultiLabel_TextCategorizer(TextCategorizer):
            for val in ex.reference.cats.values():
                if not (val == 1.0 or val == 0.0):
                    raise ValueError(Errors.E851.format(val=val))
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_multilabel_textcat":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_multilabel_textcat
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/tok2vec.py
+++ b/spacy/pipeline/tok2vec.py
@ -1,3 +1,5 @@
+import importlib
+import sys
 from itertools import islice
 from typing import Any, Callable, Dict, Iterable, List, Optional, Sequence

@ -24,13 +26,6 @@ subword_features = true
 DEFAULT_TOK2VEC_MODEL = Config().from_str(default_model_config)["model"]


-@Language.factory(
-    "tok2vec", assigns=["doc.tensor"], default_config={"model": DEFAULT_TOK2VEC_MODEL}
-)
-def make_tok2vec(nlp: Language, name: str, model: Model) -> "Tok2Vec":
-    return Tok2Vec(nlp.vocab, model, name)
-
-
 class Tok2Vec(TrainablePipe):
    """Apply a "token-to-vector" model and set its outputs in the doc.tensor
    attribute. This is mostly useful to share a single subnetwork between multiple
@ -320,3 +315,11 @@ def forward(model: Tok2VecListener, inputs, is_train: bool):

 def _empty_backprop(dX):  # for pickling
    return []
+
+
+# Setup backwards compatibility hook for factories
+def __getattr__(name):
+    if name == "make_tok2vec":
+        module = importlib.import_module("spacy.pipeline.factories")
+        return module.make_tok2vec
+    raise AttributeError(f"module {__name__} has no attribute {name}")
--- a/spacy/pipeline/transition_parser.pxd
+++ b/spacy/pipeline/transition_parser.pxd
@ -19,7 +19,7 @@ cdef class Parser(TrainablePipe):
        StateC** states,
        WeightsC weights,
        SizesC sizes
-    ) nogil
+    ) noexcept nogil

    cdef void c_transition_batch(
        self,
@ -27,4 +27,4 @@ cdef class Parser(TrainablePipe):
        const float* scores,
        int nr_class,
        int batch_size
-    ) nogil
+    ) noexcept nogil
--- a/spacy/pipeline/transition_parser.pyx
+++ b/spacy/pipeline/transition_parser.pyx
@ -316,7 +316,7 @@ cdef class Parser(TrainablePipe):

    cdef void _parseC(
        self, CBlas cblas, StateC** states, WeightsC weights, SizesC sizes
-    ) nogil:
+    ) noexcept nogil:
        cdef int i
        cdef vector[StateC*] unfinished
        cdef ActivationsC activations = alloc_activations(sizes)
@ -359,7 +359,7 @@ cdef class Parser(TrainablePipe):
        const float* scores,
        int nr_class,
        int batch_size
-    ) nogil:
+    ) noexcept nogil:
        # n_moves should not be zero at this point, but make sure to avoid zero-length mem alloc
        with gil:
            assert self.moves.n_moves > 0, Errors.E924.format(name=self.name)
--- a/spacy/registrations.py
+++ b/spacy/registrations.py
@ -0,0 +1,245 @@
+"""Centralized registry population for spaCy config
+
+This module centralizes registry decorations to prevent circular import issues
+with Cython annotation changes from __future__ import annotations. Functions
+remain in their original locations, but decoration is moved here.
+
+Component definitions and registrations are in spacy/pipeline/factories.py
+"""
+# Global flag to track if registry has been populated
+REGISTRY_POPULATED = False
+
+
+def populate_registry() -> None:
+    """Populate the registry with all necessary components.
+
+    This function should be called before accessing the registry, to ensure
+    it's populated. The function uses a global flag to prevent repopulation.
+    """
+    global REGISTRY_POPULATED
+    if REGISTRY_POPULATED:
+        return
+
+    # Import all necessary modules
+    from .lang.ja import create_tokenizer as create_japanese_tokenizer
+    from .lang.ko import create_tokenizer as create_korean_tokenizer
+    from .lang.th import create_thai_tokenizer
+    from .lang.vi import create_vietnamese_tokenizer
+    from .lang.zh import create_chinese_tokenizer
+    from .language import load_lookups_data
+    from .matcher.levenshtein import make_levenshtein_compare
+    from .ml.models.entity_linker import (
+        create_candidates,
+        create_candidates_batch,
+        empty_kb,
+        empty_kb_for_config,
+        load_kb,
+    )
+    from .pipeline.attributeruler import make_attribute_ruler_scorer
+    from .pipeline.dep_parser import make_parser_scorer
+
+    # Import the functions we refactored by removing direct registry decorators
+    from .pipeline.entity_linker import make_entity_linker_scorer
+    from .pipeline.entityruler import (
+        make_entity_ruler_scorer as make_entityruler_scorer,
+    )
+    from .pipeline.lemmatizer import make_lemmatizer_scorer
+    from .pipeline.morphologizer import make_morphologizer_scorer
+    from .pipeline.ner import make_ner_scorer
+    from .pipeline.senter import make_senter_scorer
+    from .pipeline.span_finder import make_span_finder_scorer
+    from .pipeline.span_ruler import (
+        make_overlapping_labeled_spans_scorer,
+        make_preserve_existing_ents_filter,
+        make_prioritize_new_ents_filter,
+    )
+    from .pipeline.spancat import (
+        build_ngram_range_suggester,
+        build_ngram_suggester,
+        build_preset_spans_suggester,
+        make_spancat_scorer,
+    )
+
+    # Import all pipeline components that were using registry decorators
+    from .pipeline.tagger import make_tagger_scorer
+    from .pipeline.textcat import make_textcat_scorer
+    from .pipeline.textcat_multilabel import make_textcat_multilabel_scorer
+    from .util import make_first_longest_spans_filter, registry
+
+    # Register miscellaneous components
+    registry.misc("spacy.first_longest_spans_filter.v1")(
+        make_first_longest_spans_filter
+    )
+    registry.misc("spacy.ngram_suggester.v1")(build_ngram_suggester)
+    registry.misc("spacy.ngram_range_suggester.v1")(build_ngram_range_suggester)
+    registry.misc("spacy.preset_spans_suggester.v1")(build_preset_spans_suggester)
+    registry.misc("spacy.prioritize_new_ents_filter.v1")(
+        make_prioritize_new_ents_filter
+    )
+    registry.misc("spacy.prioritize_existing_ents_filter.v1")(
+        make_preserve_existing_ents_filter
+    )
+    registry.misc("spacy.levenshtein_compare.v1")(make_levenshtein_compare)
+    # KB-related registrations
+    registry.misc("spacy.KBFromFile.v1")(load_kb)
+    registry.misc("spacy.EmptyKB.v2")(empty_kb_for_config)
+    registry.misc("spacy.EmptyKB.v1")(empty_kb)
+    registry.misc("spacy.CandidateGenerator.v1")(create_candidates)
+    registry.misc("spacy.CandidateBatchGenerator.v1")(create_candidates_batch)
+    registry.misc("spacy.LookupsDataLoader.v1")(load_lookups_data)
+
+    # Need to get references to the existing functions in registry by importing the function that is there
+    # For the registry that was previously decorated
+
+    # Import ML components that use registry
+    from .language import create_tokenizer
+    from .ml._precomputable_affine import PrecomputableAffine
+    from .ml.callbacks import (
+        create_models_and_pipes_with_nvtx_range,
+        create_models_with_nvtx_range,
+    )
+    from .ml.extract_ngrams import extract_ngrams
+    from .ml.extract_spans import extract_spans
+
+    # Import decorator-removed ML components
+    from .ml.featureextractor import FeatureExtractor
+    from .ml.models.entity_linker import build_nel_encoder
+    from .ml.models.multi_task import (
+        create_pretrain_characters,
+        create_pretrain_vectors,
+    )
+    from .ml.models.parser import build_tb_parser_model
+    from .ml.models.span_finder import build_finder_model
+    from .ml.models.spancat import (
+        build_linear_logistic,
+        build_mean_max_reducer,
+        build_spancat_model,
+    )
+    from .ml.models.tagger import build_tagger_model
+    from .ml.models.textcat import (
+        build_bow_text_classifier,
+        build_bow_text_classifier_v3,
+        build_reduce_text_classifier,
+        build_simple_cnn_text_classifier,
+        build_text_classifier_lowdata,
+        build_text_classifier_v2,
+        build_textcat_parametric_attention_v1,
+    )
+    from .ml.models.tok2vec import (
+        BiLSTMEncoder,
+        CharacterEmbed,
+        MaxoutWindowEncoder,
+        MishWindowEncoder,
+        MultiHashEmbed,
+        build_hash_embed_cnn_tok2vec,
+        build_Tok2Vec_model,
+        tok2vec_listener_v1,
+    )
+    from .ml.staticvectors import StaticVectors
+    from .ml.tb_framework import TransitionModel
+    from .training.augment import (
+        create_combined_augmenter,
+        create_lower_casing_augmenter,
+        create_orth_variants_augmenter,
+    )
+    from .training.batchers import (
+        configure_minibatch,
+        configure_minibatch_by_padded_size,
+        configure_minibatch_by_words,
+    )
+    from .training.callbacks import create_copy_from_base_model
+    from .training.loggers import console_logger, console_logger_v3
+
+    # Register scorers
+    registry.scorers("spacy.tagger_scorer.v1")(make_tagger_scorer)
+    registry.scorers("spacy.ner_scorer.v1")(make_ner_scorer)
+    # span_ruler_scorer removed as it's not in span_ruler.py
+    registry.scorers("spacy.entity_ruler_scorer.v1")(make_entityruler_scorer)
+    registry.scorers("spacy.senter_scorer.v1")(make_senter_scorer)
+    registry.scorers("spacy.textcat_scorer.v1")(make_textcat_scorer)
+    registry.scorers("spacy.textcat_scorer.v2")(make_textcat_scorer)
+    registry.scorers("spacy.textcat_multilabel_scorer.v1")(
+        make_textcat_multilabel_scorer
+    )
+    registry.scorers("spacy.textcat_multilabel_scorer.v2")(
+        make_textcat_multilabel_scorer
+    )
+    registry.scorers("spacy.lemmatizer_scorer.v1")(make_lemmatizer_scorer)
+    registry.scorers("spacy.span_finder_scorer.v1")(make_span_finder_scorer)
+    registry.scorers("spacy.spancat_scorer.v1")(make_spancat_scorer)
+    registry.scorers("spacy.entity_linker_scorer.v1")(make_entity_linker_scorer)
+    registry.scorers("spacy.overlapping_labeled_spans_scorer.v1")(
+        make_overlapping_labeled_spans_scorer
+    )
+    registry.scorers("spacy.attribute_ruler_scorer.v1")(make_attribute_ruler_scorer)
+    registry.scorers("spacy.parser_scorer.v1")(make_parser_scorer)
+    registry.scorers("spacy.morphologizer_scorer.v1")(make_morphologizer_scorer)
+
+    # Register tokenizers
+    registry.tokenizers("spacy.Tokenizer.v1")(create_tokenizer)
+    registry.tokenizers("spacy.ja.JapaneseTokenizer")(create_japanese_tokenizer)
+    registry.tokenizers("spacy.zh.ChineseTokenizer")(create_chinese_tokenizer)
+    registry.tokenizers("spacy.ko.KoreanTokenizer")(create_korean_tokenizer)
+    registry.tokenizers("spacy.vi.VietnameseTokenizer")(create_vietnamese_tokenizer)
+    registry.tokenizers("spacy.th.ThaiTokenizer")(create_thai_tokenizer)
+
+    # Register tok2vec architectures we've modified
+    registry.architectures("spacy.Tok2VecListener.v1")(tok2vec_listener_v1)
+    registry.architectures("spacy.HashEmbedCNN.v2")(build_hash_embed_cnn_tok2vec)
+    registry.architectures("spacy.Tok2Vec.v2")(build_Tok2Vec_model)
+    registry.architectures("spacy.MultiHashEmbed.v2")(MultiHashEmbed)
+    registry.architectures("spacy.CharacterEmbed.v2")(CharacterEmbed)
+    registry.architectures("spacy.MaxoutWindowEncoder.v2")(MaxoutWindowEncoder)
+    registry.architectures("spacy.MishWindowEncoder.v2")(MishWindowEncoder)
+    registry.architectures("spacy.TorchBiLSTMEncoder.v1")(BiLSTMEncoder)
+    registry.architectures("spacy.EntityLinker.v2")(build_nel_encoder)
+    registry.architectures("spacy.TextCatCNN.v2")(build_simple_cnn_text_classifier)
+    registry.architectures("spacy.TextCatBOW.v2")(build_bow_text_classifier)
+    registry.architectures("spacy.TextCatBOW.v3")(build_bow_text_classifier_v3)
+    registry.architectures("spacy.TextCatEnsemble.v2")(build_text_classifier_v2)
+    registry.architectures("spacy.TextCatLowData.v1")(build_text_classifier_lowdata)
+    registry.architectures("spacy.TextCatParametricAttention.v1")(
+        build_textcat_parametric_attention_v1
+    )
+    registry.architectures("spacy.TextCatReduce.v1")(build_reduce_text_classifier)
+    registry.architectures("spacy.SpanCategorizer.v1")(build_spancat_model)
+    registry.architectures("spacy.SpanFinder.v1")(build_finder_model)
+    registry.architectures("spacy.TransitionBasedParser.v2")(build_tb_parser_model)
+    registry.architectures("spacy.PretrainVectors.v1")(create_pretrain_vectors)
+    registry.architectures("spacy.PretrainCharacters.v1")(create_pretrain_characters)
+    registry.architectures("spacy.Tagger.v2")(build_tagger_model)
+
+    # Register layers
+    registry.layers("spacy.FeatureExtractor.v1")(FeatureExtractor)
+    registry.layers("spacy.extract_spans.v1")(extract_spans)
+    registry.layers("spacy.extract_ngrams.v1")(extract_ngrams)
+    registry.layers("spacy.LinearLogistic.v1")(build_linear_logistic)
+    registry.layers("spacy.mean_max_reducer.v1")(build_mean_max_reducer)
+    registry.layers("spacy.StaticVectors.v2")(StaticVectors)
+    registry.layers("spacy.PrecomputableAffine.v1")(PrecomputableAffine)
+    registry.layers("spacy.CharEmbed.v1")(CharacterEmbed)
+    registry.layers("spacy.TransitionModel.v1")(TransitionModel)
+
+    # Register callbacks
+    registry.callbacks("spacy.copy_from_base_model.v1")(create_copy_from_base_model)
+    registry.callbacks("spacy.models_with_nvtx_range.v1")(create_models_with_nvtx_range)
+    registry.callbacks("spacy.models_and_pipes_with_nvtx_range.v1")(
+        create_models_and_pipes_with_nvtx_range
+    )
+
+    # Register loggers
+    registry.loggers("spacy.ConsoleLogger.v2")(console_logger)
+    registry.loggers("spacy.ConsoleLogger.v3")(console_logger_v3)
+
+    # Register batchers
+    registry.batchers("spacy.batch_by_padded.v1")(configure_minibatch_by_padded_size)
+    registry.batchers("spacy.batch_by_words.v1")(configure_minibatch_by_words)
+    registry.batchers("spacy.batch_by_sequence.v1")(configure_minibatch)
+
+    # Register augmenters
+    registry.augmenters("spacy.combined_augmenter.v1")(create_combined_augmenter)
+    registry.augmenters("spacy.lower_case.v1")(create_lower_casing_augmenter)
+    registry.augmenters("spacy.orth_variants.v1")(create_orth_variants_augmenter)
+
+    # Set the flag to indicate that the registry has been populated
+    REGISTRY_POPULATED = True
--- a/spacy/strings.pxd
+++ b/spacy/strings.pxd
@ -25,5 +25,7 @@ cdef class StringStore:
    cdef vector[hash_t] keys
    cdef public PreshMap _map

-    cdef const Utf8Str* intern_unicode(self, str py_string)
-    cdef const Utf8Str* _intern_utf8(self, char* utf8_string, int length, hash_t* precalculated_hash)
+    cdef const Utf8Str* intern_unicode(self, str py_string, bint allow_transient)
+    cdef const Utf8Str* _intern_utf8(self, char* utf8_string, int length, hash_t* precalculated_hash, bint allow_transient) 
+    cdef vector[hash_t] _transient_keys
+    cdef Pool _non_temp_mem
--- a/spacy/strings.pyx
+++ b/spacy/strings.pyx
@ -1,9 +1,14 @@
 # cython: infer_types=True
 # cython: profile=False
 cimport cython
+
+from contextlib import contextmanager
+from typing import List, Optional
+
 from libc.stdint cimport uint32_t
 from libc.string cimport memcpy
 from murmurhash.mrmr cimport hash32, hash64
+from preshed.maps cimport map_clear

 import srsly

@ -31,7 +36,7 @@ def get_string_id(key):
    This function optimises for convenience over performance, so shouldn't be
    used in tight loops.
    """
-    cdef hash_t str_hash    
+    cdef hash_t str_hash
    if isinstance(key, str):
        if len(key) == 0:
            return 0
@ -45,8 +50,8 @@ def get_string_id(key):
    elif _try_coerce_to_hash(key, &str_hash):
        # Coerce the integral key to the expected primitive hash type.
        # This ensures that custom/overloaded "primitive" data types
-        # such as those implemented by numpy are not inadvertently used 
-        # downsteam (as these are internally implemented as custom PyObjects 
+        # such as those implemented by numpy are not inadvertently used
+        # downsteam (as these are internally implemented as custom PyObjects
        # whose comparison operators can incur a significant overhead).
        return str_hash
    else:
@ -119,10 +124,11 @@ cdef class StringStore:
        strings (iterable): A sequence of unicode strings to add to the store.
        """
        self.mem = Pool()
+        self._non_temp_mem = self.mem
        self._map = PreshMap()
        if strings is not None:
            for string in strings:
-                self.add(string)
+                self.add(string, allow_transient=False)

    def __getitem__(self, object string_or_id):
        """Retrieve a string from a given hash, or vice versa.
@ -152,14 +158,17 @@ cdef class StringStore:
                return SYMBOLS_BY_INT[str_hash]
            else:
                utf8str = <Utf8Str*>self._map.get(str_hash)
+                if utf8str is NULL:
+                    raise KeyError(Errors.E018.format(hash_value=string_or_id))
+                else:
+                    return decode_Utf8Str(utf8str)
        else:
            # TODO: Raise an error instead
            utf8str = <Utf8Str*>self._map.get(string_or_id)
-
-        if utf8str is NULL:
-            raise KeyError(Errors.E018.format(hash_value=string_or_id))
-        else:
-            return decode_Utf8Str(utf8str)
+            if utf8str is NULL:
+                raise KeyError(Errors.E018.format(hash_value=string_or_id))
+            else:
+                return decode_Utf8Str(utf8str)

    def as_int(self, key):
        """If key is an int, return it; otherwise, get the int value."""
@ -175,12 +184,48 @@ cdef class StringStore:
        else:
            return self[key]

-    def add(self, string):
+    def __len__(self) -> int:
+        """The number of strings in the store.
+
+        RETURNS (int): The number of strings in the store.
+        """
+        return self.keys.size() + self._transient_keys.size()
+
+    @contextmanager
+    def memory_zone(self, mem: Optional[Pool] = None) -> Pool:
+        """Begin a block where all resources allocated during the block will
+        be freed at the end of it. If a resources was created within the
+        memory zone block, accessing it outside the block is invalid.
+        Behaviour of this invalid access is undefined. Memory zones should
+        not be nested.
+
+        The memory zone is helpful for services that need to process large
+        volumes of text with a defined memory budget.
+        """
+        if mem is None:
+            mem = Pool()
+        self.mem = mem
+        yield mem
+        for key in self._transient_keys:
+            map_clear(self._map.c_map, key)
+        self._transient_keys.clear()
+        self.mem = self._non_temp_mem
+
+    def add(self, string: str, allow_transient: Optional[bool] = None) -> int:
        """Add a string to the StringStore.

        string (str): The string to add.
+        allow_transient (bool): Allow the string to be stored in the 'transient'
+          map, which will be flushed at the end of the memory zone. Strings
+          encountered during arbitrary text processing should be added
+          with allow_transient=True, while labels and other strings used
+          internally should not.
        RETURNS (uint64): The string's hash value.
        """
+        if not string:
+            return 0
+        if allow_transient is None:
+            allow_transient = self.mem is not self._non_temp_mem
        cdef hash_t str_hash
        if isinstance(string, str):
            if string in SYMBOLS_BY_STR:
@ -188,22 +233,26 @@ cdef class StringStore:

            string = string.encode("utf8")
            str_hash = hash_utf8(string, len(string))
-            self._intern_utf8(string, len(string), &str_hash)
+            self._intern_utf8(string, len(string), &str_hash, allow_transient)
        elif isinstance(string, bytes):
            if string in SYMBOLS_BY_STR:
                return SYMBOLS_BY_STR[string]
            str_hash = hash_utf8(string, len(string))
-            self._intern_utf8(string, len(string), &str_hash)
+            self._intern_utf8(string, len(string), &str_hash, allow_transient)
        else:
            raise TypeError(Errors.E017.format(value_type=type(string)))
        return str_hash

    def __len__(self):
        """The number of strings in the store.
+        if string in SYMBOLS_BY_STR:
+            return SYMBOLS_BY_STR[string]
+        else:
+            return self._intern_str(string, allow_transient)

        RETURNS (int): The number of strings in the store.
        """
-        return self.keys.size()
+        return self.keys.size() + self._transient_keys.size()

    def __contains__(self, string_or_id not None):
        """Check whether a string or ID is in the store.
@ -222,12 +271,17 @@ cdef class StringStore:
            pass
        else:
            # TODO: Raise an error instead
-            return self._map.get(string_or_id) is not NULL
-
+            if self._map.get(string_or_id) is not NULL:
+                return True
+            else:
+                return False
        if str_hash < len(SYMBOLS_BY_INT):
            return True
        else:
-            return self._map.get(str_hash) is not NULL
+            if self._map.get(str_hash) is not NULL:
+                return True
+            else:
+                return False

    def __iter__(self):
        """Iterate over the strings in the store, in order.
@ -240,12 +294,29 @@ cdef class StringStore:
            key = self.keys[i]
            utf8str = <Utf8Str*>self._map.get(key)
            yield decode_Utf8Str(utf8str)
-        # TODO: Iterate OOV here?
+        for i in range(self._transient_keys.size()):
+            key = self._transient_keys[i]
+            utf8str = <Utf8Str*>self._map.get(key)
+            yield decode_Utf8Str(utf8str)

    def __reduce__(self):
        strings = list(self)
        return (StringStore, (strings,), None, None, None)

+    def values(self) -> List[int]:
+        """Iterate over the stored strings hashes in insertion order.
+
+        RETURNS: A list of string hashs.
+        """
+        cdef int i
+        hashes = [None] * self._keys.size()
+        for i in range(self._keys.size()):
+            hashes[i] = self._keys[i]
+        transient_hashes = [None] * self._transient_keys.size()
+        for i in range(self._transient_keys.size()):
+            transient_hashes[i] = self._transient_keys[i]
+        return hashes + transient_hashes
+
    def to_disk(self, path):
        """Save the current state to a directory.

@ -269,7 +340,7 @@ cdef class StringStore:
        prev = list(self)
        self._reset_and_load(strings)
        for word in prev:
-            self.add(word)
+            self.add(word, allow_transient=False)
        return self

    def to_bytes(self, **kwargs):
@ -289,30 +360,38 @@ cdef class StringStore:
        prev = list(self)
        self._reset_and_load(strings)
        for word in prev:
-            self.add(word)
+            self.add(word, allow_transient=False)
        return self

    def _reset_and_load(self, strings):
        self.mem = Pool()
+        self._non_temp_mem = self.mem
        self._map = PreshMap()
        self.keys.clear()
+        self._transient_keys.clear()
        for string in strings:
-            self.add(string)
+            self.add(string, allow_transient=False)

-    cdef const Utf8Str* intern_unicode(self, str py_string):
+    cdef const Utf8Str* intern_unicode(self, str py_string, bint allow_transient):
        # 0 means missing, but we don't bother offsetting the index.
        cdef bytes byte_string = py_string.encode("utf8")
-        return self._intern_utf8(byte_string, len(byte_string), NULL)
+        return self._intern_utf8(byte_string, len(byte_string), NULL, allow_transient)

    @cython.final
-    cdef const Utf8Str* _intern_utf8(self, char* utf8_string, int length, hash_t* precalculated_hash):
+    cdef const Utf8Str* _intern_utf8(self, char* utf8_string, int length, hash_t* precalculated_hash, bint allow_transient):
        # TODO: This function's API/behaviour is an unholy mess...
        # 0 means missing, but we don't bother offsetting the index.
        cdef hash_t key = precalculated_hash[0] if precalculated_hash is not NULL else hash_utf8(utf8_string, length)
        cdef Utf8Str* value = <Utf8Str*>self._map.get(key)
        if value is not NULL:
            return value
-        value = _allocate(self.mem, <unsigned char*>utf8_string, length)
+        if allow_transient:
+            value = _allocate(self.mem, <unsigned char*>utf8_string, length)
+        else:
+            value = _allocate(self._non_temp_mem, <unsigned char*>utf8_string, length)
        self._map.set(key, value)
-        self.keys.push_back(key)
+        if allow_transient and self.mem is not self._non_temp_mem:
+            self._transient_keys.push_back(key)
+        else:
+            self.keys.push_back(key)
        return value
--- a/spacy/symbols.pyx
+++ b/spacy/symbols.pyx
@ -479,3 +479,4 @@ NAMES = [it[0] for it in sorted(IDS.items(), key=sort_nums)]
 # (which is generating an enormous amount of C++ in Cython 0.24+)
 # We keep the enum cdef, and just make sure the names are available to Python
 locals().update(IDS)
+
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -81,6 +81,11 @@ def bn_tokenizer():
    return get_lang_class("bn")().tokenizer


+@pytest.fixture(scope="session")
+def bo_tokenizer():
+    return get_lang_class("bo")().tokenizer
+
+
@pytest.fixture(scope="session")
 def ca_tokenizer():
    return get_lang_class("ca")().tokenizer
@ -207,6 +212,16 @@ def hr_tokenizer():
    return get_lang_class("hr")().tokenizer


+@pytest.fixture(scope="session")
+def ht_tokenizer():
+    return get_lang_class("ht")().tokenizer
+
+
+@pytest.fixture(scope="session")
+def ht_vocab():
+    return get_lang_class("ht")().vocab
+
+
@pytest.fixture
 def hu_tokenizer():
    return get_lang_class("hu")().tokenizer
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Jeff Adolphe	41e07772dc	Added Haitian Creole (ht) Language Support to spaCy (#13807 ) This PR adds official support for Haitian Creole (ht) to spaCy's spacy/lang module. It includes: Added all core language data files for spacy/lang/ht: tokenizer_exceptions.py punctuation.py lex_attrs.py syntax_iterators.py lemmatizer.py stop_words.py tag_map.py Unit tests for tokenizer and noun chunking (test_tokenizer.py, test_noun_chunking.py, etc.). Passed all 58 pytest spacy/tests/lang/ht tests that I've created. Basic tokenizer rules adapted for Haitian Creole orthography and informal contractions. Custom like_num atrribute supporting Haitian number formats (e.g., "3yèm"). Support for common informal apostrophe usage (e.g., "m'ap", "n'ap", "di'm"). Ensured no breakages in other language modules. Followed spaCy coding style (PEP8, Black). This provides a foundation for Haitian Creole NLP development using spaCy.	2025-05-28 17:23:38 +02:00
Martin Schorfmann	e8f40e2169	Correct API docs for Span.lemma_, Vocab.to_bytes and Vectors.__init__ (#13436 ) * Correct code example for Span.lemma_ in API Docs (#13405) * Correct documented return type of Vocab.to_bytes in API docs * Correct wording for Vectors.__init__ in API docs	2025-05-28 17:22:50 +02:00
BLKSerene	7b1d6e58ff	Remove dependency on langcodes (#13760 ) This PR removes the dependency on langcodes introduced in #9342. While the introduction of langcodes allows a significantly wider range of language codes, there are some unexpected side effects: zh-Hant (Traditional Chinese) should be mapped to zh intead of None, as spaCy's Chinese model is based on pkuseg which supports tokenization of both Simplified and Traditional Chinese. Since it is possible that spaCy may have a model for Norwegian Nynorsk in the future, mapping no (macrolanguage Norwegian) to nb (Norwegian Bokmål) might be misleading. In that case, the user should be asked to specify nb or nn (Norwegian Nynorsk) specifically or consult the doc. Same as above for regional variants of languages such as en_gb and en_us. Overall, IMHO, introducing an extra dependency just for the conversion of language codes is an overkill. It is possible that most user just need the conversion between 2/3-letter ISO codes and a simple dictionary lookup should suffice. With this PR, ISO 639-1 and ISO 639-3 codes are supported. ISO 639-2/B (bibliographic codes which are not favored and used in ISO 639-3) and deprecated ISO 639-1/2 codes are also supported to maximize backward compatibility.	2025-05-28 17:21:46 +02:00
Matthew Honnibal	864c2f3b51	Format	2025-05-28 17:06:11 +02:00
Matthew Honnibal	75a9d9b9ad	Test and fix issue13769	2025-05-28 17:04:23 +02:00
Ilie	bec546cec0	Add TeNs plugin (#13800 ) Co-authored-by: Ilie Cristian Dorobat <idorobat@cisco.com>	2025-05-27 01:21:07 +02:00
d0ngw	46613e27cf	fix: match hyphenated words to lemmas in index_table (e.g. "co-authored" -> "co-author") (#13816 )	2025-05-27 01:20:26 +02:00
omahs	b205ff65e6	fix typos (#13813 )	2025-05-26 16:05:29 +02:00
BLKSerene	92f1b8cdb4	Switch to typer-slim (#13759 )	2025-05-26 16:03:49 +02:00
Matthew Honnibal	4b65aa79ee	Add release script	2025-05-22 14:00:48 +02:00
Matthew Honnibal	d08f4e3b10	Increment version	2025-05-22 13:58:00 +02:00
Matthew Honnibal	6036f344d3	Remove print statements	2025-05-22 13:56:31 +02:00
Matthew Honnibal	5bebbf7550	Python 3.13 support (#13823 ) In order to support Python 3.13, we had to migrate to Cython 3.0. This caused some tricky interaction with our Pydantic usage, because Cython 3 uses the from __future__ import annotations semantics, which causes type annotations to be saved as strings. The end result is that we can't have Language.factory decorated functions in Cython modules anymore, as the Language.factory decorator expects to inspect the signature of the functions and build a Pydantic model. If the function is implemented in Cython, an error is raised because the type is not resolved. To address this I've moved the factory functions into a new module, spacy.pipeline.factories. I've added __getattr__ importlib hooks to the previous locations, in case anyone was importing these functions directly. The change should have no backwards compatibility implications. Along the way I've also refactored the registration of functions for the config. Previously these ran as import-time side-effects, using the registry decorator. I've created instead a new module spacy.registrations. When the registry is accessed it calls a function ensure_populated(), which cases the registrations to occur. I've made a similar change to the Language.factory registrations in the new spacy.pipeline.factories module. I want to remove these import-time side-effects so that we can speed up the loading time of the library, which can be especially painful on the CLI. I also find that I'm often working to track down the implementations of functions referenced by strings in the config. Having the registrations all happen in one place will make this easier. With these changes I've fortunately avoided the need to migrate to Pydantic v2 properly --- we're still using the v1 compatibility shim. We might not be able to hold out forever though: Pydantic (reasonably) aren't actively supporting the v1 shims. I put a lot of work into v2 migration when investigating the 3.13 support, and it's definitely challenging. In any case, it's a relief that we don't have to do the v2 migration at the same time as the Cython 3.0/Python 3.13 support.	2025-05-22 13:47:21 +02:00
Matthew Honnibal	911539e9a4	Update version	2025-05-18 12:18:38 +02:00
Matthew Honnibal	22c1bc785b	Replace lte with lt for clarity	2025-05-18 12:18:17 +02:00
Matthew Honnibal	cb5e760e91	Fix python version supported	2025-05-18 12:17:23 +02:00
Gunther Cox	87ec2b72a5	Update spaCy Universe entry for ChatterBot to use correct name casing (#13784 )	2025-05-12 07:47:50 +02:00
翟持江	aa8de0ed37	Update embeddings-transformers.mdx, update trf_data examples info in <Runtime usage> (#13811 )	2025-05-12 07:47:12 +02:00
Adrien Carpentier	98a19df91a	docs: fix README.md for compatible Python versions (#13749 )	2025-04-11 20:56:52 +02:00
Matthew Honnibal	92bd042502	Allow Python 3.13	2025-04-03 23:15:12 +02:00
Matthew Honnibal	d0c705cbc9	Increment version	2025-04-01 09:40:59 +02:00
Matthew Honnibal	b3c46c315e	Add support for linux-arm	2025-02-03 18:32:23 +01:00
Ines Montani	d194f06437	Add live stream to site [ci skip]	2025-02-03 09:42:52 +01:00
Ines Montani	055e07d9cc	Update README.md [ci skip]	2025-02-03 09:38:32 +01:00
Ines Montani	8e1c14e977	Add live stream to README [ci skip]	2025-02-03 09:37:48 +01:00
Christine P. Chai	4278182dd0	Change Twitter to X (#13740 ) [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2025-02-03 09:30:21 +01:00
Matthew Honnibal	85cc763006	Fix python version requirement	2025-01-13 18:17:36 +01:00
Matthew Honnibal	ba7468e32e	Update requirements, fixing windows crashes (#13727 ) * Re-enable pretraining test * Require thinc 8.3.4 * Reformat * Re-enable test	2025-01-13 16:39:46 +01:00
Matthew Honnibal	311f7cc9fb	Set version to v3.8.4	2024-12-11 14:14:08 +01:00
Matthew Honnibal	682140496a	Align requirements better	2024-12-11 14:13:51 +01:00
Matthew Honnibal	343f4f21d7	Enable Python 3.13	2024-12-11 14:13:28 +01:00
Matthew Honnibal	be0fa812c2	Update cibuildwheel	2024-12-11 13:08:40 +01:00
Matthew Honnibal	a6317b3836	Fix allocation of non-transient strings in StringStore (#13713 ) * Fix bug in memory-zone code when adding non-transient strings. The error could result in segmentation faults or other memory errors during memory zones if new labels were added to the model. * Fix handling of new morphological labels within memory zones. Addresses second issue reported in Memory leak of MorphAnalysis object. #13684	2024-12-11 13:06:53 +01:00
Ines Montani	3e30b5bef6	Add spacy-layout [ci skip]	2024-11-19 10:43:40 +01:00
Matthew Honnibal	3ecec1324c	Usage page on memory management, explaining memory zones and doc_cleaner (#13643 ) [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2024-10-23 12:42:54 +02:00
Ikko Eltociear Ashimine	15fbf5ef36	docs: update rule-based-matching.mdx (#13665 ) [ci skip]	2024-10-23 12:07:01 +02:00
Sergei Pashakhin	1ee9a19059	Fix typo (#13657 ) [ci skip]	2024-10-23 12:06:36 +02:00
thjbdvlt	0d7e57fc3e	universe-pipeline-solipCysme-french (#13627 ) [ci skip]	2024-10-11 11:26:15 +02:00
Ines Montani	ae5c3e078d	Fix universe.json [ci skip]	2024-10-11 11:24:42 +02:00
Andrei (Andrey) Khropov	8d2902b0e7	Fix misspelling (#13631 ) [ci skip]	2024-10-11 11:23:12 +02:00
aravind-mc	44d1906453	Update universe.json to add my spaCy online course (#13632 ) [ci skip]	2024-10-11 11:21:57 +02:00
sam rxh	52a4cb0d14	Fix 'issue template' link in CONTRIBUTING.md (#13587 ) [ci skip]	2024-10-11 11:20:34 +02:00
Ines Montani	10a6f508ab	Fix landing banner links [ci skip]	2024-10-11 11:19:10 +02:00
Matthew Honnibal	bda4bb0184	Try disabling pretraining tests to probe windows ci failure (#13646 )	2024-10-02 01:01:40 +02:00
Matthew Honnibal	628c973db5	Note minimum python requirement in setup.cfg	2024-10-02 00:49:09 +02:00
Matthew Honnibal	e0782c5e4c	Merge branch 'master' into v3.8.x	2024-10-01 23:57:48 +02:00
Matthew Honnibal	5230754986	Fix thinc dependncy	2024-10-01 23:49:17 +02:00
Matthew Honnibal	411b70f5f3	Upd requirements	2024-10-01 23:46:54 +02:00
Matthew Honnibal	08705f5a8c	Upd tests	2024-10-01 22:57:25 +02:00
Matthew Honnibal	77177d0216	Upd tests workflow	2024-10-01 22:54:12 +02:00
Matthew Honnibal	5196366af5	Upd tests workflow	2024-10-01 22:53:11 +02:00
Matthew Honnibal	29232ad3b5	Upd tests workflow	2024-10-01 22:51:09 +02:00
Matthew Honnibal	dd47fbb45f	Remove 'apple' extra	2024-10-01 22:24:25 +02:00
Matthew Honnibal	63f1b53c1a	Check test failure	2024-10-01 16:49:49 +02:00
Matthew Honnibal	0cdcfe56cb	Set version to v3.8.2	2024-10-01 16:47:24 +02:00
Matthew Honnibal	924cbc9703	Fix environment variable for test	2024-10-01 16:08:06 +02:00
Matthew Honnibal	e1d050517d	Fix requirements.txt	2024-10-01 15:56:18 +02:00
Matthew Honnibal	6c038aaae0	Don't disable tests on workflow changes	2024-10-01 15:32:01 +02:00
Matthew Honnibal	f0084b9143	Fix matrix in tests	2024-10-01 15:28:22 +02:00
Matthew Honnibal	ff81bfb8db	Update tests	2024-10-01 13:21:10 +02:00
Matthew Honnibal	9c5b61bdff	isort	2024-10-01 12:38:51 +02:00
Matthew Honnibal	725ccbac39	Format	2024-10-01 12:38:02 +02:00
Matthew Honnibal	a8837beab7	Set version to v3.8.1	2024-10-01 12:37:11 +02:00
Matthew Honnibal	3a0aadcf86	Update spacy[apple] thinc-apple-ops pin for numpy v2 compatibility	2024-10-01 10:16:35 +02:00
DomHudson	a61a1d43cf	[Documentation] Replace broken URL in _serialization.mdx (#13641 )	2024-09-30 17:45:50 +02:00
Matthew Honnibal	114b4894fb	Fix --require-parent default	2024-09-29 15:50:31 +02:00
Matthew Honnibal	dec13b4258	Fix inverted cli arg	2024-09-29 15:50:05 +02:00
Matthew Honnibal	c03f060527	Allow positive option --require-parent	2024-09-29 14:30:14 +02:00
Matthew Honnibal	6255cb985f	Include version constraint in parent package requirement	2024-09-29 14:22:21 +02:00
Matthew Honnibal	3b165a8716	Simplify setting to require parent package	2024-09-29 14:19:10 +02:00
Matthew Honnibal	969832f5d6	Fix package	2024-09-29 14:00:11 +02:00
Matthew Honnibal	8ce53a6bbe	Syntax	2024-09-29 13:51:44 +02:00
Matthew Honnibal	6fa0d709d5	Support option to not depend on parent package in spacy package	2024-09-29 13:51:04 +02:00
Matthew Honnibal	5010fcbd3a	Fix numpy constant	2024-09-14 13:13:11 +02:00
Matthew Honnibal	de4f19f3a3	Fix version	2024-09-14 13:12:44 +02:00
Matthew Honnibal	3d03565498	Replace numpy floats in evaluate and update	2024-09-14 12:55:53 +02:00
Matthew Honnibal	0576a1ff56	Fix numpy floats in meta.json	2024-09-14 12:54:08 +02:00
Matthew Honnibal	2f1e7ed09a	Lint	2024-09-14 11:36:27 +02:00
Matthew Honnibal	e2dc9b79e1	Format	2024-09-14 11:29:40 +02:00
Matthew Honnibal	3c3d75015b	Set version to v3.7.7	2024-09-14 11:27:32 +02:00
Matthew Honnibal	50aa3b5cbe	Merge branch 'master' of https://github.com/explosion/spaCy	2024-09-14 11:09:44 +02:00
Matthew Honnibal	8266031454	Merge numpy version update	2024-09-14 11:08:35 +02:00
Matthew Honnibal	8dcc4b8daf	Skip running tests on PRs	2024-09-14 11:07:23 +02:00
William Mattingly	30f1f33e78	Added Date spaCy to universe (#13415 ) [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2024-09-10 14:29:03 +02:00
William Mattingly	f1a5ff9dba	added spacy whisper to universe (#13418 ) [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2024-09-10 14:28:00 +02:00
William Mattingly	c80dacd046	added spacy annoy to universe (#13416 ) [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2024-09-10 14:26:21 +02:00
William Mattingly	7fbbb2002a	updated universe for number spacy (#13424 ) [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2024-09-10 14:25:23 +02:00
William Mattingly	89c1774d43	added bagpipes-spacy to universe (#13425 ) [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2024-09-10 14:24:06 +02:00
thjbdvlt	081e4e385d	universe-project-presque (#13515 ) [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2024-09-10 14:21:41 +02:00
thjbdvlt	0190e669c5	universe-package-quelquhui (#13514 ) [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2024-09-10 14:17:33 +02:00
Oren Halvani	54dc4ee8fb	Added: Constituent-Treelib to: universe.json (#13432 ) [ci skip] Co-authored-by: Halvani <>	2024-09-10 14:13:36 +02:00
William Mattingly	5a7ad5572c	added gliner-spacy to universe (#13417 ) [ci skip] Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Ines Montani <ines@ines.io>	2024-09-10 14:12:52 +02:00
marinelay	b18cc94451	Delete unnecessary method (#13441 ) Co-authored-by: marinelay <marinelay@gmail.com>	2024-09-09 20:57:13 +02:00
Matthew Honnibal	4cc3ebe74e	Format	2024-09-09 20:56:01 +02:00
Matthew Honnibal	a019315534	Fix memory zones	2024-09-09 13:49:41 +02:00
Matthew Honnibal	59ac7e6bdb	Format	2024-09-09 11:22:52 +02:00
Matthew Honnibal	b65491b641	Set version to v3.8.0.dev0	2024-09-09 11:20:23 +02:00
Matthew Honnibal	1b8d560d0e	Support 'memory zones' for user memory management (#13621 ) Add a context manage nlp.memory_zone(), which will begin memory_zone() blocks on the vocab, string store, and potentially other components. Example usage: ``` with nlp.memory_zone(): for text in nlp.pipe(texts): do_something(doc) # do_something(doc) <-- Invalid ``` Once the memory_zone() block expires, spaCy will free any shared resources that were allocated for the text-processing that occurred within the memory_zone. If you create Doc objects within a memory zone, it's invalid to access them once the memory zone is expired. The purpose of this is that spaCy creates and stores Lexeme objects in the Vocab that can be shared between multiple Doc objects. It also interns strings. Normally, spaCy can't know when all Doc objects using a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab, causing memory pressure. Memory zones solve this problem by telling spaCy "okay none of the documents allocated within this block will be accessed again". This lets spaCy free all new Lexeme objects and other data that were created during the block. The mechanism is general, so memory_zone() context managers can be added to other components that could benefit from them, e.g. pipeline components. I experimented with adding memory zone support to the tokenizer as well, for its cache. However, this seems unnecessarily complicated. It makes more sense to just stick a limit on the cache size. This lets spaCy benefit from the efficiency advantage of the cache better, because we can maintain a (bounded) cache even if only small batches of documents are being processed.	2024-09-09 11:19:39 +02:00
ykyogoku	608f65ce40	add Tibetan (#13510 )	2024-09-09 11:18:03 +02:00
Muzaffer Cikay	acbf2a428f	Add Kurdish Kurmanji language (#13561 ) * Add Kurdish Kurmanji language * Add lex_attrs	2024-09-09 11:15:40 +02:00
Mark Liberko	55db9c2e87	Added gd language folder (#13570 ) Implemented a foundational Scottish Gaelic (gd) language option with tokenizer_exceptions and stop_words files.	2024-09-09 11:14:09 +02:00