mirror of
https://github.com/explosion/spaCy.git
synced 2025-04-04 17:24:16 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
0c46b69022
1
.github/FUNDING.yml
vendored
Normal file
1
.github/FUNDING.yml
vendored
Normal file
|
@ -0,0 +1 @@
|
|||
custom: [https://explosion.ai/merch, https://explosion.ai/tailored-solutions]
|
2
.github/workflows/gputests.yml
vendored
2
.github/workflows/gputests.yml
vendored
|
@ -9,7 +9,7 @@ jobs:
|
|||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
branch: [master, v4]
|
||||
branch: [master, main]
|
||||
if: github.repository_owner == 'explosion'
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
|
|
2
.github/workflows/slowtests.yml
vendored
2
.github/workflows/slowtests.yml
vendored
|
@ -9,7 +9,7 @@ jobs:
|
|||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
branch: [master, v4]
|
||||
branch: [master, main]
|
||||
if: github.repository_owner == 'explosion'
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
|
|
4
.github/workflows/tests.yml
vendored
4
.github/workflows/tests.yml
vendored
|
@ -58,7 +58,7 @@ jobs:
|
|||
fail-fast: true
|
||||
matrix:
|
||||
os: [ubuntu-latest, windows-latest, macos-latest]
|
||||
python_version: ["3.11", "3.12.0-rc.2"]
|
||||
python_version: ["3.12"]
|
||||
include:
|
||||
- os: windows-latest
|
||||
python_version: "3.7"
|
||||
|
@ -68,6 +68,8 @@ jobs:
|
|||
python_version: "3.9"
|
||||
- os: windows-latest
|
||||
python_version: "3.10"
|
||||
- os: macos-latest
|
||||
python_version: "3.11"
|
||||
|
||||
runs-on: ${{ matrix.os }}
|
||||
|
||||
|
|
|
@ -452,10 +452,9 @@ and plugins in spaCy v3.0, and we can't wait to see what you build with it!
|
|||
spaCy website. If you're sharing your project on Twitter, feel free to tag
|
||||
[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
|
||||
|
||||
- Once your extension is published, you can open an issue on the
|
||||
[issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
|
||||
[resources directory](https://spacy.io/usage/resources#extensions) on the
|
||||
website.
|
||||
- Once your extension is published, you can open a
|
||||
[PR](https://github.com/explosion/spaCy/pulls) to suggest it for the
|
||||
[Universe](https://spacy.io/universe) page.
|
||||
|
||||
📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).**
|
||||
|
||||
|
|
2
LICENSE
2
LICENSE
|
@ -1,6 +1,6 @@
|
|||
The MIT License (MIT)
|
||||
|
||||
Copyright (C) 2016-2022 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
||||
Copyright (C) 2016-2023 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
|
|
|
@ -39,26 +39,31 @@ open-source software, released under the
|
|||
| 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. |
|
||||
| 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run. |
|
||||
| 🎛 **[API Reference]** | The detailed reference for spaCy's API. |
|
||||
| ⏩ **[GPU Processing]** | Use spaCy with CUDA-compatible GPU processing. |
|
||||
| 📦 **[Models]** | Download trained pipelines for spaCy. |
|
||||
| 🦙 **[Large Language Models]** | Integrate LLMs into spaCy pipelines. |
|
||||
| 🌌 **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. |
|
||||
| ⚙️ **[spaCy VS Code Extension]** | Additional tooling and features for working with spaCy's config files. |
|
||||
| 👩🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. |
|
||||
| 📰 **[Blog]** | Read about current spaCy and Prodigy development, releases, talks and more from Explosion. |
|
||||
| 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. |
|
||||
| 🛠 **[Changelog]** | Changes and version history. |
|
||||
| 💝 **[Contribute]** | How to contribute to the spaCy project and code base. |
|
||||
| 👕 **[Swag]** | Support us and our work with unique, custom-designed swag! |
|
||||
| <a href="https://explosion.ai/spacy-tailored-pipelines"><img src="https://user-images.githubusercontent.com/13643239/152853098-1c761611-ccb0-4ec6-9066-b234552831fe.png" width="125" alt="spaCy Tailored Pipelines"/></a> | Get a custom spaCy pipeline, tailor-made for your NLP problem by spaCy's core developers. Streamlined, production-ready, predictable and maintainable. Start by completing our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more →](https://explosion.ai/spacy-tailored-pipelines)** |
|
||||
| <a href="https://explosion.ai/spacy-tailored-analysis"><img src="https://user-images.githubusercontent.com/1019791/206151300-b00cd189-e503-4797-aa1e-1bb6344062c5.png" width="125" alt="spaCy Tailored Pipelines"/></a> | Bespoke advice for problem solving, strategy and analysis for applied NLP projects. Services include data strategy, code reviews, pipeline design and annotation coaching. Curious? Fill in our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more →](https://explosion.ai/spacy-tailored-analysis)** |
|
||||
| <a href="https://explosion.ai/tailored-solutions"><img src="https://github.com/explosion/spaCy/assets/13643239/36d2a42e-98c0-4599-90e1-788ef75181be" width="150" alt="Tailored Solutions"/></a> | Custom NLP consulting, implementation and strategic advice by spaCy’s core development team. Streamlined, production-ready, predictable and maintainable. Send us an email or take our 5-minute questionnaire, and well'be in touch! **[Learn more →](https://explosion.ai/tailored-solutions)** |
|
||||
|
||||
[spacy 101]: https://spacy.io/usage/spacy-101
|
||||
[new in v3.0]: https://spacy.io/usage/v3
|
||||
[usage guides]: https://spacy.io/usage/
|
||||
[api reference]: https://spacy.io/api/
|
||||
[gpu processing]: https://spacy.io/usage#gpu
|
||||
[models]: https://spacy.io/models
|
||||
[large language models]: https://spacy.io/usage/large-language-models
|
||||
[universe]: https://spacy.io/universe
|
||||
[spacy vs code extension]: https://github.com/explosion/spacy-vscode
|
||||
[videos]: https://www.youtube.com/c/ExplosionAI
|
||||
[online course]: https://course.spacy.io
|
||||
[blog]: https://explosion.ai
|
||||
[project templates]: https://github.com/explosion/projects
|
||||
[changelog]: https://spacy.io/usage#changelog
|
||||
[contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
|
||||
|
|
|
@ -158,3 +158,45 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
|
||||
|
||||
SciPy
|
||||
-----
|
||||
|
||||
* Files: scorer.py
|
||||
|
||||
The implementation of trapezoid() is adapted from SciPy, which is distributed
|
||||
under the following license:
|
||||
|
||||
New BSD License
|
||||
|
||||
Copyright (c) 2001-2002 Enthought, Inc. 2003-2023, SciPy Developers.
|
||||
All rights reserved.
|
||||
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions
|
||||
are met:
|
||||
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
|
||||
2. Redistributions in binary form must reproduce the above
|
||||
copyright notice, this list of conditions and the following
|
||||
disclaimer in the documentation and/or other materials provided
|
||||
with the distribution.
|
||||
|
||||
3. Neither the name of the copyright holder nor the names of its
|
||||
contributors may be used to endorse or promote products derived
|
||||
from this software without specific prior written permission.
|
||||
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
||||
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
||||
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
||||
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
||||
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
||||
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
|
||||
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
||||
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
|
||||
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
||||
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
|
|
|
@ -5,7 +5,7 @@ requires = [
|
|||
"cymem>=2.0.2,<2.1.0",
|
||||
"preshed>=3.0.2,<3.1.0",
|
||||
"murmurhash>=0.28.0,<1.1.0",
|
||||
"thinc>=8.1.8,<8.3.0",
|
||||
"thinc>=8.2.2,<8.3.0",
|
||||
"numpy>=1.15.0; python_version < '3.9'",
|
||||
"numpy>=1.25.0; python_version >= '3.9'",
|
||||
]
|
||||
|
|
|
@ -3,14 +3,13 @@ spacy-legacy>=3.0.11,<3.1.0
|
|||
spacy-loggers>=1.0.0,<2.0.0
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc>=8.1.8,<8.3.0
|
||||
thinc>=8.2.2,<8.3.0
|
||||
ml_datasets>=0.2.0,<0.3.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
wasabi>=0.9.1,<1.2.0
|
||||
srsly>=2.4.3,<3.0.0
|
||||
catalogue>=2.0.6,<2.1.0
|
||||
typer>=0.3.0,<0.10.0
|
||||
pathy>=0.10.0
|
||||
smart-open>=5.2.1,<7.0.0
|
||||
weasel>=0.1.0,<0.4.0
|
||||
# Third party dependencies
|
||||
|
|
|
@ -41,7 +41,7 @@ setup_requires =
|
|||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
thinc>=8.1.8,<8.3.0
|
||||
thinc>=8.2.2,<8.3.0
|
||||
install_requires =
|
||||
# Our libraries
|
||||
spacy-legacy>=3.0.11,<3.1.0
|
||||
|
@ -49,14 +49,13 @@ install_requires =
|
|||
murmurhash>=0.28.0,<1.1.0
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc>=8.1.8,<8.3.0
|
||||
thinc>=8.2.2,<8.3.0
|
||||
wasabi>=0.9.1,<1.2.0
|
||||
srsly>=2.4.3,<3.0.0
|
||||
catalogue>=2.0.6,<2.1.0
|
||||
weasel>=0.1.0,<0.4.0
|
||||
# Third-party dependencies
|
||||
typer>=0.3.0,<0.10.0
|
||||
pathy>=0.10.0
|
||||
smart-open>=5.2.1,<7.0.0
|
||||
tqdm>=4.38.0,<5.0.0
|
||||
numpy>=1.15.0; python_version < "3.9"
|
||||
|
|
|
@ -13,6 +13,7 @@ from thinc.api import Config, prefer_gpu, require_cpu, require_gpu # noqa: F401
|
|||
from . import pipeline # noqa: F401
|
||||
from . import util
|
||||
from .about import __version__ # noqa: F401
|
||||
from .cli.info import info # noqa: F401
|
||||
from .errors import Errors
|
||||
from .glossary import explain # noqa: F401
|
||||
from .language import Language
|
||||
|
@ -76,9 +77,3 @@ def blank(
|
|||
# We should accept both dot notation and nested dict here for consistency
|
||||
config = util.dot_to_dict(config)
|
||||
return LangClass.from_config(config, vocab=vocab, meta=meta)
|
||||
|
||||
|
||||
def info(*args, **kwargs):
|
||||
from .cli.info import info as cli_info
|
||||
|
||||
return cli_info(*args, **kwargs)
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy"
|
||||
__version__ = "3.7.0"
|
||||
__version__ = "3.7.3"
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
|
|
|
@ -22,8 +22,17 @@ from .init_pipeline import init_pipeline_cli # noqa: F401
|
|||
from .package import package # noqa: F401
|
||||
from .pretrain import pretrain # noqa: F401
|
||||
from .profile import profile # noqa: F401
|
||||
from .train import train_cli # noqa: F401
|
||||
from .validate import validate # noqa: F401
|
||||
from .project.assets import project_assets # type: ignore[attr-defined] # noqa: F401
|
||||
from .project.clone import project_clone # type: ignore[attr-defined] # noqa: F401
|
||||
from .project.document import ( # type: ignore[attr-defined] # noqa: F401
|
||||
project_document,
|
||||
)
|
||||
from .project.dvc import project_update_dvc # type: ignore[attr-defined] # noqa: F401
|
||||
from .project.pull import project_pull # type: ignore[attr-defined] # noqa: F401
|
||||
from .project.push import project_push # type: ignore[attr-defined] # noqa: F401
|
||||
from .project.run import project_run # type: ignore[attr-defined] # noqa: F401
|
||||
from .train import train_cli # type: ignore[attr-defined] # noqa: F401
|
||||
from .validate import validate # type: ignore[attr-defined] # noqa: F401
|
||||
|
||||
|
||||
@app.command("link", no_args_is_help=True, deprecated=True, hidden=True)
|
||||
|
|
|
@ -41,10 +41,6 @@ from ..util import (
|
|||
run_command,
|
||||
)
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from pathy import FluidPath # noqa: F401
|
||||
|
||||
|
||||
SDIST_SUFFIX = ".tar.gz"
|
||||
WHEEL_SUFFIX = "-py3-none-any.whl"
|
||||
|
||||
|
|
|
@ -13,7 +13,7 @@ from .. import util
|
|||
from ..language import Language
|
||||
from ..tokens import Doc
|
||||
from ..training import Corpus
|
||||
from ._util import Arg, Opt, benchmark_cli, setup_gpu
|
||||
from ._util import Arg, Opt, benchmark_cli, import_code, setup_gpu
|
||||
|
||||
|
||||
@benchmark_cli.command(
|
||||
|
@ -30,12 +30,14 @@ def benchmark_speed_cli(
|
|||
use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU"),
|
||||
n_batches: int = Opt(50, "--batches", help="Minimum number of batches to benchmark", min=30,),
|
||||
warmup_epochs: int = Opt(3, "--warmup", "-w", min=0, help="Number of iterations over the data for warmup"),
|
||||
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Benchmark a pipeline. Expects a loadable spaCy pipeline and benchmark
|
||||
data in the binary .spacy format.
|
||||
"""
|
||||
import_code(code_path)
|
||||
setup_gpu(use_gpu=use_gpu, silent=False)
|
||||
|
||||
nlp = util.load_model(model)
|
||||
|
@ -171,5 +173,5 @@ def print_outliers(sample: numpy.ndarray):
|
|||
def warmup(
|
||||
nlp: Language, docs: List[Doc], warmup_epochs: int, batch_size: Optional[int]
|
||||
) -> numpy.ndarray:
|
||||
docs = warmup_epochs * docs
|
||||
docs = [doc.copy() for doc in docs * warmup_epochs]
|
||||
return annotate(nlp, docs, batch_size)
|
||||
|
|
|
@ -7,7 +7,14 @@ from wasabi import msg
|
|||
|
||||
from .. import about
|
||||
from ..errors import OLD_MODEL_SHORTCUTS
|
||||
from ..util import get_minor_version, is_package, is_prerelease_version, run_command
|
||||
from ..util import (
|
||||
get_minor_version,
|
||||
is_in_interactive,
|
||||
is_in_jupyter,
|
||||
is_package,
|
||||
is_prerelease_version,
|
||||
run_command,
|
||||
)
|
||||
from ._util import SDIST_SUFFIX, WHEEL_SUFFIX, Arg, Opt, app
|
||||
|
||||
|
||||
|
@ -77,6 +84,27 @@ def download(
|
|||
"Download and installation successful",
|
||||
f"You can now load the package via spacy.load('{model_name}')",
|
||||
)
|
||||
if is_in_jupyter():
|
||||
reload_deps_msg = (
|
||||
"If you are in a Jupyter or Colab notebook, you may need to "
|
||||
"restart Python in order to load all the package's dependencies. "
|
||||
"You can do this by selecting the 'Restart kernel' or 'Restart "
|
||||
"runtime' option."
|
||||
)
|
||||
msg.warn(
|
||||
"Restart to reload dependencies",
|
||||
reload_deps_msg,
|
||||
)
|
||||
elif is_in_interactive():
|
||||
reload_deps_msg = (
|
||||
"If you are in an interactive Python session, you may need to "
|
||||
"exit and restart Python to load all the package's dependencies. "
|
||||
"You can exit with Ctrl-D (or Ctrl-Z and Enter on Windows)."
|
||||
)
|
||||
msg.warn(
|
||||
"Restart to reload dependencies",
|
||||
reload_deps_msg,
|
||||
)
|
||||
|
||||
|
||||
def get_model_filename(model_name: str, version: str, sdist: bool = False) -> str:
|
||||
|
|
|
@ -1,5 +1,7 @@
|
|||
import os
|
||||
import re
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
@ -11,6 +13,7 @@ from thinc.api import Config
|
|||
from wasabi import MarkdownRenderer, Printer, get_raw_input
|
||||
|
||||
from .. import about, util
|
||||
from ..compat import importlib_metadata
|
||||
from ..schemas import ModelMetaSchema, validate
|
||||
from ._util import SDIST_SUFFIX, WHEEL_SUFFIX, Arg, Opt, app, string_to_list
|
||||
|
||||
|
@ -35,7 +38,7 @@ def package_cli(
|
|||
specified output directory, and the data will be copied over. If
|
||||
--create-meta is set and a meta.json already exists in the output directory,
|
||||
the existing values will be used as the defaults in the command-line prompt.
|
||||
After packaging, "python setup.py sdist" is run in the package directory,
|
||||
After packaging, "python -m build --sdist" is run in the package directory,
|
||||
which will create a .tar.gz archive that can be installed via "pip install".
|
||||
|
||||
If additional code files are provided (e.g. Python files containing custom
|
||||
|
@ -78,9 +81,17 @@ def package(
|
|||
input_path = util.ensure_path(input_dir)
|
||||
output_path = util.ensure_path(output_dir)
|
||||
meta_path = util.ensure_path(meta_path)
|
||||
if create_wheel and not has_wheel():
|
||||
err = "Generating a binary .whl file requires wheel to be installed"
|
||||
msg.fail(err, "pip install wheel", exits=1)
|
||||
if create_wheel and not has_wheel() and not has_build():
|
||||
err = (
|
||||
"Generating wheels requires 'build' or 'wheel' (deprecated) to be installed"
|
||||
)
|
||||
msg.fail(err, "pip install build", exits=1)
|
||||
if not has_build():
|
||||
msg.warn(
|
||||
"Generating packages without the 'build' package is deprecated and "
|
||||
"will not be supported in the future. To install 'build': pip "
|
||||
"install build"
|
||||
)
|
||||
if not input_path or not input_path.exists():
|
||||
msg.fail("Can't locate pipeline data", input_path, exits=1)
|
||||
if not output_path or not output_path.exists():
|
||||
|
@ -184,12 +195,37 @@ def package(
|
|||
msg.good(f"Successfully created package directory '{model_name_v}'", main_path)
|
||||
if create_sdist:
|
||||
with util.working_dir(main_path):
|
||||
util.run_command([sys.executable, "setup.py", "sdist"], capture=False)
|
||||
# run directly, since util.run_command is not designed to continue
|
||||
# after a command fails
|
||||
ret = subprocess.run(
|
||||
[sys.executable, "-m", "build", ".", "--sdist"],
|
||||
env=os.environ.copy(),
|
||||
)
|
||||
if ret.returncode != 0:
|
||||
msg.warn(
|
||||
"Creating sdist with 'python -m build' failed. Falling "
|
||||
"back to deprecated use of 'python setup.py sdist'"
|
||||
)
|
||||
util.run_command([sys.executable, "setup.py", "sdist"], capture=False)
|
||||
zip_file = main_path / "dist" / f"{model_name_v}{SDIST_SUFFIX}"
|
||||
msg.good(f"Successfully created zipped Python package", zip_file)
|
||||
if create_wheel:
|
||||
with util.working_dir(main_path):
|
||||
util.run_command([sys.executable, "setup.py", "bdist_wheel"], capture=False)
|
||||
# run directly, since util.run_command is not designed to continue
|
||||
# after a command fails
|
||||
ret = subprocess.run(
|
||||
[sys.executable, "-m", "build", ".", "--wheel"],
|
||||
env=os.environ.copy(),
|
||||
)
|
||||
if ret.returncode != 0:
|
||||
msg.warn(
|
||||
"Creating wheel with 'python -m build' failed. Falling "
|
||||
"back to deprecated use of 'wheel' with "
|
||||
"'python setup.py bdist_wheel'"
|
||||
)
|
||||
util.run_command(
|
||||
[sys.executable, "setup.py", "bdist_wheel"], capture=False
|
||||
)
|
||||
wheel_name_squashed = re.sub("_+", "_", model_name_v)
|
||||
wheel = main_path / "dist" / f"{wheel_name_squashed}{WHEEL_SUFFIX}"
|
||||
msg.good(f"Successfully created binary wheel", wheel)
|
||||
|
@ -209,6 +245,17 @@ def has_wheel() -> bool:
|
|||
return False
|
||||
|
||||
|
||||
def has_build() -> bool:
|
||||
# it's very likely that there is a local directory named build/ (especially
|
||||
# in an editable install), so an import check is not sufficient; instead
|
||||
# check that there is a package version
|
||||
try:
|
||||
importlib_metadata.version("build")
|
||||
return True
|
||||
except importlib_metadata.PackageNotFoundError: # type: ignore[attr-defined]
|
||||
return False
|
||||
|
||||
|
||||
def get_third_party_dependencies(
|
||||
config: Config, exclude: List[str] = util.SimpleFrozenList()
|
||||
) -> List[str]:
|
||||
|
|
0
spacy/cli/project/__init__.py
Normal file
0
spacy/cli/project/__init__.py
Normal file
1
spacy/cli/project/assets.py
Normal file
1
spacy/cli/project/assets.py
Normal file
|
@ -0,0 +1 @@
|
|||
from weasel.cli.assets import *
|
1
spacy/cli/project/clone.py
Normal file
1
spacy/cli/project/clone.py
Normal file
|
@ -0,0 +1 @@
|
|||
from weasel.cli.clone import *
|
1
spacy/cli/project/document.py
Normal file
1
spacy/cli/project/document.py
Normal file
|
@ -0,0 +1 @@
|
|||
from weasel.cli.document import *
|
1
spacy/cli/project/dvc.py
Normal file
1
spacy/cli/project/dvc.py
Normal file
|
@ -0,0 +1 @@
|
|||
from weasel.cli.dvc import *
|
1
spacy/cli/project/pull.py
Normal file
1
spacy/cli/project/pull.py
Normal file
|
@ -0,0 +1 @@
|
|||
from weasel.cli.pull import *
|
1
spacy/cli/project/push.py
Normal file
1
spacy/cli/project/push.py
Normal file
|
@ -0,0 +1 @@
|
|||
from weasel.cli.push import *
|
1
spacy/cli/project/remote_storage.py
Normal file
1
spacy/cli/project/remote_storage.py
Normal file
|
@ -0,0 +1 @@
|
|||
from weasel.cli.remote_storage import *
|
1
spacy/cli/project/run.py
Normal file
1
spacy/cli/project/run.py
Normal file
|
@ -0,0 +1 @@
|
|||
from weasel.cli.run import *
|
|
@ -271,8 +271,9 @@ grad_factor = 1.0
|
|||
@layers = "reduce_mean.v1"
|
||||
|
||||
[components.textcat.model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v2"
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = true
|
||||
length = 262144
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
|
||||
|
@ -308,8 +309,9 @@ grad_factor = 1.0
|
|||
@layers = "reduce_mean.v1"
|
||||
|
||||
[components.textcat_multilabel.model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v2"
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = false
|
||||
length = 262144
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
|
||||
|
@ -542,14 +544,15 @@ nO = null
|
|||
width = ${components.tok2vec.model.encode.width}
|
||||
|
||||
[components.textcat.model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v2"
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = true
|
||||
length = 262144
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
|
||||
{% else -%}
|
||||
[components.textcat.model]
|
||||
@architectures = "spacy.TextCatBOW.v2"
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = true
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
|
@ -570,15 +573,17 @@ nO = null
|
|||
width = ${components.tok2vec.model.encode.width}
|
||||
|
||||
[components.textcat_multilabel.model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v2"
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = false
|
||||
length = 262144
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
|
||||
{% else -%}
|
||||
[components.textcat_multilabel.model]
|
||||
@architectures = "spacy.TextCatBOW.v2"
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = false
|
||||
length = 262144
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
{%- endif %}
|
||||
|
|
|
@ -142,7 +142,25 @@ class SpanRenderer:
|
|||
spans (list): Individual entity spans and their start, end, label, kb_id and kb_url.
|
||||
title (str / None): Document title set in Doc.user_data['title'].
|
||||
"""
|
||||
per_token_info = []
|
||||
per_token_info = self._assemble_per_token_info(tokens, spans)
|
||||
markup = self._render_markup(per_token_info)
|
||||
markup = TPL_SPANS.format(content=markup, dir=self.direction)
|
||||
if title:
|
||||
markup = TPL_TITLE.format(title=title) + markup
|
||||
return markup
|
||||
|
||||
@staticmethod
|
||||
def _assemble_per_token_info(
|
||||
tokens: List[str], spans: List[Dict[str, Any]]
|
||||
) -> List[Dict[str, List[Dict[str, Any]]]]:
|
||||
"""Assembles token info used to generate markup in render_spans().
|
||||
tokens (List[str]): Tokens in text.
|
||||
spans (List[Dict[str, Any]]): Spans in text.
|
||||
RETURNS (List[Dict[str, List[Dict, str, Any]]]): Per token info needed to render HTML markup for given tokens
|
||||
and spans.
|
||||
"""
|
||||
per_token_info: List[Dict[str, List[Dict[str, Any]]]] = []
|
||||
|
||||
# we must sort so that we can correctly describe when spans need to "stack"
|
||||
# which is determined by their start token, then span length (longer spans on top),
|
||||
# then break any remaining ties with the span label
|
||||
|
@ -154,21 +172,22 @@ class SpanRenderer:
|
|||
s["label"],
|
||||
),
|
||||
)
|
||||
|
||||
for s in spans:
|
||||
# this is the vertical 'slot' that the span will be rendered in
|
||||
# vertical_position = span_label_offset + (offset_step * (slot - 1))
|
||||
s["render_slot"] = 0
|
||||
|
||||
for idx, token in enumerate(tokens):
|
||||
# Identify if a token belongs to a Span (and which) and if it's a
|
||||
# start token of said Span. We'll use this for the final HTML render
|
||||
token_markup: Dict[str, Any] = {}
|
||||
token_markup["text"] = token
|
||||
concurrent_spans = 0
|
||||
intersecting_spans: List[Dict[str, Any]] = []
|
||||
entities = []
|
||||
for span in spans:
|
||||
ent = {}
|
||||
if span["start_token"] <= idx < span["end_token"]:
|
||||
concurrent_spans += 1
|
||||
span_start = idx == span["start_token"]
|
||||
ent["label"] = span["label"]
|
||||
ent["is_start"] = span_start
|
||||
|
@ -176,7 +195,12 @@ class SpanRenderer:
|
|||
# When the span starts, we need to know how many other
|
||||
# spans are on the 'span stack' and will be rendered.
|
||||
# This value becomes the vertical render slot for this entire span
|
||||
span["render_slot"] = concurrent_spans
|
||||
span["render_slot"] = (
|
||||
intersecting_spans[-1]["render_slot"]
|
||||
if len(intersecting_spans)
|
||||
else 0
|
||||
) + 1
|
||||
intersecting_spans.append(span)
|
||||
ent["render_slot"] = span["render_slot"]
|
||||
kb_id = span.get("kb_id", "")
|
||||
kb_url = span.get("kb_url", "#")
|
||||
|
@ -193,11 +217,8 @@ class SpanRenderer:
|
|||
span["render_slot"] = 0
|
||||
token_markup["entities"] = entities
|
||||
per_token_info.append(token_markup)
|
||||
markup = self._render_markup(per_token_info)
|
||||
markup = TPL_SPANS.format(content=markup, dir=self.direction)
|
||||
if title:
|
||||
markup = TPL_TITLE.format(title=title) + markup
|
||||
return markup
|
||||
|
||||
return per_token_info
|
||||
|
||||
def _render_markup(self, per_token_info: List[Dict[str, Any]]) -> str:
|
||||
"""Render the markup from per-token information"""
|
||||
|
|
|
@ -227,7 +227,6 @@ class Errors(metaclass=ErrorsWithCodes):
|
|||
E002 = ("Can't find factory for '{name}' for language {lang} ({lang_code}). "
|
||||
"This usually happens when spaCy calls `nlp.{method}` with a custom "
|
||||
"component name that's not registered on the current language class. "
|
||||
"If you're using a Transformer, make sure to install 'spacy-transformers'. "
|
||||
"If you're using a custom component, make sure you've added the "
|
||||
"decorator `@Language.component` (for function components) or "
|
||||
"`@Language.factory` (for class components).\n\nAvailable "
|
||||
|
@ -984,6 +983,10 @@ class Errors(metaclass=ErrorsWithCodes):
|
|||
"predicted docs when training {component}.")
|
||||
E1055 = ("The 'replace_listener' callback expects {num_params} parameters, "
|
||||
"but only callbacks with one or three parameters are supported")
|
||||
E1056 = ("The `TextCatBOW` architecture expects a length of at least 1, was {length}.")
|
||||
E1057 = ("The `TextCatReduce` architecture must be used with at least one "
|
||||
"reduction. Please enable one of `use_reduce_first`, "
|
||||
"`use_reduce_last`, `use_reduce_max` or `use_reduce_mean`.")
|
||||
|
||||
|
||||
# Deprecated model shortcuts, only used in errors and warnings
|
||||
|
|
|
@ -1,3 +1,11 @@
|
|||
from .candidate import Candidate, get_candidates, get_candidates_batch
|
||||
from .kb import KnowledgeBase
|
||||
from .kb_in_memory import InMemoryLookupKB
|
||||
|
||||
__all__ = [
|
||||
"Candidate",
|
||||
"KnowledgeBase",
|
||||
"InMemoryLookupKB",
|
||||
"get_candidates",
|
||||
"get_candidates_batch",
|
||||
]
|
||||
|
|
|
@ -6,7 +6,8 @@ _num_words = [
|
|||
"nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
|
||||
"sixteen", "seventeen", "eighteen", "nineteen", "twenty", "thirty", "forty",
|
||||
"fifty", "sixty", "seventy", "eighty", "ninety", "hundred", "thousand",
|
||||
"million", "billion", "trillion", "quadrillion", "gajillion", "bazillion"
|
||||
"million", "billion", "trillion", "quadrillion", "quintillion", "sextillion",
|
||||
"septillion", "octillion", "nonillion", "decillion", "gajillion", "bazillion"
|
||||
]
|
||||
_ordinal_words = [
|
||||
"first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth",
|
||||
|
@ -14,7 +15,8 @@ _ordinal_words = [
|
|||
"fifteenth", "sixteenth", "seventeenth", "eighteenth", "nineteenth",
|
||||
"twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth",
|
||||
"eightieth", "ninetieth", "hundredth", "thousandth", "millionth", "billionth",
|
||||
"trillionth", "quadrillionth", "gajillionth", "bazillionth"
|
||||
"trillionth", "quadrillionth", "quintillionth", "sextillionth", "septillionth",
|
||||
"octillionth", "nonillionth", "decillionth", "gajillionth", "bazillionth"
|
||||
]
|
||||
# fmt: on
|
||||
|
||||
|
|
18
spacy/lang/fo/__init__.py
Normal file
18
spacy/lang/fo/__init__.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
from ...language import BaseDefaults, Language
|
||||
from ..punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
|
||||
|
||||
class FaroeseDefaults(BaseDefaults):
|
||||
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||
infixes = TOKENIZER_INFIXES
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
prefixes = TOKENIZER_PREFIXES
|
||||
|
||||
|
||||
class Faroese(Language):
|
||||
lang = "fo"
|
||||
Defaults = FaroeseDefaults
|
||||
|
||||
|
||||
__all__ = ["Faroese"]
|
90
spacy/lang/fo/tokenizer_exceptions.py
Normal file
90
spacy/lang/fo/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,90 @@
|
|||
from ...symbols import ORTH
|
||||
from ...util import update_exc
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
|
||||
_exc = {}
|
||||
|
||||
for orth in [
|
||||
"apr.",
|
||||
"aug.",
|
||||
"avgr.",
|
||||
"árg.",
|
||||
"ávís.",
|
||||
"beinl.",
|
||||
"blkv.",
|
||||
"blaðkv.",
|
||||
"blm.",
|
||||
"blaðm.",
|
||||
"bls.",
|
||||
"blstj.",
|
||||
"blaðstj.",
|
||||
"des.",
|
||||
"eint.",
|
||||
"febr.",
|
||||
"fyrrv.",
|
||||
"góðk.",
|
||||
"h.m.",
|
||||
"innt.",
|
||||
"jan.",
|
||||
"kl.",
|
||||
"m.a.",
|
||||
"mðr.",
|
||||
"mió.",
|
||||
"nr.",
|
||||
"nto.",
|
||||
"nov.",
|
||||
"nút.",
|
||||
"o.a.",
|
||||
"o.a.m.",
|
||||
"o.a.tíl.",
|
||||
"o.fl.",
|
||||
"ff.",
|
||||
"o.m.a.",
|
||||
"o.o.",
|
||||
"o.s.fr.",
|
||||
"o.tíl.",
|
||||
"o.ø.",
|
||||
"okt.",
|
||||
"omf.",
|
||||
"pst.",
|
||||
"ritstj.",
|
||||
"sbr.",
|
||||
"sms.",
|
||||
"smst.",
|
||||
"smb.",
|
||||
"sb.",
|
||||
"sbrt.",
|
||||
"sp.",
|
||||
"sept.",
|
||||
"spf.",
|
||||
"spsk.",
|
||||
"t.e.",
|
||||
"t.s.",
|
||||
"t.s.s.",
|
||||
"tlf.",
|
||||
"tel.",
|
||||
"tsk.",
|
||||
"t.o.v.",
|
||||
"t.d.",
|
||||
"uml.",
|
||||
"ums.",
|
||||
"uppl.",
|
||||
"upprfr.",
|
||||
"uppr.",
|
||||
"útg.",
|
||||
"útl.",
|
||||
"útr.",
|
||||
"vanl.",
|
||||
"v.",
|
||||
"v.h.",
|
||||
"v.ø.o.",
|
||||
"viðm.",
|
||||
"viðv.",
|
||||
"vm.",
|
||||
"v.m.",
|
||||
]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
capitalized = orth.capitalize()
|
||||
_exc[capitalized] = [{ORTH: capitalized}]
|
||||
|
||||
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
|
20
spacy/lang/nn/__init__.py
Normal file
20
spacy/lang/nn/__init__.py
Normal file
|
@ -0,0 +1,20 @@
|
|||
from ...language import BaseDefaults, Language
|
||||
from ..nb import SYNTAX_ITERATORS
|
||||
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
|
||||
|
||||
class NorwegianNynorskDefaults(BaseDefaults):
|
||||
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||
prefixes = TOKENIZER_PREFIXES
|
||||
infixes = TOKENIZER_INFIXES
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
syntax_iterators = SYNTAX_ITERATORS
|
||||
|
||||
|
||||
class NorwegianNynorsk(Language):
|
||||
lang = "nn"
|
||||
Defaults = NorwegianNynorskDefaults
|
||||
|
||||
|
||||
__all__ = ["NorwegianNynorsk"]
|
15
spacy/lang/nn/examples.py
Normal file
15
spacy/lang/nn/examples.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.nn.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
||||
# sentences taken from Omsetjingsminne frå Nynorsk pressekontor 2022 (https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-80/)
|
||||
sentences = [
|
||||
"Konseptet går ut på at alle tre omgangar tel, alle hopparar må stille i kvalifiseringa og poengsummen skal telje.",
|
||||
"Det er ein meir enn i same periode i fjor.",
|
||||
"Det har lava ned enorme snømengder i store delar av Europa den siste tida.",
|
||||
"Akhtar Chaudhry er ikkje innstilt på Oslo-lista til SV, men utfordrar Heikki Holmås om førsteplassen.",
|
||||
]
|
74
spacy/lang/nn/punctuation.py
Normal file
74
spacy/lang/nn/punctuation.py
Normal file
|
@ -0,0 +1,74 @@
|
|||
from ..char_classes import (
|
||||
ALPHA,
|
||||
ALPHA_LOWER,
|
||||
ALPHA_UPPER,
|
||||
CONCAT_QUOTES,
|
||||
CURRENCY,
|
||||
LIST_CURRENCY,
|
||||
LIST_ELLIPSES,
|
||||
LIST_ICONS,
|
||||
LIST_PUNCT,
|
||||
LIST_QUOTES,
|
||||
PUNCT,
|
||||
UNITS,
|
||||
)
|
||||
from ..punctuation import TOKENIZER_SUFFIXES
|
||||
|
||||
_quotes = CONCAT_QUOTES.replace("'", "")
|
||||
_list_punct = [x for x in LIST_PUNCT if x != "#"]
|
||||
_list_icons = [x for x in LIST_ICONS if x != "°"]
|
||||
_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
|
||||
_list_quotes = [x for x in LIST_QUOTES if x != "\\'"]
|
||||
|
||||
|
||||
_prefixes = (
|
||||
["§", "%", "=", "—", "–", r"\+(?![0-9])"]
|
||||
+ _list_punct
|
||||
+ LIST_ELLIPSES
|
||||
+ LIST_QUOTES
|
||||
+ LIST_CURRENCY
|
||||
+ LIST_ICONS
|
||||
)
|
||||
|
||||
|
||||
_infixes = (
|
||||
LIST_ELLIPSES
|
||||
+ _list_icons
|
||||
+ [
|
||||
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
||||
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}])[:<>=/](?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
|
||||
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
|
||||
]
|
||||
)
|
||||
|
||||
_suffixes = (
|
||||
LIST_PUNCT
|
||||
+ LIST_ELLIPSES
|
||||
+ _list_quotes
|
||||
+ _list_icons
|
||||
+ ["—", "–"]
|
||||
+ [
|
||||
r"(?<=[0-9])\+",
|
||||
r"(?<=°[FfCcKk])\.",
|
||||
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||
r"(?<=[0-9])(?:{u})".format(u=UNITS),
|
||||
r"(?<=[{al}{e}{p}(?:{q})])\.".format(
|
||||
al=ALPHA_LOWER, e=r"%²\-\+", q=_quotes, p=PUNCT
|
||||
),
|
||||
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||
]
|
||||
+ [r"(?<=[^sSxXzZ])'"]
|
||||
)
|
||||
_suffixes += [
|
||||
suffix
|
||||
for suffix in TOKENIZER_SUFFIXES
|
||||
if suffix not in ["'s", "'S", "’s", "’S", r"\'"]
|
||||
]
|
||||
|
||||
|
||||
TOKENIZER_PREFIXES = _prefixes
|
||||
TOKENIZER_INFIXES = _infixes
|
||||
TOKENIZER_SUFFIXES = _suffixes
|
228
spacy/lang/nn/tokenizer_exceptions.py
Normal file
228
spacy/lang/nn/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,228 @@
|
|||
from ...symbols import NORM, ORTH
|
||||
from ...util import update_exc
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
|
||||
_exc = {}
|
||||
|
||||
|
||||
for exc_data in [
|
||||
{ORTH: "jan.", NORM: "januar"},
|
||||
{ORTH: "feb.", NORM: "februar"},
|
||||
{ORTH: "mar.", NORM: "mars"},
|
||||
{ORTH: "apr.", NORM: "april"},
|
||||
{ORTH: "jun.", NORM: "juni"},
|
||||
# note: "jul." is in the simple list below without a NORM exception
|
||||
{ORTH: "aug.", NORM: "august"},
|
||||
{ORTH: "sep.", NORM: "september"},
|
||||
{ORTH: "okt.", NORM: "oktober"},
|
||||
{ORTH: "nov.", NORM: "november"},
|
||||
{ORTH: "des.", NORM: "desember"},
|
||||
]:
|
||||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
|
||||
|
||||
for orth in [
|
||||
"Ap.",
|
||||
"Aq.",
|
||||
"Ca.",
|
||||
"Chr.",
|
||||
"Co.",
|
||||
"Dr.",
|
||||
"F.eks.",
|
||||
"Fr.p.",
|
||||
"Frp.",
|
||||
"Grl.",
|
||||
"Kr.",
|
||||
"Kr.F.",
|
||||
"Kr.F.s",
|
||||
"Mr.",
|
||||
"Mrs.",
|
||||
"Pb.",
|
||||
"Pr.",
|
||||
"Sp.",
|
||||
"St.",
|
||||
"a.m.",
|
||||
"ad.",
|
||||
"adm.dir.",
|
||||
"adr.",
|
||||
"b.c.",
|
||||
"bl.a.",
|
||||
"bla.",
|
||||
"bm.",
|
||||
"bnr.",
|
||||
"bto.",
|
||||
"c.c.",
|
||||
"ca.",
|
||||
"cand.mag.",
|
||||
"co.",
|
||||
"d.d.",
|
||||
"d.m.",
|
||||
"d.y.",
|
||||
"dept.",
|
||||
"dr.",
|
||||
"dr.med.",
|
||||
"dr.philos.",
|
||||
"dr.psychol.",
|
||||
"dss.",
|
||||
"dvs.",
|
||||
"e.Kr.",
|
||||
"e.l.",
|
||||
"eg.",
|
||||
"eig.",
|
||||
"ekskl.",
|
||||
"el.",
|
||||
"et.",
|
||||
"etc.",
|
||||
"etg.",
|
||||
"ev.",
|
||||
"evt.",
|
||||
"f.",
|
||||
"f.Kr.",
|
||||
"f.eks.",
|
||||
"f.o.m.",
|
||||
"fhv.",
|
||||
"fk.",
|
||||
"foreg.",
|
||||
"fork.",
|
||||
"fv.",
|
||||
"fvt.",
|
||||
"g.",
|
||||
"gl.",
|
||||
"gno.",
|
||||
"gnr.",
|
||||
"grl.",
|
||||
"gt.",
|
||||
"h.r.adv.",
|
||||
"hhv.",
|
||||
"hoh.",
|
||||
"hr.",
|
||||
"ifb.",
|
||||
"ifm.",
|
||||
"iht.",
|
||||
"inkl.",
|
||||
"istf.",
|
||||
"jf.",
|
||||
"jr.",
|
||||
"jul.",
|
||||
"juris.",
|
||||
"kfr.",
|
||||
"kgl.",
|
||||
"kgl.res.",
|
||||
"kl.",
|
||||
"komm.",
|
||||
"kr.",
|
||||
"kst.",
|
||||
"lat.",
|
||||
"lø.",
|
||||
"m.a.",
|
||||
"m.a.o.",
|
||||
"m.fl.",
|
||||
"m.m.",
|
||||
"m.v.",
|
||||
"ma.",
|
||||
"mag.art.",
|
||||
"md.",
|
||||
"mfl.",
|
||||
"mht.",
|
||||
"mill.",
|
||||
"min.",
|
||||
"mnd.",
|
||||
"moh.",
|
||||
"mrd.",
|
||||
"muh.",
|
||||
"mv.",
|
||||
"mva.",
|
||||
"n.å.",
|
||||
"ndf.",
|
||||
"nr.",
|
||||
"nto.",
|
||||
"nyno.",
|
||||
"o.a.",
|
||||
"o.l.",
|
||||
"obl.",
|
||||
"off.",
|
||||
"ofl.",
|
||||
"on.",
|
||||
"op.",
|
||||
"org.",
|
||||
"osv.",
|
||||
"ovf.",
|
||||
"p.",
|
||||
"p.a.",
|
||||
"p.g.a.",
|
||||
"p.m.",
|
||||
"p.t.",
|
||||
"pga.",
|
||||
"ph.d.",
|
||||
"pkt.",
|
||||
"pr.",
|
||||
"pst.",
|
||||
"pt.",
|
||||
"red.anm.",
|
||||
"ref.",
|
||||
"res.",
|
||||
"res.kap.",
|
||||
"resp.",
|
||||
"rv.",
|
||||
"s.",
|
||||
"s.d.",
|
||||
"s.k.",
|
||||
"s.u.",
|
||||
"s.å.",
|
||||
"sen.",
|
||||
"sep.",
|
||||
"siviling.",
|
||||
"sms.",
|
||||
"snr.",
|
||||
"spm.",
|
||||
"sr.",
|
||||
"sst.",
|
||||
"st.",
|
||||
"st.meld.",
|
||||
"st.prp.",
|
||||
"stip.",
|
||||
"stk.",
|
||||
"stud.",
|
||||
"sv.",
|
||||
"såk.",
|
||||
"sø.",
|
||||
"t.d.",
|
||||
"t.h.",
|
||||
"t.o.m.",
|
||||
"t.v.",
|
||||
"temp.",
|
||||
"ti.",
|
||||
"tils.",
|
||||
"tilsv.",
|
||||
"tl;dr",
|
||||
"tlf.",
|
||||
"to.",
|
||||
"ult.",
|
||||
"utg.",
|
||||
"v.",
|
||||
"vedk.",
|
||||
"vedr.",
|
||||
"vg.",
|
||||
"vgs.",
|
||||
"vha.",
|
||||
"vit.ass.",
|
||||
"vn.",
|
||||
"vol.",
|
||||
"vs.",
|
||||
"vsa.",
|
||||
"§§",
|
||||
"©NTB",
|
||||
"årg.",
|
||||
"årh.",
|
||||
]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
# Dates
|
||||
for h in range(1, 31 + 1):
|
||||
for period in ["."]:
|
||||
_exc[f"{h}{period}"] = [{ORTH: f"{h}."}]
|
||||
|
||||
_custom_base_exc = {"i.": [{ORTH: "i", NORM: "i"}, {ORTH: "."}]}
|
||||
_exc.update(_custom_base_exc)
|
||||
|
||||
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
|
|
@ -1683,6 +1683,12 @@ class Language:
|
|||
for proc in procs:
|
||||
proc.start()
|
||||
|
||||
# Close writing-end of channels. This is needed to avoid that reading
|
||||
# from the channel blocks indefinitely when the worker closes the
|
||||
# channel.
|
||||
for tx in bytedocs_send_ch:
|
||||
tx.close()
|
||||
|
||||
# Cycle channels not to break the order of docs.
|
||||
# The received object is a batch of byte-encoded docs, so flatten them with chain.from_iterable.
|
||||
byte_tuples = chain.from_iterable(
|
||||
|
@ -1705,8 +1711,24 @@ class Language:
|
|||
# tell `sender` that one batch was consumed.
|
||||
sender.step()
|
||||
finally:
|
||||
# If we are stopping in an orderly fashion, the workers' queues
|
||||
# are empty. Put the sentinel in their queues to signal that work
|
||||
# is done, so that they can exit gracefully.
|
||||
for q in texts_q:
|
||||
q.put(_WORK_DONE_SENTINEL)
|
||||
q.close()
|
||||
|
||||
# Otherwise, we are stopping because the error handler raised an
|
||||
# exception. The sentinel will be last to go out of the queue.
|
||||
# To avoid doing unnecessary work or hanging on platforms that
|
||||
# block on sending (Windows), we'll close our end of the channel.
|
||||
# This signals to the worker that it can exit the next time it
|
||||
# attempts to send data down the channel.
|
||||
for r in bytedocs_recv_ch:
|
||||
r.close()
|
||||
|
||||
for proc in procs:
|
||||
proc.terminate()
|
||||
proc.join()
|
||||
|
||||
def _link_components(self) -> None:
|
||||
"""Register 'listeners' within pipeline components, to allow them to
|
||||
|
@ -2323,6 +2345,12 @@ def _apply_pipes(
|
|||
while True:
|
||||
try:
|
||||
texts_with_ctx = receiver.get()
|
||||
|
||||
# Stop working if we encounter the end-of-work sentinel.
|
||||
if isinstance(texts_with_ctx, _WorkDoneSentinel):
|
||||
sender.close()
|
||||
receiver.close()
|
||||
|
||||
docs = (
|
||||
ensure_doc(doc_like, context) for doc_like, context in texts_with_ctx
|
||||
)
|
||||
|
@ -2331,11 +2359,22 @@ def _apply_pipes(
|
|||
# Connection does not accept unpickable objects, so send list.
|
||||
byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
|
||||
padding = [(None, None, None)] * (len(texts_with_ctx) - len(byte_docs))
|
||||
sender.send(byte_docs + padding) # type: ignore[operator]
|
||||
data: Sequence[Tuple[Optional[bytes], Optional[Any], Optional[bytes]]] = (
|
||||
byte_docs + padding # type: ignore[operator]
|
||||
)
|
||||
except Exception:
|
||||
error_msg = [(None, None, srsly.msgpack_dumps(traceback.format_exc()))]
|
||||
padding = [(None, None, None)] * (len(texts_with_ctx) - 1)
|
||||
sender.send(error_msg + padding)
|
||||
data = error_msg + padding
|
||||
|
||||
try:
|
||||
sender.send(data)
|
||||
except BrokenPipeError:
|
||||
# Parent has closed the pipe prematurely. This happens when a
|
||||
# worker encounters an error and the error handler is set to
|
||||
# stop processing.
|
||||
sender.close()
|
||||
receiver.close()
|
||||
|
||||
|
||||
class _Sender:
|
||||
|
@ -2365,3 +2404,10 @@ class _Sender:
|
|||
if self.count >= self.chunk_size:
|
||||
self.count = 0
|
||||
self.send()
|
||||
|
||||
|
||||
class _WorkDoneSentinel:
|
||||
pass
|
||||
|
||||
|
||||
_WORK_DONE_SENTINEL = _WorkDoneSentinel()
|
||||
|
|
|
@ -3,4 +3,4 @@ from .levenshtein import levenshtein
|
|||
from .matcher import Matcher
|
||||
from .phrasematcher import PhraseMatcher
|
||||
|
||||
__all__ = ["Matcher", "PhraseMatcher", "DependencyMatcher", "levenshtein"]
|
||||
__all__ = ["DependencyMatcher", "Matcher", "PhraseMatcher", "levenshtein"]
|
||||
|
|
|
@ -1,21 +1,27 @@
|
|||
from functools import partial
|
||||
from typing import List, Optional, cast
|
||||
from typing import List, Optional, Tuple, cast
|
||||
|
||||
from thinc.api import (
|
||||
Dropout,
|
||||
Gelu,
|
||||
LayerNorm,
|
||||
Linear,
|
||||
Logistic,
|
||||
Maxout,
|
||||
Model,
|
||||
ParametricAttention,
|
||||
ParametricAttention_v2,
|
||||
Relu,
|
||||
Softmax,
|
||||
SparseLinear,
|
||||
SparseLinear_v2,
|
||||
chain,
|
||||
clone,
|
||||
concatenate,
|
||||
list2ragged,
|
||||
reduce_first,
|
||||
reduce_last,
|
||||
reduce_max,
|
||||
reduce_mean,
|
||||
reduce_sum,
|
||||
residual,
|
||||
|
@ -25,9 +31,10 @@ from thinc.api import (
|
|||
)
|
||||
from thinc.layers.chain import init as init_chain
|
||||
from thinc.layers.resizable import resize_linear_weighted, resize_model
|
||||
from thinc.types import Floats2d
|
||||
from thinc.types import ArrayXd, Floats2d
|
||||
|
||||
from ...attrs import ORTH
|
||||
from ...errors import Errors
|
||||
from ...tokens import Doc
|
||||
from ...util import registry
|
||||
from ..extract_ngrams import extract_ngrams
|
||||
|
@ -47,39 +54,15 @@ def build_simple_cnn_text_classifier(
|
|||
outputs sum to 1. If exclusive_classes=False, a logistic non-linearity
|
||||
is applied instead, so that outputs are in the range [0, 1].
|
||||
"""
|
||||
fill_defaults = {"b": 0, "W": 0}
|
||||
with Model.define_operators({">>": chain}):
|
||||
cnn = tok2vec >> list2ragged() >> reduce_mean()
|
||||
nI = tok2vec.maybe_get_dim("nO")
|
||||
if exclusive_classes:
|
||||
output_layer = Softmax(nO=nO, nI=nI)
|
||||
fill_defaults["b"] = NEG_VALUE
|
||||
resizable_layer: Model = resizable(
|
||||
output_layer,
|
||||
resize_layer=partial(
|
||||
resize_linear_weighted, fill_defaults=fill_defaults
|
||||
),
|
||||
)
|
||||
model = cnn >> resizable_layer
|
||||
else:
|
||||
output_layer = Linear(nO=nO, nI=nI)
|
||||
resizable_layer = resizable(
|
||||
output_layer,
|
||||
resize_layer=partial(
|
||||
resize_linear_weighted, fill_defaults=fill_defaults
|
||||
),
|
||||
)
|
||||
model = cnn >> resizable_layer >> Logistic()
|
||||
model.set_ref("output_layer", output_layer)
|
||||
model.attrs["resize_output"] = partial(
|
||||
resize_and_set_ref,
|
||||
resizable_layer=resizable_layer,
|
||||
)
|
||||
model.set_ref("tok2vec", tok2vec)
|
||||
if nO is not None:
|
||||
model.set_dim("nO", cast(int, nO))
|
||||
model.attrs["multi_label"] = not exclusive_classes
|
||||
return model
|
||||
return build_reduce_text_classifier(
|
||||
tok2vec=tok2vec,
|
||||
exclusive_classes=exclusive_classes,
|
||||
use_reduce_first=False,
|
||||
use_reduce_last=False,
|
||||
use_reduce_max=False,
|
||||
use_reduce_mean=True,
|
||||
nO=nO,
|
||||
)
|
||||
|
||||
|
||||
def resize_and_set_ref(model, new_nO, resizable_layer):
|
||||
|
@ -95,10 +78,48 @@ def build_bow_text_classifier(
|
|||
ngram_size: int,
|
||||
no_output_layer: bool,
|
||||
nO: Optional[int] = None,
|
||||
) -> Model[List[Doc], Floats2d]:
|
||||
return _build_bow_text_classifier(
|
||||
exclusive_classes=exclusive_classes,
|
||||
ngram_size=ngram_size,
|
||||
no_output_layer=no_output_layer,
|
||||
nO=nO,
|
||||
sparse_linear=SparseLinear(nO=nO),
|
||||
)
|
||||
|
||||
|
||||
@registry.architectures("spacy.TextCatBOW.v3")
|
||||
def build_bow_text_classifier_v3(
|
||||
exclusive_classes: bool,
|
||||
ngram_size: int,
|
||||
no_output_layer: bool,
|
||||
length: int = 262144,
|
||||
nO: Optional[int] = None,
|
||||
) -> Model[List[Doc], Floats2d]:
|
||||
if length < 1:
|
||||
raise ValueError(Errors.E1056.format(length=length))
|
||||
|
||||
# Find k such that 2**(k-1) < length <= 2**k.
|
||||
length = 2 ** (length - 1).bit_length()
|
||||
|
||||
return _build_bow_text_classifier(
|
||||
exclusive_classes=exclusive_classes,
|
||||
ngram_size=ngram_size,
|
||||
no_output_layer=no_output_layer,
|
||||
nO=nO,
|
||||
sparse_linear=SparseLinear_v2(nO=nO, length=length),
|
||||
)
|
||||
|
||||
|
||||
def _build_bow_text_classifier(
|
||||
exclusive_classes: bool,
|
||||
ngram_size: int,
|
||||
no_output_layer: bool,
|
||||
sparse_linear: Model[Tuple[ArrayXd, ArrayXd, ArrayXd], ArrayXd],
|
||||
nO: Optional[int] = None,
|
||||
) -> Model[List[Doc], Floats2d]:
|
||||
fill_defaults = {"b": 0, "W": 0}
|
||||
with Model.define_operators({">>": chain}):
|
||||
sparse_linear = SparseLinear(nO=nO)
|
||||
output_layer = None
|
||||
if not no_output_layer:
|
||||
fill_defaults["b"] = NEG_VALUE
|
||||
|
@ -127,6 +148,9 @@ def build_text_classifier_v2(
|
|||
linear_model: Model[List[Doc], Floats2d],
|
||||
nO: Optional[int] = None,
|
||||
) -> Model[List[Doc], Floats2d]:
|
||||
# TODO: build the model with _build_parametric_attention_with_residual_nonlinear
|
||||
# in spaCy v4. We don't do this in spaCy v3 to preserve model
|
||||
# compatibility.
|
||||
exclusive_classes = not linear_model.attrs["multi_label"]
|
||||
with Model.define_operators({">>": chain, "|": concatenate}):
|
||||
width = tok2vec.maybe_get_dim("nO")
|
||||
|
@ -161,6 +185,11 @@ def build_text_classifier_v2(
|
|||
|
||||
|
||||
def init_ensemble_textcat(model, X, Y) -> Model:
|
||||
# When tok2vec is lazily initialized, we need to initialize it before
|
||||
# the rest of the chain to ensure that we can get its width.
|
||||
tok2vec = model.get_ref("tok2vec")
|
||||
tok2vec.initialize(X)
|
||||
|
||||
tok2vec_width = get_tok2vec_width(model)
|
||||
model.get_ref("attention_layer").set_dim("nO", tok2vec_width)
|
||||
model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
|
||||
|
@ -190,3 +219,153 @@ def build_text_classifier_lowdata(
|
|||
model = model >> Dropout(dropout)
|
||||
model = model >> Logistic()
|
||||
return model
|
||||
|
||||
|
||||
@registry.architectures("spacy.TextCatParametricAttention.v1")
|
||||
def build_textcat_parametric_attention_v1(
|
||||
tok2vec: Model[List[Doc], List[Floats2d]],
|
||||
exclusive_classes: bool,
|
||||
nO: Optional[int] = None,
|
||||
) -> Model[List[Doc], Floats2d]:
|
||||
width = tok2vec.maybe_get_dim("nO")
|
||||
parametric_attention = _build_parametric_attention_with_residual_nonlinear(
|
||||
tok2vec=tok2vec,
|
||||
nonlinear_layer=Maxout(nI=width, nO=width),
|
||||
key_transform=Gelu(nI=width, nO=width),
|
||||
)
|
||||
with Model.define_operators({">>": chain}):
|
||||
if exclusive_classes:
|
||||
output_layer = Softmax(nO=nO)
|
||||
else:
|
||||
output_layer = Linear(nO=nO) >> Logistic()
|
||||
model = parametric_attention >> output_layer
|
||||
if model.has_dim("nO") is not False and nO is not None:
|
||||
model.set_dim("nO", cast(int, nO))
|
||||
model.set_ref("output_layer", output_layer)
|
||||
model.attrs["multi_label"] = not exclusive_classes
|
||||
|
||||
return model
|
||||
|
||||
|
||||
def _build_parametric_attention_with_residual_nonlinear(
|
||||
*,
|
||||
tok2vec: Model[List[Doc], List[Floats2d]],
|
||||
nonlinear_layer: Model[Floats2d, Floats2d],
|
||||
key_transform: Optional[Model[Floats2d, Floats2d]] = None,
|
||||
) -> Model[List[Doc], Floats2d]:
|
||||
with Model.define_operators({">>": chain, "|": concatenate}):
|
||||
width = tok2vec.maybe_get_dim("nO")
|
||||
attention_layer = ParametricAttention_v2(nO=width, key_transform=key_transform)
|
||||
norm_layer = LayerNorm(nI=width)
|
||||
parametric_attention = (
|
||||
tok2vec
|
||||
>> list2ragged()
|
||||
>> attention_layer
|
||||
>> reduce_sum()
|
||||
>> residual(nonlinear_layer >> norm_layer >> Dropout(0.0))
|
||||
)
|
||||
|
||||
parametric_attention.init = _init_parametric_attention_with_residual_nonlinear
|
||||
|
||||
parametric_attention.set_ref("tok2vec", tok2vec)
|
||||
parametric_attention.set_ref("attention_layer", attention_layer)
|
||||
parametric_attention.set_ref("key_transform", key_transform)
|
||||
parametric_attention.set_ref("nonlinear_layer", nonlinear_layer)
|
||||
parametric_attention.set_ref("norm_layer", norm_layer)
|
||||
|
||||
return parametric_attention
|
||||
|
||||
|
||||
def _init_parametric_attention_with_residual_nonlinear(model, X, Y) -> Model:
|
||||
# When tok2vec is lazily initialized, we need to initialize it before
|
||||
# the rest of the chain to ensure that we can get its width.
|
||||
tok2vec = model.get_ref("tok2vec")
|
||||
tok2vec.initialize(X)
|
||||
|
||||
tok2vec_width = get_tok2vec_width(model)
|
||||
model.get_ref("attention_layer").set_dim("nO", tok2vec_width)
|
||||
model.get_ref("key_transform").set_dim("nI", tok2vec_width)
|
||||
model.get_ref("key_transform").set_dim("nO", tok2vec_width)
|
||||
model.get_ref("nonlinear_layer").set_dim("nI", tok2vec_width)
|
||||
model.get_ref("nonlinear_layer").set_dim("nO", tok2vec_width)
|
||||
model.get_ref("norm_layer").set_dim("nI", tok2vec_width)
|
||||
model.get_ref("norm_layer").set_dim("nO", tok2vec_width)
|
||||
init_chain(model, X, Y)
|
||||
return model
|
||||
|
||||
|
||||
@registry.architectures("spacy.TextCatReduce.v1")
|
||||
def build_reduce_text_classifier(
|
||||
tok2vec: Model,
|
||||
exclusive_classes: bool,
|
||||
use_reduce_first: bool,
|
||||
use_reduce_last: bool,
|
||||
use_reduce_max: bool,
|
||||
use_reduce_mean: bool,
|
||||
nO: Optional[int] = None,
|
||||
) -> Model[List[Doc], Floats2d]:
|
||||
"""Build a model that classifies pooled `Doc` representations.
|
||||
|
||||
Pooling is performed using reductions. Reductions are concatenated when
|
||||
multiple reductions are used.
|
||||
|
||||
tok2vec (Model): the tok2vec layer to pool over.
|
||||
exclusive_classes (bool): Whether or not classes are mutually exclusive.
|
||||
use_reduce_first (bool): Pool by using the hidden representation of the
|
||||
first token of a `Doc`.
|
||||
use_reduce_last (bool): Pool by using the hidden representation of the
|
||||
last token of a `Doc`.
|
||||
use_reduce_max (bool): Pool by taking the maximum values of the hidden
|
||||
representations of a `Doc`.
|
||||
use_reduce_mean (bool): Pool by taking the mean of all hidden
|
||||
representations of a `Doc`.
|
||||
nO (Optional[int]): Number of classes.
|
||||
"""
|
||||
|
||||
fill_defaults = {"b": 0, "W": 0}
|
||||
reductions = []
|
||||
if use_reduce_first:
|
||||
reductions.append(reduce_first())
|
||||
if use_reduce_last:
|
||||
reductions.append(reduce_last())
|
||||
if use_reduce_max:
|
||||
reductions.append(reduce_max())
|
||||
if use_reduce_mean:
|
||||
reductions.append(reduce_mean())
|
||||
|
||||
if not len(reductions):
|
||||
raise ValueError(Errors.E1057)
|
||||
|
||||
with Model.define_operators({">>": chain}):
|
||||
cnn = tok2vec >> list2ragged() >> concatenate(*reductions)
|
||||
nO_tok2vec = tok2vec.maybe_get_dim("nO")
|
||||
nI = nO_tok2vec * len(reductions) if nO_tok2vec is not None else None
|
||||
if exclusive_classes:
|
||||
output_layer = Softmax(nO=nO, nI=nI)
|
||||
fill_defaults["b"] = NEG_VALUE
|
||||
resizable_layer: Model = resizable(
|
||||
output_layer,
|
||||
resize_layer=partial(
|
||||
resize_linear_weighted, fill_defaults=fill_defaults
|
||||
),
|
||||
)
|
||||
model = cnn >> resizable_layer
|
||||
else:
|
||||
output_layer = Linear(nO=nO, nI=nI)
|
||||
resizable_layer = resizable(
|
||||
output_layer,
|
||||
resize_layer=partial(
|
||||
resize_linear_weighted, fill_defaults=fill_defaults
|
||||
),
|
||||
)
|
||||
model = cnn >> resizable_layer >> Logistic()
|
||||
model.set_ref("output_layer", output_layer)
|
||||
model.attrs["resize_output"] = partial(
|
||||
resize_and_set_ref,
|
||||
resizable_layer=resizable_layer,
|
||||
)
|
||||
model.set_ref("tok2vec", tok2vec)
|
||||
if nO is not None:
|
||||
model.set_dim("nO", cast(int, nO))
|
||||
model.attrs["multi_label"] = not exclusive_classes
|
||||
return model
|
||||
|
|
|
@ -22,6 +22,7 @@ from .trainable_pipe import TrainablePipe
|
|||
__all__ = [
|
||||
"AttributeRuler",
|
||||
"DependencyParser",
|
||||
"EditTreeLemmatizer",
|
||||
"EntityLinker",
|
||||
"EntityRecognizer",
|
||||
"EntityRuler",
|
||||
|
|
|
@ -29,7 +29,7 @@ cdef class StateClass:
|
|||
return [self.B(i) for i in range(self.c.buffer_length())]
|
||||
|
||||
@property
|
||||
def token_vector_lenth(self):
|
||||
def token_vector_length(self):
|
||||
return self.doc.tensor.shape[1]
|
||||
|
||||
@property
|
||||
|
|
|
@ -36,8 +36,9 @@ maxout_pieces = 3
|
|||
depth = 2
|
||||
|
||||
[model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v2"
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = true
|
||||
length = 262144
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
"""
|
||||
|
@ -45,16 +46,21 @@ DEFAULT_SINGLE_TEXTCAT_MODEL = Config().from_str(single_label_default_config)["m
|
|||
|
||||
single_label_bow_config = """
|
||||
[model]
|
||||
@architectures = "spacy.TextCatBOW.v2"
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = true
|
||||
length = 262144
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
"""
|
||||
|
||||
single_label_cnn_config = """
|
||||
[model]
|
||||
@architectures = "spacy.TextCatCNN.v2"
|
||||
@architectures = "spacy.TextCatReduce.v1"
|
||||
exclusive_classes = true
|
||||
use_reduce_first = false
|
||||
use_reduce_last = false
|
||||
use_reduce_max = false
|
||||
use_reduce_mean = true
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
|
|
|
@ -35,8 +35,9 @@ maxout_pieces = 3
|
|||
depth = 2
|
||||
|
||||
[model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v2"
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = false
|
||||
length = 262144
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
"""
|
||||
|
@ -44,7 +45,7 @@ DEFAULT_MULTI_TEXTCAT_MODEL = Config().from_str(multi_label_default_config)["mod
|
|||
|
||||
multi_label_bow_config = """
|
||||
[model]
|
||||
@architectures = "spacy.TextCatBOW.v2"
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = false
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
|
@ -52,8 +53,12 @@ no_output_layer = false
|
|||
|
||||
multi_label_cnn_config = """
|
||||
[model]
|
||||
@architectures = "spacy.TextCatCNN.v2"
|
||||
@architectures = "spacy.TextCatReduce.v1"
|
||||
exclusive_classes = false
|
||||
use_reduce_first = false
|
||||
use_reduce_last = false
|
||||
use_reduce_max = false
|
||||
use_reduce_mean = true
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
|
|
138
spacy/scorer.py
138
spacy/scorer.py
|
@ -802,6 +802,140 @@ def get_ner_prf(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
|||
}
|
||||
|
||||
|
||||
# The following implementation of trapezoid() is adapted from SciPy,
|
||||
# which is distributed under the New BSD License.
|
||||
# Copyright (c) 2001-2002 Enthought, Inc. 2003-2023, SciPy Developers.
|
||||
# See licenses/3rd_party_licenses.txt
|
||||
def trapezoid(y, x=None, dx=1.0, axis=-1):
|
||||
r"""
|
||||
Integrate along the given axis using the composite trapezoidal rule.
|
||||
|
||||
If `x` is provided, the integration happens in sequence along its
|
||||
elements - they are not sorted.
|
||||
|
||||
Integrate `y` (`x`) along each 1d slice on the given axis, compute
|
||||
:math:`\int y(x) dx`.
|
||||
When `x` is specified, this integrates along the parametric curve,
|
||||
computing :math:`\int_t y(t) dt =
|
||||
\int_t y(t) \left.\frac{dx}{dt}\right|_{x=x(t)} dt`.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
y : array_like
|
||||
Input array to integrate.
|
||||
x : array_like, optional
|
||||
The sample points corresponding to the `y` values. If `x` is None,
|
||||
the sample points are assumed to be evenly spaced `dx` apart. The
|
||||
default is None.
|
||||
dx : scalar, optional
|
||||
The spacing between sample points when `x` is None. The default is 1.
|
||||
axis : int, optional
|
||||
The axis along which to integrate.
|
||||
|
||||
Returns
|
||||
-------
|
||||
trapezoid : float or ndarray
|
||||
Definite integral of `y` = n-dimensional array as approximated along
|
||||
a single axis by the trapezoidal rule. If `y` is a 1-dimensional array,
|
||||
then the result is a float. If `n` is greater than 1, then the result
|
||||
is an `n`-1 dimensional array.
|
||||
|
||||
See Also
|
||||
--------
|
||||
cumulative_trapezoid, simpson, romb
|
||||
|
||||
Notes
|
||||
-----
|
||||
Image [2]_ illustrates trapezoidal rule -- y-axis locations of points
|
||||
will be taken from `y` array, by default x-axis distances between
|
||||
points will be 1.0, alternatively they can be provided with `x` array
|
||||
or with `dx` scalar. Return value will be equal to combined area under
|
||||
the red lines.
|
||||
|
||||
References
|
||||
----------
|
||||
.. [1] Wikipedia page: https://en.wikipedia.org/wiki/Trapezoidal_rule
|
||||
|
||||
.. [2] Illustration image:
|
||||
https://en.wikipedia.org/wiki/File:Composite_trapezoidal_rule_illustration.png
|
||||
|
||||
Examples
|
||||
--------
|
||||
Use the trapezoidal rule on evenly spaced points:
|
||||
|
||||
>>> import numpy as np
|
||||
>>> from scipy import integrate
|
||||
>>> integrate.trapezoid([1, 2, 3])
|
||||
4.0
|
||||
|
||||
The spacing between sample points can be selected by either the
|
||||
``x`` or ``dx`` arguments:
|
||||
|
||||
>>> integrate.trapezoid([1, 2, 3], x=[4, 6, 8])
|
||||
8.0
|
||||
>>> integrate.trapezoid([1, 2, 3], dx=2)
|
||||
8.0
|
||||
|
||||
Using a decreasing ``x`` corresponds to integrating in reverse:
|
||||
|
||||
>>> integrate.trapezoid([1, 2, 3], x=[8, 6, 4])
|
||||
-8.0
|
||||
|
||||
More generally ``x`` is used to integrate along a parametric curve. We can
|
||||
estimate the integral :math:`\int_0^1 x^2 = 1/3` using:
|
||||
|
||||
>>> x = np.linspace(0, 1, num=50)
|
||||
>>> y = x**2
|
||||
>>> integrate.trapezoid(y, x)
|
||||
0.33340274885464394
|
||||
|
||||
Or estimate the area of a circle, noting we repeat the sample which closes
|
||||
the curve:
|
||||
|
||||
>>> theta = np.linspace(0, 2 * np.pi, num=1000, endpoint=True)
|
||||
>>> integrate.trapezoid(np.cos(theta), x=np.sin(theta))
|
||||
3.141571941375841
|
||||
|
||||
``trapezoid`` can be applied along a specified axis to do multiple
|
||||
computations in one call:
|
||||
|
||||
>>> a = np.arange(6).reshape(2, 3)
|
||||
>>> a
|
||||
array([[0, 1, 2],
|
||||
[3, 4, 5]])
|
||||
>>> integrate.trapezoid(a, axis=0)
|
||||
array([1.5, 2.5, 3.5])
|
||||
>>> integrate.trapezoid(a, axis=1)
|
||||
array([2., 8.])
|
||||
"""
|
||||
y = np.asanyarray(y)
|
||||
if x is None:
|
||||
d = dx
|
||||
else:
|
||||
x = np.asanyarray(x)
|
||||
if x.ndim == 1:
|
||||
d = np.diff(x)
|
||||
# reshape to correct shape
|
||||
shape = [1] * y.ndim
|
||||
shape[axis] = d.shape[0]
|
||||
d = d.reshape(shape)
|
||||
else:
|
||||
d = np.diff(x, axis=axis)
|
||||
nd = y.ndim
|
||||
slice1 = [slice(None)] * nd
|
||||
slice2 = [slice(None)] * nd
|
||||
slice1[axis] = slice(1, None)
|
||||
slice2[axis] = slice(None, -1)
|
||||
try:
|
||||
ret = (d * (y[tuple(slice1)] + y[tuple(slice2)]) / 2.0).sum(axis)
|
||||
except ValueError:
|
||||
# Operations didn't work, cast to ndarray
|
||||
d = np.asarray(d)
|
||||
y = np.asarray(y)
|
||||
ret = np.add.reduce(d * (y[tuple(slice1)] + y[tuple(slice2)]) / 2.0, axis)
|
||||
return ret
|
||||
|
||||
|
||||
# The following implementation of roc_auc_score() is adapted from
|
||||
# scikit-learn, which is distributed under the New BSD License.
|
||||
# Copyright (c) 2007–2019 The scikit-learn developers.
|
||||
|
@ -1024,9 +1158,9 @@ def _auc(x, y):
|
|||
else:
|
||||
raise ValueError(Errors.E164.format(x=x))
|
||||
|
||||
area = direction * np.trapz(y, x)
|
||||
area = direction * trapezoid(y, x)
|
||||
if isinstance(area, np.memmap):
|
||||
# Reductions such as .sum used internally in np.trapz do not return a
|
||||
# Reductions such as .sum used internally in trapezoid do not return a
|
||||
# scalar by default for numpy.memmap instances contrary to
|
||||
# regular numpy.ndarray instances.
|
||||
area = area.dtype.type(area)
|
||||
|
|
|
@ -162,6 +162,11 @@ def fi_tokenizer():
|
|||
return get_lang_class("fi")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def fo_tokenizer():
|
||||
return get_lang_class("fo")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def fr_tokenizer():
|
||||
return get_lang_class("fr")().tokenizer
|
||||
|
@ -317,6 +322,11 @@ def nl_tokenizer():
|
|||
return get_lang_class("nl")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def nn_tokenizer():
|
||||
return get_lang_class("nn")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def pl_tokenizer():
|
||||
return get_lang_class("pl")().tokenizer
|
||||
|
|
|
@ -731,3 +731,12 @@ def test_for_no_ent_sents():
|
|||
sents = list(doc.ents[0].sents)
|
||||
assert len(sents) == 1
|
||||
assert str(sents[0]) == str(doc.ents[0].sent) == "ENTITY"
|
||||
|
||||
|
||||
def test_span_api_richcmp_other(en_tokenizer):
|
||||
doc1 = en_tokenizer("a b")
|
||||
doc2 = en_tokenizer("b c")
|
||||
assert not doc1[1:2] == doc1[1]
|
||||
assert not doc1[1:2] == doc2[0]
|
||||
assert not doc1[1:2] == doc2[0:1]
|
||||
assert not doc1[0:1] == doc2
|
||||
|
|
|
@ -294,3 +294,12 @@ def test_missing_head_dep(en_vocab):
|
|||
assert aligned_heads[0] == ref_heads[0]
|
||||
assert aligned_deps[5] == ref_deps[5]
|
||||
assert aligned_heads[5] == ref_heads[5]
|
||||
|
||||
|
||||
def test_token_api_richcmp_other(en_tokenizer):
|
||||
doc1 = en_tokenizer("a b")
|
||||
doc2 = en_tokenizer("b c")
|
||||
assert not doc1[1] == doc1[0:1]
|
||||
assert not doc1[1] == doc2[1:2]
|
||||
assert not doc1[1] == doc2[0]
|
||||
assert not doc1[0] == doc2
|
||||
|
|
0
spacy/tests/lang/fo/__init__.py
Normal file
0
spacy/tests/lang/fo/__init__.py
Normal file
26
spacy/tests/lang/fo/test_tokenizer.py
Normal file
26
spacy/tests/lang/fo/test_tokenizer.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
import pytest
|
||||
|
||||
# examples taken from Basic LAnguage Resource Kit 1.0 for Faroese (https://maltokni.fo/en/resources) licensed with CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
|
||||
# fmt: off
|
||||
FO_TOKEN_EXCEPTION_TESTS = [
|
||||
(
|
||||
"Eftir løgtingslóg um samsýning og eftirløn landsstýrismanna v.m., skulu løgmaður og landsstýrismenn vanliga siga frá sær størv í almennari tænastu ella privatum virkjum, samtøkum ella stovnum. ",
|
||||
[
|
||||
"Eftir", "løgtingslóg", "um", "samsýning", "og", "eftirløn", "landsstýrismanna", "v.m.", ",", "skulu", "løgmaður", "og", "landsstýrismenn", "vanliga", "siga", "frá", "sær", "størv", "í", "almennari", "tænastu", "ella", "privatum", "virkjum", ",", "samtøkum", "ella", "stovnum", ".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Sambandsflokkurin gongur aftur við 2,7 prosentum í mun til valið í 1994, tá flokkurin fekk undirtøku frá 23,4 prosent av veljarunum.",
|
||||
[
|
||||
"Sambandsflokkurin", "gongur", "aftur", "við", "2,7", "prosentum", "í", "mun", "til", "valið", "í", "1994", ",", "tá", "flokkurin", "fekk", "undirtøku", "frá", "23,4", "prosent", "av", "veljarunum", ".",
|
||||
],
|
||||
),
|
||||
]
|
||||
# fmt: on
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", FO_TOKEN_EXCEPTION_TESTS)
|
||||
def test_fo_tokenizer_handles_exception_cases(fo_tokenizer, text, expected_tokens):
|
||||
tokens = fo_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
0
spacy/tests/lang/nn/__init__.py
Normal file
0
spacy/tests/lang/nn/__init__.py
Normal file
38
spacy/tests/lang/nn/test_tokenizer.py
Normal file
38
spacy/tests/lang/nn/test_tokenizer.py
Normal file
|
@ -0,0 +1,38 @@
|
|||
import pytest
|
||||
|
||||
# examples taken from Omsetjingsminne frå Nynorsk pressekontor 2022 (https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-80/)
|
||||
# fmt: off
|
||||
NN_TOKEN_EXCEPTION_TESTS = [
|
||||
(
|
||||
"Målet til direktoratet er at alle skal bli tilbydd jobb i politiet så raskt som mogleg i 2014.",
|
||||
[
|
||||
"Målet", "til", "direktoratet", "er", "at", "alle", "skal", "bli", "tilbydd", "jobb", "i", "politiet", "så", "raskt", "som", "mogleg", "i", "2014", ".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Han ønskjer ikkje at staten skal vere med på å finansiere slik undervisning, men dette er rektor på skulen ueinig i.",
|
||||
[
|
||||
"Han", "ønskjer", "ikkje", "at", "staten", "skal", "vere", "med", "på", "å", "finansiere", "slik", "undervisning", ",", "men", "dette", "er", "rektor", "på", "skulen", "ueinig", "i", ".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Ifølgje China Daily vart det 8.848 meter høge fjellet flytta 3 centimeter sørvestover under jordskjelvet, som vart målt til 7,8.",
|
||||
[
|
||||
"Ifølgje", "China", "Daily", "vart", "det", "8.848", "meter", "høge", "fjellet", "flytta", "3", "centimeter", "sørvestover", "under", "jordskjelvet", ",", "som", "vart", "målt", "til", "7,8", ".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Brukssesongen er frå nov. til mai, med ein topp i mars.",
|
||||
[
|
||||
"Brukssesongen", "er", "frå", "nov.", "til", "mai", ",", "med", "ein", "topp", "i", "mars", ".",
|
||||
],
|
||||
),
|
||||
]
|
||||
# fmt: on
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", NN_TOKEN_EXCEPTION_TESTS)
|
||||
def test_nn_tokenizer_handles_exception_cases(nn_tokenizer, text, expected_tokens):
|
||||
tokens = nn_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
|
@ -203,7 +203,7 @@ def test_pipe_class_component_model():
|
|||
"@architectures": "spacy.TextCatEnsemble.v2",
|
||||
"tok2vec": DEFAULT_TOK2VEC_MODEL,
|
||||
"linear_model": {
|
||||
"@architectures": "spacy.TextCatBOW.v2",
|
||||
"@architectures": "spacy.TextCatBOW.v3",
|
||||
"exclusive_classes": False,
|
||||
"ngram_size": 1,
|
||||
"no_output_layer": False,
|
||||
|
|
|
@ -28,6 +28,8 @@ from spacy.tokens import Doc, DocBin
|
|||
from spacy.training import Example
|
||||
from spacy.training.initialize import init_nlp
|
||||
|
||||
# Ensure that the architecture gets added to the registry.
|
||||
from ..tok2vec import build_lazy_init_tok2vec as _
|
||||
from ..util import make_tempdir
|
||||
|
||||
TRAIN_DATA_SINGLE_LABEL = [
|
||||
|
@ -40,6 +42,13 @@ TRAIN_DATA_MULTI_LABEL = [
|
|||
("I'm confused but happy", {"cats": {"ANGRY": 0.0, "CONFUSED": 1.0, "HAPPY": 1.0}}),
|
||||
]
|
||||
|
||||
lazy_init_model_config = """
|
||||
[model]
|
||||
@architectures = "test.LazyInitTok2Vec.v1"
|
||||
width = 96
|
||||
"""
|
||||
LAZY_INIT_TOK2VEC_MODEL = Config().from_str(lazy_init_model_config)["model"]
|
||||
|
||||
|
||||
def make_get_examples_single_label(nlp):
|
||||
train_examples = []
|
||||
|
@ -414,7 +423,7 @@ def test_implicit_label(name, get_examples):
|
|||
@pytest.mark.parametrize(
|
||||
"name,textcat_config",
|
||||
[
|
||||
# BOW
|
||||
# BOW V1
|
||||
("textcat", {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "no_output_layer": False, "ngram_size": 3}),
|
||||
("textcat", {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "no_output_layer": True, "ngram_size": 3}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "no_output_layer": False, "ngram_size": 3}),
|
||||
|
@ -451,14 +460,14 @@ def test_no_resize(name, textcat_config):
|
|||
@pytest.mark.parametrize(
|
||||
"name,textcat_config",
|
||||
[
|
||||
# BOW
|
||||
("textcat", {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": True, "no_output_layer": False, "ngram_size": 3}),
|
||||
("textcat", {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": True, "no_output_layer": True, "ngram_size": 3}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": False, "no_output_layer": False, "ngram_size": 3}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": False, "no_output_layer": True, "ngram_size": 3}),
|
||||
# BOW V3
|
||||
("textcat", {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": True, "no_output_layer": False, "ngram_size": 3}),
|
||||
("textcat", {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": True, "no_output_layer": True, "ngram_size": 3}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": False, "no_output_layer": False, "ngram_size": 3}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": False, "no_output_layer": True, "ngram_size": 3}),
|
||||
# CNN
|
||||
("textcat", {"@architectures": "spacy.TextCatCNN.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatCNN.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False}),
|
||||
("textcat", {"@architectures": "spacy.TextCatReduce.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True, "use_reduce_first": True, "use_reduce_last": True, "use_reduce_max": True, "use_reduce_mean": True}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatReduce.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False, "use_reduce_first": True, "use_reduce_last": True, "use_reduce_max": True, "use_reduce_mean": True}),
|
||||
],
|
||||
)
|
||||
# fmt: on
|
||||
|
@ -480,14 +489,14 @@ def test_resize(name, textcat_config):
|
|||
@pytest.mark.parametrize(
|
||||
"name,textcat_config",
|
||||
[
|
||||
# BOW
|
||||
("textcat", {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": True, "no_output_layer": False, "ngram_size": 3}),
|
||||
("textcat", {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": True, "no_output_layer": True, "ngram_size": 3}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": False, "no_output_layer": False, "ngram_size": 3}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": False, "no_output_layer": True, "ngram_size": 3}),
|
||||
# CNN
|
||||
("textcat", {"@architectures": "spacy.TextCatCNN.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatCNN.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False}),
|
||||
# BOW v3
|
||||
("textcat", {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": True, "no_output_layer": False, "ngram_size": 3}),
|
||||
("textcat", {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": True, "no_output_layer": True, "ngram_size": 3}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": False, "no_output_layer": False, "ngram_size": 3}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": False, "no_output_layer": True, "ngram_size": 3}),
|
||||
# REDUCE
|
||||
("textcat", {"@architectures": "spacy.TextCatReduce.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True, "use_reduce_first": True, "use_reduce_last": True, "use_reduce_max": True, "use_reduce_mean": True}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatReduce.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False, "use_reduce_first": True, "use_reduce_last": True, "use_reduce_max": True, "use_reduce_mean": True}),
|
||||
],
|
||||
)
|
||||
# fmt: on
|
||||
|
@ -546,6 +555,34 @@ def test_error_with_multi_labels():
|
|||
nlp.initialize(get_examples=lambda: train_examples)
|
||||
|
||||
|
||||
# fmt: off
|
||||
@pytest.mark.parametrize(
|
||||
"name,textcat_config",
|
||||
[
|
||||
# ENSEMBLE V2
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": LAZY_INIT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}}),
|
||||
("textcat", {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": LAZY_INIT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": True, "ngram_size": 5, "no_output_layer": False}}),
|
||||
# PARAMETRIC ATTENTION V1
|
||||
("textcat", {"@architectures": "spacy.TextCatParametricAttention.v1", "tok2vec": LAZY_INIT_TOK2VEC_MODEL, "exclusive_classes": True}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatParametricAttention.v1", "tok2vec": LAZY_INIT_TOK2VEC_MODEL, "exclusive_classes": False}),
|
||||
# REDUCE
|
||||
("textcat", {"@architectures": "spacy.TextCatReduce.v1", "tok2vec": LAZY_INIT_TOK2VEC_MODEL, "exclusive_classes": True, "use_reduce_first": True, "use_reduce_last": True, "use_reduce_max": True, "use_reduce_mean": True}),
|
||||
("textcat_multilabel", {"@architectures": "spacy.TextCatReduce.v1", "tok2vec": LAZY_INIT_TOK2VEC_MODEL, "exclusive_classes": False, "use_reduce_first": True, "use_reduce_last": True, "use_reduce_max": True, "use_reduce_mean": True}),
|
||||
],
|
||||
)
|
||||
# fmt: on
|
||||
def test_tok2vec_lazy_init(name, textcat_config):
|
||||
# Check that we can properly initialize and use a textcat model using
|
||||
# a lazily-initialized tok2vec.
|
||||
nlp = English()
|
||||
pipe_config = {"model": textcat_config}
|
||||
textcat = nlp.add_pipe(name, config=pipe_config)
|
||||
textcat.add_label("POSITIVE")
|
||||
textcat.add_label("NEGATIVE")
|
||||
nlp.initialize()
|
||||
nlp.pipe(["This is a test."])
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"name,get_examples, train_data",
|
||||
[
|
||||
|
@ -693,12 +730,23 @@ def test_overfitting_IO_multi():
|
|||
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False}),
|
||||
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": False, "ngram_size": 3, "no_output_layer": True}),
|
||||
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": True, "ngram_size": 2, "no_output_layer": True}),
|
||||
# BOW V3
|
||||
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}),
|
||||
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False}),
|
||||
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": False, "ngram_size": 3, "no_output_layer": True}),
|
||||
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": True, "ngram_size": 2, "no_output_layer": True}),
|
||||
# ENSEMBLE V2
|
||||
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}}),
|
||||
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": True, "ngram_size": 5, "no_output_layer": False}}),
|
||||
# CNN V2
|
||||
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}}),
|
||||
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v3", "exclusive_classes": True, "ngram_size": 5, "no_output_layer": False}}),
|
||||
# CNN V2 (legacy)
|
||||
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatCNN.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True}),
|
||||
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatCNN.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False}),
|
||||
# PARAMETRIC ATTENTION V1
|
||||
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatParametricAttention.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True}),
|
||||
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatParametricAttention.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False}),
|
||||
# REDUCE V1
|
||||
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatReduce.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True, "use_reduce_first": True, "use_reduce_last": True, "use_reduce_max": True, "use_reduce_mean": True}),
|
||||
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatReduce.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False, "use_reduce_first": True, "use_reduce_last": True, "use_reduce_max": True, "use_reduce_mean": True}),
|
||||
],
|
||||
)
|
||||
# fmt: on
|
||||
|
|
|
@ -15,7 +15,12 @@ def doc_w_attrs(en_tokenizer):
|
|||
Token.set_extension("_test_token", default="t0")
|
||||
doc[1]._._test_token = "t1"
|
||||
|
||||
return doc
|
||||
yield doc
|
||||
|
||||
Doc.remove_extension("_test_attr")
|
||||
Doc.remove_extension("_test_prop")
|
||||
Doc.remove_extension("_test_method")
|
||||
Token.remove_extension("_test_token")
|
||||
|
||||
|
||||
def test_serialize_ext_attrs_from_bytes(doc_w_attrs):
|
||||
|
|
|
@ -12,7 +12,6 @@ from thinc.api import Config
|
|||
|
||||
import spacy
|
||||
from spacy import about
|
||||
from spacy import info as spacy_info
|
||||
from spacy.cli import info
|
||||
from spacy.cli._util import parse_config_overrides, string_to_list, walk_directory
|
||||
from spacy.cli.apply import apply
|
||||
|
@ -193,9 +192,6 @@ def test_cli_info():
|
|||
raw_data = info(tmp_dir, exclude=[""])
|
||||
assert raw_data["lang"] == "nl"
|
||||
assert raw_data["components"] == ["textcat"]
|
||||
raw_data = spacy_info(tmp_dir, exclude=[""])
|
||||
assert raw_data["lang"] == "nl"
|
||||
assert raw_data["components"] == ["textcat"]
|
||||
|
||||
|
||||
def test_cli_converters_conllu_to_docs():
|
||||
|
@ -1065,3 +1061,8 @@ def test_debug_data_trainable_lemmatizer_not_annotated():
|
|||
|
||||
data = _compile_gold(train_examples, ["trainable_lemmatizer"], nlp, True)
|
||||
assert data["no_lemma_annotations"] == 2
|
||||
|
||||
|
||||
def test_project_api_imports():
|
||||
from spacy.cli import project_run
|
||||
from spacy.cli.project.run import project_run # noqa: F401, F811
|
||||
|
|
|
@ -214,9 +214,6 @@ def test_project_clone(options):
|
|||
assert (out / "README.md").is_file()
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
sys.version_info >= (3, 12), reason="Python 3.12+ not supported for remotes"
|
||||
)
|
||||
def test_project_push_pull(project_dir):
|
||||
proj = dict(SAMPLE_PROJECT)
|
||||
remote = "xyz"
|
||||
|
@ -241,7 +238,7 @@ def test_project_push_pull(project_dir):
|
|||
|
||||
def test_find_function_valid():
|
||||
# example of architecture in main code base
|
||||
function = "spacy.TextCatBOW.v2"
|
||||
function = "spacy.TextCatBOW.v3"
|
||||
result = CliRunner().invoke(app, ["find-function", function, "-r", "architectures"])
|
||||
assert f"Found registered function '{function}'" in result.stdout
|
||||
assert "textcat.py" in result.stdout
|
||||
|
@ -260,7 +257,7 @@ def test_find_function_valid():
|
|||
|
||||
def test_find_function_invalid():
|
||||
# invalid registry
|
||||
function = "spacy.TextCatBOW.v2"
|
||||
function = "spacy.TextCatBOW.v3"
|
||||
registry = "foobar"
|
||||
result = CliRunner().invoke(
|
||||
app, ["find-function", function, "--registry", registry]
|
||||
|
|
|
@ -2,7 +2,7 @@ import numpy
|
|||
import pytest
|
||||
|
||||
from spacy import displacy
|
||||
from spacy.displacy.render import DependencyRenderer, EntityRenderer
|
||||
from spacy.displacy.render import DependencyRenderer, EntityRenderer, SpanRenderer
|
||||
from spacy.lang.en import English
|
||||
from spacy.lang.fa import Persian
|
||||
from spacy.tokens import Doc, Span
|
||||
|
@ -468,3 +468,23 @@ def test_issue12816(en_vocab) -> None:
|
|||
# Verify that the HTML tag is still escaped
|
||||
html = displacy.render(doc, style="span")
|
||||
assert "<TEST>" in html
|
||||
|
||||
|
||||
@pytest.mark.issue(13056)
|
||||
def test_displacy_span_stacking():
|
||||
"""Test whether span stacking works properly for multiple overlapping spans."""
|
||||
spans = [
|
||||
{"start_token": 2, "end_token": 5, "label": "SkillNC"},
|
||||
{"start_token": 0, "end_token": 2, "label": "Skill"},
|
||||
{"start_token": 1, "end_token": 3, "label": "Skill"},
|
||||
]
|
||||
tokens = ["Welcome", "to", "the", "Bank", "of", "China", "."]
|
||||
per_token_info = SpanRenderer._assemble_per_token_info(spans=spans, tokens=tokens)
|
||||
|
||||
assert len(per_token_info) == len(tokens)
|
||||
assert all([len(per_token_info[i]["entities"]) == 1 for i in (0, 3, 4)])
|
||||
assert all([len(per_token_info[i]["entities"]) == 2 for i in (1, 2)])
|
||||
assert per_token_info[1]["entities"][0]["render_slot"] == 1
|
||||
assert per_token_info[1]["entities"][1]["render_slot"] == 2
|
||||
assert per_token_info[2]["entities"][0]["render_slot"] == 2
|
||||
assert per_token_info[2]["entities"][1]["render_slot"] == 3
|
||||
|
|
|
@ -376,8 +376,9 @@ def test_util_dot_section():
|
|||
factory = "textcat"
|
||||
|
||||
[components.textcat.model]
|
||||
@architectures = "spacy.TextCatBOW.v2"
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = true
|
||||
length = 262144
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
"""
|
||||
|
@ -485,8 +486,8 @@ def test_to_ternary_int():
|
|||
|
||||
def test_find_available_port():
|
||||
host = "0.0.0.0"
|
||||
port = 5000
|
||||
assert find_available_port(port, host) == port, "Port 5000 isn't free"
|
||||
port = 5001
|
||||
assert find_available_port(port, host) == port, "Port 5001 isn't free"
|
||||
|
||||
from wsgiref.simple_server import demo_app, make_server
|
||||
|
||||
|
|
|
@ -26,6 +26,7 @@ from spacy.ml.models import (
|
|||
build_Tok2Vec_model,
|
||||
)
|
||||
from spacy.ml.staticvectors import StaticVectors
|
||||
from spacy.util import registry
|
||||
|
||||
|
||||
def get_textcat_bow_kwargs():
|
||||
|
@ -284,3 +285,17 @@ def test_spancat_model_forward_backward(nO=5):
|
|||
Y, backprop = model((docs, spans), is_train=True)
|
||||
assert Y.shape == (spans.dataXd.shape[0], nO)
|
||||
backprop(Y)
|
||||
|
||||
|
||||
def test_textcat_reduce_invalid_args():
|
||||
textcat_reduce = registry.architectures.get("spacy.TextCatReduce.v1")
|
||||
tok2vec = make_test_tok2vec()
|
||||
with pytest.raises(ValueError, match=r"must be used with at least one reduction"):
|
||||
textcat_reduce(
|
||||
tok2vec=tok2vec,
|
||||
exclusive_classes=False,
|
||||
use_reduce_first=False,
|
||||
use_reduce_last=False,
|
||||
use_reduce_max=False,
|
||||
use_reduce_mean=False,
|
||||
)
|
||||
|
|
36
spacy/tests/tok2vec.py
Normal file
36
spacy/tests/tok2vec.py
Normal file
|
@ -0,0 +1,36 @@
|
|||
from typing import List
|
||||
|
||||
from thinc.api import Model
|
||||
from thinc.types import Floats2d
|
||||
|
||||
from spacy.tokens import Doc
|
||||
from spacy.util import registry
|
||||
|
||||
|
||||
@registry.architectures("test.LazyInitTok2Vec.v1")
|
||||
def build_lazy_init_tok2vec(*, width: int) -> Model[List[Doc], List[Floats2d]]:
|
||||
"""tok2vec model of which the output size is only known after
|
||||
initialization. This implementation does not output meaningful
|
||||
embeddings, it is strictly for testing."""
|
||||
return Model(
|
||||
"lazy_init_tok2vec",
|
||||
lazy_init_tok2vec_forward,
|
||||
init=lazy_init_tok2vec_init,
|
||||
dims={"nO": None},
|
||||
attrs={"width": width},
|
||||
)
|
||||
|
||||
|
||||
def lazy_init_tok2vec_init(model: Model, X=None, Y=None):
|
||||
width = model.attrs["width"]
|
||||
model.set_dim("nO", width)
|
||||
|
||||
|
||||
def lazy_init_tok2vec_forward(model: Model, X: List[Doc], is_train: bool):
|
||||
width = model.get_dim("nO")
|
||||
Y = [model.ops.alloc2f(len(doc), width) for doc in X]
|
||||
|
||||
def backprop(dY):
|
||||
return []
|
||||
|
||||
return Y, backprop
|
|
@ -85,6 +85,18 @@ def test_tokenizer_explain_special_matcher(en_vocab):
|
|||
assert tokens == explain_tokens
|
||||
|
||||
|
||||
def test_tokenizer_explain_special_matcher_whitespace(en_vocab):
|
||||
rules = {":]": [{"ORTH": ":]"}]}
|
||||
tokenizer = Tokenizer(
|
||||
en_vocab,
|
||||
rules=rules,
|
||||
)
|
||||
text = ": ]"
|
||||
tokens = [t.text for t in tokenizer(text)]
|
||||
explain_tokens = [t[1] for t in tokenizer.explain(text)]
|
||||
assert tokens == explain_tokens
|
||||
|
||||
|
||||
@hypothesis.strategies.composite
|
||||
def sentence_strategy(draw: hypothesis.strategies.DrawFn, max_n_words: int = 4) -> str:
|
||||
"""
|
||||
|
@ -123,6 +135,9 @@ def test_tokenizer_explain_fuzzy(lang: str, sentence: str) -> None:
|
|||
"""
|
||||
|
||||
tokenizer: Tokenizer = spacy.blank(lang).tokenizer
|
||||
tokens = [t.text for t in tokenizer(sentence) if not t.is_space]
|
||||
# Tokenizer.explain is not intended to handle whitespace or control
|
||||
# characters in the same way as Tokenizer
|
||||
sentence = re.sub(r"\s+", " ", sentence).strip()
|
||||
tokens = [t.text for t in tokenizer(sentence)]
|
||||
debug_tokens = [t[1] for t in tokenizer.explain(sentence)]
|
||||
assert tokens == debug_tokens, f"{tokens}, {debug_tokens}, {sentence}"
|
||||
|
|
|
@ -730,9 +730,16 @@ cdef class Tokenizer:
|
|||
if i in spans_by_start:
|
||||
span = spans_by_start[i]
|
||||
exc = [d[ORTH] for d in special_cases[span.label_]]
|
||||
for j, orth in enumerate(exc):
|
||||
final_tokens.append((f"SPECIAL-{j + 1}", self.vocab.strings[orth]))
|
||||
i += len(span)
|
||||
# The phrase matcher can overmatch for tokens separated by
|
||||
# spaces in the text but not in the underlying rule, so skip
|
||||
# cases where the texts aren't identical
|
||||
if span.text != "".join([self.vocab.strings[orth] for orth in exc]):
|
||||
final_tokens.append(tokens[i])
|
||||
i += 1
|
||||
else:
|
||||
for j, orth in enumerate(exc):
|
||||
final_tokens.append((f"SPECIAL-{j + 1}", self.vocab.strings[orth]))
|
||||
i += len(span)
|
||||
else:
|
||||
final_tokens.append(tokens[i])
|
||||
i += 1
|
||||
|
|
|
@ -5,4 +5,4 @@ from .span import Span
|
|||
from .span_group import SpanGroup
|
||||
from .token import Token
|
||||
|
||||
__all__ = ["Doc", "Token", "Span", "SpanGroup", "DocBin", "MorphAnalysis"]
|
||||
__all__ = ["Doc", "DocBin", "MorphAnalysis", "Span", "SpanGroup", "Token"]
|
||||
|
|
|
@ -42,7 +42,7 @@ class Doc:
|
|||
user_hooks: Dict[str, Callable[..., Any]]
|
||||
user_token_hooks: Dict[str, Callable[..., Any]]
|
||||
user_span_hooks: Dict[str, Callable[..., Any]]
|
||||
tensor: np.ndarray[Any, np.dtype[np.float_]]
|
||||
tensor: np.ndarray[Any, np.dtype[np.float64]]
|
||||
user_data: Dict[str, Any]
|
||||
has_unknown_spaces: bool
|
||||
_context: Any
|
||||
|
@ -125,7 +125,7 @@ class Doc:
|
|||
vector: Optional[Floats1d] = ...,
|
||||
alignment_mode: str = ...,
|
||||
span_id: Union[int, str] = ...,
|
||||
) -> Span: ...
|
||||
) -> Optional[Span]: ...
|
||||
def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
|
||||
@property
|
||||
def has_vector(self) -> bool: ...
|
||||
|
@ -166,7 +166,7 @@ class Doc:
|
|||
) -> Doc: ...
|
||||
def to_array(
|
||||
self, py_attr_ids: Union[int, str, List[Union[int, str]]]
|
||||
) -> np.ndarray[Any, np.dtype[np.float_]]: ...
|
||||
) -> np.ndarray[Any, np.dtype[np.float64]]: ...
|
||||
@staticmethod
|
||||
def from_docs(
|
||||
docs: List[Doc],
|
||||
|
@ -179,15 +179,13 @@ class Doc:
|
|||
self, path: Union[str, Path], *, exclude: Iterable[str] = ...
|
||||
) -> None: ...
|
||||
def from_disk(
|
||||
self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
|
||||
self, path: Union[str, Path], *, exclude: Iterable[str] = ...
|
||||
) -> Doc: ...
|
||||
def to_bytes(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
|
||||
def from_bytes(
|
||||
self, bytes_data: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
|
||||
) -> Doc: ...
|
||||
def to_dict(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
|
||||
def to_bytes(self, *, exclude: Iterable[str] = ...) -> bytes: ...
|
||||
def from_bytes(self, bytes_data: bytes, *, exclude: Iterable[str] = ...) -> Doc: ...
|
||||
def to_dict(self, *, exclude: Iterable[str] = ...) -> Dict[str, Any]: ...
|
||||
def from_dict(
|
||||
self, msg: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
|
||||
self, msg: Dict[str, Any], *, exclude: Iterable[str] = ...
|
||||
) -> Doc: ...
|
||||
def extend_tensor(self, tensor: Floats2d) -> None: ...
|
||||
def retokenize(self) -> Retokenizer: ...
|
||||
|
|
|
@ -1326,7 +1326,7 @@ cdef class Doc:
|
|||
|
||||
path (str / Path): A path to a directory. Paths may be either
|
||||
strings or `Path`-like objects.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||
RETURNS (Doc): The modified `Doc` object.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#from_disk
|
||||
|
@ -1339,7 +1339,7 @@ cdef class Doc:
|
|||
def to_bytes(self, *, exclude=tuple()):
|
||||
"""Serialize, i.e. export the document contents to a binary string.
|
||||
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||
RETURNS (bytes): A losslessly serialized copy of the `Doc`, including
|
||||
all annotations.
|
||||
|
||||
|
@ -1351,7 +1351,7 @@ cdef class Doc:
|
|||
"""Deserialize, i.e. import the document contents from a binary string.
|
||||
|
||||
data (bytes): The string to load from.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||
RETURNS (Doc): Itself.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#from_bytes
|
||||
|
@ -1361,11 +1361,8 @@ cdef class Doc:
|
|||
def to_dict(self, *, exclude=tuple()):
|
||||
"""Export the document contents to a dictionary for serialization.
|
||||
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (bytes): A losslessly serialized copy of the `Doc`, including
|
||||
all annotations.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#to_bytes
|
||||
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||
RETURNS (Dict[str, Any]): A dictionary representation of the `Doc`
|
||||
"""
|
||||
array_head = Doc._get_array_attrs()
|
||||
strings = set()
|
||||
|
@ -1411,13 +1408,11 @@ cdef class Doc:
|
|||
return util.to_dict(serializers, exclude)
|
||||
|
||||
def from_dict(self, msg, *, exclude=tuple()):
|
||||
"""Deserialize, i.e. import the document contents from a binary string.
|
||||
"""Deserialize the document contents from a dictionary representation.
|
||||
|
||||
data (bytes): The string to load from.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
msg (Dict[str, Any]): The dictionary to load from.
|
||||
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||
RETURNS (Doc): Itself.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#from_dict
|
||||
"""
|
||||
if self.length != 0:
|
||||
raise ValueError(Errors.E033.format(length=self.length))
|
||||
|
|
|
@ -127,14 +127,17 @@ cdef class Span:
|
|||
self._vector = vector
|
||||
self._vector_norm = vector_norm
|
||||
|
||||
def __richcmp__(self, Span other, int op):
|
||||
def __richcmp__(self, object other, int op):
|
||||
if other is None:
|
||||
if op == 0 or op == 1 or op == 2:
|
||||
return False
|
||||
else:
|
||||
return True
|
||||
if not isinstance(other, Span):
|
||||
return False
|
||||
cdef Span other_span = other
|
||||
self_tuple = (self.c.start_char, self.c.end_char, self.c.label, self.c.kb_id, self.id, self.doc)
|
||||
other_tuple = (other.c.start_char, other.c.end_char, other.c.label, other.c.kb_id, other.id, other.doc)
|
||||
other_tuple = (other_span.c.start_char, other_span.c.end_char, other_span.c.label, other_span.c.kb_id, other_span.id, other_span.doc)
|
||||
# <
|
||||
if op == 0:
|
||||
return self_tuple < other_tuple
|
||||
|
|
|
@ -53,7 +53,12 @@ class Token:
|
|||
def __bytes__(self) -> bytes: ...
|
||||
def __str__(self) -> str: ...
|
||||
def __repr__(self) -> str: ...
|
||||
def __richcmp__(self, other: Token, op: int) -> bool: ...
|
||||
def __lt__(self, other: Any) -> bool: ...
|
||||
def __le__(self, other: Any) -> bool: ...
|
||||
def __eq__(self, other: Any) -> bool: ...
|
||||
def __ne__(self, other: Any) -> bool: ...
|
||||
def __gt__(self, other: Any) -> bool: ...
|
||||
def __ge__(self, other: Any) -> bool: ...
|
||||
@property
|
||||
def _(self) -> Underscore: ...
|
||||
def nbor(self, i: int = ...) -> Token: ...
|
||||
|
|
|
@ -139,17 +139,20 @@ cdef class Token:
|
|||
def __repr__(self):
|
||||
return self.__str__()
|
||||
|
||||
def __richcmp__(self, Token other, int op):
|
||||
def __richcmp__(self, object other, int op):
|
||||
# http://cython.readthedocs.io/en/latest/src/userguide/special_methods.html
|
||||
if other is None:
|
||||
if op in (0, 1, 2):
|
||||
return False
|
||||
else:
|
||||
return True
|
||||
if not isinstance(other, Token):
|
||||
return False
|
||||
cdef Token other_token = other
|
||||
cdef Doc my_doc = self.doc
|
||||
cdef Doc other_doc = other.doc
|
||||
cdef Doc other_doc = other_token.doc
|
||||
my = self.idx
|
||||
their = other.idx
|
||||
their = other_token.idx
|
||||
if op == 0:
|
||||
return my < their
|
||||
elif op == 2:
|
||||
|
|
|
@ -16,3 +16,28 @@ from .iob_utils import ( # noqa: F401
|
|||
tags_to_entities,
|
||||
)
|
||||
from .loggers import console_logger # noqa: F401
|
||||
|
||||
__all__ = [
|
||||
"Alignment",
|
||||
"Corpus",
|
||||
"Example",
|
||||
"JsonlCorpus",
|
||||
"PlainTextCorpus",
|
||||
"biluo_tags_to_offsets",
|
||||
"biluo_tags_to_spans",
|
||||
"biluo_to_iob",
|
||||
"create_copy_from_base_model",
|
||||
"docs_to_json",
|
||||
"dont_augment",
|
||||
"iob_to_biluo",
|
||||
"minibatch_by_padded_size",
|
||||
"minibatch_by_words",
|
||||
"offsets_to_biluo_tags",
|
||||
"orth_variants_augmenter",
|
||||
"read_json_file",
|
||||
"remove_bilu_prefix",
|
||||
"split_bilu_label",
|
||||
"tags_to_entities",
|
||||
"validate_get_examples",
|
||||
"validate_examples",
|
||||
]
|
||||
|
|
|
@ -1077,20 +1077,38 @@ def make_tempdir() -> Generator[Path, None, None]:
|
|||
|
||||
|
||||
def is_in_jupyter() -> bool:
|
||||
"""Check if user is running spaCy from a Jupyter notebook by detecting the
|
||||
IPython kernel. Mainly used for the displaCy visualizer.
|
||||
RETURNS (bool): True if in Jupyter, False if not.
|
||||
"""Check if user is running spaCy from a Jupyter or Colab notebook by
|
||||
detecting the IPython kernel. Mainly used for the displaCy visualizer.
|
||||
RETURNS (bool): True if in Jupyter/Colab, False if not.
|
||||
"""
|
||||
# https://stackoverflow.com/a/39662359/6400719
|
||||
# https://stackoverflow.com/questions/15411967
|
||||
try:
|
||||
shell = get_ipython().__class__.__name__ # type: ignore[name-defined]
|
||||
if shell == "ZMQInteractiveShell":
|
||||
if get_ipython().__class__.__name__ == "ZMQInteractiveShell": # type: ignore[name-defined]
|
||||
return True # Jupyter notebook or qtconsole
|
||||
if get_ipython().__class__.__module__ == "google.colab._shell": # type: ignore[name-defined]
|
||||
return True # Colab notebook
|
||||
except NameError:
|
||||
return False # Probably standard Python interpreter
|
||||
pass # Probably standard Python interpreter
|
||||
# additional check for Colab
|
||||
try:
|
||||
import google.colab
|
||||
|
||||
return True # Colab notebook
|
||||
except ImportError:
|
||||
pass
|
||||
return False
|
||||
|
||||
|
||||
def is_in_interactive() -> bool:
|
||||
"""Check if user is running spaCy from an interactive Python
|
||||
shell. Will return True in Jupyter notebooks too.
|
||||
RETURNS (bool): True if in interactive mode, False if not.
|
||||
"""
|
||||
# https://stackoverflow.com/questions/2356399/tell-if-python-is-in-interactive-mode
|
||||
return hasattr(sys, "ps1") or hasattr(sys, "ps2")
|
||||
|
||||
|
||||
def get_object_name(obj: Any) -> str:
|
||||
"""Get a human-readable name of a Python object, e.g. a pipeline component.
|
||||
|
||||
|
|
|
@ -78,16 +78,16 @@ subword features, and a
|
|||
[MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder) encoding layer
|
||||
consisting of a CNN and a layer-normalized maxout activation function.
|
||||
|
||||
| Name | Description |
|
||||
| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `width` | The width of the input and output. These are required to be the same, so that residual connections can be used. Recommended values are `96`, `128` or `300`. ~~int~~ |
|
||||
| `depth` | The number of convolutional layers to use. Recommended values are between `2` and `8`. ~~int~~ |
|
||||
| `embed_size` | The number of rows in the hash embedding tables. This can be surprisingly small, due to the use of the hash embeddings. Recommended values are between `2000` and `10000`. ~~int~~ |
|
||||
| Name | Description |
|
||||
| -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `width` | The width of the input and output. These are required to be the same, so that residual connections can be used. Recommended values are `96`, `128` or `300`. ~~int~~ |
|
||||
| `depth` | The number of convolutional layers to use. Recommended values are between `2` and `8`. ~~int~~ |
|
||||
| `embed_size` | The number of rows in the hash embedding tables. This can be surprisingly small, due to the use of the hash embeddings. Recommended values are between `2000` and `10000`. ~~int~~ |
|
||||
| `window_size` | The number of tokens on either side to concatenate during the convolutions. The receptive field of the CNN will be `depth * window_size * 2 + 1`, so a 4-layer network with a window size of `2` will be sensitive to 17 words at a time. Recommended value is `1`. ~~int~~ |
|
||||
| `maxout_pieces` | The number of pieces to use in the maxout non-linearity. If `1`, the [`Mish`](https://thinc.ai/docs/api-layers#mish) non-linearity is used instead. Recommended values are `1`-`3`. ~~int~~ |
|
||||
| `subword_features` | Whether to also embed subword features, specifically the prefix, suffix and word shape. This is recommended for alphabetic languages like English, but not if single-character tokens are used for a language such as Chinese. ~~bool~~ |
|
||||
| `pretrained_vectors` | Whether to also use static vectors. ~~bool~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
| `maxout_pieces` | The number of pieces to use in the maxout non-linearity. If `1`, the [`Mish`](https://thinc.ai/docs/api-layers#mish) non-linearity is used instead. Recommended values are `1`-`3`. ~~int~~ |
|
||||
| `subword_features` | Whether to also embed subword features, specifically the prefix, suffix and word shape. This is recommended for alphabetic languages like English, but not if single-character tokens are used for a language such as Chinese. ~~bool~~ |
|
||||
| `pretrained_vectors` | Whether to also use static vectors. ~~bool~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
|
||||
### spacy.Tok2VecListener.v1 {id="Tok2VecListener"}
|
||||
|
||||
|
@ -962,8 +962,9 @@ single-label use-cases where `exclusive_classes = true`, while the
|
|||
> nO = null
|
||||
>
|
||||
> [model.linear_model]
|
||||
> @architectures = "spacy.TextCatBOW.v2"
|
||||
> @architectures = "spacy.TextCatBOW.v3"
|
||||
> exclusive_classes = true
|
||||
> length = 262144
|
||||
> ngram_size = 1
|
||||
> no_output_layer = false
|
||||
>
|
||||
|
@ -1017,54 +1018,15 @@ but used an internal `tok2vec` instead of taking it as argument:
|
|||
|
||||
</Accordion>
|
||||
|
||||
### spacy.TextCatCNN.v2 {id="TextCatCNN"}
|
||||
### spacy.TextCatBOW.v3 {id="TextCatBOW"}
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.TextCatCNN.v2"
|
||||
> exclusive_classes = false
|
||||
> nO = null
|
||||
>
|
||||
> [model.tok2vec]
|
||||
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||
> pretrained_vectors = null
|
||||
> width = 96
|
||||
> depth = 4
|
||||
> embed_size = 2000
|
||||
> window_size = 1
|
||||
> maxout_pieces = 3
|
||||
> subword_features = true
|
||||
> ```
|
||||
|
||||
A neural network model where token vectors are calculated using a CNN. The
|
||||
vectors are mean pooled and used as features in a feed-forward network. This
|
||||
architecture is usually less accurate than the ensemble, but runs faster.
|
||||
|
||||
| Name | Description |
|
||||
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
||||
| `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ |
|
||||
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
||||
|
||||
<Accordion title="spacy.TextCatCNN.v1 definition" spaced>
|
||||
|
||||
[TextCatCNN.v1](/api/legacy#TextCatCNN_v1) had the exact same signature, but was
|
||||
not yet resizable. Since v2, new labels can be added to this component, even
|
||||
after training.
|
||||
|
||||
</Accordion>
|
||||
|
||||
### spacy.TextCatBOW.v2 {id="TextCatBOW"}
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.TextCatBOW.v2"
|
||||
> @architectures = "spacy.TextCatBOW.v3"
|
||||
> exclusive_classes = false
|
||||
> length = 262144
|
||||
> ngram_size = 1
|
||||
> no_output_layer = false
|
||||
> nO = null
|
||||
|
@ -1078,17 +1040,108 @@ the others, but may not be as accurate, especially if texts are short.
|
|||
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
||||
| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3` would give unigram, trigram and bigram features. ~~int~~ |
|
||||
| `no_output_layer` | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes` is `True`, else `Logistic`). ~~bool~~ |
|
||||
| `length` | The size of the weights vector. The length will be rounded up to the next power of two if it is not a power of two. Defaults to `262144`. ~~int~~ |
|
||||
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
||||
|
||||
<Accordion title="spacy.TextCatBOW.v1 definition" spaced>
|
||||
<Accordion title="Previous versions of spacy.TextCatBOW" spaced>
|
||||
|
||||
[TextCatBOW.v1](/api/legacy#TextCatBOW_v1) had the exact same signature, but was
|
||||
not yet resizable. Since v2, new labels can be added to this component, even
|
||||
after training.
|
||||
- [TextCatBOW.v1](/api/legacy#TextCatBOW_v1) was not yet resizable. Since v2,
|
||||
new labels can be added to this component, even after training.
|
||||
- [TextCatBOW.v1](/api/legacy#TextCatBOW_v1) and
|
||||
[TextCatBOW.v2](/api/legacy#TextCatBOW_v2) used an erroneous sparse linear
|
||||
layer that only used a small number of the allocated parameters.
|
||||
- [TextCatBOW.v1](/api/legacy#TextCatBOW_v1) and
|
||||
[TextCatBOW.v2](/api/legacy#TextCatBOW_v2) did not have the `length` argument.
|
||||
|
||||
</Accordion>
|
||||
|
||||
### spacy.TextCatParametricAttention.v1 {id="TextCatParametricAttention"}
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.TextCatParametricAttention.v1"
|
||||
> exclusive_classes = true
|
||||
> nO = null
|
||||
>
|
||||
> [model.tok2vec]
|
||||
> @architectures = "spacy.Tok2Vec.v2"
|
||||
>
|
||||
> [model.tok2vec.embed]
|
||||
> @architectures = "spacy.MultiHashEmbed.v2"
|
||||
> width = 64
|
||||
> rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
||||
> attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
||||
> include_static_vectors = false
|
||||
>
|
||||
> [model.tok2vec.encode]
|
||||
> @architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
> width = ${model.tok2vec.embed.width}
|
||||
> window_size = 1
|
||||
> maxout_pieces = 3
|
||||
> depth = 2
|
||||
> ```
|
||||
|
||||
A neural network model that is built upon Tok2Vec and uses parametric attention
|
||||
to attend to tokens that are relevant to text classification.
|
||||
|
||||
| Name | Description |
|
||||
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `tok2vec` | The `tok2vec` layer to build the neural network upon. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
||||
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
||||
|
||||
### spacy.TextCatReduce.v1 {id="TextCatReduce"}
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.TextCatReduce.v1"
|
||||
> exclusive_classes = false
|
||||
> use_reduce_first = false
|
||||
> use_reduce_last = false
|
||||
> use_reduce_max = false
|
||||
> use_reduce_mean = true
|
||||
> nO = null
|
||||
>
|
||||
> [model.tok2vec]
|
||||
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||
> pretrained_vectors = null
|
||||
> width = 96
|
||||
> depth = 4
|
||||
> embed_size = 2000
|
||||
> window_size = 1
|
||||
> maxout_pieces = 3
|
||||
> subword_features = true
|
||||
> ```
|
||||
|
||||
A classifier that pools token hidden representations of each `Doc` using first,
|
||||
max or mean reduction and then applies a classification layer. Reductions are
|
||||
concatenated when multiple reductions are used.
|
||||
|
||||
<Infobox variant="warning" title="Relation to TextCatCNN" id="TextCatCNN">
|
||||
|
||||
`TextCatReduce` is a generalization of the older
|
||||
[`TextCatCNN`](/api/legacy#TextCatCNN_v2) model. `TextCatCNN` always uses a mean
|
||||
reduction, whereas `TextCatReduce` also supports first/max reductions.
|
||||
|
||||
</Infobox>
|
||||
|
||||
| Name | Description |
|
||||
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
||||
| `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ |
|
||||
| `use_reduce_first` | Pool by using the hidden representation of the first token of a `Doc`. ~~bool~~ |
|
||||
| `use_reduce_last` | Pool by using the hidden representation of the last token of a `Doc`. ~~bool~~ |
|
||||
| `use_reduce_max` | Pool by taking the maximum values of the hidden representations of a `Doc`. ~~bool~~ |
|
||||
| `use_reduce_mean` | Pool by taking the mean of all hidden representations of a `Doc`. ~~bool~~ |
|
||||
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
||||
|
||||
## Span classification architectures {id="spancat",source="spacy/ml/models/spancat.py"}
|
||||
|
||||
### spacy.SpanCategorizer.v1 {id="SpanCategorizer"}
|
||||
|
|
|
@ -1268,20 +1268,21 @@ the [binary `.spacy` format](/api/data-formats#binary-training). The pipeline is
|
|||
warmed up before any measurements are taken.
|
||||
|
||||
```cli
|
||||
$ python -m spacy benchmark speed [model] [data_path] [--batch_size] [--no-shuffle] [--gpu-id] [--batches] [--warmup]
|
||||
$ python -m spacy benchmark speed [model] [data_path] [--code] [--batch_size] [--no-shuffle] [--gpu-id] [--batches] [--warmup]
|
||||
```
|
||||
|
||||
| Name | Description |
|
||||
| -------------------- | -------------------------------------------------------------------------------------------------------- |
|
||||
| `model` | Pipeline to benchmark the speed of. Can be a package or a path to a data directory. ~~str (positional)~~ |
|
||||
| `data_path` | Location of benchmark data in spaCy's [binary format](/api/data-formats#training). ~~Path (positional)~~ |
|
||||
| `--batch-size`, `-b` | Set the batch size. If not set, the pipeline's batch size is used. ~~Optional[int] \(option)~~ |
|
||||
| `--no-shuffle` | Do not shuffle documents in the benchmark data. ~~bool (flag)~~ |
|
||||
| `--gpu-id`, `-g` | GPU to use, if any. Defaults to `-1` for CPU. ~~int (option)~~ |
|
||||
| `--batches` | Number of batches to benchmark on. Defaults to `50`. ~~Optional[int] \(option)~~ |
|
||||
| `--warmup`, `-w` | Iterations over the benchmark data for warmup. Defaults to `3` ~~Optional[int] \(option)~~ |
|
||||
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||
| **PRINTS** | Pipeline speed in words per second with a 95% confidence interval. |
|
||||
| Name | Description |
|
||||
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `model` | Pipeline to benchmark the speed of. Can be a package or a path to a data directory. ~~str (positional)~~ |
|
||||
| `data_path` | Location of benchmark data in spaCy's [binary format](/api/data-formats#training). ~~Path (positional)~~ |
|
||||
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ |
|
||||
| `--batch-size`, `-b` | Set the batch size. If not set, the pipeline's batch size is used. ~~Optional[int] \(option)~~ |
|
||||
| `--no-shuffle` | Do not shuffle documents in the benchmark data. ~~bool (flag)~~ |
|
||||
| `--gpu-id`, `-g` | GPU to use, if any. Defaults to `-1` for CPU. ~~int (option)~~ |
|
||||
| `--batches` | Number of batches to benchmark on. Defaults to `50`. ~~Optional[int] \(option)~~ |
|
||||
| `--warmup`, `-w` | Iterations over the benchmark data for warmup. Defaults to `3` ~~Optional[int] \(option)~~ |
|
||||
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||
| **PRINTS** | Pipeline speed in words per second with a 95% confidence interval. |
|
||||
|
||||
## apply {id="apply", version="3.5", tag="command"}
|
||||
|
||||
|
@ -1547,9 +1548,9 @@ obsolete files is left up to you.
|
|||
|
||||
Remotes can be defined in the `remotes` section of the
|
||||
[`project.yml`](/usage/projects#project-yml). Under the hood, spaCy uses
|
||||
[`Pathy`](https://github.com/justindujardin/pathy) to communicate with the
|
||||
remote storages, so you can use any protocol that `Pathy` supports, including
|
||||
[S3](https://aws.amazon.com/s3/),
|
||||
[`cloudpathlib`](https://cloudpathlib.drivendata.org) to communicate with the
|
||||
remote storages, so you can use any protocol that `cloudpathlib` supports,
|
||||
including [S3](https://aws.amazon.com/s3/),
|
||||
[Google Cloud Storage](https://cloud.google.com/storage), and the local
|
||||
filesystem, although you may need to install extra dependencies to use certain
|
||||
protocols.
|
||||
|
|
|
@ -162,7 +162,10 @@ network has an internal CNN Tok2Vec layer and uses attention.
|
|||
|
||||
Since `spacy.TextCatCNN.v2`, this architecture has become resizable, which means
|
||||
that you can add labels to a previously trained textcat. `TextCatCNN` v1 did not
|
||||
yet support that.
|
||||
yet support that. `TextCatCNN` has been replaced by the more general
|
||||
[`TextCatReduce`](/api/architectures#TextCatReduce) layer. `TextCatCNN` is
|
||||
identical to `TextCatReduce` with `use_reduce_mean=true`,
|
||||
`use_reduce_first=false`, `reduce_last=false` and `use_reduce_max=false`.
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
|
@ -194,11 +197,58 @@ architecture is usually less accurate than the ensemble, but runs faster.
|
|||
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
||||
|
||||
### spacy.TextCatCNN.v2 {id="TextCatCNN_v2"}
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.TextCatCNN.v2"
|
||||
> exclusive_classes = false
|
||||
> nO = null
|
||||
>
|
||||
> [model.tok2vec]
|
||||
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||
> pretrained_vectors = null
|
||||
> width = 96
|
||||
> depth = 4
|
||||
> embed_size = 2000
|
||||
> window_size = 1
|
||||
> maxout_pieces = 3
|
||||
> subword_features = true
|
||||
> ```
|
||||
|
||||
A neural network model where token vectors are calculated using a CNN. The
|
||||
vectors are mean pooled and used as features in a feed-forward network. This
|
||||
architecture is usually less accurate than the ensemble, but runs faster.
|
||||
|
||||
`TextCatCNN` has been replaced by the more general
|
||||
[`TextCatReduce`](/api/architectures#TextCatReduce) layer. `TextCatCNN` is
|
||||
identical to `TextCatReduce` with `use_reduce_mean=true`,
|
||||
`use_reduce_first=false`, `reduce_last=false` and `use_reduce_max=false`.
|
||||
|
||||
| Name | Description |
|
||||
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
||||
| `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ |
|
||||
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
||||
|
||||
<Accordion title="spacy.TextCatCNN.v1 definition" spaced>
|
||||
|
||||
[TextCatCNN.v1](/api/legacy#TextCatCNN_v1) had the exact same signature, but was
|
||||
not yet resizable. Since v2, new labels can be added to this component, even
|
||||
after training.
|
||||
|
||||
</Accordion>
|
||||
|
||||
### spacy.TextCatBOW.v1 {id="TextCatBOW_v1"}
|
||||
|
||||
Since `spacy.TextCatBOW.v2`, this architecture has become resizable, which means
|
||||
that you can add labels to a previously trained textcat. `TextCatBOW` v1 did not
|
||||
yet support that.
|
||||
yet support that. Versions of this model before `spacy.TextCatBOW.v3` used an
|
||||
erroneous sparse linear layer that only used a small number of the allocated
|
||||
parameters.
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
|
@ -222,6 +272,33 @@ the others, but may not be as accurate, especially if texts are short.
|
|||
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
||||
|
||||
### spacy.TextCatBOW.v2 {id="TextCatBOW"}
|
||||
|
||||
Versions of this model before `spacy.TextCatBOW.v3` used an erroneous sparse
|
||||
linear layer that only used a small number of the allocated parameters.
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.TextCatBOW.v2"
|
||||
> exclusive_classes = false
|
||||
> ngram_size = 1
|
||||
> no_output_layer = false
|
||||
> nO = null
|
||||
> ```
|
||||
|
||||
An n-gram "bag-of-words" model. This architecture should run much faster than
|
||||
the others, but may not be as accurate, especially if texts are short.
|
||||
|
||||
| Name | Description |
|
||||
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
||||
| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3` would give unigram, trigram and bigram features. ~~int~~ |
|
||||
| `no_output_layer` | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes` is `True`, else `Logistic`). ~~bool~~ |
|
||||
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
||||
|
||||
### spacy.TransitionBasedParser.v1 {id="TransitionBasedParser_v1"}
|
||||
|
||||
Identical to
|
||||
|
|
|
@ -340,15 +340,30 @@ A _task_ defines an NLP problem or question, that will be sent to the LLM via a
|
|||
prompt. Further, the task defines how to parse the LLM's responses back into
|
||||
structured information. All tasks are registered in the `llm_tasks` registry.
|
||||
|
||||
Practically speaking, a task should adhere to the `Protocol` `LLMTask` defined
|
||||
in [`ty.py`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/ty.py).
|
||||
It needs to define a `generate_prompts` function and a `parse_responses`
|
||||
function.
|
||||
Practically speaking, a task should adhere to the `Protocol` named `LLMTask`
|
||||
defined in
|
||||
[`ty.py`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/ty.py). It
|
||||
needs to define a `generate_prompts` function and a `parse_responses` function.
|
||||
|
||||
| Task | Description |
|
||||
| --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| [`task.generate_prompts`](/api/large-language-models#task-generate-prompts) | Takes a collection of documents, and returns a collection of "prompts", which can be of type `Any`. |
|
||||
| [`task.parse_responses`](/api/large-language-models#task-parse-responses) | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. |
|
||||
Tasks may support prompt sharding (for more info see the API docs on
|
||||
[sharding](/api/large-language-models#task-sharding) and
|
||||
[non-sharding](/api/large-language-models#task-nonsharding) tasks). The function
|
||||
signatures for `generate_prompts` and `parse_responses` depend on whether they
|
||||
do.
|
||||
|
||||
For tasks **not supporting** sharding:
|
||||
|
||||
| Task | Description | |
|
||||
| --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- |
|
||||
| [`task.generate_prompts`](/api/large-language-models#task-nonsharding-generate-prompts) | Takes a collection of documents, and returns a collection of prompts, which can be of type `Any`. |
|
||||
| [`task.parse_responses`](/api/large-language-models#task-nonsharding-parse-responses) | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. |
|
||||
|
||||
For tasks **supporting** sharding:
|
||||
|
||||
| Task | Description | |
|
||||
| ------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- |
|
||||
| [`task.generate_prompts`](/api/large-language-models#task-sharding-generate-prompts) | Takes a collection of documents, and returns a collection of collections of prompt shards, which can be of type `Any`. |
|
||||
| [`task.parse_responses`](/api/large-language-models#task-sharding-parse-responses) | Takes a collection of collections of LLM responses (one per prompt shard) and the original documents, parses the responses into structured information, sets the annotations on the doc shards, and merges those doc shards back into a single doc instance. |
|
||||
|
||||
Moreover, the task may define an optional [`scorer` method](/api/scorer#score).
|
||||
It should accept an iterable of `Example` objects as input and return a score
|
||||
|
@ -357,6 +372,7 @@ evaluate the component.
|
|||
|
||||
| Component | Description |
|
||||
| ----------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
|
||||
| [`spacy.EntityLinker.v1`](/api/large-language-models#el-v1) | The entity linking task prompts the model to link all entities in a given text to entries in a knowledge base. |
|
||||
| [`spacy.Summarization.v1`](/api/large-language-models#summarization-v1) | The summarization task prompts the model for a concise summary of the provided text. |
|
||||
| [`spacy.NER.v3`](/api/large-language-models#ner-v3) | Implements Chain-of-Thought reasoning for NER extraction - obtains higher accuracy than v1 or v2. |
|
||||
| [`spacy.NER.v2`](/api/large-language-models#ner-v2) | Builds on v1 and additionally supports defining the provided labels with explicit descriptions. |
|
||||
|
@ -369,7 +385,9 @@ evaluate the component.
|
|||
| [`spacy.TextCat.v2`](/api/large-language-models#textcat-v2) | Version 2 builds on v1 and includes an improved prompt template. |
|
||||
| [`spacy.TextCat.v1`](/api/large-language-models#textcat-v1) | Version 1 of the built-in TextCat task supports both zero-shot and few-shot prompting. |
|
||||
| [`spacy.Lemma.v1`](/api/large-language-models#lemma-v1) | Lemmatizes the provided text and updates the `lemma_` attribute of the tokens accordingly. |
|
||||
| [`spacy.Raw.v1`](/api/large-language-models#raw-v1) | Executes raw doc content as prompt to LLM. |
|
||||
| [`spacy.Sentiment.v1`](/api/large-language-models#sentiment-v1) | Performs sentiment analysis on provided texts. |
|
||||
| [`spacy.Translation.v1`](/api/large-language-models#translation-v1) | Translates doc content into the specified target language. |
|
||||
| [`spacy.NoOp.v1`](/api/large-language-models#noop-v1) | This task is only useful for testing - it tells the LLM to do nothing, and does not set any fields on the `docs`. |
|
||||
|
||||
#### Providing examples for few-shot prompts {id="few-shot-prompts"}
|
||||
|
|
|
@ -153,8 +153,9 @@ maxout_pieces = 3
|
|||
depth = 2
|
||||
|
||||
[components.textcat.model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v2"
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = true
|
||||
length = 262144
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
```
|
||||
|
@ -170,8 +171,9 @@ factory = "textcat"
|
|||
labels = []
|
||||
|
||||
[components.textcat.model]
|
||||
@architectures = "spacy.TextCatBOW.v2"
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = true
|
||||
length = 262144
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
nO = null
|
||||
|
|
|
@ -1328,8 +1328,9 @@ labels = []
|
|||
# This function is created and then passed to the "textcat" component as
|
||||
# the argument "model"
|
||||
[components.textcat.model]
|
||||
@architectures = "spacy.TextCatBOW.v2"
|
||||
@architectures = "spacy.TextCatBOW.v3"
|
||||
exclusive_classes = true
|
||||
length = 262144
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
|
||||
|
|
|
@ -656,9 +656,9 @@ locally.
|
|||
You can list one or more remotes in the `remotes` section of your
|
||||
[`project.yml`](#project-yml) by mapping a string name to the URL of the
|
||||
storage. Under the hood, spaCy uses
|
||||
[`Pathy`](https://github.com/justindujardin/pathy) to communicate with the
|
||||
remote storages, so you can use any protocol that `Pathy` supports, including
|
||||
[S3](https://aws.amazon.com/s3/),
|
||||
[`cloudpathlib`](https://cloudpathlib.drivendata.org) to communicate with the
|
||||
remote storages, so you can use any protocol that `cloudpathlib` supports,
|
||||
including [S3](https://aws.amazon.com/s3/),
|
||||
[Google Cloud Storage](https://cloud.google.com/storage), and the local
|
||||
filesystem, although you may need to install extra dependencies to use certain
|
||||
protocols.
|
||||
|
|
|
@ -405,7 +405,7 @@ available to spaCy, all you need to do is install the package in your
|
|||
environment:
|
||||
|
||||
```bash
|
||||
$ python setup.py develop
|
||||
$ python -m pip install .
|
||||
```
|
||||
|
||||
spaCy is now able to create the pipeline component `"snek"` – even though you
|
||||
|
@ -673,7 +673,7 @@ $ python -m spacy package ./en_example_pipeline ./packages
|
|||
```
|
||||
|
||||
This command will create a pipeline package directory and will run
|
||||
`python setup.py sdist` in that directory to create a binary `.whl` file or
|
||||
`python -m build` in that directory to create a binary `.whl` file or
|
||||
`.tar.gz` archive of your package that can be installed using `pip install`.
|
||||
Installing the binary wheel is usually more efficient.
|
||||
|
||||
|
|
|
@ -103,6 +103,10 @@
|
|||
"has_examples": true,
|
||||
"models": ["fi_core_news_sm", "fi_core_news_md", "fi_core_news_lg"]
|
||||
},
|
||||
{
|
||||
"code": "fo",
|
||||
"name": "Faroese"
|
||||
},
|
||||
{
|
||||
"code": "fr",
|
||||
"name": "French",
|
||||
|
@ -290,6 +294,12 @@
|
|||
"example": "Dit is een zin.",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
"code": "nn",
|
||||
"name": "Norwegian Nynorsk",
|
||||
"example": "Det er ein meir enn i same periode i fjor.",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
"code": "pl",
|
||||
"name": "Polish",
|
||||
|
|
Loading…
Reference in New Issue
Block a user