Merge branch 'master' into spacy.io

This commit is contained in:
Adriane Boyd 2021-04-24 12:58:47 +02:00
commit 29ac7f776a
99 changed files with 2019 additions and 560 deletions

57
.github/azure-steps.yml vendored Normal file
View File

@ -0,0 +1,57 @@
parameters:
python_version: ''
architecture: ''
prefix: ''
gpu: false
num_build_jobs: 1
steps:
- task: UsePythonVersion@0
inputs:
versionSpec: ${{ parameters.python_version }}
architecture: ${{ parameters.architecture }}
- script: |
${{ parameters.prefix }} python -m pip install -U pip setuptools
${{ parameters.prefix }} python -m pip install -U -r requirements.txt
displayName: "Install dependencies"
- script: |
${{ parameters.prefix }} python setup.py build_ext --inplace -j ${{ parameters.num_build_jobs }}
${{ parameters.prefix }} python setup.py sdist --formats=gztar
displayName: "Compile and build sdist"
- task: DeleteFiles@1
inputs:
contents: "spacy"
displayName: "Delete source directory"
- script: |
${{ parameters.prefix }} python -m pip freeze --exclude torch --exclude cupy-cuda110 > installed.txt
${{ parameters.prefix }} python -m pip uninstall -y -r installed.txt
displayName: "Uninstall all packages"
- bash: |
${{ parameters.prefix }} SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1)
${{ parameters.prefix }} python -m pip install dist/$SDIST
displayName: "Install from sdist"
- script: |
${{ parameters.prefix }} python -m pip install -U -r requirements.txt
displayName: "Install test requirements"
- script: |
${{ parameters.prefix }} python -m pip install -U cupy-cuda110
${{ parameters.prefix }} python -m pip install "torch==1.7.1+cu110" -f https://download.pytorch.org/whl/torch_stable.html
displayName: "Install GPU requirements"
condition: eq(${{ parameters.gpu }}, true)
- script: |
${{ parameters.prefix }} python -m pytest --pyargs spacy
displayName: "Run CPU tests"
condition: eq(${{ parameters.gpu }}, false)
- script: |
${{ parameters.prefix }} python -m pytest --pyargs spacy -p spacy.tests.enable_gpu
displayName: "Run GPU tests"
condition: eq(${{ parameters.gpu }}, true)

106
.github/contributors/AyushExel.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ayush Chaurasia |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2021-03-12 |
| GitHub username | AyushExel |
| Website (optional) | |

106
.github/contributors/broaddeep.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Dongjun Park |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2021-03-06 |
| GitHub username | broaddeep |
| Website (optional) | |

View File

@ -76,39 +76,24 @@ jobs:
maxParallel: 4 maxParallel: 4
pool: pool:
vmImage: $(imageName) vmImage: $(imageName)
steps: steps:
- task: UsePythonVersion@0 - template: .github/azure-steps.yml
inputs: parameters:
versionSpec: "$(python.version)" python_version: '$(python.version)'
architecture: "x64" architecture: 'x64'
- script: | - job: "TestGPU"
python -m pip install -U setuptools dependsOn: "Validate"
pip install -r requirements.txt strategy:
displayName: "Install dependencies" matrix:
Python38LinuxX64_GPU:
- script: | python.version: '3.8'
python setup.py build_ext --inplace pool:
python setup.py sdist --formats=gztar name: "LinuxX64_GPU"
displayName: "Compile and build sdist" steps:
- template: .github/azure-steps.yml
- task: DeleteFiles@1 parameters:
inputs: python_version: '$(python.version)'
contents: "spacy" architecture: 'x64'
displayName: "Delete source directory" gpu: true
num_build_jobs: 24
- script: |
pip freeze > installed.txt
pip uninstall -y -r installed.txt
displayName: "Uninstall all packages"
- bash: |
SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1)
pip install dist/$SDIST
displayName: "Install from sdist"
- script: |
pip install -r requirements.txt
python -m pytest --pyargs spacy
displayName: "Run tests"

View File

@ -5,7 +5,7 @@ requires = [
"cymem>=2.0.2,<2.1.0", "cymem>=2.0.2,<2.1.0",
"preshed>=3.0.2,<3.1.0", "preshed>=3.0.2,<3.1.0",
"murmurhash>=0.28.0,<1.1.0", "murmurhash>=0.28.0,<1.1.0",
"thinc>=8.0.2,<8.1.0", "thinc>=8.0.3,<8.1.0",
"blis>=0.4.0,<0.8.0", "blis>=0.4.0,<0.8.0",
"pathy", "pathy",
"numpy>=1.15.0", "numpy>=1.15.0",

View File

@ -1,14 +1,14 @@
# Our libraries # Our libraries
spacy-legacy>=3.0.0,<3.1.0 spacy-legacy>=3.0.4,<3.1.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=8.0.2,<8.1.0 thinc>=8.0.3,<8.1.0
blis>=0.4.0,<0.8.0 blis>=0.4.0,<0.8.0
ml_datasets>=0.2.0,<0.3.0 ml_datasets>=0.2.0,<0.3.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
wasabi>=0.8.1,<1.1.0 wasabi>=0.8.1,<1.1.0
srsly>=2.4.0,<3.0.0 srsly>=2.4.1,<3.0.0
catalogue>=2.0.1,<2.1.0 catalogue>=2.0.3,<2.1.0
typer>=0.3.0,<0.4.0 typer>=0.3.0,<0.4.0
pathy>=0.3.5 pathy>=0.3.5
# Third party dependencies # Third party dependencies
@ -20,7 +20,6 @@ jinja2
# Official Python utilities # Official Python utilities
setuptools setuptools
packaging>=20.0 packaging>=20.0
importlib_metadata>=0.20; python_version < "3.8"
typing_extensions>=3.7.4.1,<4.0.0.0; python_version < "3.8" typing_extensions>=3.7.4.1,<4.0.0.0; python_version < "3.8"
# Development dependencies # Development dependencies
cython>=0.25 cython>=0.25

View File

@ -34,18 +34,18 @@ setup_requires =
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
thinc>=8.0.2,<8.1.0 thinc>=8.0.3,<8.1.0
install_requires = install_requires =
# Our libraries # Our libraries
spacy-legacy>=3.0.0,<3.1.0 spacy-legacy>=3.0.4,<3.1.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=8.0.2,<8.1.0 thinc>=8.0.3,<8.1.0
blis>=0.4.0,<0.8.0 blis>=0.4.0,<0.8.0
wasabi>=0.8.1,<1.1.0 wasabi>=0.8.1,<1.1.0
srsly>=2.4.0,<3.0.0 srsly>=2.4.1,<3.0.0
catalogue>=2.0.1,<2.1.0 catalogue>=2.0.3,<2.1.0
typer>=0.3.0,<0.4.0 typer>=0.3.0,<0.4.0
pathy>=0.3.5 pathy>=0.3.5
# Third-party dependencies # Third-party dependencies
@ -57,7 +57,6 @@ install_requires =
# Official Python utilities # Official Python utilities
setuptools setuptools
packaging>=20.0 packaging>=20.0
importlib_metadata>=0.20; python_version < "3.8"
typing_extensions>=3.7.4,<4.0.0.0; python_version < "3.8" typing_extensions>=3.7.4,<4.0.0.0; python_version < "3.8"
[options.entry_points] [options.entry_points]
@ -91,6 +90,8 @@ cuda110 =
cupy-cuda110>=5.0.0b4,<9.0.0 cupy-cuda110>=5.0.0b4,<9.0.0
cuda111 = cuda111 =
cupy-cuda111>=5.0.0b4,<9.0.0 cupy-cuda111>=5.0.0b4,<9.0.0
cuda112 =
cupy-cuda112>=5.0.0b4,<9.0.0
# Language tokenizers with external dependencies # Language tokenizers with external dependencies
ja = ja =
sudachipy>=0.4.9 sudachipy>=0.4.9

View File

@ -1,6 +1,6 @@
# fmt: off # fmt: off
__title__ = "spacy" __title__ = "spacy"
__version__ = "3.0.5" __version__ = "3.0.6"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects" __projects__ = "https://github.com/explosion/projects"

View File

@ -9,6 +9,7 @@ from .info import info # noqa: F401
from .package import package # noqa: F401 from .package import package # noqa: F401
from .profile import profile # noqa: F401 from .profile import profile # noqa: F401
from .train import train_cli # noqa: F401 from .train import train_cli # noqa: F401
from .assemble import assemble_cli # noqa: F401
from .pretrain import pretrain # noqa: F401 from .pretrain import pretrain # noqa: F401
from .debug_data import debug_data # noqa: F401 from .debug_data import debug_data # noqa: F401
from .debug_config import debug_config # noqa: F401 from .debug_config import debug_config # noqa: F401
@ -29,9 +30,9 @@ from .project.document import project_document # noqa: F401
@app.command("link", no_args_is_help=True, deprecated=True, hidden=True) @app.command("link", no_args_is_help=True, deprecated=True, hidden=True)
def link(*args, **kwargs): def link(*args, **kwargs):
"""As of spaCy v3.0, symlinks like "en" are deprecated. You can load trained """As of spaCy v3.0, symlinks like "en" are not supported anymore. You can load trained
pipeline packages using their full names or from a directory path.""" pipeline packages using their full names or from a directory path."""
msg.warn( msg.warn(
"As of spaCy v3.0, model symlinks are deprecated. You can load trained " "As of spaCy v3.0, model symlinks are not supported anymore. You can load trained "
"pipeline packages using their full names or from a directory path." "pipeline packages using their full names or from a directory path."
) )

58
spacy/cli/assemble.py Normal file
View File

@ -0,0 +1,58 @@
from typing import Optional
from pathlib import Path
from wasabi import msg
import typer
import logging
from ._util import app, Arg, Opt, parse_config_overrides, show_validation_error
from ._util import import_code
from ..training.initialize import init_nlp
from .. import util
from ..util import get_sourced_components, load_model_from_config
@app.command(
"assemble",
context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
)
def assemble_cli(
# fmt: off
ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
output_path: Path = Arg(..., help="Output directory to store assembled pipeline in"),
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
# fmt: on
):
"""
Assemble a spaCy pipeline from a config file. The config file includes
all settings for initializing the pipeline. To override settings in the
config, e.g. settings that point to local paths or that you want to
experiment with, you can override them as command line options. The
--code argument lets you pass in a Python file that can be used to
register custom functions that are referenced in the config.
DOCS: https://spacy.io/api/cli#assemble
"""
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
# Make sure all files and paths exists if they are needed
if not config_path or (str(config_path) != "-" and not config_path.exists()):
msg.fail("Config file not found", config_path, exits=1)
overrides = parse_config_overrides(ctx.args)
import_code(code_path)
with show_validation_error(config_path):
config = util.load_config(config_path, overrides=overrides, interpolate=False)
msg.divider("Initializing pipeline")
nlp = load_model_from_config(config, auto_fill=True)
config = config.interpolate()
sourced = get_sourced_components(config)
# Make sure that listeners are defined before initializing further
nlp._link_components()
with nlp.select_pipes(disable=[*sourced]):
nlp.initialize()
msg.good("Initialized pipeline")
msg.divider("Serializing to disk")
if output_path is not None and not output_path.exists():
output_path.mkdir(parents=True)
msg.good(f"Created output directory: {output_path}")
nlp.to_disk(output_path)

View File

@ -1,4 +1,4 @@
from typing import List, Sequence, Dict, Any, Tuple, Optional from typing import List, Sequence, Dict, Any, Tuple, Optional, Set
from pathlib import Path from pathlib import Path
from collections import Counter from collections import Counter
import sys import sys
@ -13,6 +13,8 @@ from ..training.initialize import get_sourced_components
from ..schemas import ConfigSchemaTraining from ..schemas import ConfigSchemaTraining
from ..pipeline._parser_internals import nonproj from ..pipeline._parser_internals import nonproj
from ..pipeline._parser_internals.nonproj import DELIMITER from ..pipeline._parser_internals.nonproj import DELIMITER
from ..pipeline import Morphologizer
from ..morphology import Morphology
from ..language import Language from ..language import Language
from ..util import registry, resolve_dot_names from ..util import registry, resolve_dot_names
from .. import util from .. import util
@ -194,32 +196,32 @@ def debug_data(
) )
label_counts = gold_train_data["ner"] label_counts = gold_train_data["ner"]
model_labels = _get_labels_from_model(nlp, "ner") model_labels = _get_labels_from_model(nlp, "ner")
new_labels = [l for l in labels if l not in model_labels]
existing_labels = [l for l in labels if l in model_labels]
has_low_data_warning = False has_low_data_warning = False
has_no_neg_warning = False has_no_neg_warning = False
has_ws_ents_error = False has_ws_ents_error = False
has_punct_ents_warning = False has_punct_ents_warning = False
msg.divider("Named Entity Recognition") msg.divider("Named Entity Recognition")
msg.info( msg.info(f"{len(model_labels)} label(s)")
f"{len(new_labels)} new label(s), {len(existing_labels)} existing label(s)"
)
missing_values = label_counts["-"] missing_values = label_counts["-"]
msg.text(f"{missing_values} missing value(s) (tokens with '-' label)") msg.text(f"{missing_values} missing value(s) (tokens with '-' label)")
for label in new_labels: for label in labels:
if len(label) == 0: if len(label) == 0:
msg.fail("Empty label found in new labels") msg.fail("Empty label found in train data")
if new_labels: labels_with_counts = [
labels_with_counts = [ (label, count)
(label, count) for label, count in label_counts.most_common()
for label, count in label_counts.most_common() if label != "-"
if label != "-" ]
] labels_with_counts = _format_labels(labels_with_counts, counts=True)
labels_with_counts = _format_labels(labels_with_counts, counts=True) msg.text(f"Labels in train data: {_format_labels(labels)}", show=verbose)
msg.text(f"New: {labels_with_counts}", show=verbose) missing_labels = model_labels - labels
if existing_labels: if missing_labels:
msg.text(f"Existing: {_format_labels(existing_labels)}", show=verbose) msg.warn(
"Some model labels are not present in the train data. The "
"model performance may be degraded for these labels after "
f"training: {_format_labels(missing_labels)}."
)
if gold_train_data["ws_ents"]: if gold_train_data["ws_ents"]:
msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans") msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans")
has_ws_ents_error = True has_ws_ents_error = True
@ -228,10 +230,10 @@ def debug_data(
msg.warn(f"{gold_train_data['punct_ents']} entity span(s) with punctuation") msg.warn(f"{gold_train_data['punct_ents']} entity span(s) with punctuation")
has_punct_ents_warning = True has_punct_ents_warning = True
for label in new_labels: for label in labels:
if label_counts[label] <= NEW_LABEL_THRESHOLD: if label_counts[label] <= NEW_LABEL_THRESHOLD:
msg.warn( msg.warn(
f"Low number of examples for new label '{label}' ({label_counts[label]})" f"Low number of examples for label '{label}' ({label_counts[label]})"
) )
has_low_data_warning = True has_low_data_warning = True
@ -276,22 +278,52 @@ def debug_data(
) )
if "textcat" in factory_names: if "textcat" in factory_names:
msg.divider("Text Classification") msg.divider("Text Classification (Exclusive Classes)")
labels = [label for label in gold_train_data["cats"]] labels = _get_labels_from_model(nlp, "textcat")
model_labels = _get_labels_from_model(nlp, "textcat") msg.info(f"Text Classification: {len(labels)} label(s)")
new_labels = [l for l in labels if l not in model_labels] msg.text(f"Labels: {_format_labels(labels)}", show=verbose)
existing_labels = [l for l in labels if l in model_labels] labels_with_counts = _format_labels(
msg.info( gold_train_data["cats"].most_common(), counts=True
f"Text Classification: {len(new_labels)} new label(s), "
f"{len(existing_labels)} existing label(s)"
) )
if new_labels: msg.text(f"Labels in train data: {labels_with_counts}", show=verbose)
labels_with_counts = _format_labels( missing_labels = labels - set(gold_train_data["cats"].keys())
gold_train_data["cats"].most_common(), counts=True if missing_labels:
msg.warn(
"Some model labels are not present in the train data. The "
"model performance may be degraded for these labels after "
f"training: {_format_labels(missing_labels)}."
)
if gold_train_data["n_cats_multilabel"] > 0:
# Note: you should never get here because you run into E895 on
# initialization first.
msg.warn(
"The train data contains instances without "
"mutually-exclusive classes. Use the component "
"'textcat_multilabel' instead of 'textcat'."
)
if gold_dev_data["n_cats_multilabel"] > 0:
msg.fail(
"Train/dev mismatch: the dev data contains instances "
"without mutually-exclusive classes while the train data "
"contains only instances with mutually-exclusive classes."
)
if "textcat_multilabel" in factory_names:
msg.divider("Text Classification (Multilabel)")
labels = _get_labels_from_model(nlp, "textcat_multilabel")
msg.info(f"Text Classification: {len(labels)} label(s)")
msg.text(f"Labels: {_format_labels(labels)}", show=verbose)
labels_with_counts = _format_labels(
gold_train_data["cats"].most_common(), counts=True
)
msg.text(f"Labels in train data: {labels_with_counts}", show=verbose)
missing_labels = labels - set(gold_train_data["cats"].keys())
if missing_labels:
msg.warn(
"Some model labels are not present in the train data. The "
"model performance may be degraded for these labels after "
f"training: {_format_labels(missing_labels)}."
) )
msg.text(f"New: {labels_with_counts}", show=verbose)
if existing_labels:
msg.text(f"Existing: {_format_labels(existing_labels)}", show=verbose)
if set(gold_train_data["cats"]) != set(gold_dev_data["cats"]): if set(gold_train_data["cats"]) != set(gold_dev_data["cats"]):
msg.fail( msg.fail(
f"The train and dev labels are not the same. " f"The train and dev labels are not the same. "
@ -299,11 +331,6 @@ def debug_data(
f"Dev labels: {_format_labels(gold_dev_data['cats'])}." f"Dev labels: {_format_labels(gold_dev_data['cats'])}."
) )
if gold_train_data["n_cats_multilabel"] > 0: if gold_train_data["n_cats_multilabel"] > 0:
msg.info(
"The train data contains instances without "
"mutually-exclusive classes. Use '--textcat-multilabel' "
"when training."
)
if gold_dev_data["n_cats_multilabel"] == 0: if gold_dev_data["n_cats_multilabel"] == 0:
msg.warn( msg.warn(
"Potential train/dev mismatch: the train data contains " "Potential train/dev mismatch: the train data contains "
@ -311,9 +338,10 @@ def debug_data(
"dev data does not." "dev data does not."
) )
else: else:
msg.info( msg.warn(
"The train data contains only instances with " "The train data contains only instances with "
"mutually-exclusive classes." "mutually-exclusive classes. You can potentially use the "
"component 'textcat' instead of 'textcat_multilabel'."
) )
if gold_dev_data["n_cats_multilabel"] > 0: if gold_dev_data["n_cats_multilabel"] > 0:
msg.fail( msg.fail(
@ -325,13 +353,37 @@ def debug_data(
if "tagger" in factory_names: if "tagger" in factory_names:
msg.divider("Part-of-speech Tagging") msg.divider("Part-of-speech Tagging")
labels = [label for label in gold_train_data["tags"]] labels = [label for label in gold_train_data["tags"]]
# TODO: does this need to be updated? model_labels = _get_labels_from_model(nlp, "tagger")
msg.info(f"{len(labels)} label(s) in data") msg.info(f"{len(labels)} label(s) in train data")
missing_labels = model_labels - set(labels)
if missing_labels:
msg.warn(
"Some model labels are not present in the train data. The "
"model performance may be degraded for these labels after "
f"training: {_format_labels(missing_labels)}."
)
labels_with_counts = _format_labels( labels_with_counts = _format_labels(
gold_train_data["tags"].most_common(), counts=True gold_train_data["tags"].most_common(), counts=True
) )
msg.text(labels_with_counts, show=verbose) msg.text(labels_with_counts, show=verbose)
if "morphologizer" in factory_names:
msg.divider("Morphologizer (POS+Morph)")
labels = [label for label in gold_train_data["morphs"]]
model_labels = _get_labels_from_model(nlp, "morphologizer")
msg.info(f"{len(labels)} label(s) in train data")
missing_labels = model_labels - set(labels)
if missing_labels:
msg.warn(
"Some model labels are not present in the train data. The "
"model performance may be degraded for these labels after "
f"training: {_format_labels(missing_labels)}."
)
labels_with_counts = _format_labels(
gold_train_data["morphs"].most_common(), counts=True
)
msg.text(labels_with_counts, show=verbose)
if "parser" in factory_names: if "parser" in factory_names:
has_low_data_warning = False has_low_data_warning = False
msg.divider("Dependency Parsing") msg.divider("Dependency Parsing")
@ -491,6 +543,7 @@ def _compile_gold(
"ner": Counter(), "ner": Counter(),
"cats": Counter(), "cats": Counter(),
"tags": Counter(), "tags": Counter(),
"morphs": Counter(),
"deps": Counter(), "deps": Counter(),
"words": Counter(), "words": Counter(),
"roots": Counter(), "roots": Counter(),
@ -544,13 +597,36 @@ def _compile_gold(
data["ner"][combined_label] += 1 data["ner"][combined_label] += 1
elif label == "-": elif label == "-":
data["ner"]["-"] += 1 data["ner"]["-"] += 1
if "textcat" in factory_names: if "textcat" in factory_names or "textcat_multilabel" in factory_names:
data["cats"].update(gold.cats) data["cats"].update(gold.cats)
if list(gold.cats.values()).count(1.0) != 1: if list(gold.cats.values()).count(1.0) != 1:
data["n_cats_multilabel"] += 1 data["n_cats_multilabel"] += 1
if "tagger" in factory_names: if "tagger" in factory_names:
tags = eg.get_aligned("TAG", as_string=True) tags = eg.get_aligned("TAG", as_string=True)
data["tags"].update([x for x in tags if x is not None]) data["tags"].update([x for x in tags if x is not None])
if "morphologizer" in factory_names:
pos_tags = eg.get_aligned("POS", as_string=True)
morphs = eg.get_aligned("MORPH", as_string=True)
for pos, morph in zip(pos_tags, morphs):
# POS may align (same value for multiple tokens) when morph
# doesn't, so if either is misaligned (None), treat the
# annotation as missing so that truths doesn't end up with an
# unknown morph+POS combination
if pos is None or morph is None:
pass
# If both are unset, the annotation is missing (empty morph
# converted from int is "_" rather than "")
elif pos == "" and morph == "":
pass
# Otherwise, generate the combined label
else:
label_dict = Morphology.feats_to_dict(morph)
if pos:
label_dict[Morphologizer.POS_FEAT] = pos
label = eg.reference.vocab.strings[
eg.reference.vocab.morphology.add(label_dict)
]
data["morphs"].update([label])
if "parser" in factory_names: if "parser" in factory_names:
aligned_heads, aligned_deps = eg.get_aligned_parse(projectivize=make_proj) aligned_heads, aligned_deps = eg.get_aligned_parse(projectivize=make_proj)
data["deps"].update([x for x in aligned_deps if x is not None]) data["deps"].update([x for x in aligned_deps if x is not None])
@ -584,8 +660,8 @@ def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
return count return count
def _get_labels_from_model(nlp: Language, pipe_name: str) -> Sequence[str]: def _get_labels_from_model(nlp: Language, pipe_name: str) -> Set[str]:
if pipe_name not in nlp.pipe_names: if pipe_name not in nlp.pipe_names:
return set() return set()
pipe = nlp.get_pipe(pipe_name) pipe = nlp.get_pipe(pipe_name)
return pipe.labels return set(pipe.labels)

View File

@ -206,7 +206,7 @@ factory = "tok2vec"
@architectures = "spacy.Tok2Vec.v2" @architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed] [components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width} width = ${components.tok2vec.model.encode.width}
{% if has_letters -%} {% if has_letters -%}
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"] attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]

View File

@ -68,8 +68,11 @@ seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator} gpu_allocator = ${system.gpu_allocator}
dropout = 0.1 dropout = 0.1
accumulate_gradient = 1 accumulate_gradient = 1
# Controls early-stopping. 0 or -1 mean unlimited. # Controls early-stopping. 0 disables early stopping.
patience = 1600 patience = 1600
# Number of epochs. 0 means unlimited. If >= 0, train corpus is loaded once in
# memory and shuffled within the training loop. -1 means stream train corpus
# rather than loading in memory with no shuffling within the training loop.
max_epochs = 0 max_epochs = 0
max_steps = 20000 max_steps = 20000
eval_frequency = 200 eval_frequency = 200

View File

@ -157,6 +157,10 @@ class Warnings:
"`spacy.load()` to ensure that the model is loaded on the correct " "`spacy.load()` to ensure that the model is loaded on the correct "
"device. More information: " "device. More information: "
"http://spacy.io/usage/v3#jupyter-notebook-gpu") "http://spacy.io/usage/v3#jupyter-notebook-gpu")
W112 = ("The model specified to use for initial vectors ({name}) has no "
"vectors. This is almost certainly a mistake.")
W113 = ("Sourced component '{name}' may not work as expected: source "
"vectors are not identical to current pipeline vectors.")
@add_codes @add_codes
@ -497,6 +501,12 @@ class Errors:
E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.") E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.")
# New errors added in v3.x # New errors added in v3.x
E872 = ("Unable to copy tokenizer from base model due to different "
'tokenizer settings: current tokenizer config "{curr_config}" '
'vs. base model "{base_config}"')
E873 = ("Unable to merge a span from doc.spans with key '{key}' and text "
"'{text}'. This is likely a bug in spaCy, so feel free to open an "
"issue: https://github.com/explosion/spaCy/issues")
E874 = ("Could not initialize the tok2vec model from component " E874 = ("Could not initialize the tok2vec model from component "
"'{component}' and layer '{layer}'.") "'{component}' and layer '{layer}'.")
E875 = ("To use the PretrainVectors objective, make sure that static vectors are loaded. " E875 = ("To use the PretrainVectors objective, make sure that static vectors are loaded. "
@ -631,7 +641,7 @@ class Errors:
"method, make sure it's overwritten on the subclass.") "method, make sure it's overwritten on the subclass.")
E940 = ("Found NaN values in scores.") E940 = ("Found NaN values in scores.")
E941 = ("Can't find model '{name}'. It looks like you're trying to load a " E941 = ("Can't find model '{name}'. It looks like you're trying to load a "
"model from a shortcut, which is deprecated as of spaCy v3.0. To " "model from a shortcut, which is obsolete as of spaCy v3.0. To "
"load the model, use its full name instead:\n\n" "load the model, use its full name instead:\n\n"
"nlp = spacy.load(\"{full}\")\n\nFor more details on the available " "nlp = spacy.load(\"{full}\")\n\nFor more details on the available "
"models, see the models directory: https://spacy.io/models. If you " "models, see the models directory: https://spacy.io/models. If you "
@ -646,8 +656,8 @@ class Errors:
"returned the initialized nlp object instead?") "returned the initialized nlp object instead?")
E944 = ("Can't copy pipeline component '{name}' from source '{model}': " E944 = ("Can't copy pipeline component '{name}' from source '{model}': "
"not found in pipeline. Available components: {opts}") "not found in pipeline. Available components: {opts}")
E945 = ("Can't copy pipeline component '{name}' from source. Expected loaded " E945 = ("Can't copy pipeline component '{name}' from source. Expected "
"nlp object, but got: {source}") "loaded nlp object, but got: {source}")
E947 = ("`Matcher.add` received invalid `greedy` argument: expected " E947 = ("`Matcher.add` received invalid `greedy` argument: expected "
"a string value from {expected} but got: '{arg}'") "a string value from {expected} but got: '{arg}'")
E948 = ("`Matcher.add` received invalid 'patterns' argument: expected " E948 = ("`Matcher.add` received invalid 'patterns' argument: expected "

View File

@ -17,14 +17,19 @@ _exc = {
for orth in [ for orth in [
"..", "..",
"....", "....",
"a.C.",
"al.", "al.",
"all-path", "all-path",
"art.", "art.",
"Art.", "Art.",
"artt.", "artt.",
"att.", "att.",
"avv.",
"Avv."
"by-pass", "by-pass",
"c.d.", "c.d.",
"c/c",
"C.so",
"centro-sinistra", "centro-sinistra",
"check-up", "check-up",
"Civ.", "Civ.",
@ -48,6 +53,8 @@ for orth in [
"prof.", "prof.",
"sett.", "sett.",
"s.p.a.", "s.p.a.",
"s.n.c",
"s.r.l",
"ss.", "ss.",
"St.", "St.",
"tel.", "tel.",

View File

@ -682,9 +682,14 @@ class Language:
name (str): Optional alternative name to use in current pipeline. name (str): Optional alternative name to use in current pipeline.
RETURNS (Tuple[Callable, str]): The component and its factory name. RETURNS (Tuple[Callable, str]): The component and its factory name.
""" """
# TODO: handle errors and mismatches (vectors etc.) # Check source type
if not isinstance(source, self.__class__): if not isinstance(source, Language):
raise ValueError(Errors.E945.format(name=source_name, source=type(source))) raise ValueError(Errors.E945.format(name=source_name, source=type(source)))
# Check vectors, with faster checks first
if self.vocab.vectors.shape != source.vocab.vectors.shape or \
self.vocab.vectors.key2row != source.vocab.vectors.key2row or \
self.vocab.vectors.to_bytes() != source.vocab.vectors.to_bytes():
util.logger.warning(Warnings.W113.format(name=source_name))
if not source_name in source.component_names: if not source_name in source.component_names:
raise KeyError( raise KeyError(
Errors.E944.format( Errors.E944.format(
@ -1673,7 +1678,16 @@ class Language:
# model with the same vocab as the current nlp object # model with the same vocab as the current nlp object
source_nlps[model] = util.load_model(model, vocab=nlp.vocab) source_nlps[model] = util.load_model(model, vocab=nlp.vocab)
source_name = pipe_cfg.get("component", pipe_name) source_name = pipe_cfg.get("component", pipe_name)
listeners_replaced = False
if "replace_listeners" in pipe_cfg:
for name, proc in source_nlps[model].pipeline:
if source_name in getattr(proc, "listening_components", []):
source_nlps[model].replace_listeners(name, source_name, pipe_cfg["replace_listeners"])
listeners_replaced = True
nlp.add_pipe(source_name, source=source_nlps[model], name=pipe_name) nlp.add_pipe(source_name, source=source_nlps[model], name=pipe_name)
# Delete from cache if listeners were replaced
if listeners_replaced:
del source_nlps[model]
disabled_pipes = [*config["nlp"]["disabled"], *disable] disabled_pipes = [*config["nlp"]["disabled"], *disable]
nlp._disabled = set(p for p in disabled_pipes if p not in exclude) nlp._disabled = set(p for p in disabled_pipes if p not in exclude)
nlp.batch_size = config["nlp"]["batch_size"] nlp.batch_size = config["nlp"]["batch_size"]

View File

@ -299,7 +299,7 @@ cdef class DependencyMatcher:
if isinstance(doclike, Doc): if isinstance(doclike, Doc):
doc = doclike doc = doclike
elif isinstance(doclike, Span): elif isinstance(doclike, Span):
doc = doclike.as_doc() doc = doclike.as_doc(copy_user_data=True)
else: else:
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__)) raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))

View File

@ -46,6 +46,12 @@ cdef struct TokenPatternC:
int32_t nr_py int32_t nr_py
quantifier_t quantifier quantifier_t quantifier
hash_t key hash_t key
int32_t token_idx
cdef struct MatchAlignmentC:
int32_t token_idx
int32_t length
cdef struct PatternStateC: cdef struct PatternStateC:

View File

@ -196,7 +196,7 @@ cdef class Matcher:
else: else:
yield doc yield doc
def __call__(self, object doclike, *, as_spans=False, allow_missing=False): def __call__(self, object doclike, *, as_spans=False, allow_missing=False, with_alignments=False):
"""Find all token sequences matching the supplied pattern. """Find all token sequences matching the supplied pattern.
doclike (Doc or Span): The document to match over. doclike (Doc or Span): The document to match over.
@ -204,10 +204,16 @@ cdef class Matcher:
start, end) tuples. start, end) tuples.
allow_missing (bool): Whether to skip checks for missing annotation for allow_missing (bool): Whether to skip checks for missing annotation for
attributes included in patterns. Defaults to False. attributes included in patterns. Defaults to False.
with_alignments (bool): Return match alignment information, which is
`List[int]` with length of matched span. Each entry denotes the
corresponding index of token pattern. If as_spans is set to True,
this setting is ignored.
RETURNS (list): A list of `(match_id, start, end)` tuples, RETURNS (list): A list of `(match_id, start, end)` tuples,
describing the matches. A match tuple describes a span describing the matches. A match tuple describes a span
`doc[start:end]`. The `match_id` is an integer. If as_spans is set `doc[start:end]`. The `match_id` is an integer. If as_spans is set
to True, a list of Span objects is returned. to True, a list of Span objects is returned.
If with_alignments is set to True and as_spans is set to False,
A list of `(match_id, start, end, alignments)` tuples is returned.
""" """
if isinstance(doclike, Doc): if isinstance(doclike, Doc):
doc = doclike doc = doclike
@ -217,6 +223,9 @@ cdef class Matcher:
length = doclike.end - doclike.start length = doclike.end - doclike.start
else: else:
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__)) raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
# Skip alignments calculations if as_spans is set
if as_spans:
with_alignments = False
cdef Pool tmp_pool = Pool() cdef Pool tmp_pool = Pool()
if not allow_missing: if not allow_missing:
for attr in (TAG, POS, MORPH, LEMMA, DEP): for attr in (TAG, POS, MORPH, LEMMA, DEP):
@ -232,18 +241,20 @@ cdef class Matcher:
error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr)) error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr))
raise ValueError(error_msg) raise ValueError(error_msg)
matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length, matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
extensions=self._extensions, predicates=self._extra_predicates) extensions=self._extensions, predicates=self._extra_predicates, with_alignments=with_alignments)
final_matches = [] final_matches = []
pairs_by_id = {} pairs_by_id = {}
# For each key, either add all matches, or only the filtered, non-overlapping ones # For each key, either add all matches, or only the filtered,
for (key, start, end) in matches: # non-overlapping ones this `match` can be either (start, end) or
# (start, end, alignments) depending on `with_alignments=` option.
for key, *match in matches:
span_filter = self._filter.get(key) span_filter = self._filter.get(key)
if span_filter is not None: if span_filter is not None:
pairs = pairs_by_id.get(key, []) pairs = pairs_by_id.get(key, [])
pairs.append((start,end)) pairs.append(match)
pairs_by_id[key] = pairs pairs_by_id[key] = pairs
else: else:
final_matches.append((key, start, end)) final_matches.append((key, *match))
matched = <char*>tmp_pool.alloc(length, sizeof(char)) matched = <char*>tmp_pool.alloc(length, sizeof(char))
empty = <char*>tmp_pool.alloc(length, sizeof(char)) empty = <char*>tmp_pool.alloc(length, sizeof(char))
for key, pairs in pairs_by_id.items(): for key, pairs in pairs_by_id.items():
@ -255,14 +266,18 @@ cdef class Matcher:
sorted_pairs = sorted(pairs, key=lambda x: (x[1]-x[0], -x[0]), reverse=True) # reverse sort by length sorted_pairs = sorted(pairs, key=lambda x: (x[1]-x[0], -x[0]), reverse=True) # reverse sort by length
else: else:
raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=span_filter)) raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=span_filter))
for (start, end) in sorted_pairs: for match in sorted_pairs:
start, end = match[:2]
assert 0 <= start < end # Defend against segfaults assert 0 <= start < end # Defend against segfaults
span_len = end-start span_len = end-start
# If no tokens in the span have matched # If no tokens in the span have matched
if memcmp(&matched[start], &empty[start], span_len * sizeof(matched[0])) == 0: if memcmp(&matched[start], &empty[start], span_len * sizeof(matched[0])) == 0:
final_matches.append((key, start, end)) final_matches.append((key, *match))
# Mark tokens that have matched # Mark tokens that have matched
memset(&matched[start], 1, span_len * sizeof(matched[0])) memset(&matched[start], 1, span_len * sizeof(matched[0]))
if with_alignments:
final_matches_with_alignments = final_matches
final_matches = [(key, start, end) for key, start, end, alignments in final_matches]
# perform the callbacks on the filtered set of results # perform the callbacks on the filtered set of results
for i, (key, start, end) in enumerate(final_matches): for i, (key, start, end) in enumerate(final_matches):
on_match = self._callbacks.get(key, None) on_match = self._callbacks.get(key, None)
@ -270,6 +285,22 @@ cdef class Matcher:
on_match(self, doc, i, final_matches) on_match(self, doc, i, final_matches)
if as_spans: if as_spans:
return [Span(doc, start, end, label=key) for key, start, end in final_matches] return [Span(doc, start, end, label=key) for key, start, end in final_matches]
elif with_alignments:
# convert alignments List[Dict[str, int]] --> List[int]
final_matches = []
# when multiple alignment (belongs to the same length) is found,
# keeps the alignment that has largest token_idx
for key, start, end, alignments in final_matches_with_alignments:
sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False)
alignments = [0] * (end-start)
for align in sorted_alignments:
if align['length'] >= end-start:
continue
# Since alignments are sorted in order of (length, token_idx)
# this overwrites smaller token_idx when they have same length.
alignments[align['length']] = align['token_idx']
final_matches.append((key, start, end, alignments))
return final_matches
else: else:
return final_matches return final_matches
@ -288,9 +319,9 @@ def unpickle_matcher(vocab, patterns, callbacks):
return matcher return matcher
cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple()): cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple(), bint with_alignments=0):
"""Find matches in a doc, with a compiled array of patterns. Matches are """Find matches in a doc, with a compiled array of patterns. Matches are
returned as a list of (id, start, end) tuples. returned as a list of (id, start, end) tuples or (id, start, end, alignments) tuples (if with_alignments != 0)
To augment the compiled patterns, we optionally also take two Python lists. To augment the compiled patterns, we optionally also take two Python lists.
@ -302,6 +333,8 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
""" """
cdef vector[PatternStateC] states cdef vector[PatternStateC] states
cdef vector[MatchC] matches cdef vector[MatchC] matches
cdef vector[vector[MatchAlignmentC]] align_states
cdef vector[vector[MatchAlignmentC]] align_matches
cdef PatternStateC state cdef PatternStateC state
cdef int i, j, nr_extra_attr cdef int i, j, nr_extra_attr
cdef Pool mem = Pool() cdef Pool mem = Pool()
@ -328,12 +361,14 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
for i in range(length): for i in range(length):
for j in range(n): for j in range(n):
states.push_back(PatternStateC(patterns[j], i, 0)) states.push_back(PatternStateC(patterns[j], i, 0))
transition_states(states, matches, predicate_cache, if with_alignments != 0:
doclike[i], extra_attr_values, predicates) align_states.resize(states.size())
transition_states(states, matches, align_states, align_matches, predicate_cache,
doclike[i], extra_attr_values, predicates, with_alignments)
extra_attr_values += nr_extra_attr extra_attr_values += nr_extra_attr
predicate_cache += len(predicates) predicate_cache += len(predicates)
# Handle matches that end in 0-width patterns # Handle matches that end in 0-width patterns
finish_states(matches, states) finish_states(matches, states, align_matches, align_states, with_alignments)
seen = set() seen = set()
for i in range(matches.size()): for i in range(matches.size()):
match = ( match = (
@ -346,16 +381,22 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
# first .?, or the second .? -- it doesn't matter, it's just one match. # first .?, or the second .? -- it doesn't matter, it's just one match.
# Skip 0-length matches. (TODO: fix algorithm) # Skip 0-length matches. (TODO: fix algorithm)
if match not in seen and matches[i].length > 0: if match not in seen and matches[i].length > 0:
output.append(match) if with_alignments != 0:
# since the length of align_matches equals to that of match, we can share same 'i'
output.append(match + (align_matches[i],))
else:
output.append(match)
seen.add(match) seen.add(match)
return output return output
cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches, cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches,
vector[vector[MatchAlignmentC]]& align_states, vector[vector[MatchAlignmentC]]& align_matches,
int8_t* cached_py_predicates, int8_t* cached_py_predicates,
Token token, const attr_t* extra_attrs, py_predicates) except *: Token token, const attr_t* extra_attrs, py_predicates, bint with_alignments) except *:
cdef int q = 0 cdef int q = 0
cdef vector[PatternStateC] new_states cdef vector[PatternStateC] new_states
cdef vector[vector[MatchAlignmentC]] align_new_states
cdef int nr_predicate = len(py_predicates) cdef int nr_predicate = len(py_predicates)
for i in range(states.size()): for i in range(states.size()):
if states[i].pattern.nr_py >= 1: if states[i].pattern.nr_py >= 1:
@ -370,23 +411,39 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
# it in the states list, because q doesn't advance. # it in the states list, because q doesn't advance.
state = states[i] state = states[i]
states[q] = state states[q] = state
# Separate from states, performance is guaranteed for users who only need basic options (without alignments).
# `align_states` always corresponds to `states` 1:1.
if with_alignments != 0:
align_state = align_states[i]
align_states[q] = align_state
while action in (RETRY, RETRY_ADVANCE, RETRY_EXTEND): while action in (RETRY, RETRY_ADVANCE, RETRY_EXTEND):
# Update alignment before the transition of current state
# 'MatchAlignmentC' maps 'original token index of current pattern' to 'current matching length'
if with_alignments != 0:
align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length))
if action == RETRY_EXTEND: if action == RETRY_EXTEND:
# This handles the 'extend' # This handles the 'extend'
new_states.push_back( new_states.push_back(
PatternStateC(pattern=states[q].pattern, start=state.start, PatternStateC(pattern=states[q].pattern, start=state.start,
length=state.length+1)) length=state.length+1))
if with_alignments != 0:
align_new_states.push_back(align_states[q])
if action == RETRY_ADVANCE: if action == RETRY_ADVANCE:
# This handles the 'advance' # This handles the 'advance'
new_states.push_back( new_states.push_back(
PatternStateC(pattern=states[q].pattern+1, start=state.start, PatternStateC(pattern=states[q].pattern+1, start=state.start,
length=state.length+1)) length=state.length+1))
if with_alignments != 0:
align_new_states.push_back(align_states[q])
states[q].pattern += 1 states[q].pattern += 1
if states[q].pattern.nr_py != 0: if states[q].pattern.nr_py != 0:
update_predicate_cache(cached_py_predicates, update_predicate_cache(cached_py_predicates,
states[q].pattern, token, py_predicates) states[q].pattern, token, py_predicates)
action = get_action(states[q], token.c, extra_attrs, action = get_action(states[q], token.c, extra_attrs,
cached_py_predicates) cached_py_predicates)
# Update alignment before the transition of current state
if with_alignments != 0:
align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length))
if action == REJECT: if action == REJECT:
pass pass
elif action == ADVANCE: elif action == ADVANCE:
@ -399,29 +456,50 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
matches.push_back( matches.push_back(
MatchC(pattern_id=ent_id, start=state.start, MatchC(pattern_id=ent_id, start=state.start,
length=state.length+1)) length=state.length+1))
# `align_matches` always corresponds to `matches` 1:1
if with_alignments != 0:
align_matches.push_back(align_states[q])
elif action == MATCH_DOUBLE: elif action == MATCH_DOUBLE:
# push match without last token if length > 0 # push match without last token if length > 0
if state.length > 0: if state.length > 0:
matches.push_back( matches.push_back(
MatchC(pattern_id=ent_id, start=state.start, MatchC(pattern_id=ent_id, start=state.start,
length=state.length)) length=state.length))
# MATCH_DOUBLE emits matches twice,
# add one more to align_matches in order to keep 1:1 relationship
if with_alignments != 0:
align_matches.push_back(align_states[q])
# push match with last token # push match with last token
matches.push_back( matches.push_back(
MatchC(pattern_id=ent_id, start=state.start, MatchC(pattern_id=ent_id, start=state.start,
length=state.length+1)) length=state.length+1))
# `align_matches` always corresponds to `matches` 1:1
if with_alignments != 0:
align_matches.push_back(align_states[q])
elif action == MATCH_REJECT: elif action == MATCH_REJECT:
matches.push_back( matches.push_back(
MatchC(pattern_id=ent_id, start=state.start, MatchC(pattern_id=ent_id, start=state.start,
length=state.length)) length=state.length))
# `align_matches` always corresponds to `matches` 1:1
if with_alignments != 0:
align_matches.push_back(align_states[q])
elif action == MATCH_EXTEND: elif action == MATCH_EXTEND:
matches.push_back( matches.push_back(
MatchC(pattern_id=ent_id, start=state.start, MatchC(pattern_id=ent_id, start=state.start,
length=state.length)) length=state.length))
# `align_matches` always corresponds to `matches` 1:1
if with_alignments != 0:
align_matches.push_back(align_states[q])
states[q].length += 1 states[q].length += 1
q += 1 q += 1
states.resize(q) states.resize(q)
for i in range(new_states.size()): for i in range(new_states.size()):
states.push_back(new_states[i]) states.push_back(new_states[i])
# `align_states` always corresponds to `states` 1:1
if with_alignments != 0:
align_states.resize(q)
for i in range(align_new_states.size()):
align_states.push_back(align_new_states[i])
cdef int update_predicate_cache(int8_t* cache, cdef int update_predicate_cache(int8_t* cache,
@ -444,15 +522,27 @@ cdef int update_predicate_cache(int8_t* cache,
raise ValueError(Errors.E125.format(value=result)) raise ValueError(Errors.E125.format(value=result))
cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states) except *: cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states,
vector[vector[MatchAlignmentC]]& align_matches,
vector[vector[MatchAlignmentC]]& align_states,
bint with_alignments) except *:
"""Handle states that end in zero-width patterns.""" """Handle states that end in zero-width patterns."""
cdef PatternStateC state cdef PatternStateC state
cdef vector[MatchAlignmentC] align_state
for i in range(states.size()): for i in range(states.size()):
state = states[i] state = states[i]
if with_alignments != 0:
align_state = align_states[i]
while get_quantifier(state) in (ZERO_PLUS, ZERO_ONE): while get_quantifier(state) in (ZERO_PLUS, ZERO_ONE):
# Update alignment before the transition of current state
if with_alignments != 0:
align_state.push_back(MatchAlignmentC(state.pattern.token_idx, state.length))
is_final = get_is_final(state) is_final = get_is_final(state)
if is_final: if is_final:
ent_id = get_ent_id(state.pattern) ent_id = get_ent_id(state.pattern)
# `align_matches` always corresponds to `matches` 1:1
if with_alignments != 0:
align_matches.push_back(align_state)
matches.push_back( matches.push_back(
MatchC(pattern_id=ent_id, start=state.start, length=state.length)) MatchC(pattern_id=ent_id, start=state.start, length=state.length))
break break
@ -607,7 +697,7 @@ cdef int8_t get_quantifier(PatternStateC state) nogil:
cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs) except NULL: cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs) except NULL:
pattern = <TokenPatternC*>mem.alloc(len(token_specs) + 1, sizeof(TokenPatternC)) pattern = <TokenPatternC*>mem.alloc(len(token_specs) + 1, sizeof(TokenPatternC))
cdef int i, index cdef int i, index
for i, (quantifier, spec, extensions, predicates) in enumerate(token_specs): for i, (quantifier, spec, extensions, predicates, token_idx) in enumerate(token_specs):
pattern[i].quantifier = quantifier pattern[i].quantifier = quantifier
# Ensure attrs refers to a null pointer if nr_attr == 0 # Ensure attrs refers to a null pointer if nr_attr == 0
if len(spec) > 0: if len(spec) > 0:
@ -628,6 +718,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
pattern[i].py_predicates[j] = index pattern[i].py_predicates[j] = index
pattern[i].nr_py = len(predicates) pattern[i].nr_py = len(predicates)
pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0) pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0)
pattern[i].token_idx = token_idx
i = len(token_specs) i = len(token_specs)
# Use quantifier to identify final ID pattern node (rather than previous # Use quantifier to identify final ID pattern node (rather than previous
# uninitialized quantifier == 0/ZERO + nr_attr == 0 + non-zero-length attrs) # uninitialized quantifier == 0/ZERO + nr_attr == 0 + non-zero-length attrs)
@ -638,6 +729,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
pattern[i].nr_attr = 1 pattern[i].nr_attr = 1
pattern[i].nr_extra_attr = 0 pattern[i].nr_extra_attr = 0
pattern[i].nr_py = 0 pattern[i].nr_py = 0
pattern[i].token_idx = -1
return pattern return pattern
@ -655,7 +747,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
"""This function interprets the pattern, converting the various bits of """This function interprets the pattern, converting the various bits of
syntactic sugar before we compile it into a struct with init_pattern. syntactic sugar before we compile it into a struct with init_pattern.
We need to split the pattern up into three parts: We need to split the pattern up into four parts:
* Normal attribute/value pairs, which are stored on either the token or lexeme, * Normal attribute/value pairs, which are stored on either the token or lexeme,
can be handled directly. can be handled directly.
* Extension attributes are handled specially, as we need to prefetch the * Extension attributes are handled specially, as we need to prefetch the
@ -664,13 +756,14 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
functions and store them. So we store these specially as well. functions and store them. So we store these specially as well.
* Extension attributes that have extra predicates are stored within the * Extension attributes that have extra predicates are stored within the
extra_predicates. extra_predicates.
* Token index that this pattern belongs to.
""" """
tokens = [] tokens = []
string_store = vocab.strings string_store = vocab.strings
for spec in token_specs: for token_idx, spec in enumerate(token_specs):
if not spec: if not spec:
# Signifier for 'any token' # Signifier for 'any token'
tokens.append((ONE, [(NULL_ATTR, 0)], [], [])) tokens.append((ONE, [(NULL_ATTR, 0)], [], [], token_idx))
continue continue
if not isinstance(spec, dict): if not isinstance(spec, dict):
raise ValueError(Errors.E154.format()) raise ValueError(Errors.E154.format())
@ -679,7 +772,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
extensions = _get_extensions(spec, string_store, extensions_table) extensions = _get_extensions(spec, string_store, extensions_table)
predicates = _get_extra_predicates(spec, extra_predicates, vocab) predicates = _get_extra_predicates(spec, extra_predicates, vocab)
for op in ops: for op in ops:
tokens.append((op, list(attr_values), list(extensions), list(predicates))) tokens.append((op, list(attr_values), list(extensions), list(predicates), token_idx))
return tokens return tokens

View File

@ -3,8 +3,10 @@ from thinc.api import Model
from thinc.types import Floats2d from thinc.types import Floats2d
from ..tokens import Doc from ..tokens import Doc
from ..util import registry
@registry.layers("spacy.CharEmbed.v1")
def CharacterEmbed(nM: int, nC: int) -> Model[List[Doc], List[Floats2d]]: def CharacterEmbed(nM: int, nC: int) -> Model[List[Doc], List[Floats2d]]:
# nM: Number of dimensions per character. nC: Number of characters. # nM: Number of dimensions per character. nC: Number of characters.
return Model( return Model(

View File

@ -31,7 +31,7 @@ def get_tok2vec_width(model: Model):
return nO return nO
@registry.architectures("spacy.HashEmbedCNN.v1") @registry.architectures("spacy.HashEmbedCNN.v2")
def build_hash_embed_cnn_tok2vec( def build_hash_embed_cnn_tok2vec(
*, *,
width: int, width: int,
@ -108,7 +108,7 @@ def build_Tok2Vec_model(
return tok2vec return tok2vec
@registry.architectures("spacy.MultiHashEmbed.v1") @registry.architectures("spacy.MultiHashEmbed.v2")
def MultiHashEmbed( def MultiHashEmbed(
width: int, width: int,
attrs: List[Union[str, int]], attrs: List[Union[str, int]],
@ -182,7 +182,7 @@ def MultiHashEmbed(
return model return model
@registry.architectures("spacy.CharacterEmbed.v1") @registry.architectures("spacy.CharacterEmbed.v2")
def CharacterEmbed( def CharacterEmbed(
width: int, width: int,
rows: int, rows: int,

View File

@ -8,7 +8,7 @@ from ..tokens import Doc
from ..errors import Errors from ..errors import Errors
@registry.layers("spacy.StaticVectors.v1") @registry.layers("spacy.StaticVectors.v2")
def StaticVectors( def StaticVectors(
nO: Optional[int] = None, nO: Optional[int] = None,
nM: Optional[int] = None, nM: Optional[int] = None,
@ -38,7 +38,7 @@ def forward(
return _handle_empty(model.ops, model.get_dim("nO")) return _handle_empty(model.ops, model.get_dim("nO"))
key_attr = model.attrs["key_attr"] key_attr = model.attrs["key_attr"]
W = cast(Floats2d, model.ops.as_contig(model.get_param("W"))) W = cast(Floats2d, model.ops.as_contig(model.get_param("W")))
V = cast(Floats2d, docs[0].vocab.vectors.data) V = cast(Floats2d, model.ops.asarray(docs[0].vocab.vectors.data))
rows = model.ops.flatten( rows = model.ops.flatten(
[doc.vocab.vectors.find(keys=doc.to_array(key_attr)) for doc in docs] [doc.vocab.vectors.find(keys=doc.to_array(key_attr)) for doc in docs]
) )
@ -46,6 +46,8 @@ def forward(
vectors_data = model.ops.gemm(model.ops.as_contig(V[rows]), W, trans2=True) vectors_data = model.ops.gemm(model.ops.as_contig(V[rows]), W, trans2=True)
except ValueError: except ValueError:
raise RuntimeError(Errors.E896) raise RuntimeError(Errors.E896)
# Convert negative indices to 0-vectors (TODO: more options for UNK tokens)
vectors_data[rows < 0] = 0
output = Ragged( output = Ragged(
vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i") vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i")
) )

View File

@ -24,7 +24,7 @@ maxout_pieces = 2
use_upper = true use_upper = true
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1" @architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null pretrained_vectors = null
width = 96 width = 96
depth = 4 depth = 4

View File

@ -26,7 +26,7 @@ default_model_config = """
@architectures = "spacy.EntityLinker.v1" @architectures = "spacy.EntityLinker.v1"
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1" @architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null pretrained_vectors = null
width = 96 width = 96
depth = 2 depth = 2
@ -300,77 +300,77 @@ class EntityLinker(TrainablePipe):
for i, doc in enumerate(docs): for i, doc in enumerate(docs):
sentences = [s for s in doc.sents] sentences = [s for s in doc.sents]
if len(doc) > 0: if len(doc) > 0:
# Looping through each sentence and each entity # Looping through each entity (TODO: rewrite)
# This may go wrong if there are entities across sentences - which shouldn't happen normally. for ent in doc.ents:
for sent_index, sent in enumerate(sentences): sent = ent.sent
if sent.ents: sent_index = sentences.index(sent)
# get n_neightbour sentences, clipped to the length of the document assert sent_index >= 0
start_sentence = max(0, sent_index - self.n_sents) # get n_neightbour sentences, clipped to the length of the document
end_sentence = min( start_sentence = max(0, sent_index - self.n_sents)
len(sentences) - 1, sent_index + self.n_sents end_sentence = min(
) len(sentences) - 1, sent_index + self.n_sents
start_token = sentences[start_sentence].start )
end_token = sentences[end_sentence].end start_token = sentences[start_sentence].start
sent_doc = doc[start_token:end_token].as_doc() end_token = sentences[end_sentence].end
# currently, the context is the same for each entity in a sentence (should be refined) sent_doc = doc[start_token:end_token].as_doc()
xp = self.model.ops.xp # currently, the context is the same for each entity in a sentence (should be refined)
if self.incl_context: xp = self.model.ops.xp
sentence_encoding = self.model.predict([sent_doc])[0] if self.incl_context:
sentence_encoding_t = sentence_encoding.T sentence_encoding = self.model.predict([sent_doc])[0]
sentence_norm = xp.linalg.norm(sentence_encoding_t) sentence_encoding_t = sentence_encoding.T
for ent in sent.ents: sentence_norm = xp.linalg.norm(sentence_encoding_t)
entity_count += 1 entity_count += 1
if ent.label_ in self.labels_discard: if ent.label_ in self.labels_discard:
# ignoring this entity - setting to NIL # ignoring this entity - setting to NIL
final_kb_ids.append(self.NIL) final_kb_ids.append(self.NIL)
else: else:
candidates = self.get_candidates(self.kb, ent) candidates = self.get_candidates(self.kb, ent)
if not candidates: if not candidates:
# no prediction possible for this entity - setting to NIL # no prediction possible for this entity - setting to NIL
final_kb_ids.append(self.NIL) final_kb_ids.append(self.NIL)
elif len(candidates) == 1: elif len(candidates) == 1:
# shortcut for efficiency reasons: take the 1 candidate # shortcut for efficiency reasons: take the 1 candidate
# TODO: thresholding # TODO: thresholding
final_kb_ids.append(candidates[0].entity_) final_kb_ids.append(candidates[0].entity_)
else: else:
random.shuffle(candidates) random.shuffle(candidates)
# set all prior probabilities to 0 if incl_prior=False # set all prior probabilities to 0 if incl_prior=False
prior_probs = xp.asarray( prior_probs = xp.asarray(
[c.prior_prob for c in candidates] [c.prior_prob for c in candidates]
)
if not self.incl_prior:
prior_probs = xp.asarray(
[0.0 for _ in candidates]
)
scores = prior_probs
# add in similarity from the context
if self.incl_context:
entity_encodings = xp.asarray(
[c.entity_vector for c in candidates]
)
entity_norm = xp.linalg.norm(
entity_encodings, axis=1
)
if len(entity_encodings) != len(prior_probs):
raise RuntimeError(
Errors.E147.format(
method="predict",
msg="vectors not of equal length",
)
) )
if not self.incl_prior: # cosine similarity
prior_probs = xp.asarray( sims = xp.dot(
[0.0 for _ in candidates] entity_encodings, sentence_encoding_t
) ) / (sentence_norm * entity_norm)
scores = prior_probs if sims.shape != prior_probs.shape:
# add in similarity from the context raise ValueError(Errors.E161)
if self.incl_context: scores = (
entity_encodings = xp.asarray( prior_probs + sims - (prior_probs * sims)
[c.entity_vector for c in candidates] )
) # TODO: thresholding
entity_norm = xp.linalg.norm( best_index = scores.argmax().item()
entity_encodings, axis=1 best_candidate = candidates[best_index]
) final_kb_ids.append(best_candidate.entity_)
if len(entity_encodings) != len(prior_probs):
raise RuntimeError(
Errors.E147.format(
method="predict",
msg="vectors not of equal length",
)
)
# cosine similarity
sims = xp.dot(
entity_encodings, sentence_encoding_t
) / (sentence_norm * entity_norm)
if sims.shape != prior_probs.shape:
raise ValueError(Errors.E161)
scores = (
prior_probs + sims - (prior_probs * sims)
)
# TODO: thresholding
best_index = scores.argmax().item()
best_candidate = candidates[best_index]
final_kb_ids.append(best_candidate.entity_)
if not (len(final_kb_ids) == entity_count): if not (len(final_kb_ids) == entity_count):
err = Errors.E147.format( err = Errors.E147.format(
method="predict", msg="result variables not of equal length" method="predict", msg="result variables not of equal length"

View File

@ -175,7 +175,7 @@ class Lemmatizer(Pipe):
DOCS: https://spacy.io/api/lemmatizer#rule_lemmatize DOCS: https://spacy.io/api/lemmatizer#rule_lemmatize
""" """
cache_key = (token.orth, token.pos, token.morph) cache_key = (token.orth, token.pos, token.morph.key)
if cache_key in self.cache: if cache_key in self.cache:
return self.cache[cache_key] return self.cache[cache_key]
string = token.text string = token.text

View File

@ -27,7 +27,7 @@ default_model_config = """
@architectures = "spacy.Tok2Vec.v2" @architectures = "spacy.Tok2Vec.v2"
[model.tok2vec.embed] [model.tok2vec.embed]
@architectures = "spacy.CharacterEmbed.v1" @architectures = "spacy.CharacterEmbed.v2"
width = 128 width = 128
rows = 7000 rows = 7000
nM = 64 nM = 64

View File

@ -22,7 +22,7 @@ maxout_pieces = 3
token_vector_width = 96 token_vector_width = 96
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1" @architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null pretrained_vectors = null
width = 96 width = 96
depth = 4 depth = 4

View File

@ -21,7 +21,7 @@ maxout_pieces = 2
use_upper = true use_upper = true
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1" @architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null pretrained_vectors = null
width = 96 width = 96
depth = 4 depth = 4

View File

@ -19,7 +19,7 @@ default_model_config = """
@architectures = "spacy.Tagger.v1" @architectures = "spacy.Tagger.v1"
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1" @architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null pretrained_vectors = null
width = 12 width = 12
depth = 1 depth = 1

View File

@ -26,7 +26,7 @@ default_model_config = """
@architectures = "spacy.Tagger.v1" @architectures = "spacy.Tagger.v1"
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1" @architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null pretrained_vectors = null
width = 96 width = 96
depth = 4 depth = 4

View File

@ -21,7 +21,7 @@ single_label_default_config = """
@architectures = "spacy.Tok2Vec.v2" @architectures = "spacy.Tok2Vec.v2"
[model.tok2vec.embed] [model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v2"
width = 64 width = 64
rows = [2000, 2000, 1000, 1000, 1000, 1000] rows = [2000, 2000, 1000, 1000, 1000, 1000]
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"] attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
@ -56,7 +56,7 @@ single_label_cnn_config = """
exclusive_classes = true exclusive_classes = true
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1" @architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null pretrained_vectors = null
width = 96 width = 96
depth = 4 depth = 4

View File

@ -21,7 +21,7 @@ multi_label_default_config = """
@architectures = "spacy.Tok2Vec.v1" @architectures = "spacy.Tok2Vec.v1"
[model.tok2vec.embed] [model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v2"
width = 64 width = 64
rows = [2000, 2000, 1000, 1000, 1000, 1000] rows = [2000, 2000, 1000, 1000, 1000, 1000]
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"] attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
@ -56,7 +56,7 @@ multi_label_cnn_config = """
exclusive_classes = false exclusive_classes = false
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1" @architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null pretrained_vectors = null
width = 96 width = 96
depth = 4 depth = 4

View File

@ -11,7 +11,7 @@ from ..errors import Errors
default_model_config = """ default_model_config = """
[model] [model]
@architectures = "spacy.HashEmbedCNN.v1" @architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null pretrained_vectors = null
width = 96 width = 96
depth = 4 depth = 4

View File

@ -20,10 +20,16 @@ MISSING_VALUES = frozenset([None, 0, ""])
class PRFScore: class PRFScore:
"""A precision / recall / F score.""" """A precision / recall / F score."""
def __init__(self) -> None: def __init__(
self.tp = 0 self,
self.fp = 0 *,
self.fn = 0 tp: int = 0,
fp: int = 0,
fn: int = 0,
) -> None:
self.tp = tp
self.fp = fp
self.fn = fn
def __len__(self) -> int: def __len__(self) -> int:
return self.tp + self.fp + self.fn return self.tp + self.fp + self.fn
@ -305,6 +311,8 @@ class Scorer:
*, *,
getter: Callable[[Doc, str], Iterable[Span]] = getattr, getter: Callable[[Doc, str], Iterable[Span]] = getattr,
has_annotation: Optional[Callable[[Doc], bool]] = None, has_annotation: Optional[Callable[[Doc], bool]] = None,
labeled: bool = True,
allow_overlap: bool = False,
**cfg, **cfg,
) -> Dict[str, Any]: ) -> Dict[str, Any]:
"""Returns PRF scores for labeled spans. """Returns PRF scores for labeled spans.
@ -317,6 +325,11 @@ class Scorer:
has_annotation (Optional[Callable[[Doc], bool]]) should return whether a `Doc` has_annotation (Optional[Callable[[Doc], bool]]) should return whether a `Doc`
has annotation for this `attr`. Docs without annotation are skipped for has annotation for this `attr`. Docs without annotation are skipped for
scoring purposes. scoring purposes.
labeled (bool): Whether or not to include label information in
the evaluation. If set to 'False', two spans will be considered
equal if their start and end match, irrespective of their label.
allow_overlap (bool): Whether or not to allow overlapping spans.
If set to 'False', the alignment will automatically resolve conflicts.
RETURNS (Dict[str, Any]): A dictionary containing the PRF scores under RETURNS (Dict[str, Any]): A dictionary containing the PRF scores under
the keys attr_p/r/f and the per-type PRF scores under attr_per_type. the keys attr_p/r/f and the per-type PRF scores under attr_per_type.
@ -345,33 +358,42 @@ class Scorer:
gold_spans = set() gold_spans = set()
pred_spans = set() pred_spans = set()
for span in getter(gold_doc, attr): for span in getter(gold_doc, attr):
gold_span = (span.label_, span.start, span.end - 1) if labeled:
gold_span = (span.label_, span.start, span.end - 1)
else:
gold_span = (span.start, span.end - 1)
gold_spans.add(gold_span) gold_spans.add(gold_span)
gold_per_type[span.label_].add((span.label_, span.start, span.end - 1)) gold_per_type[span.label_].add(gold_span)
pred_per_type = {label: set() for label in labels} pred_per_type = {label: set() for label in labels}
for span in example.get_aligned_spans_x2y(getter(pred_doc, attr)): for span in example.get_aligned_spans_x2y(getter(pred_doc, attr), allow_overlap):
pred_spans.add((span.label_, span.start, span.end - 1)) if labeled:
pred_per_type[span.label_].add((span.label_, span.start, span.end - 1)) pred_span = (span.label_, span.start, span.end - 1)
else:
pred_span = (span.start, span.end - 1)
pred_spans.add(pred_span)
pred_per_type[span.label_].add(pred_span)
# Scores per label # Scores per label
for k, v in score_per_type.items(): if labeled:
if k in pred_per_type: for k, v in score_per_type.items():
v.score_set(pred_per_type[k], gold_per_type[k]) if k in pred_per_type:
v.score_set(pred_per_type[k], gold_per_type[k])
# Score for all labels # Score for all labels
score.score_set(pred_spans, gold_spans) score.score_set(pred_spans, gold_spans)
if len(score) > 0: # Assemble final result
return { final_scores = {
f"{attr}_p": score.precision,
f"{attr}_r": score.recall,
f"{attr}_f": score.fscore,
f"{attr}_per_type": {k: v.to_dict() for k, v in score_per_type.items()},
}
else:
return {
f"{attr}_p": None, f"{attr}_p": None,
f"{attr}_r": None, f"{attr}_r": None,
f"{attr}_f": None, f"{attr}_f": None,
f"{attr}_per_type": None,
} }
if labeled:
final_scores[f"{attr}_per_type"] = None
if len(score) > 0:
final_scores[f"{attr}_p"] = score.precision
final_scores[f"{attr}_r"] = score.recall
final_scores[f"{attr}_f"] = score.fscore
if labeled:
final_scores[f"{attr}_per_type"] = {k: v.to_dict() for k, v in score_per_type.items()}
return final_scores
@staticmethod @staticmethod
def score_cats( def score_cats(

View File

@ -223,7 +223,7 @@ cdef class StringStore:
it doesn't exist. Paths may be either strings or Path-like objects. it doesn't exist. Paths may be either strings or Path-like objects.
""" """
path = util.ensure_path(path) path = util.ensure_path(path)
strings = list(self) strings = sorted(self)
srsly.write_json(path, strings) srsly.write_json(path, strings)
def from_disk(self, path): def from_disk(self, path):
@ -247,7 +247,7 @@ cdef class StringStore:
RETURNS (bytes): The serialized form of the `StringStore` object. RETURNS (bytes): The serialized form of the `StringStore` object.
""" """
return srsly.json_dumps(list(self)) return srsly.json_dumps(sorted(self))
def from_bytes(self, bytes_data, **kwargs): def from_bytes(self, bytes_data, **kwargs):
"""Load state from a binary string. """Load state from a binary string.

View File

@ -6,12 +6,14 @@ import logging
import mock import mock
from spacy.lang.xx import MultiLanguage from spacy.lang.xx import MultiLanguage
from spacy.tokens import Doc, Span from spacy.tokens import Doc, Span, Token
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.lexeme import Lexeme from spacy.lexeme import Lexeme
from spacy.lang.en import English from spacy.lang.en import English
from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP, MORPH from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP, MORPH
from .test_underscore import clean_underscore # noqa: F401
def test_doc_api_init(en_vocab): def test_doc_api_init(en_vocab):
words = ["a", "b", "c", "d"] words = ["a", "b", "c", "d"]
@ -347,15 +349,19 @@ def test_doc_from_array_morph(en_vocab):
assert [str(t.morph) for t in doc] == [str(t.morph) for t in new_doc] assert [str(t.morph) for t in doc] == [str(t.morph) for t in new_doc]
@pytest.mark.usefixtures("clean_underscore")
def test_doc_api_from_docs(en_tokenizer, de_tokenizer): def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
en_texts = ["Merging the docs is fun.", "", "They don't think alike."] en_texts = ["Merging the docs is fun.", "", "They don't think alike."]
en_texts_without_empty = [t for t in en_texts if len(t)] en_texts_without_empty = [t for t in en_texts if len(t)]
de_text = "Wie war die Frage?" de_text = "Wie war die Frage?"
en_docs = [en_tokenizer(text) for text in en_texts] en_docs = [en_tokenizer(text) for text in en_texts]
docs_idx = en_texts[0].index("docs") en_docs[0].spans["group"] = [en_docs[0][1:4]]
en_docs[2].spans["group"] = [en_docs[2][1:4]]
span_group_texts = sorted([en_docs[0][1:4].text, en_docs[2][1:4].text])
de_doc = de_tokenizer(de_text) de_doc = de_tokenizer(de_text)
expected = (True, None, None, None) Token.set_extension("is_ambiguous", default=False)
en_docs[0].user_data[("._.", "is_ambiguous", docs_idx, None)] = expected en_docs[0][2]._.is_ambiguous = True # docs
en_docs[2][3]._.is_ambiguous = True # think
assert Doc.from_docs([]) is None assert Doc.from_docs([]) is None
assert de_doc is not Doc.from_docs([de_doc]) assert de_doc is not Doc.from_docs([de_doc])
assert str(de_doc) == str(Doc.from_docs([de_doc])) assert str(de_doc) == str(Doc.from_docs([de_doc]))
@ -372,11 +378,12 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
en_docs_tokens = [t for doc in en_docs for t in doc] en_docs_tokens = [t for doc in en_docs for t in doc]
assert len(m_doc) == len(en_docs_tokens) assert len(m_doc) == len(en_docs_tokens)
think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think") think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think")
assert m_doc[2]._.is_ambiguous == True
assert m_doc[9].idx == think_idx assert m_doc[9].idx == think_idx
with pytest.raises(AttributeError): assert m_doc[9]._.is_ambiguous == True
# not callable, because it was not set via set_extension assert not any([t._.is_ambiguous for t in m_doc[3:8]])
m_doc[2]._.is_ambiguous assert "group" in m_doc.spans
assert len(m_doc.user_data) == len(en_docs[0].user_data) # but it's there assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]])
m_doc = Doc.from_docs(en_docs, ensure_whitespace=False) m_doc = Doc.from_docs(en_docs, ensure_whitespace=False)
assert len(en_texts_without_empty) == len(list(m_doc.sents)) assert len(en_texts_without_empty) == len(list(m_doc.sents))
@ -388,6 +395,8 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
assert len(m_doc) == len(en_docs_tokens) assert len(m_doc) == len(en_docs_tokens)
think_idx = len(en_texts[0]) + 0 + en_texts[2].index("think") think_idx = len(en_texts[0]) + 0 + en_texts[2].index("think")
assert m_doc[9].idx == think_idx assert m_doc[9].idx == think_idx
assert "group" in m_doc.spans
assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]])
m_doc = Doc.from_docs(en_docs, attrs=["lemma", "length", "pos"]) m_doc = Doc.from_docs(en_docs, attrs=["lemma", "length", "pos"])
assert len(str(m_doc)) > len(en_texts[0]) + len(en_texts[1]) assert len(str(m_doc)) > len(en_texts[0]) + len(en_texts[1])
@ -399,6 +408,8 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
assert len(m_doc) == len(en_docs_tokens) assert len(m_doc) == len(en_docs_tokens)
think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think") think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think")
assert m_doc[9].idx == think_idx assert m_doc[9].idx == think_idx
assert "group" in m_doc.spans
assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]])
def test_doc_api_from_docs_ents(en_tokenizer): def test_doc_api_from_docs_ents(en_tokenizer):

View File

@ -452,3 +452,30 @@ def test_retokenize_disallow_zero_length(en_vocab):
with pytest.raises(ValueError): with pytest.raises(ValueError):
with doc.retokenize() as retokenizer: with doc.retokenize() as retokenizer:
retokenizer.merge(doc[1:1]) retokenizer.merge(doc[1:1])
def test_doc_retokenize_merge_without_parse_keeps_sents(en_tokenizer):
text = "displaCy is a parse tool built with Javascript"
sent_starts = [1, 0, 0, 0, 1, 0, 0, 0]
tokens = en_tokenizer(text)
# merging within a sentence keeps all sentence boundaries
doc = Doc(tokens.vocab, words=[t.text for t in tokens], sent_starts=sent_starts)
assert len(list(doc.sents)) == 2
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[1:3])
assert len(list(doc.sents)) == 2
# merging over a sentence boundary unsets it by default
doc = Doc(tokens.vocab, words=[t.text for t in tokens], sent_starts=sent_starts)
assert len(list(doc.sents)) == 2
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[3:6])
assert doc[3].is_sent_start == None
# merging over a sentence boundary and setting sent_start
doc = Doc(tokens.vocab, words=[t.text for t in tokens], sent_starts=sent_starts)
assert len(list(doc.sents)) == 2
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[3:6], attrs={"sent_start": True})
assert len(list(doc.sents)) == 2

View File

@ -1,9 +1,11 @@
import pytest import pytest
from spacy.attrs import ORTH, LENGTH from spacy.attrs import ORTH, LENGTH
from spacy.tokens import Doc, Span from spacy.tokens import Doc, Span, Token
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.util import filter_spans from spacy.util import filter_spans
from .test_underscore import clean_underscore # noqa: F401
@pytest.fixture @pytest.fixture
def doc(en_tokenizer): def doc(en_tokenizer):
@ -219,11 +221,14 @@ def test_span_as_doc(doc):
assert span_doc[0].idx == 0 assert span_doc[0].idx == 0
@pytest.mark.usefixtures("clean_underscore")
def test_span_as_doc_user_data(doc): def test_span_as_doc_user_data(doc):
"""Test that the user_data can be preserved (but not by default). """ """Test that the user_data can be preserved (but not by default). """
my_key = "my_info" my_key = "my_info"
my_value = 342 my_value = 342
doc.user_data[my_key] = my_value doc.user_data[my_key] = my_value
Token.set_extension("is_x", default=False)
doc[7]._.is_x = True
span = doc[4:10] span = doc[4:10]
span_doc_with = span.as_doc(copy_user_data=True) span_doc_with = span.as_doc(copy_user_data=True)
@ -232,6 +237,12 @@ def test_span_as_doc_user_data(doc):
assert doc.user_data.get(my_key, None) is my_value assert doc.user_data.get(my_key, None) is my_value
assert span_doc_with.user_data.get(my_key, None) is my_value assert span_doc_with.user_data.get(my_key, None) is my_value
assert span_doc_without.user_data.get(my_key, None) is None assert span_doc_without.user_data.get(my_key, None) is None
for i in range(len(span_doc_with)):
if i != 3:
assert span_doc_with[i]._.is_x is False
else:
assert span_doc_with[i]._.is_x is True
assert not any([t._.is_x for t in span_doc_without])
def test_span_string_label_kb_id(doc): def test_span_string_label_kb_id(doc):

View File

@ -0,0 +1,3 @@
from spacy import require_gpu
require_gpu()

View File

@ -4,7 +4,9 @@ import re
import copy import copy
from mock import Mock from mock import Mock
from spacy.matcher import DependencyMatcher from spacy.matcher import DependencyMatcher
from spacy.tokens import Doc from spacy.tokens import Doc, Token
from ..doc.test_underscore import clean_underscore # noqa: F401
@pytest.fixture @pytest.fixture
@ -344,3 +346,26 @@ def test_dependency_matcher_long_matches(en_vocab, doc):
matcher = DependencyMatcher(en_vocab) matcher = DependencyMatcher(en_vocab)
with pytest.raises(ValueError): with pytest.raises(ValueError):
matcher.add("pattern", [pattern]) matcher.add("pattern", [pattern])
@pytest.mark.usefixtures("clean_underscore")
def test_dependency_matcher_span_user_data(en_tokenizer):
doc = en_tokenizer("a b c d e")
for token in doc:
token.head = doc[0]
token.dep_ = "a"
get_is_c = lambda token: token.text in ("c",)
Token.set_extension("is_c", default=False)
doc[2]._.is_c = True
pattern = [
{"RIGHT_ID": "c", "RIGHT_ATTRS": {"_": {"is_c": True}}},
]
matcher = DependencyMatcher(en_tokenizer.vocab)
matcher.add("C", [pattern])
doc_matches = matcher(doc)
offset = 1
span_matches = matcher(doc[offset:])
for doc_match, span_match in zip(sorted(doc_matches), sorted(span_matches)):
assert doc_match[0] == span_match[0]
for doc_t_i, span_t_i in zip(doc_match[1], span_match[1]):
assert doc_t_i == span_t_i + offset

View File

@ -204,3 +204,90 @@ def test_matcher_remove():
# removing again should throw an error # removing again should throw an error
with pytest.raises(ValueError): with pytest.raises(ValueError):
matcher.remove("Rule") matcher.remove("Rule")
def test_matcher_with_alignments_greedy_longest(en_vocab):
cases = [
("aaab", "a* b", [0, 0, 0, 1]),
("baab", "b a* b", [0, 1, 1, 2]),
("aaab", "a a a b", [0, 1, 2, 3]),
("aaab", "a+ b", [0, 0, 0, 1]),
("aaba", "a+ b a+", [0, 0, 1, 2]),
("aabaa", "a+ b a+", [0, 0, 1, 2, 2]),
("aaba", "a+ b a*", [0, 0, 1, 2]),
("aaaa", "a*", [0, 0, 0, 0]),
("baab", "b a* b b*", [0, 1, 1, 2]),
("aabb", "a* b* a*", [0, 0, 1, 1]),
("aaab", "a+ a+ a b", [0, 1, 2, 3]),
("aaab", "a+ a+ a+ b", [0, 1, 2, 3]),
("aaab", "a+ a a b", [0, 1, 2, 3]),
("aaab", "a+ a a", [0, 1, 2]),
("aaab", "a+ a a?", [0, 1, 2]),
("aaaa", "a a a a a?", [0, 1, 2, 3]),
("aaab", "a+ a b", [0, 0, 1, 2]),
("aaab", "a+ a+ b", [0, 0, 1, 2]),
]
for string, pattern_str, result in cases:
matcher = Matcher(en_vocab)
doc = Doc(matcher.vocab, words=list(string))
pattern = []
for part in pattern_str.split():
if part.endswith("+"):
pattern.append({"ORTH": part[0], "OP": "+"})
elif part.endswith("*"):
pattern.append({"ORTH": part[0], "OP": "*"})
elif part.endswith("?"):
pattern.append({"ORTH": part[0], "OP": "?"})
else:
pattern.append({"ORTH": part})
matcher.add("PATTERN", [pattern], greedy="LONGEST")
matches = matcher(doc, with_alignments=True)
n_matches = len(matches)
_, s, e, expected = matches[0]
assert expected == result, (string, pattern_str, s, e, n_matches)
def test_matcher_with_alignments_nongreedy(en_vocab):
cases = [
(0, "aaab", "a* b", [[0, 1], [0, 0, 1], [0, 0, 0, 1], [1]]),
(1, "baab", "b a* b", [[0, 1, 1, 2]]),
(2, "aaab", "a a a b", [[0, 1, 2, 3]]),
(3, "aaab", "a+ b", [[0, 1], [0, 0, 1], [0, 0, 0, 1]]),
(4, "aaba", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2]]),
(5, "aabaa", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2], [0, 0, 1, 2, 2], [0, 1, 2, 2] ]),
(6, "aaba", "a+ b a*", [[0, 1], [0, 0, 1], [0, 0, 1, 2], [0, 1, 2]]),
(7, "aaaa", "a*", [[0], [0, 0], [0, 0, 0], [0, 0, 0, 0]]),
(8, "baab", "b a* b b*", [[0, 1, 1, 2]]),
(9, "aabb", "a* b* a*", [[1], [2], [2, 2], [0, 1], [0, 0, 1], [0, 0, 1, 1], [0, 1, 1], [1, 1]]),
(10, "aaab", "a+ a+ a b", [[0, 1, 2, 3]]),
(11, "aaab", "a+ a+ a+ b", [[0, 1, 2, 3]]),
(12, "aaab", "a+ a a b", [[0, 1, 2, 3]]),
(13, "aaab", "a+ a a", [[0, 1, 2]]),
(14, "aaab", "a+ a a?", [[0, 1], [0, 1, 2]]),
(15, "aaaa", "a a a a a?", [[0, 1, 2, 3]]),
(16, "aaab", "a+ a b", [[0, 1, 2], [0, 0, 1, 2]]),
(17, "aaab", "a+ a+ b", [[0, 1, 2], [0, 0, 1, 2]]),
]
for case_id, string, pattern_str, results in cases:
matcher = Matcher(en_vocab)
doc = Doc(matcher.vocab, words=list(string))
pattern = []
for part in pattern_str.split():
if part.endswith("+"):
pattern.append({"ORTH": part[0], "OP": "+"})
elif part.endswith("*"):
pattern.append({"ORTH": part[0], "OP": "*"})
elif part.endswith("?"):
pattern.append({"ORTH": part[0], "OP": "?"})
else:
pattern.append({"ORTH": part})
matcher.add("PATTERN", [pattern])
matches = matcher(doc, with_alignments=True)
n_matches = len(matches)
for _, s, e, expected in matches:
assert expected in results, (case_id, string, pattern_str, s, e, n_matches)
assert len(expected) == e - s

View File

@ -5,6 +5,7 @@ from spacy.tokens import Span
from spacy.language import Language from spacy.language import Language
from spacy.pipeline import EntityRuler from spacy.pipeline import EntityRuler
from spacy.errors import MatchPatternError from spacy.errors import MatchPatternError
from thinc.api import NumpyOps, get_current_ops
@pytest.fixture @pytest.fixture
@ -201,13 +202,14 @@ def test_entity_ruler_overlapping_spans(nlp):
@pytest.mark.parametrize("n_process", [1, 2]) @pytest.mark.parametrize("n_process", [1, 2])
def test_entity_ruler_multiprocessing(nlp, n_process): def test_entity_ruler_multiprocessing(nlp, n_process):
texts = ["I enjoy eating Pizza Hut pizza."] if isinstance(get_current_ops, NumpyOps) or n_process < 2:
texts = ["I enjoy eating Pizza Hut pizza."]
patterns = [{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}] patterns = [{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}]
ruler = nlp.add_pipe("entity_ruler") ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns) ruler.add_patterns(patterns)
for doc in nlp.pipe(texts, n_process=2): for doc in nlp.pipe(texts, n_process=2):
for ent in doc.ents: for ent in doc.ents:
assert ent.ent_id_ == "1234" assert ent.ent_id_ == "1234"

View File

@ -1,6 +1,7 @@
import pytest import pytest
import logging import logging
import mock import mock
import pickle
from spacy import util, registry from spacy import util, registry
from spacy.lang.en import English from spacy.lang.en import English
from spacy.lookups import Lookups from spacy.lookups import Lookups
@ -106,6 +107,9 @@ def test_lemmatizer_serialize(nlp):
doc2 = nlp2.make_doc("coping") doc2 = nlp2.make_doc("coping")
doc2[0].pos_ = "VERB" doc2[0].pos_ = "VERB"
assert doc2[0].lemma_ == "" assert doc2[0].lemma_ == ""
doc2 = lemmatizer(doc2) doc2 = lemmatizer2(doc2)
assert doc2[0].text == "coping" assert doc2[0].text == "coping"
assert doc2[0].lemma_ == "cope" assert doc2[0].lemma_ == "cope"
# Make sure that lemmatizer cache can be pickled
b = pickle.dumps(lemmatizer2)

View File

@ -4,7 +4,7 @@ import numpy
import pytest import pytest
from numpy.testing import assert_almost_equal from numpy.testing import assert_almost_equal
from spacy.vocab import Vocab from spacy.vocab import Vocab
from thinc.api import NumpyOps, Model, data_validation from thinc.api import Model, data_validation, get_current_ops
from thinc.types import Array2d, Ragged from thinc.types import Array2d, Ragged
from spacy.lang.en import English from spacy.lang.en import English
@ -13,7 +13,7 @@ from spacy.ml._character_embed import CharacterEmbed
from spacy.tokens import Doc from spacy.tokens import Doc
OPS = NumpyOps() OPS = get_current_ops()
texts = ["These are 4 words", "Here just three"] texts = ["These are 4 words", "Here just three"]
l0 = [[1, 2], [3, 4], [5, 6], [7, 8]] l0 = [[1, 2], [3, 4], [5, 6], [7, 8]]
@ -82,7 +82,7 @@ def util_batch_unbatch_docs_list(
Y_batched = model.predict(in_data) Y_batched = model.predict(in_data)
Y_not_batched = [model.predict([u])[0] for u in in_data] Y_not_batched = [model.predict([u])[0] for u in in_data]
for i in range(len(Y_batched)): for i in range(len(Y_batched)):
assert_almost_equal(Y_batched[i], Y_not_batched[i], decimal=4) assert_almost_equal(OPS.to_numpy(Y_batched[i]), OPS.to_numpy(Y_not_batched[i]), decimal=4)
def util_batch_unbatch_docs_array( def util_batch_unbatch_docs_array(
@ -91,7 +91,7 @@ def util_batch_unbatch_docs_array(
with data_validation(True): with data_validation(True):
model.initialize(in_data, out_data) model.initialize(in_data, out_data)
Y_batched = model.predict(in_data).tolist() Y_batched = model.predict(in_data).tolist()
Y_not_batched = [model.predict([u])[0] for u in in_data] Y_not_batched = [model.predict([u])[0].tolist() for u in in_data]
assert_almost_equal(Y_batched, Y_not_batched, decimal=4) assert_almost_equal(Y_batched, Y_not_batched, decimal=4)
@ -100,8 +100,8 @@ def util_batch_unbatch_docs_ragged(
): ):
with data_validation(True): with data_validation(True):
model.initialize(in_data, out_data) model.initialize(in_data, out_data)
Y_batched = model.predict(in_data) Y_batched = model.predict(in_data).data.tolist()
Y_not_batched = [] Y_not_batched = []
for u in in_data: for u in in_data:
Y_not_batched.extend(model.predict([u]).data.tolist()) Y_not_batched.extend(model.predict([u]).data.tolist())
assert_almost_equal(Y_batched.data, Y_not_batched, decimal=4) assert_almost_equal(Y_batched, Y_not_batched, decimal=4)

View File

@ -1,4 +1,6 @@
import pytest import pytest
import mock
import logging
from spacy.language import Language from spacy.language import Language
from spacy.lang.en import English from spacy.lang.en import English
from spacy.lang.de import German from spacy.lang.de import German
@ -402,6 +404,38 @@ def test_pipe_factories_from_source():
nlp.add_pipe("custom", source=source_nlp) nlp.add_pipe("custom", source=source_nlp)
def test_pipe_factories_from_source_language_subclass():
class CustomEnglishDefaults(English.Defaults):
stop_words = set(["custom", "stop"])
@registry.languages("custom_en")
class CustomEnglish(English):
lang = "custom_en"
Defaults = CustomEnglishDefaults
source_nlp = English()
source_nlp.add_pipe("tagger")
# custom subclass
nlp = CustomEnglish()
nlp.add_pipe("tagger", source=source_nlp)
assert "tagger" in nlp.pipe_names
# non-subclass
nlp = German()
nlp.add_pipe("tagger", source=source_nlp)
assert "tagger" in nlp.pipe_names
# mismatched vectors
nlp = English()
nlp.vocab.vectors.resize((1, 4))
nlp.vocab.vectors.add("cat", vector=[1, 2, 3, 4])
logger = logging.getLogger("spacy")
with mock.patch.object(logger, "warning") as mock_warning:
nlp.add_pipe("tagger", source=source_nlp)
mock_warning.assert_called()
def test_pipe_factories_from_source_custom(): def test_pipe_factories_from_source_custom():
"""Test adding components from a source model with custom components.""" """Test adding components from a source model with custom components."""
name = "test_pipe_factories_from_source_custom" name = "test_pipe_factories_from_source_custom"

View File

@ -1,7 +1,7 @@
import pytest import pytest
import random import random
import numpy.random import numpy.random
from numpy.testing import assert_equal from numpy.testing import assert_almost_equal
from thinc.api import fix_random_seed from thinc.api import fix_random_seed
from spacy import util from spacy import util
from spacy.lang.en import English from spacy.lang.en import English
@ -222,8 +222,12 @@ def test_overfitting_IO():
batch_cats_1 = [doc.cats for doc in nlp.pipe(texts)] batch_cats_1 = [doc.cats for doc in nlp.pipe(texts)]
batch_cats_2 = [doc.cats for doc in nlp.pipe(texts)] batch_cats_2 = [doc.cats for doc in nlp.pipe(texts)]
no_batch_cats = [doc.cats for doc in [nlp(text) for text in texts]] no_batch_cats = [doc.cats for doc in [nlp(text) for text in texts]]
assert_equal(batch_cats_1, batch_cats_2) for cats_1, cats_2 in zip(batch_cats_1, batch_cats_2):
assert_equal(batch_cats_1, no_batch_cats) for cat in cats_1:
assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5)
for cats_1, cats_2 in zip(batch_cats_1, no_batch_cats):
for cat in cats_1:
assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5)
def test_overfitting_IO_multi(): def test_overfitting_IO_multi():
@ -270,8 +274,12 @@ def test_overfitting_IO_multi():
batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)] batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)]
batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)] batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)]
no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]] no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]]
assert_equal(batch_deps_1, batch_deps_2) for cats_1, cats_2 in zip(batch_deps_1, batch_deps_2):
assert_equal(batch_deps_1, no_batch_deps) for cat in cats_1:
assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5)
for cats_1, cats_2 in zip(batch_deps_1, no_batch_deps):
for cat in cats_1:
assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5)
# fmt: off # fmt: off

View File

@ -8,8 +8,8 @@ from spacy.tokens import Doc
from spacy.training import Example from spacy.training import Example
from spacy import util from spacy import util
from spacy.lang.en import English from spacy.lang.en import English
from thinc.api import Config from thinc.api import Config, get_current_ops
from numpy.testing import assert_equal from numpy.testing import assert_array_equal
from ..util import get_batch, make_tempdir from ..util import get_batch, make_tempdir
@ -160,7 +160,8 @@ def test_tok2vec_listener():
doc = nlp("Running the pipeline as a whole.") doc = nlp("Running the pipeline as a whole.")
doc_tensor = tagger_tok2vec.predict([doc])[0] doc_tensor = tagger_tok2vec.predict([doc])[0]
assert_equal(doc.tensor, doc_tensor) ops = get_current_ops()
assert_array_equal(ops.to_numpy(doc.tensor), ops.to_numpy(doc_tensor))
# TODO: should this warn or error? # TODO: should this warn or error?
nlp.select_pipes(disable="tok2vec") nlp.select_pipes(disable="tok2vec")

View File

@ -9,6 +9,7 @@ from spacy.language import Language
from spacy.util import ensure_path, load_model_from_path from spacy.util import ensure_path, load_model_from_path
import numpy import numpy
import pickle import pickle
from thinc.api import NumpyOps, get_current_ops
from ..util import make_tempdir from ..util import make_tempdir
@ -169,21 +170,22 @@ def test_issue4725_1():
def test_issue4725_2(): def test_issue4725_2():
# ensures that this runs correctly and doesn't hang or crash because of the global vectors if isinstance(get_current_ops, NumpyOps):
# if it does crash, it's usually because of calling 'spawn' for multiprocessing (e.g. on Windows), # ensures that this runs correctly and doesn't hang or crash because of the global vectors
# or because of issues with pickling the NER (cf test_issue4725_1) # if it does crash, it's usually because of calling 'spawn' for multiprocessing (e.g. on Windows),
vocab = Vocab(vectors_name="test_vocab_add_vector") # or because of issues with pickling the NER (cf test_issue4725_1)
data = numpy.ndarray((5, 3), dtype="f") vocab = Vocab(vectors_name="test_vocab_add_vector")
data[0] = 1.0 data = numpy.ndarray((5, 3), dtype="f")
data[1] = 2.0 data[0] = 1.0
vocab.set_vector("cat", data[0]) data[1] = 2.0
vocab.set_vector("dog", data[1]) vocab.set_vector("cat", data[0])
nlp = English(vocab=vocab) vocab.set_vector("dog", data[1])
nlp.add_pipe("ner") nlp = English(vocab=vocab)
nlp.initialize() nlp.add_pipe("ner")
docs = ["Kurt is in London."] * 10 nlp.initialize()
for _ in nlp.pipe(docs, batch_size=2, n_process=2): docs = ["Kurt is in London."] * 10
pass for _ in nlp.pipe(docs, batch_size=2, n_process=2):
pass
def test_issue4849(): def test_issue4849():
@ -204,10 +206,11 @@ def test_issue4849():
count_ents += len([ent for ent in doc.ents if ent.ent_id > 0]) count_ents += len([ent for ent in doc.ents if ent.ent_id > 0])
assert count_ents == 2 assert count_ents == 2
# USING 2 PROCESSES # USING 2 PROCESSES
count_ents = 0 if isinstance(get_current_ops, NumpyOps):
for doc in nlp.pipe([text], n_process=2): count_ents = 0
count_ents += len([ent for ent in doc.ents if ent.ent_id > 0]) for doc in nlp.pipe([text], n_process=2):
assert count_ents == 2 count_ents += len([ent for ent in doc.ents if ent.ent_id > 0])
assert count_ents == 2
@Language.factory("my_pipe") @Language.factory("my_pipe")
@ -239,10 +242,11 @@ def test_issue4903():
nlp.add_pipe("sentencizer") nlp.add_pipe("sentencizer")
nlp.add_pipe("my_pipe", after="sentencizer") nlp.add_pipe("my_pipe", after="sentencizer")
text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."] text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."]
docs = list(nlp.pipe(text, n_process=2)) if isinstance(get_current_ops(), NumpyOps):
assert docs[0].text == "I like bananas." docs = list(nlp.pipe(text, n_process=2))
assert docs[1].text == "Do you like them?" assert docs[0].text == "I like bananas."
assert docs[2].text == "No, I prefer wasabi." assert docs[1].text == "Do you like them?"
assert docs[2].text == "No, I prefer wasabi."
def test_issue4924(): def test_issue4924():

View File

@ -6,6 +6,7 @@ from spacy.language import Language
from spacy.lang.en.syntax_iterators import noun_chunks from spacy.lang.en.syntax_iterators import noun_chunks
from spacy.vocab import Vocab from spacy.vocab import Vocab
import spacy import spacy
from thinc.api import get_current_ops
import pytest import pytest
from ...util import make_tempdir from ...util import make_tempdir
@ -54,16 +55,17 @@ def test_issue5082():
ruler.add_patterns(patterns) ruler.add_patterns(patterns)
parsed_vectors_1 = [t.vector for t in nlp(text)] parsed_vectors_1 = [t.vector for t in nlp(text)]
assert len(parsed_vectors_1) == 4 assert len(parsed_vectors_1) == 4
numpy.testing.assert_array_equal(parsed_vectors_1[0], array1) ops = get_current_ops()
numpy.testing.assert_array_equal(parsed_vectors_1[1], array2) numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[0]), array1)
numpy.testing.assert_array_equal(parsed_vectors_1[2], array3) numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[1]), array2)
numpy.testing.assert_array_equal(parsed_vectors_1[3], array4) numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[2]), array3)
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[3]), array4)
nlp.add_pipe("merge_entities") nlp.add_pipe("merge_entities")
parsed_vectors_2 = [t.vector for t in nlp(text)] parsed_vectors_2 = [t.vector for t in nlp(text)]
assert len(parsed_vectors_2) == 3 assert len(parsed_vectors_2) == 3
numpy.testing.assert_array_equal(parsed_vectors_2[0], array1) numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_2[0]), array1)
numpy.testing.assert_array_equal(parsed_vectors_2[1], array2) numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_2[1]), array2)
numpy.testing.assert_array_equal(parsed_vectors_2[2], array34) numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_2[2]), array34)
def test_issue5137(): def test_issue5137():

View File

@ -1,5 +1,6 @@
import pytest import pytest
from thinc.api import Config, fix_random_seed from numpy.testing import assert_almost_equal
from thinc.api import Config, fix_random_seed, get_current_ops
from spacy.lang.en import English from spacy.lang.en import English
from spacy.pipeline.textcat import single_label_default_config, single_label_bow_config from spacy.pipeline.textcat import single_label_default_config, single_label_bow_config
@ -44,11 +45,12 @@ def test_issue5551(textcat_config):
nlp.update([Example.from_dict(doc, annots)]) nlp.update([Example.from_dict(doc, annots)])
# Store the result of each iteration # Store the result of each iteration
result = pipe.model.predict([doc]) result = pipe.model.predict([doc])
results.append(list(result[0])) results.append(result[0])
# All results should be the same because of the fixed seed # All results should be the same because of the fixed seed
assert len(results) == 3 assert len(results) == 3
assert results[0] == results[1] ops = get_current_ops()
assert results[0] == results[2] assert_almost_equal(ops.to_numpy(results[0]), ops.to_numpy(results[1]))
assert_almost_equal(ops.to_numpy(results[0]), ops.to_numpy(results[2]))
def test_issue5838(): def test_issue5838():

View File

@ -1,4 +1,6 @@
from spacy.kb import KnowledgeBase
from spacy.lang.en import English from spacy.lang.en import English
from spacy.training import Example
def test_issue7065(): def test_issue7065():
@ -16,3 +18,58 @@ def test_issue7065():
ent = doc.ents[0] ent = doc.ents[0]
assert ent.start < sent0.end < ent.end assert ent.start < sent0.end < ent.end
assert sentences.index(ent.sent) == 0 assert sentences.index(ent.sent) == 0
def test_issue7065_b():
# Test that the NEL doesn't crash when an entity crosses a sentence boundary
nlp = English()
vector_length = 3
nlp.add_pipe("sentencizer")
text = "Mahler 's Symphony No. 8 was beautiful."
entities = [(0, 6, "PERSON"), (10, 24, "WORK")]
links = {(0, 6): {"Q7304": 1.0, "Q270853": 0.0},
(10, 24): {"Q7304": 0.0, "Q270853": 1.0}}
sent_starts = [1, -1, 0, 0, 0, 0, 0, 0, 0]
doc = nlp(text)
example = Example.from_dict(doc, {"entities": entities, "links": links, "sent_starts": sent_starts})
train_examples = [example]
def create_kb(vocab):
# create artificial KB
mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
mykb.add_entity(entity="Q270853", freq=12, entity_vector=[9, 1, -7])
mykb.add_alias(
alias="No. 8",
entities=["Q270853"],
probabilities=[1.0],
)
mykb.add_entity(entity="Q7304", freq=12, entity_vector=[6, -4, 3])
mykb.add_alias(
alias="Mahler",
entities=["Q7304"],
probabilities=[1.0],
)
return mykb
# Create the Entity Linker component and add it to the pipeline
entity_linker = nlp.add_pipe("entity_linker", last=True)
entity_linker.set_kb(create_kb)
# train the NEL pipe
optimizer = nlp.initialize(get_examples=lambda: train_examples)
for i in range(2):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
# Add a custom rule-based component to mimick NER
patterns = [
{"label": "PERSON", "pattern": [{"LOWER": "mahler"}]},
{"label": "WORK", "pattern": [{"LOWER": "symphony"}, {"LOWER": "no"}, {"LOWER": "."}, {"LOWER": "8"}]}
]
ruler = nlp.add_pipe("entity_ruler", before="entity_linker")
ruler.add_patterns(patterns)
# test the trained model - this should not throw E148
doc = nlp(text)
assert doc

View File

@ -4,7 +4,7 @@ import spacy
from spacy.lang.en import English from spacy.lang.en import English
from spacy.lang.de import German from spacy.lang.de import German
from spacy.language import Language, DEFAULT_CONFIG, DEFAULT_CONFIG_PRETRAIN_PATH from spacy.language import Language, DEFAULT_CONFIG, DEFAULT_CONFIG_PRETRAIN_PATH
from spacy.util import registry, load_model_from_config, load_config from spacy.util import registry, load_model_from_config, load_config, load_config_from_str
from spacy.ml.models import build_Tok2Vec_model, build_tb_parser_model from spacy.ml.models import build_Tok2Vec_model, build_tb_parser_model
from spacy.ml.models import MultiHashEmbed, MaxoutWindowEncoder from spacy.ml.models import MultiHashEmbed, MaxoutWindowEncoder
from spacy.schemas import ConfigSchema, ConfigSchemaPretrain from spacy.schemas import ConfigSchema, ConfigSchemaPretrain
@ -465,3 +465,32 @@ def test_config_only_resolve_relevant_blocks():
nlp.initialize() nlp.initialize()
nlp.config["initialize"]["lookups"] = None nlp.config["initialize"]["lookups"] = None
nlp.initialize() nlp.initialize()
def test_hyphen_in_config():
hyphen_config_str = """
[nlp]
lang = "en"
pipeline = ["my_punctual_component"]
[components]
[components.my_punctual_component]
factory = "my_punctual_component"
punctuation = ["?","-"]
"""
@spacy.Language.factory("my_punctual_component")
class MyPunctualComponent(object):
name = "my_punctual_component"
def __init__(
self,
nlp,
name,
punctuation,
):
self.punctuation = punctuation
nlp = English.from_config(load_config_from_str(hyphen_config_str))
assert nlp.get_pipe("my_punctual_component").punctuation == ['?', '-']

View File

@ -26,10 +26,14 @@ def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
assert tokenizer.rules != {} assert tokenizer.rules != {}
assert tokenizer.token_match is not None assert tokenizer.token_match is not None
assert tokenizer.url_match is not None assert tokenizer.url_match is not None
assert tokenizer.prefix_search is not None
assert tokenizer.infix_finditer is not None
tokenizer.from_bytes(tokenizer_bytes) tokenizer.from_bytes(tokenizer_bytes)
assert tokenizer.rules == {} assert tokenizer.rules == {}
assert tokenizer.token_match is None assert tokenizer.token_match is None
assert tokenizer.url_match is None assert tokenizer.url_match is None
assert tokenizer.prefix_search is None
assert tokenizer.infix_finditer is None
tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC"}, {"ORTH": "."}]}) tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC"}, {"ORTH": "."}]})
tokenizer.rules = {} tokenizer.rules = {}

View File

@ -49,9 +49,9 @@ def test_serialize_vocab_roundtrip_disk(strings1, strings2):
vocab1_d = Vocab().from_disk(file_path1) vocab1_d = Vocab().from_disk(file_path1)
vocab2_d = Vocab().from_disk(file_path2) vocab2_d = Vocab().from_disk(file_path2)
# check strings rather than lexemes, which are only reloaded on demand # check strings rather than lexemes, which are only reloaded on demand
assert strings1 == [s for s in vocab1_d.strings] assert set(strings1) == set([s for s in vocab1_d.strings])
assert strings2 == [s for s in vocab2_d.strings] assert set(strings2) == set([s for s in vocab2_d.strings])
if strings1 == strings2: if set(strings1) == set(strings2):
assert [s for s in vocab1_d.strings] == [s for s in vocab2_d.strings] assert [s for s in vocab1_d.strings] == [s for s in vocab2_d.strings]
else: else:
assert [s for s in vocab1_d.strings] != [s for s in vocab2_d.strings] assert [s for s in vocab1_d.strings] != [s for s in vocab2_d.strings]
@ -96,7 +96,7 @@ def test_serialize_stringstore_roundtrip_bytes(strings1, strings2):
sstore2 = StringStore(strings=strings2) sstore2 = StringStore(strings=strings2)
sstore1_b = sstore1.to_bytes() sstore1_b = sstore1.to_bytes()
sstore2_b = sstore2.to_bytes() sstore2_b = sstore2.to_bytes()
if strings1 == strings2: if set(strings1) == set(strings2):
assert sstore1_b == sstore2_b assert sstore1_b == sstore2_b
else: else:
assert sstore1_b != sstore2_b assert sstore1_b != sstore2_b
@ -104,7 +104,7 @@ def test_serialize_stringstore_roundtrip_bytes(strings1, strings2):
assert sstore1.to_bytes() == sstore1_b assert sstore1.to_bytes() == sstore1_b
new_sstore1 = StringStore().from_bytes(sstore1_b) new_sstore1 = StringStore().from_bytes(sstore1_b)
assert new_sstore1.to_bytes() == sstore1_b assert new_sstore1.to_bytes() == sstore1_b
assert list(new_sstore1) == strings1 assert set(new_sstore1) == set(strings1)
@pytest.mark.parametrize("strings1,strings2", test_strings) @pytest.mark.parametrize("strings1,strings2", test_strings)
@ -118,12 +118,12 @@ def test_serialize_stringstore_roundtrip_disk(strings1, strings2):
sstore2.to_disk(file_path2) sstore2.to_disk(file_path2)
sstore1_d = StringStore().from_disk(file_path1) sstore1_d = StringStore().from_disk(file_path1)
sstore2_d = StringStore().from_disk(file_path2) sstore2_d = StringStore().from_disk(file_path2)
assert list(sstore1_d) == list(sstore1) assert set(sstore1_d) == set(sstore1)
assert list(sstore2_d) == list(sstore2) assert set(sstore2_d) == set(sstore2)
if strings1 == strings2: if set(strings1) == set(strings2):
assert list(sstore1_d) == list(sstore2_d) assert set(sstore1_d) == set(sstore2_d)
else: else:
assert list(sstore1_d) != list(sstore2_d) assert set(sstore1_d) != set(sstore2_d)
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs) @pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)

View File

@ -307,8 +307,11 @@ def test_project_config_validation2(config, n_errors):
assert len(errors) == n_errors assert len(errors) == n_errors
def test_project_config_interpolation(): @pytest.mark.parametrize(
variables = {"a": 10, "b": {"c": "foo", "d": True}} "int_value", [10, pytest.param("10", marks=pytest.mark.xfail)],
)
def test_project_config_interpolation(int_value):
variables = {"a": int_value, "b": {"c": "foo", "d": True}}
commands = [ commands = [
{"name": "x", "script": ["hello ${vars.a} ${vars.b.c}"]}, {"name": "x", "script": ["hello ${vars.a} ${vars.b.c}"]},
{"name": "y", "script": ["${vars.b.c} ${vars.b.d}"]}, {"name": "y", "script": ["${vars.b.c} ${vars.b.d}"]},
@ -317,6 +320,8 @@ def test_project_config_interpolation():
with make_tempdir() as d: with make_tempdir() as d:
srsly.write_yaml(d / "project.yml", project) srsly.write_yaml(d / "project.yml", project)
cfg = load_project_config(d) cfg = load_project_config(d)
assert type(cfg) == dict
assert type(cfg["commands"]) == list
assert cfg["commands"][0]["script"][0] == "hello 10 foo" assert cfg["commands"][0]["script"][0] == "hello 10 foo"
assert cfg["commands"][1]["script"][0] == "foo true" assert cfg["commands"][1]["script"][0] == "foo true"
commands = [{"name": "x", "script": ["hello ${vars.a} ${vars.b.e}"]}] commands = [{"name": "x", "script": ["hello ${vars.a} ${vars.b.e}"]}]
@ -325,6 +330,24 @@ def test_project_config_interpolation():
substitute_project_variables(project) substitute_project_variables(project)
@pytest.mark.parametrize(
"greeting", [342, "everyone", "tout le monde", pytest.param("42", marks=pytest.mark.xfail)],
)
def test_project_config_interpolation_override(greeting):
variables = {"a": "world"}
commands = [
{"name": "x", "script": ["hello ${vars.a}"]},
]
overrides = {"vars.a": greeting}
project = {"commands": commands, "vars": variables}
with make_tempdir() as d:
srsly.write_yaml(d / "project.yml", project)
cfg = load_project_config(d, overrides=overrides)
assert type(cfg) == dict
assert type(cfg["commands"]) == list
assert cfg["commands"][0]["script"][0] == f"hello {greeting}"
def test_project_config_interpolation_env(): def test_project_config_interpolation_env():
variables = {"a": 10} variables = {"a": 10}
env_var = "SPACY_TEST_FOO" env_var = "SPACY_TEST_FOO"

View File

@ -10,6 +10,7 @@ from spacy.lang.en import English
from spacy.lang.de import German from spacy.lang.de import German
from spacy.util import registry, ignore_error, raise_error from spacy.util import registry, ignore_error, raise_error
import spacy import spacy
from thinc.api import NumpyOps, get_current_ops
from .util import add_vecs_to_vocab, assert_docs_equal from .util import add_vecs_to_vocab, assert_docs_equal
@ -142,25 +143,29 @@ def texts():
@pytest.mark.parametrize("n_process", [1, 2]) @pytest.mark.parametrize("n_process", [1, 2])
def test_language_pipe(nlp2, n_process, texts): def test_language_pipe(nlp2, n_process, texts):
texts = texts * 10 ops = get_current_ops()
expecteds = [nlp2(text) for text in texts] if isinstance(ops, NumpyOps) or n_process < 2:
docs = nlp2.pipe(texts, n_process=n_process, batch_size=2) texts = texts * 10
expecteds = [nlp2(text) for text in texts]
docs = nlp2.pipe(texts, n_process=n_process, batch_size=2)
for doc, expected_doc in zip(docs, expecteds): for doc, expected_doc in zip(docs, expecteds):
assert_docs_equal(doc, expected_doc) assert_docs_equal(doc, expected_doc)
@pytest.mark.parametrize("n_process", [1, 2]) @pytest.mark.parametrize("n_process", [1, 2])
def test_language_pipe_stream(nlp2, n_process, texts): def test_language_pipe_stream(nlp2, n_process, texts):
# check if nlp.pipe can handle infinite length iterator properly. ops = get_current_ops()
stream_texts = itertools.cycle(texts) if isinstance(ops, NumpyOps) or n_process < 2:
texts0, texts1 = itertools.tee(stream_texts) # check if nlp.pipe can handle infinite length iterator properly.
expecteds = (nlp2(text) for text in texts0) stream_texts = itertools.cycle(texts)
docs = nlp2.pipe(texts1, n_process=n_process, batch_size=2) texts0, texts1 = itertools.tee(stream_texts)
expecteds = (nlp2(text) for text in texts0)
docs = nlp2.pipe(texts1, n_process=n_process, batch_size=2)
n_fetch = 20 n_fetch = 20
for doc, expected_doc in itertools.islice(zip(docs, expecteds), n_fetch): for doc, expected_doc in itertools.islice(zip(docs, expecteds), n_fetch):
assert_docs_equal(doc, expected_doc) assert_docs_equal(doc, expected_doc)
def test_language_pipe_error_handler(): def test_language_pipe_error_handler():

View File

@ -8,7 +8,8 @@ from spacy import prefer_gpu, require_gpu, require_cpu
from spacy.ml._precomputable_affine import PrecomputableAffine from spacy.ml._precomputable_affine import PrecomputableAffine
from spacy.ml._precomputable_affine import _backprop_precomputable_affine_padding from spacy.ml._precomputable_affine import _backprop_precomputable_affine_padding
from spacy.util import dot_to_object, SimpleFrozenList, import_file from spacy.util import dot_to_object, SimpleFrozenList, import_file
from thinc.api import Config, Optimizer, ConfigValidationError from thinc.api import Config, Optimizer, ConfigValidationError, get_current_ops
from thinc.api import set_current_ops
from spacy.training.batchers import minibatch_by_words from spacy.training.batchers import minibatch_by_words
from spacy.lang.en import English from spacy.lang.en import English
from spacy.lang.nl import Dutch from spacy.lang.nl import Dutch
@ -81,6 +82,7 @@ def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
def test_prefer_gpu(): def test_prefer_gpu():
current_ops = get_current_ops()
try: try:
import cupy # noqa: F401 import cupy # noqa: F401
@ -88,9 +90,11 @@ def test_prefer_gpu():
assert isinstance(get_current_ops(), CupyOps) assert isinstance(get_current_ops(), CupyOps)
except ImportError: except ImportError:
assert not prefer_gpu() assert not prefer_gpu()
set_current_ops(current_ops)
def test_require_gpu(): def test_require_gpu():
current_ops = get_current_ops()
try: try:
import cupy # noqa: F401 import cupy # noqa: F401
@ -99,9 +103,11 @@ def test_require_gpu():
except ImportError: except ImportError:
with pytest.raises(ValueError): with pytest.raises(ValueError):
require_gpu() require_gpu()
set_current_ops(current_ops)
def test_require_cpu(): def test_require_cpu():
current_ops = get_current_ops()
require_cpu() require_cpu()
assert isinstance(get_current_ops(), NumpyOps) assert isinstance(get_current_ops(), NumpyOps)
try: try:
@ -113,6 +119,7 @@ def test_require_cpu():
pass pass
require_cpu() require_cpu()
assert isinstance(get_current_ops(), NumpyOps) assert isinstance(get_current_ops(), NumpyOps)
set_current_ops(current_ops)
def test_ascii_filenames(): def test_ascii_filenames():

View File

@ -1,7 +1,7 @@
from typing import List from typing import List
import pytest import pytest
from thinc.api import fix_random_seed, Adam, set_dropout_rate from thinc.api import fix_random_seed, Adam, set_dropout_rate
from numpy.testing import assert_array_equal from numpy.testing import assert_array_equal, assert_array_almost_equal
import numpy import numpy
from spacy.ml.models import build_Tok2Vec_model, MultiHashEmbed, MaxoutWindowEncoder from spacy.ml.models import build_Tok2Vec_model, MultiHashEmbed, MaxoutWindowEncoder
from spacy.ml.models import build_bow_text_classifier, build_simple_cnn_text_classifier from spacy.ml.models import build_bow_text_classifier, build_simple_cnn_text_classifier
@ -109,7 +109,7 @@ def test_models_initialize_consistently(seed, model_func, kwargs):
model2.initialize() model2.initialize()
params1 = get_all_params(model1) params1 = get_all_params(model1)
params2 = get_all_params(model2) params2 = get_all_params(model2)
assert_array_equal(params1, params2) assert_array_equal(model1.ops.to_numpy(params1), model2.ops.to_numpy(params2))
@pytest.mark.parametrize( @pytest.mark.parametrize(
@ -134,14 +134,25 @@ def test_models_predict_consistently(seed, model_func, kwargs, get_X):
for i in range(len(tok2vec1)): for i in range(len(tok2vec1)):
for j in range(len(tok2vec1[i])): for j in range(len(tok2vec1[i])):
assert_array_equal( assert_array_equal(
numpy.asarray(tok2vec1[i][j]), numpy.asarray(tok2vec2[i][j]) numpy.asarray(model1.ops.to_numpy(tok2vec1[i][j])),
numpy.asarray(model2.ops.to_numpy(tok2vec2[i][j])),
) )
try:
Y1 = model1.ops.to_numpy(Y1)
Y2 = model2.ops.to_numpy(Y2)
except Exception:
pass
if isinstance(Y1, numpy.ndarray): if isinstance(Y1, numpy.ndarray):
assert_array_equal(Y1, Y2) assert_array_equal(Y1, Y2)
elif isinstance(Y1, List): elif isinstance(Y1, List):
assert len(Y1) == len(Y2) assert len(Y1) == len(Y2)
for y1, y2 in zip(Y1, Y2): for y1, y2 in zip(Y1, Y2):
try:
y1 = model1.ops.to_numpy(y1)
y2 = model2.ops.to_numpy(y2)
except Exception:
pass
assert_array_equal(y1, y2) assert_array_equal(y1, y2)
else: else:
raise ValueError(f"Could not compare type {type(Y1)}") raise ValueError(f"Could not compare type {type(Y1)}")
@ -169,12 +180,17 @@ def test_models_update_consistently(seed, dropout, model_func, kwargs, get_X):
model.finish_update(optimizer) model.finish_update(optimizer)
updated_params = get_all_params(model) updated_params = get_all_params(model)
with pytest.raises(AssertionError): with pytest.raises(AssertionError):
assert_array_equal(initial_params, updated_params) assert_array_equal(
model.ops.to_numpy(initial_params), model.ops.to_numpy(updated_params)
)
return model return model
model1 = get_updated_model() model1 = get_updated_model()
model2 = get_updated_model() model2 = get_updated_model()
assert_array_equal(get_all_params(model1), get_all_params(model2)) assert_array_almost_equal(
model1.ops.to_numpy(get_all_params(model1)),
model2.ops.to_numpy(get_all_params(model2)),
)
@pytest.mark.parametrize("model_func,kwargs", [(StaticVectors, {"nO": 128, "nM": 300})]) @pytest.mark.parametrize("model_func,kwargs", [(StaticVectors, {"nO": 128, "nM": 300})])

View File

@ -3,10 +3,10 @@ import pytest
from pytest import approx from pytest import approx
from spacy.training import Example from spacy.training import Example
from spacy.training.iob_utils import offsets_to_biluo_tags from spacy.training.iob_utils import offsets_to_biluo_tags
from spacy.scorer import Scorer, ROCAUCScore from spacy.scorer import Scorer, ROCAUCScore, PRFScore
from spacy.scorer import _roc_auc_score, _roc_curve from spacy.scorer import _roc_auc_score, _roc_curve
from spacy.lang.en import English from spacy.lang.en import English
from spacy.tokens import Doc from spacy.tokens import Doc, Span
test_las_apple = [ test_las_apple = [
@ -403,3 +403,68 @@ def test_roc_auc_score():
score.score_set(0.75, 1) score.score_set(0.75, 1)
with pytest.raises(ValueError): with pytest.raises(ValueError):
_ = score.score # noqa: F841 _ = score.score # noqa: F841
def test_score_spans():
nlp = English()
text = "This is just a random sentence."
key = "my_spans"
gold = nlp.make_doc(text)
pred = nlp.make_doc(text)
spans = []
spans.append(gold.char_span(0, 4, label="PERSON"))
spans.append(gold.char_span(0, 7, label="ORG"))
spans.append(gold.char_span(8, 12, label="ORG"))
gold.spans[key] = spans
def span_getter(doc, span_key):
return doc.spans[span_key]
# Predict exactly the same, but overlapping spans will be discarded
pred.spans[key] = spans
eg = Example(pred, gold)
scores = Scorer.score_spans([eg], attr=key, getter=span_getter)
assert scores[f"{key}_p"] == 1.0
assert scores[f"{key}_r"] < 1.0
# Allow overlapping, now both precision and recall should be 100%
pred.spans[key] = spans
eg = Example(pred, gold)
scores = Scorer.score_spans([eg], attr=key, getter=span_getter, allow_overlap=True)
assert scores[f"{key}_p"] == 1.0
assert scores[f"{key}_r"] == 1.0
# Change the predicted labels
new_spans = [Span(pred, span.start, span.end, label="WRONG") for span in spans]
pred.spans[key] = new_spans
eg = Example(pred, gold)
scores = Scorer.score_spans([eg], attr=key, getter=span_getter, allow_overlap=True)
assert scores[f"{key}_p"] == 0.0
assert scores[f"{key}_r"] == 0.0
assert f"{key}_per_type" in scores
# Discard labels from the evaluation
scores = Scorer.score_spans([eg], attr=key, getter=span_getter, allow_overlap=True, labeled=False)
assert scores[f"{key}_p"] == 1.0
assert scores[f"{key}_r"] == 1.0
assert f"{key}_per_type" not in scores
def test_prf_score():
cand = {"hi", "ho"}
gold1 = {"yo", "hi"}
gold2 = set()
a = PRFScore()
a.score_set(cand=cand, gold=gold1)
assert (a.precision, a.recall, a.fscore) == approx((0.5, 0.5, 0.5))
b = PRFScore()
b.score_set(cand=cand, gold=gold2)
assert (b.precision, b.recall, b.fscore) == approx((0.0, 0.0, 0.0))
c = a + b
assert (c.precision, c.recall, c.fscore) == approx((0.25, 0.5, 0.33333333))
a += b
assert (a.precision, a.recall, a.fscore) == approx((c.precision, c.recall, c.fscore))

View File

@ -1,5 +1,7 @@
import pytest import pytest
import re
from spacy.util import get_lang_class from spacy.util import get_lang_class
from spacy.tokenizer import Tokenizer
# Only include languages with no external dependencies # Only include languages with no external dependencies
# "is" seems to confuse importlib, so we're also excluding it for now # "is" seems to confuse importlib, so we're also excluding it for now
@ -60,3 +62,18 @@ def test_tokenizer_explain(lang):
tokens = [t.text for t in tokenizer(sentence) if not t.is_space] tokens = [t.text for t in tokenizer(sentence) if not t.is_space]
debug_tokens = [t[1] for t in tokenizer.explain(sentence)] debug_tokens = [t[1] for t in tokenizer.explain(sentence)]
assert tokens == debug_tokens assert tokens == debug_tokens
def test_tokenizer_explain_special_matcher(en_vocab):
suffix_re = re.compile(r"[\.]$")
infix_re = re.compile(r"[/]")
rules = {"a.": [{"ORTH": "a."}]}
tokenizer = Tokenizer(
en_vocab,
rules=rules,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
)
tokens = [t.text for t in tokenizer("a/a.")]
explain_tokens = [t[1] for t in tokenizer.explain("a/a.")]
assert tokens == explain_tokens

View File

@ -1,4 +1,5 @@
import pytest import pytest
import re
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.tokenizer import Tokenizer from spacy.tokenizer import Tokenizer
from spacy.util import ensure_path from spacy.util import ensure_path
@ -186,3 +187,31 @@ def test_tokenizer_special_cases_spaces(tokenizer):
assert [t.text for t in tokenizer("a b c")] == ["a", "b", "c"] assert [t.text for t in tokenizer("a b c")] == ["a", "b", "c"]
tokenizer.add_special_case("a b c", [{"ORTH": "a b c"}]) tokenizer.add_special_case("a b c", [{"ORTH": "a b c"}])
assert [t.text for t in tokenizer("a b c")] == ["a b c"] assert [t.text for t in tokenizer("a b c")] == ["a b c"]
def test_tokenizer_flush_cache(en_vocab):
suffix_re = re.compile(r"[\.]$")
tokenizer = Tokenizer(
en_vocab,
suffix_search=suffix_re.search,
)
assert [t.text for t in tokenizer("a.")] == ["a", "."]
tokenizer.suffix_search = None
assert [t.text for t in tokenizer("a.")] == ["a."]
def test_tokenizer_flush_specials(en_vocab):
suffix_re = re.compile(r"[\.]$")
rules = {"a a": [{"ORTH": "a a"}]}
tokenizer1 = Tokenizer(
en_vocab,
suffix_search=suffix_re.search,
rules=rules,
)
tokenizer2 = Tokenizer(
en_vocab,
suffix_search=suffix_re.search,
)
assert [t.text for t in tokenizer1("a a.")] == ["a a", "."]
tokenizer1.rules = {}
assert [t.text for t in tokenizer1("a a.")] == ["a", "a", "."]

View File

@ -2,6 +2,7 @@ import pytest
from spacy.training.example import Example from spacy.training.example import Example
from spacy.tokens import Doc from spacy.tokens import Doc
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.util import to_ternary_int
def test_Example_init_requires_doc_objects(): def test_Example_init_requires_doc_objects():
@ -121,7 +122,7 @@ def test_Example_from_dict_with_morphology(annots):
[ [
{ {
"words": ["This", "is", "one", "sentence", "this", "is", "another"], "words": ["This", "is", "one", "sentence", "this", "is", "another"],
"sent_starts": [1, 0, 0, 0, 1, 0, 0], "sent_starts": [1, False, 0, None, True, -1, -5.7],
} }
], ],
) )
@ -131,7 +132,12 @@ def test_Example_from_dict_with_sent_start(annots):
example = Example.from_dict(predicted, annots) example = Example.from_dict(predicted, annots)
assert len(list(example.reference.sents)) == 2 assert len(list(example.reference.sents)) == 2
for i, token in enumerate(example.reference): for i, token in enumerate(example.reference):
assert bool(token.is_sent_start) == bool(annots["sent_starts"][i]) if to_ternary_int(annots["sent_starts"][i]) == 1:
assert token.is_sent_start is True
elif to_ternary_int(annots["sent_starts"][i]) == 0:
assert token.is_sent_start is None
else:
assert token.is_sent_start is False
@pytest.mark.parametrize( @pytest.mark.parametrize(

View File

@ -426,6 +426,29 @@ def test_aligned_spans_x2y(en_vocab, en_tokenizer):
assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2), (4, 6)] assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2), (4, 6)]
def test_aligned_spans_y2x_overlap(en_vocab, en_tokenizer):
text = "I flew to San Francisco Valley"
nlp = English()
doc = nlp(text)
# the reference doc has overlapping spans
gold_doc = nlp.make_doc(text)
spans = []
prefix = "I flew to "
spans.append(gold_doc.char_span(len(prefix), len(prefix + "San Francisco"), label="CITY"))
spans.append(gold_doc.char_span(len(prefix), len(prefix + "San Francisco Valley"), label="VALLEY"))
spans_key = "overlap_ents"
gold_doc.spans[spans_key] = spans
example = Example(doc, gold_doc)
spans_gold = example.reference.spans[spans_key]
assert [(ent.start, ent.end) for ent in spans_gold] == [(3, 5), (3, 6)]
# Ensure that 'get_aligned_spans_y2x' has the aligned entities correct
spans_y2x_no_overlap = example.get_aligned_spans_y2x(spans_gold, allow_overlap=False)
assert [(ent.start, ent.end) for ent in spans_y2x_no_overlap] == [(3, 5)]
spans_y2x_overlap = example.get_aligned_spans_y2x(spans_gold, allow_overlap=True)
assert [(ent.start, ent.end) for ent in spans_y2x_overlap] == [(3, 5), (3, 6)]
def test_gold_ner_missing_tags(en_tokenizer): def test_gold_ner_missing_tags(en_tokenizer):
doc = en_tokenizer("I flew to Silicon Valley via London.") doc = en_tokenizer("I flew to Silicon Valley via London.")
biluo_tags = [None, "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] biluo_tags = [None, "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"]

View File

@ -5,6 +5,7 @@ import srsly
from spacy.tokens import Doc from spacy.tokens import Doc
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.util import make_tempdir # noqa: F401 from spacy.util import make_tempdir # noqa: F401
from thinc.api import get_current_ops
@contextlib.contextmanager @contextlib.contextmanager
@ -58,7 +59,10 @@ def add_vecs_to_vocab(vocab, vectors):
def get_cosine(vec1, vec2): def get_cosine(vec1, vec2):
"""Get cosine for two given vectors""" """Get cosine for two given vectors"""
return numpy.dot(vec1, vec2) / (numpy.linalg.norm(vec1) * numpy.linalg.norm(vec2)) OPS = get_current_ops()
v1 = OPS.to_numpy(OPS.asarray(vec1))
v2 = OPS.to_numpy(OPS.asarray(vec2))
return numpy.dot(v1, v2) / (numpy.linalg.norm(v1) * numpy.linalg.norm(v2))
def assert_docs_equal(doc1, doc2): def assert_docs_equal(doc1, doc2):

View File

@ -1,6 +1,7 @@
import pytest import pytest
import numpy import numpy
from numpy.testing import assert_allclose, assert_equal from numpy.testing import assert_allclose, assert_equal
from thinc.api import get_current_ops
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.vectors import Vectors from spacy.vectors import Vectors
from spacy.tokenizer import Tokenizer from spacy.tokenizer import Tokenizer
@ -9,6 +10,7 @@ from spacy.tokens import Doc
from ..util import add_vecs_to_vocab, get_cosine, make_tempdir from ..util import add_vecs_to_vocab, get_cosine, make_tempdir
OPS = get_current_ops()
@pytest.fixture @pytest.fixture
def strings(): def strings():
@ -18,21 +20,21 @@ def strings():
@pytest.fixture @pytest.fixture
def vectors(): def vectors():
return [ return [
("apple", [1, 2, 3]), ("apple", OPS.asarray([1, 2, 3])),
("orange", [-1, -2, -3]), ("orange", OPS.asarray([-1, -2, -3])),
("and", [-1, -1, -1]), ("and", OPS.asarray([-1, -1, -1])),
("juice", [5, 5, 10]), ("juice", OPS.asarray([5, 5, 10])),
("pie", [7, 6.3, 8.9]), ("pie", OPS.asarray([7, 6.3, 8.9])),
] ]
@pytest.fixture @pytest.fixture
def ngrams_vectors(): def ngrams_vectors():
return [ return [
("apple", [1, 2, 3]), ("apple", OPS.asarray([1, 2, 3])),
("app", [-0.1, -0.2, -0.3]), ("app", OPS.asarray([-0.1, -0.2, -0.3])),
("ppl", [-0.2, -0.3, -0.4]), ("ppl", OPS.asarray([-0.2, -0.3, -0.4])),
("pl", [0.7, 0.8, 0.9]), ("pl", OPS.asarray([0.7, 0.8, 0.9])),
] ]
@ -171,8 +173,10 @@ def test_vectors_most_similar_identical():
@pytest.mark.parametrize("text", ["apple and orange"]) @pytest.mark.parametrize("text", ["apple and orange"])
def test_vectors_token_vector(tokenizer_v, vectors, text): def test_vectors_token_vector(tokenizer_v, vectors, text):
doc = tokenizer_v(text) doc = tokenizer_v(text)
assert vectors[0] == (doc[0].text, list(doc[0].vector)) assert vectors[0][0] == doc[0].text
assert vectors[1] == (doc[2].text, list(doc[2].vector)) assert all([a == b for a, b in zip(vectors[0][1], doc[0].vector)])
assert vectors[1][0] == doc[2].text
assert all([a == b for a, b in zip(vectors[1][1], doc[2].vector)])
@pytest.mark.parametrize("text", ["apple"]) @pytest.mark.parametrize("text", ["apple"])
@ -301,7 +305,7 @@ def test_vectors_doc_doc_similarity(vocab, text1, text2):
def test_vocab_add_vector(): def test_vocab_add_vector():
vocab = Vocab(vectors_name="test_vocab_add_vector") vocab = Vocab(vectors_name="test_vocab_add_vector")
data = numpy.ndarray((5, 3), dtype="f") data = OPS.xp.ndarray((5, 3), dtype="f")
data[0] = 1.0 data[0] = 1.0
data[1] = 2.0 data[1] = 2.0
vocab.set_vector("cat", data[0]) vocab.set_vector("cat", data[0])
@ -320,10 +324,10 @@ def test_vocab_prune_vectors():
_ = vocab["cat"] # noqa: F841 _ = vocab["cat"] # noqa: F841
_ = vocab["dog"] # noqa: F841 _ = vocab["dog"] # noqa: F841
_ = vocab["kitten"] # noqa: F841 _ = vocab["kitten"] # noqa: F841
data = numpy.ndarray((5, 3), dtype="f") data = OPS.xp.ndarray((5, 3), dtype="f")
data[0] = [1.0, 1.2, 1.1] data[0] = OPS.asarray([1.0, 1.2, 1.1])
data[1] = [0.3, 1.3, 1.0] data[1] = OPS.asarray([0.3, 1.3, 1.0])
data[2] = [0.9, 1.22, 1.05] data[2] = OPS.asarray([0.9, 1.22, 1.05])
vocab.set_vector("cat", data[0]) vocab.set_vector("cat", data[0])
vocab.set_vector("dog", data[1]) vocab.set_vector("dog", data[1])
vocab.set_vector("kitten", data[2]) vocab.set_vector("kitten", data[2])
@ -332,40 +336,41 @@ def test_vocab_prune_vectors():
assert list(remap.keys()) == ["kitten"] assert list(remap.keys()) == ["kitten"]
neighbour, similarity = list(remap.values())[0] neighbour, similarity = list(remap.values())[0]
assert neighbour == "cat", remap assert neighbour == "cat", remap
assert_allclose(similarity, get_cosine(data[0], data[2]), atol=1e-4, rtol=1e-3) cosine = get_cosine(data[0], data[2])
assert_allclose(float(similarity), cosine, atol=1e-4, rtol=1e-3)
def test_vectors_serialize(): def test_vectors_serialize():
data = numpy.asarray([[4, 2, 2, 2], [4, 2, 2, 2], [1, 1, 1, 1]], dtype="f") data = OPS.asarray([[4, 2, 2, 2], [4, 2, 2, 2], [1, 1, 1, 1]], dtype="f")
v = Vectors(data=data, keys=["A", "B", "C"]) v = Vectors(data=data, keys=["A", "B", "C"])
b = v.to_bytes() b = v.to_bytes()
v_r = Vectors() v_r = Vectors()
v_r.from_bytes(b) v_r.from_bytes(b)
assert_equal(v.data, v_r.data) assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data))
assert v.key2row == v_r.key2row assert v.key2row == v_r.key2row
v.resize((5, 4)) v.resize((5, 4))
v_r.resize((5, 4)) v_r.resize((5, 4))
row = v.add("D", vector=numpy.asarray([1, 2, 3, 4], dtype="f")) row = v.add("D", vector=OPS.asarray([1, 2, 3, 4], dtype="f"))
row_r = v_r.add("D", vector=numpy.asarray([1, 2, 3, 4], dtype="f")) row_r = v_r.add("D", vector=OPS.asarray([1, 2, 3, 4], dtype="f"))
assert row == row_r assert row == row_r
assert_equal(v.data, v_r.data) assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data))
assert v.is_full == v_r.is_full assert v.is_full == v_r.is_full
with make_tempdir() as d: with make_tempdir() as d:
v.to_disk(d) v.to_disk(d)
v_r.from_disk(d) v_r.from_disk(d)
assert_equal(v.data, v_r.data) assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data))
assert v.key2row == v_r.key2row assert v.key2row == v_r.key2row
v.resize((5, 4)) v.resize((5, 4))
v_r.resize((5, 4)) v_r.resize((5, 4))
row = v.add("D", vector=numpy.asarray([10, 20, 30, 40], dtype="f")) row = v.add("D", vector=OPS.asarray([10, 20, 30, 40], dtype="f"))
row_r = v_r.add("D", vector=numpy.asarray([10, 20, 30, 40], dtype="f")) row_r = v_r.add("D", vector=OPS.asarray([10, 20, 30, 40], dtype="f"))
assert row == row_r assert row == row_r
assert_equal(v.data, v_r.data) assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data))
def test_vector_is_oov(): def test_vector_is_oov():
vocab = Vocab(vectors_name="test_vocab_is_oov") vocab = Vocab(vectors_name="test_vocab_is_oov")
data = numpy.ndarray((5, 3), dtype="f") data = OPS.xp.ndarray((5, 3), dtype="f")
data[0] = 1.0 data[0] = 1.0
data[1] = 2.0 data[1] = 2.0
vocab.set_vector("cat", data[0]) vocab.set_vector("cat", data[0])

View File

@ -23,8 +23,8 @@ cdef class Tokenizer:
cdef object _infix_finditer cdef object _infix_finditer
cdef object _rules cdef object _rules
cdef PhraseMatcher _special_matcher cdef PhraseMatcher _special_matcher
cdef int _property_init_count cdef int _property_init_count # TODO: unused, remove in v3.1
cdef int _property_init_max cdef int _property_init_max # TODO: unused, remove in v3.1
cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases) cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases)
cdef int _apply_special_cases(self, Doc doc) except -1 cdef int _apply_special_cases(self, Doc doc) except -1

View File

@ -20,11 +20,12 @@ from .attrs import intify_attrs
from .symbols import ORTH, NORM from .symbols import ORTH, NORM
from .errors import Errors, Warnings from .errors import Errors, Warnings
from . import util from . import util
from .util import registry from .util import registry, get_words_and_spaces
from .attrs import intify_attrs from .attrs import intify_attrs
from .symbols import ORTH from .symbols import ORTH
from .scorer import Scorer from .scorer import Scorer
from .training import validate_examples from .training import validate_examples
from .tokens import Span
cdef class Tokenizer: cdef class Tokenizer:
@ -68,8 +69,6 @@ cdef class Tokenizer:
self._rules = {} self._rules = {}
self._special_matcher = PhraseMatcher(self.vocab) self._special_matcher = PhraseMatcher(self.vocab)
self._load_special_cases(rules) self._load_special_cases(rules)
self._property_init_count = 0
self._property_init_max = 4
property token_match: property token_match:
def __get__(self): def __get__(self):
@ -78,8 +77,6 @@ cdef class Tokenizer:
def __set__(self, token_match): def __set__(self, token_match):
self._token_match = token_match self._token_match = token_match
self._reload_special_cases() self._reload_special_cases()
if self._property_init_count <= self._property_init_max:
self._property_init_count += 1
property url_match: property url_match:
def __get__(self): def __get__(self):
@ -87,7 +84,7 @@ cdef class Tokenizer:
def __set__(self, url_match): def __set__(self, url_match):
self._url_match = url_match self._url_match = url_match
self._flush_cache() self._reload_special_cases()
property prefix_search: property prefix_search:
def __get__(self): def __get__(self):
@ -96,8 +93,6 @@ cdef class Tokenizer:
def __set__(self, prefix_search): def __set__(self, prefix_search):
self._prefix_search = prefix_search self._prefix_search = prefix_search
self._reload_special_cases() self._reload_special_cases()
if self._property_init_count <= self._property_init_max:
self._property_init_count += 1
property suffix_search: property suffix_search:
def __get__(self): def __get__(self):
@ -106,8 +101,6 @@ cdef class Tokenizer:
def __set__(self, suffix_search): def __set__(self, suffix_search):
self._suffix_search = suffix_search self._suffix_search = suffix_search
self._reload_special_cases() self._reload_special_cases()
if self._property_init_count <= self._property_init_max:
self._property_init_count += 1
property infix_finditer: property infix_finditer:
def __get__(self): def __get__(self):
@ -116,8 +109,6 @@ cdef class Tokenizer:
def __set__(self, infix_finditer): def __set__(self, infix_finditer):
self._infix_finditer = infix_finditer self._infix_finditer = infix_finditer
self._reload_special_cases() self._reload_special_cases()
if self._property_init_count <= self._property_init_max:
self._property_init_count += 1
property rules: property rules:
def __get__(self): def __get__(self):
@ -125,7 +116,7 @@ cdef class Tokenizer:
def __set__(self, rules): def __set__(self, rules):
self._rules = {} self._rules = {}
self._reset_cache([key for key in self._cache]) self._flush_cache()
self._flush_specials() self._flush_specials()
self._cache = PreshMap() self._cache = PreshMap()
self._specials = PreshMap() self._specials = PreshMap()
@ -225,6 +216,7 @@ cdef class Tokenizer:
self.mem.free(cached) self.mem.free(cached)
def _flush_specials(self): def _flush_specials(self):
self._special_matcher = PhraseMatcher(self.vocab)
for k in self._specials: for k in self._specials:
cached = <_Cached*>self._specials.get(k) cached = <_Cached*>self._specials.get(k)
del self._specials[k] del self._specials[k]
@ -567,7 +559,6 @@ cdef class Tokenizer:
"""Add special-case tokenization rules.""" """Add special-case tokenization rules."""
if special_cases is not None: if special_cases is not None:
for chunk, substrings in sorted(special_cases.items()): for chunk, substrings in sorted(special_cases.items()):
self._validate_special_case(chunk, substrings)
self.add_special_case(chunk, substrings) self.add_special_case(chunk, substrings)
def _validate_special_case(self, chunk, substrings): def _validate_special_case(self, chunk, substrings):
@ -615,16 +606,9 @@ cdef class Tokenizer:
self._special_matcher.add(string, None, self._tokenize_affixes(string, False)) self._special_matcher.add(string, None, self._tokenize_affixes(string, False))
def _reload_special_cases(self): def _reload_special_cases(self):
try: self._flush_cache()
self._property_init_count self._flush_specials()
except AttributeError: self._load_special_cases(self._rules)
return
# only reload if all 4 of prefix, suffix, infix, token_match have
# have been initialized
if self.vocab is not None and self._property_init_count >= self._property_init_max:
self._flush_cache()
self._flush_specials()
self._load_special_cases(self._rules)
def explain(self, text): def explain(self, text):
"""A debugging tokenizer that provides information about which """A debugging tokenizer that provides information about which
@ -638,8 +622,14 @@ cdef class Tokenizer:
DOCS: https://spacy.io/api/tokenizer#explain DOCS: https://spacy.io/api/tokenizer#explain
""" """
prefix_search = self.prefix_search prefix_search = self.prefix_search
if prefix_search is None:
prefix_search = re.compile("a^").search
suffix_search = self.suffix_search suffix_search = self.suffix_search
if suffix_search is None:
suffix_search = re.compile("a^").search
infix_finditer = self.infix_finditer infix_finditer = self.infix_finditer
if infix_finditer is None:
infix_finditer = re.compile("a^").finditer
token_match = self.token_match token_match = self.token_match
if token_match is None: if token_match is None:
token_match = re.compile("a^").match token_match = re.compile("a^").match
@ -687,7 +677,7 @@ cdef class Tokenizer:
tokens.append(("URL_MATCH", substring)) tokens.append(("URL_MATCH", substring))
substring = '' substring = ''
elif substring in special_cases: elif substring in special_cases:
tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring])) tokens.extend((f"SPECIAL-{i + 1}", self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring]))
substring = '' substring = ''
elif list(infix_finditer(substring)): elif list(infix_finditer(substring)):
infixes = infix_finditer(substring) infixes = infix_finditer(substring)
@ -705,7 +695,33 @@ cdef class Tokenizer:
tokens.append(("TOKEN", substring)) tokens.append(("TOKEN", substring))
substring = '' substring = ''
tokens.extend(reversed(suffixes)) tokens.extend(reversed(suffixes))
return tokens # Find matches for special cases handled by special matcher
words, spaces = get_words_and_spaces([t[1] for t in tokens], text)
t_words = []
t_spaces = []
for word, space in zip(words, spaces):
if not word.isspace():
t_words.append(word)
t_spaces.append(space)
doc = Doc(self.vocab, words=t_words, spaces=t_spaces)
matches = self._special_matcher(doc)
spans = [Span(doc, s, e, label=m_id) for m_id, s, e in matches]
spans = util.filter_spans(spans)
# Replace matched tokens with their exceptions
i = 0
final_tokens = []
spans_by_start = {s.start: s for s in spans}
while i < len(tokens):
if i in spans_by_start:
span = spans_by_start[i]
exc = [d[ORTH] for d in special_cases[span.label_]]
for j, orth in enumerate(exc):
final_tokens.append((f"SPECIAL-{j + 1}", self.vocab.strings[orth]))
i += len(span)
else:
final_tokens.append(tokens[i])
i += 1
return final_tokens
def score(self, examples, **kwargs): def score(self, examples, **kwargs):
validate_examples(examples, "Tokenizer.score") validate_examples(examples, "Tokenizer.score")
@ -778,6 +794,15 @@ cdef class Tokenizer:
"url_match": lambda b: data.setdefault("url_match", b), "url_match": lambda b: data.setdefault("url_match", b),
"exceptions": lambda b: data.setdefault("rules", b) "exceptions": lambda b: data.setdefault("rules", b)
} }
# reset all properties and flush all caches (through rules),
# reset rules first so that _reload_special_cases is trivial/fast as
# the other properties are reset
self.rules = {}
self.prefix_search = None
self.suffix_search = None
self.infix_finditer = None
self.token_match = None
self.url_match = None
msg = util.from_bytes(bytes_data, deserializers, exclude) msg = util.from_bytes(bytes_data, deserializers, exclude)
if "prefix_search" in data and isinstance(data["prefix_search"], str): if "prefix_search" in data and isinstance(data["prefix_search"], str):
self.prefix_search = re.compile(data["prefix_search"]).search self.prefix_search = re.compile(data["prefix_search"]).search
@ -785,22 +810,12 @@ cdef class Tokenizer:
self.suffix_search = re.compile(data["suffix_search"]).search self.suffix_search = re.compile(data["suffix_search"]).search
if "infix_finditer" in data and isinstance(data["infix_finditer"], str): if "infix_finditer" in data and isinstance(data["infix_finditer"], str):
self.infix_finditer = re.compile(data["infix_finditer"]).finditer self.infix_finditer = re.compile(data["infix_finditer"]).finditer
# for token_match and url_match, set to None to override the language
# defaults if no regex is provided
if "token_match" in data and isinstance(data["token_match"], str): if "token_match" in data and isinstance(data["token_match"], str):
self.token_match = re.compile(data["token_match"]).match self.token_match = re.compile(data["token_match"]).match
else:
self.token_match = None
if "url_match" in data and isinstance(data["url_match"], str): if "url_match" in data and isinstance(data["url_match"], str):
self.url_match = re.compile(data["url_match"]).match self.url_match = re.compile(data["url_match"]).match
else:
self.url_match = None
if "rules" in data and isinstance(data["rules"], dict): if "rules" in data and isinstance(data["rules"], dict):
# make sure to hard reset the cache to remove data from the default exceptions self.rules = data["rules"]
self._rules = {}
self._flush_cache()
self._flush_specials()
self._load_special_cases(data["rules"])
return self return self

View File

@ -281,7 +281,8 @@ def _merge(Doc doc, merges):
for i in range(doc.length): for i in range(doc.length):
doc.c[i].head -= i doc.c[i].head -= i
# Set the left/right children, left/right edges # Set the left/right children, left/right edges
set_children_from_heads(doc.c, 0, doc.length) if doc.has_annotation("DEP"):
set_children_from_heads(doc.c, 0, doc.length)
# Make sure ent_iob remains consistent # Make sure ent_iob remains consistent
make_iob_consistent(doc.c, doc.length) make_iob_consistent(doc.c, doc.length)
# Return the merged Python object # Return the merged Python object
@ -294,7 +295,19 @@ def _resize_tensor(tensor, ranges):
for i in range(start, end-1): for i in range(start, end-1):
delete.append(i) delete.append(i)
xp = get_array_module(tensor) xp = get_array_module(tensor)
return xp.delete(tensor, delete, axis=0) if xp is numpy:
return xp.delete(tensor, delete, axis=0)
else:
offset = 0
copy_start = 0
resized_shape = (tensor.shape[0] - len(delete), tensor.shape[1])
for start, end in ranges:
if copy_start > 0:
tensor[copy_start - offset:start - offset] = tensor[copy_start: start]
offset += end - start - 1
copy_start = end - 1
tensor[copy_start - offset:resized_shape[0]] = tensor[copy_start:]
return xp.asarray(tensor[:resized_shape[0]])
def _split(Doc doc, int token_index, orths, heads, attrs): def _split(Doc doc, int token_index, orths, heads, attrs):
@ -331,7 +344,13 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
to_process_tensor = (doc.tensor is not None and doc.tensor.size != 0) to_process_tensor = (doc.tensor is not None and doc.tensor.size != 0)
if to_process_tensor: if to_process_tensor:
xp = get_array_module(doc.tensor) xp = get_array_module(doc.tensor)
doc.tensor = xp.append(doc.tensor, xp.zeros((nb_subtokens,doc.tensor.shape[1]), dtype="float32"), axis=0) if xp is numpy:
doc.tensor = xp.append(doc.tensor, xp.zeros((nb_subtokens,doc.tensor.shape[1]), dtype="float32"), axis=0)
else:
shape = (doc.tensor.shape[0] + nb_subtokens, doc.tensor.shape[1])
resized_array = xp.zeros(shape, dtype="float32")
resized_array[:doc.tensor.shape[0]] = doc.tensor[:doc.tensor.shape[0]]
doc.tensor = resized_array
for token_to_move in range(orig_length - 1, token_index, -1): for token_to_move in range(orig_length - 1, token_index, -1):
doc.c[token_to_move + nb_subtokens - 1] = doc.c[token_to_move] doc.c[token_to_move + nb_subtokens - 1] = doc.c[token_to_move]
if to_process_tensor: if to_process_tensor:
@ -348,7 +367,7 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
token.norm = 0 # reset norm token.norm = 0 # reset norm
if to_process_tensor: if to_process_tensor:
# setting the tensors of the split tokens to array of zeros # setting the tensors of the split tokens to array of zeros
doc.tensor[token_index + i] = xp.zeros((1,doc.tensor.shape[1]), dtype="float32") doc.tensor[token_index + i:token_index + i + 1] = xp.zeros((1,doc.tensor.shape[1]), dtype="float32")
# Update the character offset of the subtokens # Update the character offset of the subtokens
if i != 0: if i != 0:
token.idx = orig_token.idx + idx_offset token.idx = orig_token.idx + idx_offset
@ -392,7 +411,8 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
for i in range(doc.length): for i in range(doc.length):
doc.c[i].head -= i doc.c[i].head -= i
# set children from head # set children from head
set_children_from_heads(doc.c, 0, doc.length) if doc.has_annotation("DEP"):
set_children_from_heads(doc.c, 0, doc.length)
def _validate_extensions(extensions): def _validate_extensions(extensions):

View File

@ -6,7 +6,7 @@ from libc.math cimport sqrt
from libc.stdint cimport int32_t, uint64_t from libc.stdint cimport int32_t, uint64_t
import copy import copy
from collections import Counter from collections import Counter, defaultdict
from enum import Enum from enum import Enum
import itertools import itertools
import numpy import numpy
@ -1120,13 +1120,14 @@ cdef class Doc:
concat_words = [] concat_words = []
concat_spaces = [] concat_spaces = []
concat_user_data = {} concat_user_data = {}
concat_spans = defaultdict(list)
char_offset = 0 char_offset = 0
for doc in docs: for doc in docs:
concat_words.extend(t.text for t in doc) concat_words.extend(t.text for t in doc)
concat_spaces.extend(bool(t.whitespace_) for t in doc) concat_spaces.extend(bool(t.whitespace_) for t in doc)
for key, value in doc.user_data.items(): for key, value in doc.user_data.items():
if isinstance(key, tuple) and len(key) == 4: if isinstance(key, tuple) and len(key) == 4 and key[0] == "._.":
data_type, name, start, end = key data_type, name, start, end = key
if start is not None or end is not None: if start is not None or end is not None:
start += char_offset start += char_offset
@ -1137,8 +1138,17 @@ cdef class Doc:
warnings.warn(Warnings.W101.format(name=name)) warnings.warn(Warnings.W101.format(name=name))
else: else:
warnings.warn(Warnings.W102.format(key=key, value=value)) warnings.warn(Warnings.W102.format(key=key, value=value))
for key in doc.spans:
for span in doc.spans[key]:
concat_spans[key].append((
span.start_char + char_offset,
span.end_char + char_offset,
span.label,
span.kb_id,
span.text, # included as a check
))
char_offset += len(doc.text) char_offset += len(doc.text)
if ensure_whitespace and not (len(doc) > 0 and doc[-1].is_space): if len(doc) > 0 and ensure_whitespace and not doc[-1].is_space:
char_offset += 1 char_offset += 1
arrays = [doc.to_array(attrs) for doc in docs] arrays = [doc.to_array(attrs) for doc in docs]
@ -1160,6 +1170,22 @@ cdef class Doc:
concat_doc.from_array(attrs, concat_array) concat_doc.from_array(attrs, concat_array)
for key in concat_spans:
if key not in concat_doc.spans:
concat_doc.spans[key] = []
for span_tuple in concat_spans[key]:
span = concat_doc.char_span(
span_tuple[0],
span_tuple[1],
label=span_tuple[2],
kb_id=span_tuple[3],
)
text = span_tuple[4]
if span is not None and span.text == text:
concat_doc.spans[key].append(span)
else:
raise ValueError(Errors.E873.format(key=key, text=text))
return concat_doc return concat_doc
def get_lca_matrix(self): def get_lca_matrix(self):

View File

@ -6,6 +6,7 @@ from libc.math cimport sqrt
import numpy import numpy
from thinc.api import get_array_module from thinc.api import get_array_module
import warnings import warnings
import copy
from .doc cimport token_by_start, token_by_end, get_token_attr, _get_lca_matrix from .doc cimport token_by_start, token_by_end, get_token_attr, _get_lca_matrix
from ..structs cimport TokenC, LexemeC from ..structs cimport TokenC, LexemeC
@ -241,7 +242,19 @@ cdef class Span:
if cat_start == self.start_char and cat_end == self.end_char: if cat_start == self.start_char and cat_end == self.end_char:
doc.cats[cat_label] = value doc.cats[cat_label] = value
if copy_user_data: if copy_user_data:
doc.user_data = self.doc.user_data user_data = {}
char_offset = self.start_char
for key, value in self.doc.user_data.items():
if isinstance(key, tuple) and len(key) == 4 and key[0] == "._.":
data_type, name, start, end = key
if start is not None or end is not None:
start -= char_offset
if end is not None:
end -= char_offset
user_data[(data_type, name, start, end)] = copy.copy(value)
else:
user_data[key] = copy.copy(value)
doc.user_data = user_data
return doc return doc
def _fix_dep_copy(self, attrs, array): def _fix_dep_copy(self, attrs, array):

View File

@ -8,3 +8,4 @@ from .iob_utils import biluo_tags_to_spans, tags_to_entities # noqa: F401
from .gold_io import docs_to_json, read_json_file # noqa: F401 from .gold_io import docs_to_json, read_json_file # noqa: F401
from .batchers import minibatch_by_padded_size, minibatch_by_words # noqa: F401 from .batchers import minibatch_by_padded_size, minibatch_by_words # noqa: F401
from .loggers import console_logger, wandb_logger # noqa: F401 from .loggers import console_logger, wandb_logger # noqa: F401
from .callbacks import create_copy_from_base_model # noqa: F401

View File

@ -0,0 +1,32 @@
from typing import Optional
from ..errors import Errors
from ..language import Language
from ..util import load_model, registry, logger
@registry.callbacks("spacy.copy_from_base_model.v1")
def create_copy_from_base_model(
tokenizer: Optional[str] = None,
vocab: Optional[str] = None,
) -> Language:
def copy_from_base_model(nlp):
if tokenizer:
logger.info(f"Copying tokenizer from: {tokenizer}")
base_nlp = load_model(tokenizer)
if nlp.config["nlp"]["tokenizer"] == base_nlp.config["nlp"]["tokenizer"]:
nlp.tokenizer.from_bytes(base_nlp.tokenizer.to_bytes(exclude=["vocab"]))
else:
raise ValueError(
Errors.E872.format(
curr_config=nlp.config["nlp"]["tokenizer"],
base_config=base_nlp.config["nlp"]["tokenizer"],
)
)
if vocab:
logger.info(f"Copying vocab from: {vocab}")
# only reload if the vocab is from a different model
if tokenizer != vocab:
base_nlp = load_model(vocab)
nlp.vocab.from_bytes(base_nlp.vocab.to_bytes())
return copy_from_base_model

View File

@ -124,6 +124,9 @@ def segment_sents_and_docs(doc, n_sents, doc_delimiter, model=None, msg=None):
nlp = load_model(model) nlp = load_model(model)
if "parser" in nlp.pipe_names: if "parser" in nlp.pipe_names:
msg.info(f"Segmenting sentences with parser from model '{model}'.") msg.info(f"Segmenting sentences with parser from model '{model}'.")
for name, proc in nlp.pipeline:
if "parser" in getattr(proc, "listening_components", []):
nlp.replace_listeners(name, "parser", ["model.tok2vec"])
sentencizer = nlp.get_pipe("parser") sentencizer = nlp.get_pipe("parser")
if not sentencizer: if not sentencizer:
msg.info( msg.info(

View File

@ -2,6 +2,7 @@ import warnings
from typing import Union, List, Iterable, Iterator, TYPE_CHECKING, Callable from typing import Union, List, Iterable, Iterator, TYPE_CHECKING, Callable
from typing import Optional from typing import Optional
from pathlib import Path from pathlib import Path
import random
import srsly import srsly
from .. import util from .. import util
@ -96,6 +97,7 @@ class Corpus:
Defaults to 0, which indicates no limit. Defaults to 0, which indicates no limit.
augment (Callable[Example, Iterable[Example]]): Optional data augmentation augment (Callable[Example, Iterable[Example]]): Optional data augmentation
function, to extrapolate additional examples from your annotations. function, to extrapolate additional examples from your annotations.
shuffle (bool): Whether to shuffle the examples.
DOCS: https://spacy.io/api/corpus DOCS: https://spacy.io/api/corpus
""" """
@ -108,12 +110,14 @@ class Corpus:
gold_preproc: bool = False, gold_preproc: bool = False,
max_length: int = 0, max_length: int = 0,
augmenter: Optional[Callable] = None, augmenter: Optional[Callable] = None,
shuffle: bool = False,
) -> None: ) -> None:
self.path = util.ensure_path(path) self.path = util.ensure_path(path)
self.gold_preproc = gold_preproc self.gold_preproc = gold_preproc
self.max_length = max_length self.max_length = max_length
self.limit = limit self.limit = limit
self.augmenter = augmenter if augmenter is not None else dont_augment self.augmenter = augmenter if augmenter is not None else dont_augment
self.shuffle = shuffle
def __call__(self, nlp: "Language") -> Iterator[Example]: def __call__(self, nlp: "Language") -> Iterator[Example]:
"""Yield examples from the data. """Yield examples from the data.
@ -124,6 +128,10 @@ class Corpus:
DOCS: https://spacy.io/api/corpus#call DOCS: https://spacy.io/api/corpus#call
""" """
ref_docs = self.read_docbin(nlp.vocab, walk_corpus(self.path, FILE_TYPE)) ref_docs = self.read_docbin(nlp.vocab, walk_corpus(self.path, FILE_TYPE))
if self.shuffle:
ref_docs = list(ref_docs)
random.shuffle(ref_docs)
if self.gold_preproc: if self.gold_preproc:
examples = self.make_examples_gold_preproc(nlp, ref_docs) examples = self.make_examples_gold_preproc(nlp, ref_docs)
else: else:

View File

@ -13,7 +13,7 @@ from .iob_utils import biluo_tags_to_spans
from ..errors import Errors, Warnings from ..errors import Errors, Warnings
from ..pipeline._parser_internals import nonproj from ..pipeline._parser_internals import nonproj
from ..tokens.token cimport MISSING_DEP from ..tokens.token cimport MISSING_DEP
from ..util import logger from ..util import logger, to_ternary_int
cpdef Doc annotations_to_doc(vocab, tok_annot, doc_annot): cpdef Doc annotations_to_doc(vocab, tok_annot, doc_annot):
@ -213,18 +213,19 @@ cdef class Example:
else: else:
return [None] * len(self.x) return [None] * len(self.x)
def get_aligned_spans_x2y(self, x_spans): def get_aligned_spans_x2y(self, x_spans, allow_overlap=False):
return self._get_aligned_spans(self.y, x_spans, self.alignment.x2y) return self._get_aligned_spans(self.y, x_spans, self.alignment.x2y, allow_overlap)
def get_aligned_spans_y2x(self, y_spans): def get_aligned_spans_y2x(self, y_spans, allow_overlap=False):
return self._get_aligned_spans(self.x, y_spans, self.alignment.y2x) return self._get_aligned_spans(self.x, y_spans, self.alignment.y2x, allow_overlap)
def _get_aligned_spans(self, doc, spans, align): def _get_aligned_spans(self, doc, spans, align, allow_overlap):
seen = set() seen = set()
output = [] output = []
for span in spans: for span in spans:
indices = align[span.start : span.end].data.ravel() indices = align[span.start : span.end].data.ravel()
indices = [idx for idx in indices if idx not in seen] if not allow_overlap:
indices = [idx for idx in indices if idx not in seen]
if len(indices) >= 1: if len(indices) >= 1:
aligned_span = Span(doc, indices[0], indices[-1] + 1, label=span.label) aligned_span = Span(doc, indices[0], indices[-1] + 1, label=span.label)
target_text = span.text.lower().strip().replace(" ", "") target_text = span.text.lower().strip().replace(" ", "")
@ -237,7 +238,7 @@ cdef class Example:
def get_aligned_ner(self): def get_aligned_ner(self):
if not self.y.has_annotation("ENT_IOB"): if not self.y.has_annotation("ENT_IOB"):
return [None] * len(self.x) # should this be 'missing' instead of 'None' ? return [None] * len(self.x) # should this be 'missing' instead of 'None' ?
x_ents = self.get_aligned_spans_y2x(self.y.ents) x_ents = self.get_aligned_spans_y2x(self.y.ents, allow_overlap=False)
# Default to 'None' for missing values # Default to 'None' for missing values
x_tags = offsets_to_biluo_tags( x_tags = offsets_to_biluo_tags(
self.x, self.x,
@ -337,7 +338,7 @@ def _annot2array(vocab, tok_annot, doc_annot):
values.append([vocab.strings.add(h) if h is not None else MISSING_DEP for h in value]) values.append([vocab.strings.add(h) if h is not None else MISSING_DEP for h in value])
elif key == "SENT_START": elif key == "SENT_START":
attrs.append(key) attrs.append(key)
values.append(value) values.append([to_ternary_int(v) for v in value])
elif key == "MORPH": elif key == "MORPH":
attrs.append(key) attrs.append(key)
values.append([vocab.morphology.add(v) for v in value]) values.append([vocab.morphology.add(v) for v in value])

View File

@ -121,7 +121,7 @@ def json_to_annotations(doc):
if i == 0: if i == 0:
sent_starts.append(1) sent_starts.append(1)
else: else:
sent_starts.append(0) sent_starts.append(-1)
if "brackets" in sent: if "brackets" in sent:
brackets.extend((b["first"] + sent_start_i, brackets.extend((b["first"] + sent_start_i,
b["last"] + sent_start_i, b["label"]) b["last"] + sent_start_i, b["label"])

View File

@ -8,6 +8,7 @@ import tarfile
import gzip import gzip
import zipfile import zipfile
import tqdm import tqdm
from itertools import islice
from .pretrain import get_tok2vec_ref from .pretrain import get_tok2vec_ref
from ..lookups import Lookups from ..lookups import Lookups
@ -68,7 +69,11 @@ def init_nlp(config: Config, *, use_gpu: int = -1) -> "Language":
# Make sure that listeners are defined before initializing further # Make sure that listeners are defined before initializing further
nlp._link_components() nlp._link_components()
with nlp.select_pipes(disable=[*frozen_components, *resume_components]): with nlp.select_pipes(disable=[*frozen_components, *resume_components]):
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer) if T["max_epochs"] == -1:
logger.debug("Due to streamed train corpus, using only first 100 examples for initialization. If necessary, provide all labels in [initialize]. More info: https://spacy.io/api/cli#init_labels")
nlp.initialize(lambda: islice(train_corpus(nlp), 100), sgd=optimizer)
else:
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
logger.info(f"Initialized pipeline components: {nlp.pipe_names}") logger.info(f"Initialized pipeline components: {nlp.pipe_names}")
# Detect components with listeners that are not frozen consistently # Detect components with listeners that are not frozen consistently
for name, proc in nlp.pipeline: for name, proc in nlp.pipeline:
@ -133,6 +138,10 @@ def load_vectors_into_model(
) )
err = ConfigValidationError.from_error(e, title=title, desc=desc) err = ConfigValidationError.from_error(e, title=title, desc=desc)
raise err from None raise err from None
if len(vectors_nlp.vocab.vectors.keys()) == 0:
logger.warning(Warnings.W112.format(name=name))
nlp.vocab.vectors = vectors_nlp.vocab.vectors nlp.vocab.vectors = vectors_nlp.vocab.vectors
if add_strings: if add_strings:
# I guess we should add the strings from the vectors_nlp model? # I guess we should add the strings from the vectors_nlp model?

View File

@ -101,8 +101,13 @@ def console_logger(progress_bar: bool = False):
return setup_printer return setup_printer
@registry.loggers("spacy.WandbLogger.v1") @registry.loggers("spacy.WandbLogger.v2")
def wandb_logger(project_name: str, remove_config_values: List[str] = []): def wandb_logger(
project_name: str,
remove_config_values: List[str] = [],
model_log_interval: Optional[int] = None,
log_dataset_dir: Optional[str] = None,
):
try: try:
import wandb import wandb
from wandb import init, log, join # test that these are available from wandb import init, log, join # test that these are available
@ -119,9 +124,23 @@ def wandb_logger(project_name: str, remove_config_values: List[str] = []):
for field in remove_config_values: for field in remove_config_values:
del config_dot[field] del config_dot[field]
config = util.dot_to_dict(config_dot) config = util.dot_to_dict(config_dot)
wandb.init(project=project_name, config=config, reinit=True) run = wandb.init(project=project_name, config=config, reinit=True)
console_log_step, console_finalize = console(nlp, stdout, stderr) console_log_step, console_finalize = console(nlp, stdout, stderr)
def log_dir_artifact(
path: str,
name: str,
type: str,
metadata: Optional[Dict[str, Any]] = {},
aliases: Optional[List[str]] = [],
):
dataset_artifact = wandb.Artifact(name, type=type, metadata=metadata)
dataset_artifact.add_dir(path, name=name)
wandb.log_artifact(dataset_artifact, aliases=aliases)
if log_dataset_dir:
log_dir_artifact(path=log_dataset_dir, name="dataset", type="dataset")
def log_step(info: Optional[Dict[str, Any]]): def log_step(info: Optional[Dict[str, Any]]):
console_log_step(info) console_log_step(info)
if info is not None: if info is not None:
@ -133,6 +152,21 @@ def wandb_logger(project_name: str, remove_config_values: List[str] = []):
wandb.log({f"loss_{k}": v for k, v in losses.items()}) wandb.log({f"loss_{k}": v for k, v in losses.items()})
if isinstance(other_scores, dict): if isinstance(other_scores, dict):
wandb.log(other_scores) wandb.log(other_scores)
if model_log_interval and info.get("output_path"):
if info["step"] % model_log_interval == 0 and info["step"] != 0:
log_dir_artifact(
path=info["output_path"],
name="pipeline_" + run.id,
type="checkpoint",
metadata=info,
aliases=[
f"epoch {info['epoch']} step {info['step']}",
"latest",
"best"
if info["score"] == max(info["checkpoints"])[0]
else "",
],
)
def finalize() -> None: def finalize() -> None:
console_finalize() console_finalize()

View File

@ -78,7 +78,7 @@ def train(
training_step_iterator = train_while_improving( training_step_iterator = train_while_improving(
nlp, nlp,
optimizer, optimizer,
create_train_batches(train_corpus(nlp), batcher, T["max_epochs"]), create_train_batches(nlp, train_corpus, batcher, T["max_epochs"]),
create_evaluation_callback(nlp, dev_corpus, score_weights), create_evaluation_callback(nlp, dev_corpus, score_weights),
dropout=T["dropout"], dropout=T["dropout"],
accumulate_gradient=T["accumulate_gradient"], accumulate_gradient=T["accumulate_gradient"],
@ -96,12 +96,13 @@ def train(
log_step, finalize_logger = train_logger(nlp, stdout, stderr) log_step, finalize_logger = train_logger(nlp, stdout, stderr)
try: try:
for batch, info, is_best_checkpoint in training_step_iterator: for batch, info, is_best_checkpoint in training_step_iterator:
log_step(info if is_best_checkpoint is not None else None)
if is_best_checkpoint is not None: if is_best_checkpoint is not None:
with nlp.select_pipes(disable=frozen_components): with nlp.select_pipes(disable=frozen_components):
update_meta(T, nlp, info) update_meta(T, nlp, info)
if output_path is not None: if output_path is not None:
save_checkpoint(is_best_checkpoint) save_checkpoint(is_best_checkpoint)
info["output_path"] = str(output_path / DIR_MODEL_LAST)
log_step(info if is_best_checkpoint is not None else None)
except Exception as e: except Exception as e:
if output_path is not None: if output_path is not None:
stdout.write( stdout.write(
@ -289,17 +290,22 @@ def create_evaluation_callback(
def create_train_batches( def create_train_batches(
iterator: Iterator[Example], nlp: "Language",
corpus: Callable[["Language"], Iterable[Example]],
batcher: Callable[[Iterable[Example]], Iterable[Example]], batcher: Callable[[Iterable[Example]], Iterable[Example]],
max_epochs: int, max_epochs: int,
): ):
epoch = 0 epoch = 0
examples = list(iterator) if max_epochs >= 0:
if not examples: examples = list(corpus(nlp))
# Raise error if no data if not examples:
raise ValueError(Errors.E986) # Raise error if no data
raise ValueError(Errors.E986)
while max_epochs < 1 or epoch != max_epochs: while max_epochs < 1 or epoch != max_epochs:
random.shuffle(examples) if max_epochs >= 0:
random.shuffle(examples)
else:
examples = corpus(nlp)
for batch in batcher(examples): for batch in batcher(examples):
yield epoch, batch yield epoch, batch
epoch += 1 epoch += 1

View File

@ -36,7 +36,7 @@ except ImportError:
try: # Python 3.8 try: # Python 3.8
import importlib.metadata as importlib_metadata import importlib.metadata as importlib_metadata
except ImportError: except ImportError:
import importlib_metadata from catalogue import _importlib_metadata as importlib_metadata
# These are functions that were previously (v2.x) available from spacy.util # These are functions that were previously (v2.x) available from spacy.util
# and have since moved to Thinc. We're importing them here so people's code # and have since moved to Thinc. We're importing them here so people's code
@ -1526,3 +1526,18 @@ def check_lexeme_norms(vocab, component_name):
if len(lexeme_norms) == 0 and vocab.lang in LEXEME_NORM_LANGS: if len(lexeme_norms) == 0 and vocab.lang in LEXEME_NORM_LANGS:
langs = ", ".join(LEXEME_NORM_LANGS) langs = ", ".join(LEXEME_NORM_LANGS)
logger.debug(Warnings.W033.format(model=component_name, langs=langs)) logger.debug(Warnings.W033.format(model=component_name, langs=langs))
def to_ternary_int(val) -> int:
"""Convert a value to the ternary 1/0/-1 int used for True/None/False in
attributes such as SENT_START: True/1/1.0 is 1 (True), None/0/0.0 is 0
(None), any other values are -1 (False).
"""
if isinstance(val, float):
val = int(val)
if val is True or val is 1:
return 1
elif val is None or val is 0:
return 0
else:
return -1

View File

@ -55,7 +55,7 @@ cdef class Vectors:
"""Create a new vector store. """Create a new vector store.
shape (tuple): Size of the table, as (# entries, # columns) shape (tuple): Size of the table, as (# entries, # columns)
data (numpy.ndarray): The vector data. data (numpy.ndarray or cupy.ndarray): The vector data.
keys (iterable): A sequence of keys, aligned with the data. keys (iterable): A sequence of keys, aligned with the data.
name (str): A name to identify the vectors table. name (str): A name to identify the vectors table.
@ -65,7 +65,8 @@ cdef class Vectors:
if data is None: if data is None:
if shape is None: if shape is None:
shape = (0,0) shape = (0,0)
data = numpy.zeros(shape, dtype="f") ops = get_current_ops()
data = ops.xp.zeros(shape, dtype="f")
self.data = data self.data = data
self.key2row = {} self.key2row = {}
if self.data is not None: if self.data is not None:
@ -300,6 +301,8 @@ cdef class Vectors:
else: else:
raise ValueError(Errors.E197.format(row=row, key=key)) raise ValueError(Errors.E197.format(row=row, key=key))
if vector is not None: if vector is not None:
xp = get_array_module(self.data)
vector = xp.asarray(vector)
self.data[row] = vector self.data[row] = vector
if self._unset.count(row): if self._unset.count(row):
self._unset.erase(self._unset.find(row)) self._unset.erase(self._unset.find(row))
@ -321,10 +324,11 @@ cdef class Vectors:
RETURNS (tuple): The most similar entries as a `(keys, best_rows, scores)` RETURNS (tuple): The most similar entries as a `(keys, best_rows, scores)`
tuple. tuple.
""" """
xp = get_array_module(self.data)
filled = sorted(list({row for row in self.key2row.values()})) filled = sorted(list({row for row in self.key2row.values()}))
if len(filled) < n: if len(filled) < n:
raise ValueError(Errors.E198.format(n=n, n_rows=len(filled))) raise ValueError(Errors.E198.format(n=n, n_rows=len(filled)))
xp = get_array_module(self.data) filled = xp.asarray(filled)
norms = xp.linalg.norm(self.data[filled], axis=1, keepdims=True) norms = xp.linalg.norm(self.data[filled], axis=1, keepdims=True)
norms[norms == 0] = 1 norms[norms == 0] = 1
@ -357,8 +361,10 @@ cdef class Vectors:
# Account for numerical error we want to return in range -1, 1 # Account for numerical error we want to return in range -1, 1
scores = xp.clip(scores, a_min=-1, a_max=1, out=scores) scores = xp.clip(scores, a_min=-1, a_max=1, out=scores)
row2key = {row: key for key, row in self.key2row.items()} row2key = {row: key for key, row in self.key2row.items()}
numpy_rows = get_current_ops().to_numpy(best_rows)
keys = xp.asarray( keys = xp.asarray(
[[row2key[row] for row in best_rows[i] if row in row2key] [[row2key[row] for row in numpy_rows[i] if row in row2key]
for i in range(len(queries)) ], dtype="uint64") for i in range(len(queries)) ], dtype="uint64")
return (keys, best_rows, scores) return (keys, best_rows, scores)
@ -459,7 +465,8 @@ cdef class Vectors:
if hasattr(self.data, "from_bytes"): if hasattr(self.data, "from_bytes"):
self.data.from_bytes() self.data.from_bytes()
else: else:
self.data = srsly.msgpack_loads(b) xp = get_array_module(self.data)
self.data = xp.asarray(srsly.msgpack_loads(b))
deserializers = { deserializers = {
"key2row": lambda b: self.key2row.update(srsly.msgpack_loads(b)), "key2row": lambda b: self.key2row.update(srsly.msgpack_loads(b)),

View File

@ -2,7 +2,7 @@
from libc.string cimport memcpy from libc.string cimport memcpy
import srsly import srsly
from thinc.api import get_array_module from thinc.api import get_array_module, get_current_ops
import functools import functools
from .lexeme cimport EMPTY_LEXEME, OOV_RANK from .lexeme cimport EMPTY_LEXEME, OOV_RANK
@ -293,7 +293,7 @@ cdef class Vocab:
among those remaining. among those remaining.
For example, suppose the original table had vectors for the words: For example, suppose the original table had vectors for the words:
['sat', 'cat', 'feline', 'reclined']. If we prune the vector table to, ['sat', 'cat', 'feline', 'reclined']. If we prune the vector table to
two rows, we would discard the vectors for 'feline' and 'reclined'. two rows, we would discard the vectors for 'feline' and 'reclined'.
These words would then be remapped to the closest remaining vector These words would then be remapped to the closest remaining vector
-- so "feline" would have the same vector as "cat", and "reclined" -- so "feline" would have the same vector as "cat", and "reclined"
@ -314,6 +314,7 @@ cdef class Vocab:
DOCS: https://spacy.io/api/vocab#prune_vectors DOCS: https://spacy.io/api/vocab#prune_vectors
""" """
ops = get_current_ops()
xp = get_array_module(self.vectors.data) xp = get_array_module(self.vectors.data)
# Make sure all vectors are in the vocab # Make sure all vectors are in the vocab
for orth in self.vectors: for orth in self.vectors:
@ -329,8 +330,9 @@ cdef class Vocab:
toss = xp.ascontiguousarray(self.vectors.data[indices[nr_row:]]) toss = xp.ascontiguousarray(self.vectors.data[indices[nr_row:]])
self.vectors = Vectors(data=keep, keys=keys[:nr_row], name=self.vectors.name) self.vectors = Vectors(data=keep, keys=keys[:nr_row], name=self.vectors.name)
syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size) syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size)
syn_keys = ops.to_numpy(syn_keys)
remap = {} remap = {}
for i, key in enumerate(keys[nr_row:]): for i, key in enumerate(ops.to_numpy(keys[nr_row:])):
self.vectors.add(key, row=syn_rows[i][0]) self.vectors.add(key, row=syn_rows[i][0])
word = self.strings[key] word = self.strings[key]
synonym = self.strings[syn_keys[i][0]] synonym = self.strings[syn_keys[i][0]]
@ -351,7 +353,7 @@ cdef class Vocab:
Defaults to the length of `orth`. Defaults to the length of `orth`.
maxn (int): Maximum n-gram length used for Fasttext's ngram computation. maxn (int): Maximum n-gram length used for Fasttext's ngram computation.
Defaults to the length of `orth`. Defaults to the length of `orth`.
RETURNS (numpy.ndarray): A word vector. Size RETURNS (numpy.ndarray or cupy.ndarray): A word vector. Size
and shape determined by the `vocab.vectors` instance. Usually, a and shape determined by the `vocab.vectors` instance. Usually, a
numpy ndarray of shape (300,) and dtype float32. numpy ndarray of shape (300,) and dtype float32.
@ -400,7 +402,7 @@ cdef class Vocab:
by string or int ID. by string or int ID.
orth (int / unicode): The word. orth (int / unicode): The word.
vector (numpy.ndarray[ndim=1, dtype='float32']): The vector to set. vector (numpy.ndarray or cupy.nadarry[ndim=1, dtype='float32']): The vector to set.
DOCS: https://spacy.io/api/vocab#set_vector DOCS: https://spacy.io/api/vocab#set_vector
""" """

View File

@ -35,7 +35,7 @@ usage documentation on
> @architectures = "spacy.Tok2Vec.v2" > @architectures = "spacy.Tok2Vec.v2"
> >
> [model.embed] > [model.embed]
> @architectures = "spacy.CharacterEmbed.v1" > @architectures = "spacy.CharacterEmbed.v2"
> # ... > # ...
> >
> [model.encode] > [model.encode]
@ -54,13 +54,13 @@ blog post for background.
| `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). ~~Model[List[Floats2d], List[Floats2d]]~~ | | `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). ~~Model[List[Floats2d], List[Floats2d]]~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy.HashEmbedCNN.v1 {#HashEmbedCNN} ### spacy.HashEmbedCNN.v2 {#HashEmbedCNN}
> #### Example Config > #### Example Config
> >
> ```ini > ```ini
> [model] > [model]
> @architectures = "spacy.HashEmbedCNN.v1" > @architectures = "spacy.HashEmbedCNN.v2"
> pretrained_vectors = null > pretrained_vectors = null
> width = 96 > width = 96
> depth = 4 > depth = 4
@ -96,7 +96,7 @@ consisting of a CNN and a layer-normalized maxout activation function.
> factory = "tok2vec" > factory = "tok2vec"
> >
> [components.tok2vec.model] > [components.tok2vec.model]
> @architectures = "spacy.HashEmbedCNN.v1" > @architectures = "spacy.HashEmbedCNN.v2"
> width = 342 > width = 342
> >
> [components.tagger] > [components.tagger]
@ -129,13 +129,13 @@ argument that connects to the shared `tok2vec` component in the pipeline.
| `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. By default, the upstream name is the wildcard string `"*"`, but you could also specify the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ | | `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. By default, the upstream name is the wildcard string `"*"`, but you could also specify the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed} ### spacy.MultiHashEmbed.v2 {#MultiHashEmbed}
> #### Example config > #### Example config
> >
> ```ini > ```ini
> [model] > [model]
> @architectures = "spacy.MultiHashEmbed.v1" > @architectures = "spacy.MultiHashEmbed.v2"
> width = 64 > width = 64
> attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"] > attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
> rows = [2000, 1000, 1000, 1000] > rows = [2000, 1000, 1000, 1000]
@ -160,13 +160,13 @@ not updated).
| `include_static_vectors` | Whether to also use static word vectors. Requires a vectors table to be loaded in the [`Doc`](/api/doc) objects' vocab. ~~bool~~ | | `include_static_vectors` | Whether to also use static word vectors. Requires a vectors table to be loaded in the [`Doc`](/api/doc) objects' vocab. ~~bool~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy.CharacterEmbed.v1 {#CharacterEmbed} ### spacy.CharacterEmbed.v2 {#CharacterEmbed}
> #### Example config > #### Example config
> >
> ```ini > ```ini
> [model] > [model]
> @architectures = "spacy.CharacterEmbed.v1" > @architectures = "spacy.CharacterEmbed.v2"
> width = 128 > width = 128
> rows = 7000 > rows = 7000
> nM = 64 > nM = 64
@ -266,13 +266,13 @@ Encode context using bidirectional LSTM layers. Requires
| `dropout` | Creates a Dropout layer on the outputs of each LSTM layer except the last layer. Set to 0.0 to disable this functionality. ~~float~~ | | `dropout` | Creates a Dropout layer on the outputs of each LSTM layer except the last layer. Set to 0.0 to disable this functionality. ~~float~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
### spacy.StaticVectors.v1 {#StaticVectors} ### spacy.StaticVectors.v2 {#StaticVectors}
> #### Example config > #### Example config
> >
> ```ini > ```ini
> [model] > [model]
> @architectures = "spacy.StaticVectors.v1" > @architectures = "spacy.StaticVectors.v2"
> nO = null > nO = null
> nM = null > nM = null
> dropout = 0.2 > dropout = 0.2
@ -283,8 +283,9 @@ Encode context using bidirectional LSTM layers. Requires
> ``` > ```
Embed [`Doc`](/api/doc) objects with their vocab's vectors table, applying a Embed [`Doc`](/api/doc) objects with their vocab's vectors table, applying a
learned linear projection to control the dimensionality. See the documentation learned linear projection to control the dimensionality. Unknown tokens are
on [static vectors](/usage/embeddings-transformers#static-vectors) for details. mapped to a zero vector. See the documentation on [static
vectors](/usage/embeddings-transformers#static-vectors) for details.
| Name |  Description | | Name |  Description |
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -513,7 +514,7 @@ for a Tok2Vec layer.
> use_upper = true > use_upper = true
> >
> [model.tok2vec] > [model.tok2vec]
> @architectures = "spacy.HashEmbedCNN.v1" > @architectures = "spacy.HashEmbedCNN.v2"
> pretrained_vectors = null > pretrained_vectors = null
> width = 96 > width = 96
> depth = 4 > depth = 4
@ -619,7 +620,7 @@ single-label use-cases where `exclusive_classes = true`, while the
> @architectures = "spacy.Tok2Vec.v2" > @architectures = "spacy.Tok2Vec.v2"
> >
> [model.tok2vec.embed] > [model.tok2vec.embed]
> @architectures = "spacy.MultiHashEmbed.v1" > @architectures = "spacy.MultiHashEmbed.v2"
> width = 64 > width = 64
> rows = [2000, 2000, 1000, 1000, 1000, 1000] > rows = [2000, 2000, 1000, 1000, 1000, 1000]
> attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"] > attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
@ -676,7 +677,7 @@ taking it as argument:
> nO = null > nO = null
> >
> [model.tok2vec] > [model.tok2vec]
> @architectures = "spacy.HashEmbedCNN.v1" > @architectures = "spacy.HashEmbedCNN.v2"
> pretrained_vectors = null > pretrained_vectors = null
> width = 96 > width = 96
> depth = 4 > depth = 4
@ -744,7 +745,7 @@ into the "real world". This requires 3 main components:
> nO = null > nO = null
> >
> [model.tok2vec] > [model.tok2vec]
> @architectures = "spacy.HashEmbedCNN.v1" > @architectures = "spacy.HashEmbedCNN.v2"
> pretrained_vectors = null > pretrained_vectors = null
> width = 96 > width = 96
> depth = 2 > depth = 2

View File

@ -12,6 +12,7 @@ menu:
- ['train', 'train'] - ['train', 'train']
- ['pretrain', 'pretrain'] - ['pretrain', 'pretrain']
- ['evaluate', 'evaluate'] - ['evaluate', 'evaluate']
- ['assemble', 'assemble']
- ['package', 'package'] - ['package', 'package']
- ['project', 'project'] - ['project', 'project']
- ['ray', 'ray'] - ['ray', 'ray']
@ -892,6 +893,34 @@ $ python -m spacy evaluate [model] [data_path] [--output] [--code] [--gold-prepr
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **CREATES** | Training results and optional metrics and visualizations. | | **CREATES** | Training results and optional metrics and visualizations. |
## assemble {#assemble tag="command"}
Assemble a pipeline from a config file without additional training. Expects a
[config file](/api/data-formats#config) with all settings and hyperparameters.
The `--code` argument can be used to import a Python file that lets you register
[custom functions](/usage/training#custom-functions) and refer to them in your
config.
> #### Example
>
> ```cli
> $ python -m spacy assemble config.cfg ./output
> ```
```cli
$ python -m spacy assemble [config_path] [output_dir] [--code] [--verbose] [overrides]
```
| Name | Description |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `config_path` | Path to the [config](/api/data-formats#config) file containing all settings and hyperparameters. If `-`, the data will be [read from stdin](/usage/training#config-stdin). ~~Union[Path, str] \(positional)~~ |
| `output_dir` | Directory to store the final pipeline in. Will be created if it doesn't exist. ~~Optional[Path] \(option)~~ |
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions). ~~Optional[Path] \(option)~~ |
| `--verbose`, `-V` | Show more detailed messages during processing. ~~bool (flag)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.data ./data`. ~~Any (option/flag)~~ |
| **CREATES** | The final assembled pipeline. |
## package {#package tag="command"} ## package {#package tag="command"}
Generate an installable [Python package](/usage/training#models-generating) from Generate an installable [Python package](/usage/training#models-generating) from

View File

@ -29,8 +29,8 @@ recommended settings for your use case, check out the
> >
> The `@` syntax lets you refer to function names registered in the > The `@` syntax lets you refer to function names registered in the
> [function registry](/api/top-level#registry). For example, > [function registry](/api/top-level#registry). For example,
> `@architectures = "spacy.HashEmbedCNN.v1"` refers to a registered function of > `@architectures = "spacy.HashEmbedCNN.v2"` refers to a registered function of
> the name [spacy.HashEmbedCNN.v1](/api/architectures#HashEmbedCNN) and all > the name [spacy.HashEmbedCNN.v2](/api/architectures#HashEmbedCNN) and all
> other values defined in its block will be passed into that function as > other values defined in its block will be passed into that function as
> arguments. Those arguments depend on the registered function. See the usage > arguments. Those arguments depend on the registered function. See the usage
> guide on [registered functions](/usage/training#config-functions) for details. > guide on [registered functions](/usage/training#config-functions) for details.
@ -193,10 +193,10 @@ process that are used when you run [`spacy train`](/api/cli#train).
| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ | | `frozen_components` | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ |
| `gpu_allocator` | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~ | | `gpu_allocator` | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~ |
| `logger` | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ | | `logger` | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ |
| `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ | | `max_epochs` | Maximum number of epochs to train for. `0` means an unlimited number of epochs. `-1` means that the train corpus should be streamed rather than loaded into memory with no shuffling within the training loop. Defaults to `0`. ~~int~~ |
| `max_steps` | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~ | | `max_steps` | Maximum number of update steps to train for. `0` means an unlimited number of steps. Defaults to `20000`. ~~int~~ |
| `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ | | `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ |
| `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ | | `patience` | How many steps to continue without improvement in evaluation score. `0` disables early stopping. Defaults to `1600`. ~~int~~ |
| `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ | | `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ |
| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ | | `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ |
| `train_corpus` | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~ | | `train_corpus` | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~ |
@ -390,7 +390,7 @@ file to keep track of your settings and hyperparameters and your own
> "tags": List[str], > "tags": List[str],
> "pos": List[str], > "pos": List[str],
> "morphs": List[str], > "morphs": List[str],
> "sent_starts": List[bool], > "sent_starts": List[Optional[bool]],
> "deps": List[string], > "deps": List[string],
> "heads": List[int], > "heads": List[int],
> "entities": List[str], > "entities": List[str],

View File

@ -44,7 +44,7 @@ Construct a `Doc` object. The most common way to get a `Doc` object is via the
| `lemmas` <Tag variant="new">3</Tag> | A list of strings, of the same length as `words`, to assign as `token.lemma` for each word. Defaults to `None`. ~~Optional[List[str]]~~ | | `lemmas` <Tag variant="new">3</Tag> | A list of strings, of the same length as `words`, to assign as `token.lemma` for each word. Defaults to `None`. ~~Optional[List[str]]~~ |
| `heads` <Tag variant="new">3</Tag> | A list of values, of the same length as `words`, to assign as the head for each word. Head indices are the absolute position of the head in the `Doc`. Defaults to `None`. ~~Optional[List[int]]~~ | | `heads` <Tag variant="new">3</Tag> | A list of values, of the same length as `words`, to assign as the head for each word. Head indices are the absolute position of the head in the `Doc`. Defaults to `None`. ~~Optional[List[int]]~~ |
| `deps` <Tag variant="new">3</Tag> | A list of strings, of the same length as `words`, to assign as `token.dep` for each word. Defaults to `None`. ~~Optional[List[str]]~~ | | `deps` <Tag variant="new">3</Tag> | A list of strings, of the same length as `words`, to assign as `token.dep` for each word. Defaults to `None`. ~~Optional[List[str]]~~ |
| `sent_starts` <Tag variant="new">3</Tag> | A list of values, of the same length as `words`, to assign as `token.is_sent_start`. Will be overridden by heads if `heads` is provided. Defaults to `None`. ~~Optional[List[Union[bool, None]]~~ | | `sent_starts` <Tag variant="new">3</Tag> | A list of values, of the same length as `words`, to assign as `token.is_sent_start`. Will be overridden by heads if `heads` is provided. Defaults to `None`. ~~Optional[List[Optional[bool]]]~~ |
| `ents` <Tag variant="new">3</Tag> | A list of strings, of the same length of `words`, to assign the token-based IOB tag. Defaults to `None`. ~~Optional[List[str]]~~ | | `ents` <Tag variant="new">3</Tag> | A list of strings, of the same length of `words`, to assign the token-based IOB tag. Defaults to `None`. ~~Optional[List[str]]~~ |
## Doc.\_\_getitem\_\_ {#getitem tag="method"} ## Doc.\_\_getitem\_\_ {#getitem tag="method"}

View File

@ -33,8 +33,8 @@ both documents.
| Name | Description | | Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------ | | -------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ | | `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ |
| `reference` | The document containing gold-standard annotations. Cannot be `None`. ~~Doc~~ | | `reference` | The document containing gold-standard annotations. Cannot be `None`. ~~Doc~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `alignment` | An object holding the alignment between the tokens of the `predicted` and `reference` documents. ~~Optional[Alignment]~~ | | `alignment` | An object holding the alignment between the tokens of the `predicted` and `reference` documents. ~~Optional[Alignment]~~ |
@ -56,11 +56,11 @@ see the [training format documentation](/api/data-formats#dict-input).
> example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref}) > example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref})
> ``` > ```
| Name | Description | | Name | Description |
| -------------- | ------------------------------------------------------------------------- | | -------------- | ----------------------------------------------------------------------------------- |
| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ | | `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ |
| `example_dict` | `Dict[str, obj]` | The gold-standard annotations as a dictionary. Cannot be `None`. ~~Dict[str, Any]~~ | | `example_dict` | The gold-standard annotations as a dictionary. Cannot be `None`. ~~Dict[str, Any]~~ |
| **RETURNS** | The newly constructed object. ~~Example~~ | | **RETURNS** | The newly constructed object. ~~Example~~ |
## Example.text {#text tag="property"} ## Example.text {#text tag="property"}
@ -211,10 +211,11 @@ align to the tokenization in [`Example.predicted`](/api/example#predicted).
> assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1)] > assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1)]
> ``` > ```
| Name | Description | | Name | Description |
| ----------- | ----------------------------------------------------------------------------- | | --------------- | -------------------------------------------------------------------------------------------- |
| `y_spans` | `Span` objects aligned to the tokenization of `reference`. ~~Iterable[Span]~~ | | `y_spans` | `Span` objects aligned to the tokenization of `reference`. ~~Iterable[Span]~~ |
| **RETURNS** | `Span` objects aligned to the tokenization of `predicted`. ~~List[Span]~~ | | `allow_overlap` | Whether the resulting `Span` objects may overlap or not. Set to `False` by default. ~~bool~~ |
| **RETURNS** | `Span` objects aligned to the tokenization of `predicted`. ~~List[Span]~~ |
## Example.get_aligned_spans_x2y {#get_aligned_spans_x2y tag="method"} ## Example.get_aligned_spans_x2y {#get_aligned_spans_x2y tag="method"}
@ -238,10 +239,11 @@ against the original gold-standard annotation.
> assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)] > assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)]
> ``` > ```
| Name | Description | | Name | Description |
| ----------- | ----------------------------------------------------------------------------- | | --------------- | -------------------------------------------------------------------------------------------- |
| `x_spans` | `Span` objects aligned to the tokenization of `predicted`. ~~Iterable[Span]~~ | | `x_spans` | `Span` objects aligned to the tokenization of `predicted`. ~~Iterable[Span]~~ |
| **RETURNS** | `Span` objects aligned to the tokenization of `reference`. ~~List[Span]~~ | | `allow_overlap` | Whether the resulting `Span` objects may overlap or not. Set to `False` by default. ~~bool~~ |
| **RETURNS** | `Span` objects aligned to the tokenization of `reference`. ~~List[Span]~~ |
## Example.to_dict {#to_dict tag="method"} ## Example.to_dict {#to_dict tag="method"}

View File

@ -5,11 +5,12 @@ source: spacy/legacy
--- ---
The [`spacy-legacy`](https://github.com/explosion/spacy-legacy) package includes The [`spacy-legacy`](https://github.com/explosion/spacy-legacy) package includes
outdated registered functions and architectures. It is installed automatically as outdated registered functions and architectures. It is installed automatically
a dependency of spaCy, and provides backwards compatibility for archived functions as a dependency of spaCy, and provides backwards compatibility for archived
that may still be used in projects. functions that may still be used in projects.
You can find the detailed documentation of each such legacy function on this page. You can find the detailed documentation of each such legacy function on this
page.
## Architectures {#architectures} ## Architectures {#architectures}
@ -44,15 +45,14 @@ blog post for background.
| Name | Description | | Name | Description |
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `embed` | Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed). ~~Model[List[Doc], List[Floats2d]]~~ | | `embed` | Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed). ~~Model[List[Doc], List[Floats2d]]~~ |
| `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder.v1](/api/legacy#MaxoutWindowEncoder_v1). ~~Model[Floats2d, Floats2d]~~ | | `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder.v1](/api/legacy#MaxoutWindowEncoder_v1). ~~Model[Floats2d, Floats2d]~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder_v1} ### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder_v1}
The `spacy.MaxoutWindowEncoder.v1` architecture was producing a model of type The `spacy.MaxoutWindowEncoder.v1` architecture was producing a model of type
`Model[Floats2D, Floats2D]`. Since `spacy.MaxoutWindowEncoder.v2`, this has been changed to output `Model[Floats2D, Floats2D]`. Since `spacy.MaxoutWindowEncoder.v2`, this has been
type `Model[List[Floats2d], List[Floats2d]]`. changed to output type `Model[List[Floats2d], List[Floats2d]]`.
> #### Example config > #### Example config
> >
@ -79,8 +79,8 @@ and residual connections.
### spacy.MishWindowEncoder.v1 {#MishWindowEncoder_v1} ### spacy.MishWindowEncoder.v1 {#MishWindowEncoder_v1}
The `spacy.MishWindowEncoder.v1` architecture was producing a model of type The `spacy.MishWindowEncoder.v1` architecture was producing a model of type
`Model[Floats2D, Floats2D]`. Since `spacy.MishWindowEncoder.v2`, this has been changed to output `Model[Floats2D, Floats2D]`. Since `spacy.MishWindowEncoder.v2`, this has been
type `Model[List[Floats2d], List[Floats2d]]`. changed to output type `Model[List[Floats2d], List[Floats2d]]`.
> #### Example config > #### Example config
> >
@ -103,12 +103,11 @@ and residual connections.
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ | | `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
| **CREATES** | The model using the architecture. ~~Model[Floats2d, Floats2d]~~ | | **CREATES** | The model using the architecture. ~~Model[Floats2d, Floats2d]~~ |
### spacy.TextCatEnsemble.v1 {#TextCatEnsemble_v1} ### spacy.TextCatEnsemble.v1 {#TextCatEnsemble_v1}
The `spacy.TextCatEnsemble.v1` architecture built an internal `tok2vec` and `linear_model`. The `spacy.TextCatEnsemble.v1` architecture built an internal `tok2vec` and
Since `spacy.TextCatEnsemble.v2`, this has been refactored so that the `TextCatEnsemble` takes these `linear_model`. Since `spacy.TextCatEnsemble.v2`, this has been refactored so
two sublayers as input. that the `TextCatEnsemble` takes these two sublayers as input.
> #### Example Config > #### Example Config
> >
@ -141,3 +140,61 @@ network has an internal CNN Tok2Vec layer and uses attention.
| `dropout` | The dropout rate. ~~float~~ | | `dropout` | The dropout rate. ~~float~~ |
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
### spacy.HashEmbedCNN.v1 {#HashEmbedCNN_v1}
Identical to [`spacy.HashEmbedCNN.v2`](/api/architectures#HashEmbedCNN) except
using [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are included.
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed_v1}
Identical to [`spacy.MultiHashEmbed.v2`](/api/architectures#MultiHashEmbed)
except with [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are
included.
### spacy.CharacterEmbed.v1 {#CharacterEmbed_v1}
Identical to [`spacy.CharacterEmbed.v2`](/api/architectures#CharacterEmbed)
except using [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are
included.
## Layers {#layers}
These functions are available from `@spacy.registry.layers`.
### spacy.StaticVectors.v1 {#StaticVectors_v1}
Identical to [`spacy.StaticVectors.v2`](/api/architectures#StaticVectors) except
for the handling of tokens without vectors.
<Infobox title="Bugs for tokens without vectors" variant="warning">
`spacy.StaticVectors.v1` maps tokens without vectors to the final row in the
vectors table, which causes the model predictions to change if new vectors are
added to an existing vectors table. See more details in
[issue #7662](https://github.com/explosion/spaCy/issues/7662#issuecomment-813925655).
</Infobox>
## Loggers {#loggers}
These functions are available from `@spacy.registry.loggers`.
### spacy.WandbLogger.v1 {#WandbLogger_v1}
The first version of the [`WandbLogger`](/api/top-level#WandbLogger) did not yet
support the `log_dataset_dir` and `model_log_interval` arguments.
> #### Example config
>
> ```ini
> [training.logger]
> @loggers = "spacy.WandbLogger.v1"
> project_name = "monitor_spacy_training"
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
> ```
>
> | Name | Description |
> | ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
> | `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
> | `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |

View File

@ -120,13 +120,14 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
> matches = matcher(doc) > matches = matcher(doc)
> ``` > ```
| Name | Description | | Name | Description |
| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | | `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | | `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ |
| `allow_missing` <Tag variant="new">3</Tag> | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~ | | `allow_missing` <Tag variant="new">3</Tag> | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~ |
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | | `with_alignments` <Tag variant="new">3.1</Tag> | Return match alignment information as part of the match tuple as `List[int]` with the same length as the matched span. Each entry denotes the corresponding index of the token pattern. If `as_spans` is set to `True`, this setting is ignored. Defaults to `False`. ~~bool~~ |
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
## Matcher.\_\_len\_\_ {#len tag="method" new="2"} ## Matcher.\_\_len\_\_ {#len tag="method" new="2"}

View File

@ -137,14 +137,16 @@ Returns PRF scores for labeled or unlabeled spans.
> print(scores["ents_f"]) > print(scores["ents_f"])
> ``` > ```
| Name | Description | | Name | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | | `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
| `attr` | The attribute to score. ~~str~~ | | `attr` | The attribute to score. ~~str~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ | | `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ |
| `has_annotation` | Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. ~~Optional[Callable[[Doc], bool]]~~ | | `has_annotation` | Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. ~~str~~ |
| **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ | | `labeled` | Defaults to `True`. If set to `False`, two spans will be considered equal if their start and end match, irrespective of their label. ~~bool~~ |
| `allow_overlap` | Defaults to `False`. Whether or not to allow overlapping spans. If set to `False`, the alignment will automatically resolve conflicts. ~~bool~~ |
| **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
## Scorer.score_deps {#score_deps tag="staticmethod" new="3"} ## Scorer.score_deps {#score_deps tag="staticmethod" new="3"}

View File

@ -364,7 +364,7 @@ unknown. Defaults to `True` for the first token in the `Doc`.
| Name | Description | | Name | Description |
| ----------- | --------------------------------------------- | | ----------- | --------------------------------------------- |
| **RETURNS** | Whether the token starts a sentence. ~~bool~~ | | **RETURNS** | Whether the token starts a sentence. ~~Optional[bool]~~ |
## Token.has_vector {#has_vector tag="property" model="vectors"} ## Token.has_vector {#has_vector tag="property" model="vectors"}

View File

@ -8,6 +8,7 @@ menu:
- ['Readers', 'readers'] - ['Readers', 'readers']
- ['Batchers', 'batchers'] - ['Batchers', 'batchers']
- ['Augmenters', 'augmenters'] - ['Augmenters', 'augmenters']
- ['Callbacks', 'callbacks']
- ['Training & Alignment', 'gold'] - ['Training & Alignment', 'gold']
- ['Utility Functions', 'util'] - ['Utility Functions', 'util']
--- ---
@ -461,7 +462,7 @@ start decreasing across epochs.
</Accordion> </Accordion>
#### spacy.WandbLogger.v1 {#WandbLogger tag="registered function"} #### spacy.WandbLogger.v2 {#WandbLogger tag="registered function"}
> #### Installation > #### Installation
> >
@ -493,15 +494,19 @@ remain in the config file stored on your local system.
> >
> ```ini > ```ini
> [training.logger] > [training.logger]
> @loggers = "spacy.WandbLogger.v1" > @loggers = "spacy.WandbLogger.v2"
> project_name = "monitor_spacy_training" > project_name = "monitor_spacy_training"
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"] > remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
> log_dataset_dir = "corpus"
> model_log_interval = 1000
> ``` > ```
| Name | Description | | Name | Description |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | | ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ | | `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ | | `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |
| `model_log_interval` | Steps to wait between logging model checkpoints to W&B dasboard (default: None). ~~Optional[int]~~ |
| `log_dataset_dir` | Directory containing dataset to be logged and versioned as W&B artifact (default: None). ~~Optional[str]~~ |
<Project id="integrations/wandb"> <Project id="integrations/wandb">
@ -781,6 +786,35 @@ useful for making the model less sensitive to capitalization.
| `level` | The percentage of texts that will be augmented. ~~float~~ | | `level` | The percentage of texts that will be augmented. ~~float~~ |
| **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ | | **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ |
## Callbacks {#callbacks source="spacy/training/callbacks.py" new="3"}
The config supports [callbacks](/usage/training#custom-code-nlp-callbacks) at
several points in the lifecycle that can be used modify the `nlp` object.
### spacy.copy_from_base_model.v1 {#copy_from_base_model tag="registered function"}
> #### Example config
>
> ```ini
> [initialize.before_init]
> @callbacks = "spacy.copy_from_base_model.v1"
> tokenizer = "en_core_sci_md"
> vocab = "en_core_sci_md"
> ```
Copy the tokenizer and/or vocab from the specified models. It's similar to the
v2 [base model](https://v2.spacy.io/api/cli#train) option and useful in
combination with
[sourced components](/usage/processing-pipelines#sourced-components) when
fine-tuning an existing pipeline. The vocab includes the lookups and the vectors
from the specified model. Intended for use in `[initialize.before_init]`.
| Name | Description |
| ----------- | ----------------------------------------------------------------------------------------------------------------------- |
| `tokenizer` | The pipeline to copy the tokenizer from. Defaults to `None`. ~~Optional[str]~~ |
| `vocab` | The pipeline to copy the vocab from. The vocab includes the lookups and vectors. Defaults to `None`. ~~Optional[str]~~ |
| **CREATES** | A function that takes the current `nlp` object and modifies its `tokenizer` and `vocab`. ~~Callable[[Language], None]~~ |
## Training data and alignment {#gold source="spacy/training"} ## Training data and alignment {#gold source="spacy/training"}
### training.offsets_to_biluo_tags {#offsets_to_biluo_tags tag="function"} ### training.offsets_to_biluo_tags {#offsets_to_biluo_tags tag="function"}

View File

@ -132,7 +132,7 @@ factory = "tok2vec"
@architectures = "spacy.Tok2Vec.v2" @architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed] [components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v2"
[components.tok2vec.model.encode] [components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2" @architectures = "spacy.MaxoutWindowEncoder.v2"
@ -164,7 +164,7 @@ factory = "ner"
@architectures = "spacy.Tok2Vec.v2" @architectures = "spacy.Tok2Vec.v2"
[components.ner.model.tok2vec.embed] [components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v2"
[components.ner.model.tok2vec.encode] [components.ner.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2" @architectures = "spacy.MaxoutWindowEncoder.v2"
@ -541,7 +541,7 @@ word vector tables using the `include_static_vectors` flag.
```ini ```ini
[tagger.model.tok2vec.embed] [tagger.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v2"
width = 128 width = 128
attrs = ["LOWER","PREFIX","SUFFIX","SHAPE"] attrs = ["LOWER","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500] rows = [5000,2500,2500,2500]
@ -550,7 +550,7 @@ include_static_vectors = true
<Infobox title="How it works" emoji="💡"> <Infobox title="How it works" emoji="💡">
The configuration system will look up the string `"spacy.MultiHashEmbed.v1"` in The configuration system will look up the string `"spacy.MultiHashEmbed.v2"` in
the `architectures` [registry](/api/top-level#registry), and call the returned the `architectures` [registry](/api/top-level#registry), and call the returned
object with the rest of the arguments from the block. This will result in a call object with the rest of the arguments from the block. This will result in a call
to the to the

View File

@ -130,9 +130,9 @@ which provides a numpy-compatible interface for GPU arrays.
spaCy can be installed on GPU by specifying `spacy[cuda]`, `spacy[cuda90]`, spaCy can be installed on GPU by specifying `spacy[cuda]`, `spacy[cuda90]`,
`spacy[cuda91]`, `spacy[cuda92]`, `spacy[cuda100]`, `spacy[cuda101]`, `spacy[cuda91]`, `spacy[cuda92]`, `spacy[cuda100]`, `spacy[cuda101]`,
`spacy[cuda102]`, `spacy[cuda110]` or `spacy[cuda111]`. If you know your cuda `spacy[cuda102]`, `spacy[cuda110]`, `spacy[cuda111]` or `spacy[cuda112]`. If you
version, using the more explicit specifier allows cupy to be installed via know your cuda version, using the more explicit specifier allows cupy to be
wheel, saving some compilation time. The specifiers should install installed via wheel, saving some compilation time. The specifiers should install
[`cupy`](https://cupy.chainer.org). [`cupy`](https://cupy.chainer.org).
```bash ```bash

View File

@ -137,7 +137,7 @@ nO = null
@architectures = "spacy.Tok2Vec.v2" @architectures = "spacy.Tok2Vec.v2"
[components.textcat.model.tok2vec.embed] [components.textcat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v2"
width = 64 width = 64
rows = [2000, 2000, 1000, 1000, 1000, 1000] rows = [2000, 2000, 1000, 1000, 1000, 1000]
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"] attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
@ -204,7 +204,7 @@ factory = "tok2vec"
@architectures = "spacy.Tok2Vec.v2" @architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed] [components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v2"
# ... # ...
[components.tok2vec.model.encode] [components.tok2vec.model.encode]
@ -220,7 +220,7 @@ architecture:
```ini ```ini
### config.cfg (excerpt) ### config.cfg (excerpt)
[components.tok2vec.model.embed] [components.tok2vec.model.embed]
@architectures = "spacy.CharacterEmbed.v1" @architectures = "spacy.CharacterEmbed.v2"
# ... # ...
[components.tok2vec.model.encode] [components.tok2vec.model.encode]
@ -638,7 +638,7 @@ that has the full implementation.
> @architectures = "rel_instance_tensor.v1" > @architectures = "rel_instance_tensor.v1"
> >
> [model.create_instance_tensor.tok2vec] > [model.create_instance_tensor.tok2vec]
> @architectures = "spacy.HashEmbedCNN.v1" > @architectures = "spacy.HashEmbedCNN.v2"
> # ... > # ...
> >
> [model.create_instance_tensor.pooling] > [model.create_instance_tensor.pooling]

View File

@ -787,6 +787,7 @@ rather than performance:
```python ```python
def tokenizer_pseudo_code( def tokenizer_pseudo_code(
text,
special_cases, special_cases,
prefix_search, prefix_search,
suffix_search, suffix_search,
@ -840,12 +841,14 @@ def tokenizer_pseudo_code(
tokens.append(substring) tokens.append(substring)
substring = "" substring = ""
tokens.extend(reversed(suffixes)) tokens.extend(reversed(suffixes))
for match in matcher(special_cases, text):
tokens.replace(match, special_cases[match])
return tokens return tokens
``` ```
The algorithm can be summarized as follows: The algorithm can be summarized as follows:
1. Iterate over whitespace-separated substrings. 1. Iterate over space-separated substrings.
2. Look for a token match. If there is a match, stop processing and keep this 2. Look for a token match. If there is a match, stop processing and keep this
token. token.
3. Check whether we have an explicitly defined special case for this substring. 3. Check whether we have an explicitly defined special case for this substring.
@ -859,6 +862,8 @@ The algorithm can be summarized as follows:
8. Look for "infixes" stuff like hyphens etc. and split the substring into 8. Look for "infixes" stuff like hyphens etc. and split the substring into
tokens on all infixes. tokens on all infixes.
9. Once we can't consume any more of the string, handle it as a single token. 9. Once we can't consume any more of the string, handle it as a single token.
10. Make a final pass over the text to check for special cases that include
spaces or that were missed due to the incremental processing of affixes.
</Accordion> </Accordion>

View File

@ -995,7 +995,7 @@ your results.
> >
> ```ini > ```ini
> [training.logger] > [training.logger]
> @loggers = "spacy.WandbLogger.v1" > @loggers = "spacy.WandbLogger.v2"
> project_name = "monitor_spacy_training" > project_name = "monitor_spacy_training"
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"] > remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
> ``` > ```

View File

@ -1130,8 +1130,8 @@ any other custom workflows. `corpora.train` and `corpora.dev` are used as
conventions within spaCy's default configs, but you can also define any other conventions within spaCy's default configs, but you can also define any other
custom blocks. Each section in the corpora config should resolve to a custom blocks. Each section in the corpora config should resolve to a
[`Corpus`](/api/corpus) for example, using spaCy's built-in [`Corpus`](/api/corpus) for example, using spaCy's built-in
[corpus reader](/api/top-level#readers) that takes a path to a binary `.spacy` [corpus reader](/api/top-level#corpus-readers) that takes a path to a binary
file. The `train_corpus` and `dev_corpus` fields in the `.spacy` file. The `train_corpus` and `dev_corpus` fields in the
[`[training]`](/api/data-formats#config-training) block specify where to find [`[training]`](/api/data-formats#config-training) block specify where to find
the corpus in your config. This makes it easy to **swap out** different corpora the corpus in your config. This makes it easy to **swap out** different corpora
by only changing a single config setting. by only changing a single config setting.
@ -1142,21 +1142,23 @@ corpora, keyed by corpus name, e.g. `"train"` and `"dev"`. This can be
especially useful if you need to split a single file into corpora for training especially useful if you need to split a single file into corpora for training
and evaluation, without loading the same file twice. and evaluation, without loading the same file twice.
By default, the training data is loaded into memory and shuffled before each
epoch. If the corpus is **too large to fit into memory** during training, stream
the corpus using a custom reader as described in the next section.
### Custom data reading and batching {#custom-code-readers-batchers} ### Custom data reading and batching {#custom-code-readers-batchers}
Some use-cases require **streaming in data** or manipulating datasets on the Some use-cases require **streaming in data** or manipulating datasets on the
fly, rather than generating all data beforehand and storing it to file. Instead fly, rather than generating all data beforehand and storing it to disk. Instead
of using the built-in [`Corpus`](/api/corpus) reader, which uses static file of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
paths, you can create and register a custom function that generates paths, you can create and register a custom function that generates
[`Example`](/api/example) objects. The resulting generator can be infinite. When [`Example`](/api/example) objects.
using this dataset for training, stopping criteria such as maximum number of
steps, or stopping when the loss does not decrease further, can be used.
In this example we assume a custom function `read_custom_data` which loads or In the following example we assume a custom function `read_custom_data` which
generates texts with relevant text classification annotations. Then, small loads or generates texts with relevant text classification annotations. Then,
lexical variations of the input text are created before generating the final small lexical variations of the input text are created before generating the
[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets final [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator
you register the function creating the custom reader in the `readers` lets you register the function creating the custom reader in the `readers`
[registry](/api/top-level#registry) and assign it a string name, so it can be [registry](/api/top-level#registry) and assign it a string name, so it can be
used in your config. All arguments on the registered function become available used in your config. All arguments on the registered function become available
as **config settings** in this case, `source`. as **config settings** in this case, `source`.
@ -1199,6 +1201,80 @@ Remember that a registered function should always be a function that spaCy
</Infobox> </Infobox>
If the corpus is **too large to load into memory** or the corpus reader is an
**infinite generator**, use the setting `max_epochs = -1` to indicate that the
train corpus should be streamed. With this setting the train corpus is merely
streamed and batched, not shuffled, so any shuffling needs to be implemented in
the corpus reader itself. In the example below, a corpus reader that generates
sentences containing even or odd numbers is used with an unlimited number of
examples for the train corpus and a limited number of examples for the dev
corpus. The dev corpus should always be finite and fit in memory during the
evaluation step. `max_steps` and/or `patience` are used to determine when the
training should stop.
> #### config.cfg
>
> ```ini
> [corpora.dev]
> @readers = "even_odd.v1"
> limit = 100
>
> [corpora.train]
> @readers = "even_odd.v1"
> limit = -1
>
> [training]
> max_epochs = -1
> patience = 500
> max_steps = 2000
> ```
```python
### functions.py
from typing import Callable, Iterable, Iterator
from spacy import util
import random
from spacy.training import Example
from spacy import Language
@util.registry.readers("even_odd.v1")
def create_even_odd_corpus(limit: int = -1) -> Callable[[Language], Iterable[Example]]:
return EvenOddCorpus(limit)
class EvenOddCorpus:
def __init__(self, limit):
self.limit = limit
def __call__(self, nlp: Language) -> Iterator[Example]:
i = 0
while i < self.limit or self.limit < 0:
r = random.randint(0, 1000)
cat = r % 2 == 0
text = "This is sentence " + str(r)
yield Example.from_dict(
nlp.make_doc(text), {"cats": {"EVEN": cat, "ODD": not cat}}
)
i += 1
```
> #### config.cfg
>
> ```ini
> [initialize.components.textcat.labels]
> @readers = "spacy.read_labels.v1"
> path = "labels/textcat.json"
> require = true
> ```
If the train corpus is streamed, the initialize step peeks at the first 100
examples in the corpus to find the labels for each component. If this isn't
sufficient, you'll need to [provide the labels](#initialization-labels) for each
component in the `[initialize]` block. [`init labels`](/api/cli#init-labels) can
be used to generate JSON files in the correct format, which you can extend with
the full label set.
We can also customize the **batching strategy** by registering a new batcher We can also customize the **batching strategy** by registering a new batcher
function in the `batchers` [registry](/api/top-level#registry). A batcher turns function in the `batchers` [registry](/api/top-level#registry). A batcher turns
a stream of items into a stream of batches. spaCy has several useful built-in a stream of items into a stream of batches. spaCy has several useful built-in

View File

@ -616,11 +616,11 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) | | `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, symlinks are deprecated | | `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, symlinks are deprecated |
The following deprecated methods, attributes and arguments were removed in v3.0. The following methods, attributes and arguments were removed in v3.0. Most of
Most of them have been **deprecated for a while** and many would previously them have been **deprecated for a while** and many would previously raise
raise errors. Many of them were also mostly internals. If you've been working errors. Many of them were also mostly internals. If you've been working with
with more recent versions of spaCy v2.x, it's **unlikely** that your code relied more recent versions of spaCy v2.x, it's **unlikely** that your code relied on
on them. them.
| Removed | Replacement | | Removed | Replacement |
| ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -637,10 +637,10 @@ on them.
### Downloading and loading trained pipelines {#migrating-downloading-models} ### Downloading and loading trained pipelines {#migrating-downloading-models}
Symlinks and shortcuts like `en` are now officially deprecated. There are Symlinks and shortcuts like `en` have been deprecated for a while, and are now
[many different trained pipelines](/models) with different capabilities and not not supported anymore. There are [many different trained pipelines](/models)
just one "English model". In order to download and load a package, you should with different capabilities and not just one "English model". In order to
always use its full name for instance, download and load a package, you should always use its full name for instance,
[`en_core_web_sm`](/models/en#en_core_web_sm). [`en_core_web_sm`](/models/en#en_core_web_sm).
```diff ```diff
@ -1185,9 +1185,10 @@ package isn't imported.
In Jupyter notebooks, run [`prefer_gpu`](/api/top-level#spacy.prefer_gpu), In Jupyter notebooks, run [`prefer_gpu`](/api/top-level#spacy.prefer_gpu),
[`require_gpu`](/api/top-level#spacy.require_gpu) or [`require_gpu`](/api/top-level#spacy.require_gpu) or
[`require_cpu`](/api/top-level#spacy.require_cpu) in the same cell as [`require_cpu`](/api/top-level#spacy.require_cpu) in the same cell as
[`spacy.load`](/api/top-level#spacy.load) to ensure that the model is loaded on the correct device. [`spacy.load`](/api/top-level#spacy.load) to ensure that the model is loaded on
the correct device.
Due to a bug related to `contextvars` (see the [bug Due to a bug related to `contextvars` (see the
report](https://github.com/ipython/ipython/issues/11565)), the GPU settings may [bug report](https://github.com/ipython/ipython/issues/11565)), the GPU settings
not be preserved correctly across cells, resulting in models being loaded on may not be preserved correctly across cells, resulting in models being loaded on
the wrong device or only partially on GPU. the wrong device or only partially on GPU.