mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-01 04:46:38 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
29ac7f776a
57
.github/azure-steps.yml
vendored
Normal file
57
.github/azure-steps.yml
vendored
Normal file
|
@ -0,0 +1,57 @@
|
|||
parameters:
|
||||
python_version: ''
|
||||
architecture: ''
|
||||
prefix: ''
|
||||
gpu: false
|
||||
num_build_jobs: 1
|
||||
|
||||
steps:
|
||||
- task: UsePythonVersion@0
|
||||
inputs:
|
||||
versionSpec: ${{ parameters.python_version }}
|
||||
architecture: ${{ parameters.architecture }}
|
||||
|
||||
- script: |
|
||||
${{ parameters.prefix }} python -m pip install -U pip setuptools
|
||||
${{ parameters.prefix }} python -m pip install -U -r requirements.txt
|
||||
displayName: "Install dependencies"
|
||||
|
||||
- script: |
|
||||
${{ parameters.prefix }} python setup.py build_ext --inplace -j ${{ parameters.num_build_jobs }}
|
||||
${{ parameters.prefix }} python setup.py sdist --formats=gztar
|
||||
displayName: "Compile and build sdist"
|
||||
|
||||
- task: DeleteFiles@1
|
||||
inputs:
|
||||
contents: "spacy"
|
||||
displayName: "Delete source directory"
|
||||
|
||||
- script: |
|
||||
${{ parameters.prefix }} python -m pip freeze --exclude torch --exclude cupy-cuda110 > installed.txt
|
||||
${{ parameters.prefix }} python -m pip uninstall -y -r installed.txt
|
||||
displayName: "Uninstall all packages"
|
||||
|
||||
- bash: |
|
||||
${{ parameters.prefix }} SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1)
|
||||
${{ parameters.prefix }} python -m pip install dist/$SDIST
|
||||
displayName: "Install from sdist"
|
||||
|
||||
- script: |
|
||||
${{ parameters.prefix }} python -m pip install -U -r requirements.txt
|
||||
displayName: "Install test requirements"
|
||||
|
||||
- script: |
|
||||
${{ parameters.prefix }} python -m pip install -U cupy-cuda110
|
||||
${{ parameters.prefix }} python -m pip install "torch==1.7.1+cu110" -f https://download.pytorch.org/whl/torch_stable.html
|
||||
displayName: "Install GPU requirements"
|
||||
condition: eq(${{ parameters.gpu }}, true)
|
||||
|
||||
- script: |
|
||||
${{ parameters.prefix }} python -m pytest --pyargs spacy
|
||||
displayName: "Run CPU tests"
|
||||
condition: eq(${{ parameters.gpu }}, false)
|
||||
|
||||
- script: |
|
||||
${{ parameters.prefix }} python -m pytest --pyargs spacy -p spacy.tests.enable_gpu
|
||||
displayName: "Run GPU tests"
|
||||
condition: eq(${{ parameters.gpu }}, true)
|
106
.github/contributors/AyushExel.md
vendored
Normal file
106
.github/contributors/AyushExel.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Ayush Chaurasia |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2021-03-12 |
|
||||
| GitHub username | AyushExel |
|
||||
| Website (optional) | |
|
106
.github/contributors/broaddeep.md
vendored
Normal file
106
.github/contributors/broaddeep.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Dongjun Park |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2021-03-06 |
|
||||
| GitHub username | broaddeep |
|
||||
| Website (optional) | |
|
|
@ -76,39 +76,24 @@ jobs:
|
|||
maxParallel: 4
|
||||
pool:
|
||||
vmImage: $(imageName)
|
||||
|
||||
steps:
|
||||
- task: UsePythonVersion@0
|
||||
inputs:
|
||||
versionSpec: "$(python.version)"
|
||||
architecture: "x64"
|
||||
- template: .github/azure-steps.yml
|
||||
parameters:
|
||||
python_version: '$(python.version)'
|
||||
architecture: 'x64'
|
||||
|
||||
- script: |
|
||||
python -m pip install -U setuptools
|
||||
pip install -r requirements.txt
|
||||
displayName: "Install dependencies"
|
||||
|
||||
- script: |
|
||||
python setup.py build_ext --inplace
|
||||
python setup.py sdist --formats=gztar
|
||||
displayName: "Compile and build sdist"
|
||||
|
||||
- task: DeleteFiles@1
|
||||
inputs:
|
||||
contents: "spacy"
|
||||
displayName: "Delete source directory"
|
||||
|
||||
- script: |
|
||||
pip freeze > installed.txt
|
||||
pip uninstall -y -r installed.txt
|
||||
displayName: "Uninstall all packages"
|
||||
|
||||
- bash: |
|
||||
SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1)
|
||||
pip install dist/$SDIST
|
||||
displayName: "Install from sdist"
|
||||
|
||||
- script: |
|
||||
pip install -r requirements.txt
|
||||
python -m pytest --pyargs spacy
|
||||
displayName: "Run tests"
|
||||
- job: "TestGPU"
|
||||
dependsOn: "Validate"
|
||||
strategy:
|
||||
matrix:
|
||||
Python38LinuxX64_GPU:
|
||||
python.version: '3.8'
|
||||
pool:
|
||||
name: "LinuxX64_GPU"
|
||||
steps:
|
||||
- template: .github/azure-steps.yml
|
||||
parameters:
|
||||
python_version: '$(python.version)'
|
||||
architecture: 'x64'
|
||||
gpu: true
|
||||
num_build_jobs: 24
|
||||
|
|
|
@ -5,7 +5,7 @@ requires = [
|
|||
"cymem>=2.0.2,<2.1.0",
|
||||
"preshed>=3.0.2,<3.1.0",
|
||||
"murmurhash>=0.28.0,<1.1.0",
|
||||
"thinc>=8.0.2,<8.1.0",
|
||||
"thinc>=8.0.3,<8.1.0",
|
||||
"blis>=0.4.0,<0.8.0",
|
||||
"pathy",
|
||||
"numpy>=1.15.0",
|
||||
|
|
|
@ -1,14 +1,14 @@
|
|||
# Our libraries
|
||||
spacy-legacy>=3.0.0,<3.1.0
|
||||
spacy-legacy>=3.0.4,<3.1.0
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc>=8.0.2,<8.1.0
|
||||
thinc>=8.0.3,<8.1.0
|
||||
blis>=0.4.0,<0.8.0
|
||||
ml_datasets>=0.2.0,<0.3.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
wasabi>=0.8.1,<1.1.0
|
||||
srsly>=2.4.0,<3.0.0
|
||||
catalogue>=2.0.1,<2.1.0
|
||||
srsly>=2.4.1,<3.0.0
|
||||
catalogue>=2.0.3,<2.1.0
|
||||
typer>=0.3.0,<0.4.0
|
||||
pathy>=0.3.5
|
||||
# Third party dependencies
|
||||
|
@ -20,7 +20,6 @@ jinja2
|
|||
# Official Python utilities
|
||||
setuptools
|
||||
packaging>=20.0
|
||||
importlib_metadata>=0.20; python_version < "3.8"
|
||||
typing_extensions>=3.7.4.1,<4.0.0.0; python_version < "3.8"
|
||||
# Development dependencies
|
||||
cython>=0.25
|
||||
|
|
13
setup.cfg
13
setup.cfg
|
@ -34,18 +34,18 @@ setup_requires =
|
|||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
thinc>=8.0.2,<8.1.0
|
||||
thinc>=8.0.3,<8.1.0
|
||||
install_requires =
|
||||
# Our libraries
|
||||
spacy-legacy>=3.0.0,<3.1.0
|
||||
spacy-legacy>=3.0.4,<3.1.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc>=8.0.2,<8.1.0
|
||||
thinc>=8.0.3,<8.1.0
|
||||
blis>=0.4.0,<0.8.0
|
||||
wasabi>=0.8.1,<1.1.0
|
||||
srsly>=2.4.0,<3.0.0
|
||||
catalogue>=2.0.1,<2.1.0
|
||||
srsly>=2.4.1,<3.0.0
|
||||
catalogue>=2.0.3,<2.1.0
|
||||
typer>=0.3.0,<0.4.0
|
||||
pathy>=0.3.5
|
||||
# Third-party dependencies
|
||||
|
@ -57,7 +57,6 @@ install_requires =
|
|||
# Official Python utilities
|
||||
setuptools
|
||||
packaging>=20.0
|
||||
importlib_metadata>=0.20; python_version < "3.8"
|
||||
typing_extensions>=3.7.4,<4.0.0.0; python_version < "3.8"
|
||||
|
||||
[options.entry_points]
|
||||
|
@ -91,6 +90,8 @@ cuda110 =
|
|||
cupy-cuda110>=5.0.0b4,<9.0.0
|
||||
cuda111 =
|
||||
cupy-cuda111>=5.0.0b4,<9.0.0
|
||||
cuda112 =
|
||||
cupy-cuda112>=5.0.0b4,<9.0.0
|
||||
# Language tokenizers with external dependencies
|
||||
ja =
|
||||
sudachipy>=0.4.9
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy"
|
||||
__version__ = "3.0.5"
|
||||
__version__ = "3.0.6"
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
__projects__ = "https://github.com/explosion/projects"
|
||||
|
|
|
@ -9,6 +9,7 @@ from .info import info # noqa: F401
|
|||
from .package import package # noqa: F401
|
||||
from .profile import profile # noqa: F401
|
||||
from .train import train_cli # noqa: F401
|
||||
from .assemble import assemble_cli # noqa: F401
|
||||
from .pretrain import pretrain # noqa: F401
|
||||
from .debug_data import debug_data # noqa: F401
|
||||
from .debug_config import debug_config # noqa: F401
|
||||
|
@ -29,9 +30,9 @@ from .project.document import project_document # noqa: F401
|
|||
|
||||
@app.command("link", no_args_is_help=True, deprecated=True, hidden=True)
|
||||
def link(*args, **kwargs):
|
||||
"""As of spaCy v3.0, symlinks like "en" are deprecated. You can load trained
|
||||
"""As of spaCy v3.0, symlinks like "en" are not supported anymore. You can load trained
|
||||
pipeline packages using their full names or from a directory path."""
|
||||
msg.warn(
|
||||
"As of spaCy v3.0, model symlinks are deprecated. You can load trained "
|
||||
"As of spaCy v3.0, model symlinks are not supported anymore. You can load trained "
|
||||
"pipeline packages using their full names or from a directory path."
|
||||
)
|
||||
|
|
58
spacy/cli/assemble.py
Normal file
58
spacy/cli/assemble.py
Normal file
|
@ -0,0 +1,58 @@
|
|||
from typing import Optional
|
||||
from pathlib import Path
|
||||
from wasabi import msg
|
||||
import typer
|
||||
import logging
|
||||
|
||||
from ._util import app, Arg, Opt, parse_config_overrides, show_validation_error
|
||||
from ._util import import_code
|
||||
from ..training.initialize import init_nlp
|
||||
from .. import util
|
||||
from ..util import get_sourced_components, load_model_from_config
|
||||
|
||||
|
||||
@app.command(
|
||||
"assemble",
|
||||
context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
|
||||
)
|
||||
def assemble_cli(
|
||||
# fmt: off
|
||||
ctx: typer.Context, # This is only used to read additional arguments
|
||||
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
|
||||
output_path: Path = Arg(..., help="Output directory to store assembled pipeline in"),
|
||||
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
||||
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Assemble a spaCy pipeline from a config file. The config file includes
|
||||
all settings for initializing the pipeline. To override settings in the
|
||||
config, e.g. settings that point to local paths or that you want to
|
||||
experiment with, you can override them as command line options. The
|
||||
--code argument lets you pass in a Python file that can be used to
|
||||
register custom functions that are referenced in the config.
|
||||
|
||||
DOCS: https://spacy.io/api/cli#assemble
|
||||
"""
|
||||
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
|
||||
# Make sure all files and paths exists if they are needed
|
||||
if not config_path or (str(config_path) != "-" and not config_path.exists()):
|
||||
msg.fail("Config file not found", config_path, exits=1)
|
||||
overrides = parse_config_overrides(ctx.args)
|
||||
import_code(code_path)
|
||||
with show_validation_error(config_path):
|
||||
config = util.load_config(config_path, overrides=overrides, interpolate=False)
|
||||
msg.divider("Initializing pipeline")
|
||||
nlp = load_model_from_config(config, auto_fill=True)
|
||||
config = config.interpolate()
|
||||
sourced = get_sourced_components(config)
|
||||
# Make sure that listeners are defined before initializing further
|
||||
nlp._link_components()
|
||||
with nlp.select_pipes(disable=[*sourced]):
|
||||
nlp.initialize()
|
||||
msg.good("Initialized pipeline")
|
||||
msg.divider("Serializing to disk")
|
||||
if output_path is not None and not output_path.exists():
|
||||
output_path.mkdir(parents=True)
|
||||
msg.good(f"Created output directory: {output_path}")
|
||||
nlp.to_disk(output_path)
|
|
@ -1,4 +1,4 @@
|
|||
from typing import List, Sequence, Dict, Any, Tuple, Optional
|
||||
from typing import List, Sequence, Dict, Any, Tuple, Optional, Set
|
||||
from pathlib import Path
|
||||
from collections import Counter
|
||||
import sys
|
||||
|
@ -13,6 +13,8 @@ from ..training.initialize import get_sourced_components
|
|||
from ..schemas import ConfigSchemaTraining
|
||||
from ..pipeline._parser_internals import nonproj
|
||||
from ..pipeline._parser_internals.nonproj import DELIMITER
|
||||
from ..pipeline import Morphologizer
|
||||
from ..morphology import Morphology
|
||||
from ..language import Language
|
||||
from ..util import registry, resolve_dot_names
|
||||
from .. import util
|
||||
|
@ -194,32 +196,32 @@ def debug_data(
|
|||
)
|
||||
label_counts = gold_train_data["ner"]
|
||||
model_labels = _get_labels_from_model(nlp, "ner")
|
||||
new_labels = [l for l in labels if l not in model_labels]
|
||||
existing_labels = [l for l in labels if l in model_labels]
|
||||
has_low_data_warning = False
|
||||
has_no_neg_warning = False
|
||||
has_ws_ents_error = False
|
||||
has_punct_ents_warning = False
|
||||
|
||||
msg.divider("Named Entity Recognition")
|
||||
msg.info(
|
||||
f"{len(new_labels)} new label(s), {len(existing_labels)} existing label(s)"
|
||||
)
|
||||
msg.info(f"{len(model_labels)} label(s)")
|
||||
missing_values = label_counts["-"]
|
||||
msg.text(f"{missing_values} missing value(s) (tokens with '-' label)")
|
||||
for label in new_labels:
|
||||
for label in labels:
|
||||
if len(label) == 0:
|
||||
msg.fail("Empty label found in new labels")
|
||||
if new_labels:
|
||||
labels_with_counts = [
|
||||
(label, count)
|
||||
for label, count in label_counts.most_common()
|
||||
if label != "-"
|
||||
]
|
||||
labels_with_counts = _format_labels(labels_with_counts, counts=True)
|
||||
msg.text(f"New: {labels_with_counts}", show=verbose)
|
||||
if existing_labels:
|
||||
msg.text(f"Existing: {_format_labels(existing_labels)}", show=verbose)
|
||||
msg.fail("Empty label found in train data")
|
||||
labels_with_counts = [
|
||||
(label, count)
|
||||
for label, count in label_counts.most_common()
|
||||
if label != "-"
|
||||
]
|
||||
labels_with_counts = _format_labels(labels_with_counts, counts=True)
|
||||
msg.text(f"Labels in train data: {_format_labels(labels)}", show=verbose)
|
||||
missing_labels = model_labels - labels
|
||||
if missing_labels:
|
||||
msg.warn(
|
||||
"Some model labels are not present in the train data. The "
|
||||
"model performance may be degraded for these labels after "
|
||||
f"training: {_format_labels(missing_labels)}."
|
||||
)
|
||||
if gold_train_data["ws_ents"]:
|
||||
msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans")
|
||||
has_ws_ents_error = True
|
||||
|
@ -228,10 +230,10 @@ def debug_data(
|
|||
msg.warn(f"{gold_train_data['punct_ents']} entity span(s) with punctuation")
|
||||
has_punct_ents_warning = True
|
||||
|
||||
for label in new_labels:
|
||||
for label in labels:
|
||||
if label_counts[label] <= NEW_LABEL_THRESHOLD:
|
||||
msg.warn(
|
||||
f"Low number of examples for new label '{label}' ({label_counts[label]})"
|
||||
f"Low number of examples for label '{label}' ({label_counts[label]})"
|
||||
)
|
||||
has_low_data_warning = True
|
||||
|
||||
|
@ -276,22 +278,52 @@ def debug_data(
|
|||
)
|
||||
|
||||
if "textcat" in factory_names:
|
||||
msg.divider("Text Classification")
|
||||
labels = [label for label in gold_train_data["cats"]]
|
||||
model_labels = _get_labels_from_model(nlp, "textcat")
|
||||
new_labels = [l for l in labels if l not in model_labels]
|
||||
existing_labels = [l for l in labels if l in model_labels]
|
||||
msg.info(
|
||||
f"Text Classification: {len(new_labels)} new label(s), "
|
||||
f"{len(existing_labels)} existing label(s)"
|
||||
msg.divider("Text Classification (Exclusive Classes)")
|
||||
labels = _get_labels_from_model(nlp, "textcat")
|
||||
msg.info(f"Text Classification: {len(labels)} label(s)")
|
||||
msg.text(f"Labels: {_format_labels(labels)}", show=verbose)
|
||||
labels_with_counts = _format_labels(
|
||||
gold_train_data["cats"].most_common(), counts=True
|
||||
)
|
||||
if new_labels:
|
||||
labels_with_counts = _format_labels(
|
||||
gold_train_data["cats"].most_common(), counts=True
|
||||
msg.text(f"Labels in train data: {labels_with_counts}", show=verbose)
|
||||
missing_labels = labels - set(gold_train_data["cats"].keys())
|
||||
if missing_labels:
|
||||
msg.warn(
|
||||
"Some model labels are not present in the train data. The "
|
||||
"model performance may be degraded for these labels after "
|
||||
f"training: {_format_labels(missing_labels)}."
|
||||
)
|
||||
if gold_train_data["n_cats_multilabel"] > 0:
|
||||
# Note: you should never get here because you run into E895 on
|
||||
# initialization first.
|
||||
msg.warn(
|
||||
"The train data contains instances without "
|
||||
"mutually-exclusive classes. Use the component "
|
||||
"'textcat_multilabel' instead of 'textcat'."
|
||||
)
|
||||
if gold_dev_data["n_cats_multilabel"] > 0:
|
||||
msg.fail(
|
||||
"Train/dev mismatch: the dev data contains instances "
|
||||
"without mutually-exclusive classes while the train data "
|
||||
"contains only instances with mutually-exclusive classes."
|
||||
)
|
||||
|
||||
if "textcat_multilabel" in factory_names:
|
||||
msg.divider("Text Classification (Multilabel)")
|
||||
labels = _get_labels_from_model(nlp, "textcat_multilabel")
|
||||
msg.info(f"Text Classification: {len(labels)} label(s)")
|
||||
msg.text(f"Labels: {_format_labels(labels)}", show=verbose)
|
||||
labels_with_counts = _format_labels(
|
||||
gold_train_data["cats"].most_common(), counts=True
|
||||
)
|
||||
msg.text(f"Labels in train data: {labels_with_counts}", show=verbose)
|
||||
missing_labels = labels - set(gold_train_data["cats"].keys())
|
||||
if missing_labels:
|
||||
msg.warn(
|
||||
"Some model labels are not present in the train data. The "
|
||||
"model performance may be degraded for these labels after "
|
||||
f"training: {_format_labels(missing_labels)}."
|
||||
)
|
||||
msg.text(f"New: {labels_with_counts}", show=verbose)
|
||||
if existing_labels:
|
||||
msg.text(f"Existing: {_format_labels(existing_labels)}", show=verbose)
|
||||
if set(gold_train_data["cats"]) != set(gold_dev_data["cats"]):
|
||||
msg.fail(
|
||||
f"The train and dev labels are not the same. "
|
||||
|
@ -299,11 +331,6 @@ def debug_data(
|
|||
f"Dev labels: {_format_labels(gold_dev_data['cats'])}."
|
||||
)
|
||||
if gold_train_data["n_cats_multilabel"] > 0:
|
||||
msg.info(
|
||||
"The train data contains instances without "
|
||||
"mutually-exclusive classes. Use '--textcat-multilabel' "
|
||||
"when training."
|
||||
)
|
||||
if gold_dev_data["n_cats_multilabel"] == 0:
|
||||
msg.warn(
|
||||
"Potential train/dev mismatch: the train data contains "
|
||||
|
@ -311,9 +338,10 @@ def debug_data(
|
|||
"dev data does not."
|
||||
)
|
||||
else:
|
||||
msg.info(
|
||||
msg.warn(
|
||||
"The train data contains only instances with "
|
||||
"mutually-exclusive classes."
|
||||
"mutually-exclusive classes. You can potentially use the "
|
||||
"component 'textcat' instead of 'textcat_multilabel'."
|
||||
)
|
||||
if gold_dev_data["n_cats_multilabel"] > 0:
|
||||
msg.fail(
|
||||
|
@ -325,13 +353,37 @@ def debug_data(
|
|||
if "tagger" in factory_names:
|
||||
msg.divider("Part-of-speech Tagging")
|
||||
labels = [label for label in gold_train_data["tags"]]
|
||||
# TODO: does this need to be updated?
|
||||
msg.info(f"{len(labels)} label(s) in data")
|
||||
model_labels = _get_labels_from_model(nlp, "tagger")
|
||||
msg.info(f"{len(labels)} label(s) in train data")
|
||||
missing_labels = model_labels - set(labels)
|
||||
if missing_labels:
|
||||
msg.warn(
|
||||
"Some model labels are not present in the train data. The "
|
||||
"model performance may be degraded for these labels after "
|
||||
f"training: {_format_labels(missing_labels)}."
|
||||
)
|
||||
labels_with_counts = _format_labels(
|
||||
gold_train_data["tags"].most_common(), counts=True
|
||||
)
|
||||
msg.text(labels_with_counts, show=verbose)
|
||||
|
||||
if "morphologizer" in factory_names:
|
||||
msg.divider("Morphologizer (POS+Morph)")
|
||||
labels = [label for label in gold_train_data["morphs"]]
|
||||
model_labels = _get_labels_from_model(nlp, "morphologizer")
|
||||
msg.info(f"{len(labels)} label(s) in train data")
|
||||
missing_labels = model_labels - set(labels)
|
||||
if missing_labels:
|
||||
msg.warn(
|
||||
"Some model labels are not present in the train data. The "
|
||||
"model performance may be degraded for these labels after "
|
||||
f"training: {_format_labels(missing_labels)}."
|
||||
)
|
||||
labels_with_counts = _format_labels(
|
||||
gold_train_data["morphs"].most_common(), counts=True
|
||||
)
|
||||
msg.text(labels_with_counts, show=verbose)
|
||||
|
||||
if "parser" in factory_names:
|
||||
has_low_data_warning = False
|
||||
msg.divider("Dependency Parsing")
|
||||
|
@ -491,6 +543,7 @@ def _compile_gold(
|
|||
"ner": Counter(),
|
||||
"cats": Counter(),
|
||||
"tags": Counter(),
|
||||
"morphs": Counter(),
|
||||
"deps": Counter(),
|
||||
"words": Counter(),
|
||||
"roots": Counter(),
|
||||
|
@ -544,13 +597,36 @@ def _compile_gold(
|
|||
data["ner"][combined_label] += 1
|
||||
elif label == "-":
|
||||
data["ner"]["-"] += 1
|
||||
if "textcat" in factory_names:
|
||||
if "textcat" in factory_names or "textcat_multilabel" in factory_names:
|
||||
data["cats"].update(gold.cats)
|
||||
if list(gold.cats.values()).count(1.0) != 1:
|
||||
data["n_cats_multilabel"] += 1
|
||||
if "tagger" in factory_names:
|
||||
tags = eg.get_aligned("TAG", as_string=True)
|
||||
data["tags"].update([x for x in tags if x is not None])
|
||||
if "morphologizer" in factory_names:
|
||||
pos_tags = eg.get_aligned("POS", as_string=True)
|
||||
morphs = eg.get_aligned("MORPH", as_string=True)
|
||||
for pos, morph in zip(pos_tags, morphs):
|
||||
# POS may align (same value for multiple tokens) when morph
|
||||
# doesn't, so if either is misaligned (None), treat the
|
||||
# annotation as missing so that truths doesn't end up with an
|
||||
# unknown morph+POS combination
|
||||
if pos is None or morph is None:
|
||||
pass
|
||||
# If both are unset, the annotation is missing (empty morph
|
||||
# converted from int is "_" rather than "")
|
||||
elif pos == "" and morph == "":
|
||||
pass
|
||||
# Otherwise, generate the combined label
|
||||
else:
|
||||
label_dict = Morphology.feats_to_dict(morph)
|
||||
if pos:
|
||||
label_dict[Morphologizer.POS_FEAT] = pos
|
||||
label = eg.reference.vocab.strings[
|
||||
eg.reference.vocab.morphology.add(label_dict)
|
||||
]
|
||||
data["morphs"].update([label])
|
||||
if "parser" in factory_names:
|
||||
aligned_heads, aligned_deps = eg.get_aligned_parse(projectivize=make_proj)
|
||||
data["deps"].update([x for x in aligned_deps if x is not None])
|
||||
|
@ -584,8 +660,8 @@ def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
|
|||
return count
|
||||
|
||||
|
||||
def _get_labels_from_model(nlp: Language, pipe_name: str) -> Sequence[str]:
|
||||
def _get_labels_from_model(nlp: Language, pipe_name: str) -> Set[str]:
|
||||
if pipe_name not in nlp.pipe_names:
|
||||
return set()
|
||||
pipe = nlp.get_pipe(pipe_name)
|
||||
return pipe.labels
|
||||
return set(pipe.labels)
|
||||
|
|
|
@ -206,7 +206,7 @@ factory = "tok2vec"
|
|||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[components.tok2vec.model.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
@architectures = "spacy.MultiHashEmbed.v2"
|
||||
width = ${components.tok2vec.model.encode.width}
|
||||
{% if has_letters -%}
|
||||
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
|
||||
|
|
|
@ -68,8 +68,11 @@ seed = ${system.seed}
|
|||
gpu_allocator = ${system.gpu_allocator}
|
||||
dropout = 0.1
|
||||
accumulate_gradient = 1
|
||||
# Controls early-stopping. 0 or -1 mean unlimited.
|
||||
# Controls early-stopping. 0 disables early stopping.
|
||||
patience = 1600
|
||||
# Number of epochs. 0 means unlimited. If >= 0, train corpus is loaded once in
|
||||
# memory and shuffled within the training loop. -1 means stream train corpus
|
||||
# rather than loading in memory with no shuffling within the training loop.
|
||||
max_epochs = 0
|
||||
max_steps = 20000
|
||||
eval_frequency = 200
|
||||
|
|
|
@ -157,6 +157,10 @@ class Warnings:
|
|||
"`spacy.load()` to ensure that the model is loaded on the correct "
|
||||
"device. More information: "
|
||||
"http://spacy.io/usage/v3#jupyter-notebook-gpu")
|
||||
W112 = ("The model specified to use for initial vectors ({name}) has no "
|
||||
"vectors. This is almost certainly a mistake.")
|
||||
W113 = ("Sourced component '{name}' may not work as expected: source "
|
||||
"vectors are not identical to current pipeline vectors.")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
@ -497,6 +501,12 @@ class Errors:
|
|||
E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.")
|
||||
|
||||
# New errors added in v3.x
|
||||
E872 = ("Unable to copy tokenizer from base model due to different "
|
||||
'tokenizer settings: current tokenizer config "{curr_config}" '
|
||||
'vs. base model "{base_config}"')
|
||||
E873 = ("Unable to merge a span from doc.spans with key '{key}' and text "
|
||||
"'{text}'. This is likely a bug in spaCy, so feel free to open an "
|
||||
"issue: https://github.com/explosion/spaCy/issues")
|
||||
E874 = ("Could not initialize the tok2vec model from component "
|
||||
"'{component}' and layer '{layer}'.")
|
||||
E875 = ("To use the PretrainVectors objective, make sure that static vectors are loaded. "
|
||||
|
@ -631,7 +641,7 @@ class Errors:
|
|||
"method, make sure it's overwritten on the subclass.")
|
||||
E940 = ("Found NaN values in scores.")
|
||||
E941 = ("Can't find model '{name}'. It looks like you're trying to load a "
|
||||
"model from a shortcut, which is deprecated as of spaCy v3.0. To "
|
||||
"model from a shortcut, which is obsolete as of spaCy v3.0. To "
|
||||
"load the model, use its full name instead:\n\n"
|
||||
"nlp = spacy.load(\"{full}\")\n\nFor more details on the available "
|
||||
"models, see the models directory: https://spacy.io/models. If you "
|
||||
|
@ -646,8 +656,8 @@ class Errors:
|
|||
"returned the initialized nlp object instead?")
|
||||
E944 = ("Can't copy pipeline component '{name}' from source '{model}': "
|
||||
"not found in pipeline. Available components: {opts}")
|
||||
E945 = ("Can't copy pipeline component '{name}' from source. Expected loaded "
|
||||
"nlp object, but got: {source}")
|
||||
E945 = ("Can't copy pipeline component '{name}' from source. Expected "
|
||||
"loaded nlp object, but got: {source}")
|
||||
E947 = ("`Matcher.add` received invalid `greedy` argument: expected "
|
||||
"a string value from {expected} but got: '{arg}'")
|
||||
E948 = ("`Matcher.add` received invalid 'patterns' argument: expected "
|
||||
|
|
|
@ -17,14 +17,19 @@ _exc = {
|
|||
for orth in [
|
||||
"..",
|
||||
"....",
|
||||
"a.C.",
|
||||
"al.",
|
||||
"all-path",
|
||||
"art.",
|
||||
"Art.",
|
||||
"artt.",
|
||||
"att.",
|
||||
"avv.",
|
||||
"Avv."
|
||||
"by-pass",
|
||||
"c.d.",
|
||||
"c/c",
|
||||
"C.so",
|
||||
"centro-sinistra",
|
||||
"check-up",
|
||||
"Civ.",
|
||||
|
@ -48,6 +53,8 @@ for orth in [
|
|||
"prof.",
|
||||
"sett.",
|
||||
"s.p.a.",
|
||||
"s.n.c",
|
||||
"s.r.l",
|
||||
"ss.",
|
||||
"St.",
|
||||
"tel.",
|
||||
|
|
|
@ -682,9 +682,14 @@ class Language:
|
|||
name (str): Optional alternative name to use in current pipeline.
|
||||
RETURNS (Tuple[Callable, str]): The component and its factory name.
|
||||
"""
|
||||
# TODO: handle errors and mismatches (vectors etc.)
|
||||
if not isinstance(source, self.__class__):
|
||||
# Check source type
|
||||
if not isinstance(source, Language):
|
||||
raise ValueError(Errors.E945.format(name=source_name, source=type(source)))
|
||||
# Check vectors, with faster checks first
|
||||
if self.vocab.vectors.shape != source.vocab.vectors.shape or \
|
||||
self.vocab.vectors.key2row != source.vocab.vectors.key2row or \
|
||||
self.vocab.vectors.to_bytes() != source.vocab.vectors.to_bytes():
|
||||
util.logger.warning(Warnings.W113.format(name=source_name))
|
||||
if not source_name in source.component_names:
|
||||
raise KeyError(
|
||||
Errors.E944.format(
|
||||
|
@ -1673,7 +1678,16 @@ class Language:
|
|||
# model with the same vocab as the current nlp object
|
||||
source_nlps[model] = util.load_model(model, vocab=nlp.vocab)
|
||||
source_name = pipe_cfg.get("component", pipe_name)
|
||||
listeners_replaced = False
|
||||
if "replace_listeners" in pipe_cfg:
|
||||
for name, proc in source_nlps[model].pipeline:
|
||||
if source_name in getattr(proc, "listening_components", []):
|
||||
source_nlps[model].replace_listeners(name, source_name, pipe_cfg["replace_listeners"])
|
||||
listeners_replaced = True
|
||||
nlp.add_pipe(source_name, source=source_nlps[model], name=pipe_name)
|
||||
# Delete from cache if listeners were replaced
|
||||
if listeners_replaced:
|
||||
del source_nlps[model]
|
||||
disabled_pipes = [*config["nlp"]["disabled"], *disable]
|
||||
nlp._disabled = set(p for p in disabled_pipes if p not in exclude)
|
||||
nlp.batch_size = config["nlp"]["batch_size"]
|
||||
|
|
|
@ -299,7 +299,7 @@ cdef class DependencyMatcher:
|
|||
if isinstance(doclike, Doc):
|
||||
doc = doclike
|
||||
elif isinstance(doclike, Span):
|
||||
doc = doclike.as_doc()
|
||||
doc = doclike.as_doc(copy_user_data=True)
|
||||
else:
|
||||
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
|
||||
|
||||
|
|
|
@ -46,6 +46,12 @@ cdef struct TokenPatternC:
|
|||
int32_t nr_py
|
||||
quantifier_t quantifier
|
||||
hash_t key
|
||||
int32_t token_idx
|
||||
|
||||
|
||||
cdef struct MatchAlignmentC:
|
||||
int32_t token_idx
|
||||
int32_t length
|
||||
|
||||
|
||||
cdef struct PatternStateC:
|
||||
|
|
|
@ -196,7 +196,7 @@ cdef class Matcher:
|
|||
else:
|
||||
yield doc
|
||||
|
||||
def __call__(self, object doclike, *, as_spans=False, allow_missing=False):
|
||||
def __call__(self, object doclike, *, as_spans=False, allow_missing=False, with_alignments=False):
|
||||
"""Find all token sequences matching the supplied pattern.
|
||||
|
||||
doclike (Doc or Span): The document to match over.
|
||||
|
@ -204,10 +204,16 @@ cdef class Matcher:
|
|||
start, end) tuples.
|
||||
allow_missing (bool): Whether to skip checks for missing annotation for
|
||||
attributes included in patterns. Defaults to False.
|
||||
with_alignments (bool): Return match alignment information, which is
|
||||
`List[int]` with length of matched span. Each entry denotes the
|
||||
corresponding index of token pattern. If as_spans is set to True,
|
||||
this setting is ignored.
|
||||
RETURNS (list): A list of `(match_id, start, end)` tuples,
|
||||
describing the matches. A match tuple describes a span
|
||||
`doc[start:end]`. The `match_id` is an integer. If as_spans is set
|
||||
to True, a list of Span objects is returned.
|
||||
If with_alignments is set to True and as_spans is set to False,
|
||||
A list of `(match_id, start, end, alignments)` tuples is returned.
|
||||
"""
|
||||
if isinstance(doclike, Doc):
|
||||
doc = doclike
|
||||
|
@ -217,6 +223,9 @@ cdef class Matcher:
|
|||
length = doclike.end - doclike.start
|
||||
else:
|
||||
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
|
||||
# Skip alignments calculations if as_spans is set
|
||||
if as_spans:
|
||||
with_alignments = False
|
||||
cdef Pool tmp_pool = Pool()
|
||||
if not allow_missing:
|
||||
for attr in (TAG, POS, MORPH, LEMMA, DEP):
|
||||
|
@ -232,18 +241,20 @@ cdef class Matcher:
|
|||
error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr))
|
||||
raise ValueError(error_msg)
|
||||
matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
|
||||
extensions=self._extensions, predicates=self._extra_predicates)
|
||||
extensions=self._extensions, predicates=self._extra_predicates, with_alignments=with_alignments)
|
||||
final_matches = []
|
||||
pairs_by_id = {}
|
||||
# For each key, either add all matches, or only the filtered, non-overlapping ones
|
||||
for (key, start, end) in matches:
|
||||
# For each key, either add all matches, or only the filtered,
|
||||
# non-overlapping ones this `match` can be either (start, end) or
|
||||
# (start, end, alignments) depending on `with_alignments=` option.
|
||||
for key, *match in matches:
|
||||
span_filter = self._filter.get(key)
|
||||
if span_filter is not None:
|
||||
pairs = pairs_by_id.get(key, [])
|
||||
pairs.append((start,end))
|
||||
pairs.append(match)
|
||||
pairs_by_id[key] = pairs
|
||||
else:
|
||||
final_matches.append((key, start, end))
|
||||
final_matches.append((key, *match))
|
||||
matched = <char*>tmp_pool.alloc(length, sizeof(char))
|
||||
empty = <char*>tmp_pool.alloc(length, sizeof(char))
|
||||
for key, pairs in pairs_by_id.items():
|
||||
|
@ -255,14 +266,18 @@ cdef class Matcher:
|
|||
sorted_pairs = sorted(pairs, key=lambda x: (x[1]-x[0], -x[0]), reverse=True) # reverse sort by length
|
||||
else:
|
||||
raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=span_filter))
|
||||
for (start, end) in sorted_pairs:
|
||||
for match in sorted_pairs:
|
||||
start, end = match[:2]
|
||||
assert 0 <= start < end # Defend against segfaults
|
||||
span_len = end-start
|
||||
# If no tokens in the span have matched
|
||||
if memcmp(&matched[start], &empty[start], span_len * sizeof(matched[0])) == 0:
|
||||
final_matches.append((key, start, end))
|
||||
final_matches.append((key, *match))
|
||||
# Mark tokens that have matched
|
||||
memset(&matched[start], 1, span_len * sizeof(matched[0]))
|
||||
if with_alignments:
|
||||
final_matches_with_alignments = final_matches
|
||||
final_matches = [(key, start, end) for key, start, end, alignments in final_matches]
|
||||
# perform the callbacks on the filtered set of results
|
||||
for i, (key, start, end) in enumerate(final_matches):
|
||||
on_match = self._callbacks.get(key, None)
|
||||
|
@ -270,6 +285,22 @@ cdef class Matcher:
|
|||
on_match(self, doc, i, final_matches)
|
||||
if as_spans:
|
||||
return [Span(doc, start, end, label=key) for key, start, end in final_matches]
|
||||
elif with_alignments:
|
||||
# convert alignments List[Dict[str, int]] --> List[int]
|
||||
final_matches = []
|
||||
# when multiple alignment (belongs to the same length) is found,
|
||||
# keeps the alignment that has largest token_idx
|
||||
for key, start, end, alignments in final_matches_with_alignments:
|
||||
sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False)
|
||||
alignments = [0] * (end-start)
|
||||
for align in sorted_alignments:
|
||||
if align['length'] >= end-start:
|
||||
continue
|
||||
# Since alignments are sorted in order of (length, token_idx)
|
||||
# this overwrites smaller token_idx when they have same length.
|
||||
alignments[align['length']] = align['token_idx']
|
||||
final_matches.append((key, start, end, alignments))
|
||||
return final_matches
|
||||
else:
|
||||
return final_matches
|
||||
|
||||
|
@ -288,9 +319,9 @@ def unpickle_matcher(vocab, patterns, callbacks):
|
|||
return matcher
|
||||
|
||||
|
||||
cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple()):
|
||||
cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple(), bint with_alignments=0):
|
||||
"""Find matches in a doc, with a compiled array of patterns. Matches are
|
||||
returned as a list of (id, start, end) tuples.
|
||||
returned as a list of (id, start, end) tuples or (id, start, end, alignments) tuples (if with_alignments != 0)
|
||||
|
||||
To augment the compiled patterns, we optionally also take two Python lists.
|
||||
|
||||
|
@ -302,6 +333,8 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
|
|||
"""
|
||||
cdef vector[PatternStateC] states
|
||||
cdef vector[MatchC] matches
|
||||
cdef vector[vector[MatchAlignmentC]] align_states
|
||||
cdef vector[vector[MatchAlignmentC]] align_matches
|
||||
cdef PatternStateC state
|
||||
cdef int i, j, nr_extra_attr
|
||||
cdef Pool mem = Pool()
|
||||
|
@ -328,12 +361,14 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
|
|||
for i in range(length):
|
||||
for j in range(n):
|
||||
states.push_back(PatternStateC(patterns[j], i, 0))
|
||||
transition_states(states, matches, predicate_cache,
|
||||
doclike[i], extra_attr_values, predicates)
|
||||
if with_alignments != 0:
|
||||
align_states.resize(states.size())
|
||||
transition_states(states, matches, align_states, align_matches, predicate_cache,
|
||||
doclike[i], extra_attr_values, predicates, with_alignments)
|
||||
extra_attr_values += nr_extra_attr
|
||||
predicate_cache += len(predicates)
|
||||
# Handle matches that end in 0-width patterns
|
||||
finish_states(matches, states)
|
||||
finish_states(matches, states, align_matches, align_states, with_alignments)
|
||||
seen = set()
|
||||
for i in range(matches.size()):
|
||||
match = (
|
||||
|
@ -346,16 +381,22 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
|
|||
# first .?, or the second .? -- it doesn't matter, it's just one match.
|
||||
# Skip 0-length matches. (TODO: fix algorithm)
|
||||
if match not in seen and matches[i].length > 0:
|
||||
output.append(match)
|
||||
if with_alignments != 0:
|
||||
# since the length of align_matches equals to that of match, we can share same 'i'
|
||||
output.append(match + (align_matches[i],))
|
||||
else:
|
||||
output.append(match)
|
||||
seen.add(match)
|
||||
return output
|
||||
|
||||
|
||||
cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches,
|
||||
vector[vector[MatchAlignmentC]]& align_states, vector[vector[MatchAlignmentC]]& align_matches,
|
||||
int8_t* cached_py_predicates,
|
||||
Token token, const attr_t* extra_attrs, py_predicates) except *:
|
||||
Token token, const attr_t* extra_attrs, py_predicates, bint with_alignments) except *:
|
||||
cdef int q = 0
|
||||
cdef vector[PatternStateC] new_states
|
||||
cdef vector[vector[MatchAlignmentC]] align_new_states
|
||||
cdef int nr_predicate = len(py_predicates)
|
||||
for i in range(states.size()):
|
||||
if states[i].pattern.nr_py >= 1:
|
||||
|
@ -370,23 +411,39 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
|
|||
# it in the states list, because q doesn't advance.
|
||||
state = states[i]
|
||||
states[q] = state
|
||||
# Separate from states, performance is guaranteed for users who only need basic options (without alignments).
|
||||
# `align_states` always corresponds to `states` 1:1.
|
||||
if with_alignments != 0:
|
||||
align_state = align_states[i]
|
||||
align_states[q] = align_state
|
||||
while action in (RETRY, RETRY_ADVANCE, RETRY_EXTEND):
|
||||
# Update alignment before the transition of current state
|
||||
# 'MatchAlignmentC' maps 'original token index of current pattern' to 'current matching length'
|
||||
if with_alignments != 0:
|
||||
align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length))
|
||||
if action == RETRY_EXTEND:
|
||||
# This handles the 'extend'
|
||||
new_states.push_back(
|
||||
PatternStateC(pattern=states[q].pattern, start=state.start,
|
||||
length=state.length+1))
|
||||
if with_alignments != 0:
|
||||
align_new_states.push_back(align_states[q])
|
||||
if action == RETRY_ADVANCE:
|
||||
# This handles the 'advance'
|
||||
new_states.push_back(
|
||||
PatternStateC(pattern=states[q].pattern+1, start=state.start,
|
||||
length=state.length+1))
|
||||
if with_alignments != 0:
|
||||
align_new_states.push_back(align_states[q])
|
||||
states[q].pattern += 1
|
||||
if states[q].pattern.nr_py != 0:
|
||||
update_predicate_cache(cached_py_predicates,
|
||||
states[q].pattern, token, py_predicates)
|
||||
action = get_action(states[q], token.c, extra_attrs,
|
||||
cached_py_predicates)
|
||||
# Update alignment before the transition of current state
|
||||
if with_alignments != 0:
|
||||
align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length))
|
||||
if action == REJECT:
|
||||
pass
|
||||
elif action == ADVANCE:
|
||||
|
@ -399,29 +456,50 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
|
|||
matches.push_back(
|
||||
MatchC(pattern_id=ent_id, start=state.start,
|
||||
length=state.length+1))
|
||||
# `align_matches` always corresponds to `matches` 1:1
|
||||
if with_alignments != 0:
|
||||
align_matches.push_back(align_states[q])
|
||||
elif action == MATCH_DOUBLE:
|
||||
# push match without last token if length > 0
|
||||
if state.length > 0:
|
||||
matches.push_back(
|
||||
MatchC(pattern_id=ent_id, start=state.start,
|
||||
length=state.length))
|
||||
# MATCH_DOUBLE emits matches twice,
|
||||
# add one more to align_matches in order to keep 1:1 relationship
|
||||
if with_alignments != 0:
|
||||
align_matches.push_back(align_states[q])
|
||||
# push match with last token
|
||||
matches.push_back(
|
||||
MatchC(pattern_id=ent_id, start=state.start,
|
||||
length=state.length+1))
|
||||
# `align_matches` always corresponds to `matches` 1:1
|
||||
if with_alignments != 0:
|
||||
align_matches.push_back(align_states[q])
|
||||
elif action == MATCH_REJECT:
|
||||
matches.push_back(
|
||||
MatchC(pattern_id=ent_id, start=state.start,
|
||||
length=state.length))
|
||||
# `align_matches` always corresponds to `matches` 1:1
|
||||
if with_alignments != 0:
|
||||
align_matches.push_back(align_states[q])
|
||||
elif action == MATCH_EXTEND:
|
||||
matches.push_back(
|
||||
MatchC(pattern_id=ent_id, start=state.start,
|
||||
length=state.length))
|
||||
# `align_matches` always corresponds to `matches` 1:1
|
||||
if with_alignments != 0:
|
||||
align_matches.push_back(align_states[q])
|
||||
states[q].length += 1
|
||||
q += 1
|
||||
states.resize(q)
|
||||
for i in range(new_states.size()):
|
||||
states.push_back(new_states[i])
|
||||
# `align_states` always corresponds to `states` 1:1
|
||||
if with_alignments != 0:
|
||||
align_states.resize(q)
|
||||
for i in range(align_new_states.size()):
|
||||
align_states.push_back(align_new_states[i])
|
||||
|
||||
|
||||
cdef int update_predicate_cache(int8_t* cache,
|
||||
|
@ -444,15 +522,27 @@ cdef int update_predicate_cache(int8_t* cache,
|
|||
raise ValueError(Errors.E125.format(value=result))
|
||||
|
||||
|
||||
cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states) except *:
|
||||
cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states,
|
||||
vector[vector[MatchAlignmentC]]& align_matches,
|
||||
vector[vector[MatchAlignmentC]]& align_states,
|
||||
bint with_alignments) except *:
|
||||
"""Handle states that end in zero-width patterns."""
|
||||
cdef PatternStateC state
|
||||
cdef vector[MatchAlignmentC] align_state
|
||||
for i in range(states.size()):
|
||||
state = states[i]
|
||||
if with_alignments != 0:
|
||||
align_state = align_states[i]
|
||||
while get_quantifier(state) in (ZERO_PLUS, ZERO_ONE):
|
||||
# Update alignment before the transition of current state
|
||||
if with_alignments != 0:
|
||||
align_state.push_back(MatchAlignmentC(state.pattern.token_idx, state.length))
|
||||
is_final = get_is_final(state)
|
||||
if is_final:
|
||||
ent_id = get_ent_id(state.pattern)
|
||||
# `align_matches` always corresponds to `matches` 1:1
|
||||
if with_alignments != 0:
|
||||
align_matches.push_back(align_state)
|
||||
matches.push_back(
|
||||
MatchC(pattern_id=ent_id, start=state.start, length=state.length))
|
||||
break
|
||||
|
@ -607,7 +697,7 @@ cdef int8_t get_quantifier(PatternStateC state) nogil:
|
|||
cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs) except NULL:
|
||||
pattern = <TokenPatternC*>mem.alloc(len(token_specs) + 1, sizeof(TokenPatternC))
|
||||
cdef int i, index
|
||||
for i, (quantifier, spec, extensions, predicates) in enumerate(token_specs):
|
||||
for i, (quantifier, spec, extensions, predicates, token_idx) in enumerate(token_specs):
|
||||
pattern[i].quantifier = quantifier
|
||||
# Ensure attrs refers to a null pointer if nr_attr == 0
|
||||
if len(spec) > 0:
|
||||
|
@ -628,6 +718,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
|
|||
pattern[i].py_predicates[j] = index
|
||||
pattern[i].nr_py = len(predicates)
|
||||
pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0)
|
||||
pattern[i].token_idx = token_idx
|
||||
i = len(token_specs)
|
||||
# Use quantifier to identify final ID pattern node (rather than previous
|
||||
# uninitialized quantifier == 0/ZERO + nr_attr == 0 + non-zero-length attrs)
|
||||
|
@ -638,6 +729,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
|
|||
pattern[i].nr_attr = 1
|
||||
pattern[i].nr_extra_attr = 0
|
||||
pattern[i].nr_py = 0
|
||||
pattern[i].token_idx = -1
|
||||
return pattern
|
||||
|
||||
|
||||
|
@ -655,7 +747,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
|
|||
"""This function interprets the pattern, converting the various bits of
|
||||
syntactic sugar before we compile it into a struct with init_pattern.
|
||||
|
||||
We need to split the pattern up into three parts:
|
||||
We need to split the pattern up into four parts:
|
||||
* Normal attribute/value pairs, which are stored on either the token or lexeme,
|
||||
can be handled directly.
|
||||
* Extension attributes are handled specially, as we need to prefetch the
|
||||
|
@ -664,13 +756,14 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
|
|||
functions and store them. So we store these specially as well.
|
||||
* Extension attributes that have extra predicates are stored within the
|
||||
extra_predicates.
|
||||
* Token index that this pattern belongs to.
|
||||
"""
|
||||
tokens = []
|
||||
string_store = vocab.strings
|
||||
for spec in token_specs:
|
||||
for token_idx, spec in enumerate(token_specs):
|
||||
if not spec:
|
||||
# Signifier for 'any token'
|
||||
tokens.append((ONE, [(NULL_ATTR, 0)], [], []))
|
||||
tokens.append((ONE, [(NULL_ATTR, 0)], [], [], token_idx))
|
||||
continue
|
||||
if not isinstance(spec, dict):
|
||||
raise ValueError(Errors.E154.format())
|
||||
|
@ -679,7 +772,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
|
|||
extensions = _get_extensions(spec, string_store, extensions_table)
|
||||
predicates = _get_extra_predicates(spec, extra_predicates, vocab)
|
||||
for op in ops:
|
||||
tokens.append((op, list(attr_values), list(extensions), list(predicates)))
|
||||
tokens.append((op, list(attr_values), list(extensions), list(predicates), token_idx))
|
||||
return tokens
|
||||
|
||||
|
||||
|
|
|
@ -3,8 +3,10 @@ from thinc.api import Model
|
|||
from thinc.types import Floats2d
|
||||
|
||||
from ..tokens import Doc
|
||||
from ..util import registry
|
||||
|
||||
|
||||
@registry.layers("spacy.CharEmbed.v1")
|
||||
def CharacterEmbed(nM: int, nC: int) -> Model[List[Doc], List[Floats2d]]:
|
||||
# nM: Number of dimensions per character. nC: Number of characters.
|
||||
return Model(
|
||||
|
|
|
@ -31,7 +31,7 @@ def get_tok2vec_width(model: Model):
|
|||
return nO
|
||||
|
||||
|
||||
@registry.architectures("spacy.HashEmbedCNN.v1")
|
||||
@registry.architectures("spacy.HashEmbedCNN.v2")
|
||||
def build_hash_embed_cnn_tok2vec(
|
||||
*,
|
||||
width: int,
|
||||
|
@ -108,7 +108,7 @@ def build_Tok2Vec_model(
|
|||
return tok2vec
|
||||
|
||||
|
||||
@registry.architectures("spacy.MultiHashEmbed.v1")
|
||||
@registry.architectures("spacy.MultiHashEmbed.v2")
|
||||
def MultiHashEmbed(
|
||||
width: int,
|
||||
attrs: List[Union[str, int]],
|
||||
|
@ -182,7 +182,7 @@ def MultiHashEmbed(
|
|||
return model
|
||||
|
||||
|
||||
@registry.architectures("spacy.CharacterEmbed.v1")
|
||||
@registry.architectures("spacy.CharacterEmbed.v2")
|
||||
def CharacterEmbed(
|
||||
width: int,
|
||||
rows: int,
|
||||
|
|
|
@ -8,7 +8,7 @@ from ..tokens import Doc
|
|||
from ..errors import Errors
|
||||
|
||||
|
||||
@registry.layers("spacy.StaticVectors.v1")
|
||||
@registry.layers("spacy.StaticVectors.v2")
|
||||
def StaticVectors(
|
||||
nO: Optional[int] = None,
|
||||
nM: Optional[int] = None,
|
||||
|
@ -38,7 +38,7 @@ def forward(
|
|||
return _handle_empty(model.ops, model.get_dim("nO"))
|
||||
key_attr = model.attrs["key_attr"]
|
||||
W = cast(Floats2d, model.ops.as_contig(model.get_param("W")))
|
||||
V = cast(Floats2d, docs[0].vocab.vectors.data)
|
||||
V = cast(Floats2d, model.ops.asarray(docs[0].vocab.vectors.data))
|
||||
rows = model.ops.flatten(
|
||||
[doc.vocab.vectors.find(keys=doc.to_array(key_attr)) for doc in docs]
|
||||
)
|
||||
|
@ -46,6 +46,8 @@ def forward(
|
|||
vectors_data = model.ops.gemm(model.ops.as_contig(V[rows]), W, trans2=True)
|
||||
except ValueError:
|
||||
raise RuntimeError(Errors.E896)
|
||||
# Convert negative indices to 0-vectors (TODO: more options for UNK tokens)
|
||||
vectors_data[rows < 0] = 0
|
||||
output = Ragged(
|
||||
vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i")
|
||||
)
|
||||
|
|
|
@ -24,7 +24,7 @@ maxout_pieces = 2
|
|||
use_upper = true
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v1"
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
pretrained_vectors = null
|
||||
width = 96
|
||||
depth = 4
|
||||
|
|
|
@ -26,7 +26,7 @@ default_model_config = """
|
|||
@architectures = "spacy.EntityLinker.v1"
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v1"
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
pretrained_vectors = null
|
||||
width = 96
|
||||
depth = 2
|
||||
|
@ -300,77 +300,77 @@ class EntityLinker(TrainablePipe):
|
|||
for i, doc in enumerate(docs):
|
||||
sentences = [s for s in doc.sents]
|
||||
if len(doc) > 0:
|
||||
# Looping through each sentence and each entity
|
||||
# This may go wrong if there are entities across sentences - which shouldn't happen normally.
|
||||
for sent_index, sent in enumerate(sentences):
|
||||
if sent.ents:
|
||||
# get n_neightbour sentences, clipped to the length of the document
|
||||
start_sentence = max(0, sent_index - self.n_sents)
|
||||
end_sentence = min(
|
||||
len(sentences) - 1, sent_index + self.n_sents
|
||||
)
|
||||
start_token = sentences[start_sentence].start
|
||||
end_token = sentences[end_sentence].end
|
||||
sent_doc = doc[start_token:end_token].as_doc()
|
||||
# currently, the context is the same for each entity in a sentence (should be refined)
|
||||
xp = self.model.ops.xp
|
||||
if self.incl_context:
|
||||
sentence_encoding = self.model.predict([sent_doc])[0]
|
||||
sentence_encoding_t = sentence_encoding.T
|
||||
sentence_norm = xp.linalg.norm(sentence_encoding_t)
|
||||
for ent in sent.ents:
|
||||
entity_count += 1
|
||||
if ent.label_ in self.labels_discard:
|
||||
# ignoring this entity - setting to NIL
|
||||
final_kb_ids.append(self.NIL)
|
||||
else:
|
||||
candidates = self.get_candidates(self.kb, ent)
|
||||
if not candidates:
|
||||
# no prediction possible for this entity - setting to NIL
|
||||
final_kb_ids.append(self.NIL)
|
||||
elif len(candidates) == 1:
|
||||
# shortcut for efficiency reasons: take the 1 candidate
|
||||
# TODO: thresholding
|
||||
final_kb_ids.append(candidates[0].entity_)
|
||||
else:
|
||||
random.shuffle(candidates)
|
||||
# set all prior probabilities to 0 if incl_prior=False
|
||||
prior_probs = xp.asarray(
|
||||
[c.prior_prob for c in candidates]
|
||||
# Looping through each entity (TODO: rewrite)
|
||||
for ent in doc.ents:
|
||||
sent = ent.sent
|
||||
sent_index = sentences.index(sent)
|
||||
assert sent_index >= 0
|
||||
# get n_neightbour sentences, clipped to the length of the document
|
||||
start_sentence = max(0, sent_index - self.n_sents)
|
||||
end_sentence = min(
|
||||
len(sentences) - 1, sent_index + self.n_sents
|
||||
)
|
||||
start_token = sentences[start_sentence].start
|
||||
end_token = sentences[end_sentence].end
|
||||
sent_doc = doc[start_token:end_token].as_doc()
|
||||
# currently, the context is the same for each entity in a sentence (should be refined)
|
||||
xp = self.model.ops.xp
|
||||
if self.incl_context:
|
||||
sentence_encoding = self.model.predict([sent_doc])[0]
|
||||
sentence_encoding_t = sentence_encoding.T
|
||||
sentence_norm = xp.linalg.norm(sentence_encoding_t)
|
||||
entity_count += 1
|
||||
if ent.label_ in self.labels_discard:
|
||||
# ignoring this entity - setting to NIL
|
||||
final_kb_ids.append(self.NIL)
|
||||
else:
|
||||
candidates = self.get_candidates(self.kb, ent)
|
||||
if not candidates:
|
||||
# no prediction possible for this entity - setting to NIL
|
||||
final_kb_ids.append(self.NIL)
|
||||
elif len(candidates) == 1:
|
||||
# shortcut for efficiency reasons: take the 1 candidate
|
||||
# TODO: thresholding
|
||||
final_kb_ids.append(candidates[0].entity_)
|
||||
else:
|
||||
random.shuffle(candidates)
|
||||
# set all prior probabilities to 0 if incl_prior=False
|
||||
prior_probs = xp.asarray(
|
||||
[c.prior_prob for c in candidates]
|
||||
)
|
||||
if not self.incl_prior:
|
||||
prior_probs = xp.asarray(
|
||||
[0.0 for _ in candidates]
|
||||
)
|
||||
scores = prior_probs
|
||||
# add in similarity from the context
|
||||
if self.incl_context:
|
||||
entity_encodings = xp.asarray(
|
||||
[c.entity_vector for c in candidates]
|
||||
)
|
||||
entity_norm = xp.linalg.norm(
|
||||
entity_encodings, axis=1
|
||||
)
|
||||
if len(entity_encodings) != len(prior_probs):
|
||||
raise RuntimeError(
|
||||
Errors.E147.format(
|
||||
method="predict",
|
||||
msg="vectors not of equal length",
|
||||
)
|
||||
)
|
||||
if not self.incl_prior:
|
||||
prior_probs = xp.asarray(
|
||||
[0.0 for _ in candidates]
|
||||
)
|
||||
scores = prior_probs
|
||||
# add in similarity from the context
|
||||
if self.incl_context:
|
||||
entity_encodings = xp.asarray(
|
||||
[c.entity_vector for c in candidates]
|
||||
)
|
||||
entity_norm = xp.linalg.norm(
|
||||
entity_encodings, axis=1
|
||||
)
|
||||
if len(entity_encodings) != len(prior_probs):
|
||||
raise RuntimeError(
|
||||
Errors.E147.format(
|
||||
method="predict",
|
||||
msg="vectors not of equal length",
|
||||
)
|
||||
)
|
||||
# cosine similarity
|
||||
sims = xp.dot(
|
||||
entity_encodings, sentence_encoding_t
|
||||
) / (sentence_norm * entity_norm)
|
||||
if sims.shape != prior_probs.shape:
|
||||
raise ValueError(Errors.E161)
|
||||
scores = (
|
||||
prior_probs + sims - (prior_probs * sims)
|
||||
)
|
||||
# TODO: thresholding
|
||||
best_index = scores.argmax().item()
|
||||
best_candidate = candidates[best_index]
|
||||
final_kb_ids.append(best_candidate.entity_)
|
||||
# cosine similarity
|
||||
sims = xp.dot(
|
||||
entity_encodings, sentence_encoding_t
|
||||
) / (sentence_norm * entity_norm)
|
||||
if sims.shape != prior_probs.shape:
|
||||
raise ValueError(Errors.E161)
|
||||
scores = (
|
||||
prior_probs + sims - (prior_probs * sims)
|
||||
)
|
||||
# TODO: thresholding
|
||||
best_index = scores.argmax().item()
|
||||
best_candidate = candidates[best_index]
|
||||
final_kb_ids.append(best_candidate.entity_)
|
||||
if not (len(final_kb_ids) == entity_count):
|
||||
err = Errors.E147.format(
|
||||
method="predict", msg="result variables not of equal length"
|
||||
|
|
|
@ -175,7 +175,7 @@ class Lemmatizer(Pipe):
|
|||
|
||||
DOCS: https://spacy.io/api/lemmatizer#rule_lemmatize
|
||||
"""
|
||||
cache_key = (token.orth, token.pos, token.morph)
|
||||
cache_key = (token.orth, token.pos, token.morph.key)
|
||||
if cache_key in self.cache:
|
||||
return self.cache[cache_key]
|
||||
string = token.text
|
||||
|
|
|
@ -27,7 +27,7 @@ default_model_config = """
|
|||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[model.tok2vec.embed]
|
||||
@architectures = "spacy.CharacterEmbed.v1"
|
||||
@architectures = "spacy.CharacterEmbed.v2"
|
||||
width = 128
|
||||
rows = 7000
|
||||
nM = 64
|
||||
|
|
|
@ -22,7 +22,7 @@ maxout_pieces = 3
|
|||
token_vector_width = 96
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v1"
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
pretrained_vectors = null
|
||||
width = 96
|
||||
depth = 4
|
||||
|
|
|
@ -21,7 +21,7 @@ maxout_pieces = 2
|
|||
use_upper = true
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v1"
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
pretrained_vectors = null
|
||||
width = 96
|
||||
depth = 4
|
||||
|
|
|
@ -19,7 +19,7 @@ default_model_config = """
|
|||
@architectures = "spacy.Tagger.v1"
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v1"
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
pretrained_vectors = null
|
||||
width = 12
|
||||
depth = 1
|
||||
|
|
|
@ -26,7 +26,7 @@ default_model_config = """
|
|||
@architectures = "spacy.Tagger.v1"
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v1"
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
pretrained_vectors = null
|
||||
width = 96
|
||||
depth = 4
|
||||
|
|
|
@ -21,7 +21,7 @@ single_label_default_config = """
|
|||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[model.tok2vec.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
@architectures = "spacy.MultiHashEmbed.v2"
|
||||
width = 64
|
||||
rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
||||
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
||||
|
@ -56,7 +56,7 @@ single_label_cnn_config = """
|
|||
exclusive_classes = true
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v1"
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
pretrained_vectors = null
|
||||
width = 96
|
||||
depth = 4
|
||||
|
|
|
@ -21,7 +21,7 @@ multi_label_default_config = """
|
|||
@architectures = "spacy.Tok2Vec.v1"
|
||||
|
||||
[model.tok2vec.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
@architectures = "spacy.MultiHashEmbed.v2"
|
||||
width = 64
|
||||
rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
||||
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
||||
|
@ -56,7 +56,7 @@ multi_label_cnn_config = """
|
|||
exclusive_classes = false
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v1"
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
pretrained_vectors = null
|
||||
width = 96
|
||||
depth = 4
|
||||
|
|
|
@ -11,7 +11,7 @@ from ..errors import Errors
|
|||
|
||||
default_model_config = """
|
||||
[model]
|
||||
@architectures = "spacy.HashEmbedCNN.v1"
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
pretrained_vectors = null
|
||||
width = 96
|
||||
depth = 4
|
||||
|
|
|
@ -20,10 +20,16 @@ MISSING_VALUES = frozenset([None, 0, ""])
|
|||
class PRFScore:
|
||||
"""A precision / recall / F score."""
|
||||
|
||||
def __init__(self) -> None:
|
||||
self.tp = 0
|
||||
self.fp = 0
|
||||
self.fn = 0
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
tp: int = 0,
|
||||
fp: int = 0,
|
||||
fn: int = 0,
|
||||
) -> None:
|
||||
self.tp = tp
|
||||
self.fp = fp
|
||||
self.fn = fn
|
||||
|
||||
def __len__(self) -> int:
|
||||
return self.tp + self.fp + self.fn
|
||||
|
@ -305,6 +311,8 @@ class Scorer:
|
|||
*,
|
||||
getter: Callable[[Doc, str], Iterable[Span]] = getattr,
|
||||
has_annotation: Optional[Callable[[Doc], bool]] = None,
|
||||
labeled: bool = True,
|
||||
allow_overlap: bool = False,
|
||||
**cfg,
|
||||
) -> Dict[str, Any]:
|
||||
"""Returns PRF scores for labeled spans.
|
||||
|
@ -317,6 +325,11 @@ class Scorer:
|
|||
has_annotation (Optional[Callable[[Doc], bool]]) should return whether a `Doc`
|
||||
has annotation for this `attr`. Docs without annotation are skipped for
|
||||
scoring purposes.
|
||||
labeled (bool): Whether or not to include label information in
|
||||
the evaluation. If set to 'False', two spans will be considered
|
||||
equal if their start and end match, irrespective of their label.
|
||||
allow_overlap (bool): Whether or not to allow overlapping spans.
|
||||
If set to 'False', the alignment will automatically resolve conflicts.
|
||||
RETURNS (Dict[str, Any]): A dictionary containing the PRF scores under
|
||||
the keys attr_p/r/f and the per-type PRF scores under attr_per_type.
|
||||
|
||||
|
@ -345,33 +358,42 @@ class Scorer:
|
|||
gold_spans = set()
|
||||
pred_spans = set()
|
||||
for span in getter(gold_doc, attr):
|
||||
gold_span = (span.label_, span.start, span.end - 1)
|
||||
if labeled:
|
||||
gold_span = (span.label_, span.start, span.end - 1)
|
||||
else:
|
||||
gold_span = (span.start, span.end - 1)
|
||||
gold_spans.add(gold_span)
|
||||
gold_per_type[span.label_].add((span.label_, span.start, span.end - 1))
|
||||
gold_per_type[span.label_].add(gold_span)
|
||||
pred_per_type = {label: set() for label in labels}
|
||||
for span in example.get_aligned_spans_x2y(getter(pred_doc, attr)):
|
||||
pred_spans.add((span.label_, span.start, span.end - 1))
|
||||
pred_per_type[span.label_].add((span.label_, span.start, span.end - 1))
|
||||
for span in example.get_aligned_spans_x2y(getter(pred_doc, attr), allow_overlap):
|
||||
if labeled:
|
||||
pred_span = (span.label_, span.start, span.end - 1)
|
||||
else:
|
||||
pred_span = (span.start, span.end - 1)
|
||||
pred_spans.add(pred_span)
|
||||
pred_per_type[span.label_].add(pred_span)
|
||||
# Scores per label
|
||||
for k, v in score_per_type.items():
|
||||
if k in pred_per_type:
|
||||
v.score_set(pred_per_type[k], gold_per_type[k])
|
||||
if labeled:
|
||||
for k, v in score_per_type.items():
|
||||
if k in pred_per_type:
|
||||
v.score_set(pred_per_type[k], gold_per_type[k])
|
||||
# Score for all labels
|
||||
score.score_set(pred_spans, gold_spans)
|
||||
if len(score) > 0:
|
||||
return {
|
||||
f"{attr}_p": score.precision,
|
||||
f"{attr}_r": score.recall,
|
||||
f"{attr}_f": score.fscore,
|
||||
f"{attr}_per_type": {k: v.to_dict() for k, v in score_per_type.items()},
|
||||
}
|
||||
else:
|
||||
return {
|
||||
# Assemble final result
|
||||
final_scores = {
|
||||
f"{attr}_p": None,
|
||||
f"{attr}_r": None,
|
||||
f"{attr}_f": None,
|
||||
f"{attr}_per_type": None,
|
||||
}
|
||||
if labeled:
|
||||
final_scores[f"{attr}_per_type"] = None
|
||||
if len(score) > 0:
|
||||
final_scores[f"{attr}_p"] = score.precision
|
||||
final_scores[f"{attr}_r"] = score.recall
|
||||
final_scores[f"{attr}_f"] = score.fscore
|
||||
if labeled:
|
||||
final_scores[f"{attr}_per_type"] = {k: v.to_dict() for k, v in score_per_type.items()}
|
||||
return final_scores
|
||||
|
||||
@staticmethod
|
||||
def score_cats(
|
||||
|
|
|
@ -223,7 +223,7 @@ cdef class StringStore:
|
|||
it doesn't exist. Paths may be either strings or Path-like objects.
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
strings = list(self)
|
||||
strings = sorted(self)
|
||||
srsly.write_json(path, strings)
|
||||
|
||||
def from_disk(self, path):
|
||||
|
@ -247,7 +247,7 @@ cdef class StringStore:
|
|||
|
||||
RETURNS (bytes): The serialized form of the `StringStore` object.
|
||||
"""
|
||||
return srsly.json_dumps(list(self))
|
||||
return srsly.json_dumps(sorted(self))
|
||||
|
||||
def from_bytes(self, bytes_data, **kwargs):
|
||||
"""Load state from a binary string.
|
||||
|
|
|
@ -6,12 +6,14 @@ import logging
|
|||
import mock
|
||||
|
||||
from spacy.lang.xx import MultiLanguage
|
||||
from spacy.tokens import Doc, Span
|
||||
from spacy.tokens import Doc, Span, Token
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.lexeme import Lexeme
|
||||
from spacy.lang.en import English
|
||||
from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP, MORPH
|
||||
|
||||
from .test_underscore import clean_underscore # noqa: F401
|
||||
|
||||
|
||||
def test_doc_api_init(en_vocab):
|
||||
words = ["a", "b", "c", "d"]
|
||||
|
@ -347,15 +349,19 @@ def test_doc_from_array_morph(en_vocab):
|
|||
assert [str(t.morph) for t in doc] == [str(t.morph) for t in new_doc]
|
||||
|
||||
|
||||
@pytest.mark.usefixtures("clean_underscore")
|
||||
def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
||||
en_texts = ["Merging the docs is fun.", "", "They don't think alike."]
|
||||
en_texts_without_empty = [t for t in en_texts if len(t)]
|
||||
de_text = "Wie war die Frage?"
|
||||
en_docs = [en_tokenizer(text) for text in en_texts]
|
||||
docs_idx = en_texts[0].index("docs")
|
||||
en_docs[0].spans["group"] = [en_docs[0][1:4]]
|
||||
en_docs[2].spans["group"] = [en_docs[2][1:4]]
|
||||
span_group_texts = sorted([en_docs[0][1:4].text, en_docs[2][1:4].text])
|
||||
de_doc = de_tokenizer(de_text)
|
||||
expected = (True, None, None, None)
|
||||
en_docs[0].user_data[("._.", "is_ambiguous", docs_idx, None)] = expected
|
||||
Token.set_extension("is_ambiguous", default=False)
|
||||
en_docs[0][2]._.is_ambiguous = True # docs
|
||||
en_docs[2][3]._.is_ambiguous = True # think
|
||||
assert Doc.from_docs([]) is None
|
||||
assert de_doc is not Doc.from_docs([de_doc])
|
||||
assert str(de_doc) == str(Doc.from_docs([de_doc]))
|
||||
|
@ -372,11 +378,12 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
|||
en_docs_tokens = [t for doc in en_docs for t in doc]
|
||||
assert len(m_doc) == len(en_docs_tokens)
|
||||
think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think")
|
||||
assert m_doc[2]._.is_ambiguous == True
|
||||
assert m_doc[9].idx == think_idx
|
||||
with pytest.raises(AttributeError):
|
||||
# not callable, because it was not set via set_extension
|
||||
m_doc[2]._.is_ambiguous
|
||||
assert len(m_doc.user_data) == len(en_docs[0].user_data) # but it's there
|
||||
assert m_doc[9]._.is_ambiguous == True
|
||||
assert not any([t._.is_ambiguous for t in m_doc[3:8]])
|
||||
assert "group" in m_doc.spans
|
||||
assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]])
|
||||
|
||||
m_doc = Doc.from_docs(en_docs, ensure_whitespace=False)
|
||||
assert len(en_texts_without_empty) == len(list(m_doc.sents))
|
||||
|
@ -388,6 +395,8 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
|||
assert len(m_doc) == len(en_docs_tokens)
|
||||
think_idx = len(en_texts[0]) + 0 + en_texts[2].index("think")
|
||||
assert m_doc[9].idx == think_idx
|
||||
assert "group" in m_doc.spans
|
||||
assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]])
|
||||
|
||||
m_doc = Doc.from_docs(en_docs, attrs=["lemma", "length", "pos"])
|
||||
assert len(str(m_doc)) > len(en_texts[0]) + len(en_texts[1])
|
||||
|
@ -399,6 +408,8 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
|||
assert len(m_doc) == len(en_docs_tokens)
|
||||
think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think")
|
||||
assert m_doc[9].idx == think_idx
|
||||
assert "group" in m_doc.spans
|
||||
assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]])
|
||||
|
||||
|
||||
def test_doc_api_from_docs_ents(en_tokenizer):
|
||||
|
|
|
@ -452,3 +452,30 @@ def test_retokenize_disallow_zero_length(en_vocab):
|
|||
with pytest.raises(ValueError):
|
||||
with doc.retokenize() as retokenizer:
|
||||
retokenizer.merge(doc[1:1])
|
||||
|
||||
|
||||
def test_doc_retokenize_merge_without_parse_keeps_sents(en_tokenizer):
|
||||
text = "displaCy is a parse tool built with Javascript"
|
||||
sent_starts = [1, 0, 0, 0, 1, 0, 0, 0]
|
||||
tokens = en_tokenizer(text)
|
||||
|
||||
# merging within a sentence keeps all sentence boundaries
|
||||
doc = Doc(tokens.vocab, words=[t.text for t in tokens], sent_starts=sent_starts)
|
||||
assert len(list(doc.sents)) == 2
|
||||
with doc.retokenize() as retokenizer:
|
||||
retokenizer.merge(doc[1:3])
|
||||
assert len(list(doc.sents)) == 2
|
||||
|
||||
# merging over a sentence boundary unsets it by default
|
||||
doc = Doc(tokens.vocab, words=[t.text for t in tokens], sent_starts=sent_starts)
|
||||
assert len(list(doc.sents)) == 2
|
||||
with doc.retokenize() as retokenizer:
|
||||
retokenizer.merge(doc[3:6])
|
||||
assert doc[3].is_sent_start == None
|
||||
|
||||
# merging over a sentence boundary and setting sent_start
|
||||
doc = Doc(tokens.vocab, words=[t.text for t in tokens], sent_starts=sent_starts)
|
||||
assert len(list(doc.sents)) == 2
|
||||
with doc.retokenize() as retokenizer:
|
||||
retokenizer.merge(doc[3:6], attrs={"sent_start": True})
|
||||
assert len(list(doc.sents)) == 2
|
||||
|
|
|
@ -1,9 +1,11 @@
|
|||
import pytest
|
||||
from spacy.attrs import ORTH, LENGTH
|
||||
from spacy.tokens import Doc, Span
|
||||
from spacy.tokens import Doc, Span, Token
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.util import filter_spans
|
||||
|
||||
from .test_underscore import clean_underscore # noqa: F401
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def doc(en_tokenizer):
|
||||
|
@ -219,11 +221,14 @@ def test_span_as_doc(doc):
|
|||
assert span_doc[0].idx == 0
|
||||
|
||||
|
||||
@pytest.mark.usefixtures("clean_underscore")
|
||||
def test_span_as_doc_user_data(doc):
|
||||
"""Test that the user_data can be preserved (but not by default). """
|
||||
my_key = "my_info"
|
||||
my_value = 342
|
||||
doc.user_data[my_key] = my_value
|
||||
Token.set_extension("is_x", default=False)
|
||||
doc[7]._.is_x = True
|
||||
|
||||
span = doc[4:10]
|
||||
span_doc_with = span.as_doc(copy_user_data=True)
|
||||
|
@ -232,6 +237,12 @@ def test_span_as_doc_user_data(doc):
|
|||
assert doc.user_data.get(my_key, None) is my_value
|
||||
assert span_doc_with.user_data.get(my_key, None) is my_value
|
||||
assert span_doc_without.user_data.get(my_key, None) is None
|
||||
for i in range(len(span_doc_with)):
|
||||
if i != 3:
|
||||
assert span_doc_with[i]._.is_x is False
|
||||
else:
|
||||
assert span_doc_with[i]._.is_x is True
|
||||
assert not any([t._.is_x for t in span_doc_without])
|
||||
|
||||
|
||||
def test_span_string_label_kb_id(doc):
|
||||
|
|
3
spacy/tests/enable_gpu.py
Normal file
3
spacy/tests/enable_gpu.py
Normal file
|
@ -0,0 +1,3 @@
|
|||
from spacy import require_gpu
|
||||
|
||||
require_gpu()
|
|
@ -4,7 +4,9 @@ import re
|
|||
import copy
|
||||
from mock import Mock
|
||||
from spacy.matcher import DependencyMatcher
|
||||
from spacy.tokens import Doc
|
||||
from spacy.tokens import Doc, Token
|
||||
|
||||
from ..doc.test_underscore import clean_underscore # noqa: F401
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
@ -344,3 +346,26 @@ def test_dependency_matcher_long_matches(en_vocab, doc):
|
|||
matcher = DependencyMatcher(en_vocab)
|
||||
with pytest.raises(ValueError):
|
||||
matcher.add("pattern", [pattern])
|
||||
|
||||
|
||||
@pytest.mark.usefixtures("clean_underscore")
|
||||
def test_dependency_matcher_span_user_data(en_tokenizer):
|
||||
doc = en_tokenizer("a b c d e")
|
||||
for token in doc:
|
||||
token.head = doc[0]
|
||||
token.dep_ = "a"
|
||||
get_is_c = lambda token: token.text in ("c",)
|
||||
Token.set_extension("is_c", default=False)
|
||||
doc[2]._.is_c = True
|
||||
pattern = [
|
||||
{"RIGHT_ID": "c", "RIGHT_ATTRS": {"_": {"is_c": True}}},
|
||||
]
|
||||
matcher = DependencyMatcher(en_tokenizer.vocab)
|
||||
matcher.add("C", [pattern])
|
||||
doc_matches = matcher(doc)
|
||||
offset = 1
|
||||
span_matches = matcher(doc[offset:])
|
||||
for doc_match, span_match in zip(sorted(doc_matches), sorted(span_matches)):
|
||||
assert doc_match[0] == span_match[0]
|
||||
for doc_t_i, span_t_i in zip(doc_match[1], span_match[1]):
|
||||
assert doc_t_i == span_t_i + offset
|
||||
|
|
|
@ -204,3 +204,90 @@ def test_matcher_remove():
|
|||
# removing again should throw an error
|
||||
with pytest.raises(ValueError):
|
||||
matcher.remove("Rule")
|
||||
|
||||
|
||||
def test_matcher_with_alignments_greedy_longest(en_vocab):
|
||||
cases = [
|
||||
("aaab", "a* b", [0, 0, 0, 1]),
|
||||
("baab", "b a* b", [0, 1, 1, 2]),
|
||||
("aaab", "a a a b", [0, 1, 2, 3]),
|
||||
("aaab", "a+ b", [0, 0, 0, 1]),
|
||||
("aaba", "a+ b a+", [0, 0, 1, 2]),
|
||||
("aabaa", "a+ b a+", [0, 0, 1, 2, 2]),
|
||||
("aaba", "a+ b a*", [0, 0, 1, 2]),
|
||||
("aaaa", "a*", [0, 0, 0, 0]),
|
||||
("baab", "b a* b b*", [0, 1, 1, 2]),
|
||||
("aabb", "a* b* a*", [0, 0, 1, 1]),
|
||||
("aaab", "a+ a+ a b", [0, 1, 2, 3]),
|
||||
("aaab", "a+ a+ a+ b", [0, 1, 2, 3]),
|
||||
("aaab", "a+ a a b", [0, 1, 2, 3]),
|
||||
("aaab", "a+ a a", [0, 1, 2]),
|
||||
("aaab", "a+ a a?", [0, 1, 2]),
|
||||
("aaaa", "a a a a a?", [0, 1, 2, 3]),
|
||||
("aaab", "a+ a b", [0, 0, 1, 2]),
|
||||
("aaab", "a+ a+ b", [0, 0, 1, 2]),
|
||||
]
|
||||
for string, pattern_str, result in cases:
|
||||
matcher = Matcher(en_vocab)
|
||||
doc = Doc(matcher.vocab, words=list(string))
|
||||
pattern = []
|
||||
for part in pattern_str.split():
|
||||
if part.endswith("+"):
|
||||
pattern.append({"ORTH": part[0], "OP": "+"})
|
||||
elif part.endswith("*"):
|
||||
pattern.append({"ORTH": part[0], "OP": "*"})
|
||||
elif part.endswith("?"):
|
||||
pattern.append({"ORTH": part[0], "OP": "?"})
|
||||
else:
|
||||
pattern.append({"ORTH": part})
|
||||
matcher.add("PATTERN", [pattern], greedy="LONGEST")
|
||||
matches = matcher(doc, with_alignments=True)
|
||||
n_matches = len(matches)
|
||||
|
||||
_, s, e, expected = matches[0]
|
||||
|
||||
assert expected == result, (string, pattern_str, s, e, n_matches)
|
||||
|
||||
|
||||
def test_matcher_with_alignments_nongreedy(en_vocab):
|
||||
cases = [
|
||||
(0, "aaab", "a* b", [[0, 1], [0, 0, 1], [0, 0, 0, 1], [1]]),
|
||||
(1, "baab", "b a* b", [[0, 1, 1, 2]]),
|
||||
(2, "aaab", "a a a b", [[0, 1, 2, 3]]),
|
||||
(3, "aaab", "a+ b", [[0, 1], [0, 0, 1], [0, 0, 0, 1]]),
|
||||
(4, "aaba", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2]]),
|
||||
(5, "aabaa", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2], [0, 0, 1, 2, 2], [0, 1, 2, 2] ]),
|
||||
(6, "aaba", "a+ b a*", [[0, 1], [0, 0, 1], [0, 0, 1, 2], [0, 1, 2]]),
|
||||
(7, "aaaa", "a*", [[0], [0, 0], [0, 0, 0], [0, 0, 0, 0]]),
|
||||
(8, "baab", "b a* b b*", [[0, 1, 1, 2]]),
|
||||
(9, "aabb", "a* b* a*", [[1], [2], [2, 2], [0, 1], [0, 0, 1], [0, 0, 1, 1], [0, 1, 1], [1, 1]]),
|
||||
(10, "aaab", "a+ a+ a b", [[0, 1, 2, 3]]),
|
||||
(11, "aaab", "a+ a+ a+ b", [[0, 1, 2, 3]]),
|
||||
(12, "aaab", "a+ a a b", [[0, 1, 2, 3]]),
|
||||
(13, "aaab", "a+ a a", [[0, 1, 2]]),
|
||||
(14, "aaab", "a+ a a?", [[0, 1], [0, 1, 2]]),
|
||||
(15, "aaaa", "a a a a a?", [[0, 1, 2, 3]]),
|
||||
(16, "aaab", "a+ a b", [[0, 1, 2], [0, 0, 1, 2]]),
|
||||
(17, "aaab", "a+ a+ b", [[0, 1, 2], [0, 0, 1, 2]]),
|
||||
]
|
||||
for case_id, string, pattern_str, results in cases:
|
||||
matcher = Matcher(en_vocab)
|
||||
doc = Doc(matcher.vocab, words=list(string))
|
||||
pattern = []
|
||||
for part in pattern_str.split():
|
||||
if part.endswith("+"):
|
||||
pattern.append({"ORTH": part[0], "OP": "+"})
|
||||
elif part.endswith("*"):
|
||||
pattern.append({"ORTH": part[0], "OP": "*"})
|
||||
elif part.endswith("?"):
|
||||
pattern.append({"ORTH": part[0], "OP": "?"})
|
||||
else:
|
||||
pattern.append({"ORTH": part})
|
||||
|
||||
matcher.add("PATTERN", [pattern])
|
||||
matches = matcher(doc, with_alignments=True)
|
||||
n_matches = len(matches)
|
||||
|
||||
for _, s, e, expected in matches:
|
||||
assert expected in results, (case_id, string, pattern_str, s, e, n_matches)
|
||||
assert len(expected) == e - s
|
||||
|
|
|
@ -5,6 +5,7 @@ from spacy.tokens import Span
|
|||
from spacy.language import Language
|
||||
from spacy.pipeline import EntityRuler
|
||||
from spacy.errors import MatchPatternError
|
||||
from thinc.api import NumpyOps, get_current_ops
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
@ -201,13 +202,14 @@ def test_entity_ruler_overlapping_spans(nlp):
|
|||
|
||||
@pytest.mark.parametrize("n_process", [1, 2])
|
||||
def test_entity_ruler_multiprocessing(nlp, n_process):
|
||||
texts = ["I enjoy eating Pizza Hut pizza."]
|
||||
if isinstance(get_current_ops, NumpyOps) or n_process < 2:
|
||||
texts = ["I enjoy eating Pizza Hut pizza."]
|
||||
|
||||
patterns = [{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}]
|
||||
patterns = [{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}]
|
||||
|
||||
ruler = nlp.add_pipe("entity_ruler")
|
||||
ruler.add_patterns(patterns)
|
||||
ruler = nlp.add_pipe("entity_ruler")
|
||||
ruler.add_patterns(patterns)
|
||||
|
||||
for doc in nlp.pipe(texts, n_process=2):
|
||||
for ent in doc.ents:
|
||||
assert ent.ent_id_ == "1234"
|
||||
for doc in nlp.pipe(texts, n_process=2):
|
||||
for ent in doc.ents:
|
||||
assert ent.ent_id_ == "1234"
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
import pytest
|
||||
import logging
|
||||
import mock
|
||||
import pickle
|
||||
from spacy import util, registry
|
||||
from spacy.lang.en import English
|
||||
from spacy.lookups import Lookups
|
||||
|
@ -106,6 +107,9 @@ def test_lemmatizer_serialize(nlp):
|
|||
doc2 = nlp2.make_doc("coping")
|
||||
doc2[0].pos_ = "VERB"
|
||||
assert doc2[0].lemma_ == ""
|
||||
doc2 = lemmatizer(doc2)
|
||||
doc2 = lemmatizer2(doc2)
|
||||
assert doc2[0].text == "coping"
|
||||
assert doc2[0].lemma_ == "cope"
|
||||
|
||||
# Make sure that lemmatizer cache can be pickled
|
||||
b = pickle.dumps(lemmatizer2)
|
||||
|
|
|
@ -4,7 +4,7 @@ import numpy
|
|||
import pytest
|
||||
from numpy.testing import assert_almost_equal
|
||||
from spacy.vocab import Vocab
|
||||
from thinc.api import NumpyOps, Model, data_validation
|
||||
from thinc.api import Model, data_validation, get_current_ops
|
||||
from thinc.types import Array2d, Ragged
|
||||
|
||||
from spacy.lang.en import English
|
||||
|
@ -13,7 +13,7 @@ from spacy.ml._character_embed import CharacterEmbed
|
|||
from spacy.tokens import Doc
|
||||
|
||||
|
||||
OPS = NumpyOps()
|
||||
OPS = get_current_ops()
|
||||
|
||||
texts = ["These are 4 words", "Here just three"]
|
||||
l0 = [[1, 2], [3, 4], [5, 6], [7, 8]]
|
||||
|
@ -82,7 +82,7 @@ def util_batch_unbatch_docs_list(
|
|||
Y_batched = model.predict(in_data)
|
||||
Y_not_batched = [model.predict([u])[0] for u in in_data]
|
||||
for i in range(len(Y_batched)):
|
||||
assert_almost_equal(Y_batched[i], Y_not_batched[i], decimal=4)
|
||||
assert_almost_equal(OPS.to_numpy(Y_batched[i]), OPS.to_numpy(Y_not_batched[i]), decimal=4)
|
||||
|
||||
|
||||
def util_batch_unbatch_docs_array(
|
||||
|
@ -91,7 +91,7 @@ def util_batch_unbatch_docs_array(
|
|||
with data_validation(True):
|
||||
model.initialize(in_data, out_data)
|
||||
Y_batched = model.predict(in_data).tolist()
|
||||
Y_not_batched = [model.predict([u])[0] for u in in_data]
|
||||
Y_not_batched = [model.predict([u])[0].tolist() for u in in_data]
|
||||
assert_almost_equal(Y_batched, Y_not_batched, decimal=4)
|
||||
|
||||
|
||||
|
@ -100,8 +100,8 @@ def util_batch_unbatch_docs_ragged(
|
|||
):
|
||||
with data_validation(True):
|
||||
model.initialize(in_data, out_data)
|
||||
Y_batched = model.predict(in_data)
|
||||
Y_batched = model.predict(in_data).data.tolist()
|
||||
Y_not_batched = []
|
||||
for u in in_data:
|
||||
Y_not_batched.extend(model.predict([u]).data.tolist())
|
||||
assert_almost_equal(Y_batched.data, Y_not_batched, decimal=4)
|
||||
assert_almost_equal(Y_batched, Y_not_batched, decimal=4)
|
||||
|
|
|
@ -1,4 +1,6 @@
|
|||
import pytest
|
||||
import mock
|
||||
import logging
|
||||
from spacy.language import Language
|
||||
from spacy.lang.en import English
|
||||
from spacy.lang.de import German
|
||||
|
@ -402,6 +404,38 @@ def test_pipe_factories_from_source():
|
|||
nlp.add_pipe("custom", source=source_nlp)
|
||||
|
||||
|
||||
def test_pipe_factories_from_source_language_subclass():
|
||||
class CustomEnglishDefaults(English.Defaults):
|
||||
stop_words = set(["custom", "stop"])
|
||||
|
||||
@registry.languages("custom_en")
|
||||
class CustomEnglish(English):
|
||||
lang = "custom_en"
|
||||
Defaults = CustomEnglishDefaults
|
||||
|
||||
source_nlp = English()
|
||||
source_nlp.add_pipe("tagger")
|
||||
|
||||
# custom subclass
|
||||
nlp = CustomEnglish()
|
||||
nlp.add_pipe("tagger", source=source_nlp)
|
||||
assert "tagger" in nlp.pipe_names
|
||||
|
||||
# non-subclass
|
||||
nlp = German()
|
||||
nlp.add_pipe("tagger", source=source_nlp)
|
||||
assert "tagger" in nlp.pipe_names
|
||||
|
||||
# mismatched vectors
|
||||
nlp = English()
|
||||
nlp.vocab.vectors.resize((1, 4))
|
||||
nlp.vocab.vectors.add("cat", vector=[1, 2, 3, 4])
|
||||
logger = logging.getLogger("spacy")
|
||||
with mock.patch.object(logger, "warning") as mock_warning:
|
||||
nlp.add_pipe("tagger", source=source_nlp)
|
||||
mock_warning.assert_called()
|
||||
|
||||
|
||||
def test_pipe_factories_from_source_custom():
|
||||
"""Test adding components from a source model with custom components."""
|
||||
name = "test_pipe_factories_from_source_custom"
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
import pytest
|
||||
import random
|
||||
import numpy.random
|
||||
from numpy.testing import assert_equal
|
||||
from numpy.testing import assert_almost_equal
|
||||
from thinc.api import fix_random_seed
|
||||
from spacy import util
|
||||
from spacy.lang.en import English
|
||||
|
@ -222,8 +222,12 @@ def test_overfitting_IO():
|
|||
batch_cats_1 = [doc.cats for doc in nlp.pipe(texts)]
|
||||
batch_cats_2 = [doc.cats for doc in nlp.pipe(texts)]
|
||||
no_batch_cats = [doc.cats for doc in [nlp(text) for text in texts]]
|
||||
assert_equal(batch_cats_1, batch_cats_2)
|
||||
assert_equal(batch_cats_1, no_batch_cats)
|
||||
for cats_1, cats_2 in zip(batch_cats_1, batch_cats_2):
|
||||
for cat in cats_1:
|
||||
assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5)
|
||||
for cats_1, cats_2 in zip(batch_cats_1, no_batch_cats):
|
||||
for cat in cats_1:
|
||||
assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5)
|
||||
|
||||
|
||||
def test_overfitting_IO_multi():
|
||||
|
@ -270,8 +274,12 @@ def test_overfitting_IO_multi():
|
|||
batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)]
|
||||
batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)]
|
||||
no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]]
|
||||
assert_equal(batch_deps_1, batch_deps_2)
|
||||
assert_equal(batch_deps_1, no_batch_deps)
|
||||
for cats_1, cats_2 in zip(batch_deps_1, batch_deps_2):
|
||||
for cat in cats_1:
|
||||
assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5)
|
||||
for cats_1, cats_2 in zip(batch_deps_1, no_batch_deps):
|
||||
for cat in cats_1:
|
||||
assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5)
|
||||
|
||||
|
||||
# fmt: off
|
||||
|
|
|
@ -8,8 +8,8 @@ from spacy.tokens import Doc
|
|||
from spacy.training import Example
|
||||
from spacy import util
|
||||
from spacy.lang.en import English
|
||||
from thinc.api import Config
|
||||
from numpy.testing import assert_equal
|
||||
from thinc.api import Config, get_current_ops
|
||||
from numpy.testing import assert_array_equal
|
||||
|
||||
from ..util import get_batch, make_tempdir
|
||||
|
||||
|
@ -160,7 +160,8 @@ def test_tok2vec_listener():
|
|||
|
||||
doc = nlp("Running the pipeline as a whole.")
|
||||
doc_tensor = tagger_tok2vec.predict([doc])[0]
|
||||
assert_equal(doc.tensor, doc_tensor)
|
||||
ops = get_current_ops()
|
||||
assert_array_equal(ops.to_numpy(doc.tensor), ops.to_numpy(doc_tensor))
|
||||
|
||||
# TODO: should this warn or error?
|
||||
nlp.select_pipes(disable="tok2vec")
|
||||
|
|
|
@ -9,6 +9,7 @@ from spacy.language import Language
|
|||
from spacy.util import ensure_path, load_model_from_path
|
||||
import numpy
|
||||
import pickle
|
||||
from thinc.api import NumpyOps, get_current_ops
|
||||
|
||||
from ..util import make_tempdir
|
||||
|
||||
|
@ -169,21 +170,22 @@ def test_issue4725_1():
|
|||
|
||||
|
||||
def test_issue4725_2():
|
||||
# ensures that this runs correctly and doesn't hang or crash because of the global vectors
|
||||
# if it does crash, it's usually because of calling 'spawn' for multiprocessing (e.g. on Windows),
|
||||
# or because of issues with pickling the NER (cf test_issue4725_1)
|
||||
vocab = Vocab(vectors_name="test_vocab_add_vector")
|
||||
data = numpy.ndarray((5, 3), dtype="f")
|
||||
data[0] = 1.0
|
||||
data[1] = 2.0
|
||||
vocab.set_vector("cat", data[0])
|
||||
vocab.set_vector("dog", data[1])
|
||||
nlp = English(vocab=vocab)
|
||||
nlp.add_pipe("ner")
|
||||
nlp.initialize()
|
||||
docs = ["Kurt is in London."] * 10
|
||||
for _ in nlp.pipe(docs, batch_size=2, n_process=2):
|
||||
pass
|
||||
if isinstance(get_current_ops, NumpyOps):
|
||||
# ensures that this runs correctly and doesn't hang or crash because of the global vectors
|
||||
# if it does crash, it's usually because of calling 'spawn' for multiprocessing (e.g. on Windows),
|
||||
# or because of issues with pickling the NER (cf test_issue4725_1)
|
||||
vocab = Vocab(vectors_name="test_vocab_add_vector")
|
||||
data = numpy.ndarray((5, 3), dtype="f")
|
||||
data[0] = 1.0
|
||||
data[1] = 2.0
|
||||
vocab.set_vector("cat", data[0])
|
||||
vocab.set_vector("dog", data[1])
|
||||
nlp = English(vocab=vocab)
|
||||
nlp.add_pipe("ner")
|
||||
nlp.initialize()
|
||||
docs = ["Kurt is in London."] * 10
|
||||
for _ in nlp.pipe(docs, batch_size=2, n_process=2):
|
||||
pass
|
||||
|
||||
|
||||
def test_issue4849():
|
||||
|
@ -204,10 +206,11 @@ def test_issue4849():
|
|||
count_ents += len([ent for ent in doc.ents if ent.ent_id > 0])
|
||||
assert count_ents == 2
|
||||
# USING 2 PROCESSES
|
||||
count_ents = 0
|
||||
for doc in nlp.pipe([text], n_process=2):
|
||||
count_ents += len([ent for ent in doc.ents if ent.ent_id > 0])
|
||||
assert count_ents == 2
|
||||
if isinstance(get_current_ops, NumpyOps):
|
||||
count_ents = 0
|
||||
for doc in nlp.pipe([text], n_process=2):
|
||||
count_ents += len([ent for ent in doc.ents if ent.ent_id > 0])
|
||||
assert count_ents == 2
|
||||
|
||||
|
||||
@Language.factory("my_pipe")
|
||||
|
@ -239,10 +242,11 @@ def test_issue4903():
|
|||
nlp.add_pipe("sentencizer")
|
||||
nlp.add_pipe("my_pipe", after="sentencizer")
|
||||
text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."]
|
||||
docs = list(nlp.pipe(text, n_process=2))
|
||||
assert docs[0].text == "I like bananas."
|
||||
assert docs[1].text == "Do you like them?"
|
||||
assert docs[2].text == "No, I prefer wasabi."
|
||||
if isinstance(get_current_ops(), NumpyOps):
|
||||
docs = list(nlp.pipe(text, n_process=2))
|
||||
assert docs[0].text == "I like bananas."
|
||||
assert docs[1].text == "Do you like them?"
|
||||
assert docs[2].text == "No, I prefer wasabi."
|
||||
|
||||
|
||||
def test_issue4924():
|
||||
|
|
|
@ -6,6 +6,7 @@ from spacy.language import Language
|
|||
from spacy.lang.en.syntax_iterators import noun_chunks
|
||||
from spacy.vocab import Vocab
|
||||
import spacy
|
||||
from thinc.api import get_current_ops
|
||||
import pytest
|
||||
|
||||
from ...util import make_tempdir
|
||||
|
@ -54,16 +55,17 @@ def test_issue5082():
|
|||
ruler.add_patterns(patterns)
|
||||
parsed_vectors_1 = [t.vector for t in nlp(text)]
|
||||
assert len(parsed_vectors_1) == 4
|
||||
numpy.testing.assert_array_equal(parsed_vectors_1[0], array1)
|
||||
numpy.testing.assert_array_equal(parsed_vectors_1[1], array2)
|
||||
numpy.testing.assert_array_equal(parsed_vectors_1[2], array3)
|
||||
numpy.testing.assert_array_equal(parsed_vectors_1[3], array4)
|
||||
ops = get_current_ops()
|
||||
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[0]), array1)
|
||||
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[1]), array2)
|
||||
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[2]), array3)
|
||||
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[3]), array4)
|
||||
nlp.add_pipe("merge_entities")
|
||||
parsed_vectors_2 = [t.vector for t in nlp(text)]
|
||||
assert len(parsed_vectors_2) == 3
|
||||
numpy.testing.assert_array_equal(parsed_vectors_2[0], array1)
|
||||
numpy.testing.assert_array_equal(parsed_vectors_2[1], array2)
|
||||
numpy.testing.assert_array_equal(parsed_vectors_2[2], array34)
|
||||
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_2[0]), array1)
|
||||
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_2[1]), array2)
|
||||
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_2[2]), array34)
|
||||
|
||||
|
||||
def test_issue5137():
|
||||
|
|
|
@ -1,5 +1,6 @@
|
|||
import pytest
|
||||
from thinc.api import Config, fix_random_seed
|
||||
from numpy.testing import assert_almost_equal
|
||||
from thinc.api import Config, fix_random_seed, get_current_ops
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.pipeline.textcat import single_label_default_config, single_label_bow_config
|
||||
|
@ -44,11 +45,12 @@ def test_issue5551(textcat_config):
|
|||
nlp.update([Example.from_dict(doc, annots)])
|
||||
# Store the result of each iteration
|
||||
result = pipe.model.predict([doc])
|
||||
results.append(list(result[0]))
|
||||
results.append(result[0])
|
||||
# All results should be the same because of the fixed seed
|
||||
assert len(results) == 3
|
||||
assert results[0] == results[1]
|
||||
assert results[0] == results[2]
|
||||
ops = get_current_ops()
|
||||
assert_almost_equal(ops.to_numpy(results[0]), ops.to_numpy(results[1]))
|
||||
assert_almost_equal(ops.to_numpy(results[0]), ops.to_numpy(results[2]))
|
||||
|
||||
|
||||
def test_issue5838():
|
||||
|
|
|
@ -1,4 +1,6 @@
|
|||
from spacy.kb import KnowledgeBase
|
||||
from spacy.lang.en import English
|
||||
from spacy.training import Example
|
||||
|
||||
|
||||
def test_issue7065():
|
||||
|
@ -16,3 +18,58 @@ def test_issue7065():
|
|||
ent = doc.ents[0]
|
||||
assert ent.start < sent0.end < ent.end
|
||||
assert sentences.index(ent.sent) == 0
|
||||
|
||||
|
||||
def test_issue7065_b():
|
||||
# Test that the NEL doesn't crash when an entity crosses a sentence boundary
|
||||
nlp = English()
|
||||
vector_length = 3
|
||||
nlp.add_pipe("sentencizer")
|
||||
|
||||
text = "Mahler 's Symphony No. 8 was beautiful."
|
||||
entities = [(0, 6, "PERSON"), (10, 24, "WORK")]
|
||||
links = {(0, 6): {"Q7304": 1.0, "Q270853": 0.0},
|
||||
(10, 24): {"Q7304": 0.0, "Q270853": 1.0}}
|
||||
sent_starts = [1, -1, 0, 0, 0, 0, 0, 0, 0]
|
||||
doc = nlp(text)
|
||||
example = Example.from_dict(doc, {"entities": entities, "links": links, "sent_starts": sent_starts})
|
||||
train_examples = [example]
|
||||
|
||||
def create_kb(vocab):
|
||||
# create artificial KB
|
||||
mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
|
||||
mykb.add_entity(entity="Q270853", freq=12, entity_vector=[9, 1, -7])
|
||||
mykb.add_alias(
|
||||
alias="No. 8",
|
||||
entities=["Q270853"],
|
||||
probabilities=[1.0],
|
||||
)
|
||||
mykb.add_entity(entity="Q7304", freq=12, entity_vector=[6, -4, 3])
|
||||
mykb.add_alias(
|
||||
alias="Mahler",
|
||||
entities=["Q7304"],
|
||||
probabilities=[1.0],
|
||||
)
|
||||
return mykb
|
||||
|
||||
# Create the Entity Linker component and add it to the pipeline
|
||||
entity_linker = nlp.add_pipe("entity_linker", last=True)
|
||||
entity_linker.set_kb(create_kb)
|
||||
|
||||
# train the NEL pipe
|
||||
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||
for i in range(2):
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
|
||||
# Add a custom rule-based component to mimick NER
|
||||
patterns = [
|
||||
{"label": "PERSON", "pattern": [{"LOWER": "mahler"}]},
|
||||
{"label": "WORK", "pattern": [{"LOWER": "symphony"}, {"LOWER": "no"}, {"LOWER": "."}, {"LOWER": "8"}]}
|
||||
]
|
||||
ruler = nlp.add_pipe("entity_ruler", before="entity_linker")
|
||||
ruler.add_patterns(patterns)
|
||||
|
||||
# test the trained model - this should not throw E148
|
||||
doc = nlp(text)
|
||||
assert doc
|
||||
|
|
|
@ -4,7 +4,7 @@ import spacy
|
|||
from spacy.lang.en import English
|
||||
from spacy.lang.de import German
|
||||
from spacy.language import Language, DEFAULT_CONFIG, DEFAULT_CONFIG_PRETRAIN_PATH
|
||||
from spacy.util import registry, load_model_from_config, load_config
|
||||
from spacy.util import registry, load_model_from_config, load_config, load_config_from_str
|
||||
from spacy.ml.models import build_Tok2Vec_model, build_tb_parser_model
|
||||
from spacy.ml.models import MultiHashEmbed, MaxoutWindowEncoder
|
||||
from spacy.schemas import ConfigSchema, ConfigSchemaPretrain
|
||||
|
@ -465,3 +465,32 @@ def test_config_only_resolve_relevant_blocks():
|
|||
nlp.initialize()
|
||||
nlp.config["initialize"]["lookups"] = None
|
||||
nlp.initialize()
|
||||
|
||||
|
||||
def test_hyphen_in_config():
|
||||
hyphen_config_str = """
|
||||
[nlp]
|
||||
lang = "en"
|
||||
pipeline = ["my_punctual_component"]
|
||||
|
||||
[components]
|
||||
|
||||
[components.my_punctual_component]
|
||||
factory = "my_punctual_component"
|
||||
punctuation = ["?","-"]
|
||||
"""
|
||||
|
||||
@spacy.Language.factory("my_punctual_component")
|
||||
class MyPunctualComponent(object):
|
||||
name = "my_punctual_component"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
nlp,
|
||||
name,
|
||||
punctuation,
|
||||
):
|
||||
self.punctuation = punctuation
|
||||
|
||||
nlp = English.from_config(load_config_from_str(hyphen_config_str))
|
||||
assert nlp.get_pipe("my_punctual_component").punctuation == ['?', '-']
|
||||
|
|
|
@ -26,10 +26,14 @@ def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
|
|||
assert tokenizer.rules != {}
|
||||
assert tokenizer.token_match is not None
|
||||
assert tokenizer.url_match is not None
|
||||
assert tokenizer.prefix_search is not None
|
||||
assert tokenizer.infix_finditer is not None
|
||||
tokenizer.from_bytes(tokenizer_bytes)
|
||||
assert tokenizer.rules == {}
|
||||
assert tokenizer.token_match is None
|
||||
assert tokenizer.url_match is None
|
||||
assert tokenizer.prefix_search is None
|
||||
assert tokenizer.infix_finditer is None
|
||||
|
||||
tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC"}, {"ORTH": "."}]})
|
||||
tokenizer.rules = {}
|
||||
|
|
|
@ -49,9 +49,9 @@ def test_serialize_vocab_roundtrip_disk(strings1, strings2):
|
|||
vocab1_d = Vocab().from_disk(file_path1)
|
||||
vocab2_d = Vocab().from_disk(file_path2)
|
||||
# check strings rather than lexemes, which are only reloaded on demand
|
||||
assert strings1 == [s for s in vocab1_d.strings]
|
||||
assert strings2 == [s for s in vocab2_d.strings]
|
||||
if strings1 == strings2:
|
||||
assert set(strings1) == set([s for s in vocab1_d.strings])
|
||||
assert set(strings2) == set([s for s in vocab2_d.strings])
|
||||
if set(strings1) == set(strings2):
|
||||
assert [s for s in vocab1_d.strings] == [s for s in vocab2_d.strings]
|
||||
else:
|
||||
assert [s for s in vocab1_d.strings] != [s for s in vocab2_d.strings]
|
||||
|
@ -96,7 +96,7 @@ def test_serialize_stringstore_roundtrip_bytes(strings1, strings2):
|
|||
sstore2 = StringStore(strings=strings2)
|
||||
sstore1_b = sstore1.to_bytes()
|
||||
sstore2_b = sstore2.to_bytes()
|
||||
if strings1 == strings2:
|
||||
if set(strings1) == set(strings2):
|
||||
assert sstore1_b == sstore2_b
|
||||
else:
|
||||
assert sstore1_b != sstore2_b
|
||||
|
@ -104,7 +104,7 @@ def test_serialize_stringstore_roundtrip_bytes(strings1, strings2):
|
|||
assert sstore1.to_bytes() == sstore1_b
|
||||
new_sstore1 = StringStore().from_bytes(sstore1_b)
|
||||
assert new_sstore1.to_bytes() == sstore1_b
|
||||
assert list(new_sstore1) == strings1
|
||||
assert set(new_sstore1) == set(strings1)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("strings1,strings2", test_strings)
|
||||
|
@ -118,12 +118,12 @@ def test_serialize_stringstore_roundtrip_disk(strings1, strings2):
|
|||
sstore2.to_disk(file_path2)
|
||||
sstore1_d = StringStore().from_disk(file_path1)
|
||||
sstore2_d = StringStore().from_disk(file_path2)
|
||||
assert list(sstore1_d) == list(sstore1)
|
||||
assert list(sstore2_d) == list(sstore2)
|
||||
if strings1 == strings2:
|
||||
assert list(sstore1_d) == list(sstore2_d)
|
||||
assert set(sstore1_d) == set(sstore1)
|
||||
assert set(sstore2_d) == set(sstore2)
|
||||
if set(strings1) == set(strings2):
|
||||
assert set(sstore1_d) == set(sstore2_d)
|
||||
else:
|
||||
assert list(sstore1_d) != list(sstore2_d)
|
||||
assert set(sstore1_d) != set(sstore2_d)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)
|
||||
|
|
|
@ -307,8 +307,11 @@ def test_project_config_validation2(config, n_errors):
|
|||
assert len(errors) == n_errors
|
||||
|
||||
|
||||
def test_project_config_interpolation():
|
||||
variables = {"a": 10, "b": {"c": "foo", "d": True}}
|
||||
@pytest.mark.parametrize(
|
||||
"int_value", [10, pytest.param("10", marks=pytest.mark.xfail)],
|
||||
)
|
||||
def test_project_config_interpolation(int_value):
|
||||
variables = {"a": int_value, "b": {"c": "foo", "d": True}}
|
||||
commands = [
|
||||
{"name": "x", "script": ["hello ${vars.a} ${vars.b.c}"]},
|
||||
{"name": "y", "script": ["${vars.b.c} ${vars.b.d}"]},
|
||||
|
@ -317,6 +320,8 @@ def test_project_config_interpolation():
|
|||
with make_tempdir() as d:
|
||||
srsly.write_yaml(d / "project.yml", project)
|
||||
cfg = load_project_config(d)
|
||||
assert type(cfg) == dict
|
||||
assert type(cfg["commands"]) == list
|
||||
assert cfg["commands"][0]["script"][0] == "hello 10 foo"
|
||||
assert cfg["commands"][1]["script"][0] == "foo true"
|
||||
commands = [{"name": "x", "script": ["hello ${vars.a} ${vars.b.e}"]}]
|
||||
|
@ -325,6 +330,24 @@ def test_project_config_interpolation():
|
|||
substitute_project_variables(project)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"greeting", [342, "everyone", "tout le monde", pytest.param("42", marks=pytest.mark.xfail)],
|
||||
)
|
||||
def test_project_config_interpolation_override(greeting):
|
||||
variables = {"a": "world"}
|
||||
commands = [
|
||||
{"name": "x", "script": ["hello ${vars.a}"]},
|
||||
]
|
||||
overrides = {"vars.a": greeting}
|
||||
project = {"commands": commands, "vars": variables}
|
||||
with make_tempdir() as d:
|
||||
srsly.write_yaml(d / "project.yml", project)
|
||||
cfg = load_project_config(d, overrides=overrides)
|
||||
assert type(cfg) == dict
|
||||
assert type(cfg["commands"]) == list
|
||||
assert cfg["commands"][0]["script"][0] == f"hello {greeting}"
|
||||
|
||||
|
||||
def test_project_config_interpolation_env():
|
||||
variables = {"a": 10}
|
||||
env_var = "SPACY_TEST_FOO"
|
||||
|
|
|
@ -10,6 +10,7 @@ from spacy.lang.en import English
|
|||
from spacy.lang.de import German
|
||||
from spacy.util import registry, ignore_error, raise_error
|
||||
import spacy
|
||||
from thinc.api import NumpyOps, get_current_ops
|
||||
|
||||
from .util import add_vecs_to_vocab, assert_docs_equal
|
||||
|
||||
|
@ -142,25 +143,29 @@ def texts():
|
|||
|
||||
@pytest.mark.parametrize("n_process", [1, 2])
|
||||
def test_language_pipe(nlp2, n_process, texts):
|
||||
texts = texts * 10
|
||||
expecteds = [nlp2(text) for text in texts]
|
||||
docs = nlp2.pipe(texts, n_process=n_process, batch_size=2)
|
||||
ops = get_current_ops()
|
||||
if isinstance(ops, NumpyOps) or n_process < 2:
|
||||
texts = texts * 10
|
||||
expecteds = [nlp2(text) for text in texts]
|
||||
docs = nlp2.pipe(texts, n_process=n_process, batch_size=2)
|
||||
|
||||
for doc, expected_doc in zip(docs, expecteds):
|
||||
assert_docs_equal(doc, expected_doc)
|
||||
for doc, expected_doc in zip(docs, expecteds):
|
||||
assert_docs_equal(doc, expected_doc)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("n_process", [1, 2])
|
||||
def test_language_pipe_stream(nlp2, n_process, texts):
|
||||
# check if nlp.pipe can handle infinite length iterator properly.
|
||||
stream_texts = itertools.cycle(texts)
|
||||
texts0, texts1 = itertools.tee(stream_texts)
|
||||
expecteds = (nlp2(text) for text in texts0)
|
||||
docs = nlp2.pipe(texts1, n_process=n_process, batch_size=2)
|
||||
ops = get_current_ops()
|
||||
if isinstance(ops, NumpyOps) or n_process < 2:
|
||||
# check if nlp.pipe can handle infinite length iterator properly.
|
||||
stream_texts = itertools.cycle(texts)
|
||||
texts0, texts1 = itertools.tee(stream_texts)
|
||||
expecteds = (nlp2(text) for text in texts0)
|
||||
docs = nlp2.pipe(texts1, n_process=n_process, batch_size=2)
|
||||
|
||||
n_fetch = 20
|
||||
for doc, expected_doc in itertools.islice(zip(docs, expecteds), n_fetch):
|
||||
assert_docs_equal(doc, expected_doc)
|
||||
n_fetch = 20
|
||||
for doc, expected_doc in itertools.islice(zip(docs, expecteds), n_fetch):
|
||||
assert_docs_equal(doc, expected_doc)
|
||||
|
||||
|
||||
def test_language_pipe_error_handler():
|
||||
|
|
|
@ -8,7 +8,8 @@ from spacy import prefer_gpu, require_gpu, require_cpu
|
|||
from spacy.ml._precomputable_affine import PrecomputableAffine
|
||||
from spacy.ml._precomputable_affine import _backprop_precomputable_affine_padding
|
||||
from spacy.util import dot_to_object, SimpleFrozenList, import_file
|
||||
from thinc.api import Config, Optimizer, ConfigValidationError
|
||||
from thinc.api import Config, Optimizer, ConfigValidationError, get_current_ops
|
||||
from thinc.api import set_current_ops
|
||||
from spacy.training.batchers import minibatch_by_words
|
||||
from spacy.lang.en import English
|
||||
from spacy.lang.nl import Dutch
|
||||
|
@ -81,6 +82,7 @@ def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
|
|||
|
||||
|
||||
def test_prefer_gpu():
|
||||
current_ops = get_current_ops()
|
||||
try:
|
||||
import cupy # noqa: F401
|
||||
|
||||
|
@ -88,9 +90,11 @@ def test_prefer_gpu():
|
|||
assert isinstance(get_current_ops(), CupyOps)
|
||||
except ImportError:
|
||||
assert not prefer_gpu()
|
||||
set_current_ops(current_ops)
|
||||
|
||||
|
||||
def test_require_gpu():
|
||||
current_ops = get_current_ops()
|
||||
try:
|
||||
import cupy # noqa: F401
|
||||
|
||||
|
@ -99,9 +103,11 @@ def test_require_gpu():
|
|||
except ImportError:
|
||||
with pytest.raises(ValueError):
|
||||
require_gpu()
|
||||
set_current_ops(current_ops)
|
||||
|
||||
|
||||
def test_require_cpu():
|
||||
current_ops = get_current_ops()
|
||||
require_cpu()
|
||||
assert isinstance(get_current_ops(), NumpyOps)
|
||||
try:
|
||||
|
@ -113,6 +119,7 @@ def test_require_cpu():
|
|||
pass
|
||||
require_cpu()
|
||||
assert isinstance(get_current_ops(), NumpyOps)
|
||||
set_current_ops(current_ops)
|
||||
|
||||
|
||||
def test_ascii_filenames():
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
from typing import List
|
||||
import pytest
|
||||
from thinc.api import fix_random_seed, Adam, set_dropout_rate
|
||||
from numpy.testing import assert_array_equal
|
||||
from numpy.testing import assert_array_equal, assert_array_almost_equal
|
||||
import numpy
|
||||
from spacy.ml.models import build_Tok2Vec_model, MultiHashEmbed, MaxoutWindowEncoder
|
||||
from spacy.ml.models import build_bow_text_classifier, build_simple_cnn_text_classifier
|
||||
|
@ -109,7 +109,7 @@ def test_models_initialize_consistently(seed, model_func, kwargs):
|
|||
model2.initialize()
|
||||
params1 = get_all_params(model1)
|
||||
params2 = get_all_params(model2)
|
||||
assert_array_equal(params1, params2)
|
||||
assert_array_equal(model1.ops.to_numpy(params1), model2.ops.to_numpy(params2))
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
|
@ -134,14 +134,25 @@ def test_models_predict_consistently(seed, model_func, kwargs, get_X):
|
|||
for i in range(len(tok2vec1)):
|
||||
for j in range(len(tok2vec1[i])):
|
||||
assert_array_equal(
|
||||
numpy.asarray(tok2vec1[i][j]), numpy.asarray(tok2vec2[i][j])
|
||||
numpy.asarray(model1.ops.to_numpy(tok2vec1[i][j])),
|
||||
numpy.asarray(model2.ops.to_numpy(tok2vec2[i][j])),
|
||||
)
|
||||
|
||||
try:
|
||||
Y1 = model1.ops.to_numpy(Y1)
|
||||
Y2 = model2.ops.to_numpy(Y2)
|
||||
except Exception:
|
||||
pass
|
||||
if isinstance(Y1, numpy.ndarray):
|
||||
assert_array_equal(Y1, Y2)
|
||||
elif isinstance(Y1, List):
|
||||
assert len(Y1) == len(Y2)
|
||||
for y1, y2 in zip(Y1, Y2):
|
||||
try:
|
||||
y1 = model1.ops.to_numpy(y1)
|
||||
y2 = model2.ops.to_numpy(y2)
|
||||
except Exception:
|
||||
pass
|
||||
assert_array_equal(y1, y2)
|
||||
else:
|
||||
raise ValueError(f"Could not compare type {type(Y1)}")
|
||||
|
@ -169,12 +180,17 @@ def test_models_update_consistently(seed, dropout, model_func, kwargs, get_X):
|
|||
model.finish_update(optimizer)
|
||||
updated_params = get_all_params(model)
|
||||
with pytest.raises(AssertionError):
|
||||
assert_array_equal(initial_params, updated_params)
|
||||
assert_array_equal(
|
||||
model.ops.to_numpy(initial_params), model.ops.to_numpy(updated_params)
|
||||
)
|
||||
return model
|
||||
|
||||
model1 = get_updated_model()
|
||||
model2 = get_updated_model()
|
||||
assert_array_equal(get_all_params(model1), get_all_params(model2))
|
||||
assert_array_almost_equal(
|
||||
model1.ops.to_numpy(get_all_params(model1)),
|
||||
model2.ops.to_numpy(get_all_params(model2)),
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("model_func,kwargs", [(StaticVectors, {"nO": 128, "nM": 300})])
|
||||
|
|
|
@ -3,10 +3,10 @@ import pytest
|
|||
from pytest import approx
|
||||
from spacy.training import Example
|
||||
from spacy.training.iob_utils import offsets_to_biluo_tags
|
||||
from spacy.scorer import Scorer, ROCAUCScore
|
||||
from spacy.scorer import Scorer, ROCAUCScore, PRFScore
|
||||
from spacy.scorer import _roc_auc_score, _roc_curve
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import Doc
|
||||
from spacy.tokens import Doc, Span
|
||||
|
||||
|
||||
test_las_apple = [
|
||||
|
@ -403,3 +403,68 @@ def test_roc_auc_score():
|
|||
score.score_set(0.75, 1)
|
||||
with pytest.raises(ValueError):
|
||||
_ = score.score # noqa: F841
|
||||
|
||||
|
||||
def test_score_spans():
|
||||
nlp = English()
|
||||
text = "This is just a random sentence."
|
||||
key = "my_spans"
|
||||
gold = nlp.make_doc(text)
|
||||
pred = nlp.make_doc(text)
|
||||
spans = []
|
||||
spans.append(gold.char_span(0, 4, label="PERSON"))
|
||||
spans.append(gold.char_span(0, 7, label="ORG"))
|
||||
spans.append(gold.char_span(8, 12, label="ORG"))
|
||||
gold.spans[key] = spans
|
||||
|
||||
def span_getter(doc, span_key):
|
||||
return doc.spans[span_key]
|
||||
|
||||
# Predict exactly the same, but overlapping spans will be discarded
|
||||
pred.spans[key] = spans
|
||||
eg = Example(pred, gold)
|
||||
scores = Scorer.score_spans([eg], attr=key, getter=span_getter)
|
||||
assert scores[f"{key}_p"] == 1.0
|
||||
assert scores[f"{key}_r"] < 1.0
|
||||
|
||||
# Allow overlapping, now both precision and recall should be 100%
|
||||
pred.spans[key] = spans
|
||||
eg = Example(pred, gold)
|
||||
scores = Scorer.score_spans([eg], attr=key, getter=span_getter, allow_overlap=True)
|
||||
assert scores[f"{key}_p"] == 1.0
|
||||
assert scores[f"{key}_r"] == 1.0
|
||||
|
||||
# Change the predicted labels
|
||||
new_spans = [Span(pred, span.start, span.end, label="WRONG") for span in spans]
|
||||
pred.spans[key] = new_spans
|
||||
eg = Example(pred, gold)
|
||||
scores = Scorer.score_spans([eg], attr=key, getter=span_getter, allow_overlap=True)
|
||||
assert scores[f"{key}_p"] == 0.0
|
||||
assert scores[f"{key}_r"] == 0.0
|
||||
assert f"{key}_per_type" in scores
|
||||
|
||||
# Discard labels from the evaluation
|
||||
scores = Scorer.score_spans([eg], attr=key, getter=span_getter, allow_overlap=True, labeled=False)
|
||||
assert scores[f"{key}_p"] == 1.0
|
||||
assert scores[f"{key}_r"] == 1.0
|
||||
assert f"{key}_per_type" not in scores
|
||||
|
||||
|
||||
def test_prf_score():
|
||||
cand = {"hi", "ho"}
|
||||
gold1 = {"yo", "hi"}
|
||||
gold2 = set()
|
||||
|
||||
a = PRFScore()
|
||||
a.score_set(cand=cand, gold=gold1)
|
||||
assert (a.precision, a.recall, a.fscore) == approx((0.5, 0.5, 0.5))
|
||||
|
||||
b = PRFScore()
|
||||
b.score_set(cand=cand, gold=gold2)
|
||||
assert (b.precision, b.recall, b.fscore) == approx((0.0, 0.0, 0.0))
|
||||
|
||||
c = a + b
|
||||
assert (c.precision, c.recall, c.fscore) == approx((0.25, 0.5, 0.33333333))
|
||||
|
||||
a += b
|
||||
assert (a.precision, a.recall, a.fscore) == approx((c.precision, c.recall, c.fscore))
|
|
@ -1,5 +1,7 @@
|
|||
import pytest
|
||||
import re
|
||||
from spacy.util import get_lang_class
|
||||
from spacy.tokenizer import Tokenizer
|
||||
|
||||
# Only include languages with no external dependencies
|
||||
# "is" seems to confuse importlib, so we're also excluding it for now
|
||||
|
@ -60,3 +62,18 @@ def test_tokenizer_explain(lang):
|
|||
tokens = [t.text for t in tokenizer(sentence) if not t.is_space]
|
||||
debug_tokens = [t[1] for t in tokenizer.explain(sentence)]
|
||||
assert tokens == debug_tokens
|
||||
|
||||
|
||||
def test_tokenizer_explain_special_matcher(en_vocab):
|
||||
suffix_re = re.compile(r"[\.]$")
|
||||
infix_re = re.compile(r"[/]")
|
||||
rules = {"a.": [{"ORTH": "a."}]}
|
||||
tokenizer = Tokenizer(
|
||||
en_vocab,
|
||||
rules=rules,
|
||||
suffix_search=suffix_re.search,
|
||||
infix_finditer=infix_re.finditer,
|
||||
)
|
||||
tokens = [t.text for t in tokenizer("a/a.")]
|
||||
explain_tokens = [t[1] for t in tokenizer.explain("a/a.")]
|
||||
assert tokens == explain_tokens
|
||||
|
|
|
@ -1,4 +1,5 @@
|
|||
import pytest
|
||||
import re
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.tokenizer import Tokenizer
|
||||
from spacy.util import ensure_path
|
||||
|
@ -186,3 +187,31 @@ def test_tokenizer_special_cases_spaces(tokenizer):
|
|||
assert [t.text for t in tokenizer("a b c")] == ["a", "b", "c"]
|
||||
tokenizer.add_special_case("a b c", [{"ORTH": "a b c"}])
|
||||
assert [t.text for t in tokenizer("a b c")] == ["a b c"]
|
||||
|
||||
|
||||
def test_tokenizer_flush_cache(en_vocab):
|
||||
suffix_re = re.compile(r"[\.]$")
|
||||
tokenizer = Tokenizer(
|
||||
en_vocab,
|
||||
suffix_search=suffix_re.search,
|
||||
)
|
||||
assert [t.text for t in tokenizer("a.")] == ["a", "."]
|
||||
tokenizer.suffix_search = None
|
||||
assert [t.text for t in tokenizer("a.")] == ["a."]
|
||||
|
||||
|
||||
def test_tokenizer_flush_specials(en_vocab):
|
||||
suffix_re = re.compile(r"[\.]$")
|
||||
rules = {"a a": [{"ORTH": "a a"}]}
|
||||
tokenizer1 = Tokenizer(
|
||||
en_vocab,
|
||||
suffix_search=suffix_re.search,
|
||||
rules=rules,
|
||||
)
|
||||
tokenizer2 = Tokenizer(
|
||||
en_vocab,
|
||||
suffix_search=suffix_re.search,
|
||||
)
|
||||
assert [t.text for t in tokenizer1("a a.")] == ["a a", "."]
|
||||
tokenizer1.rules = {}
|
||||
assert [t.text for t in tokenizer1("a a.")] == ["a", "a", "."]
|
||||
|
|
|
@ -2,6 +2,7 @@ import pytest
|
|||
from spacy.training.example import Example
|
||||
from spacy.tokens import Doc
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.util import to_ternary_int
|
||||
|
||||
|
||||
def test_Example_init_requires_doc_objects():
|
||||
|
@ -121,7 +122,7 @@ def test_Example_from_dict_with_morphology(annots):
|
|||
[
|
||||
{
|
||||
"words": ["This", "is", "one", "sentence", "this", "is", "another"],
|
||||
"sent_starts": [1, 0, 0, 0, 1, 0, 0],
|
||||
"sent_starts": [1, False, 0, None, True, -1, -5.7],
|
||||
}
|
||||
],
|
||||
)
|
||||
|
@ -131,7 +132,12 @@ def test_Example_from_dict_with_sent_start(annots):
|
|||
example = Example.from_dict(predicted, annots)
|
||||
assert len(list(example.reference.sents)) == 2
|
||||
for i, token in enumerate(example.reference):
|
||||
assert bool(token.is_sent_start) == bool(annots["sent_starts"][i])
|
||||
if to_ternary_int(annots["sent_starts"][i]) == 1:
|
||||
assert token.is_sent_start is True
|
||||
elif to_ternary_int(annots["sent_starts"][i]) == 0:
|
||||
assert token.is_sent_start is None
|
||||
else:
|
||||
assert token.is_sent_start is False
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
|
|
|
@ -426,6 +426,29 @@ def test_aligned_spans_x2y(en_vocab, en_tokenizer):
|
|||
assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2), (4, 6)]
|
||||
|
||||
|
||||
def test_aligned_spans_y2x_overlap(en_vocab, en_tokenizer):
|
||||
text = "I flew to San Francisco Valley"
|
||||
nlp = English()
|
||||
doc = nlp(text)
|
||||
# the reference doc has overlapping spans
|
||||
gold_doc = nlp.make_doc(text)
|
||||
spans = []
|
||||
prefix = "I flew to "
|
||||
spans.append(gold_doc.char_span(len(prefix), len(prefix + "San Francisco"), label="CITY"))
|
||||
spans.append(gold_doc.char_span(len(prefix), len(prefix + "San Francisco Valley"), label="VALLEY"))
|
||||
spans_key = "overlap_ents"
|
||||
gold_doc.spans[spans_key] = spans
|
||||
example = Example(doc, gold_doc)
|
||||
spans_gold = example.reference.spans[spans_key]
|
||||
assert [(ent.start, ent.end) for ent in spans_gold] == [(3, 5), (3, 6)]
|
||||
|
||||
# Ensure that 'get_aligned_spans_y2x' has the aligned entities correct
|
||||
spans_y2x_no_overlap = example.get_aligned_spans_y2x(spans_gold, allow_overlap=False)
|
||||
assert [(ent.start, ent.end) for ent in spans_y2x_no_overlap] == [(3, 5)]
|
||||
spans_y2x_overlap = example.get_aligned_spans_y2x(spans_gold, allow_overlap=True)
|
||||
assert [(ent.start, ent.end) for ent in spans_y2x_overlap] == [(3, 5), (3, 6)]
|
||||
|
||||
|
||||
def test_gold_ner_missing_tags(en_tokenizer):
|
||||
doc = en_tokenizer("I flew to Silicon Valley via London.")
|
||||
biluo_tags = [None, "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"]
|
||||
|
|
|
@ -5,6 +5,7 @@ import srsly
|
|||
from spacy.tokens import Doc
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.util import make_tempdir # noqa: F401
|
||||
from thinc.api import get_current_ops
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
|
@ -58,7 +59,10 @@ def add_vecs_to_vocab(vocab, vectors):
|
|||
|
||||
def get_cosine(vec1, vec2):
|
||||
"""Get cosine for two given vectors"""
|
||||
return numpy.dot(vec1, vec2) / (numpy.linalg.norm(vec1) * numpy.linalg.norm(vec2))
|
||||
OPS = get_current_ops()
|
||||
v1 = OPS.to_numpy(OPS.asarray(vec1))
|
||||
v2 = OPS.to_numpy(OPS.asarray(vec2))
|
||||
return numpy.dot(v1, v2) / (numpy.linalg.norm(v1) * numpy.linalg.norm(v2))
|
||||
|
||||
|
||||
def assert_docs_equal(doc1, doc2):
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
import pytest
|
||||
import numpy
|
||||
from numpy.testing import assert_allclose, assert_equal
|
||||
from thinc.api import get_current_ops
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.vectors import Vectors
|
||||
from spacy.tokenizer import Tokenizer
|
||||
|
@ -9,6 +10,7 @@ from spacy.tokens import Doc
|
|||
|
||||
from ..util import add_vecs_to_vocab, get_cosine, make_tempdir
|
||||
|
||||
OPS = get_current_ops()
|
||||
|
||||
@pytest.fixture
|
||||
def strings():
|
||||
|
@ -18,21 +20,21 @@ def strings():
|
|||
@pytest.fixture
|
||||
def vectors():
|
||||
return [
|
||||
("apple", [1, 2, 3]),
|
||||
("orange", [-1, -2, -3]),
|
||||
("and", [-1, -1, -1]),
|
||||
("juice", [5, 5, 10]),
|
||||
("pie", [7, 6.3, 8.9]),
|
||||
("apple", OPS.asarray([1, 2, 3])),
|
||||
("orange", OPS.asarray([-1, -2, -3])),
|
||||
("and", OPS.asarray([-1, -1, -1])),
|
||||
("juice", OPS.asarray([5, 5, 10])),
|
||||
("pie", OPS.asarray([7, 6.3, 8.9])),
|
||||
]
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def ngrams_vectors():
|
||||
return [
|
||||
("apple", [1, 2, 3]),
|
||||
("app", [-0.1, -0.2, -0.3]),
|
||||
("ppl", [-0.2, -0.3, -0.4]),
|
||||
("pl", [0.7, 0.8, 0.9]),
|
||||
("apple", OPS.asarray([1, 2, 3])),
|
||||
("app", OPS.asarray([-0.1, -0.2, -0.3])),
|
||||
("ppl", OPS.asarray([-0.2, -0.3, -0.4])),
|
||||
("pl", OPS.asarray([0.7, 0.8, 0.9])),
|
||||
]
|
||||
|
||||
|
||||
|
@ -171,8 +173,10 @@ def test_vectors_most_similar_identical():
|
|||
@pytest.mark.parametrize("text", ["apple and orange"])
|
||||
def test_vectors_token_vector(tokenizer_v, vectors, text):
|
||||
doc = tokenizer_v(text)
|
||||
assert vectors[0] == (doc[0].text, list(doc[0].vector))
|
||||
assert vectors[1] == (doc[2].text, list(doc[2].vector))
|
||||
assert vectors[0][0] == doc[0].text
|
||||
assert all([a == b for a, b in zip(vectors[0][1], doc[0].vector)])
|
||||
assert vectors[1][0] == doc[2].text
|
||||
assert all([a == b for a, b in zip(vectors[1][1], doc[2].vector)])
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text", ["apple"])
|
||||
|
@ -301,7 +305,7 @@ def test_vectors_doc_doc_similarity(vocab, text1, text2):
|
|||
|
||||
def test_vocab_add_vector():
|
||||
vocab = Vocab(vectors_name="test_vocab_add_vector")
|
||||
data = numpy.ndarray((5, 3), dtype="f")
|
||||
data = OPS.xp.ndarray((5, 3), dtype="f")
|
||||
data[0] = 1.0
|
||||
data[1] = 2.0
|
||||
vocab.set_vector("cat", data[0])
|
||||
|
@ -320,10 +324,10 @@ def test_vocab_prune_vectors():
|
|||
_ = vocab["cat"] # noqa: F841
|
||||
_ = vocab["dog"] # noqa: F841
|
||||
_ = vocab["kitten"] # noqa: F841
|
||||
data = numpy.ndarray((5, 3), dtype="f")
|
||||
data[0] = [1.0, 1.2, 1.1]
|
||||
data[1] = [0.3, 1.3, 1.0]
|
||||
data[2] = [0.9, 1.22, 1.05]
|
||||
data = OPS.xp.ndarray((5, 3), dtype="f")
|
||||
data[0] = OPS.asarray([1.0, 1.2, 1.1])
|
||||
data[1] = OPS.asarray([0.3, 1.3, 1.0])
|
||||
data[2] = OPS.asarray([0.9, 1.22, 1.05])
|
||||
vocab.set_vector("cat", data[0])
|
||||
vocab.set_vector("dog", data[1])
|
||||
vocab.set_vector("kitten", data[2])
|
||||
|
@ -332,40 +336,41 @@ def test_vocab_prune_vectors():
|
|||
assert list(remap.keys()) == ["kitten"]
|
||||
neighbour, similarity = list(remap.values())[0]
|
||||
assert neighbour == "cat", remap
|
||||
assert_allclose(similarity, get_cosine(data[0], data[2]), atol=1e-4, rtol=1e-3)
|
||||
cosine = get_cosine(data[0], data[2])
|
||||
assert_allclose(float(similarity), cosine, atol=1e-4, rtol=1e-3)
|
||||
|
||||
|
||||
def test_vectors_serialize():
|
||||
data = numpy.asarray([[4, 2, 2, 2], [4, 2, 2, 2], [1, 1, 1, 1]], dtype="f")
|
||||
data = OPS.asarray([[4, 2, 2, 2], [4, 2, 2, 2], [1, 1, 1, 1]], dtype="f")
|
||||
v = Vectors(data=data, keys=["A", "B", "C"])
|
||||
b = v.to_bytes()
|
||||
v_r = Vectors()
|
||||
v_r.from_bytes(b)
|
||||
assert_equal(v.data, v_r.data)
|
||||
assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data))
|
||||
assert v.key2row == v_r.key2row
|
||||
v.resize((5, 4))
|
||||
v_r.resize((5, 4))
|
||||
row = v.add("D", vector=numpy.asarray([1, 2, 3, 4], dtype="f"))
|
||||
row_r = v_r.add("D", vector=numpy.asarray([1, 2, 3, 4], dtype="f"))
|
||||
row = v.add("D", vector=OPS.asarray([1, 2, 3, 4], dtype="f"))
|
||||
row_r = v_r.add("D", vector=OPS.asarray([1, 2, 3, 4], dtype="f"))
|
||||
assert row == row_r
|
||||
assert_equal(v.data, v_r.data)
|
||||
assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data))
|
||||
assert v.is_full == v_r.is_full
|
||||
with make_tempdir() as d:
|
||||
v.to_disk(d)
|
||||
v_r.from_disk(d)
|
||||
assert_equal(v.data, v_r.data)
|
||||
assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data))
|
||||
assert v.key2row == v_r.key2row
|
||||
v.resize((5, 4))
|
||||
v_r.resize((5, 4))
|
||||
row = v.add("D", vector=numpy.asarray([10, 20, 30, 40], dtype="f"))
|
||||
row_r = v_r.add("D", vector=numpy.asarray([10, 20, 30, 40], dtype="f"))
|
||||
row = v.add("D", vector=OPS.asarray([10, 20, 30, 40], dtype="f"))
|
||||
row_r = v_r.add("D", vector=OPS.asarray([10, 20, 30, 40], dtype="f"))
|
||||
assert row == row_r
|
||||
assert_equal(v.data, v_r.data)
|
||||
assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data))
|
||||
|
||||
|
||||
def test_vector_is_oov():
|
||||
vocab = Vocab(vectors_name="test_vocab_is_oov")
|
||||
data = numpy.ndarray((5, 3), dtype="f")
|
||||
data = OPS.xp.ndarray((5, 3), dtype="f")
|
||||
data[0] = 1.0
|
||||
data[1] = 2.0
|
||||
vocab.set_vector("cat", data[0])
|
||||
|
|
|
@ -23,8 +23,8 @@ cdef class Tokenizer:
|
|||
cdef object _infix_finditer
|
||||
cdef object _rules
|
||||
cdef PhraseMatcher _special_matcher
|
||||
cdef int _property_init_count
|
||||
cdef int _property_init_max
|
||||
cdef int _property_init_count # TODO: unused, remove in v3.1
|
||||
cdef int _property_init_max # TODO: unused, remove in v3.1
|
||||
|
||||
cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases)
|
||||
cdef int _apply_special_cases(self, Doc doc) except -1
|
||||
|
|
|
@ -20,11 +20,12 @@ from .attrs import intify_attrs
|
|||
from .symbols import ORTH, NORM
|
||||
from .errors import Errors, Warnings
|
||||
from . import util
|
||||
from .util import registry
|
||||
from .util import registry, get_words_and_spaces
|
||||
from .attrs import intify_attrs
|
||||
from .symbols import ORTH
|
||||
from .scorer import Scorer
|
||||
from .training import validate_examples
|
||||
from .tokens import Span
|
||||
|
||||
|
||||
cdef class Tokenizer:
|
||||
|
@ -68,8 +69,6 @@ cdef class Tokenizer:
|
|||
self._rules = {}
|
||||
self._special_matcher = PhraseMatcher(self.vocab)
|
||||
self._load_special_cases(rules)
|
||||
self._property_init_count = 0
|
||||
self._property_init_max = 4
|
||||
|
||||
property token_match:
|
||||
def __get__(self):
|
||||
|
@ -78,8 +77,6 @@ cdef class Tokenizer:
|
|||
def __set__(self, token_match):
|
||||
self._token_match = token_match
|
||||
self._reload_special_cases()
|
||||
if self._property_init_count <= self._property_init_max:
|
||||
self._property_init_count += 1
|
||||
|
||||
property url_match:
|
||||
def __get__(self):
|
||||
|
@ -87,7 +84,7 @@ cdef class Tokenizer:
|
|||
|
||||
def __set__(self, url_match):
|
||||
self._url_match = url_match
|
||||
self._flush_cache()
|
||||
self._reload_special_cases()
|
||||
|
||||
property prefix_search:
|
||||
def __get__(self):
|
||||
|
@ -96,8 +93,6 @@ cdef class Tokenizer:
|
|||
def __set__(self, prefix_search):
|
||||
self._prefix_search = prefix_search
|
||||
self._reload_special_cases()
|
||||
if self._property_init_count <= self._property_init_max:
|
||||
self._property_init_count += 1
|
||||
|
||||
property suffix_search:
|
||||
def __get__(self):
|
||||
|
@ -106,8 +101,6 @@ cdef class Tokenizer:
|
|||
def __set__(self, suffix_search):
|
||||
self._suffix_search = suffix_search
|
||||
self._reload_special_cases()
|
||||
if self._property_init_count <= self._property_init_max:
|
||||
self._property_init_count += 1
|
||||
|
||||
property infix_finditer:
|
||||
def __get__(self):
|
||||
|
@ -116,8 +109,6 @@ cdef class Tokenizer:
|
|||
def __set__(self, infix_finditer):
|
||||
self._infix_finditer = infix_finditer
|
||||
self._reload_special_cases()
|
||||
if self._property_init_count <= self._property_init_max:
|
||||
self._property_init_count += 1
|
||||
|
||||
property rules:
|
||||
def __get__(self):
|
||||
|
@ -125,7 +116,7 @@ cdef class Tokenizer:
|
|||
|
||||
def __set__(self, rules):
|
||||
self._rules = {}
|
||||
self._reset_cache([key for key in self._cache])
|
||||
self._flush_cache()
|
||||
self._flush_specials()
|
||||
self._cache = PreshMap()
|
||||
self._specials = PreshMap()
|
||||
|
@ -225,6 +216,7 @@ cdef class Tokenizer:
|
|||
self.mem.free(cached)
|
||||
|
||||
def _flush_specials(self):
|
||||
self._special_matcher = PhraseMatcher(self.vocab)
|
||||
for k in self._specials:
|
||||
cached = <_Cached*>self._specials.get(k)
|
||||
del self._specials[k]
|
||||
|
@ -567,7 +559,6 @@ cdef class Tokenizer:
|
|||
"""Add special-case tokenization rules."""
|
||||
if special_cases is not None:
|
||||
for chunk, substrings in sorted(special_cases.items()):
|
||||
self._validate_special_case(chunk, substrings)
|
||||
self.add_special_case(chunk, substrings)
|
||||
|
||||
def _validate_special_case(self, chunk, substrings):
|
||||
|
@ -615,16 +606,9 @@ cdef class Tokenizer:
|
|||
self._special_matcher.add(string, None, self._tokenize_affixes(string, False))
|
||||
|
||||
def _reload_special_cases(self):
|
||||
try:
|
||||
self._property_init_count
|
||||
except AttributeError:
|
||||
return
|
||||
# only reload if all 4 of prefix, suffix, infix, token_match have
|
||||
# have been initialized
|
||||
if self.vocab is not None and self._property_init_count >= self._property_init_max:
|
||||
self._flush_cache()
|
||||
self._flush_specials()
|
||||
self._load_special_cases(self._rules)
|
||||
self._flush_cache()
|
||||
self._flush_specials()
|
||||
self._load_special_cases(self._rules)
|
||||
|
||||
def explain(self, text):
|
||||
"""A debugging tokenizer that provides information about which
|
||||
|
@ -638,8 +622,14 @@ cdef class Tokenizer:
|
|||
DOCS: https://spacy.io/api/tokenizer#explain
|
||||
"""
|
||||
prefix_search = self.prefix_search
|
||||
if prefix_search is None:
|
||||
prefix_search = re.compile("a^").search
|
||||
suffix_search = self.suffix_search
|
||||
if suffix_search is None:
|
||||
suffix_search = re.compile("a^").search
|
||||
infix_finditer = self.infix_finditer
|
||||
if infix_finditer is None:
|
||||
infix_finditer = re.compile("a^").finditer
|
||||
token_match = self.token_match
|
||||
if token_match is None:
|
||||
token_match = re.compile("a^").match
|
||||
|
@ -687,7 +677,7 @@ cdef class Tokenizer:
|
|||
tokens.append(("URL_MATCH", substring))
|
||||
substring = ''
|
||||
elif substring in special_cases:
|
||||
tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring]))
|
||||
tokens.extend((f"SPECIAL-{i + 1}", self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring]))
|
||||
substring = ''
|
||||
elif list(infix_finditer(substring)):
|
||||
infixes = infix_finditer(substring)
|
||||
|
@ -705,7 +695,33 @@ cdef class Tokenizer:
|
|||
tokens.append(("TOKEN", substring))
|
||||
substring = ''
|
||||
tokens.extend(reversed(suffixes))
|
||||
return tokens
|
||||
# Find matches for special cases handled by special matcher
|
||||
words, spaces = get_words_and_spaces([t[1] for t in tokens], text)
|
||||
t_words = []
|
||||
t_spaces = []
|
||||
for word, space in zip(words, spaces):
|
||||
if not word.isspace():
|
||||
t_words.append(word)
|
||||
t_spaces.append(space)
|
||||
doc = Doc(self.vocab, words=t_words, spaces=t_spaces)
|
||||
matches = self._special_matcher(doc)
|
||||
spans = [Span(doc, s, e, label=m_id) for m_id, s, e in matches]
|
||||
spans = util.filter_spans(spans)
|
||||
# Replace matched tokens with their exceptions
|
||||
i = 0
|
||||
final_tokens = []
|
||||
spans_by_start = {s.start: s for s in spans}
|
||||
while i < len(tokens):
|
||||
if i in spans_by_start:
|
||||
span = spans_by_start[i]
|
||||
exc = [d[ORTH] for d in special_cases[span.label_]]
|
||||
for j, orth in enumerate(exc):
|
||||
final_tokens.append((f"SPECIAL-{j + 1}", self.vocab.strings[orth]))
|
||||
i += len(span)
|
||||
else:
|
||||
final_tokens.append(tokens[i])
|
||||
i += 1
|
||||
return final_tokens
|
||||
|
||||
def score(self, examples, **kwargs):
|
||||
validate_examples(examples, "Tokenizer.score")
|
||||
|
@ -778,6 +794,15 @@ cdef class Tokenizer:
|
|||
"url_match": lambda b: data.setdefault("url_match", b),
|
||||
"exceptions": lambda b: data.setdefault("rules", b)
|
||||
}
|
||||
# reset all properties and flush all caches (through rules),
|
||||
# reset rules first so that _reload_special_cases is trivial/fast as
|
||||
# the other properties are reset
|
||||
self.rules = {}
|
||||
self.prefix_search = None
|
||||
self.suffix_search = None
|
||||
self.infix_finditer = None
|
||||
self.token_match = None
|
||||
self.url_match = None
|
||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||
if "prefix_search" in data and isinstance(data["prefix_search"], str):
|
||||
self.prefix_search = re.compile(data["prefix_search"]).search
|
||||
|
@ -785,22 +810,12 @@ cdef class Tokenizer:
|
|||
self.suffix_search = re.compile(data["suffix_search"]).search
|
||||
if "infix_finditer" in data and isinstance(data["infix_finditer"], str):
|
||||
self.infix_finditer = re.compile(data["infix_finditer"]).finditer
|
||||
# for token_match and url_match, set to None to override the language
|
||||
# defaults if no regex is provided
|
||||
if "token_match" in data and isinstance(data["token_match"], str):
|
||||
self.token_match = re.compile(data["token_match"]).match
|
||||
else:
|
||||
self.token_match = None
|
||||
if "url_match" in data and isinstance(data["url_match"], str):
|
||||
self.url_match = re.compile(data["url_match"]).match
|
||||
else:
|
||||
self.url_match = None
|
||||
if "rules" in data and isinstance(data["rules"], dict):
|
||||
# make sure to hard reset the cache to remove data from the default exceptions
|
||||
self._rules = {}
|
||||
self._flush_cache()
|
||||
self._flush_specials()
|
||||
self._load_special_cases(data["rules"])
|
||||
self.rules = data["rules"]
|
||||
return self
|
||||
|
||||
|
||||
|
|
|
@ -281,7 +281,8 @@ def _merge(Doc doc, merges):
|
|||
for i in range(doc.length):
|
||||
doc.c[i].head -= i
|
||||
# Set the left/right children, left/right edges
|
||||
set_children_from_heads(doc.c, 0, doc.length)
|
||||
if doc.has_annotation("DEP"):
|
||||
set_children_from_heads(doc.c, 0, doc.length)
|
||||
# Make sure ent_iob remains consistent
|
||||
make_iob_consistent(doc.c, doc.length)
|
||||
# Return the merged Python object
|
||||
|
@ -294,7 +295,19 @@ def _resize_tensor(tensor, ranges):
|
|||
for i in range(start, end-1):
|
||||
delete.append(i)
|
||||
xp = get_array_module(tensor)
|
||||
return xp.delete(tensor, delete, axis=0)
|
||||
if xp is numpy:
|
||||
return xp.delete(tensor, delete, axis=0)
|
||||
else:
|
||||
offset = 0
|
||||
copy_start = 0
|
||||
resized_shape = (tensor.shape[0] - len(delete), tensor.shape[1])
|
||||
for start, end in ranges:
|
||||
if copy_start > 0:
|
||||
tensor[copy_start - offset:start - offset] = tensor[copy_start: start]
|
||||
offset += end - start - 1
|
||||
copy_start = end - 1
|
||||
tensor[copy_start - offset:resized_shape[0]] = tensor[copy_start:]
|
||||
return xp.asarray(tensor[:resized_shape[0]])
|
||||
|
||||
|
||||
def _split(Doc doc, int token_index, orths, heads, attrs):
|
||||
|
@ -331,7 +344,13 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
|
|||
to_process_tensor = (doc.tensor is not None and doc.tensor.size != 0)
|
||||
if to_process_tensor:
|
||||
xp = get_array_module(doc.tensor)
|
||||
doc.tensor = xp.append(doc.tensor, xp.zeros((nb_subtokens,doc.tensor.shape[1]), dtype="float32"), axis=0)
|
||||
if xp is numpy:
|
||||
doc.tensor = xp.append(doc.tensor, xp.zeros((nb_subtokens,doc.tensor.shape[1]), dtype="float32"), axis=0)
|
||||
else:
|
||||
shape = (doc.tensor.shape[0] + nb_subtokens, doc.tensor.shape[1])
|
||||
resized_array = xp.zeros(shape, dtype="float32")
|
||||
resized_array[:doc.tensor.shape[0]] = doc.tensor[:doc.tensor.shape[0]]
|
||||
doc.tensor = resized_array
|
||||
for token_to_move in range(orig_length - 1, token_index, -1):
|
||||
doc.c[token_to_move + nb_subtokens - 1] = doc.c[token_to_move]
|
||||
if to_process_tensor:
|
||||
|
@ -348,7 +367,7 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
|
|||
token.norm = 0 # reset norm
|
||||
if to_process_tensor:
|
||||
# setting the tensors of the split tokens to array of zeros
|
||||
doc.tensor[token_index + i] = xp.zeros((1,doc.tensor.shape[1]), dtype="float32")
|
||||
doc.tensor[token_index + i:token_index + i + 1] = xp.zeros((1,doc.tensor.shape[1]), dtype="float32")
|
||||
# Update the character offset of the subtokens
|
||||
if i != 0:
|
||||
token.idx = orig_token.idx + idx_offset
|
||||
|
@ -392,7 +411,8 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
|
|||
for i in range(doc.length):
|
||||
doc.c[i].head -= i
|
||||
# set children from head
|
||||
set_children_from_heads(doc.c, 0, doc.length)
|
||||
if doc.has_annotation("DEP"):
|
||||
set_children_from_heads(doc.c, 0, doc.length)
|
||||
|
||||
|
||||
def _validate_extensions(extensions):
|
||||
|
|
|
@ -6,7 +6,7 @@ from libc.math cimport sqrt
|
|||
from libc.stdint cimport int32_t, uint64_t
|
||||
|
||||
import copy
|
||||
from collections import Counter
|
||||
from collections import Counter, defaultdict
|
||||
from enum import Enum
|
||||
import itertools
|
||||
import numpy
|
||||
|
@ -1120,13 +1120,14 @@ cdef class Doc:
|
|||
concat_words = []
|
||||
concat_spaces = []
|
||||
concat_user_data = {}
|
||||
concat_spans = defaultdict(list)
|
||||
char_offset = 0
|
||||
for doc in docs:
|
||||
concat_words.extend(t.text for t in doc)
|
||||
concat_spaces.extend(bool(t.whitespace_) for t in doc)
|
||||
|
||||
for key, value in doc.user_data.items():
|
||||
if isinstance(key, tuple) and len(key) == 4:
|
||||
if isinstance(key, tuple) and len(key) == 4 and key[0] == "._.":
|
||||
data_type, name, start, end = key
|
||||
if start is not None or end is not None:
|
||||
start += char_offset
|
||||
|
@ -1137,8 +1138,17 @@ cdef class Doc:
|
|||
warnings.warn(Warnings.W101.format(name=name))
|
||||
else:
|
||||
warnings.warn(Warnings.W102.format(key=key, value=value))
|
||||
for key in doc.spans:
|
||||
for span in doc.spans[key]:
|
||||
concat_spans[key].append((
|
||||
span.start_char + char_offset,
|
||||
span.end_char + char_offset,
|
||||
span.label,
|
||||
span.kb_id,
|
||||
span.text, # included as a check
|
||||
))
|
||||
char_offset += len(doc.text)
|
||||
if ensure_whitespace and not (len(doc) > 0 and doc[-1].is_space):
|
||||
if len(doc) > 0 and ensure_whitespace and not doc[-1].is_space:
|
||||
char_offset += 1
|
||||
|
||||
arrays = [doc.to_array(attrs) for doc in docs]
|
||||
|
@ -1160,6 +1170,22 @@ cdef class Doc:
|
|||
|
||||
concat_doc.from_array(attrs, concat_array)
|
||||
|
||||
for key in concat_spans:
|
||||
if key not in concat_doc.spans:
|
||||
concat_doc.spans[key] = []
|
||||
for span_tuple in concat_spans[key]:
|
||||
span = concat_doc.char_span(
|
||||
span_tuple[0],
|
||||
span_tuple[1],
|
||||
label=span_tuple[2],
|
||||
kb_id=span_tuple[3],
|
||||
)
|
||||
text = span_tuple[4]
|
||||
if span is not None and span.text == text:
|
||||
concat_doc.spans[key].append(span)
|
||||
else:
|
||||
raise ValueError(Errors.E873.format(key=key, text=text))
|
||||
|
||||
return concat_doc
|
||||
|
||||
def get_lca_matrix(self):
|
||||
|
|
|
@ -6,6 +6,7 @@ from libc.math cimport sqrt
|
|||
import numpy
|
||||
from thinc.api import get_array_module
|
||||
import warnings
|
||||
import copy
|
||||
|
||||
from .doc cimport token_by_start, token_by_end, get_token_attr, _get_lca_matrix
|
||||
from ..structs cimport TokenC, LexemeC
|
||||
|
@ -241,7 +242,19 @@ cdef class Span:
|
|||
if cat_start == self.start_char and cat_end == self.end_char:
|
||||
doc.cats[cat_label] = value
|
||||
if copy_user_data:
|
||||
doc.user_data = self.doc.user_data
|
||||
user_data = {}
|
||||
char_offset = self.start_char
|
||||
for key, value in self.doc.user_data.items():
|
||||
if isinstance(key, tuple) and len(key) == 4 and key[0] == "._.":
|
||||
data_type, name, start, end = key
|
||||
if start is not None or end is not None:
|
||||
start -= char_offset
|
||||
if end is not None:
|
||||
end -= char_offset
|
||||
user_data[(data_type, name, start, end)] = copy.copy(value)
|
||||
else:
|
||||
user_data[key] = copy.copy(value)
|
||||
doc.user_data = user_data
|
||||
return doc
|
||||
|
||||
def _fix_dep_copy(self, attrs, array):
|
||||
|
|
|
@ -8,3 +8,4 @@ from .iob_utils import biluo_tags_to_spans, tags_to_entities # noqa: F401
|
|||
from .gold_io import docs_to_json, read_json_file # noqa: F401
|
||||
from .batchers import minibatch_by_padded_size, minibatch_by_words # noqa: F401
|
||||
from .loggers import console_logger, wandb_logger # noqa: F401
|
||||
from .callbacks import create_copy_from_base_model # noqa: F401
|
||||
|
|
32
spacy/training/callbacks.py
Normal file
32
spacy/training/callbacks.py
Normal file
|
@ -0,0 +1,32 @@
|
|||
from typing import Optional
|
||||
from ..errors import Errors
|
||||
from ..language import Language
|
||||
from ..util import load_model, registry, logger
|
||||
|
||||
|
||||
@registry.callbacks("spacy.copy_from_base_model.v1")
|
||||
def create_copy_from_base_model(
|
||||
tokenizer: Optional[str] = None,
|
||||
vocab: Optional[str] = None,
|
||||
) -> Language:
|
||||
def copy_from_base_model(nlp):
|
||||
if tokenizer:
|
||||
logger.info(f"Copying tokenizer from: {tokenizer}")
|
||||
base_nlp = load_model(tokenizer)
|
||||
if nlp.config["nlp"]["tokenizer"] == base_nlp.config["nlp"]["tokenizer"]:
|
||||
nlp.tokenizer.from_bytes(base_nlp.tokenizer.to_bytes(exclude=["vocab"]))
|
||||
else:
|
||||
raise ValueError(
|
||||
Errors.E872.format(
|
||||
curr_config=nlp.config["nlp"]["tokenizer"],
|
||||
base_config=base_nlp.config["nlp"]["tokenizer"],
|
||||
)
|
||||
)
|
||||
if vocab:
|
||||
logger.info(f"Copying vocab from: {vocab}")
|
||||
# only reload if the vocab is from a different model
|
||||
if tokenizer != vocab:
|
||||
base_nlp = load_model(vocab)
|
||||
nlp.vocab.from_bytes(base_nlp.vocab.to_bytes())
|
||||
|
||||
return copy_from_base_model
|
|
@ -124,6 +124,9 @@ def segment_sents_and_docs(doc, n_sents, doc_delimiter, model=None, msg=None):
|
|||
nlp = load_model(model)
|
||||
if "parser" in nlp.pipe_names:
|
||||
msg.info(f"Segmenting sentences with parser from model '{model}'.")
|
||||
for name, proc in nlp.pipeline:
|
||||
if "parser" in getattr(proc, "listening_components", []):
|
||||
nlp.replace_listeners(name, "parser", ["model.tok2vec"])
|
||||
sentencizer = nlp.get_pipe("parser")
|
||||
if not sentencizer:
|
||||
msg.info(
|
||||
|
|
|
@ -2,6 +2,7 @@ import warnings
|
|||
from typing import Union, List, Iterable, Iterator, TYPE_CHECKING, Callable
|
||||
from typing import Optional
|
||||
from pathlib import Path
|
||||
import random
|
||||
import srsly
|
||||
|
||||
from .. import util
|
||||
|
@ -96,6 +97,7 @@ class Corpus:
|
|||
Defaults to 0, which indicates no limit.
|
||||
augment (Callable[Example, Iterable[Example]]): Optional data augmentation
|
||||
function, to extrapolate additional examples from your annotations.
|
||||
shuffle (bool): Whether to shuffle the examples.
|
||||
|
||||
DOCS: https://spacy.io/api/corpus
|
||||
"""
|
||||
|
@ -108,12 +110,14 @@ class Corpus:
|
|||
gold_preproc: bool = False,
|
||||
max_length: int = 0,
|
||||
augmenter: Optional[Callable] = None,
|
||||
shuffle: bool = False,
|
||||
) -> None:
|
||||
self.path = util.ensure_path(path)
|
||||
self.gold_preproc = gold_preproc
|
||||
self.max_length = max_length
|
||||
self.limit = limit
|
||||
self.augmenter = augmenter if augmenter is not None else dont_augment
|
||||
self.shuffle = shuffle
|
||||
|
||||
def __call__(self, nlp: "Language") -> Iterator[Example]:
|
||||
"""Yield examples from the data.
|
||||
|
@ -124,6 +128,10 @@ class Corpus:
|
|||
DOCS: https://spacy.io/api/corpus#call
|
||||
"""
|
||||
ref_docs = self.read_docbin(nlp.vocab, walk_corpus(self.path, FILE_TYPE))
|
||||
if self.shuffle:
|
||||
ref_docs = list(ref_docs)
|
||||
random.shuffle(ref_docs)
|
||||
|
||||
if self.gold_preproc:
|
||||
examples = self.make_examples_gold_preproc(nlp, ref_docs)
|
||||
else:
|
||||
|
|
|
@ -13,7 +13,7 @@ from .iob_utils import biluo_tags_to_spans
|
|||
from ..errors import Errors, Warnings
|
||||
from ..pipeline._parser_internals import nonproj
|
||||
from ..tokens.token cimport MISSING_DEP
|
||||
from ..util import logger
|
||||
from ..util import logger, to_ternary_int
|
||||
|
||||
|
||||
cpdef Doc annotations_to_doc(vocab, tok_annot, doc_annot):
|
||||
|
@ -213,18 +213,19 @@ cdef class Example:
|
|||
else:
|
||||
return [None] * len(self.x)
|
||||
|
||||
def get_aligned_spans_x2y(self, x_spans):
|
||||
return self._get_aligned_spans(self.y, x_spans, self.alignment.x2y)
|
||||
def get_aligned_spans_x2y(self, x_spans, allow_overlap=False):
|
||||
return self._get_aligned_spans(self.y, x_spans, self.alignment.x2y, allow_overlap)
|
||||
|
||||
def get_aligned_spans_y2x(self, y_spans):
|
||||
return self._get_aligned_spans(self.x, y_spans, self.alignment.y2x)
|
||||
def get_aligned_spans_y2x(self, y_spans, allow_overlap=False):
|
||||
return self._get_aligned_spans(self.x, y_spans, self.alignment.y2x, allow_overlap)
|
||||
|
||||
def _get_aligned_spans(self, doc, spans, align):
|
||||
def _get_aligned_spans(self, doc, spans, align, allow_overlap):
|
||||
seen = set()
|
||||
output = []
|
||||
for span in spans:
|
||||
indices = align[span.start : span.end].data.ravel()
|
||||
indices = [idx for idx in indices if idx not in seen]
|
||||
if not allow_overlap:
|
||||
indices = [idx for idx in indices if idx not in seen]
|
||||
if len(indices) >= 1:
|
||||
aligned_span = Span(doc, indices[0], indices[-1] + 1, label=span.label)
|
||||
target_text = span.text.lower().strip().replace(" ", "")
|
||||
|
@ -237,7 +238,7 @@ cdef class Example:
|
|||
def get_aligned_ner(self):
|
||||
if not self.y.has_annotation("ENT_IOB"):
|
||||
return [None] * len(self.x) # should this be 'missing' instead of 'None' ?
|
||||
x_ents = self.get_aligned_spans_y2x(self.y.ents)
|
||||
x_ents = self.get_aligned_spans_y2x(self.y.ents, allow_overlap=False)
|
||||
# Default to 'None' for missing values
|
||||
x_tags = offsets_to_biluo_tags(
|
||||
self.x,
|
||||
|
@ -337,7 +338,7 @@ def _annot2array(vocab, tok_annot, doc_annot):
|
|||
values.append([vocab.strings.add(h) if h is not None else MISSING_DEP for h in value])
|
||||
elif key == "SENT_START":
|
||||
attrs.append(key)
|
||||
values.append(value)
|
||||
values.append([to_ternary_int(v) for v in value])
|
||||
elif key == "MORPH":
|
||||
attrs.append(key)
|
||||
values.append([vocab.morphology.add(v) for v in value])
|
||||
|
|
|
@ -121,7 +121,7 @@ def json_to_annotations(doc):
|
|||
if i == 0:
|
||||
sent_starts.append(1)
|
||||
else:
|
||||
sent_starts.append(0)
|
||||
sent_starts.append(-1)
|
||||
if "brackets" in sent:
|
||||
brackets.extend((b["first"] + sent_start_i,
|
||||
b["last"] + sent_start_i, b["label"])
|
||||
|
|
|
@ -8,6 +8,7 @@ import tarfile
|
|||
import gzip
|
||||
import zipfile
|
||||
import tqdm
|
||||
from itertools import islice
|
||||
|
||||
from .pretrain import get_tok2vec_ref
|
||||
from ..lookups import Lookups
|
||||
|
@ -68,7 +69,11 @@ def init_nlp(config: Config, *, use_gpu: int = -1) -> "Language":
|
|||
# Make sure that listeners are defined before initializing further
|
||||
nlp._link_components()
|
||||
with nlp.select_pipes(disable=[*frozen_components, *resume_components]):
|
||||
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
|
||||
if T["max_epochs"] == -1:
|
||||
logger.debug("Due to streamed train corpus, using only first 100 examples for initialization. If necessary, provide all labels in [initialize]. More info: https://spacy.io/api/cli#init_labels")
|
||||
nlp.initialize(lambda: islice(train_corpus(nlp), 100), sgd=optimizer)
|
||||
else:
|
||||
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
|
||||
logger.info(f"Initialized pipeline components: {nlp.pipe_names}")
|
||||
# Detect components with listeners that are not frozen consistently
|
||||
for name, proc in nlp.pipeline:
|
||||
|
@ -133,6 +138,10 @@ def load_vectors_into_model(
|
|||
)
|
||||
err = ConfigValidationError.from_error(e, title=title, desc=desc)
|
||||
raise err from None
|
||||
|
||||
if len(vectors_nlp.vocab.vectors.keys()) == 0:
|
||||
logger.warning(Warnings.W112.format(name=name))
|
||||
|
||||
nlp.vocab.vectors = vectors_nlp.vocab.vectors
|
||||
if add_strings:
|
||||
# I guess we should add the strings from the vectors_nlp model?
|
||||
|
|
|
@ -101,8 +101,13 @@ def console_logger(progress_bar: bool = False):
|
|||
return setup_printer
|
||||
|
||||
|
||||
@registry.loggers("spacy.WandbLogger.v1")
|
||||
def wandb_logger(project_name: str, remove_config_values: List[str] = []):
|
||||
@registry.loggers("spacy.WandbLogger.v2")
|
||||
def wandb_logger(
|
||||
project_name: str,
|
||||
remove_config_values: List[str] = [],
|
||||
model_log_interval: Optional[int] = None,
|
||||
log_dataset_dir: Optional[str] = None,
|
||||
):
|
||||
try:
|
||||
import wandb
|
||||
from wandb import init, log, join # test that these are available
|
||||
|
@ -119,9 +124,23 @@ def wandb_logger(project_name: str, remove_config_values: List[str] = []):
|
|||
for field in remove_config_values:
|
||||
del config_dot[field]
|
||||
config = util.dot_to_dict(config_dot)
|
||||
wandb.init(project=project_name, config=config, reinit=True)
|
||||
run = wandb.init(project=project_name, config=config, reinit=True)
|
||||
console_log_step, console_finalize = console(nlp, stdout, stderr)
|
||||
|
||||
def log_dir_artifact(
|
||||
path: str,
|
||||
name: str,
|
||||
type: str,
|
||||
metadata: Optional[Dict[str, Any]] = {},
|
||||
aliases: Optional[List[str]] = [],
|
||||
):
|
||||
dataset_artifact = wandb.Artifact(name, type=type, metadata=metadata)
|
||||
dataset_artifact.add_dir(path, name=name)
|
||||
wandb.log_artifact(dataset_artifact, aliases=aliases)
|
||||
|
||||
if log_dataset_dir:
|
||||
log_dir_artifact(path=log_dataset_dir, name="dataset", type="dataset")
|
||||
|
||||
def log_step(info: Optional[Dict[str, Any]]):
|
||||
console_log_step(info)
|
||||
if info is not None:
|
||||
|
@ -133,6 +152,21 @@ def wandb_logger(project_name: str, remove_config_values: List[str] = []):
|
|||
wandb.log({f"loss_{k}": v for k, v in losses.items()})
|
||||
if isinstance(other_scores, dict):
|
||||
wandb.log(other_scores)
|
||||
if model_log_interval and info.get("output_path"):
|
||||
if info["step"] % model_log_interval == 0 and info["step"] != 0:
|
||||
log_dir_artifact(
|
||||
path=info["output_path"],
|
||||
name="pipeline_" + run.id,
|
||||
type="checkpoint",
|
||||
metadata=info,
|
||||
aliases=[
|
||||
f"epoch {info['epoch']} step {info['step']}",
|
||||
"latest",
|
||||
"best"
|
||||
if info["score"] == max(info["checkpoints"])[0]
|
||||
else "",
|
||||
],
|
||||
)
|
||||
|
||||
def finalize() -> None:
|
||||
console_finalize()
|
||||
|
|
|
@ -78,7 +78,7 @@ def train(
|
|||
training_step_iterator = train_while_improving(
|
||||
nlp,
|
||||
optimizer,
|
||||
create_train_batches(train_corpus(nlp), batcher, T["max_epochs"]),
|
||||
create_train_batches(nlp, train_corpus, batcher, T["max_epochs"]),
|
||||
create_evaluation_callback(nlp, dev_corpus, score_weights),
|
||||
dropout=T["dropout"],
|
||||
accumulate_gradient=T["accumulate_gradient"],
|
||||
|
@ -96,12 +96,13 @@ def train(
|
|||
log_step, finalize_logger = train_logger(nlp, stdout, stderr)
|
||||
try:
|
||||
for batch, info, is_best_checkpoint in training_step_iterator:
|
||||
log_step(info if is_best_checkpoint is not None else None)
|
||||
if is_best_checkpoint is not None:
|
||||
with nlp.select_pipes(disable=frozen_components):
|
||||
update_meta(T, nlp, info)
|
||||
if output_path is not None:
|
||||
save_checkpoint(is_best_checkpoint)
|
||||
info["output_path"] = str(output_path / DIR_MODEL_LAST)
|
||||
log_step(info if is_best_checkpoint is not None else None)
|
||||
except Exception as e:
|
||||
if output_path is not None:
|
||||
stdout.write(
|
||||
|
@ -289,17 +290,22 @@ def create_evaluation_callback(
|
|||
|
||||
|
||||
def create_train_batches(
|
||||
iterator: Iterator[Example],
|
||||
nlp: "Language",
|
||||
corpus: Callable[["Language"], Iterable[Example]],
|
||||
batcher: Callable[[Iterable[Example]], Iterable[Example]],
|
||||
max_epochs: int,
|
||||
):
|
||||
epoch = 0
|
||||
examples = list(iterator)
|
||||
if not examples:
|
||||
# Raise error if no data
|
||||
raise ValueError(Errors.E986)
|
||||
if max_epochs >= 0:
|
||||
examples = list(corpus(nlp))
|
||||
if not examples:
|
||||
# Raise error if no data
|
||||
raise ValueError(Errors.E986)
|
||||
while max_epochs < 1 or epoch != max_epochs:
|
||||
random.shuffle(examples)
|
||||
if max_epochs >= 0:
|
||||
random.shuffle(examples)
|
||||
else:
|
||||
examples = corpus(nlp)
|
||||
for batch in batcher(examples):
|
||||
yield epoch, batch
|
||||
epoch += 1
|
||||
|
|
|
@ -36,7 +36,7 @@ except ImportError:
|
|||
try: # Python 3.8
|
||||
import importlib.metadata as importlib_metadata
|
||||
except ImportError:
|
||||
import importlib_metadata
|
||||
from catalogue import _importlib_metadata as importlib_metadata
|
||||
|
||||
# These are functions that were previously (v2.x) available from spacy.util
|
||||
# and have since moved to Thinc. We're importing them here so people's code
|
||||
|
@ -1526,3 +1526,18 @@ def check_lexeme_norms(vocab, component_name):
|
|||
if len(lexeme_norms) == 0 and vocab.lang in LEXEME_NORM_LANGS:
|
||||
langs = ", ".join(LEXEME_NORM_LANGS)
|
||||
logger.debug(Warnings.W033.format(model=component_name, langs=langs))
|
||||
|
||||
|
||||
def to_ternary_int(val) -> int:
|
||||
"""Convert a value to the ternary 1/0/-1 int used for True/None/False in
|
||||
attributes such as SENT_START: True/1/1.0 is 1 (True), None/0/0.0 is 0
|
||||
(None), any other values are -1 (False).
|
||||
"""
|
||||
if isinstance(val, float):
|
||||
val = int(val)
|
||||
if val is True or val is 1:
|
||||
return 1
|
||||
elif val is None or val is 0:
|
||||
return 0
|
||||
else:
|
||||
return -1
|
||||
|
|
|
@ -55,7 +55,7 @@ cdef class Vectors:
|
|||
"""Create a new vector store.
|
||||
|
||||
shape (tuple): Size of the table, as (# entries, # columns)
|
||||
data (numpy.ndarray): The vector data.
|
||||
data (numpy.ndarray or cupy.ndarray): The vector data.
|
||||
keys (iterable): A sequence of keys, aligned with the data.
|
||||
name (str): A name to identify the vectors table.
|
||||
|
||||
|
@ -65,7 +65,8 @@ cdef class Vectors:
|
|||
if data is None:
|
||||
if shape is None:
|
||||
shape = (0,0)
|
||||
data = numpy.zeros(shape, dtype="f")
|
||||
ops = get_current_ops()
|
||||
data = ops.xp.zeros(shape, dtype="f")
|
||||
self.data = data
|
||||
self.key2row = {}
|
||||
if self.data is not None:
|
||||
|
@ -300,6 +301,8 @@ cdef class Vectors:
|
|||
else:
|
||||
raise ValueError(Errors.E197.format(row=row, key=key))
|
||||
if vector is not None:
|
||||
xp = get_array_module(self.data)
|
||||
vector = xp.asarray(vector)
|
||||
self.data[row] = vector
|
||||
if self._unset.count(row):
|
||||
self._unset.erase(self._unset.find(row))
|
||||
|
@ -321,10 +324,11 @@ cdef class Vectors:
|
|||
RETURNS (tuple): The most similar entries as a `(keys, best_rows, scores)`
|
||||
tuple.
|
||||
"""
|
||||
xp = get_array_module(self.data)
|
||||
filled = sorted(list({row for row in self.key2row.values()}))
|
||||
if len(filled) < n:
|
||||
raise ValueError(Errors.E198.format(n=n, n_rows=len(filled)))
|
||||
xp = get_array_module(self.data)
|
||||
filled = xp.asarray(filled)
|
||||
|
||||
norms = xp.linalg.norm(self.data[filled], axis=1, keepdims=True)
|
||||
norms[norms == 0] = 1
|
||||
|
@ -357,8 +361,10 @@ cdef class Vectors:
|
|||
# Account for numerical error we want to return in range -1, 1
|
||||
scores = xp.clip(scores, a_min=-1, a_max=1, out=scores)
|
||||
row2key = {row: key for key, row in self.key2row.items()}
|
||||
|
||||
numpy_rows = get_current_ops().to_numpy(best_rows)
|
||||
keys = xp.asarray(
|
||||
[[row2key[row] for row in best_rows[i] if row in row2key]
|
||||
[[row2key[row] for row in numpy_rows[i] if row in row2key]
|
||||
for i in range(len(queries)) ], dtype="uint64")
|
||||
return (keys, best_rows, scores)
|
||||
|
||||
|
@ -459,7 +465,8 @@ cdef class Vectors:
|
|||
if hasattr(self.data, "from_bytes"):
|
||||
self.data.from_bytes()
|
||||
else:
|
||||
self.data = srsly.msgpack_loads(b)
|
||||
xp = get_array_module(self.data)
|
||||
self.data = xp.asarray(srsly.msgpack_loads(b))
|
||||
|
||||
deserializers = {
|
||||
"key2row": lambda b: self.key2row.update(srsly.msgpack_loads(b)),
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
from libc.string cimport memcpy
|
||||
|
||||
import srsly
|
||||
from thinc.api import get_array_module
|
||||
from thinc.api import get_array_module, get_current_ops
|
||||
import functools
|
||||
|
||||
from .lexeme cimport EMPTY_LEXEME, OOV_RANK
|
||||
|
@ -293,7 +293,7 @@ cdef class Vocab:
|
|||
among those remaining.
|
||||
|
||||
For example, suppose the original table had vectors for the words:
|
||||
['sat', 'cat', 'feline', 'reclined']. If we prune the vector table to,
|
||||
['sat', 'cat', 'feline', 'reclined']. If we prune the vector table to
|
||||
two rows, we would discard the vectors for 'feline' and 'reclined'.
|
||||
These words would then be remapped to the closest remaining vector
|
||||
-- so "feline" would have the same vector as "cat", and "reclined"
|
||||
|
@ -314,6 +314,7 @@ cdef class Vocab:
|
|||
|
||||
DOCS: https://spacy.io/api/vocab#prune_vectors
|
||||
"""
|
||||
ops = get_current_ops()
|
||||
xp = get_array_module(self.vectors.data)
|
||||
# Make sure all vectors are in the vocab
|
||||
for orth in self.vectors:
|
||||
|
@ -329,8 +330,9 @@ cdef class Vocab:
|
|||
toss = xp.ascontiguousarray(self.vectors.data[indices[nr_row:]])
|
||||
self.vectors = Vectors(data=keep, keys=keys[:nr_row], name=self.vectors.name)
|
||||
syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size)
|
||||
syn_keys = ops.to_numpy(syn_keys)
|
||||
remap = {}
|
||||
for i, key in enumerate(keys[nr_row:]):
|
||||
for i, key in enumerate(ops.to_numpy(keys[nr_row:])):
|
||||
self.vectors.add(key, row=syn_rows[i][0])
|
||||
word = self.strings[key]
|
||||
synonym = self.strings[syn_keys[i][0]]
|
||||
|
@ -351,7 +353,7 @@ cdef class Vocab:
|
|||
Defaults to the length of `orth`.
|
||||
maxn (int): Maximum n-gram length used for Fasttext's ngram computation.
|
||||
Defaults to the length of `orth`.
|
||||
RETURNS (numpy.ndarray): A word vector. Size
|
||||
RETURNS (numpy.ndarray or cupy.ndarray): A word vector. Size
|
||||
and shape determined by the `vocab.vectors` instance. Usually, a
|
||||
numpy ndarray of shape (300,) and dtype float32.
|
||||
|
||||
|
@ -400,7 +402,7 @@ cdef class Vocab:
|
|||
by string or int ID.
|
||||
|
||||
orth (int / unicode): The word.
|
||||
vector (numpy.ndarray[ndim=1, dtype='float32']): The vector to set.
|
||||
vector (numpy.ndarray or cupy.nadarry[ndim=1, dtype='float32']): The vector to set.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#set_vector
|
||||
"""
|
||||
|
|
|
@ -35,7 +35,7 @@ usage documentation on
|
|||
> @architectures = "spacy.Tok2Vec.v2"
|
||||
>
|
||||
> [model.embed]
|
||||
> @architectures = "spacy.CharacterEmbed.v1"
|
||||
> @architectures = "spacy.CharacterEmbed.v2"
|
||||
> # ...
|
||||
>
|
||||
> [model.encode]
|
||||
|
@ -54,13 +54,13 @@ blog post for background.
|
|||
| `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
|
||||
### spacy.HashEmbedCNN.v1 {#HashEmbedCNN}
|
||||
### spacy.HashEmbedCNN.v2 {#HashEmbedCNN}
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
||||
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||
> pretrained_vectors = null
|
||||
> width = 96
|
||||
> depth = 4
|
||||
|
@ -96,7 +96,7 @@ consisting of a CNN and a layer-normalized maxout activation function.
|
|||
> factory = "tok2vec"
|
||||
>
|
||||
> [components.tok2vec.model]
|
||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
||||
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||
> width = 342
|
||||
>
|
||||
> [components.tagger]
|
||||
|
@ -129,13 +129,13 @@ argument that connects to the shared `tok2vec` component in the pipeline.
|
|||
| `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. By default, the upstream name is the wildcard string `"*"`, but you could also specify the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
|
||||
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
|
||||
### spacy.MultiHashEmbed.v2 {#MultiHashEmbed}
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.MultiHashEmbed.v1"
|
||||
> @architectures = "spacy.MultiHashEmbed.v2"
|
||||
> width = 64
|
||||
> attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
|
||||
> rows = [2000, 1000, 1000, 1000]
|
||||
|
@ -160,13 +160,13 @@ not updated).
|
|||
| `include_static_vectors` | Whether to also use static word vectors. Requires a vectors table to be loaded in the [`Doc`](/api/doc) objects' vocab. ~~bool~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
|
||||
### spacy.CharacterEmbed.v1 {#CharacterEmbed}
|
||||
### spacy.CharacterEmbed.v2 {#CharacterEmbed}
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.CharacterEmbed.v1"
|
||||
> @architectures = "spacy.CharacterEmbed.v2"
|
||||
> width = 128
|
||||
> rows = 7000
|
||||
> nM = 64
|
||||
|
@ -266,13 +266,13 @@ Encode context using bidirectional LSTM layers. Requires
|
|||
| `dropout` | Creates a Dropout layer on the outputs of each LSTM layer except the last layer. Set to 0.0 to disable this functionality. ~~float~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
||||
|
||||
### spacy.StaticVectors.v1 {#StaticVectors}
|
||||
### spacy.StaticVectors.v2 {#StaticVectors}
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.StaticVectors.v1"
|
||||
> @architectures = "spacy.StaticVectors.v2"
|
||||
> nO = null
|
||||
> nM = null
|
||||
> dropout = 0.2
|
||||
|
@ -283,8 +283,9 @@ Encode context using bidirectional LSTM layers. Requires
|
|||
> ```
|
||||
|
||||
Embed [`Doc`](/api/doc) objects with their vocab's vectors table, applying a
|
||||
learned linear projection to control the dimensionality. See the documentation
|
||||
on [static vectors](/usage/embeddings-transformers#static-vectors) for details.
|
||||
learned linear projection to control the dimensionality. Unknown tokens are
|
||||
mapped to a zero vector. See the documentation on [static
|
||||
vectors](/usage/embeddings-transformers#static-vectors) for details.
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
|
@ -513,7 +514,7 @@ for a Tok2Vec layer.
|
|||
> use_upper = true
|
||||
>
|
||||
> [model.tok2vec]
|
||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
||||
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||
> pretrained_vectors = null
|
||||
> width = 96
|
||||
> depth = 4
|
||||
|
@ -619,7 +620,7 @@ single-label use-cases where `exclusive_classes = true`, while the
|
|||
> @architectures = "spacy.Tok2Vec.v2"
|
||||
>
|
||||
> [model.tok2vec.embed]
|
||||
> @architectures = "spacy.MultiHashEmbed.v1"
|
||||
> @architectures = "spacy.MultiHashEmbed.v2"
|
||||
> width = 64
|
||||
> rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
||||
> attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
||||
|
@ -676,7 +677,7 @@ taking it as argument:
|
|||
> nO = null
|
||||
>
|
||||
> [model.tok2vec]
|
||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
||||
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||
> pretrained_vectors = null
|
||||
> width = 96
|
||||
> depth = 4
|
||||
|
@ -744,7 +745,7 @@ into the "real world". This requires 3 main components:
|
|||
> nO = null
|
||||
>
|
||||
> [model.tok2vec]
|
||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
||||
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||
> pretrained_vectors = null
|
||||
> width = 96
|
||||
> depth = 2
|
||||
|
|
|
@ -12,6 +12,7 @@ menu:
|
|||
- ['train', 'train']
|
||||
- ['pretrain', 'pretrain']
|
||||
- ['evaluate', 'evaluate']
|
||||
- ['assemble', 'assemble']
|
||||
- ['package', 'package']
|
||||
- ['project', 'project']
|
||||
- ['ray', 'ray']
|
||||
|
@ -892,6 +893,34 @@ $ python -m spacy evaluate [model] [data_path] [--output] [--code] [--gold-prepr
|
|||
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||
| **CREATES** | Training results and optional metrics and visualizations. |
|
||||
|
||||
## assemble {#assemble tag="command"}
|
||||
|
||||
Assemble a pipeline from a config file without additional training. Expects a
|
||||
[config file](/api/data-formats#config) with all settings and hyperparameters.
|
||||
The `--code` argument can be used to import a Python file that lets you register
|
||||
[custom functions](/usage/training#custom-functions) and refer to them in your
|
||||
config.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```cli
|
||||
> $ python -m spacy assemble config.cfg ./output
|
||||
> ```
|
||||
|
||||
```cli
|
||||
$ python -m spacy assemble [config_path] [output_dir] [--code] [--verbose] [overrides]
|
||||
```
|
||||
|
||||
| Name | Description |
|
||||
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `config_path` | Path to the [config](/api/data-formats#config) file containing all settings and hyperparameters. If `-`, the data will be [read from stdin](/usage/training#config-stdin). ~~Union[Path, str] \(positional)~~ |
|
||||
| `output_dir` | Directory to store the final pipeline in. Will be created if it doesn't exist. ~~Optional[Path] \(option)~~ |
|
||||
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions). ~~Optional[Path] \(option)~~ |
|
||||
| `--verbose`, `-V` | Show more detailed messages during processing. ~~bool (flag)~~ |
|
||||
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.data ./data`. ~~Any (option/flag)~~ |
|
||||
| **CREATES** | The final assembled pipeline. |
|
||||
|
||||
## package {#package tag="command"}
|
||||
|
||||
Generate an installable [Python package](/usage/training#models-generating) from
|
||||
|
|
|
@ -29,8 +29,8 @@ recommended settings for your use case, check out the
|
|||
>
|
||||
> The `@` syntax lets you refer to function names registered in the
|
||||
> [function registry](/api/top-level#registry). For example,
|
||||
> `@architectures = "spacy.HashEmbedCNN.v1"` refers to a registered function of
|
||||
> the name [spacy.HashEmbedCNN.v1](/api/architectures#HashEmbedCNN) and all
|
||||
> `@architectures = "spacy.HashEmbedCNN.v2"` refers to a registered function of
|
||||
> the name [spacy.HashEmbedCNN.v2](/api/architectures#HashEmbedCNN) and all
|
||||
> other values defined in its block will be passed into that function as
|
||||
> arguments. Those arguments depend on the registered function. See the usage
|
||||
> guide on [registered functions](/usage/training#config-functions) for details.
|
||||
|
@ -193,10 +193,10 @@ process that are used when you run [`spacy train`](/api/cli#train).
|
|||
| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ |
|
||||
| `gpu_allocator` | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~ |
|
||||
| `logger` | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ |
|
||||
| `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ |
|
||||
| `max_steps` | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~ |
|
||||
| `max_epochs` | Maximum number of epochs to train for. `0` means an unlimited number of epochs. `-1` means that the train corpus should be streamed rather than loaded into memory with no shuffling within the training loop. Defaults to `0`. ~~int~~ |
|
||||
| `max_steps` | Maximum number of update steps to train for. `0` means an unlimited number of steps. Defaults to `20000`. ~~int~~ |
|
||||
| `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ |
|
||||
| `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ |
|
||||
| `patience` | How many steps to continue without improvement in evaluation score. `0` disables early stopping. Defaults to `1600`. ~~int~~ |
|
||||
| `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ |
|
||||
| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ |
|
||||
| `train_corpus` | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~ |
|
||||
|
@ -390,7 +390,7 @@ file to keep track of your settings and hyperparameters and your own
|
|||
> "tags": List[str],
|
||||
> "pos": List[str],
|
||||
> "morphs": List[str],
|
||||
> "sent_starts": List[bool],
|
||||
> "sent_starts": List[Optional[bool]],
|
||||
> "deps": List[string],
|
||||
> "heads": List[int],
|
||||
> "entities": List[str],
|
||||
|
|
|
@ -44,7 +44,7 @@ Construct a `Doc` object. The most common way to get a `Doc` object is via the
|
|||
| `lemmas` <Tag variant="new">3</Tag> | A list of strings, of the same length as `words`, to assign as `token.lemma` for each word. Defaults to `None`. ~~Optional[List[str]]~~ |
|
||||
| `heads` <Tag variant="new">3</Tag> | A list of values, of the same length as `words`, to assign as the head for each word. Head indices are the absolute position of the head in the `Doc`. Defaults to `None`. ~~Optional[List[int]]~~ |
|
||||
| `deps` <Tag variant="new">3</Tag> | A list of strings, of the same length as `words`, to assign as `token.dep` for each word. Defaults to `None`. ~~Optional[List[str]]~~ |
|
||||
| `sent_starts` <Tag variant="new">3</Tag> | A list of values, of the same length as `words`, to assign as `token.is_sent_start`. Will be overridden by heads if `heads` is provided. Defaults to `None`. ~~Optional[List[Union[bool, None]]~~ |
|
||||
| `sent_starts` <Tag variant="new">3</Tag> | A list of values, of the same length as `words`, to assign as `token.is_sent_start`. Will be overridden by heads if `heads` is provided. Defaults to `None`. ~~Optional[List[Optional[bool]]]~~ |
|
||||
| `ents` <Tag variant="new">3</Tag> | A list of strings, of the same length of `words`, to assign the token-based IOB tag. Defaults to `None`. ~~Optional[List[str]]~~ |
|
||||
|
||||
## Doc.\_\_getitem\_\_ {#getitem tag="method"}
|
||||
|
|
|
@ -33,8 +33,8 @@ both documents.
|
|||
|
||||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ |
|
||||
| `reference` | The document containing gold-standard annotations. Cannot be `None`. ~~Doc~~ |
|
||||
| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ |
|
||||
| `reference` | The document containing gold-standard annotations. Cannot be `None`. ~~Doc~~ |
|
||||
| _keyword-only_ | |
|
||||
| `alignment` | An object holding the alignment between the tokens of the `predicted` and `reference` documents. ~~Optional[Alignment]~~ |
|
||||
|
||||
|
@ -56,11 +56,11 @@ see the [training format documentation](/api/data-formats#dict-input).
|
|||
> example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref})
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------- |
|
||||
| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ |
|
||||
| `example_dict` | `Dict[str, obj]` | The gold-standard annotations as a dictionary. Cannot be `None`. ~~Dict[str, Any]~~ |
|
||||
| **RETURNS** | The newly constructed object. ~~Example~~ |
|
||||
| Name | Description |
|
||||
| -------------- | ----------------------------------------------------------------------------------- |
|
||||
| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ |
|
||||
| `example_dict` | The gold-standard annotations as a dictionary. Cannot be `None`. ~~Dict[str, Any]~~ |
|
||||
| **RETURNS** | The newly constructed object. ~~Example~~ |
|
||||
|
||||
## Example.text {#text tag="property"}
|
||||
|
||||
|
@ -211,10 +211,11 @@ align to the tokenization in [`Example.predicted`](/api/example#predicted).
|
|||
> assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1)]
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ----------------------------------------------------------------------------- |
|
||||
| `y_spans` | `Span` objects aligned to the tokenization of `reference`. ~~Iterable[Span]~~ |
|
||||
| **RETURNS** | `Span` objects aligned to the tokenization of `predicted`. ~~List[Span]~~ |
|
||||
| Name | Description |
|
||||
| --------------- | -------------------------------------------------------------------------------------------- |
|
||||
| `y_spans` | `Span` objects aligned to the tokenization of `reference`. ~~Iterable[Span]~~ |
|
||||
| `allow_overlap` | Whether the resulting `Span` objects may overlap or not. Set to `False` by default. ~~bool~~ |
|
||||
| **RETURNS** | `Span` objects aligned to the tokenization of `predicted`. ~~List[Span]~~ |
|
||||
|
||||
## Example.get_aligned_spans_x2y {#get_aligned_spans_x2y tag="method"}
|
||||
|
||||
|
@ -238,10 +239,11 @@ against the original gold-standard annotation.
|
|||
> assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)]
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ----------------------------------------------------------------------------- |
|
||||
| `x_spans` | `Span` objects aligned to the tokenization of `predicted`. ~~Iterable[Span]~~ |
|
||||
| **RETURNS** | `Span` objects aligned to the tokenization of `reference`. ~~List[Span]~~ |
|
||||
| Name | Description |
|
||||
| --------------- | -------------------------------------------------------------------------------------------- |
|
||||
| `x_spans` | `Span` objects aligned to the tokenization of `predicted`. ~~Iterable[Span]~~ |
|
||||
| `allow_overlap` | Whether the resulting `Span` objects may overlap or not. Set to `False` by default. ~~bool~~ |
|
||||
| **RETURNS** | `Span` objects aligned to the tokenization of `reference`. ~~List[Span]~~ |
|
||||
|
||||
## Example.to_dict {#to_dict tag="method"}
|
||||
|
||||
|
|
|
@ -5,11 +5,12 @@ source: spacy/legacy
|
|||
---
|
||||
|
||||
The [`spacy-legacy`](https://github.com/explosion/spacy-legacy) package includes
|
||||
outdated registered functions and architectures. It is installed automatically as
|
||||
a dependency of spaCy, and provides backwards compatibility for archived functions
|
||||
that may still be used in projects.
|
||||
outdated registered functions and architectures. It is installed automatically
|
||||
as a dependency of spaCy, and provides backwards compatibility for archived
|
||||
functions that may still be used in projects.
|
||||
|
||||
You can find the detailed documentation of each such legacy function on this page.
|
||||
You can find the detailed documentation of each such legacy function on this
|
||||
page.
|
||||
|
||||
## Architectures {#architectures}
|
||||
|
||||
|
@ -44,15 +45,14 @@ blog post for background.
|
|||
| Name | Description |
|
||||
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `embed` | Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
| `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder.v1](/api/legacy#MaxoutWindowEncoder_v1). ~~Model[Floats2d, Floats2d]~~ |
|
||||
| `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder.v1](/api/legacy#MaxoutWindowEncoder_v1). ~~Model[Floats2d, Floats2d]~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
|
||||
### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder_v1}
|
||||
|
||||
The `spacy.MaxoutWindowEncoder.v1` architecture was producing a model of type
|
||||
`Model[Floats2D, Floats2D]`. Since `spacy.MaxoutWindowEncoder.v2`, this has been changed to output
|
||||
type `Model[List[Floats2d], List[Floats2d]]`.
|
||||
|
||||
`Model[Floats2D, Floats2D]`. Since `spacy.MaxoutWindowEncoder.v2`, this has been
|
||||
changed to output type `Model[List[Floats2d], List[Floats2d]]`.
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
|
@ -79,8 +79,8 @@ and residual connections.
|
|||
### spacy.MishWindowEncoder.v1 {#MishWindowEncoder_v1}
|
||||
|
||||
The `spacy.MishWindowEncoder.v1` architecture was producing a model of type
|
||||
`Model[Floats2D, Floats2D]`. Since `spacy.MishWindowEncoder.v2`, this has been changed to output
|
||||
type `Model[List[Floats2d], List[Floats2d]]`.
|
||||
`Model[Floats2D, Floats2D]`. Since `spacy.MishWindowEncoder.v2`, this has been
|
||||
changed to output type `Model[List[Floats2d], List[Floats2d]]`.
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
|
@ -103,12 +103,11 @@ and residual connections.
|
|||
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[Floats2d, Floats2d]~~ |
|
||||
|
||||
|
||||
### spacy.TextCatEnsemble.v1 {#TextCatEnsemble_v1}
|
||||
|
||||
The `spacy.TextCatEnsemble.v1` architecture built an internal `tok2vec` and `linear_model`.
|
||||
Since `spacy.TextCatEnsemble.v2`, this has been refactored so that the `TextCatEnsemble` takes these
|
||||
two sublayers as input.
|
||||
The `spacy.TextCatEnsemble.v1` architecture built an internal `tok2vec` and
|
||||
`linear_model`. Since `spacy.TextCatEnsemble.v2`, this has been refactored so
|
||||
that the `TextCatEnsemble` takes these two sublayers as input.
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
|
@ -141,3 +140,61 @@ network has an internal CNN Tok2Vec layer and uses attention.
|
|||
| `dropout` | The dropout rate. ~~float~~ |
|
||||
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
||||
|
||||
### spacy.HashEmbedCNN.v1 {#HashEmbedCNN_v1}
|
||||
|
||||
Identical to [`spacy.HashEmbedCNN.v2`](/api/architectures#HashEmbedCNN) except
|
||||
using [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are included.
|
||||
|
||||
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed_v1}
|
||||
|
||||
Identical to [`spacy.MultiHashEmbed.v2`](/api/architectures#MultiHashEmbed)
|
||||
except with [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are
|
||||
included.
|
||||
|
||||
### spacy.CharacterEmbed.v1 {#CharacterEmbed_v1}
|
||||
|
||||
Identical to [`spacy.CharacterEmbed.v2`](/api/architectures#CharacterEmbed)
|
||||
except using [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are
|
||||
included.
|
||||
|
||||
## Layers {#layers}
|
||||
|
||||
These functions are available from `@spacy.registry.layers`.
|
||||
|
||||
### spacy.StaticVectors.v1 {#StaticVectors_v1}
|
||||
|
||||
Identical to [`spacy.StaticVectors.v2`](/api/architectures#StaticVectors) except
|
||||
for the handling of tokens without vectors.
|
||||
|
||||
<Infobox title="Bugs for tokens without vectors" variant="warning">
|
||||
|
||||
`spacy.StaticVectors.v1` maps tokens without vectors to the final row in the
|
||||
vectors table, which causes the model predictions to change if new vectors are
|
||||
added to an existing vectors table. See more details in
|
||||
[issue #7662](https://github.com/explosion/spaCy/issues/7662#issuecomment-813925655).
|
||||
|
||||
</Infobox>
|
||||
|
||||
## Loggers {#loggers}
|
||||
|
||||
These functions are available from `@spacy.registry.loggers`.
|
||||
|
||||
### spacy.WandbLogger.v1 {#WandbLogger_v1}
|
||||
|
||||
The first version of the [`WandbLogger`](/api/top-level#WandbLogger) did not yet
|
||||
support the `log_dataset_dir` and `model_log_interval` arguments.
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [training.logger]
|
||||
> @loggers = "spacy.WandbLogger.v1"
|
||||
> project_name = "monitor_spacy_training"
|
||||
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
|
||||
> ```
|
||||
>
|
||||
> | Name | Description |
|
||||
> | ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
> | `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
|
||||
> | `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |
|
||||
|
|
|
@ -120,13 +120,14 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
|
|||
> matches = matcher(doc)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ |
|
||||
| `allow_missing` <Tag variant="new">3</Tag> | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~ |
|
||||
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
|
||||
| Name | Description |
|
||||
| ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ |
|
||||
| `allow_missing` <Tag variant="new">3</Tag> | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~ |
|
||||
| `with_alignments` <Tag variant="new">3.1</Tag> | Return match alignment information as part of the match tuple as `List[int]` with the same length as the matched span. Each entry denotes the corresponding index of the token pattern. If `as_spans` is set to `True`, this setting is ignored. Defaults to `False`. ~~bool~~ |
|
||||
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
|
||||
|
||||
## Matcher.\_\_len\_\_ {#len tag="method" new="2"}
|
||||
|
||||
|
|
|
@ -137,14 +137,16 @@ Returns PRF scores for labeled or unlabeled spans.
|
|||
> print(scores["ents_f"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
|
||||
| `attr` | The attribute to score. ~~str~~ |
|
||||
| _keyword-only_ | |
|
||||
| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ |
|
||||
| `has_annotation` | Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. ~~Optional[Callable[[Doc], bool]]~~ |
|
||||
| **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||
| Name | Description |
|
||||
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
|
||||
| `attr` | The attribute to score. ~~str~~ |
|
||||
| _keyword-only_ | |
|
||||
| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ |
|
||||
| `has_annotation` | Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. ~~str~~ |
|
||||
| `labeled` | Defaults to `True`. If set to `False`, two spans will be considered equal if their start and end match, irrespective of their label. ~~bool~~ |
|
||||
| `allow_overlap` | Defaults to `False`. Whether or not to allow overlapping spans. If set to `False`, the alignment will automatically resolve conflicts. ~~bool~~ |
|
||||
| **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||
|
||||
## Scorer.score_deps {#score_deps tag="staticmethod" new="3"}
|
||||
|
||||
|
|
|
@ -364,7 +364,7 @@ unknown. Defaults to `True` for the first token in the `Doc`.
|
|||
|
||||
| Name | Description |
|
||||
| ----------- | --------------------------------------------- |
|
||||
| **RETURNS** | Whether the token starts a sentence. ~~bool~~ |
|
||||
| **RETURNS** | Whether the token starts a sentence. ~~Optional[bool]~~ |
|
||||
|
||||
## Token.has_vector {#has_vector tag="property" model="vectors"}
|
||||
|
||||
|
|
|
@ -8,6 +8,7 @@ menu:
|
|||
- ['Readers', 'readers']
|
||||
- ['Batchers', 'batchers']
|
||||
- ['Augmenters', 'augmenters']
|
||||
- ['Callbacks', 'callbacks']
|
||||
- ['Training & Alignment', 'gold']
|
||||
- ['Utility Functions', 'util']
|
||||
---
|
||||
|
@ -461,7 +462,7 @@ start decreasing across epochs.
|
|||
|
||||
</Accordion>
|
||||
|
||||
#### spacy.WandbLogger.v1 {#WandbLogger tag="registered function"}
|
||||
#### spacy.WandbLogger.v2 {#WandbLogger tag="registered function"}
|
||||
|
||||
> #### Installation
|
||||
>
|
||||
|
@ -493,15 +494,19 @@ remain in the config file stored on your local system.
|
|||
>
|
||||
> ```ini
|
||||
> [training.logger]
|
||||
> @loggers = "spacy.WandbLogger.v1"
|
||||
> @loggers = "spacy.WandbLogger.v2"
|
||||
> project_name = "monitor_spacy_training"
|
||||
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
|
||||
> log_dataset_dir = "corpus"
|
||||
> model_log_interval = 1000
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
|
||||
| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |
|
||||
| `model_log_interval` | Steps to wait between logging model checkpoints to W&B dasboard (default: None). ~~Optional[int]~~ |
|
||||
| `log_dataset_dir` | Directory containing dataset to be logged and versioned as W&B artifact (default: None). ~~Optional[str]~~ |
|
||||
|
||||
<Project id="integrations/wandb">
|
||||
|
||||
|
@ -781,6 +786,35 @@ useful for making the model less sensitive to capitalization.
|
|||
| `level` | The percentage of texts that will be augmented. ~~float~~ |
|
||||
| **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ |
|
||||
|
||||
## Callbacks {#callbacks source="spacy/training/callbacks.py" new="3"}
|
||||
|
||||
The config supports [callbacks](/usage/training#custom-code-nlp-callbacks) at
|
||||
several points in the lifecycle that can be used modify the `nlp` object.
|
||||
|
||||
### spacy.copy_from_base_model.v1 {#copy_from_base_model tag="registered function"}
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [initialize.before_init]
|
||||
> @callbacks = "spacy.copy_from_base_model.v1"
|
||||
> tokenizer = "en_core_sci_md"
|
||||
> vocab = "en_core_sci_md"
|
||||
> ```
|
||||
|
||||
Copy the tokenizer and/or vocab from the specified models. It's similar to the
|
||||
v2 [base model](https://v2.spacy.io/api/cli#train) option and useful in
|
||||
combination with
|
||||
[sourced components](/usage/processing-pipelines#sourced-components) when
|
||||
fine-tuning an existing pipeline. The vocab includes the lookups and the vectors
|
||||
from the specified model. Intended for use in `[initialize.before_init]`.
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ----------------------------------------------------------------------------------------------------------------------- |
|
||||
| `tokenizer` | The pipeline to copy the tokenizer from. Defaults to `None`. ~~Optional[str]~~ |
|
||||
| `vocab` | The pipeline to copy the vocab from. The vocab includes the lookups and vectors. Defaults to `None`. ~~Optional[str]~~ |
|
||||
| **CREATES** | A function that takes the current `nlp` object and modifies its `tokenizer` and `vocab`. ~~Callable[[Language], None]~~ |
|
||||
|
||||
## Training data and alignment {#gold source="spacy/training"}
|
||||
|
||||
### training.offsets_to_biluo_tags {#offsets_to_biluo_tags tag="function"}
|
||||
|
|
|
@ -132,7 +132,7 @@ factory = "tok2vec"
|
|||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[components.tok2vec.model.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
@architectures = "spacy.MultiHashEmbed.v2"
|
||||
|
||||
[components.tok2vec.model.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
|
@ -164,7 +164,7 @@ factory = "ner"
|
|||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[components.ner.model.tok2vec.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
@architectures = "spacy.MultiHashEmbed.v2"
|
||||
|
||||
[components.ner.model.tok2vec.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
|
@ -541,7 +541,7 @@ word vector tables using the `include_static_vectors` flag.
|
|||
|
||||
```ini
|
||||
[tagger.model.tok2vec.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
@architectures = "spacy.MultiHashEmbed.v2"
|
||||
width = 128
|
||||
attrs = ["LOWER","PREFIX","SUFFIX","SHAPE"]
|
||||
rows = [5000,2500,2500,2500]
|
||||
|
@ -550,7 +550,7 @@ include_static_vectors = true
|
|||
|
||||
<Infobox title="How it works" emoji="💡">
|
||||
|
||||
The configuration system will look up the string `"spacy.MultiHashEmbed.v1"` in
|
||||
The configuration system will look up the string `"spacy.MultiHashEmbed.v2"` in
|
||||
the `architectures` [registry](/api/top-level#registry), and call the returned
|
||||
object with the rest of the arguments from the block. This will result in a call
|
||||
to the
|
||||
|
|
|
@ -130,9 +130,9 @@ which provides a numpy-compatible interface for GPU arrays.
|
|||
|
||||
spaCy can be installed on GPU by specifying `spacy[cuda]`, `spacy[cuda90]`,
|
||||
`spacy[cuda91]`, `spacy[cuda92]`, `spacy[cuda100]`, `spacy[cuda101]`,
|
||||
`spacy[cuda102]`, `spacy[cuda110]` or `spacy[cuda111]`. If you know your cuda
|
||||
version, using the more explicit specifier allows cupy to be installed via
|
||||
wheel, saving some compilation time. The specifiers should install
|
||||
`spacy[cuda102]`, `spacy[cuda110]`, `spacy[cuda111]` or `spacy[cuda112]`. If you
|
||||
know your cuda version, using the more explicit specifier allows cupy to be
|
||||
installed via wheel, saving some compilation time. The specifiers should install
|
||||
[`cupy`](https://cupy.chainer.org).
|
||||
|
||||
```bash
|
||||
|
|
|
@ -137,7 +137,7 @@ nO = null
|
|||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[components.textcat.model.tok2vec.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
@architectures = "spacy.MultiHashEmbed.v2"
|
||||
width = 64
|
||||
rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
||||
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
||||
|
@ -204,7 +204,7 @@ factory = "tok2vec"
|
|||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[components.tok2vec.model.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
@architectures = "spacy.MultiHashEmbed.v2"
|
||||
# ...
|
||||
|
||||
[components.tok2vec.model.encode]
|
||||
|
@ -220,7 +220,7 @@ architecture:
|
|||
```ini
|
||||
### config.cfg (excerpt)
|
||||
[components.tok2vec.model.embed]
|
||||
@architectures = "spacy.CharacterEmbed.v1"
|
||||
@architectures = "spacy.CharacterEmbed.v2"
|
||||
# ...
|
||||
|
||||
[components.tok2vec.model.encode]
|
||||
|
@ -638,7 +638,7 @@ that has the full implementation.
|
|||
> @architectures = "rel_instance_tensor.v1"
|
||||
>
|
||||
> [model.create_instance_tensor.tok2vec]
|
||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
||||
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||
> # ...
|
||||
>
|
||||
> [model.create_instance_tensor.pooling]
|
||||
|
|
|
@ -787,6 +787,7 @@ rather than performance:
|
|||
|
||||
```python
|
||||
def tokenizer_pseudo_code(
|
||||
text,
|
||||
special_cases,
|
||||
prefix_search,
|
||||
suffix_search,
|
||||
|
@ -840,12 +841,14 @@ def tokenizer_pseudo_code(
|
|||
tokens.append(substring)
|
||||
substring = ""
|
||||
tokens.extend(reversed(suffixes))
|
||||
for match in matcher(special_cases, text):
|
||||
tokens.replace(match, special_cases[match])
|
||||
return tokens
|
||||
```
|
||||
|
||||
The algorithm can be summarized as follows:
|
||||
|
||||
1. Iterate over whitespace-separated substrings.
|
||||
1. Iterate over space-separated substrings.
|
||||
2. Look for a token match. If there is a match, stop processing and keep this
|
||||
token.
|
||||
3. Check whether we have an explicitly defined special case for this substring.
|
||||
|
@ -859,6 +862,8 @@ The algorithm can be summarized as follows:
|
|||
8. Look for "infixes" – stuff like hyphens etc. and split the substring into
|
||||
tokens on all infixes.
|
||||
9. Once we can't consume any more of the string, handle it as a single token.
|
||||
10. Make a final pass over the text to check for special cases that include
|
||||
spaces or that were missed due to the incremental processing of affixes.
|
||||
|
||||
</Accordion>
|
||||
|
||||
|
|
|
@ -995,7 +995,7 @@ your results.
|
|||
>
|
||||
> ```ini
|
||||
> [training.logger]
|
||||
> @loggers = "spacy.WandbLogger.v1"
|
||||
> @loggers = "spacy.WandbLogger.v2"
|
||||
> project_name = "monitor_spacy_training"
|
||||
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
|
||||
> ```
|
||||
|
|
|
@ -1130,8 +1130,8 @@ any other custom workflows. `corpora.train` and `corpora.dev` are used as
|
|||
conventions within spaCy's default configs, but you can also define any other
|
||||
custom blocks. Each section in the corpora config should resolve to a
|
||||
[`Corpus`](/api/corpus) – for example, using spaCy's built-in
|
||||
[corpus reader](/api/top-level#readers) that takes a path to a binary `.spacy`
|
||||
file. The `train_corpus` and `dev_corpus` fields in the
|
||||
[corpus reader](/api/top-level#corpus-readers) that takes a path to a binary
|
||||
`.spacy` file. The `train_corpus` and `dev_corpus` fields in the
|
||||
[`[training]`](/api/data-formats#config-training) block specify where to find
|
||||
the corpus in your config. This makes it easy to **swap out** different corpora
|
||||
by only changing a single config setting.
|
||||
|
@ -1142,21 +1142,23 @@ corpora, keyed by corpus name, e.g. `"train"` and `"dev"`. This can be
|
|||
especially useful if you need to split a single file into corpora for training
|
||||
and evaluation, without loading the same file twice.
|
||||
|
||||
By default, the training data is loaded into memory and shuffled before each
|
||||
epoch. If the corpus is **too large to fit into memory** during training, stream
|
||||
the corpus using a custom reader as described in the next section.
|
||||
|
||||
### Custom data reading and batching {#custom-code-readers-batchers}
|
||||
|
||||
Some use-cases require **streaming in data** or manipulating datasets on the
|
||||
fly, rather than generating all data beforehand and storing it to file. Instead
|
||||
fly, rather than generating all data beforehand and storing it to disk. Instead
|
||||
of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
|
||||
paths, you can create and register a custom function that generates
|
||||
[`Example`](/api/example) objects. The resulting generator can be infinite. When
|
||||
using this dataset for training, stopping criteria such as maximum number of
|
||||
steps, or stopping when the loss does not decrease further, can be used.
|
||||
[`Example`](/api/example) objects.
|
||||
|
||||
In this example we assume a custom function `read_custom_data` which loads or
|
||||
generates texts with relevant text classification annotations. Then, small
|
||||
lexical variations of the input text are created before generating the final
|
||||
[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
|
||||
you register the function creating the custom reader in the `readers`
|
||||
In the following example we assume a custom function `read_custom_data` which
|
||||
loads or generates texts with relevant text classification annotations. Then,
|
||||
small lexical variations of the input text are created before generating the
|
||||
final [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator
|
||||
lets you register the function creating the custom reader in the `readers`
|
||||
[registry](/api/top-level#registry) and assign it a string name, so it can be
|
||||
used in your config. All arguments on the registered function become available
|
||||
as **config settings** – in this case, `source`.
|
||||
|
@ -1199,6 +1201,80 @@ Remember that a registered function should always be a function that spaCy
|
|||
|
||||
</Infobox>
|
||||
|
||||
If the corpus is **too large to load into memory** or the corpus reader is an
|
||||
**infinite generator**, use the setting `max_epochs = -1` to indicate that the
|
||||
train corpus should be streamed. With this setting the train corpus is merely
|
||||
streamed and batched, not shuffled, so any shuffling needs to be implemented in
|
||||
the corpus reader itself. In the example below, a corpus reader that generates
|
||||
sentences containing even or odd numbers is used with an unlimited number of
|
||||
examples for the train corpus and a limited number of examples for the dev
|
||||
corpus. The dev corpus should always be finite and fit in memory during the
|
||||
evaluation step. `max_steps` and/or `patience` are used to determine when the
|
||||
training should stop.
|
||||
|
||||
> #### config.cfg
|
||||
>
|
||||
> ```ini
|
||||
> [corpora.dev]
|
||||
> @readers = "even_odd.v1"
|
||||
> limit = 100
|
||||
>
|
||||
> [corpora.train]
|
||||
> @readers = "even_odd.v1"
|
||||
> limit = -1
|
||||
>
|
||||
> [training]
|
||||
> max_epochs = -1
|
||||
> patience = 500
|
||||
> max_steps = 2000
|
||||
> ```
|
||||
|
||||
```python
|
||||
### functions.py
|
||||
from typing import Callable, Iterable, Iterator
|
||||
from spacy import util
|
||||
import random
|
||||
from spacy.training import Example
|
||||
from spacy import Language
|
||||
|
||||
|
||||
@util.registry.readers("even_odd.v1")
|
||||
def create_even_odd_corpus(limit: int = -1) -> Callable[[Language], Iterable[Example]]:
|
||||
return EvenOddCorpus(limit)
|
||||
|
||||
|
||||
class EvenOddCorpus:
|
||||
def __init__(self, limit):
|
||||
self.limit = limit
|
||||
|
||||
def __call__(self, nlp: Language) -> Iterator[Example]:
|
||||
i = 0
|
||||
while i < self.limit or self.limit < 0:
|
||||
r = random.randint(0, 1000)
|
||||
cat = r % 2 == 0
|
||||
text = "This is sentence " + str(r)
|
||||
yield Example.from_dict(
|
||||
nlp.make_doc(text), {"cats": {"EVEN": cat, "ODD": not cat}}
|
||||
)
|
||||
i += 1
|
||||
```
|
||||
|
||||
> #### config.cfg
|
||||
>
|
||||
> ```ini
|
||||
> [initialize.components.textcat.labels]
|
||||
> @readers = "spacy.read_labels.v1"
|
||||
> path = "labels/textcat.json"
|
||||
> require = true
|
||||
> ```
|
||||
|
||||
If the train corpus is streamed, the initialize step peeks at the first 100
|
||||
examples in the corpus to find the labels for each component. If this isn't
|
||||
sufficient, you'll need to [provide the labels](#initialization-labels) for each
|
||||
component in the `[initialize]` block. [`init labels`](/api/cli#init-labels) can
|
||||
be used to generate JSON files in the correct format, which you can extend with
|
||||
the full label set.
|
||||
|
||||
We can also customize the **batching strategy** by registering a new batcher
|
||||
function in the `batchers` [registry](/api/top-level#registry). A batcher turns
|
||||
a stream of items into a stream of batches. spaCy has several useful built-in
|
||||
|
|
|
@ -616,11 +616,11 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
|
|||
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
|
||||
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, symlinks are deprecated |
|
||||
|
||||
The following deprecated methods, attributes and arguments were removed in v3.0.
|
||||
Most of them have been **deprecated for a while** and many would previously
|
||||
raise errors. Many of them were also mostly internals. If you've been working
|
||||
with more recent versions of spaCy v2.x, it's **unlikely** that your code relied
|
||||
on them.
|
||||
The following methods, attributes and arguments were removed in v3.0. Most of
|
||||
them have been **deprecated for a while** and many would previously raise
|
||||
errors. Many of them were also mostly internals. If you've been working with
|
||||
more recent versions of spaCy v2.x, it's **unlikely** that your code relied on
|
||||
them.
|
||||
|
||||
| Removed | Replacement |
|
||||
| ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
|
@ -637,10 +637,10 @@ on them.
|
|||
|
||||
### Downloading and loading trained pipelines {#migrating-downloading-models}
|
||||
|
||||
Symlinks and shortcuts like `en` are now officially deprecated. There are
|
||||
[many different trained pipelines](/models) with different capabilities and not
|
||||
just one "English model". In order to download and load a package, you should
|
||||
always use its full name – for instance,
|
||||
Symlinks and shortcuts like `en` have been deprecated for a while, and are now
|
||||
not supported anymore. There are [many different trained pipelines](/models)
|
||||
with different capabilities and not just one "English model". In order to
|
||||
download and load a package, you should always use its full name – for instance,
|
||||
[`en_core_web_sm`](/models/en#en_core_web_sm).
|
||||
|
||||
```diff
|
||||
|
@ -1185,9 +1185,10 @@ package isn't imported.
|
|||
In Jupyter notebooks, run [`prefer_gpu`](/api/top-level#spacy.prefer_gpu),
|
||||
[`require_gpu`](/api/top-level#spacy.require_gpu) or
|
||||
[`require_cpu`](/api/top-level#spacy.require_cpu) in the same cell as
|
||||
[`spacy.load`](/api/top-level#spacy.load) to ensure that the model is loaded on the correct device.
|
||||
[`spacy.load`](/api/top-level#spacy.load) to ensure that the model is loaded on
|
||||
the correct device.
|
||||
|
||||
Due to a bug related to `contextvars` (see the [bug
|
||||
report](https://github.com/ipython/ipython/issues/11565)), the GPU settings may
|
||||
not be preserved correctly across cells, resulting in models being loaded on
|
||||
Due to a bug related to `contextvars` (see the
|
||||
[bug report](https://github.com/ipython/ipython/issues/11565)), the GPU settings
|
||||
may not be preserved correctly across cells, resulting in models being loaded on
|
||||
the wrong device or only partially on GPU.
|
||||
|
|
Loading…
Reference in New Issue
Block a user