mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-01 04:46:38 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
29ac7f776a
57
.github/azure-steps.yml
vendored
Normal file
57
.github/azure-steps.yml
vendored
Normal file
|
@ -0,0 +1,57 @@
|
||||||
|
parameters:
|
||||||
|
python_version: ''
|
||||||
|
architecture: ''
|
||||||
|
prefix: ''
|
||||||
|
gpu: false
|
||||||
|
num_build_jobs: 1
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- task: UsePythonVersion@0
|
||||||
|
inputs:
|
||||||
|
versionSpec: ${{ parameters.python_version }}
|
||||||
|
architecture: ${{ parameters.architecture }}
|
||||||
|
|
||||||
|
- script: |
|
||||||
|
${{ parameters.prefix }} python -m pip install -U pip setuptools
|
||||||
|
${{ parameters.prefix }} python -m pip install -U -r requirements.txt
|
||||||
|
displayName: "Install dependencies"
|
||||||
|
|
||||||
|
- script: |
|
||||||
|
${{ parameters.prefix }} python setup.py build_ext --inplace -j ${{ parameters.num_build_jobs }}
|
||||||
|
${{ parameters.prefix }} python setup.py sdist --formats=gztar
|
||||||
|
displayName: "Compile and build sdist"
|
||||||
|
|
||||||
|
- task: DeleteFiles@1
|
||||||
|
inputs:
|
||||||
|
contents: "spacy"
|
||||||
|
displayName: "Delete source directory"
|
||||||
|
|
||||||
|
- script: |
|
||||||
|
${{ parameters.prefix }} python -m pip freeze --exclude torch --exclude cupy-cuda110 > installed.txt
|
||||||
|
${{ parameters.prefix }} python -m pip uninstall -y -r installed.txt
|
||||||
|
displayName: "Uninstall all packages"
|
||||||
|
|
||||||
|
- bash: |
|
||||||
|
${{ parameters.prefix }} SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1)
|
||||||
|
${{ parameters.prefix }} python -m pip install dist/$SDIST
|
||||||
|
displayName: "Install from sdist"
|
||||||
|
|
||||||
|
- script: |
|
||||||
|
${{ parameters.prefix }} python -m pip install -U -r requirements.txt
|
||||||
|
displayName: "Install test requirements"
|
||||||
|
|
||||||
|
- script: |
|
||||||
|
${{ parameters.prefix }} python -m pip install -U cupy-cuda110
|
||||||
|
${{ parameters.prefix }} python -m pip install "torch==1.7.1+cu110" -f https://download.pytorch.org/whl/torch_stable.html
|
||||||
|
displayName: "Install GPU requirements"
|
||||||
|
condition: eq(${{ parameters.gpu }}, true)
|
||||||
|
|
||||||
|
- script: |
|
||||||
|
${{ parameters.prefix }} python -m pytest --pyargs spacy
|
||||||
|
displayName: "Run CPU tests"
|
||||||
|
condition: eq(${{ parameters.gpu }}, false)
|
||||||
|
|
||||||
|
- script: |
|
||||||
|
${{ parameters.prefix }} python -m pytest --pyargs spacy -p spacy.tests.enable_gpu
|
||||||
|
displayName: "Run GPU tests"
|
||||||
|
condition: eq(${{ parameters.gpu }}, true)
|
106
.github/contributors/AyushExel.md
vendored
Normal file
106
.github/contributors/AyushExel.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [X] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Ayush Chaurasia |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2021-03-12 |
|
||||||
|
| GitHub username | AyushExel |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/broaddeep.md
vendored
Normal file
106
.github/contributors/broaddeep.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Dongjun Park |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2021-03-06 |
|
||||||
|
| GitHub username | broaddeep |
|
||||||
|
| Website (optional) | |
|
|
@ -76,39 +76,24 @@ jobs:
|
||||||
maxParallel: 4
|
maxParallel: 4
|
||||||
pool:
|
pool:
|
||||||
vmImage: $(imageName)
|
vmImage: $(imageName)
|
||||||
|
|
||||||
steps:
|
steps:
|
||||||
- task: UsePythonVersion@0
|
- template: .github/azure-steps.yml
|
||||||
inputs:
|
parameters:
|
||||||
versionSpec: "$(python.version)"
|
python_version: '$(python.version)'
|
||||||
architecture: "x64"
|
architecture: 'x64'
|
||||||
|
|
||||||
- script: |
|
- job: "TestGPU"
|
||||||
python -m pip install -U setuptools
|
dependsOn: "Validate"
|
||||||
pip install -r requirements.txt
|
strategy:
|
||||||
displayName: "Install dependencies"
|
matrix:
|
||||||
|
Python38LinuxX64_GPU:
|
||||||
- script: |
|
python.version: '3.8'
|
||||||
python setup.py build_ext --inplace
|
pool:
|
||||||
python setup.py sdist --formats=gztar
|
name: "LinuxX64_GPU"
|
||||||
displayName: "Compile and build sdist"
|
steps:
|
||||||
|
- template: .github/azure-steps.yml
|
||||||
- task: DeleteFiles@1
|
parameters:
|
||||||
inputs:
|
python_version: '$(python.version)'
|
||||||
contents: "spacy"
|
architecture: 'x64'
|
||||||
displayName: "Delete source directory"
|
gpu: true
|
||||||
|
num_build_jobs: 24
|
||||||
- script: |
|
|
||||||
pip freeze > installed.txt
|
|
||||||
pip uninstall -y -r installed.txt
|
|
||||||
displayName: "Uninstall all packages"
|
|
||||||
|
|
||||||
- bash: |
|
|
||||||
SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1)
|
|
||||||
pip install dist/$SDIST
|
|
||||||
displayName: "Install from sdist"
|
|
||||||
|
|
||||||
- script: |
|
|
||||||
pip install -r requirements.txt
|
|
||||||
python -m pytest --pyargs spacy
|
|
||||||
displayName: "Run tests"
|
|
||||||
|
|
|
@ -5,7 +5,7 @@ requires = [
|
||||||
"cymem>=2.0.2,<2.1.0",
|
"cymem>=2.0.2,<2.1.0",
|
||||||
"preshed>=3.0.2,<3.1.0",
|
"preshed>=3.0.2,<3.1.0",
|
||||||
"murmurhash>=0.28.0,<1.1.0",
|
"murmurhash>=0.28.0,<1.1.0",
|
||||||
"thinc>=8.0.2,<8.1.0",
|
"thinc>=8.0.3,<8.1.0",
|
||||||
"blis>=0.4.0,<0.8.0",
|
"blis>=0.4.0,<0.8.0",
|
||||||
"pathy",
|
"pathy",
|
||||||
"numpy>=1.15.0",
|
"numpy>=1.15.0",
|
||||||
|
|
|
@ -1,14 +1,14 @@
|
||||||
# Our libraries
|
# Our libraries
|
||||||
spacy-legacy>=3.0.0,<3.1.0
|
spacy-legacy>=3.0.4,<3.1.0
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
thinc>=8.0.2,<8.1.0
|
thinc>=8.0.3,<8.1.0
|
||||||
blis>=0.4.0,<0.8.0
|
blis>=0.4.0,<0.8.0
|
||||||
ml_datasets>=0.2.0,<0.3.0
|
ml_datasets>=0.2.0,<0.3.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
wasabi>=0.8.1,<1.1.0
|
wasabi>=0.8.1,<1.1.0
|
||||||
srsly>=2.4.0,<3.0.0
|
srsly>=2.4.1,<3.0.0
|
||||||
catalogue>=2.0.1,<2.1.0
|
catalogue>=2.0.3,<2.1.0
|
||||||
typer>=0.3.0,<0.4.0
|
typer>=0.3.0,<0.4.0
|
||||||
pathy>=0.3.5
|
pathy>=0.3.5
|
||||||
# Third party dependencies
|
# Third party dependencies
|
||||||
|
@ -20,7 +20,6 @@ jinja2
|
||||||
# Official Python utilities
|
# Official Python utilities
|
||||||
setuptools
|
setuptools
|
||||||
packaging>=20.0
|
packaging>=20.0
|
||||||
importlib_metadata>=0.20; python_version < "3.8"
|
|
||||||
typing_extensions>=3.7.4.1,<4.0.0.0; python_version < "3.8"
|
typing_extensions>=3.7.4.1,<4.0.0.0; python_version < "3.8"
|
||||||
# Development dependencies
|
# Development dependencies
|
||||||
cython>=0.25
|
cython>=0.25
|
||||||
|
|
13
setup.cfg
13
setup.cfg
|
@ -34,18 +34,18 @@ setup_requires =
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
thinc>=8.0.2,<8.1.0
|
thinc>=8.0.3,<8.1.0
|
||||||
install_requires =
|
install_requires =
|
||||||
# Our libraries
|
# Our libraries
|
||||||
spacy-legacy>=3.0.0,<3.1.0
|
spacy-legacy>=3.0.4,<3.1.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
thinc>=8.0.2,<8.1.0
|
thinc>=8.0.3,<8.1.0
|
||||||
blis>=0.4.0,<0.8.0
|
blis>=0.4.0,<0.8.0
|
||||||
wasabi>=0.8.1,<1.1.0
|
wasabi>=0.8.1,<1.1.0
|
||||||
srsly>=2.4.0,<3.0.0
|
srsly>=2.4.1,<3.0.0
|
||||||
catalogue>=2.0.1,<2.1.0
|
catalogue>=2.0.3,<2.1.0
|
||||||
typer>=0.3.0,<0.4.0
|
typer>=0.3.0,<0.4.0
|
||||||
pathy>=0.3.5
|
pathy>=0.3.5
|
||||||
# Third-party dependencies
|
# Third-party dependencies
|
||||||
|
@ -57,7 +57,6 @@ install_requires =
|
||||||
# Official Python utilities
|
# Official Python utilities
|
||||||
setuptools
|
setuptools
|
||||||
packaging>=20.0
|
packaging>=20.0
|
||||||
importlib_metadata>=0.20; python_version < "3.8"
|
|
||||||
typing_extensions>=3.7.4,<4.0.0.0; python_version < "3.8"
|
typing_extensions>=3.7.4,<4.0.0.0; python_version < "3.8"
|
||||||
|
|
||||||
[options.entry_points]
|
[options.entry_points]
|
||||||
|
@ -91,6 +90,8 @@ cuda110 =
|
||||||
cupy-cuda110>=5.0.0b4,<9.0.0
|
cupy-cuda110>=5.0.0b4,<9.0.0
|
||||||
cuda111 =
|
cuda111 =
|
||||||
cupy-cuda111>=5.0.0b4,<9.0.0
|
cupy-cuda111>=5.0.0b4,<9.0.0
|
||||||
|
cuda112 =
|
||||||
|
cupy-cuda112>=5.0.0b4,<9.0.0
|
||||||
# Language tokenizers with external dependencies
|
# Language tokenizers with external dependencies
|
||||||
ja =
|
ja =
|
||||||
sudachipy>=0.4.9
|
sudachipy>=0.4.9
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
# fmt: off
|
# fmt: off
|
||||||
__title__ = "spacy"
|
__title__ = "spacy"
|
||||||
__version__ = "3.0.5"
|
__version__ = "3.0.6"
|
||||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||||
__projects__ = "https://github.com/explosion/projects"
|
__projects__ = "https://github.com/explosion/projects"
|
||||||
|
|
|
@ -9,6 +9,7 @@ from .info import info # noqa: F401
|
||||||
from .package import package # noqa: F401
|
from .package import package # noqa: F401
|
||||||
from .profile import profile # noqa: F401
|
from .profile import profile # noqa: F401
|
||||||
from .train import train_cli # noqa: F401
|
from .train import train_cli # noqa: F401
|
||||||
|
from .assemble import assemble_cli # noqa: F401
|
||||||
from .pretrain import pretrain # noqa: F401
|
from .pretrain import pretrain # noqa: F401
|
||||||
from .debug_data import debug_data # noqa: F401
|
from .debug_data import debug_data # noqa: F401
|
||||||
from .debug_config import debug_config # noqa: F401
|
from .debug_config import debug_config # noqa: F401
|
||||||
|
@ -29,9 +30,9 @@ from .project.document import project_document # noqa: F401
|
||||||
|
|
||||||
@app.command("link", no_args_is_help=True, deprecated=True, hidden=True)
|
@app.command("link", no_args_is_help=True, deprecated=True, hidden=True)
|
||||||
def link(*args, **kwargs):
|
def link(*args, **kwargs):
|
||||||
"""As of spaCy v3.0, symlinks like "en" are deprecated. You can load trained
|
"""As of spaCy v3.0, symlinks like "en" are not supported anymore. You can load trained
|
||||||
pipeline packages using their full names or from a directory path."""
|
pipeline packages using their full names or from a directory path."""
|
||||||
msg.warn(
|
msg.warn(
|
||||||
"As of spaCy v3.0, model symlinks are deprecated. You can load trained "
|
"As of spaCy v3.0, model symlinks are not supported anymore. You can load trained "
|
||||||
"pipeline packages using their full names or from a directory path."
|
"pipeline packages using their full names or from a directory path."
|
||||||
)
|
)
|
||||||
|
|
58
spacy/cli/assemble.py
Normal file
58
spacy/cli/assemble.py
Normal file
|
@ -0,0 +1,58 @@
|
||||||
|
from typing import Optional
|
||||||
|
from pathlib import Path
|
||||||
|
from wasabi import msg
|
||||||
|
import typer
|
||||||
|
import logging
|
||||||
|
|
||||||
|
from ._util import app, Arg, Opt, parse_config_overrides, show_validation_error
|
||||||
|
from ._util import import_code
|
||||||
|
from ..training.initialize import init_nlp
|
||||||
|
from .. import util
|
||||||
|
from ..util import get_sourced_components, load_model_from_config
|
||||||
|
|
||||||
|
|
||||||
|
@app.command(
|
||||||
|
"assemble",
|
||||||
|
context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
|
||||||
|
)
|
||||||
|
def assemble_cli(
|
||||||
|
# fmt: off
|
||||||
|
ctx: typer.Context, # This is only used to read additional arguments
|
||||||
|
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
|
||||||
|
output_path: Path = Arg(..., help="Output directory to store assembled pipeline in"),
|
||||||
|
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
||||||
|
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
|
||||||
|
# fmt: on
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Assemble a spaCy pipeline from a config file. The config file includes
|
||||||
|
all settings for initializing the pipeline. To override settings in the
|
||||||
|
config, e.g. settings that point to local paths or that you want to
|
||||||
|
experiment with, you can override them as command line options. The
|
||||||
|
--code argument lets you pass in a Python file that can be used to
|
||||||
|
register custom functions that are referenced in the config.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/cli#assemble
|
||||||
|
"""
|
||||||
|
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
|
||||||
|
# Make sure all files and paths exists if they are needed
|
||||||
|
if not config_path or (str(config_path) != "-" and not config_path.exists()):
|
||||||
|
msg.fail("Config file not found", config_path, exits=1)
|
||||||
|
overrides = parse_config_overrides(ctx.args)
|
||||||
|
import_code(code_path)
|
||||||
|
with show_validation_error(config_path):
|
||||||
|
config = util.load_config(config_path, overrides=overrides, interpolate=False)
|
||||||
|
msg.divider("Initializing pipeline")
|
||||||
|
nlp = load_model_from_config(config, auto_fill=True)
|
||||||
|
config = config.interpolate()
|
||||||
|
sourced = get_sourced_components(config)
|
||||||
|
# Make sure that listeners are defined before initializing further
|
||||||
|
nlp._link_components()
|
||||||
|
with nlp.select_pipes(disable=[*sourced]):
|
||||||
|
nlp.initialize()
|
||||||
|
msg.good("Initialized pipeline")
|
||||||
|
msg.divider("Serializing to disk")
|
||||||
|
if output_path is not None and not output_path.exists():
|
||||||
|
output_path.mkdir(parents=True)
|
||||||
|
msg.good(f"Created output directory: {output_path}")
|
||||||
|
nlp.to_disk(output_path)
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import List, Sequence, Dict, Any, Tuple, Optional
|
from typing import List, Sequence, Dict, Any, Tuple, Optional, Set
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from collections import Counter
|
from collections import Counter
|
||||||
import sys
|
import sys
|
||||||
|
@ -13,6 +13,8 @@ from ..training.initialize import get_sourced_components
|
||||||
from ..schemas import ConfigSchemaTraining
|
from ..schemas import ConfigSchemaTraining
|
||||||
from ..pipeline._parser_internals import nonproj
|
from ..pipeline._parser_internals import nonproj
|
||||||
from ..pipeline._parser_internals.nonproj import DELIMITER
|
from ..pipeline._parser_internals.nonproj import DELIMITER
|
||||||
|
from ..pipeline import Morphologizer
|
||||||
|
from ..morphology import Morphology
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
from ..util import registry, resolve_dot_names
|
from ..util import registry, resolve_dot_names
|
||||||
from .. import util
|
from .. import util
|
||||||
|
@ -194,32 +196,32 @@ def debug_data(
|
||||||
)
|
)
|
||||||
label_counts = gold_train_data["ner"]
|
label_counts = gold_train_data["ner"]
|
||||||
model_labels = _get_labels_from_model(nlp, "ner")
|
model_labels = _get_labels_from_model(nlp, "ner")
|
||||||
new_labels = [l for l in labels if l not in model_labels]
|
|
||||||
existing_labels = [l for l in labels if l in model_labels]
|
|
||||||
has_low_data_warning = False
|
has_low_data_warning = False
|
||||||
has_no_neg_warning = False
|
has_no_neg_warning = False
|
||||||
has_ws_ents_error = False
|
has_ws_ents_error = False
|
||||||
has_punct_ents_warning = False
|
has_punct_ents_warning = False
|
||||||
|
|
||||||
msg.divider("Named Entity Recognition")
|
msg.divider("Named Entity Recognition")
|
||||||
msg.info(
|
msg.info(f"{len(model_labels)} label(s)")
|
||||||
f"{len(new_labels)} new label(s), {len(existing_labels)} existing label(s)"
|
|
||||||
)
|
|
||||||
missing_values = label_counts["-"]
|
missing_values = label_counts["-"]
|
||||||
msg.text(f"{missing_values} missing value(s) (tokens with '-' label)")
|
msg.text(f"{missing_values} missing value(s) (tokens with '-' label)")
|
||||||
for label in new_labels:
|
for label in labels:
|
||||||
if len(label) == 0:
|
if len(label) == 0:
|
||||||
msg.fail("Empty label found in new labels")
|
msg.fail("Empty label found in train data")
|
||||||
if new_labels:
|
labels_with_counts = [
|
||||||
labels_with_counts = [
|
(label, count)
|
||||||
(label, count)
|
for label, count in label_counts.most_common()
|
||||||
for label, count in label_counts.most_common()
|
if label != "-"
|
||||||
if label != "-"
|
]
|
||||||
]
|
labels_with_counts = _format_labels(labels_with_counts, counts=True)
|
||||||
labels_with_counts = _format_labels(labels_with_counts, counts=True)
|
msg.text(f"Labels in train data: {_format_labels(labels)}", show=verbose)
|
||||||
msg.text(f"New: {labels_with_counts}", show=verbose)
|
missing_labels = model_labels - labels
|
||||||
if existing_labels:
|
if missing_labels:
|
||||||
msg.text(f"Existing: {_format_labels(existing_labels)}", show=verbose)
|
msg.warn(
|
||||||
|
"Some model labels are not present in the train data. The "
|
||||||
|
"model performance may be degraded for these labels after "
|
||||||
|
f"training: {_format_labels(missing_labels)}."
|
||||||
|
)
|
||||||
if gold_train_data["ws_ents"]:
|
if gold_train_data["ws_ents"]:
|
||||||
msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans")
|
msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans")
|
||||||
has_ws_ents_error = True
|
has_ws_ents_error = True
|
||||||
|
@ -228,10 +230,10 @@ def debug_data(
|
||||||
msg.warn(f"{gold_train_data['punct_ents']} entity span(s) with punctuation")
|
msg.warn(f"{gold_train_data['punct_ents']} entity span(s) with punctuation")
|
||||||
has_punct_ents_warning = True
|
has_punct_ents_warning = True
|
||||||
|
|
||||||
for label in new_labels:
|
for label in labels:
|
||||||
if label_counts[label] <= NEW_LABEL_THRESHOLD:
|
if label_counts[label] <= NEW_LABEL_THRESHOLD:
|
||||||
msg.warn(
|
msg.warn(
|
||||||
f"Low number of examples for new label '{label}' ({label_counts[label]})"
|
f"Low number of examples for label '{label}' ({label_counts[label]})"
|
||||||
)
|
)
|
||||||
has_low_data_warning = True
|
has_low_data_warning = True
|
||||||
|
|
||||||
|
@ -276,22 +278,52 @@ def debug_data(
|
||||||
)
|
)
|
||||||
|
|
||||||
if "textcat" in factory_names:
|
if "textcat" in factory_names:
|
||||||
msg.divider("Text Classification")
|
msg.divider("Text Classification (Exclusive Classes)")
|
||||||
labels = [label for label in gold_train_data["cats"]]
|
labels = _get_labels_from_model(nlp, "textcat")
|
||||||
model_labels = _get_labels_from_model(nlp, "textcat")
|
msg.info(f"Text Classification: {len(labels)} label(s)")
|
||||||
new_labels = [l for l in labels if l not in model_labels]
|
msg.text(f"Labels: {_format_labels(labels)}", show=verbose)
|
||||||
existing_labels = [l for l in labels if l in model_labels]
|
labels_with_counts = _format_labels(
|
||||||
msg.info(
|
gold_train_data["cats"].most_common(), counts=True
|
||||||
f"Text Classification: {len(new_labels)} new label(s), "
|
|
||||||
f"{len(existing_labels)} existing label(s)"
|
|
||||||
)
|
)
|
||||||
if new_labels:
|
msg.text(f"Labels in train data: {labels_with_counts}", show=verbose)
|
||||||
labels_with_counts = _format_labels(
|
missing_labels = labels - set(gold_train_data["cats"].keys())
|
||||||
gold_train_data["cats"].most_common(), counts=True
|
if missing_labels:
|
||||||
|
msg.warn(
|
||||||
|
"Some model labels are not present in the train data. The "
|
||||||
|
"model performance may be degraded for these labels after "
|
||||||
|
f"training: {_format_labels(missing_labels)}."
|
||||||
|
)
|
||||||
|
if gold_train_data["n_cats_multilabel"] > 0:
|
||||||
|
# Note: you should never get here because you run into E895 on
|
||||||
|
# initialization first.
|
||||||
|
msg.warn(
|
||||||
|
"The train data contains instances without "
|
||||||
|
"mutually-exclusive classes. Use the component "
|
||||||
|
"'textcat_multilabel' instead of 'textcat'."
|
||||||
|
)
|
||||||
|
if gold_dev_data["n_cats_multilabel"] > 0:
|
||||||
|
msg.fail(
|
||||||
|
"Train/dev mismatch: the dev data contains instances "
|
||||||
|
"without mutually-exclusive classes while the train data "
|
||||||
|
"contains only instances with mutually-exclusive classes."
|
||||||
|
)
|
||||||
|
|
||||||
|
if "textcat_multilabel" in factory_names:
|
||||||
|
msg.divider("Text Classification (Multilabel)")
|
||||||
|
labels = _get_labels_from_model(nlp, "textcat_multilabel")
|
||||||
|
msg.info(f"Text Classification: {len(labels)} label(s)")
|
||||||
|
msg.text(f"Labels: {_format_labels(labels)}", show=verbose)
|
||||||
|
labels_with_counts = _format_labels(
|
||||||
|
gold_train_data["cats"].most_common(), counts=True
|
||||||
|
)
|
||||||
|
msg.text(f"Labels in train data: {labels_with_counts}", show=verbose)
|
||||||
|
missing_labels = labels - set(gold_train_data["cats"].keys())
|
||||||
|
if missing_labels:
|
||||||
|
msg.warn(
|
||||||
|
"Some model labels are not present in the train data. The "
|
||||||
|
"model performance may be degraded for these labels after "
|
||||||
|
f"training: {_format_labels(missing_labels)}."
|
||||||
)
|
)
|
||||||
msg.text(f"New: {labels_with_counts}", show=verbose)
|
|
||||||
if existing_labels:
|
|
||||||
msg.text(f"Existing: {_format_labels(existing_labels)}", show=verbose)
|
|
||||||
if set(gold_train_data["cats"]) != set(gold_dev_data["cats"]):
|
if set(gold_train_data["cats"]) != set(gold_dev_data["cats"]):
|
||||||
msg.fail(
|
msg.fail(
|
||||||
f"The train and dev labels are not the same. "
|
f"The train and dev labels are not the same. "
|
||||||
|
@ -299,11 +331,6 @@ def debug_data(
|
||||||
f"Dev labels: {_format_labels(gold_dev_data['cats'])}."
|
f"Dev labels: {_format_labels(gold_dev_data['cats'])}."
|
||||||
)
|
)
|
||||||
if gold_train_data["n_cats_multilabel"] > 0:
|
if gold_train_data["n_cats_multilabel"] > 0:
|
||||||
msg.info(
|
|
||||||
"The train data contains instances without "
|
|
||||||
"mutually-exclusive classes. Use '--textcat-multilabel' "
|
|
||||||
"when training."
|
|
||||||
)
|
|
||||||
if gold_dev_data["n_cats_multilabel"] == 0:
|
if gold_dev_data["n_cats_multilabel"] == 0:
|
||||||
msg.warn(
|
msg.warn(
|
||||||
"Potential train/dev mismatch: the train data contains "
|
"Potential train/dev mismatch: the train data contains "
|
||||||
|
@ -311,9 +338,10 @@ def debug_data(
|
||||||
"dev data does not."
|
"dev data does not."
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
msg.info(
|
msg.warn(
|
||||||
"The train data contains only instances with "
|
"The train data contains only instances with "
|
||||||
"mutually-exclusive classes."
|
"mutually-exclusive classes. You can potentially use the "
|
||||||
|
"component 'textcat' instead of 'textcat_multilabel'."
|
||||||
)
|
)
|
||||||
if gold_dev_data["n_cats_multilabel"] > 0:
|
if gold_dev_data["n_cats_multilabel"] > 0:
|
||||||
msg.fail(
|
msg.fail(
|
||||||
|
@ -325,13 +353,37 @@ def debug_data(
|
||||||
if "tagger" in factory_names:
|
if "tagger" in factory_names:
|
||||||
msg.divider("Part-of-speech Tagging")
|
msg.divider("Part-of-speech Tagging")
|
||||||
labels = [label for label in gold_train_data["tags"]]
|
labels = [label for label in gold_train_data["tags"]]
|
||||||
# TODO: does this need to be updated?
|
model_labels = _get_labels_from_model(nlp, "tagger")
|
||||||
msg.info(f"{len(labels)} label(s) in data")
|
msg.info(f"{len(labels)} label(s) in train data")
|
||||||
|
missing_labels = model_labels - set(labels)
|
||||||
|
if missing_labels:
|
||||||
|
msg.warn(
|
||||||
|
"Some model labels are not present in the train data. The "
|
||||||
|
"model performance may be degraded for these labels after "
|
||||||
|
f"training: {_format_labels(missing_labels)}."
|
||||||
|
)
|
||||||
labels_with_counts = _format_labels(
|
labels_with_counts = _format_labels(
|
||||||
gold_train_data["tags"].most_common(), counts=True
|
gold_train_data["tags"].most_common(), counts=True
|
||||||
)
|
)
|
||||||
msg.text(labels_with_counts, show=verbose)
|
msg.text(labels_with_counts, show=verbose)
|
||||||
|
|
||||||
|
if "morphologizer" in factory_names:
|
||||||
|
msg.divider("Morphologizer (POS+Morph)")
|
||||||
|
labels = [label for label in gold_train_data["morphs"]]
|
||||||
|
model_labels = _get_labels_from_model(nlp, "morphologizer")
|
||||||
|
msg.info(f"{len(labels)} label(s) in train data")
|
||||||
|
missing_labels = model_labels - set(labels)
|
||||||
|
if missing_labels:
|
||||||
|
msg.warn(
|
||||||
|
"Some model labels are not present in the train data. The "
|
||||||
|
"model performance may be degraded for these labels after "
|
||||||
|
f"training: {_format_labels(missing_labels)}."
|
||||||
|
)
|
||||||
|
labels_with_counts = _format_labels(
|
||||||
|
gold_train_data["morphs"].most_common(), counts=True
|
||||||
|
)
|
||||||
|
msg.text(labels_with_counts, show=verbose)
|
||||||
|
|
||||||
if "parser" in factory_names:
|
if "parser" in factory_names:
|
||||||
has_low_data_warning = False
|
has_low_data_warning = False
|
||||||
msg.divider("Dependency Parsing")
|
msg.divider("Dependency Parsing")
|
||||||
|
@ -491,6 +543,7 @@ def _compile_gold(
|
||||||
"ner": Counter(),
|
"ner": Counter(),
|
||||||
"cats": Counter(),
|
"cats": Counter(),
|
||||||
"tags": Counter(),
|
"tags": Counter(),
|
||||||
|
"morphs": Counter(),
|
||||||
"deps": Counter(),
|
"deps": Counter(),
|
||||||
"words": Counter(),
|
"words": Counter(),
|
||||||
"roots": Counter(),
|
"roots": Counter(),
|
||||||
|
@ -544,13 +597,36 @@ def _compile_gold(
|
||||||
data["ner"][combined_label] += 1
|
data["ner"][combined_label] += 1
|
||||||
elif label == "-":
|
elif label == "-":
|
||||||
data["ner"]["-"] += 1
|
data["ner"]["-"] += 1
|
||||||
if "textcat" in factory_names:
|
if "textcat" in factory_names or "textcat_multilabel" in factory_names:
|
||||||
data["cats"].update(gold.cats)
|
data["cats"].update(gold.cats)
|
||||||
if list(gold.cats.values()).count(1.0) != 1:
|
if list(gold.cats.values()).count(1.0) != 1:
|
||||||
data["n_cats_multilabel"] += 1
|
data["n_cats_multilabel"] += 1
|
||||||
if "tagger" in factory_names:
|
if "tagger" in factory_names:
|
||||||
tags = eg.get_aligned("TAG", as_string=True)
|
tags = eg.get_aligned("TAG", as_string=True)
|
||||||
data["tags"].update([x for x in tags if x is not None])
|
data["tags"].update([x for x in tags if x is not None])
|
||||||
|
if "morphologizer" in factory_names:
|
||||||
|
pos_tags = eg.get_aligned("POS", as_string=True)
|
||||||
|
morphs = eg.get_aligned("MORPH", as_string=True)
|
||||||
|
for pos, morph in zip(pos_tags, morphs):
|
||||||
|
# POS may align (same value for multiple tokens) when morph
|
||||||
|
# doesn't, so if either is misaligned (None), treat the
|
||||||
|
# annotation as missing so that truths doesn't end up with an
|
||||||
|
# unknown morph+POS combination
|
||||||
|
if pos is None or morph is None:
|
||||||
|
pass
|
||||||
|
# If both are unset, the annotation is missing (empty morph
|
||||||
|
# converted from int is "_" rather than "")
|
||||||
|
elif pos == "" and morph == "":
|
||||||
|
pass
|
||||||
|
# Otherwise, generate the combined label
|
||||||
|
else:
|
||||||
|
label_dict = Morphology.feats_to_dict(morph)
|
||||||
|
if pos:
|
||||||
|
label_dict[Morphologizer.POS_FEAT] = pos
|
||||||
|
label = eg.reference.vocab.strings[
|
||||||
|
eg.reference.vocab.morphology.add(label_dict)
|
||||||
|
]
|
||||||
|
data["morphs"].update([label])
|
||||||
if "parser" in factory_names:
|
if "parser" in factory_names:
|
||||||
aligned_heads, aligned_deps = eg.get_aligned_parse(projectivize=make_proj)
|
aligned_heads, aligned_deps = eg.get_aligned_parse(projectivize=make_proj)
|
||||||
data["deps"].update([x for x in aligned_deps if x is not None])
|
data["deps"].update([x for x in aligned_deps if x is not None])
|
||||||
|
@ -584,8 +660,8 @@ def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
|
||||||
return count
|
return count
|
||||||
|
|
||||||
|
|
||||||
def _get_labels_from_model(nlp: Language, pipe_name: str) -> Sequence[str]:
|
def _get_labels_from_model(nlp: Language, pipe_name: str) -> Set[str]:
|
||||||
if pipe_name not in nlp.pipe_names:
|
if pipe_name not in nlp.pipe_names:
|
||||||
return set()
|
return set()
|
||||||
pipe = nlp.get_pipe(pipe_name)
|
pipe = nlp.get_pipe(pipe_name)
|
||||||
return pipe.labels
|
return set(pipe.labels)
|
||||||
|
|
|
@ -206,7 +206,7 @@ factory = "tok2vec"
|
||||||
@architectures = "spacy.Tok2Vec.v2"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[components.tok2vec.model.embed]
|
[components.tok2vec.model.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v2"
|
||||||
width = ${components.tok2vec.model.encode.width}
|
width = ${components.tok2vec.model.encode.width}
|
||||||
{% if has_letters -%}
|
{% if has_letters -%}
|
||||||
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
|
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
|
||||||
|
|
|
@ -68,8 +68,11 @@ seed = ${system.seed}
|
||||||
gpu_allocator = ${system.gpu_allocator}
|
gpu_allocator = ${system.gpu_allocator}
|
||||||
dropout = 0.1
|
dropout = 0.1
|
||||||
accumulate_gradient = 1
|
accumulate_gradient = 1
|
||||||
# Controls early-stopping. 0 or -1 mean unlimited.
|
# Controls early-stopping. 0 disables early stopping.
|
||||||
patience = 1600
|
patience = 1600
|
||||||
|
# Number of epochs. 0 means unlimited. If >= 0, train corpus is loaded once in
|
||||||
|
# memory and shuffled within the training loop. -1 means stream train corpus
|
||||||
|
# rather than loading in memory with no shuffling within the training loop.
|
||||||
max_epochs = 0
|
max_epochs = 0
|
||||||
max_steps = 20000
|
max_steps = 20000
|
||||||
eval_frequency = 200
|
eval_frequency = 200
|
||||||
|
|
|
@ -157,6 +157,10 @@ class Warnings:
|
||||||
"`spacy.load()` to ensure that the model is loaded on the correct "
|
"`spacy.load()` to ensure that the model is loaded on the correct "
|
||||||
"device. More information: "
|
"device. More information: "
|
||||||
"http://spacy.io/usage/v3#jupyter-notebook-gpu")
|
"http://spacy.io/usage/v3#jupyter-notebook-gpu")
|
||||||
|
W112 = ("The model specified to use for initial vectors ({name}) has no "
|
||||||
|
"vectors. This is almost certainly a mistake.")
|
||||||
|
W113 = ("Sourced component '{name}' may not work as expected: source "
|
||||||
|
"vectors are not identical to current pipeline vectors.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
|
@ -497,6 +501,12 @@ class Errors:
|
||||||
E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.")
|
E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.")
|
||||||
|
|
||||||
# New errors added in v3.x
|
# New errors added in v3.x
|
||||||
|
E872 = ("Unable to copy tokenizer from base model due to different "
|
||||||
|
'tokenizer settings: current tokenizer config "{curr_config}" '
|
||||||
|
'vs. base model "{base_config}"')
|
||||||
|
E873 = ("Unable to merge a span from doc.spans with key '{key}' and text "
|
||||||
|
"'{text}'. This is likely a bug in spaCy, so feel free to open an "
|
||||||
|
"issue: https://github.com/explosion/spaCy/issues")
|
||||||
E874 = ("Could not initialize the tok2vec model from component "
|
E874 = ("Could not initialize the tok2vec model from component "
|
||||||
"'{component}' and layer '{layer}'.")
|
"'{component}' and layer '{layer}'.")
|
||||||
E875 = ("To use the PretrainVectors objective, make sure that static vectors are loaded. "
|
E875 = ("To use the PretrainVectors objective, make sure that static vectors are loaded. "
|
||||||
|
@ -631,7 +641,7 @@ class Errors:
|
||||||
"method, make sure it's overwritten on the subclass.")
|
"method, make sure it's overwritten on the subclass.")
|
||||||
E940 = ("Found NaN values in scores.")
|
E940 = ("Found NaN values in scores.")
|
||||||
E941 = ("Can't find model '{name}'. It looks like you're trying to load a "
|
E941 = ("Can't find model '{name}'. It looks like you're trying to load a "
|
||||||
"model from a shortcut, which is deprecated as of spaCy v3.0. To "
|
"model from a shortcut, which is obsolete as of spaCy v3.0. To "
|
||||||
"load the model, use its full name instead:\n\n"
|
"load the model, use its full name instead:\n\n"
|
||||||
"nlp = spacy.load(\"{full}\")\n\nFor more details on the available "
|
"nlp = spacy.load(\"{full}\")\n\nFor more details on the available "
|
||||||
"models, see the models directory: https://spacy.io/models. If you "
|
"models, see the models directory: https://spacy.io/models. If you "
|
||||||
|
@ -646,8 +656,8 @@ class Errors:
|
||||||
"returned the initialized nlp object instead?")
|
"returned the initialized nlp object instead?")
|
||||||
E944 = ("Can't copy pipeline component '{name}' from source '{model}': "
|
E944 = ("Can't copy pipeline component '{name}' from source '{model}': "
|
||||||
"not found in pipeline. Available components: {opts}")
|
"not found in pipeline. Available components: {opts}")
|
||||||
E945 = ("Can't copy pipeline component '{name}' from source. Expected loaded "
|
E945 = ("Can't copy pipeline component '{name}' from source. Expected "
|
||||||
"nlp object, but got: {source}")
|
"loaded nlp object, but got: {source}")
|
||||||
E947 = ("`Matcher.add` received invalid `greedy` argument: expected "
|
E947 = ("`Matcher.add` received invalid `greedy` argument: expected "
|
||||||
"a string value from {expected} but got: '{arg}'")
|
"a string value from {expected} but got: '{arg}'")
|
||||||
E948 = ("`Matcher.add` received invalid 'patterns' argument: expected "
|
E948 = ("`Matcher.add` received invalid 'patterns' argument: expected "
|
||||||
|
|
|
@ -17,14 +17,19 @@ _exc = {
|
||||||
for orth in [
|
for orth in [
|
||||||
"..",
|
"..",
|
||||||
"....",
|
"....",
|
||||||
|
"a.C.",
|
||||||
"al.",
|
"al.",
|
||||||
"all-path",
|
"all-path",
|
||||||
"art.",
|
"art.",
|
||||||
"Art.",
|
"Art.",
|
||||||
"artt.",
|
"artt.",
|
||||||
"att.",
|
"att.",
|
||||||
|
"avv.",
|
||||||
|
"Avv."
|
||||||
"by-pass",
|
"by-pass",
|
||||||
"c.d.",
|
"c.d.",
|
||||||
|
"c/c",
|
||||||
|
"C.so",
|
||||||
"centro-sinistra",
|
"centro-sinistra",
|
||||||
"check-up",
|
"check-up",
|
||||||
"Civ.",
|
"Civ.",
|
||||||
|
@ -48,6 +53,8 @@ for orth in [
|
||||||
"prof.",
|
"prof.",
|
||||||
"sett.",
|
"sett.",
|
||||||
"s.p.a.",
|
"s.p.a.",
|
||||||
|
"s.n.c",
|
||||||
|
"s.r.l",
|
||||||
"ss.",
|
"ss.",
|
||||||
"St.",
|
"St.",
|
||||||
"tel.",
|
"tel.",
|
||||||
|
|
|
@ -682,9 +682,14 @@ class Language:
|
||||||
name (str): Optional alternative name to use in current pipeline.
|
name (str): Optional alternative name to use in current pipeline.
|
||||||
RETURNS (Tuple[Callable, str]): The component and its factory name.
|
RETURNS (Tuple[Callable, str]): The component and its factory name.
|
||||||
"""
|
"""
|
||||||
# TODO: handle errors and mismatches (vectors etc.)
|
# Check source type
|
||||||
if not isinstance(source, self.__class__):
|
if not isinstance(source, Language):
|
||||||
raise ValueError(Errors.E945.format(name=source_name, source=type(source)))
|
raise ValueError(Errors.E945.format(name=source_name, source=type(source)))
|
||||||
|
# Check vectors, with faster checks first
|
||||||
|
if self.vocab.vectors.shape != source.vocab.vectors.shape or \
|
||||||
|
self.vocab.vectors.key2row != source.vocab.vectors.key2row or \
|
||||||
|
self.vocab.vectors.to_bytes() != source.vocab.vectors.to_bytes():
|
||||||
|
util.logger.warning(Warnings.W113.format(name=source_name))
|
||||||
if not source_name in source.component_names:
|
if not source_name in source.component_names:
|
||||||
raise KeyError(
|
raise KeyError(
|
||||||
Errors.E944.format(
|
Errors.E944.format(
|
||||||
|
@ -1673,7 +1678,16 @@ class Language:
|
||||||
# model with the same vocab as the current nlp object
|
# model with the same vocab as the current nlp object
|
||||||
source_nlps[model] = util.load_model(model, vocab=nlp.vocab)
|
source_nlps[model] = util.load_model(model, vocab=nlp.vocab)
|
||||||
source_name = pipe_cfg.get("component", pipe_name)
|
source_name = pipe_cfg.get("component", pipe_name)
|
||||||
|
listeners_replaced = False
|
||||||
|
if "replace_listeners" in pipe_cfg:
|
||||||
|
for name, proc in source_nlps[model].pipeline:
|
||||||
|
if source_name in getattr(proc, "listening_components", []):
|
||||||
|
source_nlps[model].replace_listeners(name, source_name, pipe_cfg["replace_listeners"])
|
||||||
|
listeners_replaced = True
|
||||||
nlp.add_pipe(source_name, source=source_nlps[model], name=pipe_name)
|
nlp.add_pipe(source_name, source=source_nlps[model], name=pipe_name)
|
||||||
|
# Delete from cache if listeners were replaced
|
||||||
|
if listeners_replaced:
|
||||||
|
del source_nlps[model]
|
||||||
disabled_pipes = [*config["nlp"]["disabled"], *disable]
|
disabled_pipes = [*config["nlp"]["disabled"], *disable]
|
||||||
nlp._disabled = set(p for p in disabled_pipes if p not in exclude)
|
nlp._disabled = set(p for p in disabled_pipes if p not in exclude)
|
||||||
nlp.batch_size = config["nlp"]["batch_size"]
|
nlp.batch_size = config["nlp"]["batch_size"]
|
||||||
|
|
|
@ -299,7 +299,7 @@ cdef class DependencyMatcher:
|
||||||
if isinstance(doclike, Doc):
|
if isinstance(doclike, Doc):
|
||||||
doc = doclike
|
doc = doclike
|
||||||
elif isinstance(doclike, Span):
|
elif isinstance(doclike, Span):
|
||||||
doc = doclike.as_doc()
|
doc = doclike.as_doc(copy_user_data=True)
|
||||||
else:
|
else:
|
||||||
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
|
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
|
||||||
|
|
||||||
|
|
|
@ -46,6 +46,12 @@ cdef struct TokenPatternC:
|
||||||
int32_t nr_py
|
int32_t nr_py
|
||||||
quantifier_t quantifier
|
quantifier_t quantifier
|
||||||
hash_t key
|
hash_t key
|
||||||
|
int32_t token_idx
|
||||||
|
|
||||||
|
|
||||||
|
cdef struct MatchAlignmentC:
|
||||||
|
int32_t token_idx
|
||||||
|
int32_t length
|
||||||
|
|
||||||
|
|
||||||
cdef struct PatternStateC:
|
cdef struct PatternStateC:
|
||||||
|
|
|
@ -196,7 +196,7 @@ cdef class Matcher:
|
||||||
else:
|
else:
|
||||||
yield doc
|
yield doc
|
||||||
|
|
||||||
def __call__(self, object doclike, *, as_spans=False, allow_missing=False):
|
def __call__(self, object doclike, *, as_spans=False, allow_missing=False, with_alignments=False):
|
||||||
"""Find all token sequences matching the supplied pattern.
|
"""Find all token sequences matching the supplied pattern.
|
||||||
|
|
||||||
doclike (Doc or Span): The document to match over.
|
doclike (Doc or Span): The document to match over.
|
||||||
|
@ -204,10 +204,16 @@ cdef class Matcher:
|
||||||
start, end) tuples.
|
start, end) tuples.
|
||||||
allow_missing (bool): Whether to skip checks for missing annotation for
|
allow_missing (bool): Whether to skip checks for missing annotation for
|
||||||
attributes included in patterns. Defaults to False.
|
attributes included in patterns. Defaults to False.
|
||||||
|
with_alignments (bool): Return match alignment information, which is
|
||||||
|
`List[int]` with length of matched span. Each entry denotes the
|
||||||
|
corresponding index of token pattern. If as_spans is set to True,
|
||||||
|
this setting is ignored.
|
||||||
RETURNS (list): A list of `(match_id, start, end)` tuples,
|
RETURNS (list): A list of `(match_id, start, end)` tuples,
|
||||||
describing the matches. A match tuple describes a span
|
describing the matches. A match tuple describes a span
|
||||||
`doc[start:end]`. The `match_id` is an integer. If as_spans is set
|
`doc[start:end]`. The `match_id` is an integer. If as_spans is set
|
||||||
to True, a list of Span objects is returned.
|
to True, a list of Span objects is returned.
|
||||||
|
If with_alignments is set to True and as_spans is set to False,
|
||||||
|
A list of `(match_id, start, end, alignments)` tuples is returned.
|
||||||
"""
|
"""
|
||||||
if isinstance(doclike, Doc):
|
if isinstance(doclike, Doc):
|
||||||
doc = doclike
|
doc = doclike
|
||||||
|
@ -217,6 +223,9 @@ cdef class Matcher:
|
||||||
length = doclike.end - doclike.start
|
length = doclike.end - doclike.start
|
||||||
else:
|
else:
|
||||||
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
|
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
|
||||||
|
# Skip alignments calculations if as_spans is set
|
||||||
|
if as_spans:
|
||||||
|
with_alignments = False
|
||||||
cdef Pool tmp_pool = Pool()
|
cdef Pool tmp_pool = Pool()
|
||||||
if not allow_missing:
|
if not allow_missing:
|
||||||
for attr in (TAG, POS, MORPH, LEMMA, DEP):
|
for attr in (TAG, POS, MORPH, LEMMA, DEP):
|
||||||
|
@ -232,18 +241,20 @@ cdef class Matcher:
|
||||||
error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr))
|
error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr))
|
||||||
raise ValueError(error_msg)
|
raise ValueError(error_msg)
|
||||||
matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
|
matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
|
||||||
extensions=self._extensions, predicates=self._extra_predicates)
|
extensions=self._extensions, predicates=self._extra_predicates, with_alignments=with_alignments)
|
||||||
final_matches = []
|
final_matches = []
|
||||||
pairs_by_id = {}
|
pairs_by_id = {}
|
||||||
# For each key, either add all matches, or only the filtered, non-overlapping ones
|
# For each key, either add all matches, or only the filtered,
|
||||||
for (key, start, end) in matches:
|
# non-overlapping ones this `match` can be either (start, end) or
|
||||||
|
# (start, end, alignments) depending on `with_alignments=` option.
|
||||||
|
for key, *match in matches:
|
||||||
span_filter = self._filter.get(key)
|
span_filter = self._filter.get(key)
|
||||||
if span_filter is not None:
|
if span_filter is not None:
|
||||||
pairs = pairs_by_id.get(key, [])
|
pairs = pairs_by_id.get(key, [])
|
||||||
pairs.append((start,end))
|
pairs.append(match)
|
||||||
pairs_by_id[key] = pairs
|
pairs_by_id[key] = pairs
|
||||||
else:
|
else:
|
||||||
final_matches.append((key, start, end))
|
final_matches.append((key, *match))
|
||||||
matched = <char*>tmp_pool.alloc(length, sizeof(char))
|
matched = <char*>tmp_pool.alloc(length, sizeof(char))
|
||||||
empty = <char*>tmp_pool.alloc(length, sizeof(char))
|
empty = <char*>tmp_pool.alloc(length, sizeof(char))
|
||||||
for key, pairs in pairs_by_id.items():
|
for key, pairs in pairs_by_id.items():
|
||||||
|
@ -255,14 +266,18 @@ cdef class Matcher:
|
||||||
sorted_pairs = sorted(pairs, key=lambda x: (x[1]-x[0], -x[0]), reverse=True) # reverse sort by length
|
sorted_pairs = sorted(pairs, key=lambda x: (x[1]-x[0], -x[0]), reverse=True) # reverse sort by length
|
||||||
else:
|
else:
|
||||||
raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=span_filter))
|
raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=span_filter))
|
||||||
for (start, end) in sorted_pairs:
|
for match in sorted_pairs:
|
||||||
|
start, end = match[:2]
|
||||||
assert 0 <= start < end # Defend against segfaults
|
assert 0 <= start < end # Defend against segfaults
|
||||||
span_len = end-start
|
span_len = end-start
|
||||||
# If no tokens in the span have matched
|
# If no tokens in the span have matched
|
||||||
if memcmp(&matched[start], &empty[start], span_len * sizeof(matched[0])) == 0:
|
if memcmp(&matched[start], &empty[start], span_len * sizeof(matched[0])) == 0:
|
||||||
final_matches.append((key, start, end))
|
final_matches.append((key, *match))
|
||||||
# Mark tokens that have matched
|
# Mark tokens that have matched
|
||||||
memset(&matched[start], 1, span_len * sizeof(matched[0]))
|
memset(&matched[start], 1, span_len * sizeof(matched[0]))
|
||||||
|
if with_alignments:
|
||||||
|
final_matches_with_alignments = final_matches
|
||||||
|
final_matches = [(key, start, end) for key, start, end, alignments in final_matches]
|
||||||
# perform the callbacks on the filtered set of results
|
# perform the callbacks on the filtered set of results
|
||||||
for i, (key, start, end) in enumerate(final_matches):
|
for i, (key, start, end) in enumerate(final_matches):
|
||||||
on_match = self._callbacks.get(key, None)
|
on_match = self._callbacks.get(key, None)
|
||||||
|
@ -270,6 +285,22 @@ cdef class Matcher:
|
||||||
on_match(self, doc, i, final_matches)
|
on_match(self, doc, i, final_matches)
|
||||||
if as_spans:
|
if as_spans:
|
||||||
return [Span(doc, start, end, label=key) for key, start, end in final_matches]
|
return [Span(doc, start, end, label=key) for key, start, end in final_matches]
|
||||||
|
elif with_alignments:
|
||||||
|
# convert alignments List[Dict[str, int]] --> List[int]
|
||||||
|
final_matches = []
|
||||||
|
# when multiple alignment (belongs to the same length) is found,
|
||||||
|
# keeps the alignment that has largest token_idx
|
||||||
|
for key, start, end, alignments in final_matches_with_alignments:
|
||||||
|
sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False)
|
||||||
|
alignments = [0] * (end-start)
|
||||||
|
for align in sorted_alignments:
|
||||||
|
if align['length'] >= end-start:
|
||||||
|
continue
|
||||||
|
# Since alignments are sorted in order of (length, token_idx)
|
||||||
|
# this overwrites smaller token_idx when they have same length.
|
||||||
|
alignments[align['length']] = align['token_idx']
|
||||||
|
final_matches.append((key, start, end, alignments))
|
||||||
|
return final_matches
|
||||||
else:
|
else:
|
||||||
return final_matches
|
return final_matches
|
||||||
|
|
||||||
|
@ -288,9 +319,9 @@ def unpickle_matcher(vocab, patterns, callbacks):
|
||||||
return matcher
|
return matcher
|
||||||
|
|
||||||
|
|
||||||
cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple()):
|
cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple(), bint with_alignments=0):
|
||||||
"""Find matches in a doc, with a compiled array of patterns. Matches are
|
"""Find matches in a doc, with a compiled array of patterns. Matches are
|
||||||
returned as a list of (id, start, end) tuples.
|
returned as a list of (id, start, end) tuples or (id, start, end, alignments) tuples (if with_alignments != 0)
|
||||||
|
|
||||||
To augment the compiled patterns, we optionally also take two Python lists.
|
To augment the compiled patterns, we optionally also take two Python lists.
|
||||||
|
|
||||||
|
@ -302,6 +333,8 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
|
||||||
"""
|
"""
|
||||||
cdef vector[PatternStateC] states
|
cdef vector[PatternStateC] states
|
||||||
cdef vector[MatchC] matches
|
cdef vector[MatchC] matches
|
||||||
|
cdef vector[vector[MatchAlignmentC]] align_states
|
||||||
|
cdef vector[vector[MatchAlignmentC]] align_matches
|
||||||
cdef PatternStateC state
|
cdef PatternStateC state
|
||||||
cdef int i, j, nr_extra_attr
|
cdef int i, j, nr_extra_attr
|
||||||
cdef Pool mem = Pool()
|
cdef Pool mem = Pool()
|
||||||
|
@ -328,12 +361,14 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
|
||||||
for i in range(length):
|
for i in range(length):
|
||||||
for j in range(n):
|
for j in range(n):
|
||||||
states.push_back(PatternStateC(patterns[j], i, 0))
|
states.push_back(PatternStateC(patterns[j], i, 0))
|
||||||
transition_states(states, matches, predicate_cache,
|
if with_alignments != 0:
|
||||||
doclike[i], extra_attr_values, predicates)
|
align_states.resize(states.size())
|
||||||
|
transition_states(states, matches, align_states, align_matches, predicate_cache,
|
||||||
|
doclike[i], extra_attr_values, predicates, with_alignments)
|
||||||
extra_attr_values += nr_extra_attr
|
extra_attr_values += nr_extra_attr
|
||||||
predicate_cache += len(predicates)
|
predicate_cache += len(predicates)
|
||||||
# Handle matches that end in 0-width patterns
|
# Handle matches that end in 0-width patterns
|
||||||
finish_states(matches, states)
|
finish_states(matches, states, align_matches, align_states, with_alignments)
|
||||||
seen = set()
|
seen = set()
|
||||||
for i in range(matches.size()):
|
for i in range(matches.size()):
|
||||||
match = (
|
match = (
|
||||||
|
@ -346,16 +381,22 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
|
||||||
# first .?, or the second .? -- it doesn't matter, it's just one match.
|
# first .?, or the second .? -- it doesn't matter, it's just one match.
|
||||||
# Skip 0-length matches. (TODO: fix algorithm)
|
# Skip 0-length matches. (TODO: fix algorithm)
|
||||||
if match not in seen and matches[i].length > 0:
|
if match not in seen and matches[i].length > 0:
|
||||||
output.append(match)
|
if with_alignments != 0:
|
||||||
|
# since the length of align_matches equals to that of match, we can share same 'i'
|
||||||
|
output.append(match + (align_matches[i],))
|
||||||
|
else:
|
||||||
|
output.append(match)
|
||||||
seen.add(match)
|
seen.add(match)
|
||||||
return output
|
return output
|
||||||
|
|
||||||
|
|
||||||
cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches,
|
cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches,
|
||||||
|
vector[vector[MatchAlignmentC]]& align_states, vector[vector[MatchAlignmentC]]& align_matches,
|
||||||
int8_t* cached_py_predicates,
|
int8_t* cached_py_predicates,
|
||||||
Token token, const attr_t* extra_attrs, py_predicates) except *:
|
Token token, const attr_t* extra_attrs, py_predicates, bint with_alignments) except *:
|
||||||
cdef int q = 0
|
cdef int q = 0
|
||||||
cdef vector[PatternStateC] new_states
|
cdef vector[PatternStateC] new_states
|
||||||
|
cdef vector[vector[MatchAlignmentC]] align_new_states
|
||||||
cdef int nr_predicate = len(py_predicates)
|
cdef int nr_predicate = len(py_predicates)
|
||||||
for i in range(states.size()):
|
for i in range(states.size()):
|
||||||
if states[i].pattern.nr_py >= 1:
|
if states[i].pattern.nr_py >= 1:
|
||||||
|
@ -370,23 +411,39 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
|
||||||
# it in the states list, because q doesn't advance.
|
# it in the states list, because q doesn't advance.
|
||||||
state = states[i]
|
state = states[i]
|
||||||
states[q] = state
|
states[q] = state
|
||||||
|
# Separate from states, performance is guaranteed for users who only need basic options (without alignments).
|
||||||
|
# `align_states` always corresponds to `states` 1:1.
|
||||||
|
if with_alignments != 0:
|
||||||
|
align_state = align_states[i]
|
||||||
|
align_states[q] = align_state
|
||||||
while action in (RETRY, RETRY_ADVANCE, RETRY_EXTEND):
|
while action in (RETRY, RETRY_ADVANCE, RETRY_EXTEND):
|
||||||
|
# Update alignment before the transition of current state
|
||||||
|
# 'MatchAlignmentC' maps 'original token index of current pattern' to 'current matching length'
|
||||||
|
if with_alignments != 0:
|
||||||
|
align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length))
|
||||||
if action == RETRY_EXTEND:
|
if action == RETRY_EXTEND:
|
||||||
# This handles the 'extend'
|
# This handles the 'extend'
|
||||||
new_states.push_back(
|
new_states.push_back(
|
||||||
PatternStateC(pattern=states[q].pattern, start=state.start,
|
PatternStateC(pattern=states[q].pattern, start=state.start,
|
||||||
length=state.length+1))
|
length=state.length+1))
|
||||||
|
if with_alignments != 0:
|
||||||
|
align_new_states.push_back(align_states[q])
|
||||||
if action == RETRY_ADVANCE:
|
if action == RETRY_ADVANCE:
|
||||||
# This handles the 'advance'
|
# This handles the 'advance'
|
||||||
new_states.push_back(
|
new_states.push_back(
|
||||||
PatternStateC(pattern=states[q].pattern+1, start=state.start,
|
PatternStateC(pattern=states[q].pattern+1, start=state.start,
|
||||||
length=state.length+1))
|
length=state.length+1))
|
||||||
|
if with_alignments != 0:
|
||||||
|
align_new_states.push_back(align_states[q])
|
||||||
states[q].pattern += 1
|
states[q].pattern += 1
|
||||||
if states[q].pattern.nr_py != 0:
|
if states[q].pattern.nr_py != 0:
|
||||||
update_predicate_cache(cached_py_predicates,
|
update_predicate_cache(cached_py_predicates,
|
||||||
states[q].pattern, token, py_predicates)
|
states[q].pattern, token, py_predicates)
|
||||||
action = get_action(states[q], token.c, extra_attrs,
|
action = get_action(states[q], token.c, extra_attrs,
|
||||||
cached_py_predicates)
|
cached_py_predicates)
|
||||||
|
# Update alignment before the transition of current state
|
||||||
|
if with_alignments != 0:
|
||||||
|
align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length))
|
||||||
if action == REJECT:
|
if action == REJECT:
|
||||||
pass
|
pass
|
||||||
elif action == ADVANCE:
|
elif action == ADVANCE:
|
||||||
|
@ -399,29 +456,50 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
|
||||||
matches.push_back(
|
matches.push_back(
|
||||||
MatchC(pattern_id=ent_id, start=state.start,
|
MatchC(pattern_id=ent_id, start=state.start,
|
||||||
length=state.length+1))
|
length=state.length+1))
|
||||||
|
# `align_matches` always corresponds to `matches` 1:1
|
||||||
|
if with_alignments != 0:
|
||||||
|
align_matches.push_back(align_states[q])
|
||||||
elif action == MATCH_DOUBLE:
|
elif action == MATCH_DOUBLE:
|
||||||
# push match without last token if length > 0
|
# push match without last token if length > 0
|
||||||
if state.length > 0:
|
if state.length > 0:
|
||||||
matches.push_back(
|
matches.push_back(
|
||||||
MatchC(pattern_id=ent_id, start=state.start,
|
MatchC(pattern_id=ent_id, start=state.start,
|
||||||
length=state.length))
|
length=state.length))
|
||||||
|
# MATCH_DOUBLE emits matches twice,
|
||||||
|
# add one more to align_matches in order to keep 1:1 relationship
|
||||||
|
if with_alignments != 0:
|
||||||
|
align_matches.push_back(align_states[q])
|
||||||
# push match with last token
|
# push match with last token
|
||||||
matches.push_back(
|
matches.push_back(
|
||||||
MatchC(pattern_id=ent_id, start=state.start,
|
MatchC(pattern_id=ent_id, start=state.start,
|
||||||
length=state.length+1))
|
length=state.length+1))
|
||||||
|
# `align_matches` always corresponds to `matches` 1:1
|
||||||
|
if with_alignments != 0:
|
||||||
|
align_matches.push_back(align_states[q])
|
||||||
elif action == MATCH_REJECT:
|
elif action == MATCH_REJECT:
|
||||||
matches.push_back(
|
matches.push_back(
|
||||||
MatchC(pattern_id=ent_id, start=state.start,
|
MatchC(pattern_id=ent_id, start=state.start,
|
||||||
length=state.length))
|
length=state.length))
|
||||||
|
# `align_matches` always corresponds to `matches` 1:1
|
||||||
|
if with_alignments != 0:
|
||||||
|
align_matches.push_back(align_states[q])
|
||||||
elif action == MATCH_EXTEND:
|
elif action == MATCH_EXTEND:
|
||||||
matches.push_back(
|
matches.push_back(
|
||||||
MatchC(pattern_id=ent_id, start=state.start,
|
MatchC(pattern_id=ent_id, start=state.start,
|
||||||
length=state.length))
|
length=state.length))
|
||||||
|
# `align_matches` always corresponds to `matches` 1:1
|
||||||
|
if with_alignments != 0:
|
||||||
|
align_matches.push_back(align_states[q])
|
||||||
states[q].length += 1
|
states[q].length += 1
|
||||||
q += 1
|
q += 1
|
||||||
states.resize(q)
|
states.resize(q)
|
||||||
for i in range(new_states.size()):
|
for i in range(new_states.size()):
|
||||||
states.push_back(new_states[i])
|
states.push_back(new_states[i])
|
||||||
|
# `align_states` always corresponds to `states` 1:1
|
||||||
|
if with_alignments != 0:
|
||||||
|
align_states.resize(q)
|
||||||
|
for i in range(align_new_states.size()):
|
||||||
|
align_states.push_back(align_new_states[i])
|
||||||
|
|
||||||
|
|
||||||
cdef int update_predicate_cache(int8_t* cache,
|
cdef int update_predicate_cache(int8_t* cache,
|
||||||
|
@ -444,15 +522,27 @@ cdef int update_predicate_cache(int8_t* cache,
|
||||||
raise ValueError(Errors.E125.format(value=result))
|
raise ValueError(Errors.E125.format(value=result))
|
||||||
|
|
||||||
|
|
||||||
cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states) except *:
|
cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states,
|
||||||
|
vector[vector[MatchAlignmentC]]& align_matches,
|
||||||
|
vector[vector[MatchAlignmentC]]& align_states,
|
||||||
|
bint with_alignments) except *:
|
||||||
"""Handle states that end in zero-width patterns."""
|
"""Handle states that end in zero-width patterns."""
|
||||||
cdef PatternStateC state
|
cdef PatternStateC state
|
||||||
|
cdef vector[MatchAlignmentC] align_state
|
||||||
for i in range(states.size()):
|
for i in range(states.size()):
|
||||||
state = states[i]
|
state = states[i]
|
||||||
|
if with_alignments != 0:
|
||||||
|
align_state = align_states[i]
|
||||||
while get_quantifier(state) in (ZERO_PLUS, ZERO_ONE):
|
while get_quantifier(state) in (ZERO_PLUS, ZERO_ONE):
|
||||||
|
# Update alignment before the transition of current state
|
||||||
|
if with_alignments != 0:
|
||||||
|
align_state.push_back(MatchAlignmentC(state.pattern.token_idx, state.length))
|
||||||
is_final = get_is_final(state)
|
is_final = get_is_final(state)
|
||||||
if is_final:
|
if is_final:
|
||||||
ent_id = get_ent_id(state.pattern)
|
ent_id = get_ent_id(state.pattern)
|
||||||
|
# `align_matches` always corresponds to `matches` 1:1
|
||||||
|
if with_alignments != 0:
|
||||||
|
align_matches.push_back(align_state)
|
||||||
matches.push_back(
|
matches.push_back(
|
||||||
MatchC(pattern_id=ent_id, start=state.start, length=state.length))
|
MatchC(pattern_id=ent_id, start=state.start, length=state.length))
|
||||||
break
|
break
|
||||||
|
@ -607,7 +697,7 @@ cdef int8_t get_quantifier(PatternStateC state) nogil:
|
||||||
cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs) except NULL:
|
cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs) except NULL:
|
||||||
pattern = <TokenPatternC*>mem.alloc(len(token_specs) + 1, sizeof(TokenPatternC))
|
pattern = <TokenPatternC*>mem.alloc(len(token_specs) + 1, sizeof(TokenPatternC))
|
||||||
cdef int i, index
|
cdef int i, index
|
||||||
for i, (quantifier, spec, extensions, predicates) in enumerate(token_specs):
|
for i, (quantifier, spec, extensions, predicates, token_idx) in enumerate(token_specs):
|
||||||
pattern[i].quantifier = quantifier
|
pattern[i].quantifier = quantifier
|
||||||
# Ensure attrs refers to a null pointer if nr_attr == 0
|
# Ensure attrs refers to a null pointer if nr_attr == 0
|
||||||
if len(spec) > 0:
|
if len(spec) > 0:
|
||||||
|
@ -628,6 +718,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
|
||||||
pattern[i].py_predicates[j] = index
|
pattern[i].py_predicates[j] = index
|
||||||
pattern[i].nr_py = len(predicates)
|
pattern[i].nr_py = len(predicates)
|
||||||
pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0)
|
pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0)
|
||||||
|
pattern[i].token_idx = token_idx
|
||||||
i = len(token_specs)
|
i = len(token_specs)
|
||||||
# Use quantifier to identify final ID pattern node (rather than previous
|
# Use quantifier to identify final ID pattern node (rather than previous
|
||||||
# uninitialized quantifier == 0/ZERO + nr_attr == 0 + non-zero-length attrs)
|
# uninitialized quantifier == 0/ZERO + nr_attr == 0 + non-zero-length attrs)
|
||||||
|
@ -638,6 +729,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
|
||||||
pattern[i].nr_attr = 1
|
pattern[i].nr_attr = 1
|
||||||
pattern[i].nr_extra_attr = 0
|
pattern[i].nr_extra_attr = 0
|
||||||
pattern[i].nr_py = 0
|
pattern[i].nr_py = 0
|
||||||
|
pattern[i].token_idx = -1
|
||||||
return pattern
|
return pattern
|
||||||
|
|
||||||
|
|
||||||
|
@ -655,7 +747,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
|
||||||
"""This function interprets the pattern, converting the various bits of
|
"""This function interprets the pattern, converting the various bits of
|
||||||
syntactic sugar before we compile it into a struct with init_pattern.
|
syntactic sugar before we compile it into a struct with init_pattern.
|
||||||
|
|
||||||
We need to split the pattern up into three parts:
|
We need to split the pattern up into four parts:
|
||||||
* Normal attribute/value pairs, which are stored on either the token or lexeme,
|
* Normal attribute/value pairs, which are stored on either the token or lexeme,
|
||||||
can be handled directly.
|
can be handled directly.
|
||||||
* Extension attributes are handled specially, as we need to prefetch the
|
* Extension attributes are handled specially, as we need to prefetch the
|
||||||
|
@ -664,13 +756,14 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
|
||||||
functions and store them. So we store these specially as well.
|
functions and store them. So we store these specially as well.
|
||||||
* Extension attributes that have extra predicates are stored within the
|
* Extension attributes that have extra predicates are stored within the
|
||||||
extra_predicates.
|
extra_predicates.
|
||||||
|
* Token index that this pattern belongs to.
|
||||||
"""
|
"""
|
||||||
tokens = []
|
tokens = []
|
||||||
string_store = vocab.strings
|
string_store = vocab.strings
|
||||||
for spec in token_specs:
|
for token_idx, spec in enumerate(token_specs):
|
||||||
if not spec:
|
if not spec:
|
||||||
# Signifier for 'any token'
|
# Signifier for 'any token'
|
||||||
tokens.append((ONE, [(NULL_ATTR, 0)], [], []))
|
tokens.append((ONE, [(NULL_ATTR, 0)], [], [], token_idx))
|
||||||
continue
|
continue
|
||||||
if not isinstance(spec, dict):
|
if not isinstance(spec, dict):
|
||||||
raise ValueError(Errors.E154.format())
|
raise ValueError(Errors.E154.format())
|
||||||
|
@ -679,7 +772,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
|
||||||
extensions = _get_extensions(spec, string_store, extensions_table)
|
extensions = _get_extensions(spec, string_store, extensions_table)
|
||||||
predicates = _get_extra_predicates(spec, extra_predicates, vocab)
|
predicates = _get_extra_predicates(spec, extra_predicates, vocab)
|
||||||
for op in ops:
|
for op in ops:
|
||||||
tokens.append((op, list(attr_values), list(extensions), list(predicates)))
|
tokens.append((op, list(attr_values), list(extensions), list(predicates), token_idx))
|
||||||
return tokens
|
return tokens
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -3,8 +3,10 @@ from thinc.api import Model
|
||||||
from thinc.types import Floats2d
|
from thinc.types import Floats2d
|
||||||
|
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
|
from ..util import registry
|
||||||
|
|
||||||
|
|
||||||
|
@registry.layers("spacy.CharEmbed.v1")
|
||||||
def CharacterEmbed(nM: int, nC: int) -> Model[List[Doc], List[Floats2d]]:
|
def CharacterEmbed(nM: int, nC: int) -> Model[List[Doc], List[Floats2d]]:
|
||||||
# nM: Number of dimensions per character. nC: Number of characters.
|
# nM: Number of dimensions per character. nC: Number of characters.
|
||||||
return Model(
|
return Model(
|
||||||
|
|
|
@ -31,7 +31,7 @@ def get_tok2vec_width(model: Model):
|
||||||
return nO
|
return nO
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures("spacy.HashEmbedCNN.v1")
|
@registry.architectures("spacy.HashEmbedCNN.v2")
|
||||||
def build_hash_embed_cnn_tok2vec(
|
def build_hash_embed_cnn_tok2vec(
|
||||||
*,
|
*,
|
||||||
width: int,
|
width: int,
|
||||||
|
@ -108,7 +108,7 @@ def build_Tok2Vec_model(
|
||||||
return tok2vec
|
return tok2vec
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures("spacy.MultiHashEmbed.v1")
|
@registry.architectures("spacy.MultiHashEmbed.v2")
|
||||||
def MultiHashEmbed(
|
def MultiHashEmbed(
|
||||||
width: int,
|
width: int,
|
||||||
attrs: List[Union[str, int]],
|
attrs: List[Union[str, int]],
|
||||||
|
@ -182,7 +182,7 @@ def MultiHashEmbed(
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures("spacy.CharacterEmbed.v1")
|
@registry.architectures("spacy.CharacterEmbed.v2")
|
||||||
def CharacterEmbed(
|
def CharacterEmbed(
|
||||||
width: int,
|
width: int,
|
||||||
rows: int,
|
rows: int,
|
||||||
|
|
|
@ -8,7 +8,7 @@ from ..tokens import Doc
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
|
|
||||||
|
|
||||||
@registry.layers("spacy.StaticVectors.v1")
|
@registry.layers("spacy.StaticVectors.v2")
|
||||||
def StaticVectors(
|
def StaticVectors(
|
||||||
nO: Optional[int] = None,
|
nO: Optional[int] = None,
|
||||||
nM: Optional[int] = None,
|
nM: Optional[int] = None,
|
||||||
|
@ -38,7 +38,7 @@ def forward(
|
||||||
return _handle_empty(model.ops, model.get_dim("nO"))
|
return _handle_empty(model.ops, model.get_dim("nO"))
|
||||||
key_attr = model.attrs["key_attr"]
|
key_attr = model.attrs["key_attr"]
|
||||||
W = cast(Floats2d, model.ops.as_contig(model.get_param("W")))
|
W = cast(Floats2d, model.ops.as_contig(model.get_param("W")))
|
||||||
V = cast(Floats2d, docs[0].vocab.vectors.data)
|
V = cast(Floats2d, model.ops.asarray(docs[0].vocab.vectors.data))
|
||||||
rows = model.ops.flatten(
|
rows = model.ops.flatten(
|
||||||
[doc.vocab.vectors.find(keys=doc.to_array(key_attr)) for doc in docs]
|
[doc.vocab.vectors.find(keys=doc.to_array(key_attr)) for doc in docs]
|
||||||
)
|
)
|
||||||
|
@ -46,6 +46,8 @@ def forward(
|
||||||
vectors_data = model.ops.gemm(model.ops.as_contig(V[rows]), W, trans2=True)
|
vectors_data = model.ops.gemm(model.ops.as_contig(V[rows]), W, trans2=True)
|
||||||
except ValueError:
|
except ValueError:
|
||||||
raise RuntimeError(Errors.E896)
|
raise RuntimeError(Errors.E896)
|
||||||
|
# Convert negative indices to 0-vectors (TODO: more options for UNK tokens)
|
||||||
|
vectors_data[rows < 0] = 0
|
||||||
output = Ragged(
|
output = Ragged(
|
||||||
vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i")
|
vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i")
|
||||||
)
|
)
|
||||||
|
|
|
@ -24,7 +24,7 @@ maxout_pieces = 2
|
||||||
use_upper = true
|
use_upper = true
|
||||||
|
|
||||||
[model.tok2vec]
|
[model.tok2vec]
|
||||||
@architectures = "spacy.HashEmbedCNN.v1"
|
@architectures = "spacy.HashEmbedCNN.v2"
|
||||||
pretrained_vectors = null
|
pretrained_vectors = null
|
||||||
width = 96
|
width = 96
|
||||||
depth = 4
|
depth = 4
|
||||||
|
|
|
@ -26,7 +26,7 @@ default_model_config = """
|
||||||
@architectures = "spacy.EntityLinker.v1"
|
@architectures = "spacy.EntityLinker.v1"
|
||||||
|
|
||||||
[model.tok2vec]
|
[model.tok2vec]
|
||||||
@architectures = "spacy.HashEmbedCNN.v1"
|
@architectures = "spacy.HashEmbedCNN.v2"
|
||||||
pretrained_vectors = null
|
pretrained_vectors = null
|
||||||
width = 96
|
width = 96
|
||||||
depth = 2
|
depth = 2
|
||||||
|
@ -300,77 +300,77 @@ class EntityLinker(TrainablePipe):
|
||||||
for i, doc in enumerate(docs):
|
for i, doc in enumerate(docs):
|
||||||
sentences = [s for s in doc.sents]
|
sentences = [s for s in doc.sents]
|
||||||
if len(doc) > 0:
|
if len(doc) > 0:
|
||||||
# Looping through each sentence and each entity
|
# Looping through each entity (TODO: rewrite)
|
||||||
# This may go wrong if there are entities across sentences - which shouldn't happen normally.
|
for ent in doc.ents:
|
||||||
for sent_index, sent in enumerate(sentences):
|
sent = ent.sent
|
||||||
if sent.ents:
|
sent_index = sentences.index(sent)
|
||||||
# get n_neightbour sentences, clipped to the length of the document
|
assert sent_index >= 0
|
||||||
start_sentence = max(0, sent_index - self.n_sents)
|
# get n_neightbour sentences, clipped to the length of the document
|
||||||
end_sentence = min(
|
start_sentence = max(0, sent_index - self.n_sents)
|
||||||
len(sentences) - 1, sent_index + self.n_sents
|
end_sentence = min(
|
||||||
)
|
len(sentences) - 1, sent_index + self.n_sents
|
||||||
start_token = sentences[start_sentence].start
|
)
|
||||||
end_token = sentences[end_sentence].end
|
start_token = sentences[start_sentence].start
|
||||||
sent_doc = doc[start_token:end_token].as_doc()
|
end_token = sentences[end_sentence].end
|
||||||
# currently, the context is the same for each entity in a sentence (should be refined)
|
sent_doc = doc[start_token:end_token].as_doc()
|
||||||
xp = self.model.ops.xp
|
# currently, the context is the same for each entity in a sentence (should be refined)
|
||||||
if self.incl_context:
|
xp = self.model.ops.xp
|
||||||
sentence_encoding = self.model.predict([sent_doc])[0]
|
if self.incl_context:
|
||||||
sentence_encoding_t = sentence_encoding.T
|
sentence_encoding = self.model.predict([sent_doc])[0]
|
||||||
sentence_norm = xp.linalg.norm(sentence_encoding_t)
|
sentence_encoding_t = sentence_encoding.T
|
||||||
for ent in sent.ents:
|
sentence_norm = xp.linalg.norm(sentence_encoding_t)
|
||||||
entity_count += 1
|
entity_count += 1
|
||||||
if ent.label_ in self.labels_discard:
|
if ent.label_ in self.labels_discard:
|
||||||
# ignoring this entity - setting to NIL
|
# ignoring this entity - setting to NIL
|
||||||
final_kb_ids.append(self.NIL)
|
final_kb_ids.append(self.NIL)
|
||||||
else:
|
else:
|
||||||
candidates = self.get_candidates(self.kb, ent)
|
candidates = self.get_candidates(self.kb, ent)
|
||||||
if not candidates:
|
if not candidates:
|
||||||
# no prediction possible for this entity - setting to NIL
|
# no prediction possible for this entity - setting to NIL
|
||||||
final_kb_ids.append(self.NIL)
|
final_kb_ids.append(self.NIL)
|
||||||
elif len(candidates) == 1:
|
elif len(candidates) == 1:
|
||||||
# shortcut for efficiency reasons: take the 1 candidate
|
# shortcut for efficiency reasons: take the 1 candidate
|
||||||
# TODO: thresholding
|
# TODO: thresholding
|
||||||
final_kb_ids.append(candidates[0].entity_)
|
final_kb_ids.append(candidates[0].entity_)
|
||||||
else:
|
else:
|
||||||
random.shuffle(candidates)
|
random.shuffle(candidates)
|
||||||
# set all prior probabilities to 0 if incl_prior=False
|
# set all prior probabilities to 0 if incl_prior=False
|
||||||
prior_probs = xp.asarray(
|
prior_probs = xp.asarray(
|
||||||
[c.prior_prob for c in candidates]
|
[c.prior_prob for c in candidates]
|
||||||
|
)
|
||||||
|
if not self.incl_prior:
|
||||||
|
prior_probs = xp.asarray(
|
||||||
|
[0.0 for _ in candidates]
|
||||||
|
)
|
||||||
|
scores = prior_probs
|
||||||
|
# add in similarity from the context
|
||||||
|
if self.incl_context:
|
||||||
|
entity_encodings = xp.asarray(
|
||||||
|
[c.entity_vector for c in candidates]
|
||||||
|
)
|
||||||
|
entity_norm = xp.linalg.norm(
|
||||||
|
entity_encodings, axis=1
|
||||||
|
)
|
||||||
|
if len(entity_encodings) != len(prior_probs):
|
||||||
|
raise RuntimeError(
|
||||||
|
Errors.E147.format(
|
||||||
|
method="predict",
|
||||||
|
msg="vectors not of equal length",
|
||||||
|
)
|
||||||
)
|
)
|
||||||
if not self.incl_prior:
|
# cosine similarity
|
||||||
prior_probs = xp.asarray(
|
sims = xp.dot(
|
||||||
[0.0 for _ in candidates]
|
entity_encodings, sentence_encoding_t
|
||||||
)
|
) / (sentence_norm * entity_norm)
|
||||||
scores = prior_probs
|
if sims.shape != prior_probs.shape:
|
||||||
# add in similarity from the context
|
raise ValueError(Errors.E161)
|
||||||
if self.incl_context:
|
scores = (
|
||||||
entity_encodings = xp.asarray(
|
prior_probs + sims - (prior_probs * sims)
|
||||||
[c.entity_vector for c in candidates]
|
)
|
||||||
)
|
# TODO: thresholding
|
||||||
entity_norm = xp.linalg.norm(
|
best_index = scores.argmax().item()
|
||||||
entity_encodings, axis=1
|
best_candidate = candidates[best_index]
|
||||||
)
|
final_kb_ids.append(best_candidate.entity_)
|
||||||
if len(entity_encodings) != len(prior_probs):
|
|
||||||
raise RuntimeError(
|
|
||||||
Errors.E147.format(
|
|
||||||
method="predict",
|
|
||||||
msg="vectors not of equal length",
|
|
||||||
)
|
|
||||||
)
|
|
||||||
# cosine similarity
|
|
||||||
sims = xp.dot(
|
|
||||||
entity_encodings, sentence_encoding_t
|
|
||||||
) / (sentence_norm * entity_norm)
|
|
||||||
if sims.shape != prior_probs.shape:
|
|
||||||
raise ValueError(Errors.E161)
|
|
||||||
scores = (
|
|
||||||
prior_probs + sims - (prior_probs * sims)
|
|
||||||
)
|
|
||||||
# TODO: thresholding
|
|
||||||
best_index = scores.argmax().item()
|
|
||||||
best_candidate = candidates[best_index]
|
|
||||||
final_kb_ids.append(best_candidate.entity_)
|
|
||||||
if not (len(final_kb_ids) == entity_count):
|
if not (len(final_kb_ids) == entity_count):
|
||||||
err = Errors.E147.format(
|
err = Errors.E147.format(
|
||||||
method="predict", msg="result variables not of equal length"
|
method="predict", msg="result variables not of equal length"
|
||||||
|
|
|
@ -175,7 +175,7 @@ class Lemmatizer(Pipe):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/lemmatizer#rule_lemmatize
|
DOCS: https://spacy.io/api/lemmatizer#rule_lemmatize
|
||||||
"""
|
"""
|
||||||
cache_key = (token.orth, token.pos, token.morph)
|
cache_key = (token.orth, token.pos, token.morph.key)
|
||||||
if cache_key in self.cache:
|
if cache_key in self.cache:
|
||||||
return self.cache[cache_key]
|
return self.cache[cache_key]
|
||||||
string = token.text
|
string = token.text
|
||||||
|
|
|
@ -27,7 +27,7 @@ default_model_config = """
|
||||||
@architectures = "spacy.Tok2Vec.v2"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[model.tok2vec.embed]
|
[model.tok2vec.embed]
|
||||||
@architectures = "spacy.CharacterEmbed.v1"
|
@architectures = "spacy.CharacterEmbed.v2"
|
||||||
width = 128
|
width = 128
|
||||||
rows = 7000
|
rows = 7000
|
||||||
nM = 64
|
nM = 64
|
||||||
|
|
|
@ -22,7 +22,7 @@ maxout_pieces = 3
|
||||||
token_vector_width = 96
|
token_vector_width = 96
|
||||||
|
|
||||||
[model.tok2vec]
|
[model.tok2vec]
|
||||||
@architectures = "spacy.HashEmbedCNN.v1"
|
@architectures = "spacy.HashEmbedCNN.v2"
|
||||||
pretrained_vectors = null
|
pretrained_vectors = null
|
||||||
width = 96
|
width = 96
|
||||||
depth = 4
|
depth = 4
|
||||||
|
|
|
@ -21,7 +21,7 @@ maxout_pieces = 2
|
||||||
use_upper = true
|
use_upper = true
|
||||||
|
|
||||||
[model.tok2vec]
|
[model.tok2vec]
|
||||||
@architectures = "spacy.HashEmbedCNN.v1"
|
@architectures = "spacy.HashEmbedCNN.v2"
|
||||||
pretrained_vectors = null
|
pretrained_vectors = null
|
||||||
width = 96
|
width = 96
|
||||||
depth = 4
|
depth = 4
|
||||||
|
|
|
@ -19,7 +19,7 @@ default_model_config = """
|
||||||
@architectures = "spacy.Tagger.v1"
|
@architectures = "spacy.Tagger.v1"
|
||||||
|
|
||||||
[model.tok2vec]
|
[model.tok2vec]
|
||||||
@architectures = "spacy.HashEmbedCNN.v1"
|
@architectures = "spacy.HashEmbedCNN.v2"
|
||||||
pretrained_vectors = null
|
pretrained_vectors = null
|
||||||
width = 12
|
width = 12
|
||||||
depth = 1
|
depth = 1
|
||||||
|
|
|
@ -26,7 +26,7 @@ default_model_config = """
|
||||||
@architectures = "spacy.Tagger.v1"
|
@architectures = "spacy.Tagger.v1"
|
||||||
|
|
||||||
[model.tok2vec]
|
[model.tok2vec]
|
||||||
@architectures = "spacy.HashEmbedCNN.v1"
|
@architectures = "spacy.HashEmbedCNN.v2"
|
||||||
pretrained_vectors = null
|
pretrained_vectors = null
|
||||||
width = 96
|
width = 96
|
||||||
depth = 4
|
depth = 4
|
||||||
|
|
|
@ -21,7 +21,7 @@ single_label_default_config = """
|
||||||
@architectures = "spacy.Tok2Vec.v2"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[model.tok2vec.embed]
|
[model.tok2vec.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v2"
|
||||||
width = 64
|
width = 64
|
||||||
rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
||||||
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
||||||
|
@ -56,7 +56,7 @@ single_label_cnn_config = """
|
||||||
exclusive_classes = true
|
exclusive_classes = true
|
||||||
|
|
||||||
[model.tok2vec]
|
[model.tok2vec]
|
||||||
@architectures = "spacy.HashEmbedCNN.v1"
|
@architectures = "spacy.HashEmbedCNN.v2"
|
||||||
pretrained_vectors = null
|
pretrained_vectors = null
|
||||||
width = 96
|
width = 96
|
||||||
depth = 4
|
depth = 4
|
||||||
|
|
|
@ -21,7 +21,7 @@ multi_label_default_config = """
|
||||||
@architectures = "spacy.Tok2Vec.v1"
|
@architectures = "spacy.Tok2Vec.v1"
|
||||||
|
|
||||||
[model.tok2vec.embed]
|
[model.tok2vec.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v2"
|
||||||
width = 64
|
width = 64
|
||||||
rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
||||||
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
||||||
|
@ -56,7 +56,7 @@ multi_label_cnn_config = """
|
||||||
exclusive_classes = false
|
exclusive_classes = false
|
||||||
|
|
||||||
[model.tok2vec]
|
[model.tok2vec]
|
||||||
@architectures = "spacy.HashEmbedCNN.v1"
|
@architectures = "spacy.HashEmbedCNN.v2"
|
||||||
pretrained_vectors = null
|
pretrained_vectors = null
|
||||||
width = 96
|
width = 96
|
||||||
depth = 4
|
depth = 4
|
||||||
|
|
|
@ -11,7 +11,7 @@ from ..errors import Errors
|
||||||
|
|
||||||
default_model_config = """
|
default_model_config = """
|
||||||
[model]
|
[model]
|
||||||
@architectures = "spacy.HashEmbedCNN.v1"
|
@architectures = "spacy.HashEmbedCNN.v2"
|
||||||
pretrained_vectors = null
|
pretrained_vectors = null
|
||||||
width = 96
|
width = 96
|
||||||
depth = 4
|
depth = 4
|
||||||
|
|
|
@ -20,10 +20,16 @@ MISSING_VALUES = frozenset([None, 0, ""])
|
||||||
class PRFScore:
|
class PRFScore:
|
||||||
"""A precision / recall / F score."""
|
"""A precision / recall / F score."""
|
||||||
|
|
||||||
def __init__(self) -> None:
|
def __init__(
|
||||||
self.tp = 0
|
self,
|
||||||
self.fp = 0
|
*,
|
||||||
self.fn = 0
|
tp: int = 0,
|
||||||
|
fp: int = 0,
|
||||||
|
fn: int = 0,
|
||||||
|
) -> None:
|
||||||
|
self.tp = tp
|
||||||
|
self.fp = fp
|
||||||
|
self.fn = fn
|
||||||
|
|
||||||
def __len__(self) -> int:
|
def __len__(self) -> int:
|
||||||
return self.tp + self.fp + self.fn
|
return self.tp + self.fp + self.fn
|
||||||
|
@ -305,6 +311,8 @@ class Scorer:
|
||||||
*,
|
*,
|
||||||
getter: Callable[[Doc, str], Iterable[Span]] = getattr,
|
getter: Callable[[Doc, str], Iterable[Span]] = getattr,
|
||||||
has_annotation: Optional[Callable[[Doc], bool]] = None,
|
has_annotation: Optional[Callable[[Doc], bool]] = None,
|
||||||
|
labeled: bool = True,
|
||||||
|
allow_overlap: bool = False,
|
||||||
**cfg,
|
**cfg,
|
||||||
) -> Dict[str, Any]:
|
) -> Dict[str, Any]:
|
||||||
"""Returns PRF scores for labeled spans.
|
"""Returns PRF scores for labeled spans.
|
||||||
|
@ -317,6 +325,11 @@ class Scorer:
|
||||||
has_annotation (Optional[Callable[[Doc], bool]]) should return whether a `Doc`
|
has_annotation (Optional[Callable[[Doc], bool]]) should return whether a `Doc`
|
||||||
has annotation for this `attr`. Docs without annotation are skipped for
|
has annotation for this `attr`. Docs without annotation are skipped for
|
||||||
scoring purposes.
|
scoring purposes.
|
||||||
|
labeled (bool): Whether or not to include label information in
|
||||||
|
the evaluation. If set to 'False', two spans will be considered
|
||||||
|
equal if their start and end match, irrespective of their label.
|
||||||
|
allow_overlap (bool): Whether or not to allow overlapping spans.
|
||||||
|
If set to 'False', the alignment will automatically resolve conflicts.
|
||||||
RETURNS (Dict[str, Any]): A dictionary containing the PRF scores under
|
RETURNS (Dict[str, Any]): A dictionary containing the PRF scores under
|
||||||
the keys attr_p/r/f and the per-type PRF scores under attr_per_type.
|
the keys attr_p/r/f and the per-type PRF scores under attr_per_type.
|
||||||
|
|
||||||
|
@ -345,33 +358,42 @@ class Scorer:
|
||||||
gold_spans = set()
|
gold_spans = set()
|
||||||
pred_spans = set()
|
pred_spans = set()
|
||||||
for span in getter(gold_doc, attr):
|
for span in getter(gold_doc, attr):
|
||||||
gold_span = (span.label_, span.start, span.end - 1)
|
if labeled:
|
||||||
|
gold_span = (span.label_, span.start, span.end - 1)
|
||||||
|
else:
|
||||||
|
gold_span = (span.start, span.end - 1)
|
||||||
gold_spans.add(gold_span)
|
gold_spans.add(gold_span)
|
||||||
gold_per_type[span.label_].add((span.label_, span.start, span.end - 1))
|
gold_per_type[span.label_].add(gold_span)
|
||||||
pred_per_type = {label: set() for label in labels}
|
pred_per_type = {label: set() for label in labels}
|
||||||
for span in example.get_aligned_spans_x2y(getter(pred_doc, attr)):
|
for span in example.get_aligned_spans_x2y(getter(pred_doc, attr), allow_overlap):
|
||||||
pred_spans.add((span.label_, span.start, span.end - 1))
|
if labeled:
|
||||||
pred_per_type[span.label_].add((span.label_, span.start, span.end - 1))
|
pred_span = (span.label_, span.start, span.end - 1)
|
||||||
|
else:
|
||||||
|
pred_span = (span.start, span.end - 1)
|
||||||
|
pred_spans.add(pred_span)
|
||||||
|
pred_per_type[span.label_].add(pred_span)
|
||||||
# Scores per label
|
# Scores per label
|
||||||
for k, v in score_per_type.items():
|
if labeled:
|
||||||
if k in pred_per_type:
|
for k, v in score_per_type.items():
|
||||||
v.score_set(pred_per_type[k], gold_per_type[k])
|
if k in pred_per_type:
|
||||||
|
v.score_set(pred_per_type[k], gold_per_type[k])
|
||||||
# Score for all labels
|
# Score for all labels
|
||||||
score.score_set(pred_spans, gold_spans)
|
score.score_set(pred_spans, gold_spans)
|
||||||
if len(score) > 0:
|
# Assemble final result
|
||||||
return {
|
final_scores = {
|
||||||
f"{attr}_p": score.precision,
|
|
||||||
f"{attr}_r": score.recall,
|
|
||||||
f"{attr}_f": score.fscore,
|
|
||||||
f"{attr}_per_type": {k: v.to_dict() for k, v in score_per_type.items()},
|
|
||||||
}
|
|
||||||
else:
|
|
||||||
return {
|
|
||||||
f"{attr}_p": None,
|
f"{attr}_p": None,
|
||||||
f"{attr}_r": None,
|
f"{attr}_r": None,
|
||||||
f"{attr}_f": None,
|
f"{attr}_f": None,
|
||||||
f"{attr}_per_type": None,
|
|
||||||
}
|
}
|
||||||
|
if labeled:
|
||||||
|
final_scores[f"{attr}_per_type"] = None
|
||||||
|
if len(score) > 0:
|
||||||
|
final_scores[f"{attr}_p"] = score.precision
|
||||||
|
final_scores[f"{attr}_r"] = score.recall
|
||||||
|
final_scores[f"{attr}_f"] = score.fscore
|
||||||
|
if labeled:
|
||||||
|
final_scores[f"{attr}_per_type"] = {k: v.to_dict() for k, v in score_per_type.items()}
|
||||||
|
return final_scores
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def score_cats(
|
def score_cats(
|
||||||
|
|
|
@ -223,7 +223,7 @@ cdef class StringStore:
|
||||||
it doesn't exist. Paths may be either strings or Path-like objects.
|
it doesn't exist. Paths may be either strings or Path-like objects.
|
||||||
"""
|
"""
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
strings = list(self)
|
strings = sorted(self)
|
||||||
srsly.write_json(path, strings)
|
srsly.write_json(path, strings)
|
||||||
|
|
||||||
def from_disk(self, path):
|
def from_disk(self, path):
|
||||||
|
@ -247,7 +247,7 @@ cdef class StringStore:
|
||||||
|
|
||||||
RETURNS (bytes): The serialized form of the `StringStore` object.
|
RETURNS (bytes): The serialized form of the `StringStore` object.
|
||||||
"""
|
"""
|
||||||
return srsly.json_dumps(list(self))
|
return srsly.json_dumps(sorted(self))
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, **kwargs):
|
def from_bytes(self, bytes_data, **kwargs):
|
||||||
"""Load state from a binary string.
|
"""Load state from a binary string.
|
||||||
|
|
|
@ -6,12 +6,14 @@ import logging
|
||||||
import mock
|
import mock
|
||||||
|
|
||||||
from spacy.lang.xx import MultiLanguage
|
from spacy.lang.xx import MultiLanguage
|
||||||
from spacy.tokens import Doc, Span
|
from spacy.tokens import Doc, Span, Token
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from spacy.lexeme import Lexeme
|
from spacy.lexeme import Lexeme
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP, MORPH
|
from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP, MORPH
|
||||||
|
|
||||||
|
from .test_underscore import clean_underscore # noqa: F401
|
||||||
|
|
||||||
|
|
||||||
def test_doc_api_init(en_vocab):
|
def test_doc_api_init(en_vocab):
|
||||||
words = ["a", "b", "c", "d"]
|
words = ["a", "b", "c", "d"]
|
||||||
|
@ -347,15 +349,19 @@ def test_doc_from_array_morph(en_vocab):
|
||||||
assert [str(t.morph) for t in doc] == [str(t.morph) for t in new_doc]
|
assert [str(t.morph) for t in doc] == [str(t.morph) for t in new_doc]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.usefixtures("clean_underscore")
|
||||||
def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
||||||
en_texts = ["Merging the docs is fun.", "", "They don't think alike."]
|
en_texts = ["Merging the docs is fun.", "", "They don't think alike."]
|
||||||
en_texts_without_empty = [t for t in en_texts if len(t)]
|
en_texts_without_empty = [t for t in en_texts if len(t)]
|
||||||
de_text = "Wie war die Frage?"
|
de_text = "Wie war die Frage?"
|
||||||
en_docs = [en_tokenizer(text) for text in en_texts]
|
en_docs = [en_tokenizer(text) for text in en_texts]
|
||||||
docs_idx = en_texts[0].index("docs")
|
en_docs[0].spans["group"] = [en_docs[0][1:4]]
|
||||||
|
en_docs[2].spans["group"] = [en_docs[2][1:4]]
|
||||||
|
span_group_texts = sorted([en_docs[0][1:4].text, en_docs[2][1:4].text])
|
||||||
de_doc = de_tokenizer(de_text)
|
de_doc = de_tokenizer(de_text)
|
||||||
expected = (True, None, None, None)
|
Token.set_extension("is_ambiguous", default=False)
|
||||||
en_docs[0].user_data[("._.", "is_ambiguous", docs_idx, None)] = expected
|
en_docs[0][2]._.is_ambiguous = True # docs
|
||||||
|
en_docs[2][3]._.is_ambiguous = True # think
|
||||||
assert Doc.from_docs([]) is None
|
assert Doc.from_docs([]) is None
|
||||||
assert de_doc is not Doc.from_docs([de_doc])
|
assert de_doc is not Doc.from_docs([de_doc])
|
||||||
assert str(de_doc) == str(Doc.from_docs([de_doc]))
|
assert str(de_doc) == str(Doc.from_docs([de_doc]))
|
||||||
|
@ -372,11 +378,12 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
||||||
en_docs_tokens = [t for doc in en_docs for t in doc]
|
en_docs_tokens = [t for doc in en_docs for t in doc]
|
||||||
assert len(m_doc) == len(en_docs_tokens)
|
assert len(m_doc) == len(en_docs_tokens)
|
||||||
think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think")
|
think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think")
|
||||||
|
assert m_doc[2]._.is_ambiguous == True
|
||||||
assert m_doc[9].idx == think_idx
|
assert m_doc[9].idx == think_idx
|
||||||
with pytest.raises(AttributeError):
|
assert m_doc[9]._.is_ambiguous == True
|
||||||
# not callable, because it was not set via set_extension
|
assert not any([t._.is_ambiguous for t in m_doc[3:8]])
|
||||||
m_doc[2]._.is_ambiguous
|
assert "group" in m_doc.spans
|
||||||
assert len(m_doc.user_data) == len(en_docs[0].user_data) # but it's there
|
assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]])
|
||||||
|
|
||||||
m_doc = Doc.from_docs(en_docs, ensure_whitespace=False)
|
m_doc = Doc.from_docs(en_docs, ensure_whitespace=False)
|
||||||
assert len(en_texts_without_empty) == len(list(m_doc.sents))
|
assert len(en_texts_without_empty) == len(list(m_doc.sents))
|
||||||
|
@ -388,6 +395,8 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
||||||
assert len(m_doc) == len(en_docs_tokens)
|
assert len(m_doc) == len(en_docs_tokens)
|
||||||
think_idx = len(en_texts[0]) + 0 + en_texts[2].index("think")
|
think_idx = len(en_texts[0]) + 0 + en_texts[2].index("think")
|
||||||
assert m_doc[9].idx == think_idx
|
assert m_doc[9].idx == think_idx
|
||||||
|
assert "group" in m_doc.spans
|
||||||
|
assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]])
|
||||||
|
|
||||||
m_doc = Doc.from_docs(en_docs, attrs=["lemma", "length", "pos"])
|
m_doc = Doc.from_docs(en_docs, attrs=["lemma", "length", "pos"])
|
||||||
assert len(str(m_doc)) > len(en_texts[0]) + len(en_texts[1])
|
assert len(str(m_doc)) > len(en_texts[0]) + len(en_texts[1])
|
||||||
|
@ -399,6 +408,8 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
||||||
assert len(m_doc) == len(en_docs_tokens)
|
assert len(m_doc) == len(en_docs_tokens)
|
||||||
think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think")
|
think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think")
|
||||||
assert m_doc[9].idx == think_idx
|
assert m_doc[9].idx == think_idx
|
||||||
|
assert "group" in m_doc.spans
|
||||||
|
assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]])
|
||||||
|
|
||||||
|
|
||||||
def test_doc_api_from_docs_ents(en_tokenizer):
|
def test_doc_api_from_docs_ents(en_tokenizer):
|
||||||
|
|
|
@ -452,3 +452,30 @@ def test_retokenize_disallow_zero_length(en_vocab):
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
with doc.retokenize() as retokenizer:
|
with doc.retokenize() as retokenizer:
|
||||||
retokenizer.merge(doc[1:1])
|
retokenizer.merge(doc[1:1])
|
||||||
|
|
||||||
|
|
||||||
|
def test_doc_retokenize_merge_without_parse_keeps_sents(en_tokenizer):
|
||||||
|
text = "displaCy is a parse tool built with Javascript"
|
||||||
|
sent_starts = [1, 0, 0, 0, 1, 0, 0, 0]
|
||||||
|
tokens = en_tokenizer(text)
|
||||||
|
|
||||||
|
# merging within a sentence keeps all sentence boundaries
|
||||||
|
doc = Doc(tokens.vocab, words=[t.text for t in tokens], sent_starts=sent_starts)
|
||||||
|
assert len(list(doc.sents)) == 2
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
retokenizer.merge(doc[1:3])
|
||||||
|
assert len(list(doc.sents)) == 2
|
||||||
|
|
||||||
|
# merging over a sentence boundary unsets it by default
|
||||||
|
doc = Doc(tokens.vocab, words=[t.text for t in tokens], sent_starts=sent_starts)
|
||||||
|
assert len(list(doc.sents)) == 2
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
retokenizer.merge(doc[3:6])
|
||||||
|
assert doc[3].is_sent_start == None
|
||||||
|
|
||||||
|
# merging over a sentence boundary and setting sent_start
|
||||||
|
doc = Doc(tokens.vocab, words=[t.text for t in tokens], sent_starts=sent_starts)
|
||||||
|
assert len(list(doc.sents)) == 2
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
retokenizer.merge(doc[3:6], attrs={"sent_start": True})
|
||||||
|
assert len(list(doc.sents)) == 2
|
||||||
|
|
|
@ -1,9 +1,11 @@
|
||||||
import pytest
|
import pytest
|
||||||
from spacy.attrs import ORTH, LENGTH
|
from spacy.attrs import ORTH, LENGTH
|
||||||
from spacy.tokens import Doc, Span
|
from spacy.tokens import Doc, Span, Token
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from spacy.util import filter_spans
|
from spacy.util import filter_spans
|
||||||
|
|
||||||
|
from .test_underscore import clean_underscore # noqa: F401
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def doc(en_tokenizer):
|
def doc(en_tokenizer):
|
||||||
|
@ -219,11 +221,14 @@ def test_span_as_doc(doc):
|
||||||
assert span_doc[0].idx == 0
|
assert span_doc[0].idx == 0
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.usefixtures("clean_underscore")
|
||||||
def test_span_as_doc_user_data(doc):
|
def test_span_as_doc_user_data(doc):
|
||||||
"""Test that the user_data can be preserved (but not by default). """
|
"""Test that the user_data can be preserved (but not by default). """
|
||||||
my_key = "my_info"
|
my_key = "my_info"
|
||||||
my_value = 342
|
my_value = 342
|
||||||
doc.user_data[my_key] = my_value
|
doc.user_data[my_key] = my_value
|
||||||
|
Token.set_extension("is_x", default=False)
|
||||||
|
doc[7]._.is_x = True
|
||||||
|
|
||||||
span = doc[4:10]
|
span = doc[4:10]
|
||||||
span_doc_with = span.as_doc(copy_user_data=True)
|
span_doc_with = span.as_doc(copy_user_data=True)
|
||||||
|
@ -232,6 +237,12 @@ def test_span_as_doc_user_data(doc):
|
||||||
assert doc.user_data.get(my_key, None) is my_value
|
assert doc.user_data.get(my_key, None) is my_value
|
||||||
assert span_doc_with.user_data.get(my_key, None) is my_value
|
assert span_doc_with.user_data.get(my_key, None) is my_value
|
||||||
assert span_doc_without.user_data.get(my_key, None) is None
|
assert span_doc_without.user_data.get(my_key, None) is None
|
||||||
|
for i in range(len(span_doc_with)):
|
||||||
|
if i != 3:
|
||||||
|
assert span_doc_with[i]._.is_x is False
|
||||||
|
else:
|
||||||
|
assert span_doc_with[i]._.is_x is True
|
||||||
|
assert not any([t._.is_x for t in span_doc_without])
|
||||||
|
|
||||||
|
|
||||||
def test_span_string_label_kb_id(doc):
|
def test_span_string_label_kb_id(doc):
|
||||||
|
|
3
spacy/tests/enable_gpu.py
Normal file
3
spacy/tests/enable_gpu.py
Normal file
|
@ -0,0 +1,3 @@
|
||||||
|
from spacy import require_gpu
|
||||||
|
|
||||||
|
require_gpu()
|
|
@ -4,7 +4,9 @@ import re
|
||||||
import copy
|
import copy
|
||||||
from mock import Mock
|
from mock import Mock
|
||||||
from spacy.matcher import DependencyMatcher
|
from spacy.matcher import DependencyMatcher
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc, Token
|
||||||
|
|
||||||
|
from ..doc.test_underscore import clean_underscore # noqa: F401
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
|
@ -344,3 +346,26 @@ def test_dependency_matcher_long_matches(en_vocab, doc):
|
||||||
matcher = DependencyMatcher(en_vocab)
|
matcher = DependencyMatcher(en_vocab)
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
matcher.add("pattern", [pattern])
|
matcher.add("pattern", [pattern])
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.usefixtures("clean_underscore")
|
||||||
|
def test_dependency_matcher_span_user_data(en_tokenizer):
|
||||||
|
doc = en_tokenizer("a b c d e")
|
||||||
|
for token in doc:
|
||||||
|
token.head = doc[0]
|
||||||
|
token.dep_ = "a"
|
||||||
|
get_is_c = lambda token: token.text in ("c",)
|
||||||
|
Token.set_extension("is_c", default=False)
|
||||||
|
doc[2]._.is_c = True
|
||||||
|
pattern = [
|
||||||
|
{"RIGHT_ID": "c", "RIGHT_ATTRS": {"_": {"is_c": True}}},
|
||||||
|
]
|
||||||
|
matcher = DependencyMatcher(en_tokenizer.vocab)
|
||||||
|
matcher.add("C", [pattern])
|
||||||
|
doc_matches = matcher(doc)
|
||||||
|
offset = 1
|
||||||
|
span_matches = matcher(doc[offset:])
|
||||||
|
for doc_match, span_match in zip(sorted(doc_matches), sorted(span_matches)):
|
||||||
|
assert doc_match[0] == span_match[0]
|
||||||
|
for doc_t_i, span_t_i in zip(doc_match[1], span_match[1]):
|
||||||
|
assert doc_t_i == span_t_i + offset
|
||||||
|
|
|
@ -204,3 +204,90 @@ def test_matcher_remove():
|
||||||
# removing again should throw an error
|
# removing again should throw an error
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
matcher.remove("Rule")
|
matcher.remove("Rule")
|
||||||
|
|
||||||
|
|
||||||
|
def test_matcher_with_alignments_greedy_longest(en_vocab):
|
||||||
|
cases = [
|
||||||
|
("aaab", "a* b", [0, 0, 0, 1]),
|
||||||
|
("baab", "b a* b", [0, 1, 1, 2]),
|
||||||
|
("aaab", "a a a b", [0, 1, 2, 3]),
|
||||||
|
("aaab", "a+ b", [0, 0, 0, 1]),
|
||||||
|
("aaba", "a+ b a+", [0, 0, 1, 2]),
|
||||||
|
("aabaa", "a+ b a+", [0, 0, 1, 2, 2]),
|
||||||
|
("aaba", "a+ b a*", [0, 0, 1, 2]),
|
||||||
|
("aaaa", "a*", [0, 0, 0, 0]),
|
||||||
|
("baab", "b a* b b*", [0, 1, 1, 2]),
|
||||||
|
("aabb", "a* b* a*", [0, 0, 1, 1]),
|
||||||
|
("aaab", "a+ a+ a b", [0, 1, 2, 3]),
|
||||||
|
("aaab", "a+ a+ a+ b", [0, 1, 2, 3]),
|
||||||
|
("aaab", "a+ a a b", [0, 1, 2, 3]),
|
||||||
|
("aaab", "a+ a a", [0, 1, 2]),
|
||||||
|
("aaab", "a+ a a?", [0, 1, 2]),
|
||||||
|
("aaaa", "a a a a a?", [0, 1, 2, 3]),
|
||||||
|
("aaab", "a+ a b", [0, 0, 1, 2]),
|
||||||
|
("aaab", "a+ a+ b", [0, 0, 1, 2]),
|
||||||
|
]
|
||||||
|
for string, pattern_str, result in cases:
|
||||||
|
matcher = Matcher(en_vocab)
|
||||||
|
doc = Doc(matcher.vocab, words=list(string))
|
||||||
|
pattern = []
|
||||||
|
for part in pattern_str.split():
|
||||||
|
if part.endswith("+"):
|
||||||
|
pattern.append({"ORTH": part[0], "OP": "+"})
|
||||||
|
elif part.endswith("*"):
|
||||||
|
pattern.append({"ORTH": part[0], "OP": "*"})
|
||||||
|
elif part.endswith("?"):
|
||||||
|
pattern.append({"ORTH": part[0], "OP": "?"})
|
||||||
|
else:
|
||||||
|
pattern.append({"ORTH": part})
|
||||||
|
matcher.add("PATTERN", [pattern], greedy="LONGEST")
|
||||||
|
matches = matcher(doc, with_alignments=True)
|
||||||
|
n_matches = len(matches)
|
||||||
|
|
||||||
|
_, s, e, expected = matches[0]
|
||||||
|
|
||||||
|
assert expected == result, (string, pattern_str, s, e, n_matches)
|
||||||
|
|
||||||
|
|
||||||
|
def test_matcher_with_alignments_nongreedy(en_vocab):
|
||||||
|
cases = [
|
||||||
|
(0, "aaab", "a* b", [[0, 1], [0, 0, 1], [0, 0, 0, 1], [1]]),
|
||||||
|
(1, "baab", "b a* b", [[0, 1, 1, 2]]),
|
||||||
|
(2, "aaab", "a a a b", [[0, 1, 2, 3]]),
|
||||||
|
(3, "aaab", "a+ b", [[0, 1], [0, 0, 1], [0, 0, 0, 1]]),
|
||||||
|
(4, "aaba", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2]]),
|
||||||
|
(5, "aabaa", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2], [0, 0, 1, 2, 2], [0, 1, 2, 2] ]),
|
||||||
|
(6, "aaba", "a+ b a*", [[0, 1], [0, 0, 1], [0, 0, 1, 2], [0, 1, 2]]),
|
||||||
|
(7, "aaaa", "a*", [[0], [0, 0], [0, 0, 0], [0, 0, 0, 0]]),
|
||||||
|
(8, "baab", "b a* b b*", [[0, 1, 1, 2]]),
|
||||||
|
(9, "aabb", "a* b* a*", [[1], [2], [2, 2], [0, 1], [0, 0, 1], [0, 0, 1, 1], [0, 1, 1], [1, 1]]),
|
||||||
|
(10, "aaab", "a+ a+ a b", [[0, 1, 2, 3]]),
|
||||||
|
(11, "aaab", "a+ a+ a+ b", [[0, 1, 2, 3]]),
|
||||||
|
(12, "aaab", "a+ a a b", [[0, 1, 2, 3]]),
|
||||||
|
(13, "aaab", "a+ a a", [[0, 1, 2]]),
|
||||||
|
(14, "aaab", "a+ a a?", [[0, 1], [0, 1, 2]]),
|
||||||
|
(15, "aaaa", "a a a a a?", [[0, 1, 2, 3]]),
|
||||||
|
(16, "aaab", "a+ a b", [[0, 1, 2], [0, 0, 1, 2]]),
|
||||||
|
(17, "aaab", "a+ a+ b", [[0, 1, 2], [0, 0, 1, 2]]),
|
||||||
|
]
|
||||||
|
for case_id, string, pattern_str, results in cases:
|
||||||
|
matcher = Matcher(en_vocab)
|
||||||
|
doc = Doc(matcher.vocab, words=list(string))
|
||||||
|
pattern = []
|
||||||
|
for part in pattern_str.split():
|
||||||
|
if part.endswith("+"):
|
||||||
|
pattern.append({"ORTH": part[0], "OP": "+"})
|
||||||
|
elif part.endswith("*"):
|
||||||
|
pattern.append({"ORTH": part[0], "OP": "*"})
|
||||||
|
elif part.endswith("?"):
|
||||||
|
pattern.append({"ORTH": part[0], "OP": "?"})
|
||||||
|
else:
|
||||||
|
pattern.append({"ORTH": part})
|
||||||
|
|
||||||
|
matcher.add("PATTERN", [pattern])
|
||||||
|
matches = matcher(doc, with_alignments=True)
|
||||||
|
n_matches = len(matches)
|
||||||
|
|
||||||
|
for _, s, e, expected in matches:
|
||||||
|
assert expected in results, (case_id, string, pattern_str, s, e, n_matches)
|
||||||
|
assert len(expected) == e - s
|
||||||
|
|
|
@ -5,6 +5,7 @@ from spacy.tokens import Span
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
from spacy.pipeline import EntityRuler
|
from spacy.pipeline import EntityRuler
|
||||||
from spacy.errors import MatchPatternError
|
from spacy.errors import MatchPatternError
|
||||||
|
from thinc.api import NumpyOps, get_current_ops
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
|
@ -201,13 +202,14 @@ def test_entity_ruler_overlapping_spans(nlp):
|
||||||
|
|
||||||
@pytest.mark.parametrize("n_process", [1, 2])
|
@pytest.mark.parametrize("n_process", [1, 2])
|
||||||
def test_entity_ruler_multiprocessing(nlp, n_process):
|
def test_entity_ruler_multiprocessing(nlp, n_process):
|
||||||
texts = ["I enjoy eating Pizza Hut pizza."]
|
if isinstance(get_current_ops, NumpyOps) or n_process < 2:
|
||||||
|
texts = ["I enjoy eating Pizza Hut pizza."]
|
||||||
|
|
||||||
patterns = [{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}]
|
patterns = [{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}]
|
||||||
|
|
||||||
ruler = nlp.add_pipe("entity_ruler")
|
ruler = nlp.add_pipe("entity_ruler")
|
||||||
ruler.add_patterns(patterns)
|
ruler.add_patterns(patterns)
|
||||||
|
|
||||||
for doc in nlp.pipe(texts, n_process=2):
|
for doc in nlp.pipe(texts, n_process=2):
|
||||||
for ent in doc.ents:
|
for ent in doc.ents:
|
||||||
assert ent.ent_id_ == "1234"
|
assert ent.ent_id_ == "1234"
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
import pytest
|
import pytest
|
||||||
import logging
|
import logging
|
||||||
import mock
|
import mock
|
||||||
|
import pickle
|
||||||
from spacy import util, registry
|
from spacy import util, registry
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.lookups import Lookups
|
from spacy.lookups import Lookups
|
||||||
|
@ -106,6 +107,9 @@ def test_lemmatizer_serialize(nlp):
|
||||||
doc2 = nlp2.make_doc("coping")
|
doc2 = nlp2.make_doc("coping")
|
||||||
doc2[0].pos_ = "VERB"
|
doc2[0].pos_ = "VERB"
|
||||||
assert doc2[0].lemma_ == ""
|
assert doc2[0].lemma_ == ""
|
||||||
doc2 = lemmatizer(doc2)
|
doc2 = lemmatizer2(doc2)
|
||||||
assert doc2[0].text == "coping"
|
assert doc2[0].text == "coping"
|
||||||
assert doc2[0].lemma_ == "cope"
|
assert doc2[0].lemma_ == "cope"
|
||||||
|
|
||||||
|
# Make sure that lemmatizer cache can be pickled
|
||||||
|
b = pickle.dumps(lemmatizer2)
|
||||||
|
|
|
@ -4,7 +4,7 @@ import numpy
|
||||||
import pytest
|
import pytest
|
||||||
from numpy.testing import assert_almost_equal
|
from numpy.testing import assert_almost_equal
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from thinc.api import NumpyOps, Model, data_validation
|
from thinc.api import Model, data_validation, get_current_ops
|
||||||
from thinc.types import Array2d, Ragged
|
from thinc.types import Array2d, Ragged
|
||||||
|
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
|
@ -13,7 +13,7 @@ from spacy.ml._character_embed import CharacterEmbed
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
|
|
||||||
|
|
||||||
OPS = NumpyOps()
|
OPS = get_current_ops()
|
||||||
|
|
||||||
texts = ["These are 4 words", "Here just three"]
|
texts = ["These are 4 words", "Here just three"]
|
||||||
l0 = [[1, 2], [3, 4], [5, 6], [7, 8]]
|
l0 = [[1, 2], [3, 4], [5, 6], [7, 8]]
|
||||||
|
@ -82,7 +82,7 @@ def util_batch_unbatch_docs_list(
|
||||||
Y_batched = model.predict(in_data)
|
Y_batched = model.predict(in_data)
|
||||||
Y_not_batched = [model.predict([u])[0] for u in in_data]
|
Y_not_batched = [model.predict([u])[0] for u in in_data]
|
||||||
for i in range(len(Y_batched)):
|
for i in range(len(Y_batched)):
|
||||||
assert_almost_equal(Y_batched[i], Y_not_batched[i], decimal=4)
|
assert_almost_equal(OPS.to_numpy(Y_batched[i]), OPS.to_numpy(Y_not_batched[i]), decimal=4)
|
||||||
|
|
||||||
|
|
||||||
def util_batch_unbatch_docs_array(
|
def util_batch_unbatch_docs_array(
|
||||||
|
@ -91,7 +91,7 @@ def util_batch_unbatch_docs_array(
|
||||||
with data_validation(True):
|
with data_validation(True):
|
||||||
model.initialize(in_data, out_data)
|
model.initialize(in_data, out_data)
|
||||||
Y_batched = model.predict(in_data).tolist()
|
Y_batched = model.predict(in_data).tolist()
|
||||||
Y_not_batched = [model.predict([u])[0] for u in in_data]
|
Y_not_batched = [model.predict([u])[0].tolist() for u in in_data]
|
||||||
assert_almost_equal(Y_batched, Y_not_batched, decimal=4)
|
assert_almost_equal(Y_batched, Y_not_batched, decimal=4)
|
||||||
|
|
||||||
|
|
||||||
|
@ -100,8 +100,8 @@ def util_batch_unbatch_docs_ragged(
|
||||||
):
|
):
|
||||||
with data_validation(True):
|
with data_validation(True):
|
||||||
model.initialize(in_data, out_data)
|
model.initialize(in_data, out_data)
|
||||||
Y_batched = model.predict(in_data)
|
Y_batched = model.predict(in_data).data.tolist()
|
||||||
Y_not_batched = []
|
Y_not_batched = []
|
||||||
for u in in_data:
|
for u in in_data:
|
||||||
Y_not_batched.extend(model.predict([u]).data.tolist())
|
Y_not_batched.extend(model.predict([u]).data.tolist())
|
||||||
assert_almost_equal(Y_batched.data, Y_not_batched, decimal=4)
|
assert_almost_equal(Y_batched, Y_not_batched, decimal=4)
|
||||||
|
|
|
@ -1,4 +1,6 @@
|
||||||
import pytest
|
import pytest
|
||||||
|
import mock
|
||||||
|
import logging
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.lang.de import German
|
from spacy.lang.de import German
|
||||||
|
@ -402,6 +404,38 @@ def test_pipe_factories_from_source():
|
||||||
nlp.add_pipe("custom", source=source_nlp)
|
nlp.add_pipe("custom", source=source_nlp)
|
||||||
|
|
||||||
|
|
||||||
|
def test_pipe_factories_from_source_language_subclass():
|
||||||
|
class CustomEnglishDefaults(English.Defaults):
|
||||||
|
stop_words = set(["custom", "stop"])
|
||||||
|
|
||||||
|
@registry.languages("custom_en")
|
||||||
|
class CustomEnglish(English):
|
||||||
|
lang = "custom_en"
|
||||||
|
Defaults = CustomEnglishDefaults
|
||||||
|
|
||||||
|
source_nlp = English()
|
||||||
|
source_nlp.add_pipe("tagger")
|
||||||
|
|
||||||
|
# custom subclass
|
||||||
|
nlp = CustomEnglish()
|
||||||
|
nlp.add_pipe("tagger", source=source_nlp)
|
||||||
|
assert "tagger" in nlp.pipe_names
|
||||||
|
|
||||||
|
# non-subclass
|
||||||
|
nlp = German()
|
||||||
|
nlp.add_pipe("tagger", source=source_nlp)
|
||||||
|
assert "tagger" in nlp.pipe_names
|
||||||
|
|
||||||
|
# mismatched vectors
|
||||||
|
nlp = English()
|
||||||
|
nlp.vocab.vectors.resize((1, 4))
|
||||||
|
nlp.vocab.vectors.add("cat", vector=[1, 2, 3, 4])
|
||||||
|
logger = logging.getLogger("spacy")
|
||||||
|
with mock.patch.object(logger, "warning") as mock_warning:
|
||||||
|
nlp.add_pipe("tagger", source=source_nlp)
|
||||||
|
mock_warning.assert_called()
|
||||||
|
|
||||||
|
|
||||||
def test_pipe_factories_from_source_custom():
|
def test_pipe_factories_from_source_custom():
|
||||||
"""Test adding components from a source model with custom components."""
|
"""Test adding components from a source model with custom components."""
|
||||||
name = "test_pipe_factories_from_source_custom"
|
name = "test_pipe_factories_from_source_custom"
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
import pytest
|
import pytest
|
||||||
import random
|
import random
|
||||||
import numpy.random
|
import numpy.random
|
||||||
from numpy.testing import assert_equal
|
from numpy.testing import assert_almost_equal
|
||||||
from thinc.api import fix_random_seed
|
from thinc.api import fix_random_seed
|
||||||
from spacy import util
|
from spacy import util
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
|
@ -222,8 +222,12 @@ def test_overfitting_IO():
|
||||||
batch_cats_1 = [doc.cats for doc in nlp.pipe(texts)]
|
batch_cats_1 = [doc.cats for doc in nlp.pipe(texts)]
|
||||||
batch_cats_2 = [doc.cats for doc in nlp.pipe(texts)]
|
batch_cats_2 = [doc.cats for doc in nlp.pipe(texts)]
|
||||||
no_batch_cats = [doc.cats for doc in [nlp(text) for text in texts]]
|
no_batch_cats = [doc.cats for doc in [nlp(text) for text in texts]]
|
||||||
assert_equal(batch_cats_1, batch_cats_2)
|
for cats_1, cats_2 in zip(batch_cats_1, batch_cats_2):
|
||||||
assert_equal(batch_cats_1, no_batch_cats)
|
for cat in cats_1:
|
||||||
|
assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5)
|
||||||
|
for cats_1, cats_2 in zip(batch_cats_1, no_batch_cats):
|
||||||
|
for cat in cats_1:
|
||||||
|
assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5)
|
||||||
|
|
||||||
|
|
||||||
def test_overfitting_IO_multi():
|
def test_overfitting_IO_multi():
|
||||||
|
@ -270,8 +274,12 @@ def test_overfitting_IO_multi():
|
||||||
batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)]
|
batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)]
|
||||||
batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)]
|
batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)]
|
||||||
no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]]
|
no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]]
|
||||||
assert_equal(batch_deps_1, batch_deps_2)
|
for cats_1, cats_2 in zip(batch_deps_1, batch_deps_2):
|
||||||
assert_equal(batch_deps_1, no_batch_deps)
|
for cat in cats_1:
|
||||||
|
assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5)
|
||||||
|
for cats_1, cats_2 in zip(batch_deps_1, no_batch_deps):
|
||||||
|
for cat in cats_1:
|
||||||
|
assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5)
|
||||||
|
|
||||||
|
|
||||||
# fmt: off
|
# fmt: off
|
||||||
|
|
|
@ -8,8 +8,8 @@ from spacy.tokens import Doc
|
||||||
from spacy.training import Example
|
from spacy.training import Example
|
||||||
from spacy import util
|
from spacy import util
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from thinc.api import Config
|
from thinc.api import Config, get_current_ops
|
||||||
from numpy.testing import assert_equal
|
from numpy.testing import assert_array_equal
|
||||||
|
|
||||||
from ..util import get_batch, make_tempdir
|
from ..util import get_batch, make_tempdir
|
||||||
|
|
||||||
|
@ -160,7 +160,8 @@ def test_tok2vec_listener():
|
||||||
|
|
||||||
doc = nlp("Running the pipeline as a whole.")
|
doc = nlp("Running the pipeline as a whole.")
|
||||||
doc_tensor = tagger_tok2vec.predict([doc])[0]
|
doc_tensor = tagger_tok2vec.predict([doc])[0]
|
||||||
assert_equal(doc.tensor, doc_tensor)
|
ops = get_current_ops()
|
||||||
|
assert_array_equal(ops.to_numpy(doc.tensor), ops.to_numpy(doc_tensor))
|
||||||
|
|
||||||
# TODO: should this warn or error?
|
# TODO: should this warn or error?
|
||||||
nlp.select_pipes(disable="tok2vec")
|
nlp.select_pipes(disable="tok2vec")
|
||||||
|
|
|
@ -9,6 +9,7 @@ from spacy.language import Language
|
||||||
from spacy.util import ensure_path, load_model_from_path
|
from spacy.util import ensure_path, load_model_from_path
|
||||||
import numpy
|
import numpy
|
||||||
import pickle
|
import pickle
|
||||||
|
from thinc.api import NumpyOps, get_current_ops
|
||||||
|
|
||||||
from ..util import make_tempdir
|
from ..util import make_tempdir
|
||||||
|
|
||||||
|
@ -169,21 +170,22 @@ def test_issue4725_1():
|
||||||
|
|
||||||
|
|
||||||
def test_issue4725_2():
|
def test_issue4725_2():
|
||||||
# ensures that this runs correctly and doesn't hang or crash because of the global vectors
|
if isinstance(get_current_ops, NumpyOps):
|
||||||
# if it does crash, it's usually because of calling 'spawn' for multiprocessing (e.g. on Windows),
|
# ensures that this runs correctly and doesn't hang or crash because of the global vectors
|
||||||
# or because of issues with pickling the NER (cf test_issue4725_1)
|
# if it does crash, it's usually because of calling 'spawn' for multiprocessing (e.g. on Windows),
|
||||||
vocab = Vocab(vectors_name="test_vocab_add_vector")
|
# or because of issues with pickling the NER (cf test_issue4725_1)
|
||||||
data = numpy.ndarray((5, 3), dtype="f")
|
vocab = Vocab(vectors_name="test_vocab_add_vector")
|
||||||
data[0] = 1.0
|
data = numpy.ndarray((5, 3), dtype="f")
|
||||||
data[1] = 2.0
|
data[0] = 1.0
|
||||||
vocab.set_vector("cat", data[0])
|
data[1] = 2.0
|
||||||
vocab.set_vector("dog", data[1])
|
vocab.set_vector("cat", data[0])
|
||||||
nlp = English(vocab=vocab)
|
vocab.set_vector("dog", data[1])
|
||||||
nlp.add_pipe("ner")
|
nlp = English(vocab=vocab)
|
||||||
nlp.initialize()
|
nlp.add_pipe("ner")
|
||||||
docs = ["Kurt is in London."] * 10
|
nlp.initialize()
|
||||||
for _ in nlp.pipe(docs, batch_size=2, n_process=2):
|
docs = ["Kurt is in London."] * 10
|
||||||
pass
|
for _ in nlp.pipe(docs, batch_size=2, n_process=2):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
def test_issue4849():
|
def test_issue4849():
|
||||||
|
@ -204,10 +206,11 @@ def test_issue4849():
|
||||||
count_ents += len([ent for ent in doc.ents if ent.ent_id > 0])
|
count_ents += len([ent for ent in doc.ents if ent.ent_id > 0])
|
||||||
assert count_ents == 2
|
assert count_ents == 2
|
||||||
# USING 2 PROCESSES
|
# USING 2 PROCESSES
|
||||||
count_ents = 0
|
if isinstance(get_current_ops, NumpyOps):
|
||||||
for doc in nlp.pipe([text], n_process=2):
|
count_ents = 0
|
||||||
count_ents += len([ent for ent in doc.ents if ent.ent_id > 0])
|
for doc in nlp.pipe([text], n_process=2):
|
||||||
assert count_ents == 2
|
count_ents += len([ent for ent in doc.ents if ent.ent_id > 0])
|
||||||
|
assert count_ents == 2
|
||||||
|
|
||||||
|
|
||||||
@Language.factory("my_pipe")
|
@Language.factory("my_pipe")
|
||||||
|
@ -239,10 +242,11 @@ def test_issue4903():
|
||||||
nlp.add_pipe("sentencizer")
|
nlp.add_pipe("sentencizer")
|
||||||
nlp.add_pipe("my_pipe", after="sentencizer")
|
nlp.add_pipe("my_pipe", after="sentencizer")
|
||||||
text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."]
|
text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."]
|
||||||
docs = list(nlp.pipe(text, n_process=2))
|
if isinstance(get_current_ops(), NumpyOps):
|
||||||
assert docs[0].text == "I like bananas."
|
docs = list(nlp.pipe(text, n_process=2))
|
||||||
assert docs[1].text == "Do you like them?"
|
assert docs[0].text == "I like bananas."
|
||||||
assert docs[2].text == "No, I prefer wasabi."
|
assert docs[1].text == "Do you like them?"
|
||||||
|
assert docs[2].text == "No, I prefer wasabi."
|
||||||
|
|
||||||
|
|
||||||
def test_issue4924():
|
def test_issue4924():
|
||||||
|
|
|
@ -6,6 +6,7 @@ from spacy.language import Language
|
||||||
from spacy.lang.en.syntax_iterators import noun_chunks
|
from spacy.lang.en.syntax_iterators import noun_chunks
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
import spacy
|
import spacy
|
||||||
|
from thinc.api import get_current_ops
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from ...util import make_tempdir
|
from ...util import make_tempdir
|
||||||
|
@ -54,16 +55,17 @@ def test_issue5082():
|
||||||
ruler.add_patterns(patterns)
|
ruler.add_patterns(patterns)
|
||||||
parsed_vectors_1 = [t.vector for t in nlp(text)]
|
parsed_vectors_1 = [t.vector for t in nlp(text)]
|
||||||
assert len(parsed_vectors_1) == 4
|
assert len(parsed_vectors_1) == 4
|
||||||
numpy.testing.assert_array_equal(parsed_vectors_1[0], array1)
|
ops = get_current_ops()
|
||||||
numpy.testing.assert_array_equal(parsed_vectors_1[1], array2)
|
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[0]), array1)
|
||||||
numpy.testing.assert_array_equal(parsed_vectors_1[2], array3)
|
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[1]), array2)
|
||||||
numpy.testing.assert_array_equal(parsed_vectors_1[3], array4)
|
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[2]), array3)
|
||||||
|
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[3]), array4)
|
||||||
nlp.add_pipe("merge_entities")
|
nlp.add_pipe("merge_entities")
|
||||||
parsed_vectors_2 = [t.vector for t in nlp(text)]
|
parsed_vectors_2 = [t.vector for t in nlp(text)]
|
||||||
assert len(parsed_vectors_2) == 3
|
assert len(parsed_vectors_2) == 3
|
||||||
numpy.testing.assert_array_equal(parsed_vectors_2[0], array1)
|
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_2[0]), array1)
|
||||||
numpy.testing.assert_array_equal(parsed_vectors_2[1], array2)
|
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_2[1]), array2)
|
||||||
numpy.testing.assert_array_equal(parsed_vectors_2[2], array34)
|
numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_2[2]), array34)
|
||||||
|
|
||||||
|
|
||||||
def test_issue5137():
|
def test_issue5137():
|
||||||
|
|
|
@ -1,5 +1,6 @@
|
||||||
import pytest
|
import pytest
|
||||||
from thinc.api import Config, fix_random_seed
|
from numpy.testing import assert_almost_equal
|
||||||
|
from thinc.api import Config, fix_random_seed, get_current_ops
|
||||||
|
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.pipeline.textcat import single_label_default_config, single_label_bow_config
|
from spacy.pipeline.textcat import single_label_default_config, single_label_bow_config
|
||||||
|
@ -44,11 +45,12 @@ def test_issue5551(textcat_config):
|
||||||
nlp.update([Example.from_dict(doc, annots)])
|
nlp.update([Example.from_dict(doc, annots)])
|
||||||
# Store the result of each iteration
|
# Store the result of each iteration
|
||||||
result = pipe.model.predict([doc])
|
result = pipe.model.predict([doc])
|
||||||
results.append(list(result[0]))
|
results.append(result[0])
|
||||||
# All results should be the same because of the fixed seed
|
# All results should be the same because of the fixed seed
|
||||||
assert len(results) == 3
|
assert len(results) == 3
|
||||||
assert results[0] == results[1]
|
ops = get_current_ops()
|
||||||
assert results[0] == results[2]
|
assert_almost_equal(ops.to_numpy(results[0]), ops.to_numpy(results[1]))
|
||||||
|
assert_almost_equal(ops.to_numpy(results[0]), ops.to_numpy(results[2]))
|
||||||
|
|
||||||
|
|
||||||
def test_issue5838():
|
def test_issue5838():
|
||||||
|
|
|
@ -1,4 +1,6 @@
|
||||||
|
from spacy.kb import KnowledgeBase
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
|
from spacy.training import Example
|
||||||
|
|
||||||
|
|
||||||
def test_issue7065():
|
def test_issue7065():
|
||||||
|
@ -16,3 +18,58 @@ def test_issue7065():
|
||||||
ent = doc.ents[0]
|
ent = doc.ents[0]
|
||||||
assert ent.start < sent0.end < ent.end
|
assert ent.start < sent0.end < ent.end
|
||||||
assert sentences.index(ent.sent) == 0
|
assert sentences.index(ent.sent) == 0
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue7065_b():
|
||||||
|
# Test that the NEL doesn't crash when an entity crosses a sentence boundary
|
||||||
|
nlp = English()
|
||||||
|
vector_length = 3
|
||||||
|
nlp.add_pipe("sentencizer")
|
||||||
|
|
||||||
|
text = "Mahler 's Symphony No. 8 was beautiful."
|
||||||
|
entities = [(0, 6, "PERSON"), (10, 24, "WORK")]
|
||||||
|
links = {(0, 6): {"Q7304": 1.0, "Q270853": 0.0},
|
||||||
|
(10, 24): {"Q7304": 0.0, "Q270853": 1.0}}
|
||||||
|
sent_starts = [1, -1, 0, 0, 0, 0, 0, 0, 0]
|
||||||
|
doc = nlp(text)
|
||||||
|
example = Example.from_dict(doc, {"entities": entities, "links": links, "sent_starts": sent_starts})
|
||||||
|
train_examples = [example]
|
||||||
|
|
||||||
|
def create_kb(vocab):
|
||||||
|
# create artificial KB
|
||||||
|
mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
|
||||||
|
mykb.add_entity(entity="Q270853", freq=12, entity_vector=[9, 1, -7])
|
||||||
|
mykb.add_alias(
|
||||||
|
alias="No. 8",
|
||||||
|
entities=["Q270853"],
|
||||||
|
probabilities=[1.0],
|
||||||
|
)
|
||||||
|
mykb.add_entity(entity="Q7304", freq=12, entity_vector=[6, -4, 3])
|
||||||
|
mykb.add_alias(
|
||||||
|
alias="Mahler",
|
||||||
|
entities=["Q7304"],
|
||||||
|
probabilities=[1.0],
|
||||||
|
)
|
||||||
|
return mykb
|
||||||
|
|
||||||
|
# Create the Entity Linker component and add it to the pipeline
|
||||||
|
entity_linker = nlp.add_pipe("entity_linker", last=True)
|
||||||
|
entity_linker.set_kb(create_kb)
|
||||||
|
|
||||||
|
# train the NEL pipe
|
||||||
|
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||||
|
for i in range(2):
|
||||||
|
losses = {}
|
||||||
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
|
|
||||||
|
# Add a custom rule-based component to mimick NER
|
||||||
|
patterns = [
|
||||||
|
{"label": "PERSON", "pattern": [{"LOWER": "mahler"}]},
|
||||||
|
{"label": "WORK", "pattern": [{"LOWER": "symphony"}, {"LOWER": "no"}, {"LOWER": "."}, {"LOWER": "8"}]}
|
||||||
|
]
|
||||||
|
ruler = nlp.add_pipe("entity_ruler", before="entity_linker")
|
||||||
|
ruler.add_patterns(patterns)
|
||||||
|
|
||||||
|
# test the trained model - this should not throw E148
|
||||||
|
doc = nlp(text)
|
||||||
|
assert doc
|
||||||
|
|
|
@ -4,7 +4,7 @@ import spacy
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.lang.de import German
|
from spacy.lang.de import German
|
||||||
from spacy.language import Language, DEFAULT_CONFIG, DEFAULT_CONFIG_PRETRAIN_PATH
|
from spacy.language import Language, DEFAULT_CONFIG, DEFAULT_CONFIG_PRETRAIN_PATH
|
||||||
from spacy.util import registry, load_model_from_config, load_config
|
from spacy.util import registry, load_model_from_config, load_config, load_config_from_str
|
||||||
from spacy.ml.models import build_Tok2Vec_model, build_tb_parser_model
|
from spacy.ml.models import build_Tok2Vec_model, build_tb_parser_model
|
||||||
from spacy.ml.models import MultiHashEmbed, MaxoutWindowEncoder
|
from spacy.ml.models import MultiHashEmbed, MaxoutWindowEncoder
|
||||||
from spacy.schemas import ConfigSchema, ConfigSchemaPretrain
|
from spacy.schemas import ConfigSchema, ConfigSchemaPretrain
|
||||||
|
@ -465,3 +465,32 @@ def test_config_only_resolve_relevant_blocks():
|
||||||
nlp.initialize()
|
nlp.initialize()
|
||||||
nlp.config["initialize"]["lookups"] = None
|
nlp.config["initialize"]["lookups"] = None
|
||||||
nlp.initialize()
|
nlp.initialize()
|
||||||
|
|
||||||
|
|
||||||
|
def test_hyphen_in_config():
|
||||||
|
hyphen_config_str = """
|
||||||
|
[nlp]
|
||||||
|
lang = "en"
|
||||||
|
pipeline = ["my_punctual_component"]
|
||||||
|
|
||||||
|
[components]
|
||||||
|
|
||||||
|
[components.my_punctual_component]
|
||||||
|
factory = "my_punctual_component"
|
||||||
|
punctuation = ["?","-"]
|
||||||
|
"""
|
||||||
|
|
||||||
|
@spacy.Language.factory("my_punctual_component")
|
||||||
|
class MyPunctualComponent(object):
|
||||||
|
name = "my_punctual_component"
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
nlp,
|
||||||
|
name,
|
||||||
|
punctuation,
|
||||||
|
):
|
||||||
|
self.punctuation = punctuation
|
||||||
|
|
||||||
|
nlp = English.from_config(load_config_from_str(hyphen_config_str))
|
||||||
|
assert nlp.get_pipe("my_punctual_component").punctuation == ['?', '-']
|
||||||
|
|
|
@ -26,10 +26,14 @@ def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
|
||||||
assert tokenizer.rules != {}
|
assert tokenizer.rules != {}
|
||||||
assert tokenizer.token_match is not None
|
assert tokenizer.token_match is not None
|
||||||
assert tokenizer.url_match is not None
|
assert tokenizer.url_match is not None
|
||||||
|
assert tokenizer.prefix_search is not None
|
||||||
|
assert tokenizer.infix_finditer is not None
|
||||||
tokenizer.from_bytes(tokenizer_bytes)
|
tokenizer.from_bytes(tokenizer_bytes)
|
||||||
assert tokenizer.rules == {}
|
assert tokenizer.rules == {}
|
||||||
assert tokenizer.token_match is None
|
assert tokenizer.token_match is None
|
||||||
assert tokenizer.url_match is None
|
assert tokenizer.url_match is None
|
||||||
|
assert tokenizer.prefix_search is None
|
||||||
|
assert tokenizer.infix_finditer is None
|
||||||
|
|
||||||
tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC"}, {"ORTH": "."}]})
|
tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC"}, {"ORTH": "."}]})
|
||||||
tokenizer.rules = {}
|
tokenizer.rules = {}
|
||||||
|
|
|
@ -49,9 +49,9 @@ def test_serialize_vocab_roundtrip_disk(strings1, strings2):
|
||||||
vocab1_d = Vocab().from_disk(file_path1)
|
vocab1_d = Vocab().from_disk(file_path1)
|
||||||
vocab2_d = Vocab().from_disk(file_path2)
|
vocab2_d = Vocab().from_disk(file_path2)
|
||||||
# check strings rather than lexemes, which are only reloaded on demand
|
# check strings rather than lexemes, which are only reloaded on demand
|
||||||
assert strings1 == [s for s in vocab1_d.strings]
|
assert set(strings1) == set([s for s in vocab1_d.strings])
|
||||||
assert strings2 == [s for s in vocab2_d.strings]
|
assert set(strings2) == set([s for s in vocab2_d.strings])
|
||||||
if strings1 == strings2:
|
if set(strings1) == set(strings2):
|
||||||
assert [s for s in vocab1_d.strings] == [s for s in vocab2_d.strings]
|
assert [s for s in vocab1_d.strings] == [s for s in vocab2_d.strings]
|
||||||
else:
|
else:
|
||||||
assert [s for s in vocab1_d.strings] != [s for s in vocab2_d.strings]
|
assert [s for s in vocab1_d.strings] != [s for s in vocab2_d.strings]
|
||||||
|
@ -96,7 +96,7 @@ def test_serialize_stringstore_roundtrip_bytes(strings1, strings2):
|
||||||
sstore2 = StringStore(strings=strings2)
|
sstore2 = StringStore(strings=strings2)
|
||||||
sstore1_b = sstore1.to_bytes()
|
sstore1_b = sstore1.to_bytes()
|
||||||
sstore2_b = sstore2.to_bytes()
|
sstore2_b = sstore2.to_bytes()
|
||||||
if strings1 == strings2:
|
if set(strings1) == set(strings2):
|
||||||
assert sstore1_b == sstore2_b
|
assert sstore1_b == sstore2_b
|
||||||
else:
|
else:
|
||||||
assert sstore1_b != sstore2_b
|
assert sstore1_b != sstore2_b
|
||||||
|
@ -104,7 +104,7 @@ def test_serialize_stringstore_roundtrip_bytes(strings1, strings2):
|
||||||
assert sstore1.to_bytes() == sstore1_b
|
assert sstore1.to_bytes() == sstore1_b
|
||||||
new_sstore1 = StringStore().from_bytes(sstore1_b)
|
new_sstore1 = StringStore().from_bytes(sstore1_b)
|
||||||
assert new_sstore1.to_bytes() == sstore1_b
|
assert new_sstore1.to_bytes() == sstore1_b
|
||||||
assert list(new_sstore1) == strings1
|
assert set(new_sstore1) == set(strings1)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("strings1,strings2", test_strings)
|
@pytest.mark.parametrize("strings1,strings2", test_strings)
|
||||||
|
@ -118,12 +118,12 @@ def test_serialize_stringstore_roundtrip_disk(strings1, strings2):
|
||||||
sstore2.to_disk(file_path2)
|
sstore2.to_disk(file_path2)
|
||||||
sstore1_d = StringStore().from_disk(file_path1)
|
sstore1_d = StringStore().from_disk(file_path1)
|
||||||
sstore2_d = StringStore().from_disk(file_path2)
|
sstore2_d = StringStore().from_disk(file_path2)
|
||||||
assert list(sstore1_d) == list(sstore1)
|
assert set(sstore1_d) == set(sstore1)
|
||||||
assert list(sstore2_d) == list(sstore2)
|
assert set(sstore2_d) == set(sstore2)
|
||||||
if strings1 == strings2:
|
if set(strings1) == set(strings2):
|
||||||
assert list(sstore1_d) == list(sstore2_d)
|
assert set(sstore1_d) == set(sstore2_d)
|
||||||
else:
|
else:
|
||||||
assert list(sstore1_d) != list(sstore2_d)
|
assert set(sstore1_d) != set(sstore2_d)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)
|
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)
|
||||||
|
|
|
@ -307,8 +307,11 @@ def test_project_config_validation2(config, n_errors):
|
||||||
assert len(errors) == n_errors
|
assert len(errors) == n_errors
|
||||||
|
|
||||||
|
|
||||||
def test_project_config_interpolation():
|
@pytest.mark.parametrize(
|
||||||
variables = {"a": 10, "b": {"c": "foo", "d": True}}
|
"int_value", [10, pytest.param("10", marks=pytest.mark.xfail)],
|
||||||
|
)
|
||||||
|
def test_project_config_interpolation(int_value):
|
||||||
|
variables = {"a": int_value, "b": {"c": "foo", "d": True}}
|
||||||
commands = [
|
commands = [
|
||||||
{"name": "x", "script": ["hello ${vars.a} ${vars.b.c}"]},
|
{"name": "x", "script": ["hello ${vars.a} ${vars.b.c}"]},
|
||||||
{"name": "y", "script": ["${vars.b.c} ${vars.b.d}"]},
|
{"name": "y", "script": ["${vars.b.c} ${vars.b.d}"]},
|
||||||
|
@ -317,6 +320,8 @@ def test_project_config_interpolation():
|
||||||
with make_tempdir() as d:
|
with make_tempdir() as d:
|
||||||
srsly.write_yaml(d / "project.yml", project)
|
srsly.write_yaml(d / "project.yml", project)
|
||||||
cfg = load_project_config(d)
|
cfg = load_project_config(d)
|
||||||
|
assert type(cfg) == dict
|
||||||
|
assert type(cfg["commands"]) == list
|
||||||
assert cfg["commands"][0]["script"][0] == "hello 10 foo"
|
assert cfg["commands"][0]["script"][0] == "hello 10 foo"
|
||||||
assert cfg["commands"][1]["script"][0] == "foo true"
|
assert cfg["commands"][1]["script"][0] == "foo true"
|
||||||
commands = [{"name": "x", "script": ["hello ${vars.a} ${vars.b.e}"]}]
|
commands = [{"name": "x", "script": ["hello ${vars.a} ${vars.b.e}"]}]
|
||||||
|
@ -325,6 +330,24 @@ def test_project_config_interpolation():
|
||||||
substitute_project_variables(project)
|
substitute_project_variables(project)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"greeting", [342, "everyone", "tout le monde", pytest.param("42", marks=pytest.mark.xfail)],
|
||||||
|
)
|
||||||
|
def test_project_config_interpolation_override(greeting):
|
||||||
|
variables = {"a": "world"}
|
||||||
|
commands = [
|
||||||
|
{"name": "x", "script": ["hello ${vars.a}"]},
|
||||||
|
]
|
||||||
|
overrides = {"vars.a": greeting}
|
||||||
|
project = {"commands": commands, "vars": variables}
|
||||||
|
with make_tempdir() as d:
|
||||||
|
srsly.write_yaml(d / "project.yml", project)
|
||||||
|
cfg = load_project_config(d, overrides=overrides)
|
||||||
|
assert type(cfg) == dict
|
||||||
|
assert type(cfg["commands"]) == list
|
||||||
|
assert cfg["commands"][0]["script"][0] == f"hello {greeting}"
|
||||||
|
|
||||||
|
|
||||||
def test_project_config_interpolation_env():
|
def test_project_config_interpolation_env():
|
||||||
variables = {"a": 10}
|
variables = {"a": 10}
|
||||||
env_var = "SPACY_TEST_FOO"
|
env_var = "SPACY_TEST_FOO"
|
||||||
|
|
|
@ -10,6 +10,7 @@ from spacy.lang.en import English
|
||||||
from spacy.lang.de import German
|
from spacy.lang.de import German
|
||||||
from spacy.util import registry, ignore_error, raise_error
|
from spacy.util import registry, ignore_error, raise_error
|
||||||
import spacy
|
import spacy
|
||||||
|
from thinc.api import NumpyOps, get_current_ops
|
||||||
|
|
||||||
from .util import add_vecs_to_vocab, assert_docs_equal
|
from .util import add_vecs_to_vocab, assert_docs_equal
|
||||||
|
|
||||||
|
@ -142,25 +143,29 @@ def texts():
|
||||||
|
|
||||||
@pytest.mark.parametrize("n_process", [1, 2])
|
@pytest.mark.parametrize("n_process", [1, 2])
|
||||||
def test_language_pipe(nlp2, n_process, texts):
|
def test_language_pipe(nlp2, n_process, texts):
|
||||||
texts = texts * 10
|
ops = get_current_ops()
|
||||||
expecteds = [nlp2(text) for text in texts]
|
if isinstance(ops, NumpyOps) or n_process < 2:
|
||||||
docs = nlp2.pipe(texts, n_process=n_process, batch_size=2)
|
texts = texts * 10
|
||||||
|
expecteds = [nlp2(text) for text in texts]
|
||||||
|
docs = nlp2.pipe(texts, n_process=n_process, batch_size=2)
|
||||||
|
|
||||||
for doc, expected_doc in zip(docs, expecteds):
|
for doc, expected_doc in zip(docs, expecteds):
|
||||||
assert_docs_equal(doc, expected_doc)
|
assert_docs_equal(doc, expected_doc)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("n_process", [1, 2])
|
@pytest.mark.parametrize("n_process", [1, 2])
|
||||||
def test_language_pipe_stream(nlp2, n_process, texts):
|
def test_language_pipe_stream(nlp2, n_process, texts):
|
||||||
# check if nlp.pipe can handle infinite length iterator properly.
|
ops = get_current_ops()
|
||||||
stream_texts = itertools.cycle(texts)
|
if isinstance(ops, NumpyOps) or n_process < 2:
|
||||||
texts0, texts1 = itertools.tee(stream_texts)
|
# check if nlp.pipe can handle infinite length iterator properly.
|
||||||
expecteds = (nlp2(text) for text in texts0)
|
stream_texts = itertools.cycle(texts)
|
||||||
docs = nlp2.pipe(texts1, n_process=n_process, batch_size=2)
|
texts0, texts1 = itertools.tee(stream_texts)
|
||||||
|
expecteds = (nlp2(text) for text in texts0)
|
||||||
|
docs = nlp2.pipe(texts1, n_process=n_process, batch_size=2)
|
||||||
|
|
||||||
n_fetch = 20
|
n_fetch = 20
|
||||||
for doc, expected_doc in itertools.islice(zip(docs, expecteds), n_fetch):
|
for doc, expected_doc in itertools.islice(zip(docs, expecteds), n_fetch):
|
||||||
assert_docs_equal(doc, expected_doc)
|
assert_docs_equal(doc, expected_doc)
|
||||||
|
|
||||||
|
|
||||||
def test_language_pipe_error_handler():
|
def test_language_pipe_error_handler():
|
||||||
|
|
|
@ -8,7 +8,8 @@ from spacy import prefer_gpu, require_gpu, require_cpu
|
||||||
from spacy.ml._precomputable_affine import PrecomputableAffine
|
from spacy.ml._precomputable_affine import PrecomputableAffine
|
||||||
from spacy.ml._precomputable_affine import _backprop_precomputable_affine_padding
|
from spacy.ml._precomputable_affine import _backprop_precomputable_affine_padding
|
||||||
from spacy.util import dot_to_object, SimpleFrozenList, import_file
|
from spacy.util import dot_to_object, SimpleFrozenList, import_file
|
||||||
from thinc.api import Config, Optimizer, ConfigValidationError
|
from thinc.api import Config, Optimizer, ConfigValidationError, get_current_ops
|
||||||
|
from thinc.api import set_current_ops
|
||||||
from spacy.training.batchers import minibatch_by_words
|
from spacy.training.batchers import minibatch_by_words
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.lang.nl import Dutch
|
from spacy.lang.nl import Dutch
|
||||||
|
@ -81,6 +82,7 @@ def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
|
||||||
|
|
||||||
|
|
||||||
def test_prefer_gpu():
|
def test_prefer_gpu():
|
||||||
|
current_ops = get_current_ops()
|
||||||
try:
|
try:
|
||||||
import cupy # noqa: F401
|
import cupy # noqa: F401
|
||||||
|
|
||||||
|
@ -88,9 +90,11 @@ def test_prefer_gpu():
|
||||||
assert isinstance(get_current_ops(), CupyOps)
|
assert isinstance(get_current_ops(), CupyOps)
|
||||||
except ImportError:
|
except ImportError:
|
||||||
assert not prefer_gpu()
|
assert not prefer_gpu()
|
||||||
|
set_current_ops(current_ops)
|
||||||
|
|
||||||
|
|
||||||
def test_require_gpu():
|
def test_require_gpu():
|
||||||
|
current_ops = get_current_ops()
|
||||||
try:
|
try:
|
||||||
import cupy # noqa: F401
|
import cupy # noqa: F401
|
||||||
|
|
||||||
|
@ -99,9 +103,11 @@ def test_require_gpu():
|
||||||
except ImportError:
|
except ImportError:
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
require_gpu()
|
require_gpu()
|
||||||
|
set_current_ops(current_ops)
|
||||||
|
|
||||||
|
|
||||||
def test_require_cpu():
|
def test_require_cpu():
|
||||||
|
current_ops = get_current_ops()
|
||||||
require_cpu()
|
require_cpu()
|
||||||
assert isinstance(get_current_ops(), NumpyOps)
|
assert isinstance(get_current_ops(), NumpyOps)
|
||||||
try:
|
try:
|
||||||
|
@ -113,6 +119,7 @@ def test_require_cpu():
|
||||||
pass
|
pass
|
||||||
require_cpu()
|
require_cpu()
|
||||||
assert isinstance(get_current_ops(), NumpyOps)
|
assert isinstance(get_current_ops(), NumpyOps)
|
||||||
|
set_current_ops(current_ops)
|
||||||
|
|
||||||
|
|
||||||
def test_ascii_filenames():
|
def test_ascii_filenames():
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
from typing import List
|
from typing import List
|
||||||
import pytest
|
import pytest
|
||||||
from thinc.api import fix_random_seed, Adam, set_dropout_rate
|
from thinc.api import fix_random_seed, Adam, set_dropout_rate
|
||||||
from numpy.testing import assert_array_equal
|
from numpy.testing import assert_array_equal, assert_array_almost_equal
|
||||||
import numpy
|
import numpy
|
||||||
from spacy.ml.models import build_Tok2Vec_model, MultiHashEmbed, MaxoutWindowEncoder
|
from spacy.ml.models import build_Tok2Vec_model, MultiHashEmbed, MaxoutWindowEncoder
|
||||||
from spacy.ml.models import build_bow_text_classifier, build_simple_cnn_text_classifier
|
from spacy.ml.models import build_bow_text_classifier, build_simple_cnn_text_classifier
|
||||||
|
@ -109,7 +109,7 @@ def test_models_initialize_consistently(seed, model_func, kwargs):
|
||||||
model2.initialize()
|
model2.initialize()
|
||||||
params1 = get_all_params(model1)
|
params1 = get_all_params(model1)
|
||||||
params2 = get_all_params(model2)
|
params2 = get_all_params(model2)
|
||||||
assert_array_equal(params1, params2)
|
assert_array_equal(model1.ops.to_numpy(params1), model2.ops.to_numpy(params2))
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
|
@ -134,14 +134,25 @@ def test_models_predict_consistently(seed, model_func, kwargs, get_X):
|
||||||
for i in range(len(tok2vec1)):
|
for i in range(len(tok2vec1)):
|
||||||
for j in range(len(tok2vec1[i])):
|
for j in range(len(tok2vec1[i])):
|
||||||
assert_array_equal(
|
assert_array_equal(
|
||||||
numpy.asarray(tok2vec1[i][j]), numpy.asarray(tok2vec2[i][j])
|
numpy.asarray(model1.ops.to_numpy(tok2vec1[i][j])),
|
||||||
|
numpy.asarray(model2.ops.to_numpy(tok2vec2[i][j])),
|
||||||
)
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
Y1 = model1.ops.to_numpy(Y1)
|
||||||
|
Y2 = model2.ops.to_numpy(Y2)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
if isinstance(Y1, numpy.ndarray):
|
if isinstance(Y1, numpy.ndarray):
|
||||||
assert_array_equal(Y1, Y2)
|
assert_array_equal(Y1, Y2)
|
||||||
elif isinstance(Y1, List):
|
elif isinstance(Y1, List):
|
||||||
assert len(Y1) == len(Y2)
|
assert len(Y1) == len(Y2)
|
||||||
for y1, y2 in zip(Y1, Y2):
|
for y1, y2 in zip(Y1, Y2):
|
||||||
|
try:
|
||||||
|
y1 = model1.ops.to_numpy(y1)
|
||||||
|
y2 = model2.ops.to_numpy(y2)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
assert_array_equal(y1, y2)
|
assert_array_equal(y1, y2)
|
||||||
else:
|
else:
|
||||||
raise ValueError(f"Could not compare type {type(Y1)}")
|
raise ValueError(f"Could not compare type {type(Y1)}")
|
||||||
|
@ -169,12 +180,17 @@ def test_models_update_consistently(seed, dropout, model_func, kwargs, get_X):
|
||||||
model.finish_update(optimizer)
|
model.finish_update(optimizer)
|
||||||
updated_params = get_all_params(model)
|
updated_params = get_all_params(model)
|
||||||
with pytest.raises(AssertionError):
|
with pytest.raises(AssertionError):
|
||||||
assert_array_equal(initial_params, updated_params)
|
assert_array_equal(
|
||||||
|
model.ops.to_numpy(initial_params), model.ops.to_numpy(updated_params)
|
||||||
|
)
|
||||||
return model
|
return model
|
||||||
|
|
||||||
model1 = get_updated_model()
|
model1 = get_updated_model()
|
||||||
model2 = get_updated_model()
|
model2 = get_updated_model()
|
||||||
assert_array_equal(get_all_params(model1), get_all_params(model2))
|
assert_array_almost_equal(
|
||||||
|
model1.ops.to_numpy(get_all_params(model1)),
|
||||||
|
model2.ops.to_numpy(get_all_params(model2)),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("model_func,kwargs", [(StaticVectors, {"nO": 128, "nM": 300})])
|
@pytest.mark.parametrize("model_func,kwargs", [(StaticVectors, {"nO": 128, "nM": 300})])
|
||||||
|
|
|
@ -3,10 +3,10 @@ import pytest
|
||||||
from pytest import approx
|
from pytest import approx
|
||||||
from spacy.training import Example
|
from spacy.training import Example
|
||||||
from spacy.training.iob_utils import offsets_to_biluo_tags
|
from spacy.training.iob_utils import offsets_to_biluo_tags
|
||||||
from spacy.scorer import Scorer, ROCAUCScore
|
from spacy.scorer import Scorer, ROCAUCScore, PRFScore
|
||||||
from spacy.scorer import _roc_auc_score, _roc_curve
|
from spacy.scorer import _roc_auc_score, _roc_curve
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc, Span
|
||||||
|
|
||||||
|
|
||||||
test_las_apple = [
|
test_las_apple = [
|
||||||
|
@ -403,3 +403,68 @@ def test_roc_auc_score():
|
||||||
score.score_set(0.75, 1)
|
score.score_set(0.75, 1)
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
_ = score.score # noqa: F841
|
_ = score.score # noqa: F841
|
||||||
|
|
||||||
|
|
||||||
|
def test_score_spans():
|
||||||
|
nlp = English()
|
||||||
|
text = "This is just a random sentence."
|
||||||
|
key = "my_spans"
|
||||||
|
gold = nlp.make_doc(text)
|
||||||
|
pred = nlp.make_doc(text)
|
||||||
|
spans = []
|
||||||
|
spans.append(gold.char_span(0, 4, label="PERSON"))
|
||||||
|
spans.append(gold.char_span(0, 7, label="ORG"))
|
||||||
|
spans.append(gold.char_span(8, 12, label="ORG"))
|
||||||
|
gold.spans[key] = spans
|
||||||
|
|
||||||
|
def span_getter(doc, span_key):
|
||||||
|
return doc.spans[span_key]
|
||||||
|
|
||||||
|
# Predict exactly the same, but overlapping spans will be discarded
|
||||||
|
pred.spans[key] = spans
|
||||||
|
eg = Example(pred, gold)
|
||||||
|
scores = Scorer.score_spans([eg], attr=key, getter=span_getter)
|
||||||
|
assert scores[f"{key}_p"] == 1.0
|
||||||
|
assert scores[f"{key}_r"] < 1.0
|
||||||
|
|
||||||
|
# Allow overlapping, now both precision and recall should be 100%
|
||||||
|
pred.spans[key] = spans
|
||||||
|
eg = Example(pred, gold)
|
||||||
|
scores = Scorer.score_spans([eg], attr=key, getter=span_getter, allow_overlap=True)
|
||||||
|
assert scores[f"{key}_p"] == 1.0
|
||||||
|
assert scores[f"{key}_r"] == 1.0
|
||||||
|
|
||||||
|
# Change the predicted labels
|
||||||
|
new_spans = [Span(pred, span.start, span.end, label="WRONG") for span in spans]
|
||||||
|
pred.spans[key] = new_spans
|
||||||
|
eg = Example(pred, gold)
|
||||||
|
scores = Scorer.score_spans([eg], attr=key, getter=span_getter, allow_overlap=True)
|
||||||
|
assert scores[f"{key}_p"] == 0.0
|
||||||
|
assert scores[f"{key}_r"] == 0.0
|
||||||
|
assert f"{key}_per_type" in scores
|
||||||
|
|
||||||
|
# Discard labels from the evaluation
|
||||||
|
scores = Scorer.score_spans([eg], attr=key, getter=span_getter, allow_overlap=True, labeled=False)
|
||||||
|
assert scores[f"{key}_p"] == 1.0
|
||||||
|
assert scores[f"{key}_r"] == 1.0
|
||||||
|
assert f"{key}_per_type" not in scores
|
||||||
|
|
||||||
|
|
||||||
|
def test_prf_score():
|
||||||
|
cand = {"hi", "ho"}
|
||||||
|
gold1 = {"yo", "hi"}
|
||||||
|
gold2 = set()
|
||||||
|
|
||||||
|
a = PRFScore()
|
||||||
|
a.score_set(cand=cand, gold=gold1)
|
||||||
|
assert (a.precision, a.recall, a.fscore) == approx((0.5, 0.5, 0.5))
|
||||||
|
|
||||||
|
b = PRFScore()
|
||||||
|
b.score_set(cand=cand, gold=gold2)
|
||||||
|
assert (b.precision, b.recall, b.fscore) == approx((0.0, 0.0, 0.0))
|
||||||
|
|
||||||
|
c = a + b
|
||||||
|
assert (c.precision, c.recall, c.fscore) == approx((0.25, 0.5, 0.33333333))
|
||||||
|
|
||||||
|
a += b
|
||||||
|
assert (a.precision, a.recall, a.fscore) == approx((c.precision, c.recall, c.fscore))
|
|
@ -1,5 +1,7 @@
|
||||||
import pytest
|
import pytest
|
||||||
|
import re
|
||||||
from spacy.util import get_lang_class
|
from spacy.util import get_lang_class
|
||||||
|
from spacy.tokenizer import Tokenizer
|
||||||
|
|
||||||
# Only include languages with no external dependencies
|
# Only include languages with no external dependencies
|
||||||
# "is" seems to confuse importlib, so we're also excluding it for now
|
# "is" seems to confuse importlib, so we're also excluding it for now
|
||||||
|
@ -60,3 +62,18 @@ def test_tokenizer_explain(lang):
|
||||||
tokens = [t.text for t in tokenizer(sentence) if not t.is_space]
|
tokens = [t.text for t in tokenizer(sentence) if not t.is_space]
|
||||||
debug_tokens = [t[1] for t in tokenizer.explain(sentence)]
|
debug_tokens = [t[1] for t in tokenizer.explain(sentence)]
|
||||||
assert tokens == debug_tokens
|
assert tokens == debug_tokens
|
||||||
|
|
||||||
|
|
||||||
|
def test_tokenizer_explain_special_matcher(en_vocab):
|
||||||
|
suffix_re = re.compile(r"[\.]$")
|
||||||
|
infix_re = re.compile(r"[/]")
|
||||||
|
rules = {"a.": [{"ORTH": "a."}]}
|
||||||
|
tokenizer = Tokenizer(
|
||||||
|
en_vocab,
|
||||||
|
rules=rules,
|
||||||
|
suffix_search=suffix_re.search,
|
||||||
|
infix_finditer=infix_re.finditer,
|
||||||
|
)
|
||||||
|
tokens = [t.text for t in tokenizer("a/a.")]
|
||||||
|
explain_tokens = [t[1] for t in tokenizer.explain("a/a.")]
|
||||||
|
assert tokens == explain_tokens
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
import pytest
|
import pytest
|
||||||
|
import re
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from spacy.tokenizer import Tokenizer
|
from spacy.tokenizer import Tokenizer
|
||||||
from spacy.util import ensure_path
|
from spacy.util import ensure_path
|
||||||
|
@ -186,3 +187,31 @@ def test_tokenizer_special_cases_spaces(tokenizer):
|
||||||
assert [t.text for t in tokenizer("a b c")] == ["a", "b", "c"]
|
assert [t.text for t in tokenizer("a b c")] == ["a", "b", "c"]
|
||||||
tokenizer.add_special_case("a b c", [{"ORTH": "a b c"}])
|
tokenizer.add_special_case("a b c", [{"ORTH": "a b c"}])
|
||||||
assert [t.text for t in tokenizer("a b c")] == ["a b c"]
|
assert [t.text for t in tokenizer("a b c")] == ["a b c"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_tokenizer_flush_cache(en_vocab):
|
||||||
|
suffix_re = re.compile(r"[\.]$")
|
||||||
|
tokenizer = Tokenizer(
|
||||||
|
en_vocab,
|
||||||
|
suffix_search=suffix_re.search,
|
||||||
|
)
|
||||||
|
assert [t.text for t in tokenizer("a.")] == ["a", "."]
|
||||||
|
tokenizer.suffix_search = None
|
||||||
|
assert [t.text for t in tokenizer("a.")] == ["a."]
|
||||||
|
|
||||||
|
|
||||||
|
def test_tokenizer_flush_specials(en_vocab):
|
||||||
|
suffix_re = re.compile(r"[\.]$")
|
||||||
|
rules = {"a a": [{"ORTH": "a a"}]}
|
||||||
|
tokenizer1 = Tokenizer(
|
||||||
|
en_vocab,
|
||||||
|
suffix_search=suffix_re.search,
|
||||||
|
rules=rules,
|
||||||
|
)
|
||||||
|
tokenizer2 = Tokenizer(
|
||||||
|
en_vocab,
|
||||||
|
suffix_search=suffix_re.search,
|
||||||
|
)
|
||||||
|
assert [t.text for t in tokenizer1("a a.")] == ["a a", "."]
|
||||||
|
tokenizer1.rules = {}
|
||||||
|
assert [t.text for t in tokenizer1("a a.")] == ["a", "a", "."]
|
||||||
|
|
|
@ -2,6 +2,7 @@ import pytest
|
||||||
from spacy.training.example import Example
|
from spacy.training.example import Example
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
|
from spacy.util import to_ternary_int
|
||||||
|
|
||||||
|
|
||||||
def test_Example_init_requires_doc_objects():
|
def test_Example_init_requires_doc_objects():
|
||||||
|
@ -121,7 +122,7 @@ def test_Example_from_dict_with_morphology(annots):
|
||||||
[
|
[
|
||||||
{
|
{
|
||||||
"words": ["This", "is", "one", "sentence", "this", "is", "another"],
|
"words": ["This", "is", "one", "sentence", "this", "is", "another"],
|
||||||
"sent_starts": [1, 0, 0, 0, 1, 0, 0],
|
"sent_starts": [1, False, 0, None, True, -1, -5.7],
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
|
@ -131,7 +132,12 @@ def test_Example_from_dict_with_sent_start(annots):
|
||||||
example = Example.from_dict(predicted, annots)
|
example = Example.from_dict(predicted, annots)
|
||||||
assert len(list(example.reference.sents)) == 2
|
assert len(list(example.reference.sents)) == 2
|
||||||
for i, token in enumerate(example.reference):
|
for i, token in enumerate(example.reference):
|
||||||
assert bool(token.is_sent_start) == bool(annots["sent_starts"][i])
|
if to_ternary_int(annots["sent_starts"][i]) == 1:
|
||||||
|
assert token.is_sent_start is True
|
||||||
|
elif to_ternary_int(annots["sent_starts"][i]) == 0:
|
||||||
|
assert token.is_sent_start is None
|
||||||
|
else:
|
||||||
|
assert token.is_sent_start is False
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
|
|
|
@ -426,6 +426,29 @@ def test_aligned_spans_x2y(en_vocab, en_tokenizer):
|
||||||
assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2), (4, 6)]
|
assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2), (4, 6)]
|
||||||
|
|
||||||
|
|
||||||
|
def test_aligned_spans_y2x_overlap(en_vocab, en_tokenizer):
|
||||||
|
text = "I flew to San Francisco Valley"
|
||||||
|
nlp = English()
|
||||||
|
doc = nlp(text)
|
||||||
|
# the reference doc has overlapping spans
|
||||||
|
gold_doc = nlp.make_doc(text)
|
||||||
|
spans = []
|
||||||
|
prefix = "I flew to "
|
||||||
|
spans.append(gold_doc.char_span(len(prefix), len(prefix + "San Francisco"), label="CITY"))
|
||||||
|
spans.append(gold_doc.char_span(len(prefix), len(prefix + "San Francisco Valley"), label="VALLEY"))
|
||||||
|
spans_key = "overlap_ents"
|
||||||
|
gold_doc.spans[spans_key] = spans
|
||||||
|
example = Example(doc, gold_doc)
|
||||||
|
spans_gold = example.reference.spans[spans_key]
|
||||||
|
assert [(ent.start, ent.end) for ent in spans_gold] == [(3, 5), (3, 6)]
|
||||||
|
|
||||||
|
# Ensure that 'get_aligned_spans_y2x' has the aligned entities correct
|
||||||
|
spans_y2x_no_overlap = example.get_aligned_spans_y2x(spans_gold, allow_overlap=False)
|
||||||
|
assert [(ent.start, ent.end) for ent in spans_y2x_no_overlap] == [(3, 5)]
|
||||||
|
spans_y2x_overlap = example.get_aligned_spans_y2x(spans_gold, allow_overlap=True)
|
||||||
|
assert [(ent.start, ent.end) for ent in spans_y2x_overlap] == [(3, 5), (3, 6)]
|
||||||
|
|
||||||
|
|
||||||
def test_gold_ner_missing_tags(en_tokenizer):
|
def test_gold_ner_missing_tags(en_tokenizer):
|
||||||
doc = en_tokenizer("I flew to Silicon Valley via London.")
|
doc = en_tokenizer("I flew to Silicon Valley via London.")
|
||||||
biluo_tags = [None, "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"]
|
biluo_tags = [None, "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"]
|
||||||
|
|
|
@ -5,6 +5,7 @@ import srsly
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from spacy.util import make_tempdir # noqa: F401
|
from spacy.util import make_tempdir # noqa: F401
|
||||||
|
from thinc.api import get_current_ops
|
||||||
|
|
||||||
|
|
||||||
@contextlib.contextmanager
|
@contextlib.contextmanager
|
||||||
|
@ -58,7 +59,10 @@ def add_vecs_to_vocab(vocab, vectors):
|
||||||
|
|
||||||
def get_cosine(vec1, vec2):
|
def get_cosine(vec1, vec2):
|
||||||
"""Get cosine for two given vectors"""
|
"""Get cosine for two given vectors"""
|
||||||
return numpy.dot(vec1, vec2) / (numpy.linalg.norm(vec1) * numpy.linalg.norm(vec2))
|
OPS = get_current_ops()
|
||||||
|
v1 = OPS.to_numpy(OPS.asarray(vec1))
|
||||||
|
v2 = OPS.to_numpy(OPS.asarray(vec2))
|
||||||
|
return numpy.dot(v1, v2) / (numpy.linalg.norm(v1) * numpy.linalg.norm(v2))
|
||||||
|
|
||||||
|
|
||||||
def assert_docs_equal(doc1, doc2):
|
def assert_docs_equal(doc1, doc2):
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
import pytest
|
import pytest
|
||||||
import numpy
|
import numpy
|
||||||
from numpy.testing import assert_allclose, assert_equal
|
from numpy.testing import assert_allclose, assert_equal
|
||||||
|
from thinc.api import get_current_ops
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from spacy.vectors import Vectors
|
from spacy.vectors import Vectors
|
||||||
from spacy.tokenizer import Tokenizer
|
from spacy.tokenizer import Tokenizer
|
||||||
|
@ -9,6 +10,7 @@ from spacy.tokens import Doc
|
||||||
|
|
||||||
from ..util import add_vecs_to_vocab, get_cosine, make_tempdir
|
from ..util import add_vecs_to_vocab, get_cosine, make_tempdir
|
||||||
|
|
||||||
|
OPS = get_current_ops()
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def strings():
|
def strings():
|
||||||
|
@ -18,21 +20,21 @@ def strings():
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def vectors():
|
def vectors():
|
||||||
return [
|
return [
|
||||||
("apple", [1, 2, 3]),
|
("apple", OPS.asarray([1, 2, 3])),
|
||||||
("orange", [-1, -2, -3]),
|
("orange", OPS.asarray([-1, -2, -3])),
|
||||||
("and", [-1, -1, -1]),
|
("and", OPS.asarray([-1, -1, -1])),
|
||||||
("juice", [5, 5, 10]),
|
("juice", OPS.asarray([5, 5, 10])),
|
||||||
("pie", [7, 6.3, 8.9]),
|
("pie", OPS.asarray([7, 6.3, 8.9])),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def ngrams_vectors():
|
def ngrams_vectors():
|
||||||
return [
|
return [
|
||||||
("apple", [1, 2, 3]),
|
("apple", OPS.asarray([1, 2, 3])),
|
||||||
("app", [-0.1, -0.2, -0.3]),
|
("app", OPS.asarray([-0.1, -0.2, -0.3])),
|
||||||
("ppl", [-0.2, -0.3, -0.4]),
|
("ppl", OPS.asarray([-0.2, -0.3, -0.4])),
|
||||||
("pl", [0.7, 0.8, 0.9]),
|
("pl", OPS.asarray([0.7, 0.8, 0.9])),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@ -171,8 +173,10 @@ def test_vectors_most_similar_identical():
|
||||||
@pytest.mark.parametrize("text", ["apple and orange"])
|
@pytest.mark.parametrize("text", ["apple and orange"])
|
||||||
def test_vectors_token_vector(tokenizer_v, vectors, text):
|
def test_vectors_token_vector(tokenizer_v, vectors, text):
|
||||||
doc = tokenizer_v(text)
|
doc = tokenizer_v(text)
|
||||||
assert vectors[0] == (doc[0].text, list(doc[0].vector))
|
assert vectors[0][0] == doc[0].text
|
||||||
assert vectors[1] == (doc[2].text, list(doc[2].vector))
|
assert all([a == b for a, b in zip(vectors[0][1], doc[0].vector)])
|
||||||
|
assert vectors[1][0] == doc[2].text
|
||||||
|
assert all([a == b for a, b in zip(vectors[1][1], doc[2].vector)])
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("text", ["apple"])
|
@pytest.mark.parametrize("text", ["apple"])
|
||||||
|
@ -301,7 +305,7 @@ def test_vectors_doc_doc_similarity(vocab, text1, text2):
|
||||||
|
|
||||||
def test_vocab_add_vector():
|
def test_vocab_add_vector():
|
||||||
vocab = Vocab(vectors_name="test_vocab_add_vector")
|
vocab = Vocab(vectors_name="test_vocab_add_vector")
|
||||||
data = numpy.ndarray((5, 3), dtype="f")
|
data = OPS.xp.ndarray((5, 3), dtype="f")
|
||||||
data[0] = 1.0
|
data[0] = 1.0
|
||||||
data[1] = 2.0
|
data[1] = 2.0
|
||||||
vocab.set_vector("cat", data[0])
|
vocab.set_vector("cat", data[0])
|
||||||
|
@ -320,10 +324,10 @@ def test_vocab_prune_vectors():
|
||||||
_ = vocab["cat"] # noqa: F841
|
_ = vocab["cat"] # noqa: F841
|
||||||
_ = vocab["dog"] # noqa: F841
|
_ = vocab["dog"] # noqa: F841
|
||||||
_ = vocab["kitten"] # noqa: F841
|
_ = vocab["kitten"] # noqa: F841
|
||||||
data = numpy.ndarray((5, 3), dtype="f")
|
data = OPS.xp.ndarray((5, 3), dtype="f")
|
||||||
data[0] = [1.0, 1.2, 1.1]
|
data[0] = OPS.asarray([1.0, 1.2, 1.1])
|
||||||
data[1] = [0.3, 1.3, 1.0]
|
data[1] = OPS.asarray([0.3, 1.3, 1.0])
|
||||||
data[2] = [0.9, 1.22, 1.05]
|
data[2] = OPS.asarray([0.9, 1.22, 1.05])
|
||||||
vocab.set_vector("cat", data[0])
|
vocab.set_vector("cat", data[0])
|
||||||
vocab.set_vector("dog", data[1])
|
vocab.set_vector("dog", data[1])
|
||||||
vocab.set_vector("kitten", data[2])
|
vocab.set_vector("kitten", data[2])
|
||||||
|
@ -332,40 +336,41 @@ def test_vocab_prune_vectors():
|
||||||
assert list(remap.keys()) == ["kitten"]
|
assert list(remap.keys()) == ["kitten"]
|
||||||
neighbour, similarity = list(remap.values())[0]
|
neighbour, similarity = list(remap.values())[0]
|
||||||
assert neighbour == "cat", remap
|
assert neighbour == "cat", remap
|
||||||
assert_allclose(similarity, get_cosine(data[0], data[2]), atol=1e-4, rtol=1e-3)
|
cosine = get_cosine(data[0], data[2])
|
||||||
|
assert_allclose(float(similarity), cosine, atol=1e-4, rtol=1e-3)
|
||||||
|
|
||||||
|
|
||||||
def test_vectors_serialize():
|
def test_vectors_serialize():
|
||||||
data = numpy.asarray([[4, 2, 2, 2], [4, 2, 2, 2], [1, 1, 1, 1]], dtype="f")
|
data = OPS.asarray([[4, 2, 2, 2], [4, 2, 2, 2], [1, 1, 1, 1]], dtype="f")
|
||||||
v = Vectors(data=data, keys=["A", "B", "C"])
|
v = Vectors(data=data, keys=["A", "B", "C"])
|
||||||
b = v.to_bytes()
|
b = v.to_bytes()
|
||||||
v_r = Vectors()
|
v_r = Vectors()
|
||||||
v_r.from_bytes(b)
|
v_r.from_bytes(b)
|
||||||
assert_equal(v.data, v_r.data)
|
assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data))
|
||||||
assert v.key2row == v_r.key2row
|
assert v.key2row == v_r.key2row
|
||||||
v.resize((5, 4))
|
v.resize((5, 4))
|
||||||
v_r.resize((5, 4))
|
v_r.resize((5, 4))
|
||||||
row = v.add("D", vector=numpy.asarray([1, 2, 3, 4], dtype="f"))
|
row = v.add("D", vector=OPS.asarray([1, 2, 3, 4], dtype="f"))
|
||||||
row_r = v_r.add("D", vector=numpy.asarray([1, 2, 3, 4], dtype="f"))
|
row_r = v_r.add("D", vector=OPS.asarray([1, 2, 3, 4], dtype="f"))
|
||||||
assert row == row_r
|
assert row == row_r
|
||||||
assert_equal(v.data, v_r.data)
|
assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data))
|
||||||
assert v.is_full == v_r.is_full
|
assert v.is_full == v_r.is_full
|
||||||
with make_tempdir() as d:
|
with make_tempdir() as d:
|
||||||
v.to_disk(d)
|
v.to_disk(d)
|
||||||
v_r.from_disk(d)
|
v_r.from_disk(d)
|
||||||
assert_equal(v.data, v_r.data)
|
assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data))
|
||||||
assert v.key2row == v_r.key2row
|
assert v.key2row == v_r.key2row
|
||||||
v.resize((5, 4))
|
v.resize((5, 4))
|
||||||
v_r.resize((5, 4))
|
v_r.resize((5, 4))
|
||||||
row = v.add("D", vector=numpy.asarray([10, 20, 30, 40], dtype="f"))
|
row = v.add("D", vector=OPS.asarray([10, 20, 30, 40], dtype="f"))
|
||||||
row_r = v_r.add("D", vector=numpy.asarray([10, 20, 30, 40], dtype="f"))
|
row_r = v_r.add("D", vector=OPS.asarray([10, 20, 30, 40], dtype="f"))
|
||||||
assert row == row_r
|
assert row == row_r
|
||||||
assert_equal(v.data, v_r.data)
|
assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data))
|
||||||
|
|
||||||
|
|
||||||
def test_vector_is_oov():
|
def test_vector_is_oov():
|
||||||
vocab = Vocab(vectors_name="test_vocab_is_oov")
|
vocab = Vocab(vectors_name="test_vocab_is_oov")
|
||||||
data = numpy.ndarray((5, 3), dtype="f")
|
data = OPS.xp.ndarray((5, 3), dtype="f")
|
||||||
data[0] = 1.0
|
data[0] = 1.0
|
||||||
data[1] = 2.0
|
data[1] = 2.0
|
||||||
vocab.set_vector("cat", data[0])
|
vocab.set_vector("cat", data[0])
|
||||||
|
|
|
@ -23,8 +23,8 @@ cdef class Tokenizer:
|
||||||
cdef object _infix_finditer
|
cdef object _infix_finditer
|
||||||
cdef object _rules
|
cdef object _rules
|
||||||
cdef PhraseMatcher _special_matcher
|
cdef PhraseMatcher _special_matcher
|
||||||
cdef int _property_init_count
|
cdef int _property_init_count # TODO: unused, remove in v3.1
|
||||||
cdef int _property_init_max
|
cdef int _property_init_max # TODO: unused, remove in v3.1
|
||||||
|
|
||||||
cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases)
|
cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases)
|
||||||
cdef int _apply_special_cases(self, Doc doc) except -1
|
cdef int _apply_special_cases(self, Doc doc) except -1
|
||||||
|
|
|
@ -20,11 +20,12 @@ from .attrs import intify_attrs
|
||||||
from .symbols import ORTH, NORM
|
from .symbols import ORTH, NORM
|
||||||
from .errors import Errors, Warnings
|
from .errors import Errors, Warnings
|
||||||
from . import util
|
from . import util
|
||||||
from .util import registry
|
from .util import registry, get_words_and_spaces
|
||||||
from .attrs import intify_attrs
|
from .attrs import intify_attrs
|
||||||
from .symbols import ORTH
|
from .symbols import ORTH
|
||||||
from .scorer import Scorer
|
from .scorer import Scorer
|
||||||
from .training import validate_examples
|
from .training import validate_examples
|
||||||
|
from .tokens import Span
|
||||||
|
|
||||||
|
|
||||||
cdef class Tokenizer:
|
cdef class Tokenizer:
|
||||||
|
@ -68,8 +69,6 @@ cdef class Tokenizer:
|
||||||
self._rules = {}
|
self._rules = {}
|
||||||
self._special_matcher = PhraseMatcher(self.vocab)
|
self._special_matcher = PhraseMatcher(self.vocab)
|
||||||
self._load_special_cases(rules)
|
self._load_special_cases(rules)
|
||||||
self._property_init_count = 0
|
|
||||||
self._property_init_max = 4
|
|
||||||
|
|
||||||
property token_match:
|
property token_match:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -78,8 +77,6 @@ cdef class Tokenizer:
|
||||||
def __set__(self, token_match):
|
def __set__(self, token_match):
|
||||||
self._token_match = token_match
|
self._token_match = token_match
|
||||||
self._reload_special_cases()
|
self._reload_special_cases()
|
||||||
if self._property_init_count <= self._property_init_max:
|
|
||||||
self._property_init_count += 1
|
|
||||||
|
|
||||||
property url_match:
|
property url_match:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -87,7 +84,7 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
def __set__(self, url_match):
|
def __set__(self, url_match):
|
||||||
self._url_match = url_match
|
self._url_match = url_match
|
||||||
self._flush_cache()
|
self._reload_special_cases()
|
||||||
|
|
||||||
property prefix_search:
|
property prefix_search:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -96,8 +93,6 @@ cdef class Tokenizer:
|
||||||
def __set__(self, prefix_search):
|
def __set__(self, prefix_search):
|
||||||
self._prefix_search = prefix_search
|
self._prefix_search = prefix_search
|
||||||
self._reload_special_cases()
|
self._reload_special_cases()
|
||||||
if self._property_init_count <= self._property_init_max:
|
|
||||||
self._property_init_count += 1
|
|
||||||
|
|
||||||
property suffix_search:
|
property suffix_search:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -106,8 +101,6 @@ cdef class Tokenizer:
|
||||||
def __set__(self, suffix_search):
|
def __set__(self, suffix_search):
|
||||||
self._suffix_search = suffix_search
|
self._suffix_search = suffix_search
|
||||||
self._reload_special_cases()
|
self._reload_special_cases()
|
||||||
if self._property_init_count <= self._property_init_max:
|
|
||||||
self._property_init_count += 1
|
|
||||||
|
|
||||||
property infix_finditer:
|
property infix_finditer:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -116,8 +109,6 @@ cdef class Tokenizer:
|
||||||
def __set__(self, infix_finditer):
|
def __set__(self, infix_finditer):
|
||||||
self._infix_finditer = infix_finditer
|
self._infix_finditer = infix_finditer
|
||||||
self._reload_special_cases()
|
self._reload_special_cases()
|
||||||
if self._property_init_count <= self._property_init_max:
|
|
||||||
self._property_init_count += 1
|
|
||||||
|
|
||||||
property rules:
|
property rules:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -125,7 +116,7 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
def __set__(self, rules):
|
def __set__(self, rules):
|
||||||
self._rules = {}
|
self._rules = {}
|
||||||
self._reset_cache([key for key in self._cache])
|
self._flush_cache()
|
||||||
self._flush_specials()
|
self._flush_specials()
|
||||||
self._cache = PreshMap()
|
self._cache = PreshMap()
|
||||||
self._specials = PreshMap()
|
self._specials = PreshMap()
|
||||||
|
@ -225,6 +216,7 @@ cdef class Tokenizer:
|
||||||
self.mem.free(cached)
|
self.mem.free(cached)
|
||||||
|
|
||||||
def _flush_specials(self):
|
def _flush_specials(self):
|
||||||
|
self._special_matcher = PhraseMatcher(self.vocab)
|
||||||
for k in self._specials:
|
for k in self._specials:
|
||||||
cached = <_Cached*>self._specials.get(k)
|
cached = <_Cached*>self._specials.get(k)
|
||||||
del self._specials[k]
|
del self._specials[k]
|
||||||
|
@ -567,7 +559,6 @@ cdef class Tokenizer:
|
||||||
"""Add special-case tokenization rules."""
|
"""Add special-case tokenization rules."""
|
||||||
if special_cases is not None:
|
if special_cases is not None:
|
||||||
for chunk, substrings in sorted(special_cases.items()):
|
for chunk, substrings in sorted(special_cases.items()):
|
||||||
self._validate_special_case(chunk, substrings)
|
|
||||||
self.add_special_case(chunk, substrings)
|
self.add_special_case(chunk, substrings)
|
||||||
|
|
||||||
def _validate_special_case(self, chunk, substrings):
|
def _validate_special_case(self, chunk, substrings):
|
||||||
|
@ -615,16 +606,9 @@ cdef class Tokenizer:
|
||||||
self._special_matcher.add(string, None, self._tokenize_affixes(string, False))
|
self._special_matcher.add(string, None, self._tokenize_affixes(string, False))
|
||||||
|
|
||||||
def _reload_special_cases(self):
|
def _reload_special_cases(self):
|
||||||
try:
|
self._flush_cache()
|
||||||
self._property_init_count
|
self._flush_specials()
|
||||||
except AttributeError:
|
self._load_special_cases(self._rules)
|
||||||
return
|
|
||||||
# only reload if all 4 of prefix, suffix, infix, token_match have
|
|
||||||
# have been initialized
|
|
||||||
if self.vocab is not None and self._property_init_count >= self._property_init_max:
|
|
||||||
self._flush_cache()
|
|
||||||
self._flush_specials()
|
|
||||||
self._load_special_cases(self._rules)
|
|
||||||
|
|
||||||
def explain(self, text):
|
def explain(self, text):
|
||||||
"""A debugging tokenizer that provides information about which
|
"""A debugging tokenizer that provides information about which
|
||||||
|
@ -638,8 +622,14 @@ cdef class Tokenizer:
|
||||||
DOCS: https://spacy.io/api/tokenizer#explain
|
DOCS: https://spacy.io/api/tokenizer#explain
|
||||||
"""
|
"""
|
||||||
prefix_search = self.prefix_search
|
prefix_search = self.prefix_search
|
||||||
|
if prefix_search is None:
|
||||||
|
prefix_search = re.compile("a^").search
|
||||||
suffix_search = self.suffix_search
|
suffix_search = self.suffix_search
|
||||||
|
if suffix_search is None:
|
||||||
|
suffix_search = re.compile("a^").search
|
||||||
infix_finditer = self.infix_finditer
|
infix_finditer = self.infix_finditer
|
||||||
|
if infix_finditer is None:
|
||||||
|
infix_finditer = re.compile("a^").finditer
|
||||||
token_match = self.token_match
|
token_match = self.token_match
|
||||||
if token_match is None:
|
if token_match is None:
|
||||||
token_match = re.compile("a^").match
|
token_match = re.compile("a^").match
|
||||||
|
@ -687,7 +677,7 @@ cdef class Tokenizer:
|
||||||
tokens.append(("URL_MATCH", substring))
|
tokens.append(("URL_MATCH", substring))
|
||||||
substring = ''
|
substring = ''
|
||||||
elif substring in special_cases:
|
elif substring in special_cases:
|
||||||
tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring]))
|
tokens.extend((f"SPECIAL-{i + 1}", self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring]))
|
||||||
substring = ''
|
substring = ''
|
||||||
elif list(infix_finditer(substring)):
|
elif list(infix_finditer(substring)):
|
||||||
infixes = infix_finditer(substring)
|
infixes = infix_finditer(substring)
|
||||||
|
@ -705,7 +695,33 @@ cdef class Tokenizer:
|
||||||
tokens.append(("TOKEN", substring))
|
tokens.append(("TOKEN", substring))
|
||||||
substring = ''
|
substring = ''
|
||||||
tokens.extend(reversed(suffixes))
|
tokens.extend(reversed(suffixes))
|
||||||
return tokens
|
# Find matches for special cases handled by special matcher
|
||||||
|
words, spaces = get_words_and_spaces([t[1] for t in tokens], text)
|
||||||
|
t_words = []
|
||||||
|
t_spaces = []
|
||||||
|
for word, space in zip(words, spaces):
|
||||||
|
if not word.isspace():
|
||||||
|
t_words.append(word)
|
||||||
|
t_spaces.append(space)
|
||||||
|
doc = Doc(self.vocab, words=t_words, spaces=t_spaces)
|
||||||
|
matches = self._special_matcher(doc)
|
||||||
|
spans = [Span(doc, s, e, label=m_id) for m_id, s, e in matches]
|
||||||
|
spans = util.filter_spans(spans)
|
||||||
|
# Replace matched tokens with their exceptions
|
||||||
|
i = 0
|
||||||
|
final_tokens = []
|
||||||
|
spans_by_start = {s.start: s for s in spans}
|
||||||
|
while i < len(tokens):
|
||||||
|
if i in spans_by_start:
|
||||||
|
span = spans_by_start[i]
|
||||||
|
exc = [d[ORTH] for d in special_cases[span.label_]]
|
||||||
|
for j, orth in enumerate(exc):
|
||||||
|
final_tokens.append((f"SPECIAL-{j + 1}", self.vocab.strings[orth]))
|
||||||
|
i += len(span)
|
||||||
|
else:
|
||||||
|
final_tokens.append(tokens[i])
|
||||||
|
i += 1
|
||||||
|
return final_tokens
|
||||||
|
|
||||||
def score(self, examples, **kwargs):
|
def score(self, examples, **kwargs):
|
||||||
validate_examples(examples, "Tokenizer.score")
|
validate_examples(examples, "Tokenizer.score")
|
||||||
|
@ -778,6 +794,15 @@ cdef class Tokenizer:
|
||||||
"url_match": lambda b: data.setdefault("url_match", b),
|
"url_match": lambda b: data.setdefault("url_match", b),
|
||||||
"exceptions": lambda b: data.setdefault("rules", b)
|
"exceptions": lambda b: data.setdefault("rules", b)
|
||||||
}
|
}
|
||||||
|
# reset all properties and flush all caches (through rules),
|
||||||
|
# reset rules first so that _reload_special_cases is trivial/fast as
|
||||||
|
# the other properties are reset
|
||||||
|
self.rules = {}
|
||||||
|
self.prefix_search = None
|
||||||
|
self.suffix_search = None
|
||||||
|
self.infix_finditer = None
|
||||||
|
self.token_match = None
|
||||||
|
self.url_match = None
|
||||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||||
if "prefix_search" in data and isinstance(data["prefix_search"], str):
|
if "prefix_search" in data and isinstance(data["prefix_search"], str):
|
||||||
self.prefix_search = re.compile(data["prefix_search"]).search
|
self.prefix_search = re.compile(data["prefix_search"]).search
|
||||||
|
@ -785,22 +810,12 @@ cdef class Tokenizer:
|
||||||
self.suffix_search = re.compile(data["suffix_search"]).search
|
self.suffix_search = re.compile(data["suffix_search"]).search
|
||||||
if "infix_finditer" in data and isinstance(data["infix_finditer"], str):
|
if "infix_finditer" in data and isinstance(data["infix_finditer"], str):
|
||||||
self.infix_finditer = re.compile(data["infix_finditer"]).finditer
|
self.infix_finditer = re.compile(data["infix_finditer"]).finditer
|
||||||
# for token_match and url_match, set to None to override the language
|
|
||||||
# defaults if no regex is provided
|
|
||||||
if "token_match" in data and isinstance(data["token_match"], str):
|
if "token_match" in data and isinstance(data["token_match"], str):
|
||||||
self.token_match = re.compile(data["token_match"]).match
|
self.token_match = re.compile(data["token_match"]).match
|
||||||
else:
|
|
||||||
self.token_match = None
|
|
||||||
if "url_match" in data and isinstance(data["url_match"], str):
|
if "url_match" in data and isinstance(data["url_match"], str):
|
||||||
self.url_match = re.compile(data["url_match"]).match
|
self.url_match = re.compile(data["url_match"]).match
|
||||||
else:
|
|
||||||
self.url_match = None
|
|
||||||
if "rules" in data and isinstance(data["rules"], dict):
|
if "rules" in data and isinstance(data["rules"], dict):
|
||||||
# make sure to hard reset the cache to remove data from the default exceptions
|
self.rules = data["rules"]
|
||||||
self._rules = {}
|
|
||||||
self._flush_cache()
|
|
||||||
self._flush_specials()
|
|
||||||
self._load_special_cases(data["rules"])
|
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -281,7 +281,8 @@ def _merge(Doc doc, merges):
|
||||||
for i in range(doc.length):
|
for i in range(doc.length):
|
||||||
doc.c[i].head -= i
|
doc.c[i].head -= i
|
||||||
# Set the left/right children, left/right edges
|
# Set the left/right children, left/right edges
|
||||||
set_children_from_heads(doc.c, 0, doc.length)
|
if doc.has_annotation("DEP"):
|
||||||
|
set_children_from_heads(doc.c, 0, doc.length)
|
||||||
# Make sure ent_iob remains consistent
|
# Make sure ent_iob remains consistent
|
||||||
make_iob_consistent(doc.c, doc.length)
|
make_iob_consistent(doc.c, doc.length)
|
||||||
# Return the merged Python object
|
# Return the merged Python object
|
||||||
|
@ -294,7 +295,19 @@ def _resize_tensor(tensor, ranges):
|
||||||
for i in range(start, end-1):
|
for i in range(start, end-1):
|
||||||
delete.append(i)
|
delete.append(i)
|
||||||
xp = get_array_module(tensor)
|
xp = get_array_module(tensor)
|
||||||
return xp.delete(tensor, delete, axis=0)
|
if xp is numpy:
|
||||||
|
return xp.delete(tensor, delete, axis=0)
|
||||||
|
else:
|
||||||
|
offset = 0
|
||||||
|
copy_start = 0
|
||||||
|
resized_shape = (tensor.shape[0] - len(delete), tensor.shape[1])
|
||||||
|
for start, end in ranges:
|
||||||
|
if copy_start > 0:
|
||||||
|
tensor[copy_start - offset:start - offset] = tensor[copy_start: start]
|
||||||
|
offset += end - start - 1
|
||||||
|
copy_start = end - 1
|
||||||
|
tensor[copy_start - offset:resized_shape[0]] = tensor[copy_start:]
|
||||||
|
return xp.asarray(tensor[:resized_shape[0]])
|
||||||
|
|
||||||
|
|
||||||
def _split(Doc doc, int token_index, orths, heads, attrs):
|
def _split(Doc doc, int token_index, orths, heads, attrs):
|
||||||
|
@ -331,7 +344,13 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
|
||||||
to_process_tensor = (doc.tensor is not None and doc.tensor.size != 0)
|
to_process_tensor = (doc.tensor is not None and doc.tensor.size != 0)
|
||||||
if to_process_tensor:
|
if to_process_tensor:
|
||||||
xp = get_array_module(doc.tensor)
|
xp = get_array_module(doc.tensor)
|
||||||
doc.tensor = xp.append(doc.tensor, xp.zeros((nb_subtokens,doc.tensor.shape[1]), dtype="float32"), axis=0)
|
if xp is numpy:
|
||||||
|
doc.tensor = xp.append(doc.tensor, xp.zeros((nb_subtokens,doc.tensor.shape[1]), dtype="float32"), axis=0)
|
||||||
|
else:
|
||||||
|
shape = (doc.tensor.shape[0] + nb_subtokens, doc.tensor.shape[1])
|
||||||
|
resized_array = xp.zeros(shape, dtype="float32")
|
||||||
|
resized_array[:doc.tensor.shape[0]] = doc.tensor[:doc.tensor.shape[0]]
|
||||||
|
doc.tensor = resized_array
|
||||||
for token_to_move in range(orig_length - 1, token_index, -1):
|
for token_to_move in range(orig_length - 1, token_index, -1):
|
||||||
doc.c[token_to_move + nb_subtokens - 1] = doc.c[token_to_move]
|
doc.c[token_to_move + nb_subtokens - 1] = doc.c[token_to_move]
|
||||||
if to_process_tensor:
|
if to_process_tensor:
|
||||||
|
@ -348,7 +367,7 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
|
||||||
token.norm = 0 # reset norm
|
token.norm = 0 # reset norm
|
||||||
if to_process_tensor:
|
if to_process_tensor:
|
||||||
# setting the tensors of the split tokens to array of zeros
|
# setting the tensors of the split tokens to array of zeros
|
||||||
doc.tensor[token_index + i] = xp.zeros((1,doc.tensor.shape[1]), dtype="float32")
|
doc.tensor[token_index + i:token_index + i + 1] = xp.zeros((1,doc.tensor.shape[1]), dtype="float32")
|
||||||
# Update the character offset of the subtokens
|
# Update the character offset of the subtokens
|
||||||
if i != 0:
|
if i != 0:
|
||||||
token.idx = orig_token.idx + idx_offset
|
token.idx = orig_token.idx + idx_offset
|
||||||
|
@ -392,7 +411,8 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
|
||||||
for i in range(doc.length):
|
for i in range(doc.length):
|
||||||
doc.c[i].head -= i
|
doc.c[i].head -= i
|
||||||
# set children from head
|
# set children from head
|
||||||
set_children_from_heads(doc.c, 0, doc.length)
|
if doc.has_annotation("DEP"):
|
||||||
|
set_children_from_heads(doc.c, 0, doc.length)
|
||||||
|
|
||||||
|
|
||||||
def _validate_extensions(extensions):
|
def _validate_extensions(extensions):
|
||||||
|
|
|
@ -6,7 +6,7 @@ from libc.math cimport sqrt
|
||||||
from libc.stdint cimport int32_t, uint64_t
|
from libc.stdint cimport int32_t, uint64_t
|
||||||
|
|
||||||
import copy
|
import copy
|
||||||
from collections import Counter
|
from collections import Counter, defaultdict
|
||||||
from enum import Enum
|
from enum import Enum
|
||||||
import itertools
|
import itertools
|
||||||
import numpy
|
import numpy
|
||||||
|
@ -1120,13 +1120,14 @@ cdef class Doc:
|
||||||
concat_words = []
|
concat_words = []
|
||||||
concat_spaces = []
|
concat_spaces = []
|
||||||
concat_user_data = {}
|
concat_user_data = {}
|
||||||
|
concat_spans = defaultdict(list)
|
||||||
char_offset = 0
|
char_offset = 0
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
concat_words.extend(t.text for t in doc)
|
concat_words.extend(t.text for t in doc)
|
||||||
concat_spaces.extend(bool(t.whitespace_) for t in doc)
|
concat_spaces.extend(bool(t.whitespace_) for t in doc)
|
||||||
|
|
||||||
for key, value in doc.user_data.items():
|
for key, value in doc.user_data.items():
|
||||||
if isinstance(key, tuple) and len(key) == 4:
|
if isinstance(key, tuple) and len(key) == 4 and key[0] == "._.":
|
||||||
data_type, name, start, end = key
|
data_type, name, start, end = key
|
||||||
if start is not None or end is not None:
|
if start is not None or end is not None:
|
||||||
start += char_offset
|
start += char_offset
|
||||||
|
@ -1137,8 +1138,17 @@ cdef class Doc:
|
||||||
warnings.warn(Warnings.W101.format(name=name))
|
warnings.warn(Warnings.W101.format(name=name))
|
||||||
else:
|
else:
|
||||||
warnings.warn(Warnings.W102.format(key=key, value=value))
|
warnings.warn(Warnings.W102.format(key=key, value=value))
|
||||||
|
for key in doc.spans:
|
||||||
|
for span in doc.spans[key]:
|
||||||
|
concat_spans[key].append((
|
||||||
|
span.start_char + char_offset,
|
||||||
|
span.end_char + char_offset,
|
||||||
|
span.label,
|
||||||
|
span.kb_id,
|
||||||
|
span.text, # included as a check
|
||||||
|
))
|
||||||
char_offset += len(doc.text)
|
char_offset += len(doc.text)
|
||||||
if ensure_whitespace and not (len(doc) > 0 and doc[-1].is_space):
|
if len(doc) > 0 and ensure_whitespace and not doc[-1].is_space:
|
||||||
char_offset += 1
|
char_offset += 1
|
||||||
|
|
||||||
arrays = [doc.to_array(attrs) for doc in docs]
|
arrays = [doc.to_array(attrs) for doc in docs]
|
||||||
|
@ -1160,6 +1170,22 @@ cdef class Doc:
|
||||||
|
|
||||||
concat_doc.from_array(attrs, concat_array)
|
concat_doc.from_array(attrs, concat_array)
|
||||||
|
|
||||||
|
for key in concat_spans:
|
||||||
|
if key not in concat_doc.spans:
|
||||||
|
concat_doc.spans[key] = []
|
||||||
|
for span_tuple in concat_spans[key]:
|
||||||
|
span = concat_doc.char_span(
|
||||||
|
span_tuple[0],
|
||||||
|
span_tuple[1],
|
||||||
|
label=span_tuple[2],
|
||||||
|
kb_id=span_tuple[3],
|
||||||
|
)
|
||||||
|
text = span_tuple[4]
|
||||||
|
if span is not None and span.text == text:
|
||||||
|
concat_doc.spans[key].append(span)
|
||||||
|
else:
|
||||||
|
raise ValueError(Errors.E873.format(key=key, text=text))
|
||||||
|
|
||||||
return concat_doc
|
return concat_doc
|
||||||
|
|
||||||
def get_lca_matrix(self):
|
def get_lca_matrix(self):
|
||||||
|
|
|
@ -6,6 +6,7 @@ from libc.math cimport sqrt
|
||||||
import numpy
|
import numpy
|
||||||
from thinc.api import get_array_module
|
from thinc.api import get_array_module
|
||||||
import warnings
|
import warnings
|
||||||
|
import copy
|
||||||
|
|
||||||
from .doc cimport token_by_start, token_by_end, get_token_attr, _get_lca_matrix
|
from .doc cimport token_by_start, token_by_end, get_token_attr, _get_lca_matrix
|
||||||
from ..structs cimport TokenC, LexemeC
|
from ..structs cimport TokenC, LexemeC
|
||||||
|
@ -241,7 +242,19 @@ cdef class Span:
|
||||||
if cat_start == self.start_char and cat_end == self.end_char:
|
if cat_start == self.start_char and cat_end == self.end_char:
|
||||||
doc.cats[cat_label] = value
|
doc.cats[cat_label] = value
|
||||||
if copy_user_data:
|
if copy_user_data:
|
||||||
doc.user_data = self.doc.user_data
|
user_data = {}
|
||||||
|
char_offset = self.start_char
|
||||||
|
for key, value in self.doc.user_data.items():
|
||||||
|
if isinstance(key, tuple) and len(key) == 4 and key[0] == "._.":
|
||||||
|
data_type, name, start, end = key
|
||||||
|
if start is not None or end is not None:
|
||||||
|
start -= char_offset
|
||||||
|
if end is not None:
|
||||||
|
end -= char_offset
|
||||||
|
user_data[(data_type, name, start, end)] = copy.copy(value)
|
||||||
|
else:
|
||||||
|
user_data[key] = copy.copy(value)
|
||||||
|
doc.user_data = user_data
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def _fix_dep_copy(self, attrs, array):
|
def _fix_dep_copy(self, attrs, array):
|
||||||
|
|
|
@ -8,3 +8,4 @@ from .iob_utils import biluo_tags_to_spans, tags_to_entities # noqa: F401
|
||||||
from .gold_io import docs_to_json, read_json_file # noqa: F401
|
from .gold_io import docs_to_json, read_json_file # noqa: F401
|
||||||
from .batchers import minibatch_by_padded_size, minibatch_by_words # noqa: F401
|
from .batchers import minibatch_by_padded_size, minibatch_by_words # noqa: F401
|
||||||
from .loggers import console_logger, wandb_logger # noqa: F401
|
from .loggers import console_logger, wandb_logger # noqa: F401
|
||||||
|
from .callbacks import create_copy_from_base_model # noqa: F401
|
||||||
|
|
32
spacy/training/callbacks.py
Normal file
32
spacy/training/callbacks.py
Normal file
|
@ -0,0 +1,32 @@
|
||||||
|
from typing import Optional
|
||||||
|
from ..errors import Errors
|
||||||
|
from ..language import Language
|
||||||
|
from ..util import load_model, registry, logger
|
||||||
|
|
||||||
|
|
||||||
|
@registry.callbacks("spacy.copy_from_base_model.v1")
|
||||||
|
def create_copy_from_base_model(
|
||||||
|
tokenizer: Optional[str] = None,
|
||||||
|
vocab: Optional[str] = None,
|
||||||
|
) -> Language:
|
||||||
|
def copy_from_base_model(nlp):
|
||||||
|
if tokenizer:
|
||||||
|
logger.info(f"Copying tokenizer from: {tokenizer}")
|
||||||
|
base_nlp = load_model(tokenizer)
|
||||||
|
if nlp.config["nlp"]["tokenizer"] == base_nlp.config["nlp"]["tokenizer"]:
|
||||||
|
nlp.tokenizer.from_bytes(base_nlp.tokenizer.to_bytes(exclude=["vocab"]))
|
||||||
|
else:
|
||||||
|
raise ValueError(
|
||||||
|
Errors.E872.format(
|
||||||
|
curr_config=nlp.config["nlp"]["tokenizer"],
|
||||||
|
base_config=base_nlp.config["nlp"]["tokenizer"],
|
||||||
|
)
|
||||||
|
)
|
||||||
|
if vocab:
|
||||||
|
logger.info(f"Copying vocab from: {vocab}")
|
||||||
|
# only reload if the vocab is from a different model
|
||||||
|
if tokenizer != vocab:
|
||||||
|
base_nlp = load_model(vocab)
|
||||||
|
nlp.vocab.from_bytes(base_nlp.vocab.to_bytes())
|
||||||
|
|
||||||
|
return copy_from_base_model
|
|
@ -124,6 +124,9 @@ def segment_sents_and_docs(doc, n_sents, doc_delimiter, model=None, msg=None):
|
||||||
nlp = load_model(model)
|
nlp = load_model(model)
|
||||||
if "parser" in nlp.pipe_names:
|
if "parser" in nlp.pipe_names:
|
||||||
msg.info(f"Segmenting sentences with parser from model '{model}'.")
|
msg.info(f"Segmenting sentences with parser from model '{model}'.")
|
||||||
|
for name, proc in nlp.pipeline:
|
||||||
|
if "parser" in getattr(proc, "listening_components", []):
|
||||||
|
nlp.replace_listeners(name, "parser", ["model.tok2vec"])
|
||||||
sentencizer = nlp.get_pipe("parser")
|
sentencizer = nlp.get_pipe("parser")
|
||||||
if not sentencizer:
|
if not sentencizer:
|
||||||
msg.info(
|
msg.info(
|
||||||
|
|
|
@ -2,6 +2,7 @@ import warnings
|
||||||
from typing import Union, List, Iterable, Iterator, TYPE_CHECKING, Callable
|
from typing import Union, List, Iterable, Iterator, TYPE_CHECKING, Callable
|
||||||
from typing import Optional
|
from typing import Optional
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
import random
|
||||||
import srsly
|
import srsly
|
||||||
|
|
||||||
from .. import util
|
from .. import util
|
||||||
|
@ -96,6 +97,7 @@ class Corpus:
|
||||||
Defaults to 0, which indicates no limit.
|
Defaults to 0, which indicates no limit.
|
||||||
augment (Callable[Example, Iterable[Example]]): Optional data augmentation
|
augment (Callable[Example, Iterable[Example]]): Optional data augmentation
|
||||||
function, to extrapolate additional examples from your annotations.
|
function, to extrapolate additional examples from your annotations.
|
||||||
|
shuffle (bool): Whether to shuffle the examples.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/corpus
|
DOCS: https://spacy.io/api/corpus
|
||||||
"""
|
"""
|
||||||
|
@ -108,12 +110,14 @@ class Corpus:
|
||||||
gold_preproc: bool = False,
|
gold_preproc: bool = False,
|
||||||
max_length: int = 0,
|
max_length: int = 0,
|
||||||
augmenter: Optional[Callable] = None,
|
augmenter: Optional[Callable] = None,
|
||||||
|
shuffle: bool = False,
|
||||||
) -> None:
|
) -> None:
|
||||||
self.path = util.ensure_path(path)
|
self.path = util.ensure_path(path)
|
||||||
self.gold_preproc = gold_preproc
|
self.gold_preproc = gold_preproc
|
||||||
self.max_length = max_length
|
self.max_length = max_length
|
||||||
self.limit = limit
|
self.limit = limit
|
||||||
self.augmenter = augmenter if augmenter is not None else dont_augment
|
self.augmenter = augmenter if augmenter is not None else dont_augment
|
||||||
|
self.shuffle = shuffle
|
||||||
|
|
||||||
def __call__(self, nlp: "Language") -> Iterator[Example]:
|
def __call__(self, nlp: "Language") -> Iterator[Example]:
|
||||||
"""Yield examples from the data.
|
"""Yield examples from the data.
|
||||||
|
@ -124,6 +128,10 @@ class Corpus:
|
||||||
DOCS: https://spacy.io/api/corpus#call
|
DOCS: https://spacy.io/api/corpus#call
|
||||||
"""
|
"""
|
||||||
ref_docs = self.read_docbin(nlp.vocab, walk_corpus(self.path, FILE_TYPE))
|
ref_docs = self.read_docbin(nlp.vocab, walk_corpus(self.path, FILE_TYPE))
|
||||||
|
if self.shuffle:
|
||||||
|
ref_docs = list(ref_docs)
|
||||||
|
random.shuffle(ref_docs)
|
||||||
|
|
||||||
if self.gold_preproc:
|
if self.gold_preproc:
|
||||||
examples = self.make_examples_gold_preproc(nlp, ref_docs)
|
examples = self.make_examples_gold_preproc(nlp, ref_docs)
|
||||||
else:
|
else:
|
||||||
|
|
|
@ -13,7 +13,7 @@ from .iob_utils import biluo_tags_to_spans
|
||||||
from ..errors import Errors, Warnings
|
from ..errors import Errors, Warnings
|
||||||
from ..pipeline._parser_internals import nonproj
|
from ..pipeline._parser_internals import nonproj
|
||||||
from ..tokens.token cimport MISSING_DEP
|
from ..tokens.token cimport MISSING_DEP
|
||||||
from ..util import logger
|
from ..util import logger, to_ternary_int
|
||||||
|
|
||||||
|
|
||||||
cpdef Doc annotations_to_doc(vocab, tok_annot, doc_annot):
|
cpdef Doc annotations_to_doc(vocab, tok_annot, doc_annot):
|
||||||
|
@ -213,18 +213,19 @@ cdef class Example:
|
||||||
else:
|
else:
|
||||||
return [None] * len(self.x)
|
return [None] * len(self.x)
|
||||||
|
|
||||||
def get_aligned_spans_x2y(self, x_spans):
|
def get_aligned_spans_x2y(self, x_spans, allow_overlap=False):
|
||||||
return self._get_aligned_spans(self.y, x_spans, self.alignment.x2y)
|
return self._get_aligned_spans(self.y, x_spans, self.alignment.x2y, allow_overlap)
|
||||||
|
|
||||||
def get_aligned_spans_y2x(self, y_spans):
|
def get_aligned_spans_y2x(self, y_spans, allow_overlap=False):
|
||||||
return self._get_aligned_spans(self.x, y_spans, self.alignment.y2x)
|
return self._get_aligned_spans(self.x, y_spans, self.alignment.y2x, allow_overlap)
|
||||||
|
|
||||||
def _get_aligned_spans(self, doc, spans, align):
|
def _get_aligned_spans(self, doc, spans, align, allow_overlap):
|
||||||
seen = set()
|
seen = set()
|
||||||
output = []
|
output = []
|
||||||
for span in spans:
|
for span in spans:
|
||||||
indices = align[span.start : span.end].data.ravel()
|
indices = align[span.start : span.end].data.ravel()
|
||||||
indices = [idx for idx in indices if idx not in seen]
|
if not allow_overlap:
|
||||||
|
indices = [idx for idx in indices if idx not in seen]
|
||||||
if len(indices) >= 1:
|
if len(indices) >= 1:
|
||||||
aligned_span = Span(doc, indices[0], indices[-1] + 1, label=span.label)
|
aligned_span = Span(doc, indices[0], indices[-1] + 1, label=span.label)
|
||||||
target_text = span.text.lower().strip().replace(" ", "")
|
target_text = span.text.lower().strip().replace(" ", "")
|
||||||
|
@ -237,7 +238,7 @@ cdef class Example:
|
||||||
def get_aligned_ner(self):
|
def get_aligned_ner(self):
|
||||||
if not self.y.has_annotation("ENT_IOB"):
|
if not self.y.has_annotation("ENT_IOB"):
|
||||||
return [None] * len(self.x) # should this be 'missing' instead of 'None' ?
|
return [None] * len(self.x) # should this be 'missing' instead of 'None' ?
|
||||||
x_ents = self.get_aligned_spans_y2x(self.y.ents)
|
x_ents = self.get_aligned_spans_y2x(self.y.ents, allow_overlap=False)
|
||||||
# Default to 'None' for missing values
|
# Default to 'None' for missing values
|
||||||
x_tags = offsets_to_biluo_tags(
|
x_tags = offsets_to_biluo_tags(
|
||||||
self.x,
|
self.x,
|
||||||
|
@ -337,7 +338,7 @@ def _annot2array(vocab, tok_annot, doc_annot):
|
||||||
values.append([vocab.strings.add(h) if h is not None else MISSING_DEP for h in value])
|
values.append([vocab.strings.add(h) if h is not None else MISSING_DEP for h in value])
|
||||||
elif key == "SENT_START":
|
elif key == "SENT_START":
|
||||||
attrs.append(key)
|
attrs.append(key)
|
||||||
values.append(value)
|
values.append([to_ternary_int(v) for v in value])
|
||||||
elif key == "MORPH":
|
elif key == "MORPH":
|
||||||
attrs.append(key)
|
attrs.append(key)
|
||||||
values.append([vocab.morphology.add(v) for v in value])
|
values.append([vocab.morphology.add(v) for v in value])
|
||||||
|
|
|
@ -121,7 +121,7 @@ def json_to_annotations(doc):
|
||||||
if i == 0:
|
if i == 0:
|
||||||
sent_starts.append(1)
|
sent_starts.append(1)
|
||||||
else:
|
else:
|
||||||
sent_starts.append(0)
|
sent_starts.append(-1)
|
||||||
if "brackets" in sent:
|
if "brackets" in sent:
|
||||||
brackets.extend((b["first"] + sent_start_i,
|
brackets.extend((b["first"] + sent_start_i,
|
||||||
b["last"] + sent_start_i, b["label"])
|
b["last"] + sent_start_i, b["label"])
|
||||||
|
|
|
@ -8,6 +8,7 @@ import tarfile
|
||||||
import gzip
|
import gzip
|
||||||
import zipfile
|
import zipfile
|
||||||
import tqdm
|
import tqdm
|
||||||
|
from itertools import islice
|
||||||
|
|
||||||
from .pretrain import get_tok2vec_ref
|
from .pretrain import get_tok2vec_ref
|
||||||
from ..lookups import Lookups
|
from ..lookups import Lookups
|
||||||
|
@ -68,7 +69,11 @@ def init_nlp(config: Config, *, use_gpu: int = -1) -> "Language":
|
||||||
# Make sure that listeners are defined before initializing further
|
# Make sure that listeners are defined before initializing further
|
||||||
nlp._link_components()
|
nlp._link_components()
|
||||||
with nlp.select_pipes(disable=[*frozen_components, *resume_components]):
|
with nlp.select_pipes(disable=[*frozen_components, *resume_components]):
|
||||||
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
|
if T["max_epochs"] == -1:
|
||||||
|
logger.debug("Due to streamed train corpus, using only first 100 examples for initialization. If necessary, provide all labels in [initialize]. More info: https://spacy.io/api/cli#init_labels")
|
||||||
|
nlp.initialize(lambda: islice(train_corpus(nlp), 100), sgd=optimizer)
|
||||||
|
else:
|
||||||
|
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
|
||||||
logger.info(f"Initialized pipeline components: {nlp.pipe_names}")
|
logger.info(f"Initialized pipeline components: {nlp.pipe_names}")
|
||||||
# Detect components with listeners that are not frozen consistently
|
# Detect components with listeners that are not frozen consistently
|
||||||
for name, proc in nlp.pipeline:
|
for name, proc in nlp.pipeline:
|
||||||
|
@ -133,6 +138,10 @@ def load_vectors_into_model(
|
||||||
)
|
)
|
||||||
err = ConfigValidationError.from_error(e, title=title, desc=desc)
|
err = ConfigValidationError.from_error(e, title=title, desc=desc)
|
||||||
raise err from None
|
raise err from None
|
||||||
|
|
||||||
|
if len(vectors_nlp.vocab.vectors.keys()) == 0:
|
||||||
|
logger.warning(Warnings.W112.format(name=name))
|
||||||
|
|
||||||
nlp.vocab.vectors = vectors_nlp.vocab.vectors
|
nlp.vocab.vectors = vectors_nlp.vocab.vectors
|
||||||
if add_strings:
|
if add_strings:
|
||||||
# I guess we should add the strings from the vectors_nlp model?
|
# I guess we should add the strings from the vectors_nlp model?
|
||||||
|
|
|
@ -101,8 +101,13 @@ def console_logger(progress_bar: bool = False):
|
||||||
return setup_printer
|
return setup_printer
|
||||||
|
|
||||||
|
|
||||||
@registry.loggers("spacy.WandbLogger.v1")
|
@registry.loggers("spacy.WandbLogger.v2")
|
||||||
def wandb_logger(project_name: str, remove_config_values: List[str] = []):
|
def wandb_logger(
|
||||||
|
project_name: str,
|
||||||
|
remove_config_values: List[str] = [],
|
||||||
|
model_log_interval: Optional[int] = None,
|
||||||
|
log_dataset_dir: Optional[str] = None,
|
||||||
|
):
|
||||||
try:
|
try:
|
||||||
import wandb
|
import wandb
|
||||||
from wandb import init, log, join # test that these are available
|
from wandb import init, log, join # test that these are available
|
||||||
|
@ -119,9 +124,23 @@ def wandb_logger(project_name: str, remove_config_values: List[str] = []):
|
||||||
for field in remove_config_values:
|
for field in remove_config_values:
|
||||||
del config_dot[field]
|
del config_dot[field]
|
||||||
config = util.dot_to_dict(config_dot)
|
config = util.dot_to_dict(config_dot)
|
||||||
wandb.init(project=project_name, config=config, reinit=True)
|
run = wandb.init(project=project_name, config=config, reinit=True)
|
||||||
console_log_step, console_finalize = console(nlp, stdout, stderr)
|
console_log_step, console_finalize = console(nlp, stdout, stderr)
|
||||||
|
|
||||||
|
def log_dir_artifact(
|
||||||
|
path: str,
|
||||||
|
name: str,
|
||||||
|
type: str,
|
||||||
|
metadata: Optional[Dict[str, Any]] = {},
|
||||||
|
aliases: Optional[List[str]] = [],
|
||||||
|
):
|
||||||
|
dataset_artifact = wandb.Artifact(name, type=type, metadata=metadata)
|
||||||
|
dataset_artifact.add_dir(path, name=name)
|
||||||
|
wandb.log_artifact(dataset_artifact, aliases=aliases)
|
||||||
|
|
||||||
|
if log_dataset_dir:
|
||||||
|
log_dir_artifact(path=log_dataset_dir, name="dataset", type="dataset")
|
||||||
|
|
||||||
def log_step(info: Optional[Dict[str, Any]]):
|
def log_step(info: Optional[Dict[str, Any]]):
|
||||||
console_log_step(info)
|
console_log_step(info)
|
||||||
if info is not None:
|
if info is not None:
|
||||||
|
@ -133,6 +152,21 @@ def wandb_logger(project_name: str, remove_config_values: List[str] = []):
|
||||||
wandb.log({f"loss_{k}": v for k, v in losses.items()})
|
wandb.log({f"loss_{k}": v for k, v in losses.items()})
|
||||||
if isinstance(other_scores, dict):
|
if isinstance(other_scores, dict):
|
||||||
wandb.log(other_scores)
|
wandb.log(other_scores)
|
||||||
|
if model_log_interval and info.get("output_path"):
|
||||||
|
if info["step"] % model_log_interval == 0 and info["step"] != 0:
|
||||||
|
log_dir_artifact(
|
||||||
|
path=info["output_path"],
|
||||||
|
name="pipeline_" + run.id,
|
||||||
|
type="checkpoint",
|
||||||
|
metadata=info,
|
||||||
|
aliases=[
|
||||||
|
f"epoch {info['epoch']} step {info['step']}",
|
||||||
|
"latest",
|
||||||
|
"best"
|
||||||
|
if info["score"] == max(info["checkpoints"])[0]
|
||||||
|
else "",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
def finalize() -> None:
|
def finalize() -> None:
|
||||||
console_finalize()
|
console_finalize()
|
||||||
|
|
|
@ -78,7 +78,7 @@ def train(
|
||||||
training_step_iterator = train_while_improving(
|
training_step_iterator = train_while_improving(
|
||||||
nlp,
|
nlp,
|
||||||
optimizer,
|
optimizer,
|
||||||
create_train_batches(train_corpus(nlp), batcher, T["max_epochs"]),
|
create_train_batches(nlp, train_corpus, batcher, T["max_epochs"]),
|
||||||
create_evaluation_callback(nlp, dev_corpus, score_weights),
|
create_evaluation_callback(nlp, dev_corpus, score_weights),
|
||||||
dropout=T["dropout"],
|
dropout=T["dropout"],
|
||||||
accumulate_gradient=T["accumulate_gradient"],
|
accumulate_gradient=T["accumulate_gradient"],
|
||||||
|
@ -96,12 +96,13 @@ def train(
|
||||||
log_step, finalize_logger = train_logger(nlp, stdout, stderr)
|
log_step, finalize_logger = train_logger(nlp, stdout, stderr)
|
||||||
try:
|
try:
|
||||||
for batch, info, is_best_checkpoint in training_step_iterator:
|
for batch, info, is_best_checkpoint in training_step_iterator:
|
||||||
log_step(info if is_best_checkpoint is not None else None)
|
|
||||||
if is_best_checkpoint is not None:
|
if is_best_checkpoint is not None:
|
||||||
with nlp.select_pipes(disable=frozen_components):
|
with nlp.select_pipes(disable=frozen_components):
|
||||||
update_meta(T, nlp, info)
|
update_meta(T, nlp, info)
|
||||||
if output_path is not None:
|
if output_path is not None:
|
||||||
save_checkpoint(is_best_checkpoint)
|
save_checkpoint(is_best_checkpoint)
|
||||||
|
info["output_path"] = str(output_path / DIR_MODEL_LAST)
|
||||||
|
log_step(info if is_best_checkpoint is not None else None)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
if output_path is not None:
|
if output_path is not None:
|
||||||
stdout.write(
|
stdout.write(
|
||||||
|
@ -289,17 +290,22 @@ def create_evaluation_callback(
|
||||||
|
|
||||||
|
|
||||||
def create_train_batches(
|
def create_train_batches(
|
||||||
iterator: Iterator[Example],
|
nlp: "Language",
|
||||||
|
corpus: Callable[["Language"], Iterable[Example]],
|
||||||
batcher: Callable[[Iterable[Example]], Iterable[Example]],
|
batcher: Callable[[Iterable[Example]], Iterable[Example]],
|
||||||
max_epochs: int,
|
max_epochs: int,
|
||||||
):
|
):
|
||||||
epoch = 0
|
epoch = 0
|
||||||
examples = list(iterator)
|
if max_epochs >= 0:
|
||||||
if not examples:
|
examples = list(corpus(nlp))
|
||||||
# Raise error if no data
|
if not examples:
|
||||||
raise ValueError(Errors.E986)
|
# Raise error if no data
|
||||||
|
raise ValueError(Errors.E986)
|
||||||
while max_epochs < 1 or epoch != max_epochs:
|
while max_epochs < 1 or epoch != max_epochs:
|
||||||
random.shuffle(examples)
|
if max_epochs >= 0:
|
||||||
|
random.shuffle(examples)
|
||||||
|
else:
|
||||||
|
examples = corpus(nlp)
|
||||||
for batch in batcher(examples):
|
for batch in batcher(examples):
|
||||||
yield epoch, batch
|
yield epoch, batch
|
||||||
epoch += 1
|
epoch += 1
|
||||||
|
|
|
@ -36,7 +36,7 @@ except ImportError:
|
||||||
try: # Python 3.8
|
try: # Python 3.8
|
||||||
import importlib.metadata as importlib_metadata
|
import importlib.metadata as importlib_metadata
|
||||||
except ImportError:
|
except ImportError:
|
||||||
import importlib_metadata
|
from catalogue import _importlib_metadata as importlib_metadata
|
||||||
|
|
||||||
# These are functions that were previously (v2.x) available from spacy.util
|
# These are functions that were previously (v2.x) available from spacy.util
|
||||||
# and have since moved to Thinc. We're importing them here so people's code
|
# and have since moved to Thinc. We're importing them here so people's code
|
||||||
|
@ -1526,3 +1526,18 @@ def check_lexeme_norms(vocab, component_name):
|
||||||
if len(lexeme_norms) == 0 and vocab.lang in LEXEME_NORM_LANGS:
|
if len(lexeme_norms) == 0 and vocab.lang in LEXEME_NORM_LANGS:
|
||||||
langs = ", ".join(LEXEME_NORM_LANGS)
|
langs = ", ".join(LEXEME_NORM_LANGS)
|
||||||
logger.debug(Warnings.W033.format(model=component_name, langs=langs))
|
logger.debug(Warnings.W033.format(model=component_name, langs=langs))
|
||||||
|
|
||||||
|
|
||||||
|
def to_ternary_int(val) -> int:
|
||||||
|
"""Convert a value to the ternary 1/0/-1 int used for True/None/False in
|
||||||
|
attributes such as SENT_START: True/1/1.0 is 1 (True), None/0/0.0 is 0
|
||||||
|
(None), any other values are -1 (False).
|
||||||
|
"""
|
||||||
|
if isinstance(val, float):
|
||||||
|
val = int(val)
|
||||||
|
if val is True or val is 1:
|
||||||
|
return 1
|
||||||
|
elif val is None or val is 0:
|
||||||
|
return 0
|
||||||
|
else:
|
||||||
|
return -1
|
||||||
|
|
|
@ -55,7 +55,7 @@ cdef class Vectors:
|
||||||
"""Create a new vector store.
|
"""Create a new vector store.
|
||||||
|
|
||||||
shape (tuple): Size of the table, as (# entries, # columns)
|
shape (tuple): Size of the table, as (# entries, # columns)
|
||||||
data (numpy.ndarray): The vector data.
|
data (numpy.ndarray or cupy.ndarray): The vector data.
|
||||||
keys (iterable): A sequence of keys, aligned with the data.
|
keys (iterable): A sequence of keys, aligned with the data.
|
||||||
name (str): A name to identify the vectors table.
|
name (str): A name to identify the vectors table.
|
||||||
|
|
||||||
|
@ -65,7 +65,8 @@ cdef class Vectors:
|
||||||
if data is None:
|
if data is None:
|
||||||
if shape is None:
|
if shape is None:
|
||||||
shape = (0,0)
|
shape = (0,0)
|
||||||
data = numpy.zeros(shape, dtype="f")
|
ops = get_current_ops()
|
||||||
|
data = ops.xp.zeros(shape, dtype="f")
|
||||||
self.data = data
|
self.data = data
|
||||||
self.key2row = {}
|
self.key2row = {}
|
||||||
if self.data is not None:
|
if self.data is not None:
|
||||||
|
@ -300,6 +301,8 @@ cdef class Vectors:
|
||||||
else:
|
else:
|
||||||
raise ValueError(Errors.E197.format(row=row, key=key))
|
raise ValueError(Errors.E197.format(row=row, key=key))
|
||||||
if vector is not None:
|
if vector is not None:
|
||||||
|
xp = get_array_module(self.data)
|
||||||
|
vector = xp.asarray(vector)
|
||||||
self.data[row] = vector
|
self.data[row] = vector
|
||||||
if self._unset.count(row):
|
if self._unset.count(row):
|
||||||
self._unset.erase(self._unset.find(row))
|
self._unset.erase(self._unset.find(row))
|
||||||
|
@ -321,10 +324,11 @@ cdef class Vectors:
|
||||||
RETURNS (tuple): The most similar entries as a `(keys, best_rows, scores)`
|
RETURNS (tuple): The most similar entries as a `(keys, best_rows, scores)`
|
||||||
tuple.
|
tuple.
|
||||||
"""
|
"""
|
||||||
|
xp = get_array_module(self.data)
|
||||||
filled = sorted(list({row for row in self.key2row.values()}))
|
filled = sorted(list({row for row in self.key2row.values()}))
|
||||||
if len(filled) < n:
|
if len(filled) < n:
|
||||||
raise ValueError(Errors.E198.format(n=n, n_rows=len(filled)))
|
raise ValueError(Errors.E198.format(n=n, n_rows=len(filled)))
|
||||||
xp = get_array_module(self.data)
|
filled = xp.asarray(filled)
|
||||||
|
|
||||||
norms = xp.linalg.norm(self.data[filled], axis=1, keepdims=True)
|
norms = xp.linalg.norm(self.data[filled], axis=1, keepdims=True)
|
||||||
norms[norms == 0] = 1
|
norms[norms == 0] = 1
|
||||||
|
@ -357,8 +361,10 @@ cdef class Vectors:
|
||||||
# Account for numerical error we want to return in range -1, 1
|
# Account for numerical error we want to return in range -1, 1
|
||||||
scores = xp.clip(scores, a_min=-1, a_max=1, out=scores)
|
scores = xp.clip(scores, a_min=-1, a_max=1, out=scores)
|
||||||
row2key = {row: key for key, row in self.key2row.items()}
|
row2key = {row: key for key, row in self.key2row.items()}
|
||||||
|
|
||||||
|
numpy_rows = get_current_ops().to_numpy(best_rows)
|
||||||
keys = xp.asarray(
|
keys = xp.asarray(
|
||||||
[[row2key[row] for row in best_rows[i] if row in row2key]
|
[[row2key[row] for row in numpy_rows[i] if row in row2key]
|
||||||
for i in range(len(queries)) ], dtype="uint64")
|
for i in range(len(queries)) ], dtype="uint64")
|
||||||
return (keys, best_rows, scores)
|
return (keys, best_rows, scores)
|
||||||
|
|
||||||
|
@ -459,7 +465,8 @@ cdef class Vectors:
|
||||||
if hasattr(self.data, "from_bytes"):
|
if hasattr(self.data, "from_bytes"):
|
||||||
self.data.from_bytes()
|
self.data.from_bytes()
|
||||||
else:
|
else:
|
||||||
self.data = srsly.msgpack_loads(b)
|
xp = get_array_module(self.data)
|
||||||
|
self.data = xp.asarray(srsly.msgpack_loads(b))
|
||||||
|
|
||||||
deserializers = {
|
deserializers = {
|
||||||
"key2row": lambda b: self.key2row.update(srsly.msgpack_loads(b)),
|
"key2row": lambda b: self.key2row.update(srsly.msgpack_loads(b)),
|
||||||
|
|
|
@ -2,7 +2,7 @@
|
||||||
from libc.string cimport memcpy
|
from libc.string cimport memcpy
|
||||||
|
|
||||||
import srsly
|
import srsly
|
||||||
from thinc.api import get_array_module
|
from thinc.api import get_array_module, get_current_ops
|
||||||
import functools
|
import functools
|
||||||
|
|
||||||
from .lexeme cimport EMPTY_LEXEME, OOV_RANK
|
from .lexeme cimport EMPTY_LEXEME, OOV_RANK
|
||||||
|
@ -293,7 +293,7 @@ cdef class Vocab:
|
||||||
among those remaining.
|
among those remaining.
|
||||||
|
|
||||||
For example, suppose the original table had vectors for the words:
|
For example, suppose the original table had vectors for the words:
|
||||||
['sat', 'cat', 'feline', 'reclined']. If we prune the vector table to,
|
['sat', 'cat', 'feline', 'reclined']. If we prune the vector table to
|
||||||
two rows, we would discard the vectors for 'feline' and 'reclined'.
|
two rows, we would discard the vectors for 'feline' and 'reclined'.
|
||||||
These words would then be remapped to the closest remaining vector
|
These words would then be remapped to the closest remaining vector
|
||||||
-- so "feline" would have the same vector as "cat", and "reclined"
|
-- so "feline" would have the same vector as "cat", and "reclined"
|
||||||
|
@ -314,6 +314,7 @@ cdef class Vocab:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/vocab#prune_vectors
|
DOCS: https://spacy.io/api/vocab#prune_vectors
|
||||||
"""
|
"""
|
||||||
|
ops = get_current_ops()
|
||||||
xp = get_array_module(self.vectors.data)
|
xp = get_array_module(self.vectors.data)
|
||||||
# Make sure all vectors are in the vocab
|
# Make sure all vectors are in the vocab
|
||||||
for orth in self.vectors:
|
for orth in self.vectors:
|
||||||
|
@ -329,8 +330,9 @@ cdef class Vocab:
|
||||||
toss = xp.ascontiguousarray(self.vectors.data[indices[nr_row:]])
|
toss = xp.ascontiguousarray(self.vectors.data[indices[nr_row:]])
|
||||||
self.vectors = Vectors(data=keep, keys=keys[:nr_row], name=self.vectors.name)
|
self.vectors = Vectors(data=keep, keys=keys[:nr_row], name=self.vectors.name)
|
||||||
syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size)
|
syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size)
|
||||||
|
syn_keys = ops.to_numpy(syn_keys)
|
||||||
remap = {}
|
remap = {}
|
||||||
for i, key in enumerate(keys[nr_row:]):
|
for i, key in enumerate(ops.to_numpy(keys[nr_row:])):
|
||||||
self.vectors.add(key, row=syn_rows[i][0])
|
self.vectors.add(key, row=syn_rows[i][0])
|
||||||
word = self.strings[key]
|
word = self.strings[key]
|
||||||
synonym = self.strings[syn_keys[i][0]]
|
synonym = self.strings[syn_keys[i][0]]
|
||||||
|
@ -351,7 +353,7 @@ cdef class Vocab:
|
||||||
Defaults to the length of `orth`.
|
Defaults to the length of `orth`.
|
||||||
maxn (int): Maximum n-gram length used for Fasttext's ngram computation.
|
maxn (int): Maximum n-gram length used for Fasttext's ngram computation.
|
||||||
Defaults to the length of `orth`.
|
Defaults to the length of `orth`.
|
||||||
RETURNS (numpy.ndarray): A word vector. Size
|
RETURNS (numpy.ndarray or cupy.ndarray): A word vector. Size
|
||||||
and shape determined by the `vocab.vectors` instance. Usually, a
|
and shape determined by the `vocab.vectors` instance. Usually, a
|
||||||
numpy ndarray of shape (300,) and dtype float32.
|
numpy ndarray of shape (300,) and dtype float32.
|
||||||
|
|
||||||
|
@ -400,7 +402,7 @@ cdef class Vocab:
|
||||||
by string or int ID.
|
by string or int ID.
|
||||||
|
|
||||||
orth (int / unicode): The word.
|
orth (int / unicode): The word.
|
||||||
vector (numpy.ndarray[ndim=1, dtype='float32']): The vector to set.
|
vector (numpy.ndarray or cupy.nadarry[ndim=1, dtype='float32']): The vector to set.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/vocab#set_vector
|
DOCS: https://spacy.io/api/vocab#set_vector
|
||||||
"""
|
"""
|
||||||
|
|
|
@ -35,7 +35,7 @@ usage documentation on
|
||||||
> @architectures = "spacy.Tok2Vec.v2"
|
> @architectures = "spacy.Tok2Vec.v2"
|
||||||
>
|
>
|
||||||
> [model.embed]
|
> [model.embed]
|
||||||
> @architectures = "spacy.CharacterEmbed.v1"
|
> @architectures = "spacy.CharacterEmbed.v2"
|
||||||
> # ...
|
> # ...
|
||||||
>
|
>
|
||||||
> [model.encode]
|
> [model.encode]
|
||||||
|
@ -54,13 +54,13 @@ blog post for background.
|
||||||
| `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
| `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.HashEmbedCNN.v1 {#HashEmbedCNN}
|
### spacy.HashEmbedCNN.v2 {#HashEmbedCNN}
|
||||||
|
|
||||||
> #### Example Config
|
> #### Example Config
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [model]
|
> [model]
|
||||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||||
> pretrained_vectors = null
|
> pretrained_vectors = null
|
||||||
> width = 96
|
> width = 96
|
||||||
> depth = 4
|
> depth = 4
|
||||||
|
@ -96,7 +96,7 @@ consisting of a CNN and a layer-normalized maxout activation function.
|
||||||
> factory = "tok2vec"
|
> factory = "tok2vec"
|
||||||
>
|
>
|
||||||
> [components.tok2vec.model]
|
> [components.tok2vec.model]
|
||||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||||
> width = 342
|
> width = 342
|
||||||
>
|
>
|
||||||
> [components.tagger]
|
> [components.tagger]
|
||||||
|
@ -129,13 +129,13 @@ argument that connects to the shared `tok2vec` component in the pipeline.
|
||||||
| `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. By default, the upstream name is the wildcard string `"*"`, but you could also specify the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ |
|
| `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. By default, the upstream name is the wildcard string `"*"`, but you could also specify the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
|
### spacy.MultiHashEmbed.v2 {#MultiHashEmbed}
|
||||||
|
|
||||||
> #### Example config
|
> #### Example config
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [model]
|
> [model]
|
||||||
> @architectures = "spacy.MultiHashEmbed.v1"
|
> @architectures = "spacy.MultiHashEmbed.v2"
|
||||||
> width = 64
|
> width = 64
|
||||||
> attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
|
> attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
|
||||||
> rows = [2000, 1000, 1000, 1000]
|
> rows = [2000, 1000, 1000, 1000]
|
||||||
|
@ -160,13 +160,13 @@ not updated).
|
||||||
| `include_static_vectors` | Whether to also use static word vectors. Requires a vectors table to be loaded in the [`Doc`](/api/doc) objects' vocab. ~~bool~~ |
|
| `include_static_vectors` | Whether to also use static word vectors. Requires a vectors table to be loaded in the [`Doc`](/api/doc) objects' vocab. ~~bool~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.CharacterEmbed.v1 {#CharacterEmbed}
|
### spacy.CharacterEmbed.v2 {#CharacterEmbed}
|
||||||
|
|
||||||
> #### Example config
|
> #### Example config
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [model]
|
> [model]
|
||||||
> @architectures = "spacy.CharacterEmbed.v1"
|
> @architectures = "spacy.CharacterEmbed.v2"
|
||||||
> width = 128
|
> width = 128
|
||||||
> rows = 7000
|
> rows = 7000
|
||||||
> nM = 64
|
> nM = 64
|
||||||
|
@ -266,13 +266,13 @@ Encode context using bidirectional LSTM layers. Requires
|
||||||
| `dropout` | Creates a Dropout layer on the outputs of each LSTM layer except the last layer. Set to 0.0 to disable this functionality. ~~float~~ |
|
| `dropout` | Creates a Dropout layer on the outputs of each LSTM layer except the last layer. Set to 0.0 to disable this functionality. ~~float~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.StaticVectors.v1 {#StaticVectors}
|
### spacy.StaticVectors.v2 {#StaticVectors}
|
||||||
|
|
||||||
> #### Example config
|
> #### Example config
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [model]
|
> [model]
|
||||||
> @architectures = "spacy.StaticVectors.v1"
|
> @architectures = "spacy.StaticVectors.v2"
|
||||||
> nO = null
|
> nO = null
|
||||||
> nM = null
|
> nM = null
|
||||||
> dropout = 0.2
|
> dropout = 0.2
|
||||||
|
@ -283,8 +283,9 @@ Encode context using bidirectional LSTM layers. Requires
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Embed [`Doc`](/api/doc) objects with their vocab's vectors table, applying a
|
Embed [`Doc`](/api/doc) objects with their vocab's vectors table, applying a
|
||||||
learned linear projection to control the dimensionality. See the documentation
|
learned linear projection to control the dimensionality. Unknown tokens are
|
||||||
on [static vectors](/usage/embeddings-transformers#static-vectors) for details.
|
mapped to a zero vector. See the documentation on [static
|
||||||
|
vectors](/usage/embeddings-transformers#static-vectors) for details.
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
@ -513,7 +514,7 @@ for a Tok2Vec layer.
|
||||||
> use_upper = true
|
> use_upper = true
|
||||||
>
|
>
|
||||||
> [model.tok2vec]
|
> [model.tok2vec]
|
||||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||||
> pretrained_vectors = null
|
> pretrained_vectors = null
|
||||||
> width = 96
|
> width = 96
|
||||||
> depth = 4
|
> depth = 4
|
||||||
|
@ -619,7 +620,7 @@ single-label use-cases where `exclusive_classes = true`, while the
|
||||||
> @architectures = "spacy.Tok2Vec.v2"
|
> @architectures = "spacy.Tok2Vec.v2"
|
||||||
>
|
>
|
||||||
> [model.tok2vec.embed]
|
> [model.tok2vec.embed]
|
||||||
> @architectures = "spacy.MultiHashEmbed.v1"
|
> @architectures = "spacy.MultiHashEmbed.v2"
|
||||||
> width = 64
|
> width = 64
|
||||||
> rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
> rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
||||||
> attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
> attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
||||||
|
@ -676,7 +677,7 @@ taking it as argument:
|
||||||
> nO = null
|
> nO = null
|
||||||
>
|
>
|
||||||
> [model.tok2vec]
|
> [model.tok2vec]
|
||||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||||
> pretrained_vectors = null
|
> pretrained_vectors = null
|
||||||
> width = 96
|
> width = 96
|
||||||
> depth = 4
|
> depth = 4
|
||||||
|
@ -744,7 +745,7 @@ into the "real world". This requires 3 main components:
|
||||||
> nO = null
|
> nO = null
|
||||||
>
|
>
|
||||||
> [model.tok2vec]
|
> [model.tok2vec]
|
||||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||||
> pretrained_vectors = null
|
> pretrained_vectors = null
|
||||||
> width = 96
|
> width = 96
|
||||||
> depth = 2
|
> depth = 2
|
||||||
|
|
|
@ -12,6 +12,7 @@ menu:
|
||||||
- ['train', 'train']
|
- ['train', 'train']
|
||||||
- ['pretrain', 'pretrain']
|
- ['pretrain', 'pretrain']
|
||||||
- ['evaluate', 'evaluate']
|
- ['evaluate', 'evaluate']
|
||||||
|
- ['assemble', 'assemble']
|
||||||
- ['package', 'package']
|
- ['package', 'package']
|
||||||
- ['project', 'project']
|
- ['project', 'project']
|
||||||
- ['ray', 'ray']
|
- ['ray', 'ray']
|
||||||
|
@ -892,6 +893,34 @@ $ python -m spacy evaluate [model] [data_path] [--output] [--code] [--gold-prepr
|
||||||
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| **CREATES** | Training results and optional metrics and visualizations. |
|
| **CREATES** | Training results and optional metrics and visualizations. |
|
||||||
|
|
||||||
|
## assemble {#assemble tag="command"}
|
||||||
|
|
||||||
|
Assemble a pipeline from a config file without additional training. Expects a
|
||||||
|
[config file](/api/data-formats#config) with all settings and hyperparameters.
|
||||||
|
The `--code` argument can be used to import a Python file that lets you register
|
||||||
|
[custom functions](/usage/training#custom-functions) and refer to them in your
|
||||||
|
config.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```cli
|
||||||
|
> $ python -m spacy assemble config.cfg ./output
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```cli
|
||||||
|
$ python -m spacy assemble [config_path] [output_dir] [--code] [--verbose] [overrides]
|
||||||
|
```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `config_path` | Path to the [config](/api/data-formats#config) file containing all settings and hyperparameters. If `-`, the data will be [read from stdin](/usage/training#config-stdin). ~~Union[Path, str] \(positional)~~ |
|
||||||
|
| `output_dir` | Directory to store the final pipeline in. Will be created if it doesn't exist. ~~Optional[Path] \(option)~~ |
|
||||||
|
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions). ~~Optional[Path] \(option)~~ |
|
||||||
|
| `--verbose`, `-V` | Show more detailed messages during processing. ~~bool (flag)~~ |
|
||||||
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
|
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.data ./data`. ~~Any (option/flag)~~ |
|
||||||
|
| **CREATES** | The final assembled pipeline. |
|
||||||
|
|
||||||
## package {#package tag="command"}
|
## package {#package tag="command"}
|
||||||
|
|
||||||
Generate an installable [Python package](/usage/training#models-generating) from
|
Generate an installable [Python package](/usage/training#models-generating) from
|
||||||
|
|
|
@ -29,8 +29,8 @@ recommended settings for your use case, check out the
|
||||||
>
|
>
|
||||||
> The `@` syntax lets you refer to function names registered in the
|
> The `@` syntax lets you refer to function names registered in the
|
||||||
> [function registry](/api/top-level#registry). For example,
|
> [function registry](/api/top-level#registry). For example,
|
||||||
> `@architectures = "spacy.HashEmbedCNN.v1"` refers to a registered function of
|
> `@architectures = "spacy.HashEmbedCNN.v2"` refers to a registered function of
|
||||||
> the name [spacy.HashEmbedCNN.v1](/api/architectures#HashEmbedCNN) and all
|
> the name [spacy.HashEmbedCNN.v2](/api/architectures#HashEmbedCNN) and all
|
||||||
> other values defined in its block will be passed into that function as
|
> other values defined in its block will be passed into that function as
|
||||||
> arguments. Those arguments depend on the registered function. See the usage
|
> arguments. Those arguments depend on the registered function. See the usage
|
||||||
> guide on [registered functions](/usage/training#config-functions) for details.
|
> guide on [registered functions](/usage/training#config-functions) for details.
|
||||||
|
@ -193,10 +193,10 @@ process that are used when you run [`spacy train`](/api/cli#train).
|
||||||
| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ |
|
| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ |
|
||||||
| `gpu_allocator` | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~ |
|
| `gpu_allocator` | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~ |
|
||||||
| `logger` | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ |
|
| `logger` | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ |
|
||||||
| `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ |
|
| `max_epochs` | Maximum number of epochs to train for. `0` means an unlimited number of epochs. `-1` means that the train corpus should be streamed rather than loaded into memory with no shuffling within the training loop. Defaults to `0`. ~~int~~ |
|
||||||
| `max_steps` | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~ |
|
| `max_steps` | Maximum number of update steps to train for. `0` means an unlimited number of steps. Defaults to `20000`. ~~int~~ |
|
||||||
| `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ |
|
| `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ |
|
||||||
| `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ |
|
| `patience` | How many steps to continue without improvement in evaluation score. `0` disables early stopping. Defaults to `1600`. ~~int~~ |
|
||||||
| `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ |
|
| `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ |
|
||||||
| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ |
|
| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ |
|
||||||
| `train_corpus` | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~ |
|
| `train_corpus` | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~ |
|
||||||
|
@ -390,7 +390,7 @@ file to keep track of your settings and hyperparameters and your own
|
||||||
> "tags": List[str],
|
> "tags": List[str],
|
||||||
> "pos": List[str],
|
> "pos": List[str],
|
||||||
> "morphs": List[str],
|
> "morphs": List[str],
|
||||||
> "sent_starts": List[bool],
|
> "sent_starts": List[Optional[bool]],
|
||||||
> "deps": List[string],
|
> "deps": List[string],
|
||||||
> "heads": List[int],
|
> "heads": List[int],
|
||||||
> "entities": List[str],
|
> "entities": List[str],
|
||||||
|
|
|
@ -44,7 +44,7 @@ Construct a `Doc` object. The most common way to get a `Doc` object is via the
|
||||||
| `lemmas` <Tag variant="new">3</Tag> | A list of strings, of the same length as `words`, to assign as `token.lemma` for each word. Defaults to `None`. ~~Optional[List[str]]~~ |
|
| `lemmas` <Tag variant="new">3</Tag> | A list of strings, of the same length as `words`, to assign as `token.lemma` for each word. Defaults to `None`. ~~Optional[List[str]]~~ |
|
||||||
| `heads` <Tag variant="new">3</Tag> | A list of values, of the same length as `words`, to assign as the head for each word. Head indices are the absolute position of the head in the `Doc`. Defaults to `None`. ~~Optional[List[int]]~~ |
|
| `heads` <Tag variant="new">3</Tag> | A list of values, of the same length as `words`, to assign as the head for each word. Head indices are the absolute position of the head in the `Doc`. Defaults to `None`. ~~Optional[List[int]]~~ |
|
||||||
| `deps` <Tag variant="new">3</Tag> | A list of strings, of the same length as `words`, to assign as `token.dep` for each word. Defaults to `None`. ~~Optional[List[str]]~~ |
|
| `deps` <Tag variant="new">3</Tag> | A list of strings, of the same length as `words`, to assign as `token.dep` for each word. Defaults to `None`. ~~Optional[List[str]]~~ |
|
||||||
| `sent_starts` <Tag variant="new">3</Tag> | A list of values, of the same length as `words`, to assign as `token.is_sent_start`. Will be overridden by heads if `heads` is provided. Defaults to `None`. ~~Optional[List[Union[bool, None]]~~ |
|
| `sent_starts` <Tag variant="new">3</Tag> | A list of values, of the same length as `words`, to assign as `token.is_sent_start`. Will be overridden by heads if `heads` is provided. Defaults to `None`. ~~Optional[List[Optional[bool]]]~~ |
|
||||||
| `ents` <Tag variant="new">3</Tag> | A list of strings, of the same length of `words`, to assign the token-based IOB tag. Defaults to `None`. ~~Optional[List[str]]~~ |
|
| `ents` <Tag variant="new">3</Tag> | A list of strings, of the same length of `words`, to assign the token-based IOB tag. Defaults to `None`. ~~Optional[List[str]]~~ |
|
||||||
|
|
||||||
## Doc.\_\_getitem\_\_ {#getitem tag="method"}
|
## Doc.\_\_getitem\_\_ {#getitem tag="method"}
|
||||||
|
|
|
@ -33,8 +33,8 @@ both documents.
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ |
|
| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ |
|
||||||
| `reference` | The document containing gold-standard annotations. Cannot be `None`. ~~Doc~~ |
|
| `reference` | The document containing gold-standard annotations. Cannot be `None`. ~~Doc~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `alignment` | An object holding the alignment between the tokens of the `predicted` and `reference` documents. ~~Optional[Alignment]~~ |
|
| `alignment` | An object holding the alignment between the tokens of the `predicted` and `reference` documents. ~~Optional[Alignment]~~ |
|
||||||
|
|
||||||
|
@ -56,11 +56,11 @@ see the [training format documentation](/api/data-formats#dict-input).
|
||||||
> example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref})
|
> example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref})
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------- |
|
||||||
| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ |
|
| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ |
|
||||||
| `example_dict` | `Dict[str, obj]` | The gold-standard annotations as a dictionary. Cannot be `None`. ~~Dict[str, Any]~~ |
|
| `example_dict` | The gold-standard annotations as a dictionary. Cannot be `None`. ~~Dict[str, Any]~~ |
|
||||||
| **RETURNS** | The newly constructed object. ~~Example~~ |
|
| **RETURNS** | The newly constructed object. ~~Example~~ |
|
||||||
|
|
||||||
## Example.text {#text tag="property"}
|
## Example.text {#text tag="property"}
|
||||||
|
|
||||||
|
@ -211,10 +211,11 @@ align to the tokenization in [`Example.predicted`](/api/example#predicted).
|
||||||
> assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1)]
|
> assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1)]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------- | ----------------------------------------------------------------------------- |
|
| --------------- | -------------------------------------------------------------------------------------------- |
|
||||||
| `y_spans` | `Span` objects aligned to the tokenization of `reference`. ~~Iterable[Span]~~ |
|
| `y_spans` | `Span` objects aligned to the tokenization of `reference`. ~~Iterable[Span]~~ |
|
||||||
| **RETURNS** | `Span` objects aligned to the tokenization of `predicted`. ~~List[Span]~~ |
|
| `allow_overlap` | Whether the resulting `Span` objects may overlap or not. Set to `False` by default. ~~bool~~ |
|
||||||
|
| **RETURNS** | `Span` objects aligned to the tokenization of `predicted`. ~~List[Span]~~ |
|
||||||
|
|
||||||
## Example.get_aligned_spans_x2y {#get_aligned_spans_x2y tag="method"}
|
## Example.get_aligned_spans_x2y {#get_aligned_spans_x2y tag="method"}
|
||||||
|
|
||||||
|
@ -238,10 +239,11 @@ against the original gold-standard annotation.
|
||||||
> assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)]
|
> assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------- | ----------------------------------------------------------------------------- |
|
| --------------- | -------------------------------------------------------------------------------------------- |
|
||||||
| `x_spans` | `Span` objects aligned to the tokenization of `predicted`. ~~Iterable[Span]~~ |
|
| `x_spans` | `Span` objects aligned to the tokenization of `predicted`. ~~Iterable[Span]~~ |
|
||||||
| **RETURNS** | `Span` objects aligned to the tokenization of `reference`. ~~List[Span]~~ |
|
| `allow_overlap` | Whether the resulting `Span` objects may overlap or not. Set to `False` by default. ~~bool~~ |
|
||||||
|
| **RETURNS** | `Span` objects aligned to the tokenization of `reference`. ~~List[Span]~~ |
|
||||||
|
|
||||||
## Example.to_dict {#to_dict tag="method"}
|
## Example.to_dict {#to_dict tag="method"}
|
||||||
|
|
||||||
|
|
|
@ -4,12 +4,13 @@ teaser: Archived implementations available through spacy-legacy
|
||||||
source: spacy/legacy
|
source: spacy/legacy
|
||||||
---
|
---
|
||||||
|
|
||||||
The [`spacy-legacy`](https://github.com/explosion/spacy-legacy) package includes
|
The [`spacy-legacy`](https://github.com/explosion/spacy-legacy) package includes
|
||||||
outdated registered functions and architectures. It is installed automatically as
|
outdated registered functions and architectures. It is installed automatically
|
||||||
a dependency of spaCy, and provides backwards compatibility for archived functions
|
as a dependency of spaCy, and provides backwards compatibility for archived
|
||||||
that may still be used in projects.
|
functions that may still be used in projects.
|
||||||
|
|
||||||
You can find the detailed documentation of each such legacy function on this page.
|
You can find the detailed documentation of each such legacy function on this
|
||||||
|
page.
|
||||||
|
|
||||||
## Architectures {#architectures}
|
## Architectures {#architectures}
|
||||||
|
|
||||||
|
@ -17,8 +18,8 @@ These functions are available from `@spacy.registry.architectures`.
|
||||||
|
|
||||||
### spacy.Tok2Vec.v1 {#Tok2Vec_v1}
|
### spacy.Tok2Vec.v1 {#Tok2Vec_v1}
|
||||||
|
|
||||||
The `spacy.Tok2Vec.v1` architecture was expecting an `encode` model of type
|
The `spacy.Tok2Vec.v1` architecture was expecting an `encode` model of type
|
||||||
`Model[Floats2D, Floats2D]` such as `spacy.MaxoutWindowEncoder.v1` or
|
`Model[Floats2D, Floats2D]` such as `spacy.MaxoutWindowEncoder.v1` or
|
||||||
`spacy.MishWindowEncoder.v1`.
|
`spacy.MishWindowEncoder.v1`.
|
||||||
|
|
||||||
> #### Example config
|
> #### Example config
|
||||||
|
@ -44,15 +45,14 @@ blog post for background.
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `embed` | Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed). ~~Model[List[Doc], List[Floats2d]]~~ |
|
| `embed` | Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
| `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder.v1](/api/legacy#MaxoutWindowEncoder_v1). ~~Model[Floats2d, Floats2d]~~ |
|
| `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder.v1](/api/legacy#MaxoutWindowEncoder_v1). ~~Model[Floats2d, Floats2d]~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder_v1}
|
### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder_v1}
|
||||||
|
|
||||||
The `spacy.MaxoutWindowEncoder.v1` architecture was producing a model of type
|
The `spacy.MaxoutWindowEncoder.v1` architecture was producing a model of type
|
||||||
`Model[Floats2D, Floats2D]`. Since `spacy.MaxoutWindowEncoder.v2`, this has been changed to output
|
`Model[Floats2D, Floats2D]`. Since `spacy.MaxoutWindowEncoder.v2`, this has been
|
||||||
type `Model[List[Floats2d], List[Floats2d]]`.
|
changed to output type `Model[List[Floats2d], List[Floats2d]]`.
|
||||||
|
|
||||||
|
|
||||||
> #### Example config
|
> #### Example config
|
||||||
>
|
>
|
||||||
|
@ -78,9 +78,9 @@ and residual connections.
|
||||||
|
|
||||||
### spacy.MishWindowEncoder.v1 {#MishWindowEncoder_v1}
|
### spacy.MishWindowEncoder.v1 {#MishWindowEncoder_v1}
|
||||||
|
|
||||||
The `spacy.MishWindowEncoder.v1` architecture was producing a model of type
|
The `spacy.MishWindowEncoder.v1` architecture was producing a model of type
|
||||||
`Model[Floats2D, Floats2D]`. Since `spacy.MishWindowEncoder.v2`, this has been changed to output
|
`Model[Floats2D, Floats2D]`. Since `spacy.MishWindowEncoder.v2`, this has been
|
||||||
type `Model[List[Floats2d], List[Floats2d]]`.
|
changed to output type `Model[List[Floats2d], List[Floats2d]]`.
|
||||||
|
|
||||||
> #### Example config
|
> #### Example config
|
||||||
>
|
>
|
||||||
|
@ -103,12 +103,11 @@ and residual connections.
|
||||||
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
|
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model[Floats2d, Floats2d]~~ |
|
| **CREATES** | The model using the architecture. ~~Model[Floats2d, Floats2d]~~ |
|
||||||
|
|
||||||
|
|
||||||
### spacy.TextCatEnsemble.v1 {#TextCatEnsemble_v1}
|
### spacy.TextCatEnsemble.v1 {#TextCatEnsemble_v1}
|
||||||
|
|
||||||
The `spacy.TextCatEnsemble.v1` architecture built an internal `tok2vec` and `linear_model`.
|
The `spacy.TextCatEnsemble.v1` architecture built an internal `tok2vec` and
|
||||||
Since `spacy.TextCatEnsemble.v2`, this has been refactored so that the `TextCatEnsemble` takes these
|
`linear_model`. Since `spacy.TextCatEnsemble.v2`, this has been refactored so
|
||||||
two sublayers as input.
|
that the `TextCatEnsemble` takes these two sublayers as input.
|
||||||
|
|
||||||
> #### Example Config
|
> #### Example Config
|
||||||
>
|
>
|
||||||
|
@ -140,4 +139,62 @@ network has an internal CNN Tok2Vec layer and uses attention.
|
||||||
| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ |
|
| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ |
|
||||||
| `dropout` | The dropout rate. ~~float~~ |
|
| `dropout` | The dropout rate. ~~float~~ |
|
||||||
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ |
|
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
||||||
|
|
||||||
|
### spacy.HashEmbedCNN.v1 {#HashEmbedCNN_v1}
|
||||||
|
|
||||||
|
Identical to [`spacy.HashEmbedCNN.v2`](/api/architectures#HashEmbedCNN) except
|
||||||
|
using [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are included.
|
||||||
|
|
||||||
|
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed_v1}
|
||||||
|
|
||||||
|
Identical to [`spacy.MultiHashEmbed.v2`](/api/architectures#MultiHashEmbed)
|
||||||
|
except with [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are
|
||||||
|
included.
|
||||||
|
|
||||||
|
### spacy.CharacterEmbed.v1 {#CharacterEmbed_v1}
|
||||||
|
|
||||||
|
Identical to [`spacy.CharacterEmbed.v2`](/api/architectures#CharacterEmbed)
|
||||||
|
except using [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are
|
||||||
|
included.
|
||||||
|
|
||||||
|
## Layers {#layers}
|
||||||
|
|
||||||
|
These functions are available from `@spacy.registry.layers`.
|
||||||
|
|
||||||
|
### spacy.StaticVectors.v1 {#StaticVectors_v1}
|
||||||
|
|
||||||
|
Identical to [`spacy.StaticVectors.v2`](/api/architectures#StaticVectors) except
|
||||||
|
for the handling of tokens without vectors.
|
||||||
|
|
||||||
|
<Infobox title="Bugs for tokens without vectors" variant="warning">
|
||||||
|
|
||||||
|
`spacy.StaticVectors.v1` maps tokens without vectors to the final row in the
|
||||||
|
vectors table, which causes the model predictions to change if new vectors are
|
||||||
|
added to an existing vectors table. See more details in
|
||||||
|
[issue #7662](https://github.com/explosion/spaCy/issues/7662#issuecomment-813925655).
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
## Loggers {#loggers}
|
||||||
|
|
||||||
|
These functions are available from `@spacy.registry.loggers`.
|
||||||
|
|
||||||
|
### spacy.WandbLogger.v1 {#WandbLogger_v1}
|
||||||
|
|
||||||
|
The first version of the [`WandbLogger`](/api/top-level#WandbLogger) did not yet
|
||||||
|
support the `log_dataset_dir` and `model_log_interval` arguments.
|
||||||
|
|
||||||
|
> #### Example config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [training.logger]
|
||||||
|
> @loggers = "spacy.WandbLogger.v1"
|
||||||
|
> project_name = "monitor_spacy_training"
|
||||||
|
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
|
||||||
|
> ```
|
||||||
|
>
|
||||||
|
> | Name | Description |
|
||||||
|
> | ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
> | `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
|
||||||
|
> | `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |
|
||||||
|
|
|
@ -120,13 +120,14 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
|
||||||
> matches = matcher(doc)
|
> matches = matcher(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ |
|
| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ |
|
| `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ |
|
||||||
| `allow_missing` <Tag variant="new">3</Tag> | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~ |
|
| `allow_missing` <Tag variant="new">3</Tag> | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~ |
|
||||||
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
|
| `with_alignments` <Tag variant="new">3.1</Tag> | Return match alignment information as part of the match tuple as `List[int]` with the same length as the matched span. Each entry denotes the corresponding index of the token pattern. If `as_spans` is set to `True`, this setting is ignored. Defaults to `False`. ~~bool~~ |
|
||||||
|
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
|
||||||
|
|
||||||
## Matcher.\_\_len\_\_ {#len tag="method" new="2"}
|
## Matcher.\_\_len\_\_ {#len tag="method" new="2"}
|
||||||
|
|
||||||
|
|
|
@ -137,14 +137,16 @@ Returns PRF scores for labeled or unlabeled spans.
|
||||||
> print(scores["ents_f"])
|
> print(scores["ents_f"])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
|
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
|
||||||
| `attr` | The attribute to score. ~~str~~ |
|
| `attr` | The attribute to score. ~~str~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ |
|
| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ |
|
||||||
| `has_annotation` | Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. ~~Optional[Callable[[Doc], bool]]~~ |
|
| `has_annotation` | Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. ~~str~~ |
|
||||||
| **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
| `labeled` | Defaults to `True`. If set to `False`, two spans will be considered equal if their start and end match, irrespective of their label. ~~bool~~ |
|
||||||
|
| `allow_overlap` | Defaults to `False`. Whether or not to allow overlapping spans. If set to `False`, the alignment will automatically resolve conflicts. ~~bool~~ |
|
||||||
|
| **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||||
|
|
||||||
## Scorer.score_deps {#score_deps tag="staticmethod" new="3"}
|
## Scorer.score_deps {#score_deps tag="staticmethod" new="3"}
|
||||||
|
|
||||||
|
|
|
@ -364,7 +364,7 @@ unknown. Defaults to `True` for the first token in the `Doc`.
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------- |
|
| ----------- | --------------------------------------------- |
|
||||||
| **RETURNS** | Whether the token starts a sentence. ~~bool~~ |
|
| **RETURNS** | Whether the token starts a sentence. ~~Optional[bool]~~ |
|
||||||
|
|
||||||
## Token.has_vector {#has_vector tag="property" model="vectors"}
|
## Token.has_vector {#has_vector tag="property" model="vectors"}
|
||||||
|
|
||||||
|
|
|
@ -8,6 +8,7 @@ menu:
|
||||||
- ['Readers', 'readers']
|
- ['Readers', 'readers']
|
||||||
- ['Batchers', 'batchers']
|
- ['Batchers', 'batchers']
|
||||||
- ['Augmenters', 'augmenters']
|
- ['Augmenters', 'augmenters']
|
||||||
|
- ['Callbacks', 'callbacks']
|
||||||
- ['Training & Alignment', 'gold']
|
- ['Training & Alignment', 'gold']
|
||||||
- ['Utility Functions', 'util']
|
- ['Utility Functions', 'util']
|
||||||
---
|
---
|
||||||
|
@ -461,7 +462,7 @@ start decreasing across epochs.
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
#### spacy.WandbLogger.v1 {#WandbLogger tag="registered function"}
|
#### spacy.WandbLogger.v2 {#WandbLogger tag="registered function"}
|
||||||
|
|
||||||
> #### Installation
|
> #### Installation
|
||||||
>
|
>
|
||||||
|
@ -493,15 +494,19 @@ remain in the config file stored on your local system.
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [training.logger]
|
> [training.logger]
|
||||||
> @loggers = "spacy.WandbLogger.v1"
|
> @loggers = "spacy.WandbLogger.v2"
|
||||||
> project_name = "monitor_spacy_training"
|
> project_name = "monitor_spacy_training"
|
||||||
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
|
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
|
||||||
|
> log_dataset_dir = "corpus"
|
||||||
|
> model_log_interval = 1000
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
|
| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
|
||||||
| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |
|
| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |
|
||||||
|
| `model_log_interval` | Steps to wait between logging model checkpoints to W&B dasboard (default: None). ~~Optional[int]~~ |
|
||||||
|
| `log_dataset_dir` | Directory containing dataset to be logged and versioned as W&B artifact (default: None). ~~Optional[str]~~ |
|
||||||
|
|
||||||
<Project id="integrations/wandb">
|
<Project id="integrations/wandb">
|
||||||
|
|
||||||
|
@ -781,6 +786,35 @@ useful for making the model less sensitive to capitalization.
|
||||||
| `level` | The percentage of texts that will be augmented. ~~float~~ |
|
| `level` | The percentage of texts that will be augmented. ~~float~~ |
|
||||||
| **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ |
|
| **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ |
|
||||||
|
|
||||||
|
## Callbacks {#callbacks source="spacy/training/callbacks.py" new="3"}
|
||||||
|
|
||||||
|
The config supports [callbacks](/usage/training#custom-code-nlp-callbacks) at
|
||||||
|
several points in the lifecycle that can be used modify the `nlp` object.
|
||||||
|
|
||||||
|
### spacy.copy_from_base_model.v1 {#copy_from_base_model tag="registered function"}
|
||||||
|
|
||||||
|
> #### Example config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [initialize.before_init]
|
||||||
|
> @callbacks = "spacy.copy_from_base_model.v1"
|
||||||
|
> tokenizer = "en_core_sci_md"
|
||||||
|
> vocab = "en_core_sci_md"
|
||||||
|
> ```
|
||||||
|
|
||||||
|
Copy the tokenizer and/or vocab from the specified models. It's similar to the
|
||||||
|
v2 [base model](https://v2.spacy.io/api/cli#train) option and useful in
|
||||||
|
combination with
|
||||||
|
[sourced components](/usage/processing-pipelines#sourced-components) when
|
||||||
|
fine-tuning an existing pipeline. The vocab includes the lookups and the vectors
|
||||||
|
from the specified model. Intended for use in `[initialize.before_init]`.
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ----------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `tokenizer` | The pipeline to copy the tokenizer from. Defaults to `None`. ~~Optional[str]~~ |
|
||||||
|
| `vocab` | The pipeline to copy the vocab from. The vocab includes the lookups and vectors. Defaults to `None`. ~~Optional[str]~~ |
|
||||||
|
| **CREATES** | A function that takes the current `nlp` object and modifies its `tokenizer` and `vocab`. ~~Callable[[Language], None]~~ |
|
||||||
|
|
||||||
## Training data and alignment {#gold source="spacy/training"}
|
## Training data and alignment {#gold source="spacy/training"}
|
||||||
|
|
||||||
### training.offsets_to_biluo_tags {#offsets_to_biluo_tags tag="function"}
|
### training.offsets_to_biluo_tags {#offsets_to_biluo_tags tag="function"}
|
||||||
|
|
|
@ -132,7 +132,7 @@ factory = "tok2vec"
|
||||||
@architectures = "spacy.Tok2Vec.v2"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[components.tok2vec.model.embed]
|
[components.tok2vec.model.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v2"
|
||||||
|
|
||||||
[components.tok2vec.model.encode]
|
[components.tok2vec.model.encode]
|
||||||
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||||
|
@ -164,7 +164,7 @@ factory = "ner"
|
||||||
@architectures = "spacy.Tok2Vec.v2"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[components.ner.model.tok2vec.embed]
|
[components.ner.model.tok2vec.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v2"
|
||||||
|
|
||||||
[components.ner.model.tok2vec.encode]
|
[components.ner.model.tok2vec.encode]
|
||||||
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||||
|
@ -541,7 +541,7 @@ word vector tables using the `include_static_vectors` flag.
|
||||||
|
|
||||||
```ini
|
```ini
|
||||||
[tagger.model.tok2vec.embed]
|
[tagger.model.tok2vec.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v2"
|
||||||
width = 128
|
width = 128
|
||||||
attrs = ["LOWER","PREFIX","SUFFIX","SHAPE"]
|
attrs = ["LOWER","PREFIX","SUFFIX","SHAPE"]
|
||||||
rows = [5000,2500,2500,2500]
|
rows = [5000,2500,2500,2500]
|
||||||
|
@ -550,7 +550,7 @@ include_static_vectors = true
|
||||||
|
|
||||||
<Infobox title="How it works" emoji="💡">
|
<Infobox title="How it works" emoji="💡">
|
||||||
|
|
||||||
The configuration system will look up the string `"spacy.MultiHashEmbed.v1"` in
|
The configuration system will look up the string `"spacy.MultiHashEmbed.v2"` in
|
||||||
the `architectures` [registry](/api/top-level#registry), and call the returned
|
the `architectures` [registry](/api/top-level#registry), and call the returned
|
||||||
object with the rest of the arguments from the block. This will result in a call
|
object with the rest of the arguments from the block. This will result in a call
|
||||||
to the
|
to the
|
||||||
|
|
|
@ -130,9 +130,9 @@ which provides a numpy-compatible interface for GPU arrays.
|
||||||
|
|
||||||
spaCy can be installed on GPU by specifying `spacy[cuda]`, `spacy[cuda90]`,
|
spaCy can be installed on GPU by specifying `spacy[cuda]`, `spacy[cuda90]`,
|
||||||
`spacy[cuda91]`, `spacy[cuda92]`, `spacy[cuda100]`, `spacy[cuda101]`,
|
`spacy[cuda91]`, `spacy[cuda92]`, `spacy[cuda100]`, `spacy[cuda101]`,
|
||||||
`spacy[cuda102]`, `spacy[cuda110]` or `spacy[cuda111]`. If you know your cuda
|
`spacy[cuda102]`, `spacy[cuda110]`, `spacy[cuda111]` or `spacy[cuda112]`. If you
|
||||||
version, using the more explicit specifier allows cupy to be installed via
|
know your cuda version, using the more explicit specifier allows cupy to be
|
||||||
wheel, saving some compilation time. The specifiers should install
|
installed via wheel, saving some compilation time. The specifiers should install
|
||||||
[`cupy`](https://cupy.chainer.org).
|
[`cupy`](https://cupy.chainer.org).
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
|
@ -137,7 +137,7 @@ nO = null
|
||||||
@architectures = "spacy.Tok2Vec.v2"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[components.textcat.model.tok2vec.embed]
|
[components.textcat.model.tok2vec.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v2"
|
||||||
width = 64
|
width = 64
|
||||||
rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
||||||
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
||||||
|
@ -204,7 +204,7 @@ factory = "tok2vec"
|
||||||
@architectures = "spacy.Tok2Vec.v2"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[components.tok2vec.model.embed]
|
[components.tok2vec.model.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v2"
|
||||||
# ...
|
# ...
|
||||||
|
|
||||||
[components.tok2vec.model.encode]
|
[components.tok2vec.model.encode]
|
||||||
|
@ -220,7 +220,7 @@ architecture:
|
||||||
```ini
|
```ini
|
||||||
### config.cfg (excerpt)
|
### config.cfg (excerpt)
|
||||||
[components.tok2vec.model.embed]
|
[components.tok2vec.model.embed]
|
||||||
@architectures = "spacy.CharacterEmbed.v1"
|
@architectures = "spacy.CharacterEmbed.v2"
|
||||||
# ...
|
# ...
|
||||||
|
|
||||||
[components.tok2vec.model.encode]
|
[components.tok2vec.model.encode]
|
||||||
|
@ -638,7 +638,7 @@ that has the full implementation.
|
||||||
> @architectures = "rel_instance_tensor.v1"
|
> @architectures = "rel_instance_tensor.v1"
|
||||||
>
|
>
|
||||||
> [model.create_instance_tensor.tok2vec]
|
> [model.create_instance_tensor.tok2vec]
|
||||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
> @architectures = "spacy.HashEmbedCNN.v2"
|
||||||
> # ...
|
> # ...
|
||||||
>
|
>
|
||||||
> [model.create_instance_tensor.pooling]
|
> [model.create_instance_tensor.pooling]
|
||||||
|
|
|
@ -787,6 +787,7 @@ rather than performance:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
def tokenizer_pseudo_code(
|
def tokenizer_pseudo_code(
|
||||||
|
text,
|
||||||
special_cases,
|
special_cases,
|
||||||
prefix_search,
|
prefix_search,
|
||||||
suffix_search,
|
suffix_search,
|
||||||
|
@ -840,12 +841,14 @@ def tokenizer_pseudo_code(
|
||||||
tokens.append(substring)
|
tokens.append(substring)
|
||||||
substring = ""
|
substring = ""
|
||||||
tokens.extend(reversed(suffixes))
|
tokens.extend(reversed(suffixes))
|
||||||
|
for match in matcher(special_cases, text):
|
||||||
|
tokens.replace(match, special_cases[match])
|
||||||
return tokens
|
return tokens
|
||||||
```
|
```
|
||||||
|
|
||||||
The algorithm can be summarized as follows:
|
The algorithm can be summarized as follows:
|
||||||
|
|
||||||
1. Iterate over whitespace-separated substrings.
|
1. Iterate over space-separated substrings.
|
||||||
2. Look for a token match. If there is a match, stop processing and keep this
|
2. Look for a token match. If there is a match, stop processing and keep this
|
||||||
token.
|
token.
|
||||||
3. Check whether we have an explicitly defined special case for this substring.
|
3. Check whether we have an explicitly defined special case for this substring.
|
||||||
|
@ -859,6 +862,8 @@ The algorithm can be summarized as follows:
|
||||||
8. Look for "infixes" – stuff like hyphens etc. and split the substring into
|
8. Look for "infixes" – stuff like hyphens etc. and split the substring into
|
||||||
tokens on all infixes.
|
tokens on all infixes.
|
||||||
9. Once we can't consume any more of the string, handle it as a single token.
|
9. Once we can't consume any more of the string, handle it as a single token.
|
||||||
|
10. Make a final pass over the text to check for special cases that include
|
||||||
|
spaces or that were missed due to the incremental processing of affixes.
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
|
|
|
@ -995,7 +995,7 @@ your results.
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [training.logger]
|
> [training.logger]
|
||||||
> @loggers = "spacy.WandbLogger.v1"
|
> @loggers = "spacy.WandbLogger.v2"
|
||||||
> project_name = "monitor_spacy_training"
|
> project_name = "monitor_spacy_training"
|
||||||
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
|
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
|
||||||
> ```
|
> ```
|
||||||
|
|
|
@ -1130,8 +1130,8 @@ any other custom workflows. `corpora.train` and `corpora.dev` are used as
|
||||||
conventions within spaCy's default configs, but you can also define any other
|
conventions within spaCy's default configs, but you can also define any other
|
||||||
custom blocks. Each section in the corpora config should resolve to a
|
custom blocks. Each section in the corpora config should resolve to a
|
||||||
[`Corpus`](/api/corpus) – for example, using spaCy's built-in
|
[`Corpus`](/api/corpus) – for example, using spaCy's built-in
|
||||||
[corpus reader](/api/top-level#readers) that takes a path to a binary `.spacy`
|
[corpus reader](/api/top-level#corpus-readers) that takes a path to a binary
|
||||||
file. The `train_corpus` and `dev_corpus` fields in the
|
`.spacy` file. The `train_corpus` and `dev_corpus` fields in the
|
||||||
[`[training]`](/api/data-formats#config-training) block specify where to find
|
[`[training]`](/api/data-formats#config-training) block specify where to find
|
||||||
the corpus in your config. This makes it easy to **swap out** different corpora
|
the corpus in your config. This makes it easy to **swap out** different corpora
|
||||||
by only changing a single config setting.
|
by only changing a single config setting.
|
||||||
|
@ -1142,21 +1142,23 @@ corpora, keyed by corpus name, e.g. `"train"` and `"dev"`. This can be
|
||||||
especially useful if you need to split a single file into corpora for training
|
especially useful if you need to split a single file into corpora for training
|
||||||
and evaluation, without loading the same file twice.
|
and evaluation, without loading the same file twice.
|
||||||
|
|
||||||
|
By default, the training data is loaded into memory and shuffled before each
|
||||||
|
epoch. If the corpus is **too large to fit into memory** during training, stream
|
||||||
|
the corpus using a custom reader as described in the next section.
|
||||||
|
|
||||||
### Custom data reading and batching {#custom-code-readers-batchers}
|
### Custom data reading and batching {#custom-code-readers-batchers}
|
||||||
|
|
||||||
Some use-cases require **streaming in data** or manipulating datasets on the
|
Some use-cases require **streaming in data** or manipulating datasets on the
|
||||||
fly, rather than generating all data beforehand and storing it to file. Instead
|
fly, rather than generating all data beforehand and storing it to disk. Instead
|
||||||
of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
|
of using the built-in [`Corpus`](/api/corpus) reader, which uses static file
|
||||||
paths, you can create and register a custom function that generates
|
paths, you can create and register a custom function that generates
|
||||||
[`Example`](/api/example) objects. The resulting generator can be infinite. When
|
[`Example`](/api/example) objects.
|
||||||
using this dataset for training, stopping criteria such as maximum number of
|
|
||||||
steps, or stopping when the loss does not decrease further, can be used.
|
|
||||||
|
|
||||||
In this example we assume a custom function `read_custom_data` which loads or
|
In the following example we assume a custom function `read_custom_data` which
|
||||||
generates texts with relevant text classification annotations. Then, small
|
loads or generates texts with relevant text classification annotations. Then,
|
||||||
lexical variations of the input text are created before generating the final
|
small lexical variations of the input text are created before generating the
|
||||||
[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
|
final [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator
|
||||||
you register the function creating the custom reader in the `readers`
|
lets you register the function creating the custom reader in the `readers`
|
||||||
[registry](/api/top-level#registry) and assign it a string name, so it can be
|
[registry](/api/top-level#registry) and assign it a string name, so it can be
|
||||||
used in your config. All arguments on the registered function become available
|
used in your config. All arguments on the registered function become available
|
||||||
as **config settings** – in this case, `source`.
|
as **config settings** – in this case, `source`.
|
||||||
|
@ -1199,6 +1201,80 @@ Remember that a registered function should always be a function that spaCy
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
If the corpus is **too large to load into memory** or the corpus reader is an
|
||||||
|
**infinite generator**, use the setting `max_epochs = -1` to indicate that the
|
||||||
|
train corpus should be streamed. With this setting the train corpus is merely
|
||||||
|
streamed and batched, not shuffled, so any shuffling needs to be implemented in
|
||||||
|
the corpus reader itself. In the example below, a corpus reader that generates
|
||||||
|
sentences containing even or odd numbers is used with an unlimited number of
|
||||||
|
examples for the train corpus and a limited number of examples for the dev
|
||||||
|
corpus. The dev corpus should always be finite and fit in memory during the
|
||||||
|
evaluation step. `max_steps` and/or `patience` are used to determine when the
|
||||||
|
training should stop.
|
||||||
|
|
||||||
|
> #### config.cfg
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [corpora.dev]
|
||||||
|
> @readers = "even_odd.v1"
|
||||||
|
> limit = 100
|
||||||
|
>
|
||||||
|
> [corpora.train]
|
||||||
|
> @readers = "even_odd.v1"
|
||||||
|
> limit = -1
|
||||||
|
>
|
||||||
|
> [training]
|
||||||
|
> max_epochs = -1
|
||||||
|
> patience = 500
|
||||||
|
> max_steps = 2000
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```python
|
||||||
|
### functions.py
|
||||||
|
from typing import Callable, Iterable, Iterator
|
||||||
|
from spacy import util
|
||||||
|
import random
|
||||||
|
from spacy.training import Example
|
||||||
|
from spacy import Language
|
||||||
|
|
||||||
|
|
||||||
|
@util.registry.readers("even_odd.v1")
|
||||||
|
def create_even_odd_corpus(limit: int = -1) -> Callable[[Language], Iterable[Example]]:
|
||||||
|
return EvenOddCorpus(limit)
|
||||||
|
|
||||||
|
|
||||||
|
class EvenOddCorpus:
|
||||||
|
def __init__(self, limit):
|
||||||
|
self.limit = limit
|
||||||
|
|
||||||
|
def __call__(self, nlp: Language) -> Iterator[Example]:
|
||||||
|
i = 0
|
||||||
|
while i < self.limit or self.limit < 0:
|
||||||
|
r = random.randint(0, 1000)
|
||||||
|
cat = r % 2 == 0
|
||||||
|
text = "This is sentence " + str(r)
|
||||||
|
yield Example.from_dict(
|
||||||
|
nlp.make_doc(text), {"cats": {"EVEN": cat, "ODD": not cat}}
|
||||||
|
)
|
||||||
|
i += 1
|
||||||
|
```
|
||||||
|
|
||||||
|
> #### config.cfg
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [initialize.components.textcat.labels]
|
||||||
|
> @readers = "spacy.read_labels.v1"
|
||||||
|
> path = "labels/textcat.json"
|
||||||
|
> require = true
|
||||||
|
> ```
|
||||||
|
|
||||||
|
If the train corpus is streamed, the initialize step peeks at the first 100
|
||||||
|
examples in the corpus to find the labels for each component. If this isn't
|
||||||
|
sufficient, you'll need to [provide the labels](#initialization-labels) for each
|
||||||
|
component in the `[initialize]` block. [`init labels`](/api/cli#init-labels) can
|
||||||
|
be used to generate JSON files in the correct format, which you can extend with
|
||||||
|
the full label set.
|
||||||
|
|
||||||
We can also customize the **batching strategy** by registering a new batcher
|
We can also customize the **batching strategy** by registering a new batcher
|
||||||
function in the `batchers` [registry](/api/top-level#registry). A batcher turns
|
function in the `batchers` [registry](/api/top-level#registry). A batcher turns
|
||||||
a stream of items into a stream of batches. spaCy has several useful built-in
|
a stream of items into a stream of batches. spaCy has several useful built-in
|
||||||
|
|
|
@ -616,11 +616,11 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
|
||||||
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
|
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
|
||||||
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, symlinks are deprecated |
|
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, symlinks are deprecated |
|
||||||
|
|
||||||
The following deprecated methods, attributes and arguments were removed in v3.0.
|
The following methods, attributes and arguments were removed in v3.0. Most of
|
||||||
Most of them have been **deprecated for a while** and many would previously
|
them have been **deprecated for a while** and many would previously raise
|
||||||
raise errors. Many of them were also mostly internals. If you've been working
|
errors. Many of them were also mostly internals. If you've been working with
|
||||||
with more recent versions of spaCy v2.x, it's **unlikely** that your code relied
|
more recent versions of spaCy v2.x, it's **unlikely** that your code relied on
|
||||||
on them.
|
them.
|
||||||
|
|
||||||
| Removed | Replacement |
|
| Removed | Replacement |
|
||||||
| ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
@ -637,10 +637,10 @@ on them.
|
||||||
|
|
||||||
### Downloading and loading trained pipelines {#migrating-downloading-models}
|
### Downloading and loading trained pipelines {#migrating-downloading-models}
|
||||||
|
|
||||||
Symlinks and shortcuts like `en` are now officially deprecated. There are
|
Symlinks and shortcuts like `en` have been deprecated for a while, and are now
|
||||||
[many different trained pipelines](/models) with different capabilities and not
|
not supported anymore. There are [many different trained pipelines](/models)
|
||||||
just one "English model". In order to download and load a package, you should
|
with different capabilities and not just one "English model". In order to
|
||||||
always use its full name – for instance,
|
download and load a package, you should always use its full name – for instance,
|
||||||
[`en_core_web_sm`](/models/en#en_core_web_sm).
|
[`en_core_web_sm`](/models/en#en_core_web_sm).
|
||||||
|
|
||||||
```diff
|
```diff
|
||||||
|
@ -1185,9 +1185,10 @@ package isn't imported.
|
||||||
In Jupyter notebooks, run [`prefer_gpu`](/api/top-level#spacy.prefer_gpu),
|
In Jupyter notebooks, run [`prefer_gpu`](/api/top-level#spacy.prefer_gpu),
|
||||||
[`require_gpu`](/api/top-level#spacy.require_gpu) or
|
[`require_gpu`](/api/top-level#spacy.require_gpu) or
|
||||||
[`require_cpu`](/api/top-level#spacy.require_cpu) in the same cell as
|
[`require_cpu`](/api/top-level#spacy.require_cpu) in the same cell as
|
||||||
[`spacy.load`](/api/top-level#spacy.load) to ensure that the model is loaded on the correct device.
|
[`spacy.load`](/api/top-level#spacy.load) to ensure that the model is loaded on
|
||||||
|
the correct device.
|
||||||
|
|
||||||
Due to a bug related to `contextvars` (see the [bug
|
Due to a bug related to `contextvars` (see the
|
||||||
report](https://github.com/ipython/ipython/issues/11565)), the GPU settings may
|
[bug report](https://github.com/ipython/ipython/issues/11565)), the GPU settings
|
||||||
not be preserved correctly across cells, resulting in models being loaded on
|
may not be preserved correctly across cells, resulting in models being loaded on
|
||||||
the wrong device or only partially on GPU.
|
the wrong device or only partially on GPU.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user