Merge remote-tracking branch 'upstream/master' into refactor/parser-gpu

This commit is contained in:
svlandeg 2022-01-20 17:10:25 +01:00
commit 4f9c54001b
284 changed files with 10268 additions and 5896 deletions

106
.github/contributors/Pantalaymon.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name |Valentin-Gabriel Soumah|
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2021-11-23 |
| GitHub username | Pantalaymon |
| Website (optional) | |

106
.github/contributors/avi197.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Son Pham |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 09/10/2021 |
| GitHub username | Avi197 |
| Website (optional) | |

106
.github/contributors/fgaim.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Fitsum Gaim |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2021-08-07 |
| GitHub username | fgaim |
| Website (optional) | |

106
.github/contributors/syrull.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Dimitar Ganev |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2021/8/2 |
| GitHub username | syrull |
| Website (optional) | |

1
.gitignore vendored
View File

@ -9,6 +9,7 @@ keys/
spacy/tests/package/setup.cfg spacy/tests/package/setup.cfg
spacy/tests/package/pyproject.toml spacy/tests/package/pyproject.toml
spacy/tests/package/requirements.txt spacy/tests/package/requirements.txt
spacy/tests/universe/universe.json
# Website # Website
website/.cache/ website/.cache/

View File

@ -143,15 +143,25 @@ Changes to `.py` files will be effective immediately.
### Fixing bugs ### Fixing bugs
When fixing a bug, first create an When fixing a bug, first create an
[issue](https://github.com/explosion/spaCy/issues) if one does not already exist. [issue](https://github.com/explosion/spaCy/issues) if one does not already
The description text can be very short we don't want to make this too exist. The description text can be very short we don't want to make this too
bureaucratic. bureaucratic.
Next, create a test file named `test_issue[ISSUE NUMBER].py` in the Next, add a test to the relevant file in the
[`spacy/tests/regression`](spacy/tests/regression) folder. Test for the bug [`spacy/tests`](spacy/tests)folder. Then add a [pytest
you're fixing, and make sure the test fails. Next, add and commit your test file mark](https://docs.pytest.org/en/6.2.x/example/markers.html#working-with-custom-markers),
referencing the issue number in the commit message. Finally, fix the bug, make `@pytest.mark.issue(NUMBER)`, to reference the issue number.
sure your test passes and reference the issue in your commit message.
```python
# Assume you're fixing Issue #1234
@pytest.mark.issue(1234)
def test_issue1234():
...
```
Test for the bug you're fixing, and make sure the test fails. Next, add and
commit your test file. Finally, fix the bug, make sure your test passes and
reference the issue number in your pull request description.
📖 **For more information on how to add tests, check out the [tests README](spacy/tests/README.md).** 📖 **For more information on how to add tests, check out the [tests README](spacy/tests/README.md).**

View File

@ -1,6 +1,6 @@
The MIT License (MIT) The MIT License (MIT)
Copyright (C) 2016-2021 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal Copyright (C) 2016-2022 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal

View File

@ -1,11 +1,8 @@
recursive-include include *.h
recursive-include spacy *.pyi *.pyx *.pxd *.txt *.cfg *.jinja *.toml recursive-include spacy *.pyi *.pyx *.pxd *.txt *.cfg *.jinja *.toml
include LICENSE include LICENSE
include README.md include README.md
include pyproject.toml include pyproject.toml
include spacy/py.typed include spacy/py.typed
recursive-exclude spacy/lang *.json recursive-include spacy/cli *.yml
recursive-include spacy/lang *.json.gz
recursive-include spacy/cli *.json *.yml
recursive-include licenses * recursive-include licenses *
recursive-exclude spacy *.cpp recursive-exclude spacy *.cpp

View File

@ -16,7 +16,7 @@ production-ready [**training system**](https://spacy.io/usage/training) and easy
model packaging, deployment and workflow management. spaCy is commercial model packaging, deployment and workflow management. spaCy is commercial
open-source software, released under the MIT license. open-source software, released under the MIT license.
💫 **Version 3.0 out now!** 💫 **Version 3.2 out now!**
[Check out the release notes here.](https://github.com/explosion/spaCy/releases) [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8) [![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)

View File

@ -23,7 +23,7 @@ jobs:
# defined in .flake8 and overwrites the selected codes. # defined in .flake8 and overwrites the selected codes.
- job: "Validate" - job: "Validate"
pool: pool:
vmImage: "ubuntu-18.04" vmImage: "ubuntu-latest"
steps: steps:
- task: UsePythonVersion@0 - task: UsePythonVersion@0
inputs: inputs:
@ -39,49 +39,49 @@ jobs:
matrix: matrix:
# We're only running one platform per Python version to speed up builds # We're only running one platform per Python version to speed up builds
Python36Linux: Python36Linux:
imageName: "ubuntu-18.04" imageName: "ubuntu-latest"
python.version: "3.6" python.version: "3.6"
# Python36Windows: # Python36Windows:
# imageName: "windows-2019" # imageName: "windows-latest"
# python.version: "3.6" # python.version: "3.6"
# Python36Mac: # Python36Mac:
# imageName: "macos-10.14" # imageName: "macos-latest"
# python.version: "3.6" # python.version: "3.6"
# Python37Linux: # Python37Linux:
# imageName: "ubuntu-18.04" # imageName: "ubuntu-latest"
# python.version: "3.7" # python.version: "3.7"
Python37Windows: Python37Windows:
imageName: "windows-2019" imageName: "windows-latest"
python.version: "3.7" python.version: "3.7"
# Python37Mac: # Python37Mac:
# imageName: "macos-10.14" # imageName: "macos-latest"
# python.version: "3.7" # python.version: "3.7"
# Python38Linux: # Python38Linux:
# imageName: "ubuntu-18.04" # imageName: "ubuntu-latest"
# python.version: "3.8" # python.version: "3.8"
# Python38Windows: # Python38Windows:
# imageName: "windows-2019" # imageName: "windows-latest"
# python.version: "3.8" # python.version: "3.8"
Python38Mac: Python38Mac:
imageName: "macos-10.14" imageName: "macos-latest"
python.version: "3.8" python.version: "3.8"
Python39Linux: Python39Linux:
imageName: "ubuntu-18.04" imageName: "ubuntu-latest"
python.version: "3.9" python.version: "3.9"
# Python39Windows: # Python39Windows:
# imageName: "windows-2019" # imageName: "windows-latest"
# python.version: "3.9" # python.version: "3.9"
# Python39Mac: # Python39Mac:
# imageName: "macos-10.14" # imageName: "macos-latest"
# python.version: "3.9" # python.version: "3.9"
Python310Linux: Python310Linux:
imageName: "ubuntu-20.04" imageName: "ubuntu-latest"
python.version: "3.10" python.version: "3.10"
Python310Windows: Python310Windows:
imageName: "windows-2019" imageName: "windows-latest"
python.version: "3.10" python.version: "3.10"
Python310Mac: Python310Mac:
imageName: "macos-10.15" imageName: "macos-latest"
python.version: "3.10" python.version: "3.10"
maxParallel: 4 maxParallel: 4
pool: pool:

View File

@ -444,7 +444,7 @@ spaCy uses the [`pytest`](http://doc.pytest.org/) framework for testing. Tests f
When adding tests, make sure to use descriptive names and only test for one behavior at a time. Tests should be grouped into modules dedicated to the same type of functionality and some test modules are organized as directories of test files related to the same larger area of the library, e.g. `matcher` or `tokenizer`. When adding tests, make sure to use descriptive names and only test for one behavior at a time. Tests should be grouped into modules dedicated to the same type of functionality and some test modules are organized as directories of test files related to the same larger area of the library, e.g. `matcher` or `tokenizer`.
Regression tests are tests that refer to bugs reported in specific issues. They should live in the `regression` module and are named according to the issue number (e.g. `test_issue1234.py`). This system allows us to relate tests for specific bugs back to the original reported issue, which is especially useful if we introduce a regression and a previously passing regression tests suddenly fails again. When fixing a bug, it's often useful to create a regression test for it first. Every once in a while, we go through the `regression` module and group tests together into larger files by issue number, in groups of 500 to 1000 numbers. This prevents us from ending up with too many individual files over time. Regression tests are tests that refer to bugs reported in specific issues. They should live in the relevant module of the test suite, named according to the issue number (e.g., `test_issue1234.py`), and [marked](https://docs.pytest.org/en/6.2.x/example/markers.html#working-with-custom-markers) appropriately (e.g. `@pytest.mark.issue(1234)`). This system allows us to relate tests for specific bugs back to the original reported issue, which is especially useful if we introduce a regression and a previously passing regression tests suddenly fails again. When fixing a bug, it's often useful to create a regression test for it first.
The test suite also provides [fixtures](https://github.com/explosion/spaCy/blob/master/spacy/tests/conftest.py) for different language tokenizers that can be used as function arguments of the same name and will be passed in automatically. Those should only be used for tests related to those specific languages. We also have [test utility functions](https://github.com/explosion/spaCy/blob/master/spacy/tests/util.py) for common operations, like creating a temporary file. The test suite also provides [fixtures](https://github.com/explosion/spaCy/blob/master/spacy/tests/conftest.py) for different language tokenizers that can be used as function arguments of the same name and will be passed in automatically. Those should only be used for tests related to those specific languages. We also have [test utility functions](https://github.com/explosion/spaCy/blob/master/spacy/tests/util.py) for common operations, like creating a temporary file.

View File

@ -1,5 +1,6 @@
# Our libraries # Our libraries
spacy-legacy>=3.0.8,<3.1.0 spacy-legacy>=3.0.8,<3.1.0
spacy-loggers>=1.0.0,<2.0.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=8.0.12,<8.1.0 thinc>=8.0.12,<8.1.0
@ -17,6 +18,7 @@ requests>=2.13.0,<3.0.0
tqdm>=4.38.0,<5.0.0 tqdm>=4.38.0,<5.0.0
pydantic>=1.7.4,!=1.8,!=1.8.1,<1.9.0 pydantic>=1.7.4,!=1.8,!=1.8.1,<1.9.0
jinja2 jinja2
langcodes>=3.2.0,<4.0.0
# Official Python utilities # Official Python utilities
setuptools setuptools
packaging>=20.0 packaging>=20.0
@ -29,7 +31,7 @@ pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0 mock>=2.0.0,<3.0.0
flake8>=3.8.0,<3.10.0 flake8>=3.8.0,<3.10.0
hypothesis>=3.27.0,<7.0.0 hypothesis>=3.27.0,<7.0.0
mypy>=0.910 mypy==0.910
types-dataclasses>=0.1.3; python_version < "3.7" types-dataclasses>=0.1.3; python_version < "3.7"
types-mock>=0.1.1 types-mock>=0.1.1
types-requests types-requests

View File

@ -42,6 +42,7 @@ setup_requires =
install_requires = install_requires =
# Our libraries # Our libraries
spacy-legacy>=3.0.8,<3.1.0 spacy-legacy>=3.0.8,<3.1.0
spacy-loggers>=1.0.0,<2.0.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
@ -62,6 +63,7 @@ install_requires =
setuptools setuptools
packaging>=20.0 packaging>=20.0
typing_extensions>=3.7.4,<4.0.0.0; python_version < "3.8" typing_extensions>=3.7.4,<4.0.0.0; python_version < "3.8"
langcodes>=3.2.0,<4.0.0
[options.entry_points] [options.entry_points]
console_scripts = console_scripts =
@ -69,43 +71,45 @@ console_scripts =
[options.extras_require] [options.extras_require]
lookups = lookups =
spacy_lookups_data>=1.0.2,<1.1.0 spacy_lookups_data>=1.0.3,<1.1.0
transformers = transformers =
spacy_transformers>=1.0.1,<1.2.0 spacy_transformers>=1.1.2,<1.2.0
ray = ray =
spacy_ray>=0.1.0,<1.0.0 spacy_ray>=0.1.0,<1.0.0
cuda = cuda =
cupy>=5.0.0b4,<10.0.0 cupy>=5.0.0b4,<11.0.0
cuda80 = cuda80 =
cupy-cuda80>=5.0.0b4,<10.0.0 cupy-cuda80>=5.0.0b4,<11.0.0
cuda90 = cuda90 =
cupy-cuda90>=5.0.0b4,<10.0.0 cupy-cuda90>=5.0.0b4,<11.0.0
cuda91 = cuda91 =
cupy-cuda91>=5.0.0b4,<10.0.0 cupy-cuda91>=5.0.0b4,<11.0.0
cuda92 = cuda92 =
cupy-cuda92>=5.0.0b4,<10.0.0 cupy-cuda92>=5.0.0b4,<11.0.0
cuda100 = cuda100 =
cupy-cuda100>=5.0.0b4,<10.0.0 cupy-cuda100>=5.0.0b4,<11.0.0
cuda101 = cuda101 =
cupy-cuda101>=5.0.0b4,<10.0.0 cupy-cuda101>=5.0.0b4,<11.0.0
cuda102 = cuda102 =
cupy-cuda102>=5.0.0b4,<10.0.0 cupy-cuda102>=5.0.0b4,<11.0.0
cuda110 = cuda110 =
cupy-cuda110>=5.0.0b4,<10.0.0 cupy-cuda110>=5.0.0b4,<11.0.0
cuda111 = cuda111 =
cupy-cuda111>=5.0.0b4,<10.0.0 cupy-cuda111>=5.0.0b4,<11.0.0
cuda112 = cuda112 =
cupy-cuda112>=5.0.0b4,<10.0.0 cupy-cuda112>=5.0.0b4,<11.0.0
cuda113 = cuda113 =
cupy-cuda113>=5.0.0b4,<10.0.0 cupy-cuda113>=5.0.0b4,<11.0.0
cuda114 = cuda114 =
cupy-cuda114>=5.0.0b4,<10.0.0 cupy-cuda114>=5.0.0b4,<11.0.0
cuda115 =
cupy-cuda115>=5.0.0b4,<11.0.0
apple = apple =
thinc-apple-ops>=0.0.4,<1.0.0 thinc-apple-ops>=0.0.4,<1.0.0
# Language tokenizers with external dependencies # Language tokenizers with external dependencies
ja = ja =
sudachipy>=0.4.9 sudachipy>=0.5.2,!=0.6.1
sudachidict_core>=20200330 sudachidict_core>=20211220
ko = ko =
natto-py==0.9.0 natto-py==0.9.0
th = th =

View File

@ -78,6 +78,7 @@ COPY_FILES = {
ROOT / "setup.cfg": PACKAGE_ROOT / "tests" / "package", ROOT / "setup.cfg": PACKAGE_ROOT / "tests" / "package",
ROOT / "pyproject.toml": PACKAGE_ROOT / "tests" / "package", ROOT / "pyproject.toml": PACKAGE_ROOT / "tests" / "package",
ROOT / "requirements.txt": PACKAGE_ROOT / "tests" / "package", ROOT / "requirements.txt": PACKAGE_ROOT / "tests" / "package",
ROOT / "website" / "meta" / "universe.json": PACKAGE_ROOT / "tests" / "universe",
} }

View File

@ -1,6 +1,6 @@
# fmt: off # fmt: off
__title__ = "spacy" __title__ = "spacy"
__version__ = "3.1.4" __version__ = "3.2.1"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects" __projects__ = "https://github.com/explosion/projects"

View File

@ -1,3 +1,6 @@
from .errors import Errors
IOB_STRINGS = ("", "I", "O", "B")
IDS = { IDS = {
"": NULL_ATTR, "": NULL_ATTR,
@ -64,7 +67,6 @@ IDS = {
"FLAG61": FLAG61, "FLAG61": FLAG61,
"FLAG62": FLAG62, "FLAG62": FLAG62,
"FLAG63": FLAG63, "FLAG63": FLAG63,
"ID": ID, "ID": ID,
"ORTH": ORTH, "ORTH": ORTH,
"LOWER": LOWER, "LOWER": LOWER,
@ -72,7 +74,6 @@ IDS = {
"SHAPE": SHAPE, "SHAPE": SHAPE,
"PREFIX": PREFIX, "PREFIX": PREFIX,
"SUFFIX": SUFFIX, "SUFFIX": SUFFIX,
"LENGTH": LENGTH, "LENGTH": LENGTH,
"LEMMA": LEMMA, "LEMMA": LEMMA,
"POS": POS, "POS": POS,
@ -87,7 +88,7 @@ IDS = {
"SPACY": SPACY, "SPACY": SPACY,
"LANG": LANG, "LANG": LANG,
"MORPH": MORPH, "MORPH": MORPH,
"IDX": IDX "IDX": IDX,
} }
@ -109,28 +110,66 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
""" """
inty_attrs = {} inty_attrs = {}
if _do_deprecated: if _do_deprecated:
if 'F' in stringy_attrs: if "F" in stringy_attrs:
stringy_attrs["ORTH"] = stringy_attrs.pop("F") stringy_attrs["ORTH"] = stringy_attrs.pop("F")
if 'L' in stringy_attrs: if "L" in stringy_attrs:
stringy_attrs["LEMMA"] = stringy_attrs.pop("L") stringy_attrs["LEMMA"] = stringy_attrs.pop("L")
if 'pos' in stringy_attrs: if "pos" in stringy_attrs:
stringy_attrs["TAG"] = stringy_attrs.pop("pos") stringy_attrs["TAG"] = stringy_attrs.pop("pos")
if 'morph' in stringy_attrs: if "morph" in stringy_attrs:
morphs = stringy_attrs.pop('morph') morphs = stringy_attrs.pop("morph")
if 'number' in stringy_attrs: if "number" in stringy_attrs:
stringy_attrs.pop('number') stringy_attrs.pop("number")
if 'tenspect' in stringy_attrs: if "tenspect" in stringy_attrs:
stringy_attrs.pop('tenspect') stringy_attrs.pop("tenspect")
morph_keys = [ morph_keys = [
'PunctType', 'PunctSide', 'Other', 'Degree', 'AdvType', 'Number', "PunctType",
'VerbForm', 'PronType', 'Aspect', 'Tense', 'PartType', 'Poss', "PunctSide",
'Hyph', 'ConjType', 'NumType', 'Foreign', 'VerbType', 'NounType', "Other",
'Gender', 'Mood', 'Negative', 'Tense', 'Voice', 'Abbr', "Degree",
'Derivation', 'Echo', 'Foreign', 'NameType', 'NounType', 'NumForm', "AdvType",
'NumValue', 'PartType', 'Polite', 'StyleVariant', "Number",
'PronType', 'AdjType', 'Person', 'Variant', 'AdpType', "VerbForm",
'Reflex', 'Negative', 'Mood', 'Aspect', 'Case', "PronType",
'Polarity', 'PrepCase', 'Animacy' # U20 "Aspect",
"Tense",
"PartType",
"Poss",
"Hyph",
"ConjType",
"NumType",
"Foreign",
"VerbType",
"NounType",
"Gender",
"Mood",
"Negative",
"Tense",
"Voice",
"Abbr",
"Derivation",
"Echo",
"Foreign",
"NameType",
"NounType",
"NumForm",
"NumValue",
"PartType",
"Polite",
"StyleVariant",
"PronType",
"AdjType",
"Person",
"Variant",
"AdpType",
"Reflex",
"Negative",
"Mood",
"Aspect",
"Case",
"Polarity",
"PrepCase",
"Animacy", # U20
] ]
for key in morph_keys: for key in morph_keys:
if key in stringy_attrs: if key in stringy_attrs:
@ -142,8 +181,13 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
for name, value in stringy_attrs.items(): for name, value in stringy_attrs.items():
int_key = intify_attr(name) int_key = intify_attr(name)
if int_key is not None: if int_key is not None:
if strings_map is not None and isinstance(value, basestring): if int_key == ENT_IOB:
if hasattr(strings_map, 'add'): if value in IOB_STRINGS:
value = IOB_STRINGS.index(value)
elif isinstance(value, str):
raise ValueError(Errors.E1025.format(value=value))
if strings_map is not None and isinstance(value, str):
if hasattr(strings_map, "add"):
value = strings_map.add(value) value = strings_map.add(value)
else: else:
value = strings_map[value] value = strings_map[value]

View File

@ -25,7 +25,7 @@ def debug_config_cli(
show_vars: bool = Opt(False, "--show-variables", "-V", help="Show an overview of all variables referenced in the config and their values. This will also reflect variables overwritten on the CLI.") show_vars: bool = Opt(False, "--show-variables", "-V", help="Show an overview of all variables referenced in the config and their values. This will also reflect variables overwritten on the CLI.")
# fmt: on # fmt: on
): ):
"""Debug a config.cfg file and show validation errors. The command will """Debug a config file and show validation errors. The command will
create all objects in the tree and validate them. Note that some config create all objects in the tree and validate them. Note that some config
validation errors are blocking and will prevent the rest of the config from validation errors are blocking and will prevent the rest of the config from
being resolved. This means that you may not see all validation errors at being resolved. This means that you may not see all validation errors at

View File

@ -14,7 +14,7 @@ from ..training.initialize import get_sourced_components
from ..schemas import ConfigSchemaTraining from ..schemas import ConfigSchemaTraining
from ..pipeline._parser_internals import nonproj from ..pipeline._parser_internals import nonproj
from ..pipeline._parser_internals.nonproj import DELIMITER from ..pipeline._parser_internals.nonproj import DELIMITER
from ..pipeline import Morphologizer from ..pipeline import Morphologizer, SpanCategorizer
from ..morphology import Morphology from ..morphology import Morphology
from ..language import Language from ..language import Language
from ..util import registry, resolve_dot_names from ..util import registry, resolve_dot_names
@ -203,6 +203,7 @@ def debug_data(
has_low_data_warning = False has_low_data_warning = False
has_no_neg_warning = False has_no_neg_warning = False
has_ws_ents_error = False has_ws_ents_error = False
has_boundary_cross_ents_warning = False
msg.divider("Named Entity Recognition") msg.divider("Named Entity Recognition")
msg.info(f"{len(model_labels)} label(s)") msg.info(f"{len(model_labels)} label(s)")
@ -242,12 +243,20 @@ def debug_data(
msg.warn(f"No examples for texts WITHOUT new label '{label}'") msg.warn(f"No examples for texts WITHOUT new label '{label}'")
has_no_neg_warning = True has_no_neg_warning = True
if gold_train_data["boundary_cross_ents"]:
msg.warn(
f"{gold_train_data['boundary_cross_ents']} entity span(s) crossing sentence boundaries"
)
has_boundary_cross_ents_warning = True
if not has_low_data_warning: if not has_low_data_warning:
msg.good("Good amount of examples for all labels") msg.good("Good amount of examples for all labels")
if not has_no_neg_warning: if not has_no_neg_warning:
msg.good("Examples without occurrences available for all labels") msg.good("Examples without occurrences available for all labels")
if not has_ws_ents_error: if not has_ws_ents_error:
msg.good("No entities consisting of or starting/ending with whitespace") msg.good("No entities consisting of or starting/ending with whitespace")
if not has_boundary_cross_ents_warning:
msg.good("No entities crossing sentence boundaries")
if has_low_data_warning: if has_low_data_warning:
msg.text( msg.text(
@ -565,6 +574,7 @@ def _compile_gold(
"words": Counter(), "words": Counter(),
"roots": Counter(), "roots": Counter(),
"ws_ents": 0, "ws_ents": 0,
"boundary_cross_ents": 0,
"n_words": 0, "n_words": 0,
"n_misaligned_words": 0, "n_misaligned_words": 0,
"words_missing_vectors": Counter(), "words_missing_vectors": Counter(),
@ -602,6 +612,8 @@ def _compile_gold(
if label.startswith(("B-", "U-")): if label.startswith(("B-", "U-")):
combined_label = label.split("-")[1] combined_label = label.split("-")[1]
data["ner"][combined_label] += 1 data["ner"][combined_label] += 1
if gold[i].is_sent_start and label.startswith(("I-", "L-")):
data["boundary_cross_ents"] += 1
elif label == "-": elif label == "-":
data["ner"]["-"] += 1 data["ner"]["-"] += 1
if "textcat" in factory_names or "textcat_multilabel" in factory_names: if "textcat" in factory_names or "textcat_multilabel" in factory_names:
@ -687,8 +699,34 @@ def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
return count return count
def _get_labels_from_model(nlp: Language, pipe_name: str) -> Set[str]: def _get_labels_from_model(
if pipe_name not in nlp.pipe_names: nlp: Language, factory_name: str
return set() ) -> Set[str]:
pipe = nlp.get_pipe(pipe_name) pipe_names = [
return set(pipe.labels) pipe_name
for pipe_name in nlp.pipe_names
if nlp.get_pipe_meta(pipe_name).factory == factory_name
]
labels: Set[str] = set()
for pipe_name in pipe_names:
pipe = nlp.get_pipe(pipe_name)
labels.update(pipe.labels)
return labels
def _get_labels_from_spancat(
nlp: Language
) -> Dict[str, Set[str]]:
pipe_names = [
pipe_name
for pipe_name in nlp.pipe_names
if nlp.get_pipe_meta(pipe_name).factory == "spancat"
]
labels: Dict[str, Set[str]] = {}
for pipe_name in pipe_names:
pipe = nlp.get_pipe(pipe_name)
assert isinstance(pipe, SpanCategorizer)
if pipe.key not in labels:
labels[pipe.key] = set()
labels[pipe.key].update(pipe.labels)
return labels

View File

@ -27,7 +27,7 @@ class Optimizations(str, Enum):
@init_cli.command("config") @init_cli.command("config")
def init_config_cli( def init_config_cli(
# fmt: off # fmt: off
output_file: Path = Arg(..., help="File to save config.cfg to or - for stdout (will only output config and no additional logging info)", allow_dash=True), output_file: Path = Arg(..., help="File to save the config to or - for stdout (will only output config and no additional logging info)", allow_dash=True),
lang: str = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"), lang: str = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"),
pipeline: str = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include (without 'tok2vec' or 'transformer')"), pipeline: str = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include (without 'tok2vec' or 'transformer')"),
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."), optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
@ -37,7 +37,7 @@ def init_config_cli(
# fmt: on # fmt: on
): ):
""" """
Generate a starter config.cfg for training. Based on your requirements Generate a starter config file for training. Based on your requirements
specified via the CLI arguments, this command generates a config with the specified via the CLI arguments, this command generates a config with the
optimal settings for your use case. This includes the choice of architecture, optimal settings for your use case. This includes the choice of architecture,
pretrained weights and related hyperparameters. pretrained weights and related hyperparameters.
@ -66,15 +66,15 @@ def init_config_cli(
@init_cli.command("fill-config") @init_cli.command("fill-config")
def init_fill_config_cli( def init_fill_config_cli(
# fmt: off # fmt: off
base_path: Path = Arg(..., help="Base config to fill", exists=True, dir_okay=False), base_path: Path = Arg(..., help="Path to base config to fill", exists=True, dir_okay=False),
output_file: Path = Arg("-", help="File to save config.cfg to (or - for stdout)", allow_dash=True), output_file: Path = Arg("-", help="Path to output .cfg file (or - for stdout)", allow_dash=True),
pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"), pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
diff: bool = Opt(False, "--diff", "-D", help="Print a visual diff highlighting the changes"), diff: bool = Opt(False, "--diff", "-D", help="Print a visual diff highlighting the changes"),
code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"), code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
# fmt: on # fmt: on
): ):
""" """
Fill partial config.cfg with default values. Will add all missing settings Fill partial config file with default values. Will add all missing settings
from the default config and will create all objects, check the registered from the default config and will create all objects, check the registered
functions for their default values and update the base config. This command functions for their default values and update the base config. This command
can be used with a config generated via the training quickstart widget: can be used with a config generated via the training quickstart widget:

View File

@ -20,6 +20,7 @@ def init_vectors_cli(
output_dir: Path = Arg(..., help="Pipeline output directory"), output_dir: Path = Arg(..., help="Pipeline output directory"),
prune: int = Opt(-1, "--prune", "-p", help="Optional number of vectors to prune to"), prune: int = Opt(-1, "--prune", "-p", help="Optional number of vectors to prune to"),
truncate: int = Opt(0, "--truncate", "-t", help="Optional number of vectors to truncate to when reading in vectors file"), truncate: int = Opt(0, "--truncate", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),
mode: str = Opt("default", "--mode", "-m", help="Vectors mode: default or floret"),
name: Optional[str] = Opt(None, "--name", "-n", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"), name: Optional[str] = Opt(None, "--name", "-n", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"), verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
jsonl_loc: Optional[Path] = Opt(None, "--lexemes-jsonl", "-j", help="Location of JSONL-formatted attributes file", hidden=True), jsonl_loc: Optional[Path] = Opt(None, "--lexemes-jsonl", "-j", help="Location of JSONL-formatted attributes file", hidden=True),
@ -34,7 +35,14 @@ def init_vectors_cli(
nlp = util.get_lang_class(lang)() nlp = util.get_lang_class(lang)()
if jsonl_loc is not None: if jsonl_loc is not None:
update_lexemes(nlp, jsonl_loc) update_lexemes(nlp, jsonl_loc)
convert_vectors(nlp, vectors_loc, truncate=truncate, prune=prune, name=name) convert_vectors(
nlp,
vectors_loc,
truncate=truncate,
prune=prune,
name=name,
mode=mode,
)
msg.good(f"Successfully converted {len(nlp.vocab.vectors)} vectors") msg.good(f"Successfully converted {len(nlp.vocab.vectors)} vectors")
nlp.to_disk(output_dir) nlp.to_disk(output_dir)
msg.good( msg.good(

View File

@ -4,6 +4,7 @@ from pathlib import Path
from wasabi import Printer, MarkdownRenderer, get_raw_input from wasabi import Printer, MarkdownRenderer, get_raw_input
from thinc.api import Config from thinc.api import Config
from collections import defaultdict from collections import defaultdict
from catalogue import RegistryError
import srsly import srsly
import sys import sys
@ -212,9 +213,18 @@ def get_third_party_dependencies(
if "factory" in component: if "factory" in component:
funcs["factories"].add(component["factory"]) funcs["factories"].add(component["factory"])
modules = set() modules = set()
lang = config["nlp"]["lang"]
for reg_name, func_names in funcs.items(): for reg_name, func_names in funcs.items():
for func_name in func_names: for func_name in func_names:
func_info = util.registry.find(reg_name, func_name) # Try the lang-specific version and fall back
try:
func_info = util.registry.find(reg_name, lang + "." + func_name)
except RegistryError:
try:
func_info = util.registry.find(reg_name, func_name)
except RegistryError as regerr:
# lang-specific version being absent is not actually an issue
raise regerr from None
module_name = func_info.get("module") # type: ignore[attr-defined] module_name = func_info.get("module") # type: ignore[attr-defined]
if module_name: # the code is part of a module, not a --code file if module_name: # the code is part of a module, not a --code file
modules.add(func_info["module"].split(".")[0]) # type: ignore[index] modules.add(func_info["module"].split(".")[0]) # type: ignore[index]
@ -397,7 +407,7 @@ def _format_label_scheme(data: Dict[str, Any]) -> str:
continue continue
col1 = md.bold(md.code(pipe)) col1 = md.bold(md.code(pipe))
col2 = ", ".join( col2 = ", ".join(
[md.code(label.replace("|", "\\|")) for label in labels] [md.code(str(label).replace("|", "\\|")) for label in labels]
) # noqa: W605 ) # noqa: W605
label_data.append((col1, col2)) label_data.append((col1, col2))
n_labels += len(labels) n_labels += len(labels)

View File

@ -1,6 +1,7 @@
from typing import Any, Dict, Optional from typing import Any, Dict, Optional
from pathlib import Path from pathlib import Path
from wasabi import msg from wasabi import msg
import os
import re import re
import shutil import shutil
import requests import requests
@ -129,10 +130,17 @@ def fetch_asset(
the asset failed. the asset failed.
""" """
dest_path = (project_path / dest).resolve() dest_path = (project_path / dest).resolve()
if dest_path.exists() and checksum: if dest_path.exists():
# If there's already a file, check for checksum # If there's already a file, check for checksum
if checksum == get_checksum(dest_path): if checksum:
msg.good(f"Skipping download with matching checksum: {dest}") if checksum == get_checksum(dest_path):
msg.good(f"Skipping download with matching checksum: {dest}")
return
else:
# If there's not a checksum, make sure the file is a possibly valid size
if os.path.getsize(dest_path) == 0:
msg.warn(f"Asset exists but with size of 0 bytes, deleting: {dest}")
os.remove(dest_path)
# We might as well support the user here and create parent directories in # We might as well support the user here and create parent directories in
# case the asset dir isn't listed as a dir to create in the project.yml # case the asset dir isn't listed as a dir to create in the project.yml
if not dest_path.parent.exists(): if not dest_path.parent.exists():

View File

@ -16,8 +16,10 @@ gpu_allocator = null
[nlp] [nlp]
lang = "{{ lang }}" lang = "{{ lang }}"
{%- set no_tok2vec = components|length == 1 and (("textcat" in components or "textcat_multilabel" in components) and optimize == "efficiency")-%} {%- set has_textcat = ("textcat" in components or "textcat_multilabel" in components) -%}
{%- if not no_tok2vec and ("tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "entity_linker" in components or "textcat" in components or "textcat_multilabel" in components) -%} {%- set with_accuracy = optimize == "accuracy" -%}
{%- set has_accurate_textcat = has_textcat and with_accuracy -%}
{%- if ("tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "entity_linker" in components or has_accurate_textcat) -%}
{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components %} {%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components %}
{%- else -%} {%- else -%}
{%- set full_pipeline = components %} {%- set full_pipeline = components %}
@ -197,7 +199,7 @@ no_output_layer = false
{# NON-TRANSFORMER PIPELINE #} {# NON-TRANSFORMER PIPELINE #}
{% else -%} {% else -%}
{% if not no_tok2vec-%} {% if "tok2vec" in full_pipeline -%}
[components.tok2vec] [components.tok2vec]
factory = "tok2vec" factory = "tok2vec"

View File

@ -68,12 +68,14 @@ seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator} gpu_allocator = ${system.gpu_allocator}
dropout = 0.1 dropout = 0.1
accumulate_gradient = 1 accumulate_gradient = 1
# Controls early-stopping. 0 disables early stopping. # Controls early-stopping, i.e., the number of steps to continue without
# improvement before stopping. 0 disables early stopping.
patience = 1600 patience = 1600
# Number of epochs. 0 means unlimited. If >= 0, train corpus is loaded once in # Number of epochs. 0 means unlimited. If >= 0, train corpus is loaded once in
# memory and shuffled within the training loop. -1 means stream train corpus # memory and shuffled within the training loop. -1 means stream train corpus
# rather than loading in memory with no shuffling within the training loop. # rather than loading in memory with no shuffling within the training loop.
max_epochs = 0 max_epochs = 0
# Maximum number of update steps to train for. 0 means an unlimited number of steps.
max_steps = 20000 max_steps = 20000
eval_frequency = 200 eval_frequency = 200
# Control how scores are printed and checkpoints are evaluated. # Control how scores are printed and checkpoints are evaluated.

View File

@ -5,6 +5,7 @@ raw_text = null
max_epochs = 1000 max_epochs = 1000
dropout = 0.2 dropout = 0.2
n_save_every = null n_save_every = null
n_save_epoch = null
component = "tok2vec" component = "tok2vec"
layer = "" layer = ""
corpus = "corpora.pretrain" corpus = "corpora.pretrain"

View File

@ -181,11 +181,19 @@ def parse_deps(orig_doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
def parse_ents(doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]: def parse_ents(doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
"""Generate named entities in [{start: i, end: i, label: 'label'}] format. """Generate named entities in [{start: i, end: i, label: 'label'}] format.
doc (Doc): Document do parse. doc (Doc): Document to parse.
options (Dict[str, Any]): NER-specific visualisation options.
RETURNS (dict): Generated entities keyed by text (original text) and ents. RETURNS (dict): Generated entities keyed by text (original text) and ents.
""" """
kb_url_template = options.get("kb_url_template", None)
ents = [ ents = [
{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} {
"start": ent.start_char,
"end": ent.end_char,
"label": ent.label_,
"kb_id": ent.kb_id_ if ent.kb_id_ else "",
"kb_url": kb_url_template.format(ent.kb_id_) if kb_url_template else "#",
}
for ent in doc.ents for ent in doc.ents
] ]
if not ents: if not ents:

View File

@ -18,7 +18,7 @@ DEFAULT_LABEL_COLORS = {
"LOC": "#ff9561", "LOC": "#ff9561",
"PERSON": "#aa9cfc", "PERSON": "#aa9cfc",
"NORP": "#c887fb", "NORP": "#c887fb",
"FACILITY": "#9cc9cc", "FAC": "#9cc9cc",
"EVENT": "#ffeb80", "EVENT": "#ffeb80",
"LAW": "#ff8197", "LAW": "#ff8197",
"LANGUAGE": "#ff8197", "LANGUAGE": "#ff8197",

View File

@ -1,18 +1,13 @@
import warnings import warnings
def add_codes(err_cls): class ErrorsWithCodes(type):
"""Add error codes to string messages via class attribute names.""" def __getattribute__(self, code):
msg = super().__getattribute__(code)
class ErrorsWithCodes(err_cls): if code.startswith("__"): # python system attributes like __class__
def __getattribute__(self, code): return msg
msg = super(ErrorsWithCodes, self).__getattribute__(code) else:
if code.startswith("__"): # python system attributes like __class__ return "[{code}] {msg}".format(code=code, msg=msg)
return msg
else:
return "[{code}] {msg}".format(code=code, msg=msg)
return ErrorsWithCodes()
def setup_default_warnings(): def setup_default_warnings():
@ -27,6 +22,9 @@ def setup_default_warnings():
# warn once about lemmatizer without required POS # warn once about lemmatizer without required POS
filter_warning("once", error_msg=Warnings.W108) filter_warning("once", error_msg=Warnings.W108)
# floret vector table cannot be modified
filter_warning("once", error_msg="[W114]")
def filter_warning(action: str, error_msg: str): def filter_warning(action: str, error_msg: str):
"""Customize how spaCy should handle a certain warning. """Customize how spaCy should handle a certain warning.
@ -44,8 +42,7 @@ def _escape_warning_msg(msg):
# fmt: off # fmt: off
@add_codes class Warnings(metaclass=ErrorsWithCodes):
class Warnings:
W005 = ("Doc object not parsed. This means displaCy won't be able to " W005 = ("Doc object not parsed. This means displaCy won't be able to "
"generate a dependency visualization for it. Make sure the Doc " "generate a dependency visualization for it. Make sure the Doc "
"was processed with a model that supports dependency parsing, and " "was processed with a model that supports dependency parsing, and "
@ -192,10 +189,12 @@ class Warnings:
"vectors are not identical to current pipeline vectors.") "vectors are not identical to current pipeline vectors.")
W114 = ("Using multiprocessing with GPU models is not recommended and may " W114 = ("Using multiprocessing with GPU models is not recommended and may "
"lead to errors.") "lead to errors.")
W115 = ("Skipping {method}: the floret vector table cannot be modified. "
"Vectors are calculated from character ngrams.")
W116 = ("Unable to clean attribute '{attr}'.")
@add_codes class Errors(metaclass=ErrorsWithCodes):
class Errors:
E001 = ("No component '{name}' found in pipeline. Available names: {opts}") E001 = ("No component '{name}' found in pipeline. Available names: {opts}")
E002 = ("Can't find factory for '{name}' for language {lang} ({lang_code}). " E002 = ("Can't find factory for '{name}' for language {lang} ({lang_code}). "
"This usually happens when spaCy calls `nlp.{method}` with a custom " "This usually happens when spaCy calls `nlp.{method}` with a custom "
@ -284,7 +283,7 @@ class Errors:
"you forget to call the `set_extension` method?") "you forget to call the `set_extension` method?")
E047 = ("Can't assign a value to unregistered extension attribute " E047 = ("Can't assign a value to unregistered extension attribute "
"'{name}'. Did you forget to call the `set_extension` method?") "'{name}'. Did you forget to call the `set_extension` method?")
E048 = ("Can't import language {lang} from spacy.lang: {err}") E048 = ("Can't import language {lang} or any matching language from spacy.lang: {err}")
E050 = ("Can't find model '{name}'. It doesn't seem to be a Python " E050 = ("Can't find model '{name}'. It doesn't seem to be a Python "
"package or a valid path to a data directory.") "package or a valid path to a data directory.")
E052 = ("Can't find model directory: {path}") E052 = ("Can't find model directory: {path}")
@ -518,13 +517,24 @@ class Errors:
E199 = ("Unable to merge 0-length span at `doc[{start}:{end}]`.") E199 = ("Unable to merge 0-length span at `doc[{start}:{end}]`.")
E200 = ("Can't yet set {attr} from Span. Vote for this feature on the " E200 = ("Can't yet set {attr} from Span. Vote for this feature on the "
"issue tracker: http://github.com/explosion/spaCy/issues") "issue tracker: http://github.com/explosion/spaCy/issues")
E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.") E202 = ("Unsupported {name} mode '{mode}'. Supported modes: {modes}.")
# New errors added in v3.x # New errors added in v3.x
E866 = ("A SpanGroup is not functional after the corresponding Doc has " E858 = ("The {mode} vector table does not support this operation. "
"{alternative}")
E859 = ("The floret vector table cannot be modified.")
E860 = ("Can't truncate fasttext-bloom vectors.")
E861 = ("No 'keys' should be provided when initializing floret vectors "
"with 'minn' and 'maxn'.")
E862 = ("'hash_count' must be between 1-4 for floret vectors.")
E863 = ("'maxn' must be greater than or equal to 'minn'.")
E864 = ("The complete vector table 'data' is required to initialize floret "
"vectors.")
E865 = ("A SpanGroup is not functional after the corresponding Doc has "
"been garbage collected. To keep using the spans, make sure that " "been garbage collected. To keep using the spans, make sure that "
"the corresponding Doc object is still available in the scope of " "the corresponding Doc object is still available in the scope of "
"your function.") "your function.")
E866 = ("Expected a string or 'Doc' as input, but got: {type}.")
E867 = ("The 'textcat' component requires at least two labels because it " E867 = ("The 'textcat' component requires at least two labels because it "
"uses mutually exclusive classes where exactly one label is True " "uses mutually exclusive classes where exactly one label is True "
"for each doc. For binary classification tasks, you can use two " "for each doc. For binary classification tasks, you can use two "
@ -632,7 +642,7 @@ class Errors:
E912 = ("Failed to initialize lemmatizer. Missing lemmatizer table(s) found " E912 = ("Failed to initialize lemmatizer. Missing lemmatizer table(s) found "
"for mode '{mode}'. Required tables: {tables}. Found: {found}.") "for mode '{mode}'. Required tables: {tables}. Found: {found}.")
E913 = ("Corpus path can't be None. Maybe you forgot to define it in your " E913 = ("Corpus path can't be None. Maybe you forgot to define it in your "
"config.cfg or override it on the CLI?") ".cfg file or override it on the CLI?")
E914 = ("Executing {name} callback failed. Expected the function to " E914 = ("Executing {name} callback failed. Expected the function to "
"return the nlp object but got: {value}. Maybe you forgot to return " "return the nlp object but got: {value}. Maybe you forgot to return "
"the modified object in your function?") "the modified object in your function?")
@ -878,7 +888,13 @@ class Errors:
E1021 = ("`pos` value \"{pp}\" is not a valid Universal Dependencies tag. " E1021 = ("`pos` value \"{pp}\" is not a valid Universal Dependencies tag. "
"Non-UD tags should use the `tag` property.") "Non-UD tags should use the `tag` property.")
E1022 = ("Words must be of type str or int, but input is of type '{wtype}'") E1022 = ("Words must be of type str or int, but input is of type '{wtype}'")
E1023 = ("Couldn't read EntityRuler from the {path}. This file doesn't "
"exist.")
E1024 = ("A pattern with ID \"{ent_id}\" is not present in EntityRuler "
"patterns.")
E1025 = ("Cannot intify the value '{value}' as an IOB string. The only "
"supported values are: 'I', 'O', 'B' and ''")
# Deprecated model shortcuts, only used in errors and warnings # Deprecated model shortcuts, only used in errors and warnings
OLD_MODEL_SHORTCUTS = { OLD_MODEL_SHORTCUTS = {

View File

@ -124,7 +124,7 @@ cdef class KnowledgeBase:
def get_alias_strings(self): def get_alias_strings(self):
return [self.vocab.strings[x] for x in self._alias_index] return [self.vocab.strings[x] for x in self._alias_index]
def add_entity(self, unicode entity, float freq, vector[float] entity_vector): def add_entity(self, str entity, float freq, vector[float] entity_vector):
""" """
Add an entity to the KB, optionally specifying its log probability based on corpus frequency Add an entity to the KB, optionally specifying its log probability based on corpus frequency
Return the hash of the entity ID/name at the end. Return the hash of the entity ID/name at the end.
@ -185,15 +185,15 @@ cdef class KnowledgeBase:
i += 1 i += 1
def contains_entity(self, unicode entity): def contains_entity(self, str entity):
cdef hash_t entity_hash = self.vocab.strings.add(entity) cdef hash_t entity_hash = self.vocab.strings.add(entity)
return entity_hash in self._entry_index return entity_hash in self._entry_index
def contains_alias(self, unicode alias): def contains_alias(self, str alias):
cdef hash_t alias_hash = self.vocab.strings.add(alias) cdef hash_t alias_hash = self.vocab.strings.add(alias)
return alias_hash in self._alias_index return alias_hash in self._alias_index
def add_alias(self, unicode alias, entities, probabilities): def add_alias(self, str alias, entities, probabilities):
""" """
For a given alias, add its potential entities and prior probabilies to the KB. For a given alias, add its potential entities and prior probabilies to the KB.
Return the alias_hash at the end Return the alias_hash at the end
@ -239,7 +239,7 @@ cdef class KnowledgeBase:
raise RuntimeError(Errors.E891.format(alias=alias)) raise RuntimeError(Errors.E891.format(alias=alias))
return alias_hash return alias_hash
def append_alias(self, unicode alias, unicode entity, float prior_prob, ignore_warnings=False): def append_alias(self, str alias, str entity, float prior_prob, ignore_warnings=False):
""" """
For an alias already existing in the KB, extend its potential entities with one more. For an alias already existing in the KB, extend its potential entities with one more.
Throw a warning if either the alias or the entity is unknown, Throw a warning if either the alias or the entity is unknown,
@ -286,7 +286,7 @@ cdef class KnowledgeBase:
alias_entry.probs = probs alias_entry.probs = probs
self._aliases_table[alias_index] = alias_entry self._aliases_table[alias_index] = alias_entry
def get_alias_candidates(self, unicode alias) -> Iterator[Candidate]: def get_alias_candidates(self, str alias) -> Iterator[Candidate]:
""" """
Return candidate entities for an alias. Each candidate defines the entity, the original alias, Return candidate entities for an alias. Each candidate defines the entity, the original alias,
and the prior probability of that alias resolving to that entity. and the prior probability of that alias resolving to that entity.
@ -307,7 +307,7 @@ cdef class KnowledgeBase:
for (entry_index, prior_prob) in zip(alias_entry.entry_indices, alias_entry.probs) for (entry_index, prior_prob) in zip(alias_entry.entry_indices, alias_entry.probs)
if entry_index != 0] if entry_index != 0]
def get_vector(self, unicode entity): def get_vector(self, str entity):
cdef hash_t entity_hash = self.vocab.strings[entity] cdef hash_t entity_hash = self.vocab.strings[entity]
# Return an empty list if this entity is unknown in this KB # Return an empty list if this entity is unknown in this KB
@ -317,7 +317,7 @@ cdef class KnowledgeBase:
return self._vectors_table[self._entries[entry_index].vector_index] return self._vectors_table[self._entries[entry_index].vector_index]
def get_prior_prob(self, unicode entity, unicode alias): def get_prior_prob(self, str entity, str alias):
""" Return the prior probability of a given alias being linked to a given entity, """ Return the prior probability of a given alias being linked to a given entity,
or return 0.0 when this combination is not known in the knowledge base""" or return 0.0 when this combination is not known in the knowledge base"""
cdef hash_t alias_hash = self.vocab.strings[alias] cdef hash_t alias_hash = self.vocab.strings[alias]
@ -587,7 +587,7 @@ cdef class Writer:
def __init__(self, path): def __init__(self, path):
assert isinstance(path, Path) assert isinstance(path, Path)
content = bytes(path) content = bytes(path)
cdef bytes bytes_loc = content.encode('utf8') if type(content) == unicode else content cdef bytes bytes_loc = content.encode('utf8') if type(content) == str else content
self._fp = fopen(<char*>bytes_loc, 'wb') self._fp = fopen(<char*>bytes_loc, 'wb')
if not self._fp: if not self._fp:
raise IOError(Errors.E146.format(path=path)) raise IOError(Errors.E146.format(path=path))
@ -629,7 +629,7 @@ cdef class Writer:
cdef class Reader: cdef class Reader:
def __init__(self, path): def __init__(self, path):
content = bytes(path) content = bytes(path)
cdef bytes bytes_loc = content.encode('utf8') if type(content) == unicode else content cdef bytes bytes_loc = content.encode('utf8') if type(content) == str else content
self._fp = fopen(<char*>bytes_loc, 'rb') self._fp = fopen(<char*>bytes_loc, 'rb')
if not self._fp: if not self._fp:
PyErr_SetFromErrno(IOError) PyErr_SetFromErrno(IOError)

View File

@ -1,7 +1,7 @@
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
from ..char_classes import UNITS, ALPHA_UPPER from ..char_classes import UNITS, ALPHA_UPPER
_list_punct = LIST_PUNCT + "፡ ። ፣ ፤ ፥ ፦ ፧".strip().split() _list_punct = LIST_PUNCT + "፡ ። ፣ ፤ ፥ ፦ ፧ ፠ ፨".strip().split()
_suffixes = ( _suffixes = (
_list_punct _list_punct

View File

@ -1,265 +1,79 @@
# Source: https://github.com/Alir3z4/stop-words """
References:
https://github.com/Alir3z4/stop-words - Original list, serves as a base.
https://postvai.com/books/stop-dumi.pdf - Additions to the original list in order to improve it.
"""
STOP_WORDS = set( STOP_WORDS = set(
""" """
а а автентичен аз ако ала
автентичен
аз бе без беше би бивш бивша бившо бивши бил била били било благодаря близо бъдат
ако бъде бъда бяха
ала
бе в вас ваш ваша вашата вашият вероятно вече взема ви вие винаги внимава време все
без всеки всички вместо всичко вследствие всъщност всяка втори във въпреки върху
беше вътре веднъж
би
бивш г ги главен главна главно глас го годно година години годишен
бивша
бившо д да дали далеч далече два двама двамата две двете ден днес дни до добра добре
бил добро добър достатъчно докато докога дори досега доста друг друга другаде други
била
били е евтин едва един една еднаква еднакви еднакъв едно екип ето
било
благодаря живот жив
близо
бъдат за здравей здрасти знае зная забавям зад зададени заедно заради засега заспал
бъде затова запазва започвам защо защото завинаги
бяха
в и из или им има имат иска искам използвайки изглежда изглеждаше изглеждайки
вас извън имайки
ваш
ваша й йо
вероятно
вече каза казва казвайки казвам как каква какво както какъв като кога кауза каузи
взема когато когото което които кой който колко която къде където към край кратък
ви кръгъл
вие
винаги лесен лесно ли летя летиш летим лош
внимава
време м май малко макар малцина междувременно минус ме между мек мен месец ми мис
все мисля много мнозина мога могат може мой можем мокър моля момента му
всеки
всички н на над назад най наш навсякъде навътре нагоре направи напред надолу наистина
всичко например наопаки наполовина напоследък нека независимо нас насам наскоро
всяка настрана необходимо него негов нещо нея ни ние никой нито нищо но нов някак нова
във нови новина някои някой някога някъде няколко няма
въпреки
върху о обаче около описан опитах опитва опитвайки опитвам определен определено освен
г обикновено осигурява обратно означава особен особено от ох отвъд отгоре отдолу
ги отново отива отивам отидох отсега отделно отколкото откъдето очевидно оттам
главен относно още
главна
главно п пак по повече повечето под поне просто пряко поради после последен последно
глас посочен почти прави прав прави правя пред преди през при пък първата първи първо
го път пъти плюс
година
години равен равна различен различни разумен разумно
годишен
д с са сам само себе сериозно сигурен сигурно се сега си син скоро скорошен след
да следващ следващия следва следното следователно случва сме смях собствен
дали сравнително смея според сред става срещу съвсем съдържа съдържащ съжалявам
два съответен съответно сте съм със също
двама
двамата т така техен техни такива такъв твърде там трета твой те тези ти то това
две тогава този той търси толкова точно три трябва тук тъй тя тях
двете
ден у утре ужасно употреба успоредно уточнен уточняване
днес
дни харесва харесали хиляди
до
добра ч часа ценя цяло цялостен че често чрез чудя
добре
добро ще щеше щом щяха
добър
докато
докога
дори
досега
доста
друг
друга
други
е
евтин
едва
един
една
еднаква
еднакви
еднакъв
едно
екип
ето
живот
за
забавям
зад
заедно
заради
засега
заспал
затова
защо
защото
и
из
или
им
има
имат
иска
й
каза
как
каква
какво
както
какъв
като
кога
когато
което
които
кой
който
колко
която
къде
където
към
лесен
лесно
ли
лош
м
май
малко
ме
между
мек
мен
месец
ми
много
мнозина
мога
могат
може
мокър
моля
момента
му
н
на
над
назад
най
направи
напред
например
нас
не
него
нещо
нея
ни
ние
никой
нито
нищо
но
нов
нова
нови
новина
някои
някой
няколко
няма
обаче
около
освен
особено
от
отгоре
отново
още
пак
по
повече
повечето
под
поне
поради
после
почти
прави
пред
преди
през
при
пък
първата
първи
първо
пъти
равен
равна
с
са
сам
само
се
сега
си
син
скоро
след
следващ
сме
смях
според
сред
срещу
сте
съм
със
също
т
тази
така
такива
такъв
там
твой
те
тези
ти
т.н.
то
това
тогава
този
той
толкова
точно
три
трябва
тук
тъй
тя
тях
у
утре
харесва
хиляди
ч
часа
че
често
чрез
ще
щом
юмрук юмрук
я
як я як
""".split() """.split()
) )

View File

@ -1,10 +1,16 @@
"""
References:
https://slovored.com/bg/abbr/grammar/ - Additional refs for abbreviations
(countries, occupations, fields of studies and more).
"""
from ...symbols import ORTH, NORM from ...symbols import ORTH, NORM
_exc = {} _exc = {}
# measurements
_abbr_exc = [ for abbr in [
{ORTH: "м", NORM: "метър"}, {ORTH: "м", NORM: "метър"},
{ORTH: "мм", NORM: "милиметър"}, {ORTH: "мм", NORM: "милиметър"},
{ORTH: "см", NORM: "сантиметър"}, {ORTH: "см", NORM: "сантиметър"},
@ -17,51 +23,191 @@ _abbr_exc = [
{ORTH: "хл", NORM: "хектолиър"}, {ORTH: "хл", NORM: "хектолиър"},
{ORTH: "дкл", NORM: "декалитър"}, {ORTH: "дкл", NORM: "декалитър"},
{ORTH: "л", NORM: "литър"}, {ORTH: "л", NORM: "литър"},
] ]:
for abbr in _abbr_exc:
_exc[abbr[ORTH]] = [abbr] _exc[abbr[ORTH]] = [abbr]
_abbr_line_exc = [ # line abbreviations
for abbr in [
{ORTH: "г-жа", NORM: "госпожа"}, {ORTH: "г-жа", NORM: "госпожа"},
{ORTH: "г", NORM: "господин"}, {ORTH: "г", NORM: "господин"},
{ORTH: "г-ца", NORM: "госпожица"}, {ORTH: "г-ца", NORM: "госпожица"},
{ORTH: "д-р", NORM: "доктор"}, {ORTH: "д-р", NORM: "доктор"},
{ORTH: "о", NORM: "остров"}, {ORTH: "о", NORM: "остров"},
{ORTH: "п-в", NORM: "полуостров"}, {ORTH: "п-в", NORM: "полуостров"},
] {ORTH: "с-у", NORM: "срещу"},
{ORTH: "в-у", NORM: "върху"},
for abbr in _abbr_line_exc: {ORTH: "м-у", NORM: "между"},
]:
_exc[abbr[ORTH]] = [abbr] _exc[abbr[ORTH]] = [abbr]
_abbr_dot_exc = [ # foreign language related abbreviations
for abbr in [
{ORTH: "англ.", NORM: "английски"},
{ORTH: "ан.", NORM: "английски термин"},
{ORTH: "араб.", NORM: "арабски"},
{ORTH: "афр.", NORM: "африкански"},
{ORTH: "гр.", NORM: "гръцки"},
{ORTH: "лат.", NORM: "латински"},
{ORTH: "рим.", NORM: "римски"},
{ORTH: "старогр.", NORM: "старогръцки"},
{ORTH: "староевр.", NORM: "староеврейски"},
{ORTH: "фр.", NORM: "френски"},
{ORTH: "хол.", NORM: "холандски"},
{ORTH: "швед.", NORM: "шведски"},
{ORTH: "шотл.", NORM: "шотландски"},
{ORTH: "яп.", NORM: "японски"},
]:
_exc[abbr[ORTH]] = [abbr]
# profession and academic titles abbreviations
for abbr in [
{ORTH: "акад.", NORM: "академик"}, {ORTH: "акад.", NORM: "академик"},
{ORTH: "ал.", NORM: "алинея"},
{ORTH: "арх.", NORM: "архитект"}, {ORTH: "арх.", NORM: "архитект"},
{ORTH: "инж.", NORM: "инженер"},
{ORTH: "канц.", NORM: "канцлер"},
{ORTH: "проф.", NORM: "професор"},
{ORTH: "св.", NORM: "свети"},
]:
_exc[abbr[ORTH]] = [abbr]
# fields of studies
for abbr in [
{ORTH: "агр.", NORM: "агрономия"},
{ORTH: "ав.", NORM: "авиация"},
{ORTH: "агр.", NORM: "агрономия"},
{ORTH: "археол.", NORM: "археология"},
{ORTH: "астр.", NORM: "астрономия"},
{ORTH: "геод.", NORM: "геодезия"},
{ORTH: "геол.", NORM: "геология"},
{ORTH: "геом.", NORM: "геометрия"},
{ORTH: "гимн.", NORM: "гимнастика"},
{ORTH: "грам.", NORM: "граматика"},
{ORTH: "жур.", NORM: "журналистика"},
{ORTH: "журн.", NORM: "журналистика"},
{ORTH: "зем.", NORM: "земеделие"},
{ORTH: "икон.", NORM: "икономика"},
{ORTH: "лит.", NORM: "литература"},
{ORTH: "мат.", NORM: "математика"},
{ORTH: "мед.", NORM: "медицина"},
{ORTH: "муз.", NORM: "музика"},
{ORTH: "печ.", NORM: "печатарство"},
{ORTH: "пол.", NORM: "политика"},
{ORTH: "псих.", NORM: "психология"},
{ORTH: "соц.", NORM: "социология"},
{ORTH: "стат.", NORM: "статистика"},
{ORTH: "стил.", NORM: "стилистика"},
{ORTH: "топогр.", NORM: "топография"},
{ORTH: "търг.", NORM: "търговия"},
{ORTH: "фарм.", NORM: "фармацевтика"},
{ORTH: "фехт.", NORM: "фехтовка"},
{ORTH: "физиол.", NORM: "физиология"},
{ORTH: "физ.", NORM: "физика"},
{ORTH: "фил.", NORM: "философия"},
{ORTH: "фин.", NORM: "финанси"},
{ORTH: "фолкл.", NORM: "фолклор"},
{ORTH: "фон.", NORM: "фонетика"},
{ORTH: "фот.", NORM: "фотография"},
{ORTH: "футб.", NORM: "футбол"},
{ORTH: "хим.", NORM: "химия"},
{ORTH: "хир.", NORM: "хирургия"},
{ORTH: "ел.", NORM: "електротехника"},
]:
_exc[abbr[ORTH]] = [abbr]
for abbr in [
{ORTH: "ал.", NORM: "алинея"},
{ORTH: "авт.", NORM: "автоматично"},
{ORTH: "адм.", NORM: "администрация"},
{ORTH: "арт.", NORM: "артилерия"},
{ORTH: "бл.", NORM: "блок"}, {ORTH: "бл.", NORM: "блок"},
{ORTH: "бр.", NORM: "брой"}, {ORTH: "бр.", NORM: "брой"},
{ORTH: "бул.", NORM: "булевард"}, {ORTH: "бул.", NORM: "булевард"},
{ORTH: "букв.", NORM: "буквално"},
{ORTH: "в.", NORM: "век"}, {ORTH: "в.", NORM: "век"},
{ORTH: "вр.", NORM: "време"},
{ORTH: "вм.", NORM: "вместо"},
{ORTH: "воен.", NORM: "военен термин"},
{ORTH: "г.", NORM: "година"}, {ORTH: "г.", NORM: "година"},
{ORTH: "гр.", NORM: "град"}, {ORTH: "гр.", NORM: "град"},
{ORTH: "гл.", NORM: "глагол"},
{ORTH: "др.", NORM: "други"},
{ORTH: "ез.", NORM: "езеро"},
{ORTH: "ж.р.", NORM: "женски род"}, {ORTH: "ж.р.", NORM: "женски род"},
{ORTH: "инж.", NORM: "инженер"}, {ORTH: "жп.", NORM: "железопът"},
{ORTH: "застр.", NORM: "застрахователно дело"},
{ORTH: "знач.", NORM: "значение"},
{ORTH: "и др.", NORM: "и други"},
{ORTH: "и под.", NORM: "и подобни"},
{ORTH: "и пр.", NORM: "и прочие"},
{ORTH: "изр.", NORM: "изречение"},
{ORTH: "изт.", NORM: "източен"},
{ORTH: "конкр.", NORM: "конкретно"},
{ORTH: "лв.", NORM: "лев"}, {ORTH: "лв.", NORM: "лев"},
{ORTH: "л.", NORM: "лице"},
{ORTH: "м.р.", NORM: "мъжки род"}, {ORTH: "м.р.", NORM: "мъжки род"},
{ORTH: "мат.", NORM: "математика"}, {ORTH: "мин.вр.", NORM: "минало време"},
{ORTH: "мед.", NORM: "медицина"}, {ORTH: "мн.ч.", NORM: "множествено число"},
{ORTH: "напр.", NORM: "например"},
{ORTH: "нар.", NORM: "наречие"},
{ORTH: "науч.", NORM: "научен термин"},
{ORTH: "непр.", NORM: "неправилно"},
{ORTH: "обик.", NORM: "обикновено"},
{ORTH: "опред.", NORM: "определение"},
{ORTH: "особ.", NORM: "особено"},
{ORTH: "ост.", NORM: "остаряло"},
{ORTH: "относ.", NORM: "относително"},
{ORTH: "отр.", NORM: "отрицателно"},
{ORTH: "пл.", NORM: "площад"}, {ORTH: "пл.", NORM: "площад"},
{ORTH: "проф.", NORM: "професор"}, {ORTH: "пад.", NORM: "падеж"},
{ORTH: "парл.", NORM: "парламентарен"},
{ORTH: "погов.", NORM: "поговорка"},
{ORTH: "пон.", NORM: "понякога"},
{ORTH: "правосл.", NORM: "православен"},
{ORTH: "прибл.", NORM: "приблизително"},
{ORTH: "прил.", NORM: "прилагателно име"},
{ORTH: "пр.", NORM: "прочие"},
{ORTH: "с.", NORM: "село"}, {ORTH: "с.", NORM: "село"},
{ORTH: "с.р.", NORM: "среден род"}, {ORTH: "с.р.", NORM: "среден род"},
{ORTH: "св.", NORM: "свети"},
{ORTH: "сп.", NORM: "списание"}, {ORTH: "сп.", NORM: "списание"},
{ORTH: "стр.", NORM: "страница"}, {ORTH: "стр.", NORM: "страница"},
{ORTH: "сз.", NORM: "съюз"},
{ORTH: "сег.", NORM: "сегашно"},
{ORTH: "сп.", NORM: "спорт"},
{ORTH: "срв.", NORM: "сравни"},
{ORTH: "с.ст.", NORM: "селскостопанска техника"},
{ORTH: "счет.", NORM: "счетоводство"},
{ORTH: "съкр.", NORM: "съкратено"},
{ORTH: "съобщ.", NORM: "съобщение"},
{ORTH: "същ.", NORM: "съществително"},
{ORTH: "текст.", NORM: "текстилен"},
{ORTH: "телев.", NORM: "телевизия"},
{ORTH: "тел.", NORM: "телефон"},
{ORTH: "т.е.", NORM: "тоест"},
{ORTH: "т.н.", NORM: "така нататък"},
{ORTH: "т.нар.", NORM: "така наречен"},
{ORTH: "търж.", NORM: "тържествено"},
{ORTH: "ул.", NORM: "улица"}, {ORTH: "ул.", NORM: "улица"},
{ORTH: "уч.", NORM: "училище"},
{ORTH: "унив.", NORM: "университет"},
{ORTH: "харт.", NORM: "хартия"},
{ORTH: "хидр.", NORM: "хидравлика"},
{ORTH: "хран.", NORM: "хранителна"},
{ORTH: "църк.", NORM: "църковен термин"},
{ORTH: "числ.", NORM: "числително"},
{ORTH: "чл.", NORM: "член"}, {ORTH: "чл.", NORM: "член"},
] {ORTH: "ч.", NORM: "число"},
{ORTH: "числ.", NORM: "числително"},
for abbr in _abbr_dot_exc: {ORTH: "шахм.", NORM: "шахмат"},
{ORTH: "шах.", NORM: "шахмат"},
{ORTH: "юр.", NORM: "юридически"},
]:
_exc[abbr[ORTH]] = [abbr] _exc[abbr[ORTH]] = [abbr]
# slash abbreviations
for abbr in [
{ORTH: "м/у", NORM: "между"},
{ORTH: "с/у", NORM: "срещу"},
]:
_exc[abbr[ORTH]] = [abbr]
TOKENIZER_EXCEPTIONS = _exc TOKENIZER_EXCEPTIONS = _exc

View File

@ -1,4 +1,4 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
@ -23,13 +23,25 @@ class Bengali(Language):
@Bengali.factory( @Bengali.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "rule", "overwrite": False}, default_config={
"model": None,
"mode": "rule",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return Lemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return Lemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["Bengali"] __all__ = ["Bengali"]

23
spacy/lang/ca/__init__.py Normal file → Executable file
View File

@ -1,9 +1,9 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES, TOKENIZER_PREFIXES
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .syntax_iterators import SYNTAX_ITERATORS from .syntax_iterators import SYNTAX_ITERATORS
@ -15,6 +15,7 @@ class CatalanDefaults(BaseDefaults):
tokenizer_exceptions = TOKENIZER_EXCEPTIONS tokenizer_exceptions = TOKENIZER_EXCEPTIONS
infixes = TOKENIZER_INFIXES infixes = TOKENIZER_INFIXES
suffixes = TOKENIZER_SUFFIXES suffixes = TOKENIZER_SUFFIXES
prefixes = TOKENIZER_PREFIXES
stop_words = STOP_WORDS stop_words = STOP_WORDS
lex_attr_getters = LEX_ATTRS lex_attr_getters = LEX_ATTRS
syntax_iterators = SYNTAX_ITERATORS syntax_iterators = SYNTAX_ITERATORS
@ -28,13 +29,25 @@ class Catalan(Language):
@Catalan.factory( @Catalan.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "rule", "overwrite": False}, default_config={
"model": None,
"mode": "rule",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return CatalanLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return CatalanLemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["Catalan"] __all__ = ["Catalan"]

11
spacy/lang/ca/punctuation.py Normal file → Executable file
View File

@ -1,4 +1,5 @@
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_ICONS from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_ICONS
from ..char_classes import LIST_CURRENCY
from ..char_classes import CURRENCY from ..char_classes import CURRENCY
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
from ..char_classes import merge_chars, _units from ..char_classes import merge_chars, _units
@ -6,6 +7,14 @@ from ..char_classes import merge_chars, _units
ELISION = " ' ".strip().replace(" ", "").replace("\n", "") ELISION = " ' ".strip().replace(" ", "").replace("\n", "")
_prefixes = (
["§", "%", "=", "", "", "-", r"\+(?![0-9])"]
+ LIST_PUNCT
+ LIST_ELLIPSES
+ LIST_QUOTES
+ LIST_CURRENCY
+ LIST_ICONS
)
_infixes = ( _infixes = (
LIST_ELLIPSES LIST_ELLIPSES
@ -18,6 +27,7 @@ _infixes = (
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA), r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}][{el}])(?=[{a}0-9])".format(a=ALPHA, el=ELISION), r"(?<=[{a}][{el}])(?=[{a}0-9])".format(a=ALPHA, el=ELISION),
r"('ls|'l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)(?![A-Za-z])|(-l'|-m'|-t'|-n')",
] ]
) )
@ -44,3 +54,4 @@ _suffixes = (
TOKENIZER_INFIXES = _infixes TOKENIZER_INFIXES = _infixes
TOKENIZER_SUFFIXES = _suffixes TOKENIZER_SUFFIXES = _suffixes
TOKENIZER_PREFIXES = _prefixes

21
spacy/lang/ca/tokenizer_exceptions.py Normal file → Executable file
View File

@ -18,12 +18,21 @@ for exc_data in [
{ORTH: "nov.", NORM: "novembre"}, {ORTH: "nov.", NORM: "novembre"},
{ORTH: "dec.", NORM: "desembre"}, {ORTH: "dec.", NORM: "desembre"},
{ORTH: "Dr.", NORM: "doctor"}, {ORTH: "Dr.", NORM: "doctor"},
{ORTH: "Dra.", NORM: "doctora"},
{ORTH: "Sr.", NORM: "senyor"}, {ORTH: "Sr.", NORM: "senyor"},
{ORTH: "Sra.", NORM: "senyora"}, {ORTH: "Sra.", NORM: "senyora"},
{ORTH: "Srta.", NORM: "senyoreta"}, {ORTH: "Srta.", NORM: "senyoreta"},
{ORTH: "núm", NORM: "número"}, {ORTH: "núm", NORM: "número"},
{ORTH: "St.", NORM: "sant"}, {ORTH: "St.", NORM: "sant"},
{ORTH: "Sta.", NORM: "santa"}, {ORTH: "Sta.", NORM: "santa"},
{ORTH: "pl.", NORM: "plaça"},
{ORTH: "à."},
{ORTH: "è."},
{ORTH: "é."},
{ORTH: "í."},
{ORTH: "ò."},
{ORTH: "ó."},
{ORTH: "ú."},
{ORTH: "'l"}, {ORTH: "'l"},
{ORTH: "'ls"}, {ORTH: "'ls"},
{ORTH: "'m"}, {ORTH: "'m"},
@ -34,6 +43,18 @@ for exc_data in [
]: ]:
_exc[exc_data[ORTH]] = [exc_data] _exc[exc_data[ORTH]] = [exc_data]
_exc["del"] = [{ORTH: "d", NORM: "de"}, {ORTH: "el"}]
_exc["dels"] = [{ORTH: "d", NORM: "de"}, {ORTH: "els"}]
_exc["al"] = [{ORTH: "a"}, {ORTH: "l", NORM: "el"}]
_exc["als"] = [{ORTH: "a"}, {ORTH: "ls", NORM: "els"}]
_exc["pel"] = [{ORTH: "p", NORM: "per"}, {ORTH: "el"}]
_exc["pels"] = [{ORTH: "p", NORM: "per"}, {ORTH: "els"}]
_exc["holahola"] = [{ORTH: "holahola", NORM: "cocacola"}]
# Times # Times
_exc["12m."] = [{ORTH: "12"}, {ORTH: "m.", NORM: "p.m."}] _exc["12m."] = [{ORTH: "12"}, {ORTH: "m.", NORM: "p.m."}]

View File

@ -1,4 +1,4 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
@ -28,13 +28,25 @@ class Greek(Language):
@Greek.factory( @Greek.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "rule", "overwrite": False}, default_config={
"model": None,
"mode": "rule",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return GreekLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return GreekLemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["Greek"] __all__ = ["Greek"]

View File

@ -1,4 +1,4 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
@ -26,13 +26,25 @@ class English(Language):
@English.factory( @English.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "rule", "overwrite": False}, default_config={
"model": None,
"mode": "rule",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return EnglishLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return EnglishLemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["English"] __all__ = ["English"]

View File

@ -10,7 +10,7 @@ class EnglishLemmatizer(Lemmatizer):
Check whether we're dealing with an uninflected paradigm, so we can Check whether we're dealing with an uninflected paradigm, so we can
avoid lemmatization entirely. avoid lemmatization entirely.
univ_pos (unicode / int): The token's universal part-of-speech tag. univ_pos (str / int): The token's universal part-of-speech tag.
morphology (dict): The token's morphological features following the morphology (dict): The token's morphological features following the
Universal Dependencies scheme. Universal Dependencies scheme.
""" """

View File

@ -1,4 +1,4 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
@ -26,13 +26,25 @@ class Spanish(Language):
@Spanish.factory( @Spanish.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "rule", "overwrite": False}, default_config={
"model": None,
"mode": "rule",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return SpanishLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return SpanishLemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["Spanish"] __all__ = ["Spanish"]

View File

@ -1,58 +1,76 @@
from typing import Union, Iterator, Tuple from typing import Union, Iterator, Tuple
from ...symbols import NOUN, PROPN, PRON, VERB, AUX from ...symbols import NOUN, PROPN, PRON
from ...errors import Errors from ...errors import Errors
from ...tokens import Doc, Span, Token from ...tokens import Doc, Span
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]: def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
"""Detect base noun phrases from a dependency parse. Works on Doc and Span.""" """
doc = doclike.doc Detect base noun phrases from a dependency parse. Works on both Doc and Span.
"""
labels = [
"nsubj",
"nsubj:pass",
"obj",
"obl",
"nmod",
"pcomp",
"appos",
"ROOT",
]
post_modifiers = ["flat", "fixed", "compound"]
doc = doclike.doc # Ensure works on both Doc and Span.
if not doc.has_annotation("DEP"): if not doc.has_annotation("DEP"):
raise ValueError(Errors.E029) raise ValueError(Errors.E029)
if not len(doc): np_deps = {doc.vocab.strings.add(label) for label in labels}
return np_modifs = {doc.vocab.strings.add(modifier) for modifier in post_modifiers}
np_label = doc.vocab.strings.add("NP") np_label = doc.vocab.strings.add("NP")
left_labels = ["det", "fixed", "neg"] # ['nunmod', 'det', 'appos', 'fixed'] adj_label = doc.vocab.strings.add("amod")
right_labels = ["flat", "fixed", "compound", "neg"] adp_label = doc.vocab.strings.add("ADP")
stop_labels = ["punct"] conj = doc.vocab.strings.add("conj")
np_left_deps = [doc.vocab.strings.add(label) for label in left_labels] conj_pos = doc.vocab.strings.add("CCONJ")
np_right_deps = [doc.vocab.strings.add(label) for label in right_labels] prev_end = -1
stop_deps = [doc.vocab.strings.add(label) for label in stop_labels] for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON):
continue
# Prevent nested chunks from being produced
if word.left_edge.i <= prev_end:
continue
if word.dep in np_deps:
right_childs = list(word.rights)
right_child = right_childs[0] if right_childs else None
prev_right = -1 if right_child:
for token in doclike: if right_child.dep == adj_label:
if token.pos in [PROPN, NOUN, PRON]: right_end = right_child.right_edge
left, right = noun_bounds( elif right_child.dep in np_modifs: # Check if we can expand to right
doc, token, np_left_deps, np_right_deps, stop_deps right_end = word.right_edge
) else:
if left.i <= prev_right: right_end = word
continue
yield left.i, right.i + 1, np_label
prev_right = right.i
def is_verb_token(token: Token) -> bool:
return token.pos in [VERB, AUX]
def noun_bounds(doc, root, np_left_deps, np_right_deps, stop_deps):
left_bound = root
for token in reversed(list(root.lefts)):
if token.dep in np_left_deps:
left_bound = token
right_bound = root
for token in root.rights:
if token.dep in np_right_deps:
left, right = noun_bounds(
doc, token, np_left_deps, np_right_deps, stop_deps
)
filter_func = lambda t: is_verb_token(t) or t.dep in stop_deps
if list(filter(filter_func, doc[left_bound.i : right.i])):
break
else: else:
right_bound = right right_end = word
return left_bound, right_bound prev_end = right_end.i
left_index = word.left_edge.i
left_index = (
left_index + 1 if word.left_edge.pos == adp_label else left_index
) # Eliminate left attached de, del
yield left_index, right_end.i + 1, np_label
elif word.dep == conj:
head = word.head
while head.dep == conj and head.head.i < head.i:
head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps:
prev_end = word.i
left_index = word.left_edge.i # eliminate left attached conjunction
left_index = (
left_index + 1 if word.left_edge.pos == conj_pos else left_index
)
yield left_index, word.i + 1, np_label
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks} SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}

View File

@ -1,4 +1,4 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
@ -26,13 +26,25 @@ class Persian(Language):
@Persian.factory( @Persian.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "rule", "overwrite": False}, default_config={
"model": None,
"mode": "rule",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return Lemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return Lemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["Persian"] __all__ = ["Persian"]

View File

@ -1,4 +1,4 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
@ -31,13 +31,25 @@ class French(Language):
@French.factory( @French.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "rule", "overwrite": False}, default_config={
"model": None,
"mode": "rule",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return FrenchLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return FrenchLemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["French"] __all__ = ["French"]

View File

@ -1,6 +1,11 @@
from typing import Optional
from thinc.api import Model
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from ...language import Language, BaseDefaults from ...language import Language, BaseDefaults
from .lemmatizer import IrishLemmatizer
class IrishDefaults(BaseDefaults): class IrishDefaults(BaseDefaults):
@ -13,4 +18,16 @@ class Irish(Language):
Defaults = IrishDefaults Defaults = IrishDefaults
@Irish.factory(
"lemmatizer",
assigns=["token.lemma"],
default_config={"model": None, "mode": "pos_lookup", "overwrite": False},
default_score_weights={"lemma_acc": 1.0},
)
def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
):
return IrishLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)
__all__ = ["Irish"] __all__ = ["Irish"]

View File

@ -1,35 +0,0 @@
# fmt: off
consonants = ["b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "z"]
broad_vowels = ["a", "á", "o", "ó", "u", "ú"]
slender_vowels = ["e", "é", "i", "í"]
vowels = broad_vowels + slender_vowels
# fmt: on
def ends_dentals(word):
if word != "" and word[-1] in ["d", "n", "t", "s"]:
return True
else:
return False
def devoice(word):
if len(word) > 2 and word[-2] == "s" and word[-1] == "d":
return word[:-1] + "t"
else:
return word
def ends_with_vowel(word):
return word != "" and word[-1] in vowels
def starts_with_vowel(word):
return word != "" and word[0] in vowels
def deduplicate(word):
if len(word) > 2 and word[-2] == word[-1] and word[-1] in consonants:
return word[:-1]
else:
return word

162
spacy/lang/ga/lemmatizer.py Normal file
View File

@ -0,0 +1,162 @@
from typing import List, Dict, Tuple
from ...pipeline import Lemmatizer
from ...tokens import Token
class IrishLemmatizer(Lemmatizer):
# This is a lookup-based lemmatiser using data extracted from
# BuNaMo (https://github.com/michmech/BuNaMo)
@classmethod
def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]:
if mode == "pos_lookup":
# fmt: off
required = [
"lemma_lookup_adj", "lemma_lookup_adp",
"lemma_lookup_noun", "lemma_lookup_verb"
]
# fmt: on
return (required, [])
else:
return super().get_lookups_config(mode)
def pos_lookup_lemmatize(self, token: Token) -> List[str]:
univ_pos = token.pos_
string = unponc(token.text)
if univ_pos not in ["PROPN", "ADP", "ADJ", "NOUN", "VERB"]:
return [string.lower()]
demutated = demutate(string)
secondary = ""
if string[0:1].lower() == "h" and string[1:2].lower() in "aáeéiíoóuú":
secondary = string[1:]
lookup_pos = univ_pos.lower()
if univ_pos == "PROPN":
lookup_pos = "noun"
if token.has_morph():
# TODO: lookup is actually required for the genitive forms, but
# this is not in BuNaMo, and would not be of use with IDT.
if univ_pos == "NOUN" and (
"VerbForm=Vnoun" in token.morph or "VerbForm=Inf" in token.morph
):
hpref = "Form=HPref" in token.morph
return [demutate(string, hpref).lower()]
elif univ_pos == "ADJ" and "VerbForm=Part" in token.morph:
return [demutate(string).lower()]
lookup_table = self.lookups.get_table("lemma_lookup_" + lookup_pos, {})
def to_list(value):
if value is None:
value = []
elif not isinstance(value, list):
value = [value]
return value
if univ_pos == "ADP":
return to_list(lookup_table.get(string, string.lower()))
ret = []
if univ_pos == "PROPN":
ret.extend(to_list(lookup_table.get(demutated)))
ret.extend(to_list(lookup_table.get(secondary)))
else:
ret.extend(to_list(lookup_table.get(demutated.lower())))
ret.extend(to_list(lookup_table.get(secondary.lower())))
if len(ret) == 0:
ret = [string.lower()]
return ret
def demutate(word: str, is_hpref: bool = False) -> str:
UVOWELS = "AÁEÉIÍOÓUÚ"
LVOWELS = "aáeéiíoóuú"
lc = word.lower()
# remove eclipsis
if lc.startswith("bhf"):
word = word[2:]
elif lc.startswith("mb"):
word = word[1:]
elif lc.startswith("gc"):
word = word[1:]
elif lc.startswith("nd"):
word = word[1:]
elif lc.startswith("ng"):
word = word[1:]
elif lc.startswith("bp"):
word = word[1:]
elif lc.startswith("dt"):
word = word[1:]
elif word[0:1] == "n" and word[1:2] in UVOWELS:
word = word[1:]
elif lc.startswith("n-") and word[2:3] in LVOWELS:
word = word[2:]
# non-standard eclipsis
elif lc.startswith("bh-f"):
word = word[3:]
elif lc.startswith("m-b"):
word = word[2:]
elif lc.startswith("g-c"):
word = word[2:]
elif lc.startswith("n-d"):
word = word[2:]
elif lc.startswith("n-g"):
word = word[2:]
elif lc.startswith("b-p"):
word = word[2:]
elif lc.startswith("d-t"):
word = word[2:]
# t-prothesis
elif lc.startswith("ts"):
word = word[1:]
elif lc.startswith("t-s"):
word = word[2:]
# h-prothesis, if known to be present
elif is_hpref and word[0:1] == "h":
word = word[1:]
# h-prothesis, simple case
# words can also begin with 'h', but unlike eclipsis,
# a hyphen is not used, so that needs to be handled
# elsewhere
elif word[0:1] == "h" and word[1:2] in UVOWELS:
word = word[1:]
# lenition
# this breaks the previous if, to handle super-non-standard
# text where both eclipsis and lenition were used.
if lc[0:1] in "bcdfgmpst" and lc[1:2] == "h":
word = word[0:1] + word[2:]
return word
def unponc(word: str) -> str:
# fmt: off
PONC = {
"": "bh",
"ċ": "ch",
"": "dh",
"": "fh",
"ġ": "gh",
"": "mh",
"": "ph",
"": "sh",
"": "th",
"": "BH",
"Ċ": "CH",
"": "DH",
"": "FH",
"Ġ": "GH",
"": "MH",
"": "PH",
"": "SH",
"": "TH"
}
# fmt: on
buf = []
for ch in word:
if ch in PONC:
buf.append(PONC[ch])
else:
buf.append(ch)
return "".join(buf)

View File

@ -9,6 +9,8 @@ _exc = {
"ded'": [{ORTH: "de", NORM: "de"}, {ORTH: "d'", NORM: "do"}], "ded'": [{ORTH: "de", NORM: "de"}, {ORTH: "d'", NORM: "do"}],
"lem'": [{ORTH: "le", NORM: "le"}, {ORTH: "m'", NORM: "mo"}], "lem'": [{ORTH: "le", NORM: "le"}, {ORTH: "m'", NORM: "mo"}],
"led'": [{ORTH: "le", NORM: "le"}, {ORTH: "d'", NORM: "do"}], "led'": [{ORTH: "le", NORM: "le"}, {ORTH: "d'", NORM: "do"}],
"théis": [{ORTH: "th", NORM: "tar"}, {ORTH: "éis", NORM: "éis"}],
"tréis": [{ORTH: "tr", NORM: "tar"}, {ORTH: "éis", NORM: "éis"}],
} }
for exc_data in [ for exc_data in [

View File

@ -646,5 +646,10 @@ _nums = r"(({ne})|({t})|({on})|({c}))({s})?".format(
) )
for u in "cfkCFK":
_exc[f"°{u}"] = [{ORTH: f"°{u}"}]
_exc[f"°{u}."] = [{ORTH: f"°{u}"}, {ORTH: "."}]
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
TOKEN_MATCH = re.compile(r"^{n}$".format(n=_nums)).match TOKEN_MATCH = re.compile(r"^{n}$".format(n=_nums)).match

View File

@ -1,4 +1,4 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
@ -23,13 +23,25 @@ class Italian(Language):
@Italian.factory( @Italian.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "pos_lookup", "overwrite": False}, default_config={
"model": None,
"mode": "pos_lookup",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return ItalianLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return ItalianLemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["Italian"] __all__ = ["Italian"]

View File

@ -1,21 +1,25 @@
from typing import Optional, Union, Dict, Any from typing import Optional, Union, Dict, Any, Callable
from pathlib import Path from pathlib import Path
import srsly import srsly
from collections import namedtuple from collections import namedtuple
from thinc.api import Model
import re
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .syntax_iterators import SYNTAX_ITERATORS from .syntax_iterators import SYNTAX_ITERATORS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from .tag_orth_map import TAG_ORTH_MAP from .tag_orth_map import TAG_ORTH_MAP
from .tag_bigram_map import TAG_BIGRAM_MAP from .tag_bigram_map import TAG_BIGRAM_MAP
from ...compat import copy_reg
from ...errors import Errors from ...errors import Errors
from ...language import Language, BaseDefaults from ...language import Language, BaseDefaults
from ...pipeline import Morphologizer
from ...pipeline.morphologizer import DEFAULT_MORPH_MODEL
from ...scorer import Scorer from ...scorer import Scorer
from ...symbols import POS from ...symbols import POS
from ...tokens import Doc from ...tokens import Doc, MorphAnalysis
from ...training import validate_examples from ...training import validate_examples
from ...util import DummyTokenizer, registry, load_config_from_str from ...util import DummyTokenizer, registry, load_config_from_str
from ...vocab import Vocab
from ... import util from ... import util
@ -31,16 +35,21 @@ split_mode = null
@registry.tokenizers("spacy.ja.JapaneseTokenizer") @registry.tokenizers("spacy.ja.JapaneseTokenizer")
def create_tokenizer(split_mode: Optional[str] = None): def create_tokenizer(split_mode: Optional[str] = None):
def japanese_tokenizer_factory(nlp): def japanese_tokenizer_factory(nlp):
return JapaneseTokenizer(nlp, split_mode=split_mode) return JapaneseTokenizer(nlp.vocab, split_mode=split_mode)
return japanese_tokenizer_factory return japanese_tokenizer_factory
class JapaneseTokenizer(DummyTokenizer): class JapaneseTokenizer(DummyTokenizer):
def __init__(self, nlp: Language, split_mode: Optional[str] = None) -> None: def __init__(self, vocab: Vocab, split_mode: Optional[str] = None) -> None:
self.vocab = nlp.vocab self.vocab = vocab
self.split_mode = split_mode self.split_mode = split_mode
self.tokenizer = try_sudachi_import(self.split_mode) self.tokenizer = try_sudachi_import(self.split_mode)
# if we're using split mode A we don't need subtokens
self.need_subtokens = not (split_mode is None or split_mode == "A")
def __reduce__(self):
return JapaneseTokenizer, (self.vocab, self.split_mode)
def __call__(self, text: str) -> Doc: def __call__(self, text: str) -> Doc:
# convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces # convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces
@ -49,8 +58,8 @@ class JapaneseTokenizer(DummyTokenizer):
dtokens, spaces = get_dtokens_and_spaces(dtokens, text) dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
# create Doc with tag bi-gram based part-of-speech identification rules # create Doc with tag bi-gram based part-of-speech identification rules
words, tags, inflections, lemmas, readings, sub_tokens_list = ( words, tags, inflections, lemmas, norms, readings, sub_tokens_list = (
zip(*dtokens) if dtokens else [[]] * 6 zip(*dtokens) if dtokens else [[]] * 7
) )
sub_tokens_list = list(sub_tokens_list) sub_tokens_list = list(sub_tokens_list)
doc = Doc(self.vocab, words=words, spaces=spaces) doc = Doc(self.vocab, words=words, spaces=spaces)
@ -68,9 +77,18 @@ class JapaneseTokenizer(DummyTokenizer):
) )
# if there's no lemma info (it's an unk) just use the surface # if there's no lemma info (it's an unk) just use the surface
token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface
doc.user_data["inflections"] = inflections morph = {}
doc.user_data["reading_forms"] = readings if dtoken.inf:
doc.user_data["sub_tokens"] = sub_tokens_list # it's normal for this to be empty for non-inflecting types
morph["Inflection"] = dtoken.inf
token.norm_ = dtoken.norm
if dtoken.reading:
# punctuation is its own reading, but we don't want values like
# "=" here
morph["Reading"] = re.sub("[=|]", "_", dtoken.reading)
token.morph = MorphAnalysis(self.vocab, morph)
if self.need_subtokens:
doc.user_data["sub_tokens"] = sub_tokens_list
return doc return doc
def _get_dtokens(self, sudachipy_tokens, need_sub_tokens: bool = True): def _get_dtokens(self, sudachipy_tokens, need_sub_tokens: bool = True):
@ -81,9 +99,10 @@ class JapaneseTokenizer(DummyTokenizer):
DetailedToken( DetailedToken(
token.surface(), # orth token.surface(), # orth
"-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"]), # tag "-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"]), # tag
",".join([xx for xx in token.part_of_speech()[4:] if xx != "*"]), # inf ";".join([xx for xx in token.part_of_speech()[4:] if xx != "*"]), # inf
token.dictionary_form(), # lemma token.dictionary_form(), # lemma
token.reading_form(), # user_data['reading_forms'] token.normalized_form(),
token.reading_form(),
sub_tokens_list[idx] sub_tokens_list[idx]
if sub_tokens_list if sub_tokens_list
else None, # user_data['sub_tokens'] else None, # user_data['sub_tokens']
@ -105,9 +124,8 @@ class JapaneseTokenizer(DummyTokenizer):
] ]
def _get_sub_tokens(self, sudachipy_tokens): def _get_sub_tokens(self, sudachipy_tokens):
if ( # do nothing for default split mode
self.split_mode is None or self.split_mode == "A" if not self.need_subtokens:
): # do nothing for default split mode
return None return None
sub_tokens_list = [] # list of (list of list of DetailedToken | None) sub_tokens_list = [] # list of (list of list of DetailedToken | None)
@ -176,9 +194,37 @@ class Japanese(Language):
Defaults = JapaneseDefaults Defaults = JapaneseDefaults
@Japanese.factory(
"morphologizer",
assigns=["token.morph", "token.pos"],
default_config={
"model": DEFAULT_MORPH_MODEL,
"overwrite": True,
"extend": True,
"scorer": {"@scorers": "spacy.morphologizer_scorer.v1"},
},
default_score_weights={
"pos_acc": 0.5,
"morph_micro_f": 0.5,
"morph_per_feat": None,
},
)
def make_morphologizer(
nlp: Language,
model: Model,
name: str,
overwrite: bool,
extend: bool,
scorer: Optional[Callable],
):
return Morphologizer(
nlp.vocab, model, name, overwrite=overwrite, extend=extend, scorer=scorer
)
# Hold the attributes we need with convenient names # Hold the attributes we need with convenient names
DetailedToken = namedtuple( DetailedToken = namedtuple(
"DetailedToken", ["surface", "tag", "inf", "lemma", "reading", "sub_tokens"] "DetailedToken", ["surface", "tag", "inf", "lemma", "norm", "reading", "sub_tokens"]
) )
@ -254,7 +300,7 @@ def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
return text_dtokens, text_spaces return text_dtokens, text_spaces
elif len([word for word in words if not word.isspace()]) == 0: elif len([word for word in words if not word.isspace()]) == 0:
assert text.isspace() assert text.isspace()
text_dtokens = [DetailedToken(text, gap_tag, "", text, None, None)] text_dtokens = [DetailedToken(text, gap_tag, "", text, text, None, None)]
text_spaces = [False] text_spaces = [False]
return text_dtokens, text_spaces return text_dtokens, text_spaces
@ -271,7 +317,7 @@ def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
# space token # space token
if word_start > 0: if word_start > 0:
w = text[text_pos : text_pos + word_start] w = text[text_pos : text_pos + word_start]
text_dtokens.append(DetailedToken(w, gap_tag, "", w, None, None)) text_dtokens.append(DetailedToken(w, gap_tag, "", w, w, None, None))
text_spaces.append(False) text_spaces.append(False)
text_pos += word_start text_pos += word_start
@ -287,16 +333,10 @@ def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
# trailing space token # trailing space token
if text_pos < len(text): if text_pos < len(text):
w = text[text_pos:] w = text[text_pos:]
text_dtokens.append(DetailedToken(w, gap_tag, "", w, None, None)) text_dtokens.append(DetailedToken(w, gap_tag, "", w, w, None, None))
text_spaces.append(False) text_spaces.append(False)
return text_dtokens, text_spaces return text_dtokens, text_spaces
def pickle_japanese(instance):
return Japanese, tuple()
copy_reg.pickle(Japanese, pickle_japanese)
__all__ = ["Japanese"] __all__ = ["Japanese"]

View File

@ -5,11 +5,11 @@ from .tag_map import TAG_MAP
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from ...language import Language, BaseDefaults from ...language import Language, BaseDefaults
from ...tokens import Doc from ...tokens import Doc
from ...compat import copy_reg
from ...scorer import Scorer from ...scorer import Scorer
from ...symbols import POS from ...symbols import POS
from ...training import validate_examples from ...training import validate_examples
from ...util import DummyTokenizer, registry, load_config_from_str from ...util import DummyTokenizer, registry, load_config_from_str
from ...vocab import Vocab
DEFAULT_CONFIG = """ DEFAULT_CONFIG = """
@ -23,17 +23,20 @@ DEFAULT_CONFIG = """
@registry.tokenizers("spacy.ko.KoreanTokenizer") @registry.tokenizers("spacy.ko.KoreanTokenizer")
def create_tokenizer(): def create_tokenizer():
def korean_tokenizer_factory(nlp): def korean_tokenizer_factory(nlp):
return KoreanTokenizer(nlp) return KoreanTokenizer(nlp.vocab)
return korean_tokenizer_factory return korean_tokenizer_factory
class KoreanTokenizer(DummyTokenizer): class KoreanTokenizer(DummyTokenizer):
def __init__(self, nlp: Language): def __init__(self, vocab: Vocab):
self.vocab = nlp.vocab self.vocab = vocab
MeCab = try_mecab_import() # type: ignore[func-returns-value] MeCab = try_mecab_import() # type: ignore[func-returns-value]
self.mecab_tokenizer = MeCab("-F%f[0],%f[7]") self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
def __reduce__(self):
return KoreanTokenizer, (self.vocab,)
def __del__(self): def __del__(self):
self.mecab_tokenizer.__del__() self.mecab_tokenizer.__del__()
@ -106,10 +109,4 @@ def check_spaces(text, tokens):
yield False yield False
def pickle_korean(instance):
return Korean, tuple()
copy_reg.pickle(Korean, pickle_korean)
__all__ = ["Korean"] __all__ = ["Korean"]

View File

@ -3,6 +3,7 @@ import unicodedata
import re import re
from .. import attrs from .. import attrs
from .tokenizer_exceptions import URL_MATCH
_like_email = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)").match _like_email = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)").match
@ -109,6 +110,8 @@ def like_url(text: str) -> bool:
return True return True
if tld.isalpha() and tld in _tlds: if tld.isalpha() and tld in _tlds:
return True return True
if URL_MATCH(text):
return True
return False return False

View File

@ -1,4 +1,4 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
from .lemmatizer import MacedonianLemmatizer from .lemmatizer import MacedonianLemmatizer
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
@ -38,13 +38,25 @@ class Macedonian(Language):
@Macedonian.factory( @Macedonian.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "rule", "overwrite": False}, default_config={
"model": None,
"mode": "rule",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return MacedonianLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return MacedonianLemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["Macedonian"] __all__ = ["Macedonian"]

View File

@ -1,4 +1,4 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
@ -26,13 +26,25 @@ class Norwegian(Language):
@Norwegian.factory( @Norwegian.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "rule", "overwrite": False}, default_config={
"model": None,
"mode": "rule",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return Lemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return Lemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["Norwegian"] __all__ = ["Norwegian"]

View File

@ -1,4 +1,4 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
@ -30,13 +30,25 @@ class Dutch(Language):
@Dutch.factory( @Dutch.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "rule", "overwrite": False}, default_config={
"model": None,
"mode": "rule",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return DutchLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return DutchLemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["Dutch"] __all__ = ["Dutch"]

View File

@ -1,4 +1,4 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
@ -33,13 +33,25 @@ class Polish(Language):
@Polish.factory( @Polish.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "pos_lookup", "overwrite": False}, default_config={
"model": None,
"mode": "pos_lookup",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return PolishLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return PolishLemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["Polish"] __all__ = ["Polish"]

View File

@ -1,6 +1,7 @@
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .syntax_iterators import SYNTAX_ITERATORS
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES
from ...language import Language, BaseDefaults from ...language import Language, BaseDefaults
@ -10,6 +11,7 @@ class PortugueseDefaults(BaseDefaults):
infixes = TOKENIZER_INFIXES infixes = TOKENIZER_INFIXES
prefixes = TOKENIZER_PREFIXES prefixes = TOKENIZER_PREFIXES
lex_attr_getters = LEX_ATTRS lex_attr_getters = LEX_ATTRS
syntax_iterators = SYNTAX_ITERATORS
stop_words = STOP_WORDS stop_words = STOP_WORDS

View File

@ -0,0 +1,85 @@
from typing import Union, Iterator, Tuple
from ...symbols import NOUN, PROPN, PRON
from ...errors import Errors
from ...tokens import Doc, Span
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
"""
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
"""
labels = [
"nsubj",
"nsubj:pass",
"obj",
"obl",
"obl:agent",
"nmod",
"pcomp",
"appos",
"ROOT",
]
post_modifiers = ["flat", "flat:name", "fixed", "compound"]
doc = doclike.doc # Ensure works on both Doc and Span.
if not doc.has_annotation("DEP"):
raise ValueError(Errors.E029)
np_deps = {doc.vocab.strings.add(label) for label in labels}
np_modifs = {doc.vocab.strings.add(modifier) for modifier in post_modifiers}
np_label = doc.vocab.strings.add("NP")
adj_label = doc.vocab.strings.add("amod")
det_label = doc.vocab.strings.add("det")
det_pos = doc.vocab.strings.add("DET")
adp_label = doc.vocab.strings.add("ADP")
conj = doc.vocab.strings.add("conj")
conj_pos = doc.vocab.strings.add("CCONJ")
prev_end = -1
for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON):
continue
# Prevent nested chunks from being produced
if word.left_edge.i <= prev_end:
continue
if word.dep in np_deps:
right_childs = list(word.rights)
right_child = right_childs[0] if right_childs else None
if right_child:
if (
right_child.dep == adj_label
): # allow chain of adjectives by expanding to right
right_end = right_child.right_edge
elif (
right_child.dep == det_label and right_child.pos == det_pos
): # cut relative pronouns here
right_end = right_child
elif right_child.dep in np_modifs: # Check if we can expand to right
right_end = word.right_edge
else:
right_end = word
else:
right_end = word
prev_end = right_end.i
left_index = word.left_edge.i
left_index = (
left_index + 1 if word.left_edge.pos == adp_label else left_index
)
yield left_index, right_end.i + 1, np_label
elif word.dep == conj:
head = word.head
while head.dep == conj and head.head.i < head.i:
head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps:
prev_end = word.i
left_index = word.left_edge.i # eliminate left attached conjunction
left_index = (
left_index + 1 if word.left_edge.pos == conj_pos else left_index
)
yield left_index, word.i + 1, np_label
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}

View File

@ -1,4 +1,4 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
@ -22,7 +22,12 @@ class Russian(Language):
@Russian.factory( @Russian.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "pymorphy2", "overwrite": False}, default_config={
"model": None,
"mode": "pymorphy2",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
@ -31,8 +36,11 @@ def make_lemmatizer(
name: str, name: str,
mode: str, mode: str,
overwrite: bool, overwrite: bool,
scorer: Optional[Callable],
): ):
return RussianLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return RussianLemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["Russian"] __all__ = ["Russian"]

View File

@ -1,8 +1,9 @@
from typing import Optional, List, Dict, Tuple from typing import Optional, List, Dict, Tuple, Callable
from thinc.api import Model from thinc.api import Model
from ...pipeline import Lemmatizer from ...pipeline import Lemmatizer
from ...pipeline.lemmatizer import lemmatizer_score
from ...symbols import POS from ...symbols import POS
from ...tokens import Token from ...tokens import Token
from ...vocab import Vocab from ...vocab import Vocab
@ -20,6 +21,7 @@ class RussianLemmatizer(Lemmatizer):
*, *,
mode: str = "pymorphy2", mode: str = "pymorphy2",
overwrite: bool = False, overwrite: bool = False,
scorer: Optional[Callable] = lemmatizer_score,
) -> None: ) -> None:
if mode == "pymorphy2": if mode == "pymorphy2":
try: try:
@ -31,7 +33,9 @@ class RussianLemmatizer(Lemmatizer):
) from None ) from None
if getattr(self, "_morph", None) is None: if getattr(self, "_morph", None) is None:
self._morph = MorphAnalyzer() self._morph = MorphAnalyzer()
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite) super().__init__(
vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
def pymorphy2_lemmatize(self, token: Token) -> List[str]: def pymorphy2_lemmatize(self, token: Token) -> List[str]:
string = token.text string = token.text

View File

@ -1,47 +1,195 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
අතර සහ
එචචර සමග
එපමණ සමඟ
එල අහ
එව ආහ
කට ඕහ
කද අන
අඳ
අප
අප
අය
ආය
ඌය
නම
පමණ
පමණ
චර
පමණ
බඳ
වන
අය
ලද අය
වග
බවට
බව
බව
නම
මහ
මහ
පමණ
පමණ
පමන
වන වන
තර
වත ඇත
වද
සමඟ වශය
යන
සඳහ
මග
ඉත
එම
අතර
සමග
බඳව
බඳ
බව
මහ
තට
වට
අන
නව
බඳ
නට
එහ
එහ
තවත
තව
සහ සහ
දක
බවත
බවද
මත
ඇත
ඇත
වඩ
වඩ
තර
ඉක
යල
ඉත
ටන
පටන
දක
වක
පව
වත
ඇය
මන
වත වත
පත
තව
ඉත
වහ
හන
එම
එමබල
නම
වන
කල
ඉඳ
අන
ඔන
උද
සඳහ
අරබය
එන
එබ
අන
පර
වට
නම
එනම
වස
පර
එහ
""".split() """.split()
) )

View File

@ -1,4 +1,4 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
@ -29,13 +29,25 @@ class Swedish(Language):
@Swedish.factory( @Swedish.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "rule", "overwrite": False}, default_config={
"model": None,
"mode": "rule",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return Lemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return Lemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["Swedish"] __all__ = ["Swedish"]

View File

@ -3,6 +3,7 @@ from .lex_attrs import LEX_ATTRS
from ...language import Language, BaseDefaults from ...language import Language, BaseDefaults
from ...tokens import Doc from ...tokens import Doc
from ...util import DummyTokenizer, registry, load_config_from_str from ...util import DummyTokenizer, registry, load_config_from_str
from ...vocab import Vocab
DEFAULT_CONFIG = """ DEFAULT_CONFIG = """
@ -16,13 +17,13 @@ DEFAULT_CONFIG = """
@registry.tokenizers("spacy.th.ThaiTokenizer") @registry.tokenizers("spacy.th.ThaiTokenizer")
def create_thai_tokenizer(): def create_thai_tokenizer():
def thai_tokenizer_factory(nlp): def thai_tokenizer_factory(nlp):
return ThaiTokenizer(nlp) return ThaiTokenizer(nlp.vocab)
return thai_tokenizer_factory return thai_tokenizer_factory
class ThaiTokenizer(DummyTokenizer): class ThaiTokenizer(DummyTokenizer):
def __init__(self, nlp: Language) -> None: def __init__(self, vocab: Vocab) -> None:
try: try:
from pythainlp.tokenize import word_tokenize from pythainlp.tokenize import word_tokenize
except ImportError: except ImportError:
@ -31,7 +32,7 @@ class ThaiTokenizer(DummyTokenizer):
"https://github.com/PyThaiNLP/pythainlp" "https://github.com/PyThaiNLP/pythainlp"
) from None ) from None
self.word_tokenize = word_tokenize self.word_tokenize = word_tokenize
self.vocab = nlp.vocab self.vocab = vocab
def __call__(self, text: str) -> Doc: def __call__(self, text: str) -> Doc:
words = list(self.word_tokenize(text)) words = list(self.word_tokenize(text))

View File

@ -2,7 +2,7 @@ from ...attrs import LIKE_NUM
_num_words = [ _num_words = [
"ዜሮ", "ዜሮ",
"", "",
"ክልተ", "ክልተ",
"ሰለስተ", "ሰለስተ",
"ኣርባዕተ", "ኣርባዕተ",
@ -11,66 +11,37 @@ _num_words = [
"ሸውዓተ", "ሸውዓተ",
"ሽሞንተ", "ሽሞንተ",
"ትሽዓተ", "ትሽዓተ",
"ኣሰርተ", "ዓሰርተ",
"ኣሰርተ ሐደ",
"ኣሰርተ ክልተ",
"ኣሰርተ ሰለስተ",
"ኣሰርተ ኣርባዕተ",
"ኣሰርተ ሓሙሽተ",
"ኣሰርተ ሽድሽተ",
"ኣሰርተ ሸውዓተ",
"ኣሰርተ ሽሞንተ",
"ኣሰርተ ትሽዓተ",
"ዕስራ", "ዕስራ",
"ሰላሳ", "ሰላሳ",
"ኣርብዓ", "ኣርብዓ",
"ምሳ", "ሓምሳ",
"ስል", "ሱሳ",
"ሰብዓ", "ሰብዓ",
"ሰማንያ", "ሰማንያ",
"ስዓ", "ቴስዓ",
"ሚእቲ", "ሚእቲ",
"ሺሕ", "ሺሕ",
"ሚልዮን", "ሚልዮን",
"ቢልዮን", "ቢልዮን",
"ትሪልዮን", "ትሪልዮን",
"ኳድሪልዮን", "ኳድሪልዮን",
"ገጅልዮን", "ጋዚልዮን",
"ልዮን", "ልዮን",
] ]
# Tigrinya ordinals above 10 are the same as _num_words but start with "መበል "
_ordinal_words = [ _ordinal_words = [
"ቀዳማይ", "ቀዳማይ",
"ካልኣይ", "ካልኣይ",
"ሳልሳይ", "ሳልሳይ",
"ራብ", "ራብ",
"ሓምሻይ", "ሓምሻይ",
"ሻድሻይ", "ሻድሻይ",
"ሻውዓይ", "ሻውዓይ",
"ሻምናይ", "ሻምናይ",
"ዘጠነኛ", "ታሽዓይ",
"አስረኛ", "ዓስራይ",
"ኣሰርተ አንደኛ",
"ኣሰርተ ሁለተኛ",
"ኣሰርተ ሶስተኛ",
"ኣሰርተ አራተኛ",
"ኣሰርተ አምስተኛ",
"ኣሰርተ ስድስተኛ",
"ኣሰርተ ሰባተኛ",
"ኣሰርተ ስምንተኛ",
"ኣሰርተ ዘጠነኛ",
"ሃያኛ",
"ሰላሳኛ" "አርባኛ",
"አምሳኛ",
"ስድሳኛ",
"ሰባኛ",
"ሰማንያኛ",
"ዘጠናኛ",
"መቶኛ",
"ሺኛ",
"ሚሊዮንኛ",
"ቢሊዮንኛ",
"ትሪሊዮንኛ",
] ]
@ -92,7 +63,7 @@ def like_num(text):
# Check ordinal number # Check ordinal number
if text_lower in _ordinal_words: if text_lower in _ordinal_words:
return True return True
if text_lower.endswith(""): if text_lower.endswith(""):
if text_lower[:-2].isdigit(): if text_lower[:-2].isdigit():
return True return True

View File

@ -1,7 +1,7 @@
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
from ..char_classes import UNITS, ALPHA_UPPER from ..char_classes import UNITS, ALPHA_UPPER
_list_punct = LIST_PUNCT + "፡ ። ፣ ፤ ፥ ፦ ፧".strip().split() _list_punct = LIST_PUNCT + "፡ ። ፣ ፤ ፥ ፦ ፧ ፠ ፨".strip().split()
_suffixes = ( _suffixes = (
_list_punct _list_punct

View File

@ -1,6 +1,27 @@
# Stop words from Tigrinya Wordcount: https://github.com/fgaim/Tigrinya-WordCount/blob/main/ti_stop_words.txt
# Stop words # Stop words
STOP_WORDS = set( STOP_WORDS = set(
""" """
ግን ግና ንስኻ ንስኺ ንስኻትክን ንስኻትኩም ናትካ ናትኪ ናትክን ናትኩም 'ምበር ' '' ''ውን '' '' 'ዮም 'ዮን
ልዕሊ ሒዙ ሒዛ ሕጂ መበል መን መንጎ መጠን ማለት ምስ ምባል
ምእንቲ ምኽንያቱ ምኽንያት ምዃኑ ምዃንና ምዃኖም
ስለ ስለዚ ስለዝበላ ሽዑ ቅድሚ በለ በቲ በዚ ብምባል ብተወሳኺ ብኸመይ
ብዘይ ብዘይካ ብዙሕ ብዛዕባ ብፍላይ ተባሂሉ ነበረ ነቲ ነታ ነቶም
ነዚ ነይሩ ነገራት ነገር ናብ ናብቲ ናትኩም ናትኪ ናትካ ናትክን
ናይ ናይቲ ንሕና ንሱ ንሳ ንሳቶም ንስኺ ንስኻ ንስኻትኩም ንስኻትክን ንዓይ
ኢለ ኢሉ ኢላ ኢልካ ኢሎም ኢና ኢኻ ኢዩ ኣለኹ
ኣለዉ ኣለዎ ኣሎ ኣብ ኣብቲ ኣብታ ኣብኡ ኣብዚ ኣነ ኣዝዩ ኣይኮነን ኣይኰነን
እምበር እሞ እተን እቲ እታ እቶም እንተ እንተሎ
ኣላ እንተኾነ እንታይ እንከሎ እኳ እዋን እውን እዚ እዛ እዞም
እየ እየን እዩ እያ እዮም
ከሎ ከመይ ከም ከምቲ ከምኡ ከምዘሎ
ከምዚ ከኣ ኩሉ ካልእ ካልኦት ካብ ካብቲ ካብቶም ክሳብ ክሳዕ ክብል
ክንደይ ክንዲ ክኸውን ኮይኑ ኰይኑ ኵሉ ኸም ኸኣ ወይ
ዋላ ዘለና ዘለዉ ዘለዋ ዘለዎ ዘለዎም ዘላ ዘሎ ዘይብሉ
ዝርከብ ዝበሃል ዝበለ ዝብል ዝተባህለ ዝተኻየደ ዝተፈላለየ ዝተፈላለዩ
ዝነበረ ዝነበረት ዝነበሩ ዝካየድ ዝኸውን ዝኽእል ዝኾነ ዝዀነ
የለን ይቕረብ ይብል ይኸውን ይኹን ይኽእል ደኣ ድሕሪ ድማ
ገለ ገሊጹ ገና ገይሩ ግና ግን ጥራይ
""".split() """.split()
) )

View File

@ -250,3 +250,9 @@ o.0
for orth in emoticons: for orth in emoticons:
BASE_EXCEPTIONS[orth] = [{ORTH: orth}] BASE_EXCEPTIONS[orth] = [{ORTH: orth}]
# Moved from a suffix setting due to #9155 removing prefixes from consideration
# for lookbehinds
for u in "cfkCFK":
BASE_EXCEPTIONS[f"°{u}."] = [{ORTH: "°"}, {ORTH: f"{u}"}, {ORTH: "."}]

View File

@ -1,4 +1,4 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
@ -23,13 +23,25 @@ class Ukrainian(Language):
@Ukrainian.factory( @Ukrainian.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "pymorphy2", "overwrite": False}, default_config={
"model": None,
"mode": "pymorphy2",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return UkrainianLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return UkrainianLemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["Ukrainian"] __all__ = ["Ukrainian"]

View File

@ -1,8 +1,9 @@
from typing import Optional from typing import Optional, Callable
from thinc.api import Model from thinc.api import Model
from ..ru.lemmatizer import RussianLemmatizer from ..ru.lemmatizer import RussianLemmatizer
from ...pipeline.lemmatizer import lemmatizer_score
from ...vocab import Vocab from ...vocab import Vocab
@ -15,6 +16,7 @@ class UkrainianLemmatizer(RussianLemmatizer):
*, *,
mode: str = "pymorphy2", mode: str = "pymorphy2",
overwrite: bool = False, overwrite: bool = False,
scorer: Optional[Callable] = lemmatizer_score,
) -> None: ) -> None:
if mode == "pymorphy2": if mode == "pymorphy2":
try: try:
@ -27,4 +29,6 @@ class UkrainianLemmatizer(RussianLemmatizer):
) from None ) from None
if getattr(self, "_morph", None) is None: if getattr(self, "_morph", None) is None:
self._morph = MorphAnalyzer(lang="uk") self._morph = MorphAnalyzer(lang="uk")
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite) super().__init__(
vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)

View File

@ -9,6 +9,7 @@ from .lex_attrs import LEX_ATTRS
from ...language import Language, BaseDefaults from ...language import Language, BaseDefaults
from ...tokens import Doc from ...tokens import Doc
from ...util import DummyTokenizer, registry, load_config_from_str from ...util import DummyTokenizer, registry, load_config_from_str
from ...vocab import Vocab
from ... import util from ... import util
@ -24,14 +25,14 @@ use_pyvi = true
@registry.tokenizers("spacy.vi.VietnameseTokenizer") @registry.tokenizers("spacy.vi.VietnameseTokenizer")
def create_vietnamese_tokenizer(use_pyvi: bool = True): def create_vietnamese_tokenizer(use_pyvi: bool = True):
def vietnamese_tokenizer_factory(nlp): def vietnamese_tokenizer_factory(nlp):
return VietnameseTokenizer(nlp, use_pyvi=use_pyvi) return VietnameseTokenizer(nlp.vocab, use_pyvi=use_pyvi)
return vietnamese_tokenizer_factory return vietnamese_tokenizer_factory
class VietnameseTokenizer(DummyTokenizer): class VietnameseTokenizer(DummyTokenizer):
def __init__(self, nlp: Language, use_pyvi: bool = False): def __init__(self, vocab: Vocab, use_pyvi: bool = False):
self.vocab = nlp.vocab self.vocab = vocab
self.use_pyvi = use_pyvi self.use_pyvi = use_pyvi
if self.use_pyvi: if self.use_pyvi:
try: try:
@ -45,6 +46,9 @@ class VietnameseTokenizer(DummyTokenizer):
) )
raise ImportError(msg) from None raise ImportError(msg) from None
def __reduce__(self):
return VietnameseTokenizer, (self.vocab, self.use_pyvi)
def __call__(self, text: str) -> Doc: def __call__(self, text: str) -> Doc:
if self.use_pyvi: if self.use_pyvi:
words = self.pyvi_tokenize(text) words = self.pyvi_tokenize(text)

17
spacy/lang/vi/examples.py Normal file
View File

@ -0,0 +1,17 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.vi.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Đây là đâu, tôi là ai?",
"Căn phòng có nhiều cửa sổ nên nó khá sáng",
"Đại dịch COVID vừa qua đã gây ảnh hưởng rất lớn tới nhiều doanh nghiệp lớn nhỏ.",
"Thành phố Hồ Chí Minh đã bị ảnh hưởng nặng nề trong thời gian vừa qua.",
"Ông bạn đang ở đâu thế?",
"Ai là người giải phóng đất nước Việt Nam khỏi ách đô hộ?",
"Vị tướng nào là người đã làm nên chiến thắng lịch sử Điện Biên Phủ?",
"Làm việc nhiều chán quá, đi chơi đâu đi?",
]

View File

@ -9,11 +9,14 @@ _num_words = [
"bốn", "bốn",
"năm", "năm",
"sáu", "sáu",
"bảy",
"bẩy", "bẩy",
"tám", "tám",
"chín", "chín",
"mười", "mười",
"chục",
"trăm", "trăm",
"nghìn",
"tỷ", "tỷ",
] ]

View File

@ -11,6 +11,7 @@ from ...scorer import Scorer
from ...tokens import Doc from ...tokens import Doc
from ...training import validate_examples, Example from ...training import validate_examples, Example
from ...util import DummyTokenizer, registry, load_config_from_str from ...util import DummyTokenizer, registry, load_config_from_str
from ...vocab import Vocab
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from ... import util from ... import util
@ -48,14 +49,14 @@ class Segmenter(str, Enum):
@registry.tokenizers("spacy.zh.ChineseTokenizer") @registry.tokenizers("spacy.zh.ChineseTokenizer")
def create_chinese_tokenizer(segmenter: Segmenter = Segmenter.char): def create_chinese_tokenizer(segmenter: Segmenter = Segmenter.char):
def chinese_tokenizer_factory(nlp): def chinese_tokenizer_factory(nlp):
return ChineseTokenizer(nlp, segmenter=segmenter) return ChineseTokenizer(nlp.vocab, segmenter=segmenter)
return chinese_tokenizer_factory return chinese_tokenizer_factory
class ChineseTokenizer(DummyTokenizer): class ChineseTokenizer(DummyTokenizer):
def __init__(self, nlp: Language, segmenter: Segmenter = Segmenter.char): def __init__(self, vocab: Vocab, segmenter: Segmenter = Segmenter.char):
self.vocab = nlp.vocab self.vocab = vocab
self.segmenter = ( self.segmenter = (
segmenter.value if isinstance(segmenter, Segmenter) else segmenter segmenter.value if isinstance(segmenter, Segmenter) else segmenter
) )

View File

@ -115,7 +115,7 @@ class Language:
Defaults (class): Settings, data and factory methods for creating the `nlp` Defaults (class): Settings, data and factory methods for creating the `nlp`
object and processing pipeline. object and processing pipeline.
lang (str): Two-letter language ID, i.e. ISO code. lang (str): IETF language code, such as 'en'.
DOCS: https://spacy.io/api/language DOCS: https://spacy.io/api/language
""" """
@ -228,6 +228,7 @@ class Language:
"vectors": len(self.vocab.vectors), "vectors": len(self.vocab.vectors),
"keys": self.vocab.vectors.n_keys, "keys": self.vocab.vectors.n_keys,
"name": self.vocab.vectors.name, "name": self.vocab.vectors.name,
"mode": self.vocab.vectors.mode,
} }
self._meta["labels"] = dict(self.pipe_labels) self._meta["labels"] = dict(self.pipe_labels)
# TODO: Adding this back to prevent breaking people's code etc., but # TODO: Adding this back to prevent breaking people's code etc., but
@ -700,7 +701,8 @@ class Language:
if ( if (
self.vocab.vectors.shape != source.vocab.vectors.shape self.vocab.vectors.shape != source.vocab.vectors.shape
or self.vocab.vectors.key2row != source.vocab.vectors.key2row or self.vocab.vectors.key2row != source.vocab.vectors.key2row
or self.vocab.vectors.to_bytes() != source.vocab.vectors.to_bytes() or self.vocab.vectors.to_bytes(exclude=["strings"])
!= source.vocab.vectors.to_bytes(exclude=["strings"])
): ):
warnings.warn(Warnings.W113.format(name=source_name)) warnings.warn(Warnings.W113.format(name=source_name))
if source_name not in source.component_names: if source_name not in source.component_names:
@ -978,7 +980,7 @@ class Language:
def __call__( def __call__(
self, self,
text: str, text: Union[str, Doc],
*, *,
disable: Iterable[str] = SimpleFrozenList(), disable: Iterable[str] = SimpleFrozenList(),
component_cfg: Optional[Dict[str, Dict[str, Any]]] = None, component_cfg: Optional[Dict[str, Dict[str, Any]]] = None,
@ -987,7 +989,9 @@ class Language:
and can contain arbitrary whitespace. Alignment into the original string and can contain arbitrary whitespace. Alignment into the original string
is preserved. is preserved.
text (str): The text to be processed. text (Union[str, Doc]): If `str`, the text to be processed. If `Doc`,
the doc will be passed directly to the pipeline, skipping
`Language.make_doc`.
disable (List[str]): Names of the pipeline components to disable. disable (List[str]): Names of the pipeline components to disable.
component_cfg (Dict[str, dict]): An optional dictionary with extra component_cfg (Dict[str, dict]): An optional dictionary with extra
keyword arguments for specific components. keyword arguments for specific components.
@ -995,7 +999,7 @@ class Language:
DOCS: https://spacy.io/api/language#call DOCS: https://spacy.io/api/language#call
""" """
doc = self.make_doc(text) doc = self._ensure_doc(text)
if component_cfg is None: if component_cfg is None:
component_cfg = {} component_cfg = {}
for name, proc in self.pipeline: for name, proc in self.pipeline:
@ -1080,6 +1084,20 @@ class Language:
) )
return self.tokenizer(text) return self.tokenizer(text)
def _ensure_doc(self, doc_like: Union[str, Doc]) -> Doc:
"""Create a Doc if need be, or raise an error if the input is not a Doc or a string."""
if isinstance(doc_like, Doc):
return doc_like
if isinstance(doc_like, str):
return self.make_doc(doc_like)
raise ValueError(Errors.E866.format(type=type(doc_like)))
def _ensure_doc_with_context(self, doc_like: Union[str, Doc], context: Any) -> Doc:
"""Create a Doc if need be and add as_tuples context, or raise an error if the input is not a Doc or a string."""
doc = self._ensure_doc(doc_like)
doc._context = context
return doc
def update( def update(
self, self,
examples: Iterable[Example], examples: Iterable[Example],
@ -1267,9 +1285,9 @@ class Language:
) )
except IOError: except IOError:
raise IOError(Errors.E884.format(vectors=I["vectors"])) raise IOError(Errors.E884.format(vectors=I["vectors"]))
if self.vocab.vectors.data.shape[1] >= 1: if self.vocab.vectors.shape[1] >= 1:
ops = get_current_ops() ops = get_current_ops()
self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data) self.vocab.vectors.to_ops(ops)
if hasattr(self.tokenizer, "initialize"): if hasattr(self.tokenizer, "initialize"):
tok_settings = validate_init_settings( tok_settings = validate_init_settings(
self.tokenizer.initialize, # type: ignore[union-attr] self.tokenizer.initialize, # type: ignore[union-attr]
@ -1314,8 +1332,8 @@ class Language:
DOCS: https://spacy.io/api/language#resume_training DOCS: https://spacy.io/api/language#resume_training
""" """
ops = get_current_ops() ops = get_current_ops()
if self.vocab.vectors.data.shape[1] >= 1: if self.vocab.vectors.shape[1] >= 1:
self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data) self.vocab.vectors.to_ops(ops)
for name, proc in self.pipeline: for name, proc in self.pipeline:
if hasattr(proc, "_rehearsal_model"): if hasattr(proc, "_rehearsal_model"):
proc._rehearsal_model = deepcopy(proc.model) # type: ignore[attr-defined] proc._rehearsal_model = deepcopy(proc.model) # type: ignore[attr-defined]
@ -1386,20 +1404,13 @@ class Language:
for eg in examples: for eg in examples:
self.make_doc(eg.reference.text) self.make_doc(eg.reference.text)
# apply all pipeline components # apply all pipeline components
for name, pipe in self.pipeline: docs = self.pipe(
kwargs = component_cfg.get(name, {}) (eg.predicted for eg in examples),
kwargs.setdefault("batch_size", batch_size) batch_size=batch_size,
for doc, eg in zip( component_cfg=component_cfg,
_pipe( )
(eg.predicted for eg in examples), for eg, doc in zip(examples, docs):
proc=pipe, eg.predicted = doc
name=name,
default_error_handler=self.default_error_handler,
kwargs=kwargs,
),
examples,
):
eg.predicted = doc
end_time = timer() end_time = timer()
results = scorer.score(examples) results = scorer.score(examples)
n_words = sum(len(eg.predicted) for eg in examples) n_words = sum(len(eg.predicted) for eg in examples)
@ -1450,7 +1461,7 @@ class Language:
@overload @overload
def pipe( def pipe(
self, self,
texts: Iterable[str], texts: Iterable[Union[str, Doc]],
*, *,
as_tuples: Literal[False] = ..., as_tuples: Literal[False] = ...,
batch_size: Optional[int] = ..., batch_size: Optional[int] = ...,
@ -1463,7 +1474,7 @@ class Language:
@overload @overload
def pipe( # noqa: F811 def pipe( # noqa: F811
self, self,
texts: Iterable[Tuple[str, _AnyContext]], texts: Iterable[Tuple[Union[str, Doc], _AnyContext]],
*, *,
as_tuples: Literal[True] = ..., as_tuples: Literal[True] = ...,
batch_size: Optional[int] = ..., batch_size: Optional[int] = ...,
@ -1475,7 +1486,9 @@ class Language:
def pipe( # noqa: F811 def pipe( # noqa: F811
self, self,
texts: Union[Iterable[str], Iterable[Tuple[str, _AnyContext]]], texts: Union[
Iterable[Union[str, Doc]], Iterable[Tuple[Union[str, Doc], _AnyContext]]
],
*, *,
as_tuples: bool = False, as_tuples: bool = False,
batch_size: Optional[int] = None, batch_size: Optional[int] = None,
@ -1485,7 +1498,8 @@ class Language:
) -> Union[Iterator[Doc], Iterator[Tuple[Doc, _AnyContext]]]: ) -> Union[Iterator[Doc], Iterator[Tuple[Doc, _AnyContext]]]:
"""Process texts as a stream, and yield `Doc` objects in order. """Process texts as a stream, and yield `Doc` objects in order.
texts (Iterable[str]): A sequence of texts to process. texts (Iterable[Union[str, Doc]]): A sequence of texts or docs to
process.
as_tuples (bool): If set to True, inputs should be a sequence of as_tuples (bool): If set to True, inputs should be a sequence of
(text, context) tuples. Output will then be a sequence of (text, context) tuples. Output will then be a sequence of
(doc, context) tuples. Defaults to False. (doc, context) tuples. Defaults to False.
@ -1500,23 +1514,24 @@ class Language:
""" """
# Handle texts with context as tuples # Handle texts with context as tuples
if as_tuples: if as_tuples:
texts = cast(Iterable[Tuple[str, _AnyContext]], texts) texts = cast(Iterable[Tuple[Union[str, Doc], _AnyContext]], texts)
text_context1, text_context2 = itertools.tee(texts) docs_with_contexts = (
texts = (tc[0] for tc in text_context1) self._ensure_doc_with_context(text, context) for text, context in texts
contexts = (tc[1] for tc in text_context2) )
docs = self.pipe( docs = self.pipe(
texts, docs_with_contexts,
batch_size=batch_size, batch_size=batch_size,
disable=disable, disable=disable,
n_process=n_process, n_process=n_process,
component_cfg=component_cfg, component_cfg=component_cfg,
) )
for doc, context in zip(docs, contexts): for doc in docs:
context = doc._context
doc._context = None
yield (doc, context) yield (doc, context)
return return
# At this point, we know that we're dealing with an iterable of plain texts texts = cast(Iterable[Union[str, Doc]], texts)
texts = cast(Iterable[str], texts)
# Set argument defaults # Set argument defaults
if n_process == -1: if n_process == -1:
@ -1551,7 +1566,7 @@ class Language:
docs = self._multiprocessing_pipe(texts, pipes, n_process, batch_size) docs = self._multiprocessing_pipe(texts, pipes, n_process, batch_size)
else: else:
# if n_process == 1, no processes are forked. # if n_process == 1, no processes are forked.
docs = (self.make_doc(text) for text in texts) docs = (self._ensure_doc(text) for text in texts)
for pipe in pipes: for pipe in pipes:
docs = pipe(docs) docs = pipe(docs)
for doc in docs: for doc in docs:
@ -1570,7 +1585,7 @@ class Language:
def _multiprocessing_pipe( def _multiprocessing_pipe(
self, self,
texts: Iterable[str], texts: Iterable[Union[str, Doc]],
pipes: Iterable[Callable[..., Iterator[Doc]]], pipes: Iterable[Callable[..., Iterator[Doc]]],
n_process: int, n_process: int,
batch_size: int, batch_size: int,
@ -1596,7 +1611,7 @@ class Language:
procs = [ procs = [
mp.Process( mp.Process(
target=_apply_pipes, target=_apply_pipes,
args=(self.make_doc, pipes, rch, sch, Underscore.get_state()), args=(self._ensure_doc, pipes, rch, sch, Underscore.get_state()),
) )
for rch, sch in zip(texts_q, bytedocs_send_ch) for rch, sch in zip(texts_q, bytedocs_send_ch)
] ]
@ -1609,11 +1624,12 @@ class Language:
recv.recv() for recv in cycle(bytedocs_recv_ch) recv.recv() for recv in cycle(bytedocs_recv_ch)
) )
try: try:
for i, (_, (byte_doc, byte_error)) in enumerate( for i, (_, (byte_doc, byte_context, byte_error)) in enumerate(
zip(raw_texts, byte_tuples), 1 zip(raw_texts, byte_tuples), 1
): ):
if byte_doc is not None: if byte_doc is not None:
doc = Doc(self.vocab).from_bytes(byte_doc) doc = Doc(self.vocab).from_bytes(byte_doc)
doc._context = byte_context
yield doc yield doc
elif byte_error is not None: elif byte_error is not None:
error = srsly.msgpack_loads(byte_error) error = srsly.msgpack_loads(byte_error)
@ -1800,7 +1816,9 @@ class Language:
) )
if model not in source_nlp_vectors_hashes: if model not in source_nlp_vectors_hashes:
source_nlp_vectors_hashes[model] = hash( source_nlp_vectors_hashes[model] = hash(
source_nlps[model].vocab.vectors.to_bytes() source_nlps[model].vocab.vectors.to_bytes(
exclude=["strings"]
)
) )
if "_sourced_vectors_hashes" not in nlp.meta: if "_sourced_vectors_hashes" not in nlp.meta:
nlp.meta["_sourced_vectors_hashes"] = {} nlp.meta["_sourced_vectors_hashes"] = {}
@ -2138,7 +2156,7 @@ def _copy_examples(examples: Iterable[Example]) -> List[Example]:
def _apply_pipes( def _apply_pipes(
make_doc: Callable[[str], Doc], ensure_doc: Callable[[Union[str, Doc]], Doc],
pipes: Iterable[Callable[..., Iterator[Doc]]], pipes: Iterable[Callable[..., Iterator[Doc]]],
receiver, receiver,
sender, sender,
@ -2146,7 +2164,8 @@ def _apply_pipes(
) -> None: ) -> None:
"""Worker for Language.pipe """Worker for Language.pipe
make_doc (Callable[[str,] Doc]): Function to create Doc from text. ensure_doc (Callable[[Union[str, Doc]], Doc]): Function to create Doc from text
or raise an error if the input is neither a Doc nor a string.
pipes (Iterable[Pipe]): The components to apply. pipes (Iterable[Pipe]): The components to apply.
receiver (multiprocessing.Connection): Pipe to receive text. Usually receiver (multiprocessing.Connection): Pipe to receive text. Usually
created by `multiprocessing.Pipe()` created by `multiprocessing.Pipe()`
@ -2159,16 +2178,16 @@ def _apply_pipes(
while True: while True:
try: try:
texts = receiver.get() texts = receiver.get()
docs = (make_doc(text) for text in texts) docs = (ensure_doc(text) for text in texts)
for pipe in pipes: for pipe in pipes:
docs = pipe(docs) # type: ignore[arg-type, assignment] docs = pipe(docs) # type: ignore[arg-type, assignment]
# Connection does not accept unpickable objects, so send list. # Connection does not accept unpickable objects, so send list.
byte_docs = [(doc.to_bytes(), None) for doc in docs] byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
padding = [(None, None)] * (len(texts) - len(byte_docs)) padding = [(None, None, None)] * (len(texts) - len(byte_docs))
sender.send(byte_docs + padding) # type: ignore[operator] sender.send(byte_docs + padding) # type: ignore[operator]
except Exception: except Exception:
error_msg = [(None, srsly.msgpack_dumps(traceback.format_exc()))] error_msg = [(None, None, srsly.msgpack_dumps(traceback.format_exc()))]
padding = [(None, None)] * (len(texts) - 1) padding = [(None, None, None)] * (len(texts) - 1)
sender.send(error_msg + padding) sender.send(error_msg + padding)

View File

@ -19,7 +19,7 @@ class Lexeme:
@property @property
def vector_norm(self) -> float: ... def vector_norm(self) -> float: ...
vector: Floats1d vector: Floats1d
rank: str rank: int
sentiment: float sentiment: float
@property @property
def orth_(self) -> str: ... def orth_(self) -> str: ...

View File

@ -130,8 +130,10 @@ cdef class Lexeme:
return 0.0 return 0.0
vector = self.vector vector = self.vector
xp = get_array_module(vector) xp = get_array_module(vector)
return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)) result = xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)
# ensure we get a scalar back (numpy does this automatically but cupy doesn't)
return result.item()
@property @property
def has_vector(self): def has_vector(self):
"""RETURNS (bool): Whether a word vector is associated with the object. """RETURNS (bool): Whether a word vector is associated with the object.
@ -284,7 +286,7 @@ cdef class Lexeme:
def __get__(self): def __get__(self):
return self.vocab.strings[self.c.lower] return self.vocab.strings[self.c.lower]
def __set__(self, unicode x): def __set__(self, str x):
self.c.lower = self.vocab.strings.add(x) self.c.lower = self.vocab.strings.add(x)
property norm_: property norm_:
@ -294,7 +296,7 @@ cdef class Lexeme:
def __get__(self): def __get__(self):
return self.vocab.strings[self.c.norm] return self.vocab.strings[self.c.norm]
def __set__(self, unicode x): def __set__(self, str x):
self.norm = self.vocab.strings.add(x) self.norm = self.vocab.strings.add(x)
property shape_: property shape_:
@ -304,7 +306,7 @@ cdef class Lexeme:
def __get__(self): def __get__(self):
return self.vocab.strings[self.c.shape] return self.vocab.strings[self.c.shape]
def __set__(self, unicode x): def __set__(self, str x):
self.c.shape = self.vocab.strings.add(x) self.c.shape = self.vocab.strings.add(x)
property prefix_: property prefix_:
@ -314,7 +316,7 @@ cdef class Lexeme:
def __get__(self): def __get__(self):
return self.vocab.strings[self.c.prefix] return self.vocab.strings[self.c.prefix]
def __set__(self, unicode x): def __set__(self, str x):
self.c.prefix = self.vocab.strings.add(x) self.c.prefix = self.vocab.strings.add(x)
property suffix_: property suffix_:
@ -324,7 +326,7 @@ cdef class Lexeme:
def __get__(self): def __get__(self):
return self.vocab.strings[self.c.suffix] return self.vocab.strings[self.c.suffix]
def __set__(self, unicode x): def __set__(self, str x):
self.c.suffix = self.vocab.strings.add(x) self.c.suffix = self.vocab.strings.add(x)
property lang_: property lang_:
@ -332,7 +334,7 @@ cdef class Lexeme:
def __get__(self): def __get__(self):
return self.vocab.strings[self.c.lang] return self.vocab.strings[self.c.lang]
def __set__(self, unicode x): def __set__(self, str x):
self.c.lang = self.vocab.strings.add(x) self.c.lang = self.vocab.strings.add(x)
property flags: property flags:

View File

@ -148,9 +148,9 @@ cdef class DependencyMatcher:
Creates a token key to be used by the matcher Creates a token key to be used by the matcher
""" """
return self._normalize_key( return self._normalize_key(
unicode(key) + DELIMITER + str(key) + DELIMITER +
unicode(pattern_idx) + DELIMITER + str(pattern_idx) + DELIMITER +
unicode(token_idx) str(token_idx)
) )
def add(self, key, patterns, *, on_match=None): def add(self, key, patterns, *, on_match=None):
@ -424,7 +424,7 @@ cdef class DependencyMatcher:
return [doc[child.i] for child in doc[node].head.children if child.i < node] return [doc[child.i] for child in doc[node].head.children if child.i < node]
def _normalize_key(self, key): def _normalize_key(self, key):
if isinstance(key, basestring): if isinstance(key, str):
return self.vocab.strings.add(key) return self.vocab.strings.add(key)
else: else:
return key return key

View File

@ -18,7 +18,7 @@ from ..tokens.doc cimport Doc, get_token_attr_for_matcher
from ..tokens.span cimport Span from ..tokens.span cimport Span
from ..tokens.token cimport Token from ..tokens.token cimport Token
from ..tokens.morphanalysis cimport MorphAnalysis from ..tokens.morphanalysis cimport MorphAnalysis
from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH, POS, TAG, DEP, LEMMA, MORPH from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH, POS, TAG, DEP, LEMMA, MORPH, ENT_IOB
from ..schemas import validate_token_pattern from ..schemas import validate_token_pattern
from ..errors import Errors, MatchPatternError, Warnings from ..errors import Errors, MatchPatternError, Warnings
@ -312,7 +312,7 @@ cdef class Matcher:
return final_results return final_results
def _normalize_key(self, key): def _normalize_key(self, key):
if isinstance(key, basestring): if isinstance(key, str):
return self.vocab.strings.add(key) return self.vocab.strings.add(key)
else: else:
return key return key
@ -360,7 +360,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
for i, token in enumerate(doclike): for i, token in enumerate(doclike):
for name, index in extensions.items(): for name, index in extensions.items():
value = token._.get(name) value = token._.get(name)
if isinstance(value, basestring): if isinstance(value, str):
value = token.vocab.strings[value] value = token.vocab.strings[value]
extra_attr_values[i * nr_extra_attr + index] = value extra_attr_values[i * nr_extra_attr + index] = value
# Main loop # Main loop
@ -786,7 +786,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
def _get_attr_values(spec, string_store): def _get_attr_values(spec, string_store):
attr_values = [] attr_values = []
for attr, value in spec.items(): for attr, value in spec.items():
if isinstance(attr, basestring): if isinstance(attr, str):
attr = attr.upper() attr = attr.upper()
if attr == '_': if attr == '_':
continue continue
@ -797,8 +797,11 @@ def _get_attr_values(spec, string_store):
if attr == "IS_SENT_START": if attr == "IS_SENT_START":
attr = "SENT_START" attr = "SENT_START"
attr = IDS.get(attr) attr = IDS.get(attr)
if isinstance(value, basestring): if isinstance(value, str):
value = string_store.add(value) if attr == ENT_IOB and value in Token.iob_strings():
value = Token.iob_strings().index(value)
else:
value = string_store.add(value)
elif isinstance(value, bool): elif isinstance(value, bool):
value = int(value) value = int(value)
elif isinstance(value, int): elif isinstance(value, int):
@ -938,7 +941,7 @@ def _get_extra_predicates(spec, extra_predicates, vocab):
seen_predicates = {pred.key: pred.i for pred in extra_predicates} seen_predicates = {pred.key: pred.i for pred in extra_predicates}
output = [] output = []
for attr, value in spec.items(): for attr, value in spec.items():
if isinstance(attr, basestring): if isinstance(attr, str):
if attr == "_": if attr == "_":
output.extend( output.extend(
_get_extension_extra_predicates( _get_extension_extra_predicates(
@ -995,7 +998,7 @@ def _get_operators(spec):
"?": (ZERO_ONE,), "1": (ONE,), "!": (ZERO,)} "?": (ZERO_ONE,), "1": (ONE,), "!": (ZERO,)}
# Fix casing # Fix casing
spec = {key.upper(): values for key, values in spec.items() spec = {key.upper(): values for key, values in spec.items()
if isinstance(key, basestring)} if isinstance(key, str)}
if "OP" not in spec: if "OP" not in spec:
return (ONE,) return (ONE,)
elif spec["OP"] in lookup: elif spec["OP"] in lookup:
@ -1013,7 +1016,7 @@ def _get_extensions(spec, string_store, name2index):
if isinstance(value, dict): if isinstance(value, dict):
# Handle predicates (e.g. "IN", in the extra_predicates, not here. # Handle predicates (e.g. "IN", in the extra_predicates, not here.
continue continue
if isinstance(value, basestring): if isinstance(value, str):
value = string_store.add(value) value = string_store.add(value)
if name not in name2index: if name not in name2index:
name2index[name] = len(name2index) name2index[name] = len(name2index)

View File

@ -8,12 +8,9 @@ class PhraseMatcher:
def __init__( def __init__(
self, vocab: Vocab, attr: Optional[Union[int, str]], validate: bool = ... self, vocab: Vocab, attr: Optional[Union[int, str]], validate: bool = ...
) -> None: ... ) -> None: ...
def __call__( def __reduce__(self) -> Any: ...
self, def __len__(self) -> int: ...
doclike: Union[Doc, Span], def __contains__(self, key: str) -> bool: ...
*,
as_spans: bool = ...,
) -> Union[List[Tuple[int, int, int]], List[Span]]: ...
def add( def add(
self, self,
key: str, key: str,
@ -23,3 +20,10 @@ class PhraseMatcher:
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any] Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
] = ..., ] = ...,
) -> None: ... ) -> None: ...
def remove(self, key: str) -> None: ...
def __call__(
self,
doclike: Union[Doc, Span],
*,
as_spans: bool = ...,
) -> Union[List[Tuple[int, int, int]], List[Span]]: ...

View File

@ -28,7 +28,13 @@ def forward(
X, spans = source_spans X, spans = source_spans
assert spans.dataXd.ndim == 2 assert spans.dataXd.ndim == 2
indices = _get_span_indices(ops, spans, X.lengths) indices = _get_span_indices(ops, spans, X.lengths)
Y = Ragged(X.dataXd[indices], spans.dataXd[:, 1] - spans.dataXd[:, 0]) # type: ignore[arg-type, index] if len(indices) > 0:
Y = Ragged(X.dataXd[indices], spans.dataXd[:, 1] - spans.dataXd[:, 0]) # type: ignore[arg-type, index]
else:
Y = Ragged(
ops.xp.zeros(X.dataXd.shape, dtype=X.dataXd.dtype),
ops.xp.zeros((len(X.lengths),), dtype="i"),
)
x_shape = X.dataXd.shape x_shape = X.dataXd.shape
x_lengths = X.lengths x_lengths = X.lengths
@ -53,7 +59,7 @@ def _get_span_indices(ops, spans: Ragged, lengths: Ints1d) -> Ints1d:
for j in range(spans_i.shape[0]): for j in range(spans_i.shape[0]):
indices.append(ops.xp.arange(spans_i[j, 0], spans_i[j, 1])) # type: ignore[call-overload, index] indices.append(ops.xp.arange(spans_i[j, 0], spans_i[j, 1])) # type: ignore[call-overload, index]
offset += length offset += length
return ops.flatten(indices) return ops.flatten(indices, dtype="i", ndim_if_empty=1)
def _ensure_cpu(spans: Ragged, lengths: Ints1d) -> Tuple[Ragged, Ints1d]: def _ensure_cpu(spans: Ragged, lengths: Ints1d) -> Tuple[Ragged, Ints1d]:

View File

@ -23,7 +23,7 @@ def create_pretrain_vectors(
maxout_pieces: int, hidden_size: int, loss: str maxout_pieces: int, hidden_size: int, loss: str
) -> Callable[["Vocab", Model], Model]: ) -> Callable[["Vocab", Model], Model]:
def create_vectors_objective(vocab: "Vocab", tok2vec: Model) -> Model: def create_vectors_objective(vocab: "Vocab", tok2vec: Model) -> Model:
if vocab.vectors.data.shape[1] == 0: if vocab.vectors.shape[1] == 0:
raise ValueError(Errors.E875) raise ValueError(Errors.E875)
model = build_cloze_multi_task_model( model = build_cloze_multi_task_model(
vocab, tok2vec, hidden_size=hidden_size, maxout_pieces=maxout_pieces vocab, tok2vec, hidden_size=hidden_size, maxout_pieces=maxout_pieces
@ -116,7 +116,7 @@ def build_multi_task_model(
def build_cloze_multi_task_model( def build_cloze_multi_task_model(
vocab: "Vocab", tok2vec: Model, maxout_pieces: int, hidden_size: int vocab: "Vocab", tok2vec: Model, maxout_pieces: int, hidden_size: int
) -> Model: ) -> Model:
nO = vocab.vectors.data.shape[1] nO = vocab.vectors.shape[1]
output_layer = chain( output_layer = chain(
cast(Model[List["Floats2d"], Floats2d], list2array()), cast(Model[List["Floats2d"], Floats2d], list2array()),
Maxout( Maxout(

View File

@ -53,7 +53,7 @@ def build_hash_embed_cnn_tok2vec(
window_size (int): The number of tokens on either side to concatenate during window_size (int): The number of tokens on either side to concatenate during
the convolutions. The receptive field of the CNN will be the convolutions. The receptive field of the CNN will be
depth * (window_size * 2 + 1), so a 4-layer network with window_size of depth * (window_size * 2 + 1), so a 4-layer network with window_size of
2 will be sensitive to 17 words at a time. Recommended value is 1. 2 will be sensitive to 20 words at a time. Recommended value is 1.
embed_size (int): The number of rows in the hash embedding tables. This can embed_size (int): The number of rows in the hash embedding tables. This can
be surprisingly small, due to the use of the hash embeddings. Recommended be surprisingly small, due to the use of the hash embeddings. Recommended
values are between 2000 and 10000. values are between 2000 and 10000.
@ -123,7 +123,7 @@ def MultiHashEmbed(
attributes are NORM, PREFIX, SUFFIX and SHAPE. This lets the model take into attributes are NORM, PREFIX, SUFFIX and SHAPE. This lets the model take into
account some subword information, without constructing a fully character-based account some subword information, without constructing a fully character-based
representation. If pretrained vectors are available, they can be included in representation. If pretrained vectors are available, they can be included in
the representation as well, with the vectors table will be kept static the representation as well, with the vectors table kept static
(i.e. it's not updated). (i.e. it's not updated).
The `width` parameter specifies the output width of the layer and the widths The `width` parameter specifies the output width of the layer and the widths

View File

@ -1,11 +1,13 @@
from typing import List, Tuple, Callable, Optional, cast from typing import List, Tuple, Callable, Optional, Sequence, cast
from thinc.initializers import glorot_uniform_init from thinc.initializers import glorot_uniform_init
from thinc.util import partial from thinc.util import partial
from thinc.types import Ragged, Floats2d, Floats1d from thinc.types import Ragged, Floats2d, Floats1d, Ints1d
from thinc.api import Model, Ops, registry from thinc.api import Model, Ops, registry
from ..tokens import Doc from ..tokens import Doc
from ..errors import Errors from ..errors import Errors
from ..vectors import Mode
from ..vocab import Vocab
@registry.layers("spacy.StaticVectors.v2") @registry.layers("spacy.StaticVectors.v2")
@ -34,20 +36,32 @@ def StaticVectors(
def forward( def forward(
model: Model[List[Doc], Ragged], docs: List[Doc], is_train: bool model: Model[List[Doc], Ragged], docs: List[Doc], is_train: bool
) -> Tuple[Ragged, Callable]: ) -> Tuple[Ragged, Callable]:
if not sum(len(doc) for doc in docs): token_count = sum(len(doc) for doc in docs)
if not token_count:
return _handle_empty(model.ops, model.get_dim("nO")) return _handle_empty(model.ops, model.get_dim("nO"))
key_attr = model.attrs["key_attr"] key_attr: int = model.attrs["key_attr"]
W = cast(Floats2d, model.ops.as_contig(model.get_param("W"))) keys: Ints1d = model.ops.flatten(
V = cast(Floats2d, model.ops.asarray(docs[0].vocab.vectors.data)) cast(Sequence, [doc.to_array(key_attr) for doc in docs])
rows = model.ops.flatten(
[doc.vocab.vectors.find(keys=doc.to_array(key_attr)) for doc in docs]
) )
vocab: Vocab = docs[0].vocab
W = cast(Floats2d, model.ops.as_contig(model.get_param("W")))
if vocab.vectors.mode == Mode.default:
V = cast(Floats2d, model.ops.asarray(vocab.vectors.data))
rows = vocab.vectors.find(keys=keys)
V = model.ops.as_contig(V[rows])
elif vocab.vectors.mode == Mode.floret:
V = cast(Floats2d, vocab.vectors.get_batch(keys))
V = model.ops.as_contig(V)
else:
raise RuntimeError(Errors.E896)
try: try:
vectors_data = model.ops.gemm(model.ops.as_contig(V[rows]), W, trans2=True) vectors_data = model.ops.gemm(V, W, trans2=True)
except ValueError: except ValueError:
raise RuntimeError(Errors.E896) raise RuntimeError(Errors.E896)
# Convert negative indices to 0-vectors (TODO: more options for UNK tokens) if vocab.vectors.mode == Mode.default:
vectors_data[rows < 0] = 0 # Convert negative indices to 0-vectors
# TODO: more options for UNK tokens
vectors_data[rows < 0] = 0
output = Ragged( output = Ragged(
vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i") # type: ignore vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i") # type: ignore
) )
@ -63,7 +77,7 @@ def forward(
model.inc_grad( model.inc_grad(
"W", "W",
model.ops.gemm( model.ops.gemm(
cast(Floats2d, d_output.data), model.ops.as_contig(V[rows]), trans1=True cast(Floats2d, d_output.data), model.ops.as_contig(V), trans1=True
), ),
) )
return [] return []
@ -80,7 +94,7 @@ def init(
nM = model.get_dim("nM") if model.has_dim("nM") else None nM = model.get_dim("nM") if model.has_dim("nM") else None
nO = model.get_dim("nO") if model.has_dim("nO") else None nO = model.get_dim("nO") if model.has_dim("nO") else None
if X is not None and len(X): if X is not None and len(X):
nM = X[0].vocab.vectors.data.shape[1] nM = X[0].vocab.vectors.shape[1]
if Y is not None: if Y is not None:
nO = Y.data.shape[1] nO = Y.data.shape[1]

View File

@ -1,3 +1,4 @@
from cython.operator cimport dereference as deref, preincrement as incr
from libc.string cimport memcpy, memset from libc.string cimport memcpy, memset
from libc.stdlib cimport calloc, free from libc.stdlib cimport calloc, free
from libc.stdint cimport uint32_t, uint64_t from libc.stdint cimport uint32_t, uint64_t
@ -185,16 +186,20 @@ cdef cppclass StateC:
int L(int head, int idx) nogil const: int L(int head, int idx) nogil const:
if idx < 1 or this._left_arcs.size() == 0: if idx < 1 or this._left_arcs.size() == 0:
return -1 return -1
cdef vector[int] lefts
for i in range(this._left_arcs.size()): # Work backwards through left-arcs to find the arc at the
arc = this._left_arcs.at(i) # requested index more quickly.
cdef size_t child_index = 0
it = this._left_arcs.const_rbegin()
while it != this._left_arcs.rend():
arc = deref(it)
if arc.head == head and arc.child != -1 and arc.child < head: if arc.head == head and arc.child != -1 and arc.child < head:
lefts.push_back(arc.child) child_index += 1
idx = (<int>lefts.size()) - idx if child_index == idx:
if idx < 0: return arc.child
return -1 incr(it)
else:
return lefts.at(idx) return -1
int R(int head, int idx) nogil const: int R(int head, int idx) nogil const:
if idx < 1 or this._right_arcs.size() == 0: if idx < 1 or this._right_arcs.size() == 0:

View File

@ -17,7 +17,7 @@ from ...errors import Errors
from thinc.extra.search cimport Beam from thinc.extra.search cimport Beam
cdef weight_t MIN_SCORE = -90000 cdef weight_t MIN_SCORE = -90000
cdef attr_t SUBTOK_LABEL = hash_string(u'subtok') cdef attr_t SUBTOK_LABEL = hash_string('subtok')
DEF NON_MONOTONIC = True DEF NON_MONOTONIC = True
@ -585,7 +585,10 @@ cdef class ArcEager(TransitionSystem):
actions[RIGHT][label] = 1 actions[RIGHT][label] = 1
actions[REDUCE][label] = 1 actions[REDUCE][label] = 1
for example in kwargs.get('examples', []): for example in kwargs.get('examples', []):
heads, labels = example.get_aligned_parse(projectivize=True) # use heads and labels from the reference parse (without regard to
# misalignments between the predicted and reference)
example_gold_preproc = Example(example.reference, example.reference)
heads, labels = example_gold_preproc.get_aligned_parse(projectivize=True)
for child, (head, label) in enumerate(zip(heads, labels)): for child, (head, label) in enumerate(zip(heads, labels)):
if head is None or label is None: if head is None or label is None:
continue continue
@ -601,7 +604,7 @@ cdef class ArcEager(TransitionSystem):
actions[SHIFT][''] += 1 actions[SHIFT][''] += 1
if min_freq is not None: if min_freq is not None:
for action, label_freqs in actions.items(): for action, label_freqs in actions.items():
for label, freq in list(label_freqs.items()): for label, freq in label_freqs.copy().items():
if freq < min_freq: if freq < min_freq:
label_freqs.pop(label) label_freqs.pop(label)
# Ensure these actions are present # Ensure these actions are present

View File

@ -5,15 +5,15 @@ from pathlib import Path
from .pipe import Pipe from .pipe import Pipe
from ..errors import Errors from ..errors import Errors
from ..training import validate_examples, Example from ..training import Example
from ..language import Language from ..language import Language
from ..matcher import Matcher from ..matcher import Matcher
from ..scorer import Scorer from ..scorer import Scorer
from ..symbols import IDS, TAG, POS, MORPH, LEMMA from ..symbols import IDS
from ..tokens import Doc, Span from ..tokens import Doc, Span
from ..tokens._retokenize import normalize_token_attrs, set_token_attrs from ..tokens._retokenize import normalize_token_attrs, set_token_attrs
from ..vocab import Vocab from ..vocab import Vocab
from ..util import SimpleFrozenList from ..util import SimpleFrozenList, registry
from .. import util from .. import util
@ -23,9 +23,41 @@ TagMapType = Dict[str, Dict[Union[int, str], Union[int, str]]]
MorphRulesType = Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]] MorphRulesType = Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]
@Language.factory("attribute_ruler", default_config={"validate": False}) @Language.factory(
def make_attribute_ruler(nlp: Language, name: str, validate: bool): "attribute_ruler",
return AttributeRuler(nlp.vocab, name, validate=validate) default_config={
"validate": False,
"scorer": {"@scorers": "spacy.attribute_ruler_scorer.v1"},
},
)
def make_attribute_ruler(
nlp: Language, name: str, validate: bool, scorer: Optional[Callable]
):
return AttributeRuler(nlp.vocab, name, validate=validate, scorer=scorer)
def attribute_ruler_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
def morph_key_getter(token, attr):
return getattr(token, attr).key
results = {}
results.update(Scorer.score_token_attr(examples, "tag", **kwargs))
results.update(Scorer.score_token_attr(examples, "pos", **kwargs))
results.update(
Scorer.score_token_attr(examples, "morph", getter=morph_key_getter, **kwargs)
)
results.update(
Scorer.score_token_attr_per_feat(
examples, "morph", getter=morph_key_getter, **kwargs
)
)
results.update(Scorer.score_token_attr(examples, "lemma", **kwargs))
return results
@registry.scorers("spacy.attribute_ruler_scorer.v1")
def make_attribute_ruler_scorer():
return attribute_ruler_score
class AttributeRuler(Pipe): class AttributeRuler(Pipe):
@ -36,7 +68,12 @@ class AttributeRuler(Pipe):
""" """
def __init__( def __init__(
self, vocab: Vocab, name: str = "attribute_ruler", *, validate: bool = False self,
vocab: Vocab,
name: str = "attribute_ruler",
*,
validate: bool = False,
scorer: Optional[Callable] = attribute_ruler_score,
) -> None: ) -> None:
"""Create the AttributeRuler. After creation, you can add patterns """Create the AttributeRuler. After creation, you can add patterns
with the `.initialize()` or `.add_patterns()` methods, or load patterns with the `.initialize()` or `.add_patterns()` methods, or load patterns
@ -45,6 +82,10 @@ class AttributeRuler(Pipe):
vocab (Vocab): The vocab. vocab (Vocab): The vocab.
name (str): The pipe name. Defaults to "attribute_ruler". name (str): The pipe name. Defaults to "attribute_ruler".
scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_token_attr for the attributes "tag", "pos", "morph" and
"lemma" and Scorer.score_token_attr_per_feat for the attribute
"morph".
RETURNS (AttributeRuler): The AttributeRuler component. RETURNS (AttributeRuler): The AttributeRuler component.
@ -57,6 +98,7 @@ class AttributeRuler(Pipe):
self.attrs: List[Dict] = [] self.attrs: List[Dict] = []
self._attrs_unnormed: List[Dict] = [] # store for reference self._attrs_unnormed: List[Dict] = [] # store for reference
self.indices: List[int] = [] self.indices: List[int] = []
self.scorer = scorer
def clear(self) -> None: def clear(self) -> None:
"""Reset all patterns.""" """Reset all patterns."""
@ -228,45 +270,6 @@ class AttributeRuler(Pipe):
all_patterns.append(p) all_patterns.append(p)
return all_patterns # type: ignore[return-value] return all_patterns # type: ignore[return-value]
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores, produced by
Scorer.score_token_attr for the attributes "tag", "pos", "morph"
and "lemma" for the target token attributes.
DOCS: https://spacy.io/api/tagger#score
"""
def morph_key_getter(token, attr):
return getattr(token, attr).key
validate_examples(examples, "AttributeRuler.score")
results = {}
attrs = set() # type: ignore
for token_attrs in self.attrs:
attrs.update(token_attrs)
for attr in attrs:
if attr == TAG:
results.update(Scorer.score_token_attr(examples, "tag", **kwargs))
elif attr == POS:
results.update(Scorer.score_token_attr(examples, "pos", **kwargs))
elif attr == MORPH:
results.update(
Scorer.score_token_attr(
examples, "morph", getter=morph_key_getter, **kwargs
)
)
results.update(
Scorer.score_token_attr_per_feat(
examples, "morph", getter=morph_key_getter, **kwargs
)
)
elif attr == LEMMA:
results.update(Scorer.score_token_attr(examples, "lemma", **kwargs))
return results
def to_bytes(self, exclude: Iterable[str] = SimpleFrozenList()) -> bytes: def to_bytes(self, exclude: Iterable[str] = SimpleFrozenList()) -> bytes:
"""Serialize the AttributeRuler to a bytestring. """Serialize the AttributeRuler to a bytestring.

View File

@ -1,6 +1,6 @@
# cython: infer_types=True, profile=True, binding=True # cython: infer_types=True, profile=True, binding=True
from collections import defaultdict from collections import defaultdict
from typing import Optional, Iterable from typing import Optional, Iterable, Callable
from thinc.api import Model, Config from thinc.api import Model, Config
from ._parser_internals.transition_system import TransitionSystem from ._parser_internals.transition_system import TransitionSystem
@ -12,7 +12,7 @@ from ..language import Language
from ._parser_internals import nonproj from ._parser_internals import nonproj
from ._parser_internals.nonproj import DELIMITER from ._parser_internals.nonproj import DELIMITER
from ..scorer import Scorer from ..scorer import Scorer
from ..training import validate_examples from ..util import registry
default_model_config = """ default_model_config = """
@ -45,6 +45,7 @@ DEFAULT_PARSER_MODEL = Config().from_str(default_model_config)["model"]
"learn_tokens": False, "learn_tokens": False,
"min_action_freq": 30, "min_action_freq": 30,
"model": DEFAULT_PARSER_MODEL, "model": DEFAULT_PARSER_MODEL,
"scorer": {"@scorers": "spacy.parser_scorer.v1"},
}, },
default_score_weights={ default_score_weights={
"dep_uas": 0.5, "dep_uas": 0.5,
@ -63,6 +64,7 @@ def make_parser(
update_with_oracle_cut_size: int, update_with_oracle_cut_size: int,
learn_tokens: bool, learn_tokens: bool,
min_action_freq: int, min_action_freq: int,
scorer: Optional[Callable],
): ):
"""Create a transition-based DependencyParser component. The dependency parser """Create a transition-based DependencyParser component. The dependency parser
jointly learns sentence segmentation and labelled dependency parsing, and can jointly learns sentence segmentation and labelled dependency parsing, and can
@ -99,6 +101,7 @@ def make_parser(
primarily affects the label accuracy, it can also affect the attachment primarily affects the label accuracy, it can also affect the attachment
structure, as the labels are used to represent the pseudo-projectivity structure, as the labels are used to represent the pseudo-projectivity
transformation. transformation.
scorer (Optional[Callable]): The scoring method.
""" """
return DependencyParser( return DependencyParser(
nlp.vocab, nlp.vocab,
@ -115,6 +118,7 @@ def make_parser(
# At some point in the future we can try to implement support for # At some point in the future we can try to implement support for
# partial annotations, perhaps only in the beam objective. # partial annotations, perhaps only in the beam objective.
incorrect_spans_key=None, incorrect_spans_key=None,
scorer=scorer,
) )
@ -130,6 +134,7 @@ def make_parser(
"learn_tokens": False, "learn_tokens": False,
"min_action_freq": 30, "min_action_freq": 30,
"model": DEFAULT_PARSER_MODEL, "model": DEFAULT_PARSER_MODEL,
"scorer": {"@scorers": "spacy.parser_scorer.v1"},
}, },
default_score_weights={ default_score_weights={
"dep_uas": 0.5, "dep_uas": 0.5,
@ -151,6 +156,7 @@ def make_beam_parser(
beam_width: int, beam_width: int,
beam_density: float, beam_density: float,
beam_update_prob: float, beam_update_prob: float,
scorer: Optional[Callable],
): ):
"""Create a transition-based DependencyParser component that uses beam-search. """Create a transition-based DependencyParser component that uses beam-search.
The dependency parser jointly learns sentence segmentation and labelled The dependency parser jointly learns sentence segmentation and labelled
@ -208,9 +214,40 @@ def make_beam_parser(
# At some point in the future we can try to implement support for # At some point in the future we can try to implement support for
# partial annotations, perhaps only in the beam objective. # partial annotations, perhaps only in the beam objective.
incorrect_spans_key=None, incorrect_spans_key=None,
scorer=scorer,
) )
def parser_score(examples, **kwargs):
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_spans
and Scorer.score_deps.
DOCS: https://spacy.io/api/dependencyparser#score
"""
def has_sents(doc):
return doc.has_annotation("SENT_START")
def dep_getter(token, attr):
dep = getattr(token, attr)
dep = token.vocab.strings.as_string(dep).lower()
return dep
results = {}
results.update(Scorer.score_spans(examples, "sents", has_annotation=has_sents, **kwargs))
kwargs.setdefault("getter", dep_getter)
kwargs.setdefault("ignore_labels", ("p", "punct"))
results.update(Scorer.score_deps(examples, "dep", **kwargs))
del results["sents_per_type"]
return results
@registry.scorers("spacy.parser_scorer.v1")
def make_parser_scorer():
return parser_score
class DependencyParser(Parser): class DependencyParser(Parser):
"""Pipeline component for dependency parsing. """Pipeline component for dependency parsing.
@ -234,6 +271,7 @@ class DependencyParser(Parser):
beam_update_prob=0.0, beam_update_prob=0.0,
multitasks=tuple(), multitasks=tuple(),
incorrect_spans_key=None, incorrect_spans_key=None,
scorer=parser_score,
): ):
"""Create a DependencyParser.""" """Create a DependencyParser."""
super().__init__( super().__init__(
@ -249,6 +287,7 @@ class DependencyParser(Parser):
beam_update_prob=beam_update_prob, beam_update_prob=beam_update_prob,
multitasks=multitasks, multitasks=multitasks,
incorrect_spans_key=incorrect_spans_key, incorrect_spans_key=incorrect_spans_key,
scorer=scorer,
) )
@property @property
@ -281,36 +320,6 @@ class DependencyParser(Parser):
labels.add(label) labels.add(label)
return tuple(sorted(labels)) return tuple(sorted(labels))
def score(self, examples, **kwargs):
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_spans
and Scorer.score_deps.
DOCS: https://spacy.io/api/dependencyparser#score
"""
def has_sents(doc):
return doc.has_annotation("SENT_START")
validate_examples(examples, "DependencyParser.score")
def dep_getter(token, attr):
dep = getattr(token, attr)
dep = token.vocab.strings.as_string(dep).lower()
return dep
results = {}
results.update(
Scorer.score_spans(examples, "sents", has_annotation=has_sents, **kwargs)
)
kwargs.setdefault("getter", dep_getter)
kwargs.setdefault("ignore_labels", ("p", "punct"))
results.update(Scorer.score_deps(examples, "dep", **kwargs))
del results["sents_per_type"]
return results
def scored_parses(self, beams): def scored_parses(self, beams):
"""Return two dictionaries with scores for each beam/doc that was processed: """Return two dictionaries with scores for each beam/doc that was processed:
one containing (i, head) keys, and another containing (i, label) keys. one containing (i, head) keys, and another containing (i, label) keys.

View File

@ -17,10 +17,12 @@ from ..language import Language
from ..vocab import Vocab from ..vocab import Vocab
from ..training import Example, validate_examples, validate_get_examples from ..training import Example, validate_examples, validate_get_examples
from ..errors import Errors, Warnings from ..errors import Errors, Warnings
from ..util import SimpleFrozenList from ..util import SimpleFrozenList, registry
from .. import util from .. import util
from ..scorer import Scorer from ..scorer import Scorer
# See #9050
BACKWARD_OVERWRITE = True
default_model_config = """ default_model_config = """
[model] [model]
@ -51,6 +53,8 @@ DEFAULT_NEL_MODEL = Config().from_str(default_model_config)["model"]
"incl_context": True, "incl_context": True,
"entity_vector_length": 64, "entity_vector_length": 64,
"get_candidates": {"@misc": "spacy.CandidateGenerator.v1"}, "get_candidates": {"@misc": "spacy.CandidateGenerator.v1"},
"overwrite": True,
"scorer": {"@scorers": "spacy.entity_linker_scorer.v1"},
}, },
default_score_weights={ default_score_weights={
"nel_micro_f": 1.0, "nel_micro_f": 1.0,
@ -69,6 +73,8 @@ def make_entity_linker(
incl_context: bool, incl_context: bool,
entity_vector_length: int, entity_vector_length: int,
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]], get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
overwrite: bool,
scorer: Optional[Callable],
): ):
"""Construct an EntityLinker component. """Construct an EntityLinker component.
@ -82,6 +88,7 @@ def make_entity_linker(
entity_vector_length (int): Size of encoding vectors in the KB. entity_vector_length (int): Size of encoding vectors in the KB.
get_candidates (Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]): Function that get_candidates (Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]): Function that
produces a list of candidates, given a certain knowledge base and a textual mention. produces a list of candidates, given a certain knowledge base and a textual mention.
scorer (Optional[Callable]): The scoring method.
""" """
return EntityLinker( return EntityLinker(
nlp.vocab, nlp.vocab,
@ -93,9 +100,20 @@ def make_entity_linker(
incl_context=incl_context, incl_context=incl_context,
entity_vector_length=entity_vector_length, entity_vector_length=entity_vector_length,
get_candidates=get_candidates, get_candidates=get_candidates,
overwrite=overwrite,
scorer=scorer,
) )
def entity_linker_score(examples, **kwargs):
return Scorer.score_links(examples, negative_labels=[EntityLinker.NIL], **kwargs)
@registry.scorers("spacy.entity_linker_scorer.v1")
def make_entity_linker_scorer():
return entity_linker_score
class EntityLinker(TrainablePipe): class EntityLinker(TrainablePipe):
"""Pipeline component for named entity linking. """Pipeline component for named entity linking.
@ -116,6 +134,8 @@ class EntityLinker(TrainablePipe):
incl_context: bool, incl_context: bool,
entity_vector_length: int, entity_vector_length: int,
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]], get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
overwrite: bool = BACKWARD_OVERWRITE,
scorer: Optional[Callable] = entity_linker_score,
) -> None: ) -> None:
"""Initialize an entity linker. """Initialize an entity linker.
@ -130,6 +150,8 @@ class EntityLinker(TrainablePipe):
entity_vector_length (int): Size of encoding vectors in the KB. entity_vector_length (int): Size of encoding vectors in the KB.
get_candidates (Callable[[KnowledgeBase, Span], Iterable[Candidate]]): Function that get_candidates (Callable[[KnowledgeBase, Span], Iterable[Candidate]]): Function that
produces a list of candidates, given a certain knowledge base and a textual mention. produces a list of candidates, given a certain knowledge base and a textual mention.
scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_links.
DOCS: https://spacy.io/api/entitylinker#init DOCS: https://spacy.io/api/entitylinker#init
""" """
@ -141,11 +163,12 @@ class EntityLinker(TrainablePipe):
self.incl_prior = incl_prior self.incl_prior = incl_prior
self.incl_context = incl_context self.incl_context = incl_context
self.get_candidates = get_candidates self.get_candidates = get_candidates
self.cfg: Dict[str, Any] = {} self.cfg: Dict[str, Any] = {"overwrite": overwrite}
self.distance = CosineDistance(normalize=False) self.distance = CosineDistance(normalize=False)
# how many neighbour sentences to take into account # how many neighbour sentences to take into account
# create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'. # create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'.
self.kb = empty_kb(entity_vector_length)(self.vocab) self.kb = empty_kb(entity_vector_length)(self.vocab)
self.scorer = scorer
def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]): def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
"""Define the KB of this pipe by providing a function that will """Define the KB of this pipe by providing a function that will
@ -384,23 +407,14 @@ class EntityLinker(TrainablePipe):
if count_ents != len(kb_ids): if count_ents != len(kb_ids):
raise ValueError(Errors.E148.format(ents=count_ents, ids=len(kb_ids))) raise ValueError(Errors.E148.format(ents=count_ents, ids=len(kb_ids)))
i = 0 i = 0
overwrite = self.cfg["overwrite"]
for doc in docs: for doc in docs:
for ent in doc.ents: for ent in doc.ents:
kb_id = kb_ids[i] kb_id = kb_ids[i]
i += 1 i += 1
for token in ent: for token in ent:
token.ent_kb_id_ = kb_id if token.ent_kb_id == 0 or overwrite:
token.ent_kb_id_ = kb_id
def score(self, examples, **kwargs):
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores.
DOCS TODO: https://spacy.io/api/entity_linker#score
"""
validate_examples(examples, "EntityLinker.score")
return Scorer.score_links(examples, negative_labels=[self.NIL])
def to_bytes(self, *, exclude=tuple()): def to_bytes(self, *, exclude=tuple()):
"""Serialize the pipe to a bytestring. """Serialize the pipe to a bytestring.

View File

@ -9,11 +9,10 @@ from .pipe import Pipe
from ..training import Example from ..training import Example
from ..language import Language from ..language import Language
from ..errors import Errors, Warnings from ..errors import Errors, Warnings
from ..util import ensure_path, to_disk, from_disk, SimpleFrozenList from ..util import ensure_path, to_disk, from_disk, SimpleFrozenList, registry
from ..tokens import Doc, Span from ..tokens import Doc, Span
from ..matcher import Matcher, PhraseMatcher from ..matcher import Matcher, PhraseMatcher
from ..scorer import get_ner_prf from ..scorer import get_ner_prf
from ..training import validate_examples
DEFAULT_ENT_ID_SEP = "||" DEFAULT_ENT_ID_SEP = "||"
@ -28,6 +27,7 @@ PatternType = Dict[str, Union[str, List[Dict[str, Any]]]]
"validate": False, "validate": False,
"overwrite_ents": False, "overwrite_ents": False,
"ent_id_sep": DEFAULT_ENT_ID_SEP, "ent_id_sep": DEFAULT_ENT_ID_SEP,
"scorer": {"@scorers": "spacy.entity_ruler_scorer.v1"},
}, },
default_score_weights={ default_score_weights={
"ents_f": 1.0, "ents_f": 1.0,
@ -43,6 +43,7 @@ def make_entity_ruler(
validate: bool, validate: bool,
overwrite_ents: bool, overwrite_ents: bool,
ent_id_sep: str, ent_id_sep: str,
scorer: Optional[Callable],
): ):
return EntityRuler( return EntityRuler(
nlp, nlp,
@ -51,9 +52,19 @@ def make_entity_ruler(
validate=validate, validate=validate,
overwrite_ents=overwrite_ents, overwrite_ents=overwrite_ents,
ent_id_sep=ent_id_sep, ent_id_sep=ent_id_sep,
scorer=scorer,
) )
def entity_ruler_score(examples, **kwargs):
return get_ner_prf(examples)
@registry.scorers("spacy.entity_ruler_scorer.v1")
def make_entity_ruler_scorer():
return entity_ruler_score
class EntityRuler(Pipe): class EntityRuler(Pipe):
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based """The EntityRuler lets you add spans to the `Doc.ents` using token-based
rules or exact phrase matches. It can be combined with the statistical rules or exact phrase matches. It can be combined with the statistical
@ -75,6 +86,7 @@ class EntityRuler(Pipe):
overwrite_ents: bool = False, overwrite_ents: bool = False,
ent_id_sep: str = DEFAULT_ENT_ID_SEP, ent_id_sep: str = DEFAULT_ENT_ID_SEP,
patterns: Optional[List[PatternType]] = None, patterns: Optional[List[PatternType]] = None,
scorer: Optional[Callable] = entity_ruler_score,
) -> None: ) -> None:
"""Initialize the entity ruler. If patterns are supplied here, they """Initialize the entity ruler. If patterns are supplied here, they
need to be a list of dictionaries with a `"label"` and `"pattern"` need to be a list of dictionaries with a `"label"` and `"pattern"`
@ -95,6 +107,8 @@ class EntityRuler(Pipe):
overwrite_ents (bool): If existing entities are present, e.g. entities overwrite_ents (bool): If existing entities are present, e.g. entities
added by the model, overwrite them by matches if necessary. added by the model, overwrite them by matches if necessary.
ent_id_sep (str): Separator used internally for entity IDs. ent_id_sep (str): Separator used internally for entity IDs.
scorer (Optional[Callable]): The scoring method. Defaults to
spacy.scorer.get_ner_prf.
DOCS: https://spacy.io/api/entityruler#init DOCS: https://spacy.io/api/entityruler#init
""" """
@ -113,6 +127,7 @@ class EntityRuler(Pipe):
self._ent_ids = defaultdict(tuple) # type: ignore self._ent_ids = defaultdict(tuple) # type: ignore
if patterns is not None: if patterns is not None:
self.add_patterns(patterns) self.add_patterns(patterns)
self.scorer = scorer
def __len__(self) -> int: def __len__(self) -> int:
"""The number of all patterns added to the entity ruler.""" """The number of all patterns added to the entity ruler."""
@ -333,6 +348,46 @@ class EntityRuler(Pipe):
self.nlp.vocab, attr=self.phrase_matcher_attr, validate=self._validate self.nlp.vocab, attr=self.phrase_matcher_attr, validate=self._validate
) )
def remove(self, ent_id: str) -> None:
"""Remove a pattern by its ent_id if a pattern with this ent_id was added before
ent_id (str): id of the pattern to be removed
RETURNS: None
DOCS: https://spacy.io/api/entityruler#remove
"""
label_id_pairs = [
(label, eid) for (label, eid) in self._ent_ids.values() if eid == ent_id
]
if not label_id_pairs:
raise ValueError(Errors.E1024.format(ent_id=ent_id))
created_labels = [
self._create_label(label, eid) for (label, eid) in label_id_pairs
]
# remove the patterns from self.phrase_patterns
self.phrase_patterns = defaultdict(
list,
{
label: val
for (label, val) in self.phrase_patterns.items()
if label not in created_labels
},
)
# remove the patterns from self.token_pattern
self.token_patterns = defaultdict(
list,
{
label: val
for (label, val) in self.token_patterns.items()
if label not in created_labels
},
)
# remove the patterns from self.token_pattern
for label in created_labels:
if label in self.phrase_matcher:
self.phrase_matcher.remove(label)
else:
self.matcher.remove(label)
def _require_patterns(self) -> None: def _require_patterns(self) -> None:
"""Raise a warning if this component has no patterns defined.""" """Raise a warning if this component has no patterns defined."""
if len(self) == 0: if len(self) == 0:
@ -363,10 +418,6 @@ class EntityRuler(Pipe):
label = f"{label}{self.ent_id_sep}{ent_id}" label = f"{label}{self.ent_id_sep}{ent_id}"
return label return label
def score(self, examples, **kwargs):
validate_examples(examples, "EntityRuler.score")
return get_ner_prf(examples)
def from_bytes( def from_bytes(
self, patterns_bytes: bytes, *, exclude: Iterable[str] = SimpleFrozenList() self, patterns_bytes: bytes, *, exclude: Iterable[str] = SimpleFrozenList()
) -> "EntityRuler": ) -> "EntityRuler":
@ -420,10 +471,16 @@ class EntityRuler(Pipe):
path = ensure_path(path) path = ensure_path(path)
self.clear() self.clear()
depr_patterns_path = path.with_suffix(".jsonl") depr_patterns_path = path.with_suffix(".jsonl")
if depr_patterns_path.is_file(): if path.suffix == ".jsonl": # user provides a jsonl
if path.is_file:
patterns = srsly.read_jsonl(path)
self.add_patterns(patterns)
else:
raise ValueError(Errors.E1023.format(path=path))
elif depr_patterns_path.is_file():
patterns = srsly.read_jsonl(depr_patterns_path) patterns = srsly.read_jsonl(depr_patterns_path)
self.add_patterns(patterns) self.add_patterns(patterns)
else: elif path.is_dir(): # path is a valid directory
cfg = {} cfg = {}
deserializers_patterns = { deserializers_patterns = {
"patterns": lambda p: self.add_patterns( "patterns": lambda p: self.add_patterns(
@ -440,6 +497,8 @@ class EntityRuler(Pipe):
self.nlp.vocab, attr=self.phrase_matcher_attr self.nlp.vocab, attr=self.phrase_matcher_attr
) )
from_disk(path, deserializers_patterns, {}) from_disk(path, deserializers_patterns, {})
else: # path is not a valid directory or file
raise ValueError(Errors.E146.format(path=path))
return self return self
def to_disk( def to_disk(

View File

@ -1,6 +1,8 @@
from typing import Dict, Any from typing import Dict, Any
import srsly import srsly
import warnings
from ..errors import Warnings
from ..language import Language from ..language import Language
from ..matcher import Matcher from ..matcher import Matcher
from ..tokens import Doc from ..tokens import Doc
@ -136,3 +138,65 @@ class TokenSplitter:
"cfg": lambda p: self._set_config(srsly.read_json(p)), "cfg": lambda p: self._set_config(srsly.read_json(p)),
} }
util.from_disk(path, serializers, []) util.from_disk(path, serializers, [])
@Language.factory(
"doc_cleaner",
default_config={"attrs": {"tensor": None, "_.trf_data": None}, "silent": True},
)
def make_doc_cleaner(nlp: Language, name: str, *, attrs: Dict[str, Any], silent: bool):
return DocCleaner(attrs, silent=silent)
class DocCleaner:
def __init__(self, attrs: Dict[str, Any], *, silent: bool = True):
self.cfg: Dict[str, Any] = {"attrs": dict(attrs), "silent": silent}
def __call__(self, doc: Doc) -> Doc:
attrs: dict = self.cfg["attrs"]
silent: bool = self.cfg["silent"]
for attr, value in attrs.items():
obj = doc
parts = attr.split(".")
skip = False
for part in parts[:-1]:
if hasattr(obj, part):
obj = getattr(obj, part)
else:
skip = True
if not silent:
warnings.warn(Warnings.W116.format(attr=attr))
if not skip:
if hasattr(obj, parts[-1]):
setattr(obj, parts[-1], value)
else:
if not silent:
warnings.warn(Warnings.W116.format(attr=attr))
return doc
def to_bytes(self, **kwargs):
serializers = {
"cfg": lambda: srsly.json_dumps(self.cfg),
}
return util.to_bytes(serializers, [])
def from_bytes(self, data, **kwargs):
deserializers = {
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
}
util.from_bytes(data, deserializers, [])
return self
def to_disk(self, path, **kwargs):
path = util.ensure_path(path)
serializers = {
"cfg": lambda p: srsly.write_json(p, self.cfg),
}
return util.to_disk(path, serializers, [])
def from_disk(self, path, **kwargs):
path = util.ensure_path(path)
serializers = {
"cfg": lambda p: self.cfg.update(srsly.read_json(p)),
}
util.from_disk(path, serializers, [])

View File

@ -12,21 +12,41 @@ from ..lookups import Lookups, load_lookups
from ..scorer import Scorer from ..scorer import Scorer
from ..tokens import Doc, Token from ..tokens import Doc, Token
from ..vocab import Vocab from ..vocab import Vocab
from ..training import validate_examples from ..util import logger, SimpleFrozenList, registry
from ..util import logger, SimpleFrozenList
from .. import util from .. import util
@Language.factory( @Language.factory(
"lemmatizer", "lemmatizer",
assigns=["token.lemma"], assigns=["token.lemma"],
default_config={"model": None, "mode": "lookup", "overwrite": False}, default_config={
"model": None,
"mode": "lookup",
"overwrite": False,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0}, default_score_weights={"lemma_acc": 1.0},
) )
def make_lemmatizer( def make_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool = False nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
): ):
return Lemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) return Lemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
def lemmatizer_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
return Scorer.score_token_attr(examples, "lemma", **kwargs)
@registry.scorers("spacy.lemmatizer_scorer.v1")
def make_lemmatizer_scorer():
return lemmatizer_score
class Lemmatizer(Pipe): class Lemmatizer(Pipe):
@ -60,6 +80,7 @@ class Lemmatizer(Pipe):
*, *,
mode: str = "lookup", mode: str = "lookup",
overwrite: bool = False, overwrite: bool = False,
scorer: Optional[Callable] = lemmatizer_score,
) -> None: ) -> None:
"""Initialize a Lemmatizer. """Initialize a Lemmatizer.
@ -69,6 +90,8 @@ class Lemmatizer(Pipe):
mode (str): The lemmatizer mode: "lookup", "rule". Defaults to "lookup". mode (str): The lemmatizer mode: "lookup", "rule". Defaults to "lookup".
overwrite (bool): Whether to overwrite existing lemmas. Defaults to overwrite (bool): Whether to overwrite existing lemmas. Defaults to
`False`. `False`.
scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_token_attr for the attribute "lemma".
DOCS: https://spacy.io/api/lemmatizer#init DOCS: https://spacy.io/api/lemmatizer#init
""" """
@ -89,6 +112,7 @@ class Lemmatizer(Pipe):
raise ValueError(Errors.E1003.format(mode=mode)) raise ValueError(Errors.E1003.format(mode=mode))
self.lemmatize = getattr(self, mode_attr) self.lemmatize = getattr(self, mode_attr)
self.cache = {} # type: ignore[var-annotated] self.cache = {} # type: ignore[var-annotated]
self.scorer = scorer
@property @property
def mode(self): def mode(self):
@ -247,17 +271,6 @@ class Lemmatizer(Pipe):
""" """
return False return False
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores.
DOCS: https://spacy.io/api/lemmatizer#score
"""
validate_examples(examples, "Lemmatizer.score")
return Scorer.score_token_attr(examples, "lemma", **kwargs)
def to_disk( def to_disk(
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList() self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
): ):

View File

@ -1,5 +1,5 @@
# cython: infer_types=True, profile=True, binding=True # cython: infer_types=True, profile=True, binding=True
from typing import Optional, Union, Dict from typing import Optional, Union, Dict, Callable
import srsly import srsly
from thinc.api import SequenceCategoricalCrossentropy, Model, Config from thinc.api import SequenceCategoricalCrossentropy, Model, Config
from itertools import islice from itertools import islice
@ -17,7 +17,11 @@ from .tagger import Tagger
from .. import util from .. import util
from ..scorer import Scorer from ..scorer import Scorer
from ..training import validate_examples, validate_get_examples from ..training import validate_examples, validate_get_examples
from ..util import registry
# See #9050
BACKWARD_OVERWRITE = True
BACKWARD_EXTEND = False
default_model_config = """ default_model_config = """
[model] [model]
@ -48,15 +52,35 @@ DEFAULT_MORPH_MODEL = Config().from_str(default_model_config)["model"]
@Language.factory( @Language.factory(
"morphologizer", "morphologizer",
assigns=["token.morph", "token.pos"], assigns=["token.morph", "token.pos"],
default_config={"model": DEFAULT_MORPH_MODEL}, default_config={"model": DEFAULT_MORPH_MODEL, "overwrite": True, "extend": False, "scorer": {"@scorers": "spacy.morphologizer_scorer.v1"}},
default_score_weights={"pos_acc": 0.5, "morph_acc": 0.5, "morph_per_feat": None}, default_score_weights={"pos_acc": 0.5, "morph_acc": 0.5, "morph_per_feat": None},
) )
def make_morphologizer( def make_morphologizer(
nlp: Language, nlp: Language,
model: Model, model: Model,
name: str, name: str,
overwrite: bool,
extend: bool,
scorer: Optional[Callable],
): ):
return Morphologizer(nlp.vocab, model, name) return Morphologizer(nlp.vocab, model, name, overwrite=overwrite, extend=extend, scorer=scorer)
def morphologizer_score(examples, **kwargs):
def morph_key_getter(token, attr):
return getattr(token, attr).key
results = {}
results.update(Scorer.score_token_attr(examples, "pos", **kwargs))
results.update(Scorer.score_token_attr(examples, "morph", getter=morph_key_getter, **kwargs))
results.update(Scorer.score_token_attr_per_feat(examples,
"morph", getter=morph_key_getter, **kwargs))
return results
@registry.scorers("spacy.morphologizer_scorer.v1")
def make_morphologizer_scorer():
return morphologizer_score
class Morphologizer(Tagger): class Morphologizer(Tagger):
@ -67,6 +91,10 @@ class Morphologizer(Tagger):
vocab: Vocab, vocab: Vocab,
model: Model, model: Model,
name: str = "morphologizer", name: str = "morphologizer",
*,
overwrite: bool = BACKWARD_OVERWRITE,
extend: bool = BACKWARD_EXTEND,
scorer: Optional[Callable] = morphologizer_score,
): ):
"""Initialize a morphologizer. """Initialize a morphologizer.
@ -74,6 +102,9 @@ class Morphologizer(Tagger):
model (thinc.api.Model): The Thinc Model powering the pipeline component. model (thinc.api.Model): The Thinc Model powering the pipeline component.
name (str): The component instance name, used to add entries to the name (str): The component instance name, used to add entries to the
losses during training. losses during training.
scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_token_attr for the attributes "pos" and "morph" and
Scorer.score_token_attr_per_feat for the attribute "morph".
DOCS: https://spacy.io/api/morphologizer#init DOCS: https://spacy.io/api/morphologizer#init
""" """
@ -85,8 +116,14 @@ class Morphologizer(Tagger):
# store mappings from morph+POS labels to token-level annotations: # store mappings from morph+POS labels to token-level annotations:
# 1) labels_morph stores a mapping from morph+POS->morph # 1) labels_morph stores a mapping from morph+POS->morph
# 2) labels_pos stores a mapping from morph+POS->POS # 2) labels_pos stores a mapping from morph+POS->POS
cfg = {"labels_morph": {}, "labels_pos": {}} cfg = {
"labels_morph": {},
"labels_pos": {},
"overwrite": overwrite,
"extend": extend,
}
self.cfg = dict(sorted(cfg.items())) self.cfg = dict(sorted(cfg.items()))
self.scorer = scorer
@property @property
def labels(self): def labels(self):
@ -192,14 +229,35 @@ class Morphologizer(Tagger):
docs = [docs] docs = [docs]
cdef Doc doc cdef Doc doc
cdef Vocab vocab = self.vocab cdef Vocab vocab = self.vocab
cdef bint overwrite = self.cfg["overwrite"]
cdef bint extend = self.cfg["extend"]
labels = self.labels
for i, doc in enumerate(docs): for i, doc in enumerate(docs):
doc_tag_ids = batch_tag_ids[i] doc_tag_ids = batch_tag_ids[i]
if hasattr(doc_tag_ids, "get"): if hasattr(doc_tag_ids, "get"):
doc_tag_ids = doc_tag_ids.get() doc_tag_ids = doc_tag_ids.get()
for j, tag_id in enumerate(doc_tag_ids): for j, tag_id in enumerate(doc_tag_ids):
morph = self.labels[tag_id] morph = labels[tag_id]
doc.c[j].morph = self.vocab.morphology.add(self.cfg["labels_morph"].get(morph, 0)) # set morph
doc.c[j].pos = self.cfg["labels_pos"].get(morph, 0) if doc.c[j].morph == 0 or overwrite or extend:
if overwrite and extend:
# morphologizer morph overwrites any existing features
# while extending
extended_morph = Morphology.feats_to_dict(self.vocab.strings[doc.c[j].morph])
extended_morph.update(Morphology.feats_to_dict(self.cfg["labels_morph"].get(morph, 0)))
doc.c[j].morph = self.vocab.morphology.add(extended_morph)
elif extend:
# existing features are preserved and any new features
# are added
extended_morph = Morphology.feats_to_dict(self.cfg["labels_morph"].get(morph, 0))
extended_morph.update(Morphology.feats_to_dict(self.vocab.strings[doc.c[j].morph]))
doc.c[j].morph = self.vocab.morphology.add(extended_morph)
else:
# clobber
doc.c[j].morph = self.vocab.morphology.add(self.cfg["labels_morph"].get(morph, 0))
# set POS
if doc.c[j].pos == 0 or overwrite:
doc.c[j].pos = self.cfg["labels_pos"].get(morph, 0)
def get_loss(self, examples, scores): def get_loss(self, examples, scores):
"""Find the loss and gradient of loss for the batch of documents and """Find the loss and gradient of loss for the batch of documents and
@ -246,24 +304,3 @@ class Morphologizer(Tagger):
if self.model.ops.xp.isnan(loss): if self.model.ops.xp.isnan(loss):
raise ValueError(Errors.E910.format(name=self.name)) raise ValueError(Errors.E910.format(name=self.name))
return float(loss), d_scores return float(loss), d_scores
def score(self, examples, **kwargs):
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores, produced by
Scorer.score_token_attr for the attributes "pos" and "morph" and
Scorer.score_token_attr_per_feat for the attribute "morph".
DOCS: https://spacy.io/api/morphologizer#score
"""
def morph_key_getter(token, attr):
return getattr(token, attr).key
validate_examples(examples, "Morphologizer.score")
results = {}
results.update(Scorer.score_token_attr(examples, "pos", **kwargs))
results.update(Scorer.score_token_attr(examples, "morph", getter=morph_key_getter, **kwargs))
results.update(Scorer.score_token_attr_per_feat(examples,
"morph", getter=morph_key_getter, **kwargs))
return results

View File

@ -1,6 +1,6 @@
# cython: infer_types=True, profile=True, binding=True # cython: infer_types=True, profile=True, binding=True
from collections import defaultdict from collections import defaultdict
from typing import Optional, Iterable from typing import Optional, Iterable, Callable
from thinc.api import Model, Config from thinc.api import Model, Config
from ._parser_internals.transition_system import TransitionSystem from ._parser_internals.transition_system import TransitionSystem
@ -10,6 +10,7 @@ from ._parser_internals.ner import BiluoPushDown
from ..language import Language from ..language import Language
from ..scorer import get_ner_prf, PRFScore from ..scorer import get_ner_prf, PRFScore
from ..training import validate_examples from ..training import validate_examples
from ..util import registry
default_model_config = """ default_model_config = """
@ -41,6 +42,7 @@ DEFAULT_NER_MODEL = Config().from_str(default_model_config)["model"]
"update_with_oracle_cut_size": 100, "update_with_oracle_cut_size": 100,
"model": DEFAULT_NER_MODEL, "model": DEFAULT_NER_MODEL,
"incorrect_spans_key": None, "incorrect_spans_key": None,
"scorer": {"@scorers": "spacy.ner_scorer.v1"},
}, },
default_score_weights={ default_score_weights={
"ents_f": 1.0, "ents_f": 1.0,
@ -55,7 +57,8 @@ def make_ner(
model: Model, model: Model,
moves: Optional[TransitionSystem], moves: Optional[TransitionSystem],
update_with_oracle_cut_size: int, update_with_oracle_cut_size: int,
incorrect_spans_key: Optional[str] = None, incorrect_spans_key: Optional[str],
scorer: Optional[Callable],
): ):
"""Create a transition-based EntityRecognizer component. The entity recognizer """Create a transition-based EntityRecognizer component. The entity recognizer
identifies non-overlapping labelled spans of tokens. identifies non-overlapping labelled spans of tokens.
@ -83,6 +86,7 @@ def make_ner(
incorrect_spans_key (Optional[str]): Identifies spans that are known incorrect_spans_key (Optional[str]): Identifies spans that are known
to be incorrect entity annotations. The incorrect entity annotations to be incorrect entity annotations. The incorrect entity annotations
can be stored in the span group, under this key. can be stored in the span group, under this key.
scorer (Optional[Callable]): The scoring method.
""" """
return EntityRecognizer( return EntityRecognizer(
nlp.vocab, nlp.vocab,
@ -95,6 +99,7 @@ def make_ner(
beam_width=1, beam_width=1,
beam_density=0.0, beam_density=0.0,
beam_update_prob=0.0, beam_update_prob=0.0,
scorer=scorer,
) )
@ -109,6 +114,7 @@ def make_ner(
"beam_update_prob": 0.5, "beam_update_prob": 0.5,
"beam_width": 32, "beam_width": 32,
"incorrect_spans_key": None, "incorrect_spans_key": None,
"scorer": None,
}, },
default_score_weights={ default_score_weights={
"ents_f": 1.0, "ents_f": 1.0,
@ -126,7 +132,8 @@ def make_beam_ner(
beam_width: int, beam_width: int,
beam_density: float, beam_density: float,
beam_update_prob: float, beam_update_prob: float,
incorrect_spans_key: Optional[str] = None, incorrect_spans_key: Optional[str],
scorer: Optional[Callable],
): ):
"""Create a transition-based EntityRecognizer component that uses beam-search. """Create a transition-based EntityRecognizer component that uses beam-search.
The entity recognizer identifies non-overlapping labelled spans of tokens. The entity recognizer identifies non-overlapping labelled spans of tokens.
@ -162,6 +169,7 @@ def make_beam_ner(
and are faster to compute. and are faster to compute.
incorrect_spans_key (Optional[str]): Optional key into span groups of incorrect_spans_key (Optional[str]): Optional key into span groups of
entities known to be non-entities. entities known to be non-entities.
scorer (Optional[Callable]): The scoring method.
""" """
return EntityRecognizer( return EntityRecognizer(
nlp.vocab, nlp.vocab,
@ -174,9 +182,19 @@ def make_beam_ner(
beam_density=beam_density, beam_density=beam_density,
beam_update_prob=beam_update_prob, beam_update_prob=beam_update_prob,
incorrect_spans_key=incorrect_spans_key, incorrect_spans_key=incorrect_spans_key,
scorer=scorer,
) )
def ner_score(examples, **kwargs):
return get_ner_prf(examples, **kwargs)
@registry.scorers("spacy.ner_scorer.v1")
def make_ner_scorer():
return ner_score
class EntityRecognizer(Parser): class EntityRecognizer(Parser):
"""Pipeline component for named entity recognition. """Pipeline component for named entity recognition.
@ -198,6 +216,7 @@ class EntityRecognizer(Parser):
beam_update_prob=0.0, beam_update_prob=0.0,
multitasks=tuple(), multitasks=tuple(),
incorrect_spans_key=None, incorrect_spans_key=None,
scorer=ner_score,
): ):
"""Create an EntityRecognizer.""" """Create an EntityRecognizer."""
super().__init__( super().__init__(
@ -213,6 +232,7 @@ class EntityRecognizer(Parser):
beam_update_prob=beam_update_prob, beam_update_prob=beam_update_prob,
multitasks=multitasks, multitasks=multitasks,
incorrect_spans_key=incorrect_spans_key, incorrect_spans_key=incorrect_spans_key,
scorer=scorer,
) )
def add_multitask_objective(self, mt_component): def add_multitask_objective(self, mt_component):
@ -239,17 +259,6 @@ class EntityRecognizer(Parser):
) )
return tuple(sorted(labels)) return tuple(sorted(labels))
def score(self, examples, **kwargs):
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The NER precision, recall and f-scores.
DOCS: https://spacy.io/api/entityrecognizer#score
"""
validate_examples(examples, "EntityRecognizer.score")
return get_ner_prf(examples)
def scored_ents(self, beams): def scored_ents(self, beams):
"""Return a dictionary of (start, end, label) tuples with corresponding scores """Return a dictionary of (start, end, label) tuples with corresponding scores
for each beam/doc that was processed. for each beam/doc that was processed.

View File

@ -81,6 +81,17 @@ cdef class Pipe:
DOCS: https://spacy.io/api/pipe#score DOCS: https://spacy.io/api/pipe#score
""" """
if hasattr(self, "scorer") and self.scorer is not None:
scorer_kwargs = {}
# use default settings from cfg (e.g., threshold)
if hasattr(self, "cfg") and isinstance(self.cfg, dict):
scorer_kwargs.update(self.cfg)
# override self.cfg["labels"] with self.labels
if hasattr(self, "labels"):
scorer_kwargs["labels"] = self.labels
# override with kwargs settings
scorer_kwargs.update(kwargs)
return self.scorer(examples, **scorer_kwargs)
return {} return {}
@property @property

View File

@ -1,26 +1,32 @@
# cython: infer_types=True, profile=True, binding=True # cython: infer_types=True, profile=True, binding=True
from typing import Optional, List from typing import Optional, List, Callable
import srsly import srsly
from ..tokens.doc cimport Doc from ..tokens.doc cimport Doc
from .pipe import Pipe from .pipe import Pipe
from .senter import senter_score
from ..language import Language from ..language import Language
from ..scorer import Scorer from ..scorer import Scorer
from ..training import validate_examples
from .. import util from .. import util
# see #9050
BACKWARD_OVERWRITE = False
@Language.factory( @Language.factory(
"sentencizer", "sentencizer",
assigns=["token.is_sent_start", "doc.sents"], assigns=["token.is_sent_start", "doc.sents"],
default_config={"punct_chars": None}, default_config={"punct_chars": None, "overwrite": False, "scorer": {"@scorers": "spacy.senter_scorer.v1"}},
default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0}, default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0},
) )
def make_sentencizer( def make_sentencizer(
nlp: Language, nlp: Language,
name: str, name: str,
punct_chars: Optional[List[str]] punct_chars: Optional[List[str]],
overwrite: bool,
scorer: Optional[Callable],
): ):
return Sentencizer(name, punct_chars=punct_chars) return Sentencizer(name, punct_chars=punct_chars, overwrite=overwrite, scorer=scorer)
class Sentencizer(Pipe): class Sentencizer(Pipe):
@ -41,12 +47,20 @@ class Sentencizer(Pipe):
'𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈',
'', ''] '', '']
def __init__(self, name="sentencizer", *, punct_chars=None): def __init__(
self,
name="sentencizer",
*,
punct_chars=None,
overwrite=BACKWARD_OVERWRITE,
scorer=senter_score,
):
"""Initialize the sentencizer. """Initialize the sentencizer.
punct_chars (list): Punctuation characters to split on. Will be punct_chars (list): Punctuation characters to split on. Will be
serialized with the nlp object. serialized with the nlp object.
RETURNS (Sentencizer): The sentencizer component. scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_spans for the attribute "sents".
DOCS: https://spacy.io/api/sentencizer#init DOCS: https://spacy.io/api/sentencizer#init
""" """
@ -55,6 +69,8 @@ class Sentencizer(Pipe):
self.punct_chars = set(punct_chars) self.punct_chars = set(punct_chars)
else: else:
self.punct_chars = set(self.default_punct_chars) self.punct_chars = set(self.default_punct_chars)
self.overwrite = overwrite
self.scorer = scorer
def __call__(self, doc): def __call__(self, doc):
"""Apply the sentencizer to a Doc and set Token.is_sent_start. """Apply the sentencizer to a Doc and set Token.is_sent_start.
@ -115,29 +131,12 @@ class Sentencizer(Pipe):
for i, doc in enumerate(docs): for i, doc in enumerate(docs):
doc_tag_ids = batch_tag_ids[i] doc_tag_ids = batch_tag_ids[i]
for j, tag_id in enumerate(doc_tag_ids): for j, tag_id in enumerate(doc_tag_ids):
# Don't clobber existing sentence boundaries if doc.c[j].sent_start == 0 or self.overwrite:
if doc.c[j].sent_start == 0:
if tag_id: if tag_id:
doc.c[j].sent_start = 1 doc.c[j].sent_start = 1
else: else:
doc.c[j].sent_start = -1 doc.c[j].sent_start = -1
def score(self, examples, **kwargs):
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_spans.
DOCS: https://spacy.io/api/sentencizer#score
"""
def has_sents(doc):
return doc.has_annotation("SENT_START")
validate_examples(examples, "Sentencizer.score")
results = Scorer.score_spans(examples, "sents", has_annotation=has_sents, **kwargs)
del results["sents_per_type"]
return results
def to_bytes(self, *, exclude=tuple()): def to_bytes(self, *, exclude=tuple()):
"""Serialize the sentencizer to a bytestring. """Serialize the sentencizer to a bytestring.
@ -145,7 +144,7 @@ class Sentencizer(Pipe):
DOCS: https://spacy.io/api/sentencizer#to_bytes DOCS: https://spacy.io/api/sentencizer#to_bytes
""" """
return srsly.msgpack_dumps({"punct_chars": list(self.punct_chars)}) return srsly.msgpack_dumps({"punct_chars": list(self.punct_chars), "overwrite": self.overwrite})
def from_bytes(self, bytes_data, *, exclude=tuple()): def from_bytes(self, bytes_data, *, exclude=tuple()):
"""Load the sentencizer from a bytestring. """Load the sentencizer from a bytestring.
@ -157,6 +156,7 @@ class Sentencizer(Pipe):
""" """
cfg = srsly.msgpack_loads(bytes_data) cfg = srsly.msgpack_loads(bytes_data)
self.punct_chars = set(cfg.get("punct_chars", self.default_punct_chars)) self.punct_chars = set(cfg.get("punct_chars", self.default_punct_chars))
self.overwrite = cfg.get("overwrite", self.overwrite)
return self return self
def to_disk(self, path, *, exclude=tuple()): def to_disk(self, path, *, exclude=tuple()):
@ -166,7 +166,7 @@ class Sentencizer(Pipe):
""" """
path = util.ensure_path(path) path = util.ensure_path(path)
path = path.with_suffix(".json") path = path.with_suffix(".json")
srsly.write_json(path, {"punct_chars": list(self.punct_chars)}) srsly.write_json(path, {"punct_chars": list(self.punct_chars), "overwrite": self.overwrite})
def from_disk(self, path, *, exclude=tuple()): def from_disk(self, path, *, exclude=tuple()):
@ -178,4 +178,5 @@ class Sentencizer(Pipe):
path = path.with_suffix(".json") path = path.with_suffix(".json")
cfg = srsly.read_json(path) cfg = srsly.read_json(path)
self.punct_chars = set(cfg.get("punct_chars", self.default_punct_chars)) self.punct_chars = set(cfg.get("punct_chars", self.default_punct_chars))
self.overwrite = cfg.get("overwrite", self.overwrite)
return self return self

View File

@ -1,5 +1,6 @@
# cython: infer_types=True, profile=True, binding=True # cython: infer_types=True, profile=True, binding=True
from itertools import islice from itertools import islice
from typing import Optional, Callable
import srsly import srsly
from thinc.api import Model, SequenceCategoricalCrossentropy, Config from thinc.api import Model, SequenceCategoricalCrossentropy, Config
@ -11,8 +12,11 @@ from ..language import Language
from ..errors import Errors from ..errors import Errors
from ..scorer import Scorer from ..scorer import Scorer
from ..training import validate_examples, validate_get_examples from ..training import validate_examples, validate_get_examples
from ..util import registry
from .. import util from .. import util
# See #9050
BACKWARD_OVERWRITE = False
default_model_config = """ default_model_config = """
[model] [model]
@ -34,11 +38,25 @@ DEFAULT_SENTER_MODEL = Config().from_str(default_model_config)["model"]
@Language.factory( @Language.factory(
"senter", "senter",
assigns=["token.is_sent_start"], assigns=["token.is_sent_start"],
default_config={"model": DEFAULT_SENTER_MODEL}, default_config={"model": DEFAULT_SENTER_MODEL, "overwrite": False, "scorer": {"@scorers": "spacy.senter_scorer.v1"}},
default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0}, default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0},
) )
def make_senter(nlp: Language, name: str, model: Model): def make_senter(nlp: Language, name: str, model: Model, overwrite: bool, scorer: Optional[Callable]):
return SentenceRecognizer(nlp.vocab, model, name) return SentenceRecognizer(nlp.vocab, model, name, overwrite=overwrite, scorer=scorer)
def senter_score(examples, **kwargs):
def has_sents(doc):
return doc.has_annotation("SENT_START")
results = Scorer.score_spans(examples, "sents", has_annotation=has_sents, **kwargs)
del results["sents_per_type"]
return results
@registry.scorers("spacy.senter_scorer.v1")
def make_senter_scorer():
return senter_score
class SentenceRecognizer(Tagger): class SentenceRecognizer(Tagger):
@ -46,13 +64,23 @@ class SentenceRecognizer(Tagger):
DOCS: https://spacy.io/api/sentencerecognizer DOCS: https://spacy.io/api/sentencerecognizer
""" """
def __init__(self, vocab, model, name="senter"): def __init__(
self,
vocab,
model,
name="senter",
*,
overwrite=BACKWARD_OVERWRITE,
scorer=senter_score,
):
"""Initialize a sentence recognizer. """Initialize a sentence recognizer.
vocab (Vocab): The shared vocabulary. vocab (Vocab): The shared vocabulary.
model (thinc.api.Model): The Thinc Model powering the pipeline component. model (thinc.api.Model): The Thinc Model powering the pipeline component.
name (str): The component instance name, used to add entries to the name (str): The component instance name, used to add entries to the
losses during training. losses during training.
scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_spans for the attribute "sents".
DOCS: https://spacy.io/api/sentencerecognizer#init DOCS: https://spacy.io/api/sentencerecognizer#init
""" """
@ -60,7 +88,8 @@ class SentenceRecognizer(Tagger):
self.model = model self.model = model
self.name = name self.name = name
self._rehearsal_model = None self._rehearsal_model = None
self.cfg = {} self.cfg = {"overwrite": overwrite}
self.scorer = scorer
@property @property
def labels(self): def labels(self):
@ -85,13 +114,13 @@ class SentenceRecognizer(Tagger):
if isinstance(docs, Doc): if isinstance(docs, Doc):
docs = [docs] docs = [docs]
cdef Doc doc cdef Doc doc
cdef bint overwrite = self.cfg["overwrite"]
for i, doc in enumerate(docs): for i, doc in enumerate(docs):
doc_tag_ids = batch_tag_ids[i] doc_tag_ids = batch_tag_ids[i]
if hasattr(doc_tag_ids, "get"): if hasattr(doc_tag_ids, "get"):
doc_tag_ids = doc_tag_ids.get() doc_tag_ids = doc_tag_ids.get()
for j, tag_id in enumerate(doc_tag_ids): for j, tag_id in enumerate(doc_tag_ids):
# Don't clobber existing sentence boundaries if doc.c[j].sent_start == 0 or overwrite:
if doc.c[j].sent_start == 0:
if tag_id == 1: if tag_id == 1:
doc.c[j].sent_start = 1 doc.c[j].sent_start = 1
else: else:
@ -153,18 +182,3 @@ class SentenceRecognizer(Tagger):
def add_label(self, label, values=None): def add_label(self, label, values=None):
raise NotImplementedError raise NotImplementedError
def score(self, examples, **kwargs):
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_spans.
DOCS: https://spacy.io/api/sentencerecognizer#score
"""
def has_sents(doc):
return doc.has_annotation("SENT_START")
validate_examples(examples, "SentenceRecognizer.score")
results = Scorer.score_spans(examples, "sents", has_annotation=has_sents, **kwargs)
del results["sents_per_type"]
return results

View File

@ -78,7 +78,7 @@ def build_ngram_suggester(sizes: List[int]) -> Suggester:
if len(spans) > 0: if len(spans) > 0:
output = Ragged(ops.xp.vstack(spans), lengths_array) output = Ragged(ops.xp.vstack(spans), lengths_array)
else: else:
output = Ragged(ops.xp.zeros((0, 0)), lengths_array) output = Ragged(ops.xp.zeros((0, 0), dtype="i"), lengths_array)
assert output.dataXd.ndim == 2 assert output.dataXd.ndim == 2
return output return output
@ -104,6 +104,7 @@ def build_ngram_range_suggester(min_size: int, max_size: int) -> Suggester:
"max_positive": None, "max_positive": None,
"model": DEFAULT_SPANCAT_MODEL, "model": DEFAULT_SPANCAT_MODEL,
"suggester": {"@misc": "spacy.ngram_suggester.v1", "sizes": [1, 2, 3]}, "suggester": {"@misc": "spacy.ngram_suggester.v1", "sizes": [1, 2, 3]},
"scorer": {"@scorers": "spacy.spancat_scorer.v1"},
}, },
default_score_weights={"spans_sc_f": 1.0, "spans_sc_p": 0.0, "spans_sc_r": 0.0}, default_score_weights={"spans_sc_f": 1.0, "spans_sc_p": 0.0, "spans_sc_r": 0.0},
) )
@ -113,8 +114,9 @@ def make_spancat(
suggester: Suggester, suggester: Suggester,
model: Model[Tuple[List[Doc], Ragged], Floats2d], model: Model[Tuple[List[Doc], Ragged], Floats2d],
spans_key: str, spans_key: str,
threshold: float = 0.5, scorer: Optional[Callable],
max_positive: Optional[int] = None, threshold: float,
max_positive: Optional[int],
) -> "SpanCategorizer": ) -> "SpanCategorizer":
"""Create a SpanCategorizer component. The span categorizer consists of two """Create a SpanCategorizer component. The span categorizer consists of two
parts: a suggester function that proposes candidate spans, and a labeller parts: a suggester function that proposes candidate spans, and a labeller
@ -144,9 +146,28 @@ def make_spancat(
threshold=threshold, threshold=threshold,
max_positive=max_positive, max_positive=max_positive,
name=name, name=name,
scorer=scorer,
) )
def spancat_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
kwargs = dict(kwargs)
attr_prefix = "spans_"
key = kwargs["spans_key"]
kwargs.setdefault("attr", f"{attr_prefix}{key}")
kwargs.setdefault("allow_overlap", True)
kwargs.setdefault(
"getter", lambda doc, key: doc.spans.get(key[len(attr_prefix) :], [])
)
kwargs.setdefault("has_annotation", lambda doc: key in doc.spans)
return Scorer.score_spans(examples, **kwargs)
@registry.scorers("spacy.spancat_scorer.v1")
def make_spancat_scorer():
return spancat_score
class SpanCategorizer(TrainablePipe): class SpanCategorizer(TrainablePipe):
"""Pipeline component to label spans of text. """Pipeline component to label spans of text.
@ -163,8 +184,25 @@ class SpanCategorizer(TrainablePipe):
spans_key: str = "spans", spans_key: str = "spans",
threshold: float = 0.5, threshold: float = 0.5,
max_positive: Optional[int] = None, max_positive: Optional[int] = None,
scorer: Optional[Callable] = spancat_score,
) -> None: ) -> None:
"""Initialize the span categorizer. """Initialize the span categorizer.
vocab (Vocab): The shared vocabulary.
model (thinc.api.Model): The Thinc Model powering the pipeline component.
name (str): The component instance name, used to add entries to the
losses during training.
spans_key (str): Key of the Doc.spans dict to save the spans under.
During initialization and training, the component will look for
spans on the reference document under the same key. Defaults to
`"spans"`.
threshold (float): Minimum probability to consider a prediction
positive. Spans with a positive prediction will be saved on the Doc.
Defaults to 0.5.
max_positive (Optional[int]): Maximum number of labels to consider
positive per span. Defaults to None, indicating no limit.
scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_spans for the Doc.spans[spans_key] with overlapping
spans allowed.
DOCS: https://spacy.io/api/spancategorizer#init DOCS: https://spacy.io/api/spancategorizer#init
""" """
@ -178,6 +216,7 @@ class SpanCategorizer(TrainablePipe):
self.suggester = suggester self.suggester = suggester
self.model = model self.model = model
self.name = name self.name = name
self.scorer = scorer
@property @property
def key(self) -> str: def key(self) -> str:
@ -379,26 +418,6 @@ class SpanCategorizer(TrainablePipe):
else: else:
self.model.initialize() self.model.initialize()
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats.
DOCS: https://spacy.io/api/spancategorizer#score
"""
validate_examples(examples, "SpanCategorizer.score")
self._validate_categories(examples)
kwargs = dict(kwargs)
attr_prefix = "spans_"
kwargs.setdefault("attr", f"{attr_prefix}{self.key}")
kwargs.setdefault("allow_overlap", True)
kwargs.setdefault(
"getter", lambda doc, key: doc.spans.get(key[len(attr_prefix) :], [])
)
kwargs.setdefault("has_annotation", lambda doc: self.key in doc.spans)
return Scorer.score_spans(examples, **kwargs)
def _validate_categories(self, examples: Iterable[Example]): def _validate_categories(self, examples: Iterable[Example]):
# TODO # TODO
pass pass

View File

@ -1,4 +1,5 @@
# cython: infer_types=True, profile=True, binding=True # cython: infer_types=True, profile=True, binding=True
from typing import Callable, Optional
import numpy import numpy
import srsly import srsly
from thinc.api import Model, set_dropout_rate, SequenceCategoricalCrossentropy, Config from thinc.api import Model, set_dropout_rate, SequenceCategoricalCrossentropy, Config
@ -18,8 +19,11 @@ from ..parts_of_speech import X
from ..errors import Errors, Warnings from ..errors import Errors, Warnings
from ..scorer import Scorer from ..scorer import Scorer
from ..training import validate_examples, validate_get_examples from ..training import validate_examples, validate_get_examples
from ..util import registry
from .. import util from .. import util
# See #9050
BACKWARD_OVERWRITE = False
default_model_config = """ default_model_config = """
[model] [model]
@ -41,10 +45,17 @@ DEFAULT_TAGGER_MODEL = Config().from_str(default_model_config)["model"]
@Language.factory( @Language.factory(
"tagger", "tagger",
assigns=["token.tag"], assigns=["token.tag"],
default_config={"model": DEFAULT_TAGGER_MODEL}, default_config={"model": DEFAULT_TAGGER_MODEL, "overwrite": False, "scorer": {"@scorers": "spacy.tagger_scorer.v1"}, "neg_prefix": "!"},
default_score_weights={"tag_acc": 1.0}, default_score_weights={"tag_acc": 1.0},
) )
def make_tagger(nlp: Language, name: str, model: Model): def make_tagger(
nlp: Language,
name: str,
model: Model,
overwrite: bool,
scorer: Optional[Callable],
neg_prefix: str,
):
"""Construct a part-of-speech tagger component. """Construct a part-of-speech tagger component.
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
@ -52,7 +63,16 @@ def make_tagger(nlp: Language, name: str, model: Model):
in size, and be normalized as probabilities (all scores between 0 and 1, in size, and be normalized as probabilities (all scores between 0 and 1,
with the rows summing to 1). with the rows summing to 1).
""" """
return Tagger(nlp.vocab, model, name) return Tagger(nlp.vocab, model, name, overwrite=overwrite, scorer=scorer, neg_prefix=neg_prefix)
def tagger_score(examples, **kwargs):
return Scorer.score_token_attr(examples, "tag", **kwargs)
@registry.scorers("spacy.tagger_scorer.v1")
def make_tagger_scorer():
return tagger_score
class Tagger(TrainablePipe): class Tagger(TrainablePipe):
@ -60,13 +80,24 @@ class Tagger(TrainablePipe):
DOCS: https://spacy.io/api/tagger DOCS: https://spacy.io/api/tagger
""" """
def __init__(self, vocab, model, name="tagger"): def __init__(
self,
vocab,
model,
name="tagger",
*,
overwrite=BACKWARD_OVERWRITE,
scorer=tagger_score,
neg_prefix="!",
):
"""Initialize a part-of-speech tagger. """Initialize a part-of-speech tagger.
vocab (Vocab): The shared vocabulary. vocab (Vocab): The shared vocabulary.
model (thinc.api.Model): The Thinc Model powering the pipeline component. model (thinc.api.Model): The Thinc Model powering the pipeline component.
name (str): The component instance name, used to add entries to the name (str): The component instance name, used to add entries to the
losses during training. losses during training.
scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_token_attr for the attribute "tag".
DOCS: https://spacy.io/api/tagger#init DOCS: https://spacy.io/api/tagger#init
""" """
@ -74,8 +105,9 @@ class Tagger(TrainablePipe):
self.model = model self.model = model
self.name = name self.name = name
self._rehearsal_model = None self._rehearsal_model = None
cfg = {"labels": []} cfg = {"labels": [], "overwrite": overwrite, "neg_prefix": neg_prefix}
self.cfg = dict(sorted(cfg.items())) self.cfg = dict(sorted(cfg.items()))
self.scorer = scorer
@property @property
def labels(self): def labels(self):
@ -135,14 +167,15 @@ class Tagger(TrainablePipe):
docs = [docs] docs = [docs]
cdef Doc doc cdef Doc doc
cdef Vocab vocab = self.vocab cdef Vocab vocab = self.vocab
cdef bint overwrite = self.cfg["overwrite"]
labels = self.labels
for i, doc in enumerate(docs): for i, doc in enumerate(docs):
doc_tag_ids = batch_tag_ids[i] doc_tag_ids = batch_tag_ids[i]
if hasattr(doc_tag_ids, "get"): if hasattr(doc_tag_ids, "get"):
doc_tag_ids = doc_tag_ids.get() doc_tag_ids = doc_tag_ids.get()
for j, tag_id in enumerate(doc_tag_ids): for j, tag_id in enumerate(doc_tag_ids):
# Don't clobber preset POS tags if doc.c[j].tag == 0 or overwrite:
if doc.c[j].tag == 0: doc.c[j].tag = self.vocab.strings[labels[tag_id]]
doc.c[j].tag = self.vocab.strings[self.labels[tag_id]]
def update(self, examples, *, drop=0., sgd=None, losses=None): def update(self, examples, *, drop=0., sgd=None, losses=None):
"""Learn from a batch of documents and gold-standard information, """Learn from a batch of documents and gold-standard information,
@ -222,7 +255,7 @@ class Tagger(TrainablePipe):
DOCS: https://spacy.io/api/tagger#get_loss DOCS: https://spacy.io/api/tagger#get_loss
""" """
validate_examples(examples, "Tagger.get_loss") validate_examples(examples, "Tagger.get_loss")
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False, neg_prefix="!") loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False, neg_prefix=self.cfg["neg_prefix"])
# Convert empty tag "" to missing value None so that both misaligned # Convert empty tag "" to missing value None so that both misaligned
# tokens and tokens with missing annotation have the default missing # tokens and tokens with missing annotation have the default missing
# value None. # value None.
@ -289,15 +322,3 @@ class Tagger(TrainablePipe):
self.cfg["labels"].append(label) self.cfg["labels"].append(label)
self.vocab.strings.add(label) self.vocab.strings.add(label)
return 1 return 1
def score(self, examples, **kwargs):
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores, produced by
Scorer.score_token_attr for the attributes "tag".
DOCS: https://spacy.io/api/tagger#score
"""
validate_examples(examples, "Tagger.score")
return Scorer.score_token_attr(examples, "tag", **kwargs)

View File

@ -10,6 +10,7 @@ from ..training import Example, validate_examples, validate_get_examples
from ..errors import Errors from ..errors import Errors
from ..scorer import Scorer from ..scorer import Scorer
from ..tokens import Doc from ..tokens import Doc
from ..util import registry
from ..vocab import Vocab from ..vocab import Vocab
@ -70,7 +71,11 @@ subword_features = true
@Language.factory( @Language.factory(
"textcat", "textcat",
assigns=["doc.cats"], assigns=["doc.cats"],
default_config={"threshold": 0.5, "model": DEFAULT_SINGLE_TEXTCAT_MODEL}, default_config={
"threshold": 0.5,
"model": DEFAULT_SINGLE_TEXTCAT_MODEL,
"scorer": {"@scorers": "spacy.textcat_scorer.v1"},
},
default_score_weights={ default_score_weights={
"cats_score": 1.0, "cats_score": 1.0,
"cats_score_desc": None, "cats_score_desc": None,
@ -86,7 +91,11 @@ subword_features = true
}, },
) )
def make_textcat( def make_textcat(
nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float nlp: Language,
name: str,
model: Model[List[Doc], List[Floats2d]],
threshold: float,
scorer: Optional[Callable],
) -> "TextCategorizer": ) -> "TextCategorizer":
"""Create a TextCategorizer component. The text categorizer predicts categories """Create a TextCategorizer component. The text categorizer predicts categories
over a whole document. It can learn one or more labels, and the labels are considered over a whole document. It can learn one or more labels, and the labels are considered
@ -95,8 +104,23 @@ def make_textcat(
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
scores for each category. scores for each category.
threshold (float): Cutoff to consider a prediction "positive". threshold (float): Cutoff to consider a prediction "positive".
scorer (Optional[Callable]): The scoring method.
""" """
return TextCategorizer(nlp.vocab, model, name, threshold=threshold) return TextCategorizer(nlp.vocab, model, name, threshold=threshold, scorer=scorer)
def textcat_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
return Scorer.score_cats(
examples,
"cats",
multi_label=False,
**kwargs,
)
@registry.scorers("spacy.textcat_scorer.v1")
def make_textcat_scorer():
return textcat_score
class TextCategorizer(TrainablePipe): class TextCategorizer(TrainablePipe):
@ -106,7 +130,13 @@ class TextCategorizer(TrainablePipe):
""" """
def __init__( def __init__(
self, vocab: Vocab, model: Model, name: str = "textcat", *, threshold: float self,
vocab: Vocab,
model: Model,
name: str = "textcat",
*,
threshold: float,
scorer: Optional[Callable] = textcat_score,
) -> None: ) -> None:
"""Initialize a text categorizer for single-label classification. """Initialize a text categorizer for single-label classification.
@ -115,6 +145,8 @@ class TextCategorizer(TrainablePipe):
name (str): The component instance name, used to add entries to the name (str): The component instance name, used to add entries to the
losses during training. losses during training.
threshold (float): Cutoff to consider a prediction "positive". threshold (float): Cutoff to consider a prediction "positive".
scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_cats for the attribute "cats".
DOCS: https://spacy.io/api/textcategorizer#init DOCS: https://spacy.io/api/textcategorizer#init
""" """
@ -124,6 +156,7 @@ class TextCategorizer(TrainablePipe):
self._rehearsal_model = None self._rehearsal_model = None
cfg = {"labels": [], "threshold": threshold, "positive_label": None} cfg = {"labels": [], "threshold": threshold, "positive_label": None}
self.cfg = dict(cfg) self.cfg = dict(cfg)
self.scorer = scorer
@property @property
def labels(self) -> Tuple[str]: def labels(self) -> Tuple[str]:
@ -353,26 +386,6 @@ class TextCategorizer(TrainablePipe):
assert len(label_sample) > 0, Errors.E923.format(name=self.name) assert len(label_sample) > 0, Errors.E923.format(name=self.name)
self.model.initialize(X=doc_sample, Y=label_sample) self.model.initialize(X=doc_sample, Y=label_sample)
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats.
DOCS: https://spacy.io/api/textcategorizer#score
"""
validate_examples(examples, "TextCategorizer.score")
self._validate_categories(examples)
kwargs.setdefault("threshold", self.cfg["threshold"])
kwargs.setdefault("positive_label", self.cfg["positive_label"])
return Scorer.score_cats(
examples,
"cats",
labels=self.labels,
multi_label=False,
**kwargs,
)
def _validate_categories(self, examples: Iterable[Example]): def _validate_categories(self, examples: Iterable[Example]):
"""Check whether the provided examples all have single-label cats annotations.""" """Check whether the provided examples all have single-label cats annotations."""
for ex in examples: for ex in examples:

View File

@ -5,10 +5,11 @@ from thinc.api import Model, Config
from thinc.types import Floats2d from thinc.types import Floats2d
from ..language import Language from ..language import Language
from ..training import Example, validate_examples, validate_get_examples from ..training import Example, validate_get_examples
from ..errors import Errors from ..errors import Errors
from ..scorer import Scorer from ..scorer import Scorer
from ..tokens import Doc from ..tokens import Doc
from ..util import registry
from ..vocab import Vocab from ..vocab import Vocab
from .textcat import TextCategorizer from .textcat import TextCategorizer
@ -70,7 +71,11 @@ subword_features = true
@Language.factory( @Language.factory(
"textcat_multilabel", "textcat_multilabel",
assigns=["doc.cats"], assigns=["doc.cats"],
default_config={"threshold": 0.5, "model": DEFAULT_MULTI_TEXTCAT_MODEL}, default_config={
"threshold": 0.5,
"model": DEFAULT_MULTI_TEXTCAT_MODEL,
"scorer": {"@scorers": "spacy.textcat_multilabel_scorer.v1"},
},
default_score_weights={ default_score_weights={
"cats_score": 1.0, "cats_score": 1.0,
"cats_score_desc": None, "cats_score_desc": None,
@ -86,7 +91,11 @@ subword_features = true
}, },
) )
def make_multilabel_textcat( def make_multilabel_textcat(
nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float nlp: Language,
name: str,
model: Model[List[Doc], List[Floats2d]],
threshold: float,
scorer: Optional[Callable],
) -> "TextCategorizer": ) -> "TextCategorizer":
"""Create a TextCategorizer component. The text categorizer predicts categories """Create a TextCategorizer component. The text categorizer predicts categories
over a whole document. It can learn one or more labels, and the labels are considered over a whole document. It can learn one or more labels, and the labels are considered
@ -97,7 +106,23 @@ def make_multilabel_textcat(
scores for each category. scores for each category.
threshold (float): Cutoff to consider a prediction "positive". threshold (float): Cutoff to consider a prediction "positive".
""" """
return MultiLabel_TextCategorizer(nlp.vocab, model, name, threshold=threshold) return MultiLabel_TextCategorizer(
nlp.vocab, model, name, threshold=threshold, scorer=scorer
)
def textcat_multilabel_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
return Scorer.score_cats(
examples,
"cats",
multi_label=True,
**kwargs,
)
@registry.scorers("spacy.textcat_multilabel_scorer.v1")
def make_textcat_multilabel_scorer():
return textcat_multilabel_score
class MultiLabel_TextCategorizer(TextCategorizer): class MultiLabel_TextCategorizer(TextCategorizer):
@ -113,6 +138,7 @@ class MultiLabel_TextCategorizer(TextCategorizer):
name: str = "textcat_multilabel", name: str = "textcat_multilabel",
*, *,
threshold: float, threshold: float,
scorer: Optional[Callable] = textcat_multilabel_score,
) -> None: ) -> None:
"""Initialize a text categorizer for multi-label classification. """Initialize a text categorizer for multi-label classification.
@ -130,6 +156,7 @@ class MultiLabel_TextCategorizer(TextCategorizer):
self._rehearsal_model = None self._rehearsal_model = None
cfg = {"labels": [], "threshold": threshold} cfg = {"labels": [], "threshold": threshold}
self.cfg = dict(cfg) self.cfg = dict(cfg)
self.scorer = scorer
def initialize( # type: ignore[override] def initialize( # type: ignore[override]
self, self,
@ -166,24 +193,6 @@ class MultiLabel_TextCategorizer(TextCategorizer):
assert len(label_sample) > 0, Errors.E923.format(name=self.name) assert len(label_sample) > 0, Errors.E923.format(name=self.name)
self.model.initialize(X=doc_sample, Y=label_sample) self.model.initialize(X=doc_sample, Y=label_sample)
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats.
DOCS: https://spacy.io/api/textcategorizer#score
"""
validate_examples(examples, "MultiLabel_TextCategorizer.score")
kwargs.setdefault("threshold", self.cfg["threshold"])
return Scorer.score_cats(
examples,
"cats",
labels=self.labels,
multi_label=True,
**kwargs,
)
def _validate_categories(self, examples: Iterable[Example]): def _validate_categories(self, examples: Iterable[Example]):
"""This component allows any type of single- or multi-label annotations. """This component allows any type of single- or multi-label annotations.
This method overwrites the more strict one from 'textcat'.""" This method overwrites the more strict one from 'textcat'."""

Some files were not shown because too many files have changed in this diff Show More