Merge pull request #9296 from adrianeboyd/chore/update-develop-from-master-v3.1-2

Update develop from master
This commit is contained in:
Adriane Boyd 2021-09-27 11:19:00 +02:00 committed by GitHub
commit 200121a035
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
80 changed files with 3017 additions and 321 deletions

View File

@ -14,6 +14,6 @@ or new feature, or a change to the documentation? -->
## Checklist ## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can <!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] --> tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement. - [ ] I confirm that I have the right to submit this contribution under the project's MIT license.
- [ ] I ran the tests, and all new and existing tests passed. - [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

106
.github/contributors/bbieniek.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Baltazar Bieniek |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2021.08.19 |
| GitHub username | bbieniek |
| Website (optional) | https://baltazar.bieniek.org.pl/ |

106
.github/contributors/hlasse.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------------- |
| Name | Lasse Hansen |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2021-08-11 |
| GitHub username | HLasse |
| Website (optional) | www.lassehansen.me |

106
.github/contributors/philipvollet.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Philip Vollet |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 22.09.2021 |
| GitHub username | philipvollet |
| Website (optional) | |

106
.github/contributors/shigapov.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Renat Shigapov |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2021-09-09 |
| GitHub username | shigapov |
| Website (optional) | |

88
.github/contributors/swfarnsworth.md vendored Normal file
View File

@ -0,0 +1,88 @@
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Steele Farnsworth |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 13 August, 2021 |
| GitHub username | swfarnsworth |
| Website (optional) | |

View File

@ -140,17 +140,6 @@ Changes to `.py` files will be effective immediately.
📖 **For more details and instructions, see the documentation on [compiling spaCy from source](https://spacy.io/usage/#source) and the [quickstart widget](https://spacy.io/usage/#section-quickstart) to get the right commands for your platform and Python version.** 📖 **For more details and instructions, see the documentation on [compiling spaCy from source](https://spacy.io/usage/#source) and the [quickstart widget](https://spacy.io/usage/#section-quickstart) to get the right commands for your platform and Python version.**
### Contributor agreement
If you've made a contribution to spaCy, you should fill in the
[spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that
your contribution can be used across the project. If you agree to be bound by
the terms of the agreement, fill in the [template](.github/CONTRIBUTOR_AGREEMENT.md)
and include it with your pull request, or submit it separately to
[`.github/contributors/`](/.github/contributors). The name of the file should be
your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
### Fixing bugs ### Fixing bugs
When fixing a bug, first create an When fixing a bug, first create an
@ -185,7 +174,6 @@ Each time a `git commit` is initiated, `black` and `flake8` will run automatical
In case of error, or when `black` modified a file, the modified file needs to be `git add` once again and a new In case of error, or when `black` modified a file, the modified file needs to be `git add` once again and a new
`git commit` has to be issued. `git commit` has to be issued.
### Code formatting ### Code formatting
[`black`](https://github.com/ambv/black) is an opinionated Python code [`black`](https://github.com/ambv/black) is an opinionated Python code
@ -414,14 +402,7 @@ all test files and test functions need to be prefixed with `test_`.
When adding tests, make sure to use descriptive names, keep the code short and When adding tests, make sure to use descriptive names, keep the code short and
concise and only test for one behavior at a time. Try to `parametrize` test concise and only test for one behavior at a time. Try to `parametrize` test
cases wherever possible, use our pre-defined fixtures for spaCy components and cases wherever possible, use our pre-defined fixtures for spaCy components and
avoid unnecessary imports. avoid unnecessary imports. Extensive tests that take a long time should be marked with `@pytest.mark.slow`.
Extensive tests that take a long time should be marked with `@pytest.mark.slow`.
Tests that require the model to be loaded should be marked with
`@pytest.mark.models`. Loading the models is expensive and not necessary if
you're not actually testing the model performance. If all you need is a `Doc`
object with annotations like heads, POS tags or the dependency parse, you can
use the `Doc` constructor to construct it manually.
📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).** 📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).**

View File

@ -0,0 +1,546 @@
# Code Conventions
For a general overview of code conventions for contributors, see the [section in the contributing guide](https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md#code-conventions).
1. [Code compatibility](#code-compatibility)
2. [Auto-formatting](#auto-formatting)
3. [Linting](#linting)
4. [Documenting code](#documenting-code)
5. [Type hints](#type-hints)
6. [Structuring logic](#structuring-logic)
7. [Naming](#naming)
8. [Error handling](#error-handling)
9. [Writing tests](#writing-tests)
## Code compatibility
spaCy supports **Python 3.6** and above, so all code should be written compatible with 3.6. This means that there are certain new syntax features that we won't be able to use until we drop support for older Python versions. Some newer features provide backports that we can conditionally install for older versions, although we only want to do this if it's absolutely necessary. If we need to use conditional imports based on the Python version or other custom compatibility-specific helpers, those should live in `compat.py`.
## Auto-formatting
spaCy uses `black` for auto-formatting (which is also available as a pre-commit hook). It's recommended to configure your editor to perform this automatically, either triggered manually or whenever you save a file. We also have a GitHub action that regularly formats the code base and submits a PR if changes are available. Note that auto-formatting is currently only available for `.py` (Python) files, not for `.pyx` (Cython).
As a rule of thumb, if the auto-formatting produces output that looks messy, it can often indicate that there's a better way to structure the code to make it more concise.
```diff
- range_suggester = registry.misc.get("spacy.ngram_range_suggester.v1")(
- min_size=1, max_size=3
- )
+ suggester_factory = registry.misc.get("spacy.ngram_range_suggester.v1")
+ range_suggester = suggester_factory(min_size=1, max_size=3)
```
In some specific cases, e.g. in the tests, it can make sense to disable auto-formatting for a specific block. You can do this by wrapping the code in `# fmt: off` and `# fmt: on`:
```diff
+ # fmt: off
text = "I look forward to using Thingamajig. I've been told it will make my life easier..."
deps = ["nsubj", "ROOT", "advmod", "prep", "pcomp", "dobj", "punct", "",
"nsubjpass", "aux", "auxpass", "ROOT", "nsubj", "aux", "ccomp",
"poss", "nsubj", "ccomp", "punct"]
+ # fmt: on
```
## Linting
[`flake8`](http://flake8.pycqa.org/en/latest/) is a tool for enforcing code style. It scans one or more files and outputs errors and warnings. This feedback can help you stick to general standards and conventions, and can be very useful for spotting potential mistakes and inconsistencies in your code. Code you write should be compatible with our flake8 rules and not cause any warnings.
```bash
flake8 spacy
```
The most common problems surfaced by linting are:
- **Trailing or missing whitespace.** This is related to formatting and should be fixed automatically by running `black`.
- **Unused imports.** Those should be removed if the imports aren't actually used. If they're required, e.g. to expose them so they can be imported from the given module, you can add a comment and `# noqa: F401` exception (see details below).
- **Unused variables.** This can often indicate bugs, e.g. a variable that's declared and not correctly passed on or returned. To prevent ambiguity here, your code shouldn't contain unused variables. If you're unpacking a list of tuples and end up with variables you don't need, you can call them `_` to indicate that they're unused.
- **Redefinition of function.** This can also indicate bugs, e.g. a copy-pasted function that you forgot to rename and that now replaces the original function.
- **Repeated dictionary keys.** This either indicates a bug or unnecessary duplication.
- **Comparison with `True`, `False`, `None`**. This is mostly a stylistic thing: when checking whether a value is `True`, `False` or `None`, you should be using `is` instead of `==`. For example, `if value is None`.
### Ignoring linter rules for special cases
To ignore a given line, you can add a comment like `# noqa: F401`, specifying the code of the error or warning we want to ignore. It's also possible to ignore several comma-separated codes at once, e.g. `# noqa: E731,E123`. In general, you should always **specify the code(s)** you want to ignore otherwise, you may end up missing actual problems.
```python
# The imported class isn't used in this file, but imported here, so it can be
# imported *from* here by another module.
from .submodule import SomeClass # noqa: F401
try:
do_something()
except: # noqa: E722
# This bare except is justified, for some specific reason
do_something_else()
```
## Documenting code
All functions and methods you write should be documented with a docstring inline. The docstring can contain a simple summary, and an overview of the arguments and their (simplified) types. Modern editors will show this information to users when they call the function or method in their code.
If it's part of the public API and there's a documentation section available, we usually add the link as `DOCS:` at the end. This allows us to keep the docstrings simple and concise, while also providing additional information and examples if necessary.
```python
def has_pipe(self, name: str) -> bool:
"""Check if a component name is present in the pipeline. Equivalent to
`name in nlp.pipe_names`.
name (str): Name of the component.
RETURNS (bool): Whether a component of the name exists in the pipeline.
DOCS: https://spacy.io/api/language#has_pipe
"""
...
```
We specifically chose this approach of maintaining the docstrings and API reference separately, instead of auto-generating the API docs from the docstrings like other packages do. We want to be able to provide extensive explanations and examples in the documentation and use our own custom markup for it that would otherwise clog up the docstrings. We also want to be able to update the documentation independently of the code base. It's slightly more work, but it's absolutely worth it in terms of user and developer experience.
### Inline code comments
We don't expect you to add inline comments for everything you're doing this should be obvious from reading the code. If it's not, the first thing to check is whether your code can be improved to make it more explicit. That said, if your code includes complex logic or aspects that may be unintuitive at first glance (or even included a subtle bug that you ended up fixing), you should leave a quick comment that provides more context.
```diff
token_index = indices[value]
+ # Index describes Token.i of last token but Span indices are inclusive
span = doc[prev_token_index:token_index + 1]
```
```diff
+ # To create the components we need to use the final interpolated config
+ # so all values are available (if component configs use variables).
+ # Later we replace the component config with the raw config again.
interpolated = filled.interpolate() if not filled.is_interpolated else filled
```
Don't be shy about including comments for tricky parts that _you_ found hard to implement or get right those may come in handy for the next person working on this code, or even future you!
If your change implements a fix to a specific issue, it can often be helpful to include the issue number in the comment, especially if it's a relatively straightforward adjustment:
```diff
+ # Ensure object is a Span, not a Doc (#1234)
if isinstance(obj, Doc):
obj = obj[obj.start:obj.end]
```
### Including TODOs
It's fine to include code comments that indicate future TODOs, using the `TODO:` prefix. Modern editors typically format this in a different color, so it's easy to spot. TODOs don't necessarily have to be things that are absolutely critical to fix fight now those should already be addressed in your pull request once it's ready for review. But they can include notes about potential future improvements.
```diff
+ # TODO: this is currently pretty slow
dir_checksum = hashlib.md5()
for sub_file in sorted(fp for fp in path.rglob("*") if fp.is_file()):
dir_checksum.update(sub_file.read_bytes())
```
If any of the TODOs you've added are important and should be fixed soon, you should add a task for this on Explosion's internal Ora board or an issue on the public issue tracker to make sure we don't forget to address it.
## Type hints
We use Python type hints across the `.py` files wherever possible. This makes it easy to understand what a function expects and returns, and modern editors will be able to show this information to you when you call an annotated function. Type hints are not currently used in the `.pyx` (Cython) code, except for definitions of registered functions and component factories, where they're used for config validation.
If possible, you should always use the more descriptive type hints like `List[str]` or even `List[Any]` instead of only `list`. We also annotate arguments and return types of `Callable` although, you can simplify this if the type otherwise gets too verbose (e.g. functions that return factories to create callbacks). Remember that `Callable` takes two values: a **list** of the argument type(s) in order, and the return values.
```diff
- def func(some_arg: dict) -> None:
+ def func(some_arg: Dict[str, Any]) -> None:
...
```
```python
def create_callback(some_arg: bool) -> Callable[[str, int], List[str]]:
def callback(arg1: str, arg2: int) -> List[str]:
...
return callback
```
For model architectures, Thinc also provides a collection of [custom types](https://thinc.ai/docs/api-types), including more specific types for arrays and model inputs/outputs. Even outside of static type checking, using these types will make the code a lot easier to read and follow, since it's always clear what array types are expected (and what might go wrong if the output is different from the expected type).
```python
def build_tagger_model(
tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None
) -> Model[List[Doc], List[Floats2d]]:
...
```
If you need to use a type hint that refers to something later declared in the same module, or the class that a method belongs to, you can use a string value instead:
```python
class SomeClass:
def from_bytes(self, data: bytes) -> "SomeClass":
...
```
In some cases, you won't be able to import a class from a different module to use it as a type hint because it'd cause circular imports. For instance, `spacy/util.py` includes various helper functions that return an instance of `Language`, but we couldn't import it, because `spacy/language.py` imports `util` itself. In this case, we can provide `"Language"` as a string and make the import conditional on `typing.TYPE_CHECKING` so it only runs when the code is evaluated by a type checker:
```python
from typing TYPE_CHECKING
if TYPE_CHECKING:
from .language import Language
def load_model(name: str) -> "Language":
...
```
## Structuring logic
### Positional and keyword arguments
We generally try to avoid writing functions and methods with too many arguments, and use keyword-only arguments wherever possible. Python lets you define arguments as keyword-only by separating them with a `, *`. If you're writing functions with additional arguments that customize the behavior, you typically want to make those arguments keyword-only, so their names have to be provided explicitly.
```diff
- def do_something(name: str, validate: bool = False):
+ def do_something(name: str, *, validate: bool = False):
...
- do_something("some_name", True)
+ do_something("some_name", validate=True)
```
This makes the function calls easier to read, because it's immediately clear what the additional values mean. It also makes it easier to extend arguments or change their order later on, because you don't end up with any function calls that depend on a specific positional order.
### Avoid mutable default arguments
A common Python gotcha are [mutable default arguments](https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments): if your argument defines a mutable default value like `[]` or `{}` and then goes and mutates it, the default value is created _once_ when the function is created and the same object is then mutated every time the function is called. This can be pretty unintuitive when you first encounter it. We therefore avoid writing logic that does this.
If your arguments need to default to an empty list or dict, you can use the `SimpleFrozenList` and `SimpleFrozenDict` helpers provided by spaCy. They are simple frozen implementations that raise an error if they're being mutated to prevent bugs and logic that accidentally mutates default arguments.
```diff
- def to_bytes(self, *, exclude: List[str] = []):
+ def to_bytes(self, *, exclude: List[str] = SimpleFrozenList()):
...
```
```diff
def do_something(values: List[str] = SimpleFrozenList()):
if some_condition:
- values.append("foo") # raises an error
+ values = [*values, "foo"]
return values
```
### Don't use `try`/`except` for control flow
We strongly discourage using `try`/`except` blocks for anything that's not third-party error handling or error handling that we otherwise have little control over. There's typically always a way to anticipate the _actual_ problem and **check for it explicitly**, which makes the code easier to follow and understand, and prevents bugs:
```diff
- try:
- token = doc[i]
- except IndexError:
- token = doc[-1]
+ if i < len(doc):
+ token = doc[i]
+ else:
+ token = doc[-1]
```
Even if you end up having to check for multiple conditions explicitly, this is still preferred over a catch-all `try`/`except`. It can be very helpful to think about the exact scenarios you need to cover, and what could go wrong at each step, which often leads to better code and fewer bugs. `try/except` blocks can also easily mask _other_ bugs and problems that raise the same errors you're catching, which is obviously bad.
If you have to use `try`/`except`, make sure to only include what's **absolutely necessary** in the `try` block and define the exception(s) explicitly. Otherwise, you may end up masking very different exceptions caused by other bugs.
```diff
- try:
- value1 = get_some_value()
- value2 = get_some_other_value()
- score = external_library.compute_some_score(value1, value2)
- except:
- score = 0.0
+ value1 = get_some_value()
+ value2 = get_some_other_value()
+ try:
+ score = external_library.compute_some_score(value1, value2)
+ except ValueError:
+ score = 0.0
```
### Avoid lambda functions
`lambda` functions can be useful for defining simple anonymous functions in a single line, but they also introduce problems: for instance, they require [additional logic](https://stackoverflow.com/questions/25348532/can-python-pickle-lambda-functions) in order to be pickled and are pretty ugly to type-annotate. So we typically avoid them in the code base and only use them in the serialization handlers and within tests for simplicity. Instead of `lambda`s, check if your code can be refactored to not need them, or use helper functions instead.
```diff
- split_string: Callable[[str], List[str]] = lambda value: [v.strip() for v in value.split(",")]
+ def split_string(value: str) -> List[str]:
+ return [v.strip() for v in value.split(",")]
```
### Iteration and comprehensions
We generally avoid using built-in functions like `filter` or `map` in favor of list or generator comprehensions.
```diff
- filtered = filter(lambda x: x in ["foo", "bar"], values)
+ filtered = (x for x in values if x in ["foo", "bar"])
- filtered = list(filter(lambda x: x in ["foo", "bar"], values))
+ filtered = [x for x in values if x in ["foo", "bar"]]
- result = map(lambda x: { x: x in ["foo", "bar"]}, values)
+ result = ({x: x in ["foo", "bar"]} for x in values)
- result = list(map(lambda x: { x: x in ["foo", "bar"]}, values))
+ result = [{x: x in ["foo", "bar"]} for x in values]
```
If your logic is more complex, it's often better to write a loop instead, even if it adds more lines of code in total. The result will be much easier to follow and understand.
```diff
- result = [{"key": key, "scores": {f"{i}": score for i, score in enumerate(scores)}} for key, scores in values]
+ result = []
+ for key, scores in values:
+ scores_dict = {f"{i}": score for i, score in enumerate(scores)}
+ result.append({"key": key, "scores": scores_dict})
```
### Composition vs. inheritance
Although spaCy uses a lot of classes, **inheritance is viewed with some suspicion** — it's seen as a mechanism of last resort. You should discuss plans to extend the class hierarchy before implementing. Unless you're implementing a new data structure or pipeline component, you typically shouldn't have to use classes at all.
### Don't use `print`
The core library never `print`s anything. While we encourage using `print` statements for simple debugging (it's the most straightforward way of looking at what's happening), make sure to clean them up once you're ready to submit your pull request. If you want to output warnings or debugging information for users, use the respective dedicated mechanisms for this instead (see sections on warnings and logging for details).
The only exceptions are the CLI functions, which pretty-print messages for the user, and methods that are explicitly intended for printing things, e.g. `Language.analyze_pipes` with `pretty=True` enabled. For this, we use our lightweight helper library [`wasabi`](https://github.com/ines/wasabi).
## Naming
Naming is hard and often a topic of long internal discussions. We don't expect you to come up with the perfect names for everything you write finding the right names is often an iterative and collaborative process. That said, we do try to follow some basic conventions.
Consistent with general Python conventions, we use `CamelCase` for class names including dataclasses, `snake_case` for methods, functions and variables, and `UPPER_SNAKE_CASE` for constants, typically defined at the top of a module. We also avoid using variable names that shadow the names of built-in functions, e.g. `input`, `help` or `list`.
### Naming variables
Variable names should always make it clear _what exactly_ the variable is and what it's used for. Instances of common classes should use the same consistent names. For example, you should avoid naming a text string (or anything else that's not a `Doc` object) `doc`. The most common class-to-variable mappings are:
| Class | Variable | Example |
| ---------- | --------------------- | ------------------------------------------- |
| `Language` | `nlp` | `nlp = spacy.blank("en")` |
| `Doc` | `doc` | `doc = nlp("Some text")` |
| `Span` | `span`, `ent`, `sent` | `span = doc[1:4]`, `ent = doc.ents[0]` |
| `Token` | `token` | `token = doc[0]` |
| `Lexeme` | `lexeme`, `lex` | `lex = nlp.vocab["foo"]` |
| `Vocab` | `vocab` | `vocab = Vocab()` |
| `Example` | `example`, `eg` | `example = Example.from_dict(doc, gold)` |
| `Config` | `config`, `cfg` | `config = Config().from_disk("config.cfg")` |
We try to avoid introducing too many temporary variables, as these clutter your namespace. It's okay to re-assign to an existing variable, but only if the value has the same type.
```diff
ents = get_a_list_of_entities()
ents = [ent for ent in doc.ents if ent.label_ == "PERSON"]
- ents = {(ent.start, ent.end): ent.label_ for ent in ents}
+ ent_mappings = {(ent.start, ent.end): ent.label_ for ent in ents}
```
### Naming methods and functions
Try choosing short and descriptive names wherever possible and imperative verbs for methods that do something, e.g. `disable_pipes`, `add_patterns` or `get_vector`. Private methods and functions that are not intended to be part of the user-facing API should be prefixed with an underscore `_`. It's often helpful to look at the existing classes for inspiration.
Objects that can be serialized, e.g. data structures and pipeline components, should implement the same consistent methods for serialization. Those usually include at least `to_disk`, `from_disk`, `to_bytes` and `from_bytes`. Some objects can also implement more specific methods like `{to/from}_dict` or `{to/from}_str`.
## Error handling
We always encourage writing helpful and detailed custom error messages for everything we can anticipate going wrong, and including as much detail as possible. spaCy provides a directory of error messages in `errors.py` with unique codes for each message. This allows us to keep the code base more concise and avoids long and nested blocks of texts throughout the code that disrupt the reading flow. The codes make it easy to find references to the same error in different places, and also helps identify problems reported by users (since we can just search for the error code).
Errors can be referenced via their code, e.g. `Errors.E123`. Messages can also include placeholders for values, that can be populated by formatting the string with `.format()`.
```python
class Errors:
E123 = "Something went wrong"
E456 = "Unexpected value: {value}"
```
```diff
if something_went_wrong:
- raise ValueError("Something went wrong!")
+ raise ValueError(Errors.E123)
if not isinstance(value, int):
- raise ValueError(f"Unexpected value: {value}")
+ raise ValueError(Errors.E456.format(value=value))
```
As a general rule of thumb, all error messages raised within the **core library** should be added to `Errors`. The only place where we write errors and messages as strings is `spacy.cli`, since these functions typically pretty-print and generate a lot of output that'd otherwise be very difficult to separate from the actual logic.
### Re-raising exceptions
If we anticipate possible errors in third-party code that we don't control, or our own code in a very different context, we typically try to provide custom and more specific error messages if possible. If we need to re-raise an exception within a `try`/`except` block, we can re-raise a custom exception.
[Re-raising `from`](https://docs.python.org/3/tutorial/errors.html#exception-chaining) the original caught exception lets us chain the exceptions, so the user sees both the original error, as well as the custom message with a note "The above exception was the direct cause of the following exception".
```diff
try:
run_third_party_code_that_might_fail()
except ValueError as e:
+ raise ValueError(Errors.E123) from e
```
In some cases, it makes sense to suppress the original exception, e.g. if we know what it is and know that it's not particularly helpful. In that case, we can raise `from None`. This prevents clogging up the user's terminal with multiple and irrelevant chained exceptions.
```diff
try:
run_our_own_code_that_might_fail_confusingly()
except ValueError:
+ raise ValueError(Errors.E123) from None
```
### Avoid using naked `assert`
During development, it can sometimes be helpful to add `assert` statements throughout your code to make sure that the values you're working with are what you expect. However, as you clean up your code, those should either be removed or replaced by more explicit error handling:
```diff
- assert score >= 0.0
+ if score < 0.0:
+ raise ValueError(Errors.789.format(score=score))
```
Otherwise, the user will get to see a naked `AssertionError` with no further explanation, which is very unhelpful. Instead of adding an error message to `assert`, it's always better to `raise` more explicit errors for specific conditions. If you're checking for something that _has to be right_ and would otherwise be a bug in spaCy, you can express this in the error message:
```python
E161 = ("Found an internal inconsistency when predicting entity links. "
"This is likely a bug in spaCy, so feel free to open an issue: "
"https://github.com/explosion/spaCy/issues")
```
### Warnings
Instead of raising an error, some parts of the code base can raise warnings to notify the user of a potential problem. This is done using Python's `warnings.warn` and the messages defined in `Warnings` in the `errors.py`. Whether or not warnings are shown can be controlled by the user, including custom filters for disabling specific warnings using a regular expression matching our internal codes, e.g. `W123`.
```diff
- print("Warning: No examples provided for validation")
+ warnings.warn(Warnings.W123)
```
When adding warnings, make sure you're not calling `warnings.warn` repeatedly, e.g. in a loop, which will clog up the terminal output. Instead, you can collect the potential problems first and then raise a single warning. If the problem is critical, consider raising an error instead.
```diff
+ n_empty = 0
for spans in lots_of_annotations:
if len(spans) == 0:
- warnings.warn(Warnings.456)
+ n_empty += 1
+ warnings.warn(Warnings.456.format(count=n_empty))
```
### Logging
Log statements can be added via spaCy's `logger`, which uses Python's native `logging` module under the hood. We generally only use logging for debugging information that **the user may choose to see** in debugging mode or that's **relevant during training** but not at runtime.
```diff
+ logger.info("Set up nlp object from config")
config = nlp.config.interpolate()
```
`spacy train` and similar CLI commands will enable all log statements of level `INFO` by default (which is not the case at runtime). This allows outputting specific information within certain parts of the core library during training, without having it shown at runtime. `DEBUG`-level logs are only shown if the user enables `--verbose` logging during training. They can be used to provide more specific and potentially more verbose details, especially in areas that can indicate bugs or problems, or to surface more details about what spaCy does under the hood. You should only use logging statements if absolutely necessary and important.
## Writing tests
spaCy uses the [`pytest`](http://doc.pytest.org/) framework for testing. Tests for spaCy modules and classes live in their own directories of the same name and all test files should be prefixed with `test_`. Tests included in the core library only cover the code and do not depend on any trained pipelines. When implementing a new feature or fixing a bug, it's usually good to start by writing some tests that describe what _should_ happen. As you write your code, you can then keep running the relevant tests until all of them pass.
### Test suite structure
When adding tests, make sure to use descriptive names and only test for one behavior at a time. Tests should be grouped into modules dedicated to the same type of functionality and some test modules are organized as directories of test files related to the same larger area of the library, e.g. `matcher` or `tokenizer`.
Regression tests are tests that refer to bugs reported in specific issues. They should live in the `regression` module and are named according to the issue number (e.g. `test_issue1234.py`). This system allows us to relate tests for specific bugs back to the original reported issue, which is especially useful if we introduce a regression and a previously passing regression tests suddenly fails again. When fixing a bug, it's often useful to create a regression test for it first. Every once in a while, we go through the `regression` module and group tests together into larger files by issue number, in groups of 500 to 1000 numbers. This prevents us from ending up with too many individual files over time.
The test suite also provides [fixtures](https://github.com/explosion/spaCy/blob/master/spacy/tests/conftest.py) for different language tokenizers that can be used as function arguments of the same name and will be passed in automatically. Those should only be used for tests related to those specific languages. We also have [test utility functions](https://github.com/explosion/spaCy/blob/master/spacy/tests/util.py) for common operations, like creating a temporary file.
### Constructing objects and state
Test functions usually follow the same simple structure: they set up some state, perform the operation you want to test and `assert` conditions that you expect to be true, usually before and after the operation.
Tests should focus on exactly what they're testing and avoid dependencies on other unrelated library functionality wherever possible. If all your test needs is a `Doc` object with certain annotations set, you should always construct it manually:
```python
def test_doc_creation_with_pos():
doc = Doc(Vocab(), words=["hello", "world"], pos=["NOUN", "VERB"])
assert doc[0].pos_ == "NOUN"
assert doc[1].pos_ == "VERB"
```
### Parametrizing tests
If you need to run the same test function over different input examples, you usually want to parametrize the test cases instead of using a loop within your test. This lets you keep a better separation between test cases and test logic, and it'll result in more useful output because `pytest` will be able to tell you which exact test case failed.
The `@pytest.mark.parametrize` decorator takes two arguments: a string defining one or more comma-separated arguments that should be passed to the test function and a list of corresponding test cases (or a list of tuples to provide multiple arguments).
```python
@pytest.mark.parametrize("words", [["hello", "world"], ["this", "is", "a", "test"]])
def test_doc_length(words):
doc = Doc(Vocab(), words=words)
assert len(doc) == len(words)
```
```python
@pytest.mark.parametrize("text,expected_len", [("hello world", 2), ("I can't!", 4)])
def test_token_length(en_tokenizer, text, expected_len): # en_tokenizer is a fixture
doc = en_tokenizer(text)
assert len(doc) == expected_len
```
You can also stack `@pytest.mark.parametrize` decorators, although this is not recommended unless it's absolutely needed or required for the test. When stacking decorators, keep in mind that this will run the test with all possible combinations of the respective parametrized values, which is often not what you want and can slow down the test suite.
### Handling failing tests
`xfail` means that a test **should pass but currently fails**, i.e. is expected to fail. You can mark a test as currently xfailing by adding the `@pytest.mark.xfail` decorator. This should only be used for tests that don't yet work, not for logic that cause errors we raise on purpose (see the section on testing errors for this). It's often very helpful to implement tests for edge cases that we don't yet cover and mark them as `xfail`. You can also provide a `reason` keyword argument to the decorator with an explanation of why the test currently fails.
```diff
+ @pytest.mark.xfail(reason="Issue #225 - not yet implemented")
def test_en_tokenizer_splits_em_dash_infix(en_tokenizer):
doc = en_tokenizer("Will this road take me to Puddleton?\u2014No.")
assert doc[8].text == "\u2014"
```
When you run the test suite, you may come across tests that are reported as `xpass`. This means that they're marked as `xfail` but didn't actually fail. This is worth looking into: sometimes, it can mean that we have since fixed a bug that caused the test to previously fail, so we can remove the decorator. In other cases, especially when it comes to machine learning model implementations, it can also indicate that the **test is flaky**: it sometimes passes and sometimes fails. This can be caused by a bug, or by constraints being too narrowly defined. If a test shows different behavior depending on whether its run in isolation or not, this can indicate that it reacts to global state set in a previous test, which is unideal and should be avoided.
### Writing slow tests
If a test is useful but potentially quite slow, you can mark it with the `@pytest.mark.slow` decorator. This is a special marker we introduced and tests decorated with it only run if you run the test suite with `--slow`, but not as part of the main CI process. Before introducing a slow test, double-check that there isn't another and more efficient way to test for the behavior. You should also consider adding a simpler test with maybe only a subset of the test cases that can always run, so we at least have some coverage.
### Skipping tests
The `@pytest.mark.skip` decorator lets you skip tests entirely. You only want to do this for failing tests that may be slow to run or cause memory errors or segfaults, which would otherwise terminate the entire process and wouldn't be caught by `xfail`. We also sometimes use the `skip` decorator for old and outdated regression tests that we want to keep around but that don't apply anymore. When using the `skip` decorator, make sure to provide the `reason` keyword argument with a quick explanation of why you chose to skip this test.
### Testing errors and warnings
`pytest` lets you check whether a given error is raised by using the `pytest.raises` contextmanager. This is very useful when implementing custom error handling, so make sure you're not only testing for the correct behavior but also for errors resulting from incorrect inputs. If you're testing errors, you should always check for `pytest.raises` explicitly and not use `xfail`.
```python
words = ["a", "b", "c", "d", "e"]
ents = ["Q-PERSON", "I-PERSON", "O", "I-PERSON", "I-GPE"]
with pytest.raises(ValueError):
Doc(Vocab(), words=words, ents=ents)
```
You can also use the `pytest.warns` contextmanager to check that a given warning type is raised. The first argument is the warning type or `None` (which will capture a list of warnings that you can `assert` is empty).
```python
def test_phrase_matcher_validation(en_vocab):
doc1 = Doc(en_vocab, words=["Test"], deps=["ROOT"])
doc2 = Doc(en_vocab, words=["Test"])
matcher = PhraseMatcher(en_vocab, validate=True)
with pytest.warns(UserWarning):
# Warn about unnecessarily parsed document
matcher.add("TEST1", [doc1])
with pytest.warns(None) as record:
matcher.add("TEST2", [docs])
assert not record.list
```
Keep in mind that your tests will fail if you're using the `pytest.warns` contextmanager with a given warning and the warning is _not_ shown. So you should only use it to check that spaCy handles and outputs warnings correctly. If your test outputs a warning that's expected but not relevant to what you're testing, you can use the `@pytest.mark.filterwarnings` decorator and ignore specific warnings starting with a given code:
```python
@pytest.mark.filterwarnings("ignore:\\[W036")
def test_matcher_empty(en_vocab):
matcher = Matcher(en_vocab)
matcher(Doc(en_vocab, words=["test"]))
```
### Testing trained pipelines
Our regular test suite does not depend on any of the trained pipelines, since their outputs can vary and aren't generally required to test the library functionality. We test pipelines separately using the tests included in the [`spacy-models`](https://github.com/explosion/spacy-models) repository, which run whenever we train a new suite of models. The tests here mostly focus on making sure that the packages can be loaded and that the predictions seam reasonable, and they include checks for common bugs we encountered previously. If your test does not primarily focus on verifying a model's predictions, it should be part of the core library tests and construct the required objects manually, instead of being added to the models tests.
Keep in mind that specific predictions may change, and we can't test for all incorrect predictions reported by users. Different models make different mistakes, so even a model that's significantly more accurate overall may end up making wrong predictions that it previously didn't. However, some surprising incorrect predictions may indicate deeper bugs that we definitely want to investigate.

View File

@ -0,0 +1,150 @@
# Language
> Reference: `spacy/language.py`
1. [Constructing the `nlp` object from a config](#1-constructing-the-nlp-object-from-a-config)
- [A. Overview of `Language.from_config`](#1a-overview)
- [B. Component factories](#1b-how-pipeline-component-factories-work-in-the-config)
- [C. Sourcing a component](#1c-sourcing-a-pipeline-component)
- [D. Tracking components as they're modified](#1d-tracking-components-as-theyre-modified)
- [E. spaCy's config utility function](#1e-spacys-config-utility-functions)
2. [Initialization](#initialization)
- [A. Initialization for training](#2a-initialization-for-training): `init_nlp`
- [B. Initializing the `nlp` object](#2b-initializing-the-nlp-object): `Language.initialize`
- [C. Initializing the vocab](#2c-initializing-the-vocab): `init_vocab`
## 1. Constructing the `nlp` object from a config
### 1A. Overview
Most of the functions referenced in the config are regular functions with arbitrary arguments registered via the function registry. However, the pipeline components are a bit special: they don't only receive arguments passed in via the config file, but also the current `nlp` object and the string `name` of the individual component instance (so a user can have multiple components created with the same factory, e.g. `ner_one` and `ner_two`). This name can then be used by the components to add to the losses and scores. This special requirement means that pipeline components can't just be resolved via the config the "normal" way: we need to retrieve the component functions manually and pass them their arguments, plus the `nlp` and `name`.
The `Language.from_config` classmethod takes care of constructing the `nlp` object from a config. It's the single place where this happens and what `spacy.load` delegates to under the hood. Its main responsibilities are:
- **Load and validate the config**, and optionally **auto-fill** all missing values that we either have defaults for in the config template or that registered function arguments define defaults for. This helps ensure backwards-compatibility, because we're able to add a new argument `foo: str = "bar"` to an existing function, without breaking configs that don't specity it.
- **Execute relevant callbacks** for pipeline creation, e.g. optional functions called before and after creation of the `nlp` object and pipeline.
- **Initialize language subclass and create tokenizer**. The `from_config` classmethod will always be called on a language subclass, e.g. `English`, not on `Language` directly. Initializing the subclass takes a callback to create the tokenizer.
- **Set up the pipeline components**. Components can either refer to a component factory or a `source`, i.e. an existing pipeline that's loaded and that the component is then copied from. We also need to ensure that we update the information about which components are disabled.
- **Manage listeners.** If sourced components "listen" to other components (`tok2vec`, `transformer`), we need to ensure that the references are valid. If the config specifies that listeners should be replaced by copies (e.g. to give the `ner` component its own `tok2vec` model instead of listening to the shared `tok2vec` component in the pipeline), we also need to take care of that.
Note that we only resolve and load **selected sections** in `Language.from_config`, i.e. only the parts that are relevant at runtime, which is `[nlp]` and `[components]`. We don't want to be resolving anything related to training or initialization, since this would mean loading and constructing unnecessary functions, including functions that require information that isn't necessarily available at runtime, like `paths.train`.
### 1B. How pipeline component factories work in the config
As opposed to regular registered functions that refer to a registry and function name (e.g. `"@misc": "foo.v1"`), pipeline components follow a different format and refer to their component `factory` name. This corresponds to the name defined via the `@Language.component` or `@Language.factory` decorator. We need this decorator to define additional meta information for the components, like their default config and score weights.
```ini
[components.my_component]
factory = "foo"
some_arg = "bar"
other_arg = ${paths.some_path}
```
This means that we need to create and resolve the `config["components"]` separately from the rest of the config. There are some important considerations and things we need to manage explicitly to avoid unexpected behavior:
#### Variable interpolation
When a config is resolved, references to variables are replaced, so that the functions receive the correct value instead of just the variable name. To interpolate a config, we need it in its entirety: we couldn't just interpolate a subsection that refers to variables defined in a different subsection. So we first interpolate the entire config.
However, the `nlp.config` should include the original config with variables intact otherwise, loading a pipeline and saving it to disk will destroy all logic implemented via variables and hard-code the values all over the place. This means that when we create the components, we need to keep two versions of the config: the interpolated config with the "real" values and the `raw_config` including the variable references.
#### Factory registry
Component factories are special and use the `@Language.factory` or `@Language.component` decorator to register themselves and their meta. When the decorator runs, it performs some basic validation, stores the meta information for the factory on the `Language` class (default config, scores etc.) and then adds the factory function to `registry.factories`. The `component` decorator can be used for registering simple functions that just take a `Doc` object and return it so in that case, we create the factory for the user automatically.
There's one important detail to note about how factories are registered via entry points: A package that wants to expose spaCy components still needs to register them via the `@Language` decorators so we have the component meta information and can perform required checks. All we care about here is that the decorated function is **loaded and imported**. When it is, the `@Language` decorator takes care of everything, including actually registering the component factory.
Normally, adding to the registry via an entry point will just add the function to the registry under the given name. But for `spacy_factories`, we don't actually want that: all we care about is that the function decorated with `@Language` is imported so the decorator runs. So we only exploit Python's entry point system to automatically import the function, and the `spacy_factories` entry point group actually adds to a **separate registry**, `registry._factories`, under the hood. Its only purpose is that the functions are imported. The decorator then runs, creates the factory if needed and adds it to the `registry.factories` registry.
#### Language-specific factories
spaCy supports registering factories on the `Language` base class, as well as language-specific subclasses like `English` or `German`. This allows providing different factories depending on the language, e.g. a different default lemmatizer. The `Language.get_factory_name` classmethod constructs the factory name as `{lang}.{name}` if a language is available (i.e. if it's a subclass) and falls back to `{name}` otherwise. So `@German.factory("foo")` will add a factory `de.foo` under the hood. If you add `nlp.add_pipe("foo")`, we first check if there's a factory for `{nlp.lang}.foo` and if not, we fall back to checking for a factory `foo`.
#### Creating a pipeline component from a factory
`Language.add_pipe` takes care of adding a pipeline component, given its factory name, its config. If no source pipeline to copy the component from is provided, it delegates to `Language.create_pipe`, which sets up the actual component function.
- Validate the config and make sure that the factory was registered via the decorator and that we have meta for it.
- Update the component config with any defaults specified by the component's `default_config`, if available. This is done by merging the values we receive into the defaults. It ensures that you can still add a component without having to specify its _entire_ config including more complex settings like `model`. If no `model` is defined, we use the default.
- Check if we have a language-specific factory for the given `nlp.lang` and if not, fall back to the global factory.
- Construct the component config, consisting of whatever arguments were provided, plus the current `nlp` object and `name`, which are default expected arguments of all factories. We also add a reference to the `@factories` registry, so we can resolve the config via the registry, like any other config. With the added `nlp` and `name`, it should now include all expected arguments of the given function.
- Fill the config to make sure all unspecified defaults from the function arguments are added and update the `raw_config` (uninterpolated with variables intact) with that information, so the component config we store in `nlp.config` is up to date. We do this by adding the `raw_config` _into_ the filled config otherwise, the references to variables would be overwritten.
- Resolve the config and create all functions it refers to (e.g. `model`). This gives us the actual component function that we can insert into the pipeline.
### 1C. Sourcing a pipeline component
```ini
[components.ner]
source = "en_core_web_sm"
```
spaCy also allows ["sourcing" a component](https://spacy.io/usage/processing-pipelines#sourced-components), which will copy it over from an existing pipeline. In this case, `Language.add_pipe` will delegate to `Language.create_pipe_from_source`. In order to copy a component effectively and validate it, the source pipeline first needs to be loaded. This is done in `Language.from_config`, so a source pipeline only has to be loaded once if multiple components source from it. Sourcing a component will perform the following checks and modifications:
- For each sourced pipeline component loaded in `Language.from_config`, a hash of the vectors data from the source pipeline is stored in the pipeline meta so we're able to check whether the vectors match and warn if not (since different vectors that are used as features in components can lead to degraded performance). Because the vectors are not loaded at the point when components are sourced, the check is postponed to `init_vocab` as part of `Language.initialize`.
- If the sourced pipeline component is loaded through `Language.add_pipe(source=)`, the vectors are already loaded and can be compared directly. The check compares the shape and keys first and finally falls back to comparing the actual byte representation of the vectors (which is slower).
- Ensure that the component is available in the pipeline.
- Interpolate the entire config of the source pipeline so all variables are replaced and the component's config that's copied over doesn't include references to variables that are not available in the destination config.
- Add the source `vocab.strings` to the destination's `vocab.strings` so we don't end up with unavailable strings in the final pipeline (which would also include labels used by the sourced component).
Note that there may be other incompatibilities that we're currently not checking for and that could cause a sourced component to not work in the destination pipeline. We're interested in adding more checks here but there'll always be a small number of edge cases we'll never be able to catch, including a sourced component depending on other pipeline state that's not available in the destination pipeline.
### 1D. Tracking components as they're modified
The `Language` class implements methods for removing, replacing or renaming pipeline components. Whenever we make these changes, we need to update the information stored on the `Language` object to ensure that it matches the current state of the pipeline. If a user just writes to `nlp.config` manually, we obviously can't ensure that the config matches the reality but since we offer modification via the pipe methods, it's expected that spaCy keeps the config in sync under the hood. Otherwise, saving a modified pipeline to disk and loading it back wouldn't work. The internal attributes we need to keep in sync here are:
| Attribute | Type | Description |
| ------------------------ | ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Language._components` | `List[Tuple[str, Callable]]` | All pipeline components as `(name, func)` tuples. This is used as the source of truth for `Language.pipeline`, `Language.pipe_names` and `Language.components`. |
| `Language._pipe_meta` | `Dict[str, FactoryMeta]` | The meta information of a component's factory, keyed by component name. This can include multiple components referring to the same factory meta. |
| `Language._pipe_configs` | `Dict[str, Config]` | The component's config, keyed by component name. |
| `Language._disabled` | `Set[str]` | Names of components that are currently disabled. |
| `Language._config` | `Config` | The underlying config. This is only internals and will be used as the basis for constructing the config in the `Language.config` property. |
In addition to the actual component settings in `[components]`, the config also allows specifying component-specific arguments via the `[initialize.components]` block, which are passed to the component's `initialize` method during initialization if it's available. So we also need to keep this in sync in the underlying config.
### 1E. spaCy's config utility functions
When working with configs in spaCy, make sure to use the utility functions provided by spaCy if available, instead of calling the respective `Config` methods. The utilities take care of providing spaCy-specific error messages and ensure a consistent order of config sections by setting the `section_order` argument. This ensures that exported configs always have the same consistent format.
- `util.load_config`: load a config from a file
- `util.load_config_from_str`: load a confirm from a string representation
- `util.copy_config`: deepcopy a config
## 2. Initialization
Initialization is a separate step of the [config lifecycle](https://spacy.io/usage/training#config-lifecycle) that's not performed at runtime. It's implemented via the `training.initialize.init_nlp` helper and calls into `Language.initialize` method, which sets up the pipeline and component models before training. The `initialize` method takes a callback that returns a sample of examples, which is used to initialize the component models, add all required labels and perform shape inference if applicable.
Components can also define custom initialization setting via the `[initialize.components]` block, e.g. if they require external data like lookup tables to be loaded in. All config settings defined here will be passed to the component's `initialize` method, if it implements one. Components are expected to handle their own serialization after they're initialized so that any data or settings they require are saved with the pipeline and will be available from disk when the pipeline is loaded back at runtime.
### 2A. Initialization for training
The `init_nlp` function is called before training and returns an initialized `nlp` object that can be updated with the examples. It only needs the config and does the following:
- Load and validate the config. In order to validate certain settings like the `seed`, we also interpolate the config to get the final value (because in theory, a user could provide this via a variable).
- Set up the GPU allocation, if required.
- Create the `nlp` object from the raw, uninterpolated config, which delegates to `Language.from_config`. Since this method may modify and auto-fill the config and pipeline component settings, we then use the interpolated version of `nlp.config` going forward, to ensure that what we're training with is up to date.
- Resolve the `[training]` block of the config and perform validation, e.g. to check that the corpora are available.
- Determine the components that should be frozen (not updated during training) or resumed (sourced components from a different pipeline that should be updated from the examples and not reset and re-initialized). To resume training, we can call the `nlp.resume_training` method.
- Initialize the `nlp` object via `nlp.initialize` and pass it a `get_examples` callback that returns the training corpus (used for shape inference, setting up labels etc.). If the training corpus is streamed, we only provide a small sample of the data, which can potentially be infinite. `nlp.initialize` will delegate to the components as well and pass the data sample forward.
- Check the listeners and warn about components dependencies, e.g. if a frozen component listens to a component that is retrained, or vice versa (which can degrade results).
### 2B. Initializing the `nlp` object
The `Language.initialize` method does the following:
- **Resolve the config** defined in the `[initialize]` block separately (since everything else is already available in the loaded `nlp` object), based on the fully interpolated config.
- **Execute callbacks**, i.e. `before_init` and `after_init`, if they're defined.
- **Initialize the vocab**, including vocab data, lookup tables and vectors.
- **Initialize the tokenizer** if it implements an `initialize` method. This is not the case for the default tokenizers, but it allows custom tokenizers to depend on external data resources that are loaded in on initialization.
- **Initialize all pipeline components** if they implement an `initialize` method and pass them the `get_examples` callback, the current `nlp` object as well as well additional initialization config settings provided in the component-specific block.
- **Initialize pretraining** if a `[pretraining]` block is available in the config. This allows loading pretrained tok2vec weights in `spacy pretrain`.
- **Register listeners** if token-to-vector embedding layers of a component model "listen" to a previous component (`tok2vec`, `transformer`) in the pipeline.
- **Create an optimizer** on the `Language` class, either by adding the optimizer passed as `sgd` to `initialize`, or by creating the optimizer defined in the config's training settings.
### 2C. Initializing the vocab
Vocab initialization is handled in the `training.initialize.init_vocab` helper. It takes the relevant loaded functions and values from the config and takes care of the following:
- Add lookup tables defined in the config initialization, e.g. custom lemmatization tables. Those will be added to `nlp.vocab.lookups` from where they can be accessed by components.
- Add JSONL-formatted [vocabulary data](https://spacy.io/api/data-formats#vocab-jsonl) to pre-populate the lexical attributes.
- Load vectors into the pipeline. Vectors are defined as a name or path to a saved `nlp` object containing the vectors, e.g. `en_vectors_web_lg`. It's loaded and the vectors are ported over, while ensuring that all source strings are available in the destination strings. We also warn if there's a mismatch between sourced vectors, since this can lead to problems.

View File

@ -0,0 +1,220 @@
# Listeners
1. [Overview](#1-overview)
2. [Initialization](#2-initialization)
- [A. Linking listeners to the embedding component](#2a-linking-listeners-to-the-embedding-component)
- [B. Shape inference](#2b-shape-inference)
3. [Internal communication](#3-internal-communication)
- [A. During prediction](#3a-during-prediction)
- [B. During training](#3b-during-training)
- [C. Frozen components](#3c-frozen-components)
4. [Replacing listener with standalone](#4-replacing-listener-with-standalone)
## 1. Overview
Trainable spaCy components typically use some sort of `tok2vec` layer as part of the `model` definition.
This `tok2vec` layer produces embeddings and is either a standard `Tok2Vec` layer, or a Transformer-based one.
Both versions can be used either inline/standalone, which means that they are defined and used
by only one specific component (e.g. NER), or
[shared](https://spacy.io/usage/embeddings-transformers#embedding-layers),
in which case the embedding functionality becomes a separate component that can
feed embeddings to multiple components downstream, using a listener-pattern.
| Type | Usage | Model Architecture |
| ------------- | ---------- | -------------------------------------------------------------------------------------------------- |
| `Tok2Vec` | standalone | [`spacy.Tok2Vec`](https://spacy.io/api/architectures#Tok2Vec) |
| `Tok2Vec` | listener | [`spacy.Tok2VecListener`](https://spacy.io/api/architectures#Tok2VecListener) |
| `Transformer` | standalone | [`spacy-transformers.Tok2VecTransformer`](https://spacy.io/api/architectures#Tok2VecTransformer) |
| `Transformer` | listener | [`spacy-transformers.TransformerListener`](https://spacy.io/api/architectures#TransformerListener) |
Here we discuss the listener pattern and its implementation in code in more detail.
## 2. Initialization
### 2A. Linking listeners to the embedding component
To allow sharing a `tok2vec` layer, a separate `tok2vec` component needs to be defined in the config:
```
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
```
A listener can then be set up by making sure the correct `upstream` name is defined, referring to the
name of the `tok2vec` component (which equals the factory name by default), or `*` as a wildcard:
```
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
upstream = "tok2vec"
```
When an [`nlp`](https://github.com/explosion/spaCy/blob/master/extra/DEVELOPER_DOCS/Language.md) object is
initialized or deserialized, it will make sure to link each `tok2vec` component to its listeners. This is
implemented in the method `nlp._link_components()` which loops over each
component in the pipeline and calls `find_listeners()` on a component if it's defined.
The [`tok2vec` component](https://github.com/explosion/spaCy/blob/master/spacy/pipeline/tok2vec.py)'s implementation
of this `find_listener()` method will specifically identify sublayers of a model definition that are of type
`Tok2VecListener` with a matching upstream name and will then add that listener to the internal `self.listener_map`.
If it's a Transformer-based pipeline, a
[`transformer` component](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py)
has a similar implementation but its `find_listener()` function will specifically look for `TransformerListener`
sublayers of downstream components.
### 2B. Shape inference
Typically, the output dimension `nO` of a listener's model equals the `nO` (or `width`) of the upstream embedding layer.
For a standard `Tok2Vec`-based component, this is typically known up-front and defined as such in the config:
```
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
```
A `transformer` component however only knows its `nO` dimension after the HuggingFace transformer
is set with the function `model.attrs["set_transformer"]`,
[implemented](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py)
by `set_pytorch_transformer`.
This is why, upon linking of the transformer listeners, the `transformer` component also makes sure to set
the listener's output dimension correctly.
This shape inference mechanism also needs to happen with resumed/frozen components, which means that for some CLI
commands (`assemble` and `train`), we need to call `nlp._link_components` even before initializing the `nlp`
object. To cover all use-cases and avoid negative side effects, the code base ensures that performing the
linking twice is not harmful.
## 3. Internal communication
The internal communication between a listener and its downstream components is organized by sending and
receiving information across the components - either directly or implicitly.
The details are different depending on whether the pipeline is currently training, or predicting.
Either way, the `tok2vec` or `transformer` component always needs to run before the listener.
### 3A. During prediction
When the `Tok2Vec` pipeline component is called, its `predict()` method is executed to produce the results,
which are then stored by `set_annotations()` in the `doc.tensor` field of the document(s).
Similarly, the `Transformer` component stores the produced embeddings
in `doc._.trf_data`. Next, the `forward` pass of a
[`Tok2VecListener`](https://github.com/explosion/spaCy/blob/master/spacy/pipeline/tok2vec.py)
or a
[`TransformerListener`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/listener.py)
accesses these fields on the `Doc` directly. Both listener implementations have a fallback mechanism for when these
properties were not set on the `Doc`: in that case an all-zero tensor is produced and returned.
We need this fallback mechanism to enable shape inference methods in Thinc, but the code
is slightly risky and at times might hide another bug - so it's a good spot to be aware of.
### 3B. During training
During training, the `update()` methods of the `Tok2Vec` & `Transformer` components don't necessarily set the
annotations on the `Doc` (though since 3.1 they can if they are part of the `annotating_components` list in the config).
Instead, we rely on a caching mechanism between the original embedding component and its listener.
Specifically, the produced embeddings are sent to the listeners by calling `listener.receive()` and uniquely
identifying the batch of documents with a `batch_id`. This `receive()` call also sends the appropriate `backprop`
call to ensure that gradients from the downstream component flow back to the trainable `Tok2Vec` or `Transformer`
network.
We rely on the `nlp` object properly batching the data and sending each batch through the pipeline in sequence,
which means that only one such batch needs to be kept in memory for each listener.
When the downstream component runs and the listener should produce embeddings, it accesses the batch in memory,
runs the backpropagation, and returns the results and the gradients.
There are two ways in which this mechanism can fail, both are detected by `verify_inputs()`:
- `E953` if a different batch is in memory than the requested one - signaling some kind of out-of-sync state of the
training pipeline.
- `E954` if no batch is in memory at all - signaling that the pipeline is probably not set up correctly.
#### Training with multiple listeners
One `Tok2Vec` or `Transformer` component may be listened to by several downstream components, e.g.
a tagger and a parser could be sharing the same embeddings. In this case, we need to be careful about how we do
the backpropagation. When the `Tok2Vec` or `Transformer` sends out data to the listener with `receive()`, they will
send an `accumulate_gradient` function call to all listeners, except the last one. This function will keep track
of the gradients received so far. Only the final listener in the pipeline will get an actual `backprop` call that
will initiate the backpropagation of the `tok2vec` or `transformer` model with the accumulated gradients.
### 3C. Frozen components
The listener pattern can get particularly tricky in combination with frozen components. To detect components
with listeners that are not frozen consistently, `init_nlp()` (which is called by `spacy train`) goes through
the listeners and their upstream components and warns in two scenarios.
#### The Tok2Vec or Transformer is frozen
If the `Tok2Vec` or `Transformer` was already trained,
e.g. by [pretraining](https://spacy.io/usage/embeddings-transformers#pretraining),
it could be a valid use-case to freeze the embedding architecture and only train downstream components such
as a tagger or a parser. This used to be impossible before 3.1, but has become supported since then by putting the
embedding component in the [`annotating_components`](https://spacy.io/usage/training#annotating-components)
list of the config. This works like any other "annotating component" because it relies on the `Doc` attributes.
However, if the `Tok2Vec` or `Transformer` is frozen, and not present in `annotating_components`, and a related
listener isn't frozen, then a `W086` warning is shown and further training of the pipeline will likely end with `E954`.
#### The upstream component is frozen
If an upstream component is frozen but the underlying `Tok2Vec` or `Transformer` isn't, the performance of
the upstream component will be degraded after training. In this case, a `W087` warning is shown, explaining
how to use the `replace_listeners` functionality to prevent this problem.
## 4. Replacing listener with standalone
The [`replace_listeners`](https://spacy.io/api/language#replace_listeners) functionality changes the architecture
of a downstream component from using a listener pattern to a standalone `tok2vec` or `transformer` layer,
effectively making the downstream component independent of any other components in the pipeline.
It is implemented by `nlp.replace_listeners()` and typically executed by `nlp.from_config()`.
First, it fetches the original `Model` of the original component that creates the embeddings:
```
tok2vec = self.get_pipe(tok2vec_name)
tok2vec_model = tok2vec.model
```
Which is either a [`Tok2Vec` model](https://github.com/explosion/spaCy/blob/master/spacy/ml/models/tok2vec.py) or a
[`TransformerModel`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py).
In the case of the `tok2vec`, this model can be copied as-is into the configuration and architecture of the
downstream component. However, for the `transformer`, this doesn't work.
The reason is that the `TransformerListener` architecture chains the listener with
[`trfs2arrays`](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/trfs2arrays.py):
```
model = chain(
TransformerListener(upstream_name=upstream)
trfs2arrays(pooling, grad_factor),
)
```
but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained inbetween the model
and `trfs2arrays`:
```
model = chain(
TransformerModel(name, get_spans, tokenizer_config),
split_trf_batch(),
trfs2arrays(pooling, grad_factor),
)
```
So you can't just take the model from the listener, and drop that into the component internally. You need to
adjust the model and the config. To facilitate this, `nlp.replace_listeners()` will check whether additional
[functions](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/_util.py) are
[defined](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/layers/transformer_model.py)
in `model.attrs`, and if so, it will essentially call these to make the appropriate changes:
```
replace_func = tok2vec_model.attrs["replace_listener_cfg"]
new_config = replace_func(tok2vec_cfg["model"], pipe_cfg["model"]["tok2vec"])
...
new_model = tok2vec_model.attrs["replace_listener"](new_model)
```
The new config and model are then properly stored on the `nlp` object.
Note that this functionality (running the replacement for a transformer listener) was broken prior to
`spacy-transformers` 1.0.5.

View File

@ -0,0 +1,7 @@
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
# Developer Documentation
This directory includes additional documentation and explanations of spaCy's internals. It's mostly intended for the spaCy core development team and contributors interested in the more complex parts of the library. The documents generally focus on more abstract implementation details and how specific methods and algorithms work, and they assume knowledge of what's already available in the [usage documentation](https://spacy.io/usage) and [API reference](https://spacy.io/api).
If you're looking to contribute to spaCy, make sure to check out the documentation and [contributing guide](https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md) first.

View File

@ -0,0 +1,216 @@
# StringStore & Vocab
> Reference: `spacy/strings.pyx`
> Reference: `spacy/vocab.pyx`
## Overview
spaCy represents mosts strings internally using a `uint64` in Cython which
corresponds to a hash. The magic required to make this largely transparent is
handled by the `StringStore`, and is integrated into the pipelines using the
`Vocab`, which also connects it to some other information.
These are mostly internal details that average library users should never have
to think about. On the other hand, when developing a component it's normal to
interact with the Vocab for lexeme data or word vectors, and it's not unusual
to add labels to the `StringStore`.
## StringStore
### Overview
The `StringStore` is a `cdef class` that looks a bit like a two-way dictionary,
though it is not a subclass of anything in particular.
The main functionality of the `StringStore` is that `__getitem__` converts
hashes into strings or strings into hashes.
The full details of the conversion are complicated. Normally you shouldn't have
to worry about them, but the first applicable case here is used to get the
return value:
1. 0 and the empty string are special cased to each other
2. internal symbols use a lookup table (`SYMBOLS_BY_STR`)
3. normal strings or bytes are hashed
4. internal symbol IDs in `SYMBOLS_BY_INT` are handled
5. anything not yet handled is used as a hash to lookup a string
For the symbol enums, see [`symbols.pxd`](https://github.com/explosion/spaCy/blob/master/spacy/symbols.pxd).
Almost all strings in spaCy are stored in the `StringStore`. This naturally
includes tokens, but also includes things like labels (not just NER/POS/dep,
but also categories etc.), lemmas, lowercase forms, word shapes, and so on. One
of the main results of this is that tokens can be represented by a compact C
struct ([`LexemeC`](https://spacy.io/api/cython-structs#lexemec)/[`TokenC`](https://github.com/explosion/spaCy/issues/4854)) that mostly consists of string hashes. This also means that converting
input for the models is straightforward, and there's not a token mapping step
like in many machine learning frameworks. Additionally, because the token IDs
in spaCy are based on hashes, they are consistent across environments or
models.
One pattern you'll see a lot in spaCy APIs is that `something.value` returns an
`int` and `something.value_` returns a string. That's implemented using the
`StringStore`. Typically the `int` is stored in a C struct and the string is
generated via a property that calls into the `StringStore` with the `int`.
Besides `__getitem__`, the `StringStore` has functions to return specifically a
string or specifically a hash, regardless of whether the input was a string or
hash to begin with, though these are only used occasionally.
### Implementation Details: Hashes and Allocations
Hashes are 64-bit and are computed using [murmurhash][] on UTF-8 bytes. There is no
mechanism for detecting and avoiding collisions. To date there has never been a
reproducible collision or user report about any related issues.
[murmurhash]: https://github.com/explosion/murmurhash
The empty string is not hashed, it's just converted to/from 0.
A small number of strings use indices into a lookup table (so low integers)
rather than hashes. This is mostly Universal Dependencies labels or other
strings considered "core" in spaCy. This was critical in v1, which hadn't
introduced hashing yet. Since v2 it's important for items in `spacy.attrs`,
especially lexeme flags, but is otherwise only maintained for backwards
compatibility.
You can call `strings["mystring"]` with a string the `StringStore` has never seen
before and it will return a hash. But in order to do the reverse operation, you
need to call `strings.add("mystring")` first. Without a call to `add` the
string will not be interned.
Example:
```
from spacy.strings import StringStore
ss = StringStore()
hashval = ss["spacy"] # 10639093010105930009
try:
# this won't work
ss[hashval]
except KeyError:
print(f"key {hashval} unknown in the StringStore.")
ss.add("spacy")
assert ss[hashval] == "spacy" # it works now
# There is no `.keys` property, but you can iterate over keys
# The empty string will never be in the list of keys
for key in ss:
print(key)
```
In normal use nothing is ever removed from the `StringStore`. In theory this
means that if you do something like iterate through all hex values of a certain
length you can have explosive memory usage. In practice this has never been an
issue. (Note that this is also different from using `sys.intern` to intern
Python strings, which does not guarantee they won't be garbage collected later.)
Strings are stored in the `StringStore` in a peculiar way: each string uses a
union that is either an eight-byte `char[]` or a `char*`. Short strings are
stored directly in the `char[]`, while longer strings are stored in allocated
memory and prefixed with their length. This is a strategy to reduce indirection
and memory fragmentation. See `decode_Utf8Str` and `_allocate` in
`strings.pyx` for the implementation.
### When to Use the StringStore?
While you can ignore the `StringStore` in many cases, there are situations where
you should make use of it to avoid errors.
Any time you introduce a string that may be set on a `Doc` field that has a hash,
you should add the string to the `StringStore`. This mainly happens when adding
labels in components, but there are some other cases:
- syntax iterators, mainly `get_noun_chunks`
- external data used in components, like the `KnowledgeBase` in the `entity_linker`
- labels used in tests
## Vocab
The `Vocab` is a core component of a `Language` pipeline. Its main function is
to manage `Lexeme`s, which are structs that contain information about a token
that depends only on its surface form, without context. `Lexeme`s store much of
the data associated with `Token`s. As a side effect of this the `Vocab` also
manages the `StringStore` for a pipeline and a grab-bag of other data.
These are things stored in the vocab:
- `Lexeme`s
- `StringStore`
- `Morphology`: manages info used in `MorphAnalysis` objects
- `vectors`: basically a dict for word vectors
- `lookups`: language specific data like lemmas
- `writing_system`: language specific metadata
- `get_noun_chunks`: a syntax iterator
- lex attribute getters: functions like `is_punct`, set in language defaults
- `cfg`: **not** the pipeline config, this is mostly unused
- `_unused_object`: Formerly an unused object, kept around until v4 for compatability
Some of these, like the Morphology and Vectors, are complex enough that they
need their own explanations. Here we'll just look at Vocab-specific items.
### Lexemes
A `Lexeme` is a type that mainly wraps a `LexemeC`, a struct consisting of ints
that identify various context-free token attributes. Lexemes are the core data
of the `Vocab`, and can be accessed using `__getitem__` on the `Vocab`. The memory
for storing `LexemeC` objects is managed by a pool that belongs to the `Vocab`.
Note that `__getitem__` on the `Vocab` works much like the `StringStore`, in
that it accepts a hash or id, with one important difference: if you do a lookup
using a string, that value is added to the `StringStore` automatically.
The attributes stored in a `LexemeC` are:
- orth (the raw text)
- lower
- norm
- shape
- prefix
- suffix
Most of these are straightforward. All of them can be customized, and (except
`orth`) probably should be since the defaults are based on English, but in
practice this is rarely done at present.
### Lookups
This is basically a dict of dicts, implemented using a `Table` for each
sub-dict, that stores lemmas and other language-specific lookup data.
A `Table` is a subclass of `OrderedDict` used for string-to-string data. It uses
Bloom filters to speed up misses and has some extra serialization features.
Tables are not used outside of the lookups.
### Lex Attribute Getters
Lexical Attribute Getters like `is_punct` are defined on a per-language basis,
much like lookups, but take the form of functions rather than string-to-string
dicts, so they're stored separately.
### Writing System
This is a dict with three attributes:
- `direction`: ltr or rtl (default ltr)
- `has_case`: bool (default `True`)
- `has_letters`: bool (default `True`, `False` only for CJK for now)
Currently these are not used much - the main use is that `direction` is used in
visualizers, though `rtl` doesn't quite work (see
[#4854](https://github.com/explosion/spaCy/issues/4854)). In the future they
could be used when choosing hyperparameters for subwords, controlling word
shape generation, and similar tasks.
### Other Vocab Members
The Vocab is kind of the default place to store things from `Language.defaults`
that don't belong to the Tokenizer. The following properties are in the Vocab
just because they don't have anywhere else to go.
- `get_noun_chunks`
- `cfg`: This is a dict that just stores `oov_prob` (hardcoded to `-20`)
- `_unused_object`: Leftover C member, should be removed in next major version

View File

@ -104,3 +104,26 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE. SOFTWARE.
importlib_metadata
------------------
* Files: util.py
The implementation of packages_distributions() is adapted from
importlib_metadata, which is distributed under the following license:
Copyright 2017-2019 Jason R. Coombs, Barry Warsaw
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

View File

@ -5,7 +5,7 @@ requires = [
"cymem>=2.0.2,<2.1.0", "cymem>=2.0.2,<2.1.0",
"preshed>=3.0.2,<3.1.0", "preshed>=3.0.2,<3.1.0",
"murmurhash>=0.28.0,<1.1.0", "murmurhash>=0.28.0,<1.1.0",
"thinc>=8.0.8,<8.1.0", "thinc>=8.0.10,<8.1.0",
"blis>=0.4.0,<0.8.0", "blis>=0.4.0,<0.8.0",
"pathy", "pathy",
"numpy>=1.15.0", "numpy>=1.15.0",

View File

@ -1,15 +1,15 @@
# Our libraries # Our libraries
spacy-legacy>=3.0.7,<3.1.0 spacy-legacy>=3.0.8,<3.1.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=8.0.8,<8.1.0 thinc>=8.0.10,<8.1.0
blis>=0.4.0,<0.8.0 blis>=0.4.0,<0.8.0
ml_datasets>=0.2.0,<0.3.0 ml_datasets>=0.2.0,<0.3.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
wasabi>=0.8.1,<1.1.0 wasabi>=0.8.1,<1.1.0
srsly>=2.4.1,<3.0.0 srsly>=2.4.1,<3.0.0
catalogue>=2.0.4,<2.1.0 catalogue>=2.0.6,<2.1.0
typer>=0.3.0,<0.4.0 typer>=0.3.0,<0.5.0
pathy>=0.3.5 pathy>=0.3.5
# Third party dependencies # Third party dependencies
numpy>=1.15.0 numpy>=1.15.0

View File

@ -37,19 +37,19 @@ setup_requires =
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
thinc>=8.0.8,<8.1.0 thinc>=8.0.10,<8.1.0
install_requires = install_requires =
# Our libraries # Our libraries
spacy-legacy>=3.0.7,<3.1.0 spacy-legacy>=3.0.8,<3.1.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=8.0.8,<8.1.0 thinc>=8.0.10,<8.1.0
blis>=0.4.0,<0.8.0 blis>=0.4.0,<0.8.0
wasabi>=0.8.1,<1.1.0 wasabi>=0.8.1,<1.1.0
srsly>=2.4.1,<3.0.0 srsly>=2.4.1,<3.0.0
catalogue>=2.0.4,<2.1.0 catalogue>=2.0.6,<2.1.0
typer>=0.3.0,<0.4.0 typer>=0.3.0,<0.5.0
pathy>=0.3.5 pathy>=0.3.5
# Third-party dependencies # Third-party dependencies
tqdm>=4.38.0,<5.0.0 tqdm>=4.38.0,<5.0.0

View File

@ -1,6 +1,6 @@
# fmt: off # fmt: off
__title__ = "spacy" __title__ = "spacy"
__version__ = "3.1.1" __version__ = "3.1.3"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects" __projects__ = "https://github.com/explosion/projects"

View File

@ -397,7 +397,11 @@ def git_checkout(
run_command(cmd, capture=True) run_command(cmd, capture=True)
# We need Path(name) to make sure we also support subdirectories # We need Path(name) to make sure we also support subdirectories
try: try:
shutil.copytree(str(tmp_dir / Path(subpath)), str(dest)) source_path = tmp_dir / Path(subpath)
if not is_subpath_of(tmp_dir, source_path):
err = f"'{subpath}' is a path outside of the cloned repository."
msg.fail(err, repo, exits=1)
shutil.copytree(str(source_path), str(dest))
except FileNotFoundError: except FileNotFoundError:
err = f"Can't clone {subpath}. Make sure the directory exists in the repo (branch '{branch}')" err = f"Can't clone {subpath}. Make sure the directory exists in the repo (branch '{branch}')"
msg.fail(err, repo, exits=1) msg.fail(err, repo, exits=1)
@ -445,8 +449,14 @@ def git_sparse_checkout(repo, subpath, dest, branch):
# And finally, we can checkout our subpath # And finally, we can checkout our subpath
cmd = f"git -C {tmp_dir} checkout {branch} {subpath}" cmd = f"git -C {tmp_dir} checkout {branch} {subpath}"
run_command(cmd, capture=True) run_command(cmd, capture=True)
# We need Path(name) to make sure we also support subdirectories
shutil.move(str(tmp_dir / Path(subpath)), str(dest)) # Get a subdirectory of the cloned path, if appropriate
source_path = tmp_dir / Path(subpath)
if not is_subpath_of(tmp_dir, source_path):
err = f"'{subpath}' is a path outside of the cloned repository."
msg.fail(err, repo, exits=1)
shutil.move(str(source_path), str(dest))
def get_git_version( def get_git_version(
@ -477,6 +487,19 @@ def _http_to_git(repo: str) -> str:
return repo return repo
def is_subpath_of(parent, child):
"""
Check whether `child` is a path contained within `parent`.
"""
# Based on https://stackoverflow.com/a/37095733 .
# In Python 3.9, the `Path.is_relative_to()` method will supplant this, so
# we can stop using crusty old os.path functions.
parent_realpath = os.path.realpath(parent)
child_realpath = os.path.realpath(child)
return os.path.commonpath([parent_realpath, child_realpath]) == parent_realpath
def string_to_list(value: str, intify: bool = False) -> Union[List[str], List[int]]: def string_to_list(value: str, intify: bool = False) -> Union[List[str], List[int]]:
"""Parse a comma-separated string to a list and account for various """Parse a comma-separated string to a list and account for various
formatting options. Mostly used to handle CLI arguments that take a list of formatting options. Mostly used to handle CLI arguments that take a list of

View File

@ -2,6 +2,8 @@ from typing import Optional, Union, Any, Dict, List, Tuple
import shutil import shutil
from pathlib import Path from pathlib import Path
from wasabi import Printer, MarkdownRenderer, get_raw_input from wasabi import Printer, MarkdownRenderer, get_raw_input
from thinc.api import Config
from collections import defaultdict
import srsly import srsly
import sys import sys
@ -99,6 +101,12 @@ def package(
msg.fail("Can't load pipeline meta.json", meta_path, exits=1) msg.fail("Can't load pipeline meta.json", meta_path, exits=1)
meta = srsly.read_json(meta_path) meta = srsly.read_json(meta_path)
meta = get_meta(input_dir, meta) meta = get_meta(input_dir, meta)
if meta["requirements"]:
msg.good(
f"Including {len(meta['requirements'])} package requirement(s) from "
f"meta and config",
", ".join(meta["requirements"]),
)
if name is not None: if name is not None:
meta["name"] = name meta["name"] = name
if version is not None: if version is not None:
@ -175,6 +183,55 @@ def has_wheel() -> bool:
return False return False
def get_third_party_dependencies(
config: Config, exclude: List[str] = util.SimpleFrozenList()
) -> List[str]:
"""If the config includes references to registered functions that are
provided by third-party packages (spacy-transformers, other libraries), we
want to include them in meta["requirements"] so that the package specifies
them as dependencies and the user won't have to do it manually.
We do this by:
- traversing the config to check for registered function (@ keys)
- looking up the functions and getting their module
- looking up the module version and generating an appropriate version range
config (Config): The pipeline config.
exclude (list): List of packages to exclude (e.g. that already exist in meta).
RETURNS (list): The versioned requirements.
"""
own_packages = ("spacy", "spacy-legacy", "spacy-nightly", "thinc", "srsly")
distributions = util.packages_distributions()
funcs = defaultdict(set)
# We only want to look at runtime-relevant sections, not [training] or [initialize]
for section in ("nlp", "components"):
for path, value in util.walk_dict(config[section]):
if path[-1].startswith("@"): # collect all function references by registry
funcs[path[-1][1:]].add(value)
for component in config.get("components", {}).values():
if "factory" in component:
funcs["factories"].add(component["factory"])
modules = set()
for reg_name, func_names in funcs.items():
for func_name in func_names:
func_info = util.registry.find(reg_name, func_name)
module_name = func_info.get("module")
if module_name: # the code is part of a module, not a --code file
modules.add(func_info["module"].split(".")[0])
dependencies = []
for module_name in modules:
if module_name in distributions:
dist = distributions.get(module_name)
if dist:
pkg = dist[0]
if pkg in own_packages or pkg in exclude:
continue
version = util.get_package_version(pkg)
version_range = util.get_minor_version_range(version)
dependencies.append(f"{pkg}{version_range}")
return dependencies
def get_build_formats(formats: List[str]) -> Tuple[bool, bool]: def get_build_formats(formats: List[str]) -> Tuple[bool, bool]:
supported = ["sdist", "wheel", "none"] supported = ["sdist", "wheel", "none"]
for form in formats: for form in formats:
@ -208,7 +265,7 @@ def get_meta(
nlp = util.load_model_from_path(Path(model_path)) nlp = util.load_model_from_path(Path(model_path))
meta.update(nlp.meta) meta.update(nlp.meta)
meta.update(existing_meta) meta.update(existing_meta)
meta["spacy_version"] = util.get_model_version_range(about.__version__) meta["spacy_version"] = util.get_minor_version_range(about.__version__)
meta["vectors"] = { meta["vectors"] = {
"width": nlp.vocab.vectors_length, "width": nlp.vocab.vectors_length,
"vectors": len(nlp.vocab.vectors), "vectors": len(nlp.vocab.vectors),
@ -217,6 +274,11 @@ def get_meta(
} }
if about.__title__ != "spacy": if about.__title__ != "spacy":
meta["parent_package"] = about.__title__ meta["parent_package"] = about.__title__
meta.setdefault("requirements", [])
# Update the requirements with all third-party packages in the config
existing_reqs = [util.split_requirement(req)[0] for req in meta["requirements"]]
reqs = get_third_party_dependencies(nlp.config, exclude=existing_reqs)
meta["requirements"].extend(reqs)
return meta return meta

View File

@ -1,18 +1,24 @@
from typing import Optional from typing import Any, Dict, Optional
from pathlib import Path from pathlib import Path
from wasabi import msg from wasabi import msg
import re import re
import shutil import shutil
import requests import requests
import typer
from ...util import ensure_path, working_dir from ...util import ensure_path, working_dir
from .._util import project_cli, Arg, Opt, PROJECT_FILE, load_project_config from .._util import project_cli, Arg, Opt, PROJECT_FILE, load_project_config
from .._util import get_checksum, download_file, git_checkout, get_git_version from .._util import get_checksum, download_file, git_checkout, get_git_version
from .._util import SimpleFrozenDict, parse_config_overrides
@project_cli.command("assets") @project_cli.command(
"assets",
context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
)
def project_assets_cli( def project_assets_cli(
# fmt: off # fmt: off
ctx: typer.Context, # This is only used to read additional arguments
project_dir: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False), project_dir: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False),
sparse_checkout: bool = Opt(False, "--sparse", "-S", help="Use sparse checkout for assets provided via Git, to only check out and clone the files needed. Requires Git v22.2+.") sparse_checkout: bool = Opt(False, "--sparse", "-S", help="Use sparse checkout for assets provided via Git, to only check out and clone the files needed. Requires Git v22.2+.")
# fmt: on # fmt: on
@ -24,16 +30,22 @@ def project_assets_cli(
DOCS: https://spacy.io/api/cli#project-assets DOCS: https://spacy.io/api/cli#project-assets
""" """
project_assets(project_dir, sparse_checkout=sparse_checkout) overrides = parse_config_overrides(ctx.args)
project_assets(project_dir, overrides=overrides, sparse_checkout=sparse_checkout)
def project_assets(project_dir: Path, *, sparse_checkout: bool = False) -> None: def project_assets(
project_dir: Path,
*,
overrides: Dict[str, Any] = SimpleFrozenDict(),
sparse_checkout: bool = False,
) -> None:
"""Fetch assets for a project using DVC if possible. """Fetch assets for a project using DVC if possible.
project_dir (Path): Path to project directory. project_dir (Path): Path to project directory.
""" """
project_path = ensure_path(project_dir) project_path = ensure_path(project_dir)
config = load_project_config(project_path) config = load_project_config(project_path, overrides=overrides)
assets = config.get("assets", {}) assets = config.get("assets", {})
if not assets: if not assets:
msg.warn(f"No assets specified in {PROJECT_FILE}", exits=0) msg.warn(f"No assets specified in {PROJECT_FILE}", exits=0)
@ -59,6 +71,15 @@ def project_assets(project_dir: Path, *, sparse_checkout: bool = False) -> None:
shutil.rmtree(dest) shutil.rmtree(dest)
else: else:
dest.unlink() dest.unlink()
if "repo" not in asset["git"] or asset["git"]["repo"] is None:
msg.fail(
"A git asset must include 'repo', the repository address.", exits=1
)
if "path" not in asset["git"] or asset["git"]["path"] is None:
msg.fail(
"A git asset must include 'path' - use \"\" to get the entire repository.",
exits=1,
)
git_checkout( git_checkout(
asset["git"]["repo"], asset["git"]["repo"],
asset["git"]["path"], asset["git"]["path"],

View File

@ -1,6 +1,7 @@
from typing import Optional, List, Dict, Sequence, Any, Iterable from typing import Optional, List, Dict, Sequence, Any, Iterable
from pathlib import Path from pathlib import Path
from wasabi import msg from wasabi import msg
from wasabi.util import locale_escape
import sys import sys
import srsly import srsly
import typer import typer
@ -57,6 +58,7 @@ def project_run(
project_dir (Path): Path to project directory. project_dir (Path): Path to project directory.
subcommand (str): Name of command to run. subcommand (str): Name of command to run.
overrides (Dict[str, Any]): Optional config overrides.
force (bool): Force re-running, even if nothing changed. force (bool): Force re-running, even if nothing changed.
dry (bool): Perform a dry run and don't execute commands. dry (bool): Perform a dry run and don't execute commands.
capture (bool): Whether to capture the output and errors of individual commands. capture (bool): Whether to capture the output and errors of individual commands.
@ -72,7 +74,14 @@ def project_run(
if subcommand in workflows: if subcommand in workflows:
msg.info(f"Running workflow '{subcommand}'") msg.info(f"Running workflow '{subcommand}'")
for cmd in workflows[subcommand]: for cmd in workflows[subcommand]:
project_run(project_dir, cmd, force=force, dry=dry, capture=capture) project_run(
project_dir,
cmd,
overrides=overrides,
force=force,
dry=dry,
capture=capture,
)
else: else:
cmd = commands[subcommand] cmd = commands[subcommand]
for dep in cmd.get("deps", []): for dep in cmd.get("deps", []):
@ -127,7 +136,7 @@ def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None:
print("") print("")
title = config.get("title") title = config.get("title")
if title: if title:
print(f"{title}\n") print(f"{locale_escape(title)}\n")
if config_commands: if config_commands:
print(f"Available commands in {PROJECT_FILE}") print(f"Available commands in {PROJECT_FILE}")
print(f"Usage: {COMMAND} project run [COMMAND] {project_loc}") print(f"Usage: {COMMAND} project run [COMMAND] {project_loc}")

View File

@ -41,10 +41,10 @@ da:
word_vectors: da_core_news_lg word_vectors: da_core_news_lg
transformer: transformer:
efficiency: efficiency:
name: DJSammy/bert-base-danish-uncased_BotXO,ai name: Maltehb/danish-bert-botxo
size_factor: 3 size_factor: 3
accuracy: accuracy:
name: DJSammy/bert-base-danish-uncased_BotXO,ai name: Maltehb/danish-bert-botxo
size_factor: 3 size_factor: 3
de: de:
word_vectors: de_core_news_lg word_vectors: de_core_news_lg

View File

@ -27,6 +27,14 @@ try: # Python 3.8+
except ImportError: except ImportError:
from typing_extensions import Literal # noqa: F401 from typing_extensions import Literal # noqa: F401
# Important note: The importlib_metadata "backport" includes functionality
# that's not part of the built-in importlib.metadata. We should treat this
# import like the built-in and only use what's available there.
try: # Python 3.8+
import importlib.metadata as importlib_metadata
except ImportError:
from catalogue import _importlib_metadata as importlib_metadata # noqa: F401
from thinc.api import Optimizer # noqa: F401 from thinc.api import Optimizer # noqa: F401
pickle = pickle pickle = pickle

View File

@ -3,7 +3,7 @@ import uuid
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_WORDS_LEMMA, TPL_DEP_ARCS from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_WORDS_LEMMA, TPL_DEP_ARCS
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
from .templates import TPL_ENTS from .templates import TPL_ENTS, TPL_KB_LINK
from ..util import minify_html, escape_html, registry from ..util import minify_html, escape_html, registry
from ..errors import Errors from ..errors import Errors
@ -305,7 +305,7 @@ class EntityRenderer:
"""Render entities in text. """Render entities in text.
text (str): Original text. text (str): Original text.
spans (list): Individual entity spans and their start, end and label. spans (list): Individual entity spans and their start, end, label, kb_id and kb_url.
title (str / None): Document title set in Doc.user_data['title']. title (str / None): Document title set in Doc.user_data['title'].
""" """
markup = "" markup = ""
@ -314,6 +314,9 @@ class EntityRenderer:
label = span["label"] label = span["label"]
start = span["start"] start = span["start"]
end = span["end"] end = span["end"]
kb_id = span.get("kb_id", "")
kb_url = span.get("kb_url", "#")
kb_link = TPL_KB_LINK.format(kb_id=kb_id, kb_url=kb_url) if kb_id else ""
additional_params = span.get("params", {}) additional_params = span.get("params", {})
entity = escape_html(text[start:end]) entity = escape_html(text[start:end])
fragments = text[offset:start].split("\n") fragments = text[offset:start].split("\n")
@ -323,7 +326,12 @@ class EntityRenderer:
markup += "</br>" markup += "</br>"
if self.ents is None or label.upper() in self.ents: if self.ents is None or label.upper() in self.ents:
color = self.colors.get(label.upper(), self.default_color) color = self.colors.get(label.upper(), self.default_color)
ent_settings = {"label": label, "text": entity, "bg": color} ent_settings = {
"label": label,
"text": entity,
"bg": color,
"kb_link": kb_link,
}
ent_settings.update(additional_params) ent_settings.update(additional_params)
markup += self.ent_template.format(**ent_settings) markup += self.ent_template.format(**ent_settings)
else: else:

View File

@ -51,17 +51,22 @@ TPL_ENTS = """
TPL_ENT = """ TPL_ENT = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;"> <mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
{text} {text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">{label}</span> <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">{label}{kb_link}</span>
</mark> </mark>
""" """
TPL_ENT_RTL = """ TPL_ENT_RTL = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"> <mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em">
{text} {text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-right: 0.5rem">{label}</span> <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-right: 0.5rem">{label}{kb_link}</span>
</mark> </mark>
""" """
# Important: this needs to start with a space!
TPL_KB_LINK = """
<a style="text-decoration: none; color: inherit; font-weight: normal" href="{kb_url}">{kb_id}</a>
"""
TPL_PAGE = """ TPL_PAGE = """
<!DOCTYPE html> <!DOCTYPE html>

View File

@ -116,13 +116,11 @@ class Warnings:
# New warnings added in v3.x # New warnings added in v3.x
W086 = ("Component '{listener}' will be (re)trained, but it needs the component " W086 = ("Component '{listener}' will be (re)trained, but it needs the component "
"'{name}' which is frozen. You can either freeze both, or neither " "'{name}' which is frozen. If you want to prevent retraining '{name}' "
"of the two. If you're sourcing the component from " "but want to train '{listener}' on top of it, you should add '{name}' to the "
"an existing pipeline, you can use the `replace_listeners` setting in " "list of 'annotating_components' in the 'training' block in the config. "
"the config block to replace its token-to-vector listener with a copy " "See the documentation for details: "
"and make it independent. For example, `replace_listeners = " "https://spacy.io/usage/training#annotating-components")
"[\"model.tok2vec\"]` See the documentation for details: "
"https://spacy.io/usage/training#config-components-listeners")
W087 = ("Component '{name}' will be (re)trained, but the component '{listener}' " W087 = ("Component '{name}' will be (re)trained, but the component '{listener}' "
"depends on it via a listener and is frozen. This means that the " "depends on it via a listener and is frozen. This means that the "
"performance of '{listener}' will be degraded. You can either freeze " "performance of '{listener}' will be degraded. You can either freeze "
@ -521,6 +519,10 @@ class Errors:
E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.") E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.")
# New errors added in v3.x # New errors added in v3.x
E865 = ("A SpanGroup is not functional after the corresponding Doc has "
"been garbage collected. To keep using the spans, make sure that "
"the corresponding Doc object is still available in the scope of "
"your function.")
E866 = ("Expected a string or 'Doc' as input, but got: {type}.") E866 = ("Expected a string or 'Doc' as input, but got: {type}.")
E867 = ("The 'textcat' component requires at least two labels because it " E867 = ("The 'textcat' component requires at least two labels because it "
"uses mutually exclusive classes where exactly one label is True " "uses mutually exclusive classes where exactly one label is True "
@ -868,6 +870,10 @@ class Errors:
E1019 = ("`noun_chunks` requires the pos tagging, which requires a " E1019 = ("`noun_chunks` requires the pos tagging, which requires a "
"statistical model to be installed and loaded. For more info, see " "statistical model to be installed and loaded. For more info, see "
"the documentation:\nhttps://spacy.io/usage/models") "the documentation:\nhttps://spacy.io/usage/models")
E1020 = ("No `epoch_resume` value specified and could not infer one from "
"filename. Specify an epoch to resume from.")
E1021 = ("`pos` value \"{pp}\" is not a valid Universal Dependencies tag. "
"Non-UD tags should use the `tag` property.")
# Deprecated model shortcuts, only used in errors and warnings # Deprecated model shortcuts, only used in errors and warnings

View File

@ -95,6 +95,7 @@ GLOSSARY = {
"XX": "unknown", "XX": "unknown",
"BES": 'auxiliary "be"', "BES": 'auxiliary "be"',
"HVS": 'forms of "have"', "HVS": 'forms of "have"',
"_SP": "whitespace",
# POS Tags (German) # POS Tags (German)
# TIGER Treebank # TIGER Treebank
# http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf # http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf

View File

@ -82,7 +82,8 @@ for orth in [
for verb in [ for verb in [
"a", "a",
"est" "semble", "est",
"semble",
"indique", "indique",
"moque", "moque",
"passe", "passe",

View File

@ -199,7 +199,7 @@ class Language:
DOCS: https://spacy.io/api/language#meta DOCS: https://spacy.io/api/language#meta
""" """
spacy_version = util.get_model_version_range(about.__version__) spacy_version = util.get_minor_version_range(about.__version__)
if self.vocab.lang: if self.vocab.lang:
self._meta.setdefault("lang", self.vocab.lang) self._meta.setdefault("lang", self.vocab.lang)
else: else:
@ -605,7 +605,7 @@ class Language:
factory_name: str, factory_name: str,
name: Optional[str] = None, name: Optional[str] = None,
*, *,
config: Optional[Dict[str, Any]] = SimpleFrozenDict(), config: Dict[str, Any] = SimpleFrozenDict(),
raw_config: Optional[Config] = None, raw_config: Optional[Config] = None,
validate: bool = True, validate: bool = True,
) -> Callable[[Doc], Doc]: ) -> Callable[[Doc], Doc]:
@ -615,8 +615,8 @@ class Language:
factory_name (str): Name of component factory. factory_name (str): Name of component factory.
name (Optional[str]): Optional name to assign to component instance. name (Optional[str]): Optional name to assign to component instance.
Defaults to factory name if not set. Defaults to factory name if not set.
config (Optional[Dict[str, Any]]): Config parameters to use for this config (Dict[str, Any]): Config parameters to use for this component.
component. Will be merged with default config, if available. Will be merged with default config, if available.
raw_config (Optional[Config]): Internals: the non-interpolated config. raw_config (Optional[Config]): Internals: the non-interpolated config.
validate (bool): Whether to validate the component config against the validate (bool): Whether to validate the component config against the
arguments and types expected by the factory. arguments and types expected by the factory.
@ -640,7 +640,6 @@ class Language:
) )
raise ValueError(err) raise ValueError(err)
pipe_meta = self.get_factory_meta(factory_name) pipe_meta = self.get_factory_meta(factory_name)
config = config or {}
# This is unideal, but the alternative would mean you always need to # This is unideal, but the alternative would mean you always need to
# specify the full config settings, which is not really viable. # specify the full config settings, which is not really viable.
if pipe_meta.default_config: if pipe_meta.default_config:
@ -722,7 +721,7 @@ class Language:
first: Optional[bool] = None, first: Optional[bool] = None,
last: Optional[bool] = None, last: Optional[bool] = None,
source: Optional["Language"] = None, source: Optional["Language"] = None,
config: Optional[Dict[str, Any]] = SimpleFrozenDict(), config: Dict[str, Any] = SimpleFrozenDict(),
raw_config: Optional[Config] = None, raw_config: Optional[Config] = None,
validate: bool = True, validate: bool = True,
) -> Callable[[Doc], Doc]: ) -> Callable[[Doc], Doc]:
@ -743,8 +742,8 @@ class Language:
last (bool): If True, insert component last in the pipeline. last (bool): If True, insert component last in the pipeline.
source (Language): Optional loaded nlp object to copy the pipeline source (Language): Optional loaded nlp object to copy the pipeline
component from. component from.
config (Optional[Dict[str, Any]]): Config parameters to use for this config (Dict[str, Any]): Config parameters to use for this component.
component. Will be merged with default config, if available. Will be merged with default config, if available.
raw_config (Optional[Config]): Internals: the non-interpolated config. raw_config (Optional[Config]): Internals: the non-interpolated config.
validate (bool): Whether to validate the component config against the validate (bool): Whether to validate the component config against the
arguments and types expected by the factory. arguments and types expected by the factory.

View File

@ -3,7 +3,6 @@ from typing import List
from collections import defaultdict from collections import defaultdict
from itertools import product from itertools import product
import numpy
import warnings import warnings
from .matcher cimport Matcher from .matcher cimport Matcher
@ -122,9 +121,7 @@ cdef class DependencyMatcher:
raise ValueError(Errors.E099.format(key=key)) raise ValueError(Errors.E099.format(key=key))
visited_nodes[relation["RIGHT_ID"]] = True visited_nodes[relation["RIGHT_ID"]] = True
else: else:
required_keys = set( required_keys = {"RIGHT_ID", "RIGHT_ATTRS", "REL_OP", "LEFT_ID"}
("RIGHT_ID", "RIGHT_ATTRS", "REL_OP", "LEFT_ID")
)
relation_keys = set(relation.keys()) relation_keys = set(relation.keys())
missing = required_keys - relation_keys missing = required_keys - relation_keys
if missing: if missing:
@ -179,28 +176,22 @@ cdef class DependencyMatcher:
self._callbacks[key] = on_match self._callbacks[key] = on_match
# Add 'RIGHT_ATTRS' to self._patterns[key] # Add 'RIGHT_ATTRS' to self._patterns[key]
_patterns = [] _patterns = [[[pat["RIGHT_ATTRS"]] for pat in pattern] for pattern in patterns]
for pattern in patterns:
token_patterns = []
for i in range(len(pattern)):
token_pattern = [pattern[i]["RIGHT_ATTRS"]]
token_patterns.append(token_pattern)
_patterns.append(token_patterns)
self._patterns[key].extend(_patterns) self._patterns[key].extend(_patterns)
# Add each node pattern of all the input patterns individually to the # Add each node pattern of all the input patterns individually to the
# matcher. This enables only a single instance of Matcher to be used. # matcher. This enables only a single instance of Matcher to be used.
# Multiple adds are required to track each node pattern. # Multiple adds are required to track each node pattern.
tokens_to_key_list = [] tokens_to_key_list = []
for i in range(len(_patterns)): for i, current_patterns in enumerate(_patterns):
# Preallocate list space # Preallocate list space
tokens_to_key = [None]*len(_patterns[i]) tokens_to_key = [None] * len(current_patterns)
# TODO: Better ways to hash edges in pattern? # TODO: Better ways to hash edges in pattern?
for j in range(len(_patterns[i])): for j, _pattern in enumerate(current_patterns):
k = self._get_matcher_key(key, i, j) k = self._get_matcher_key(key, i, j)
self._matcher.add(k, [_patterns[i][j]]) self._matcher.add(k, [_pattern])
tokens_to_key[j] = k tokens_to_key[j] = k
tokens_to_key_list.append(tokens_to_key) tokens_to_key_list.append(tokens_to_key)
@ -337,7 +328,7 @@ cdef class DependencyMatcher:
# position of the matched tokens # position of the matched tokens
for candidate_match in product(*all_positions): for candidate_match in product(*all_positions):
# A potential match is a valid match if all relationhips between the # A potential match is a valid match if all relationships between the
# matched tokens are satisfied. # matched tokens are satisfied.
is_valid = True is_valid = True
for left_idx in range(len(candidate_match)): for left_idx in range(len(candidate_match)):
@ -424,18 +415,10 @@ cdef class DependencyMatcher:
return [] return []
def _right_sib(self, doc, node): def _right_sib(self, doc, node):
candidate_children = [] return [doc[child.i] for child in doc[node].head.children if child.i > node]
for child in list(doc[node].head.children):
if child.i > node:
candidate_children.append(doc[child.i])
return candidate_children
def _left_sib(self, doc, node): def _left_sib(self, doc, node):
candidate_children = [] return [doc[child.i] for child in doc[node].head.children if child.i < node]
for child in list(doc[node].head.children):
if child.i < node:
candidate_children.append(doc[child.i])
return candidate_children
def _normalize_key(self, key): def _normalize_key(self, key):
if isinstance(key, str): if isinstance(key, str):

View File

@ -281,28 +281,19 @@ cdef class Matcher:
final_matches.append((key, *match)) final_matches.append((key, *match))
# Mark tokens that have matched # Mark tokens that have matched
memset(&matched[start], 1, span_len * sizeof(matched[0])) memset(&matched[start], 1, span_len * sizeof(matched[0]))
if with_alignments:
final_matches_with_alignments = final_matches
final_matches = [(key, start, end) for key, start, end, alignments in final_matches]
# perform the callbacks on the filtered set of results
for i, (key, start, end) in enumerate(final_matches):
on_match = self._callbacks.get(key, None)
if on_match is not None:
on_match(self, doc, i, final_matches)
if as_spans: if as_spans:
spans = [] final_results = []
for key, start, end in final_matches: for key, start, end, *_ in final_matches:
if isinstance(doclike, Span): if isinstance(doclike, Span):
start += doclike.start start += doclike.start
end += doclike.start end += doclike.start
spans.append(Span(doc, start, end, label=key)) final_results.append(Span(doc, start, end, label=key))
return spans
elif with_alignments: elif with_alignments:
# convert alignments List[Dict[str, int]] --> List[int] # convert alignments List[Dict[str, int]] --> List[int]
final_matches = []
# when multiple alignment (belongs to the same length) is found, # when multiple alignment (belongs to the same length) is found,
# keeps the alignment that has largest token_idx # keeps the alignment that has largest token_idx
for key, start, end, alignments in final_matches_with_alignments: final_results = []
for key, start, end, alignments in final_matches:
sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False) sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False)
alignments = [0] * (end-start) alignments = [0] * (end-start)
for align in sorted_alignments: for align in sorted_alignments:
@ -311,10 +302,16 @@ cdef class Matcher:
# Since alignments are sorted in order of (length, token_idx) # Since alignments are sorted in order of (length, token_idx)
# this overwrites smaller token_idx when they have same length. # this overwrites smaller token_idx when they have same length.
alignments[align['length']] = align['token_idx'] alignments[align['length']] = align['token_idx']
final_matches.append((key, start, end, alignments)) final_results.append((key, start, end, alignments))
return final_matches final_matches = final_results # for callbacks
else: else:
return final_matches final_results = final_matches
# perform the callbacks on the filtered set of results
for i, (key, *_) in enumerate(final_matches):
on_match = self._callbacks.get(key, None)
if on_match is not None:
on_match(self, doc, i, final_matches)
return final_results
def _normalize_key(self, key): def _normalize_key(self, key):
if isinstance(key, str): if isinstance(key, str):
@ -340,7 +337,7 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
The "predicates" list contains functions that take a Python list and return a The "predicates" list contains functions that take a Python list and return a
boolean value. It's mostly used for regular expressions. boolean value. It's mostly used for regular expressions.
The "extra_getters" list contains functions that take a Python list and return The "extensions" list contains functions that take a Python list and return
an attr ID. It's mostly used for extension attributes. an attr ID. It's mostly used for extension attributes.
""" """
cdef vector[PatternStateC] states cdef vector[PatternStateC] states

View File

@ -56,7 +56,7 @@ def build_tb_parser_model(
non-linearity if use_upper=False. non-linearity if use_upper=False.
use_upper (bool): Whether to use an additional hidden layer after the state use_upper (bool): Whether to use an additional hidden layer after the state
vector in order to predict the action scores. It is recommended to set vector in order to predict the action scores. It is recommended to set
this to False for large pretrained models such as transformers, and False this to False for large pretrained models such as transformers, and True
for smaller networks. The upper layer is computed on CPU, which becomes for smaller networks. The upper layer is computed on CPU, which becomes
a bottleneck on larger GPU-based models, where it's also less necessary. a bottleneck on larger GPU-based models, where it's also less necessary.
nO (int or None): The number of actions the model will predict between. nO (int or None): The number of actions the model will predict between.

View File

@ -417,7 +417,9 @@ class SpanCategorizer(TrainablePipe):
pass pass
def _get_aligned_spans(self, eg: Example): def _get_aligned_spans(self, eg: Example):
return eg.get_aligned_spans_y2x(eg.reference.spans.get(self.key, [])) return eg.get_aligned_spans_y2x(
eg.reference.spans.get(self.key, []), allow_overlap=True
)
def _make_span_group( def _make_span_group(
self, doc: Doc, indices: Ints2d, scores: Floats2d, labels: List[str] self, doc: Doc, indices: Ints2d, scores: Floats2d, labels: List[str]
@ -425,16 +427,24 @@ class SpanCategorizer(TrainablePipe):
spans = SpanGroup(doc, name=self.key) spans = SpanGroup(doc, name=self.key)
max_positive = self.cfg["max_positive"] max_positive = self.cfg["max_positive"]
threshold = self.cfg["threshold"] threshold = self.cfg["threshold"]
keeps = scores >= threshold
ranked = (scores * -1).argsort()
if max_positive is not None:
filter = ranked[:, max_positive:]
for i, row in enumerate(filter):
keeps[i, row] = False
spans.attrs["scores"] = scores[keeps].flatten()
indices = self.model.ops.to_numpy(indices)
keeps = self.model.ops.to_numpy(keeps)
for i in range(indices.shape[0]): for i in range(indices.shape[0]):
start = int(indices[i, 0]) start = indices[i, 0]
end = int(indices[i, 1]) end = indices[i, 1]
positives = []
for j, score in enumerate(scores[i]): for j, keep in enumerate(keeps[i]):
if score >= threshold: if keep:
positives.append((score, start, end, labels[j])) spans.append(Span(doc, start, end, label=labels[j]))
positives.sort(reverse=True)
if max_positive:
positives = positives[:max_positive]
for score, start, end, label in positives:
spans.append(Span(doc, start, end, label=label))
return spans return spans

View File

@ -70,3 +70,10 @@ def test_create_with_heads_and_no_deps(vocab):
heads = list(range(len(words))) heads = list(range(len(words)))
with pytest.raises(ValueError): with pytest.raises(ValueError):
Doc(vocab, words=words, heads=heads) Doc(vocab, words=words, heads=heads)
def test_create_invalid_pos(vocab):
words = "I like ginger".split()
pos = "QQ ZZ XX".split()
with pytest.raises(ValueError):
Doc(vocab, words=words, pos=pos)

View File

@ -5,7 +5,9 @@ from spacy.attrs import ORTH, LENGTH
from spacy.tokens import Doc, Span, Token from spacy.tokens import Doc, Span, Token
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.util import filter_spans from spacy.util import filter_spans
from thinc.api import get_current_ops
from ..util import add_vecs_to_vocab
from .test_underscore import clean_underscore # noqa: F401 from .test_underscore import clean_underscore # noqa: F401
@ -412,3 +414,23 @@ def test_sent(en_tokenizer):
assert not span.doc.has_annotation("SENT_START") assert not span.doc.has_annotation("SENT_START")
with pytest.raises(ValueError): with pytest.raises(ValueError):
span.sent span.sent
def test_span_with_vectors(doc):
ops = get_current_ops()
prev_vectors = doc.vocab.vectors
vectors = [
("apple", ops.asarray([1, 2, 3])),
("orange", ops.asarray([-1, -2, -3])),
("And", ops.asarray([-1, -1, -1])),
("juice", ops.asarray([5, 5, 10])),
("pie", ops.asarray([7, 6.3, 8.9])),
]
add_vecs_to_vocab(doc.vocab, vectors)
# 0-length span
assert_array_equal(ops.to_numpy(doc[0:0].vector), numpy.zeros((3,)))
# longer span with no vector
assert_array_equal(ops.to_numpy(doc[0:4].vector), numpy.zeros((3,)))
# single-token span with vector
assert_array_equal(ops.to_numpy(doc[10:11].vector), [-1, -1, -1])
doc.vocab.vectors = prev_vectors

View File

@ -203,6 +203,12 @@ def test_set_pos():
assert doc[1].pos_ == "VERB" assert doc[1].pos_ == "VERB"
def test_set_invalid_pos():
doc = Doc(Vocab(), words=["hello", "world"])
with pytest.raises(ValueError):
doc[0].pos_ = "blah"
def test_tokens_sent(doc): def test_tokens_sent(doc):
"""Test token.sent property""" """Test token.sent property"""
assert len(list(doc.sents)) == 3 assert len(list(doc.sents)) == 3

View File

@ -576,6 +576,16 @@ def test_matcher_callback(en_vocab):
mock.assert_called_once_with(matcher, doc, 0, matches) mock.assert_called_once_with(matcher, doc, 0, matches)
def test_matcher_callback_with_alignments(en_vocab):
mock = Mock()
matcher = Matcher(en_vocab)
pattern = [{"ORTH": "test"}]
matcher.add("Rule", [pattern], on_match=mock)
doc = Doc(en_vocab, words=["This", "is", "a", "test", "."])
matches = matcher(doc, with_alignments=True)
mock.assert_called_once_with(matcher, doc, 0, matches)
def test_matcher_span(matcher): def test_matcher_span(matcher):
text = "JavaScript is good but Java is better" text = "JavaScript is good but Java is better"
doc = Doc(matcher.vocab, words=text.split()) doc = Doc(matcher.vocab, words=text.split())

View File

@ -1,9 +1,15 @@
import pytest import pytest
from numpy.testing import assert_equal, assert_array_equal import numpy
from numpy.testing import assert_array_equal, assert_almost_equal
from thinc.api import get_current_ops from thinc.api import get_current_ops
from spacy import util
from spacy.lang.en import English
from spacy.language import Language from spacy.language import Language
from spacy.tokens.doc import SpanGroups
from spacy.tokens import SpanGroup
from spacy.training import Example from spacy.training import Example
from spacy.util import fix_random_seed, registry from spacy.util import fix_random_seed, registry, make_tempdir
OPS = get_current_ops() OPS = get_current_ops()
@ -17,17 +23,21 @@ TRAIN_DATA = [
), ),
] ]
TRAIN_DATA_OVERLAPPING = [
("Who is Shaka Khan?", {"spans": {SPAN_KEY: [(7, 17, "PERSON")]}}),
(
"I like London and Berlin",
{"spans": {SPAN_KEY: [(7, 13, "LOC"), (18, 24, "LOC"), (7, 24, "DOUBLE_LOC")]}},
),
]
def make_get_examples(nlp):
def make_examples(nlp, data=TRAIN_DATA):
train_examples = [] train_examples = []
for t in TRAIN_DATA: for t in data:
eg = Example.from_dict(nlp.make_doc(t[0]), t[1]) eg = Example.from_dict(nlp.make_doc(t[0]), t[1])
train_examples.append(eg) train_examples.append(eg)
return train_examples
def get_examples():
return train_examples
return get_examples
def test_no_label(): def test_no_label():
@ -54,9 +64,7 @@ def test_implicit_labels():
nlp = Language() nlp = Language()
spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY}) spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY})
assert len(spancat.labels) == 0 assert len(spancat.labels) == 0
train_examples = [] train_examples = make_examples(nlp)
for t in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
nlp.initialize(get_examples=lambda: train_examples) nlp.initialize(get_examples=lambda: train_examples)
assert spancat.labels == ("PERSON", "LOC") assert spancat.labels == ("PERSON", "LOC")
@ -71,24 +79,75 @@ def test_explicit_labels():
assert spancat.labels == ("PERSON", "LOC") assert spancat.labels == ("PERSON", "LOC")
def test_simple_train(): def test_doc_gc():
fix_random_seed(0) # If the Doc object is garbage collected, the spans won't be functional afterwards
nlp = Language() nlp = Language()
spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY}) spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY})
get_examples = make_get_examples(nlp) spancat.add_label("PERSON")
nlp.initialize(get_examples) nlp.initialize()
sgd = nlp.create_optimizer() texts = [
assert len(spancat.labels) != 0 "Just a sentence.",
for i in range(40): "I like London and Berlin",
losses = {} "I like Berlin",
nlp.update(list(get_examples()), losses=losses, drop=0.1, sgd=sgd) "I eat ham.",
doc = nlp("I like London and Berlin.") ]
assert doc.spans[spancat.key] == doc.spans[SPAN_KEY] all_spans = [doc.spans for doc in nlp.pipe(texts)]
assert len(doc.spans[spancat.key]) == 2 for text, spangroups in zip(texts, all_spans):
assert doc.spans[spancat.key][0].text == "London" assert isinstance(spangroups, SpanGroups)
scores = nlp.evaluate(get_examples()) for key, spangroup in spangroups.items():
assert f"spans_{SPAN_KEY}_f" in scores assert isinstance(spangroup, SpanGroup)
assert scores[f"spans_{SPAN_KEY}_f"] == 1.0 assert len(spangroup) > 0
with pytest.raises(RuntimeError):
span = spangroup[0]
@pytest.mark.parametrize(
"max_positive,nr_results", [(None, 4), (1, 2), (2, 3), (3, 4), (4, 4)]
)
def test_make_spangroup(max_positive, nr_results):
fix_random_seed(0)
nlp = Language()
spancat = nlp.add_pipe(
"spancat",
config={"spans_key": SPAN_KEY, "threshold": 0.5, "max_positive": max_positive},
)
doc = nlp.make_doc("Greater London")
ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1, 2])
indices = ngram_suggester([doc])[0].dataXd
assert_array_equal(indices, numpy.asarray([[0, 1], [1, 2], [0, 2]]))
labels = ["Thing", "City", "Person", "GreatCity"]
scores = numpy.asarray(
[[0.2, 0.4, 0.3, 0.1], [0.1, 0.6, 0.2, 0.4], [0.8, 0.7, 0.3, 0.9]], dtype="f"
)
spangroup = spancat._make_span_group(doc, indices, scores, labels)
assert len(spangroup) == nr_results
# first span is always the second token "London"
assert spangroup[0].text == "London"
assert spangroup[0].label_ == "City"
assert_almost_equal(0.6, spangroup.attrs["scores"][0], 5)
# second span depends on the number of positives that were allowed
assert spangroup[1].text == "Greater London"
if max_positive == 1:
assert spangroup[1].label_ == "GreatCity"
assert_almost_equal(0.9, spangroup.attrs["scores"][1], 5)
else:
assert spangroup[1].label_ == "Thing"
assert_almost_equal(0.8, spangroup.attrs["scores"][1], 5)
if nr_results > 2:
assert spangroup[2].text == "Greater London"
if max_positive == 2:
assert spangroup[2].label_ == "GreatCity"
assert_almost_equal(0.9, spangroup.attrs["scores"][2], 5)
else:
assert spangroup[2].label_ == "City"
assert_almost_equal(0.7, spangroup.attrs["scores"][2], 5)
assert spangroup[-1].text == "Greater London"
assert spangroup[-1].label_ == "GreatCity"
assert_almost_equal(0.9, spangroup.attrs["scores"][-1], 5)
def test_ngram_suggester(en_tokenizer): def test_ngram_suggester(en_tokenizer):
@ -209,3 +268,100 @@ def test_ngram_sizes(en_tokenizer):
range_suggester = suggester_factory(min_size=2, max_size=4) range_suggester = suggester_factory(min_size=2, max_size=4)
ngrams_3 = range_suggester(docs) ngrams_3 = range_suggester(docs)
assert_array_equal(OPS.to_numpy(ngrams_3.lengths), [0, 1, 3, 6, 9]) assert_array_equal(OPS.to_numpy(ngrams_3.lengths), [0, 1, 3, 6, 9])
def test_overfitting_IO():
# Simple test to try and quickly overfit the spancat component - ensuring the ML models work correctly
fix_random_seed(0)
nlp = English()
spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY})
train_examples = make_examples(nlp)
optimizer = nlp.initialize(get_examples=lambda: train_examples)
assert spancat.model.get_dim("nO") == 2
assert set(spancat.labels) == {"LOC", "PERSON"}
for i in range(50):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
assert losses["spancat"] < 0.01
# test the trained model
test_text = "I like London and Berlin"
doc = nlp(test_text)
assert doc.spans[spancat.key] == doc.spans[SPAN_KEY]
spans = doc.spans[SPAN_KEY]
assert len(spans) == 2
assert len(spans.attrs["scores"]) == 2
assert min(spans.attrs["scores"]) > 0.9
assert set([span.text for span in spans]) == {"London", "Berlin"}
assert set([span.label_ for span in spans]) == {"LOC"}
# Also test the results are still the same after IO
with make_tempdir() as tmp_dir:
nlp.to_disk(tmp_dir)
nlp2 = util.load_model_from_path(tmp_dir)
doc2 = nlp2(test_text)
spans2 = doc2.spans[SPAN_KEY]
assert len(spans2) == 2
assert len(spans2.attrs["scores"]) == 2
assert min(spans2.attrs["scores"]) > 0.9
assert set([span.text for span in spans2]) == {"London", "Berlin"}
assert set([span.label_ for span in spans2]) == {"LOC"}
# Test scoring
scores = nlp.evaluate(train_examples)
assert f"spans_{SPAN_KEY}_f" in scores
assert scores[f"spans_{SPAN_KEY}_p"] == 1.0
assert scores[f"spans_{SPAN_KEY}_r"] == 1.0
assert scores[f"spans_{SPAN_KEY}_f"] == 1.0
# also test that the spancat works for just a single entity in a sentence
doc = nlp("London")
assert len(doc.spans[spancat.key]) == 1
def test_overfitting_IO_overlapping():
# Test for overfitting on overlapping entities
fix_random_seed(0)
nlp = English()
spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY})
train_examples = make_examples(nlp, data=TRAIN_DATA_OVERLAPPING)
optimizer = nlp.initialize(get_examples=lambda: train_examples)
assert spancat.model.get_dim("nO") == 3
assert set(spancat.labels) == {"PERSON", "LOC", "DOUBLE_LOC"}
for i in range(50):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
assert losses["spancat"] < 0.01
# test the trained model
test_text = "I like London and Berlin"
doc = nlp(test_text)
spans = doc.spans[SPAN_KEY]
assert len(spans) == 3
assert len(spans.attrs["scores"]) == 3
assert min(spans.attrs["scores"]) > 0.9
assert set([span.text for span in spans]) == {
"London",
"Berlin",
"London and Berlin",
}
assert set([span.label_ for span in spans]) == {"LOC", "DOUBLE_LOC"}
# Also test the results are still the same after IO
with make_tempdir() as tmp_dir:
nlp.to_disk(tmp_dir)
nlp2 = util.load_model_from_path(tmp_dir)
doc2 = nlp2(test_text)
spans2 = doc2.spans[SPAN_KEY]
assert len(spans2) == 3
assert len(spans2.attrs["scores"]) == 3
assert min(spans2.attrs["scores"]) > 0.9
assert set([span.text for span in spans2]) == {
"London",
"Berlin",
"London and Berlin",
}
assert set([span.label_ for span in spans2]) == {"LOC", "DOUBLE_LOC"}

View File

@ -9,11 +9,13 @@ from spacy.cli import info
from spacy.cli.init_config import init_config, RECOMMENDATIONS from spacy.cli.init_config import init_config, RECOMMENDATIONS
from spacy.cli._util import validate_project_commands, parse_config_overrides from spacy.cli._util import validate_project_commands, parse_config_overrides
from spacy.cli._util import load_project_config, substitute_project_variables from spacy.cli._util import load_project_config, substitute_project_variables
from spacy.cli._util import is_subpath_of
from spacy.cli._util import string_to_list from spacy.cli._util import string_to_list
from spacy import about from spacy import about
from spacy.util import get_minor_version from spacy.util import get_minor_version
from spacy.cli.validate import get_model_pkgs from spacy.cli.validate import get_model_pkgs
from spacy.cli.download import get_compatibility, get_version from spacy.cli.download import get_compatibility, get_version
from spacy.cli.package import get_third_party_dependencies
from thinc.api import ConfigValidationError, Config from thinc.api import ConfigValidationError, Config
import srsly import srsly
import os import os
@ -532,3 +534,43 @@ def test_init_labels(component_name):
assert len(nlp2.get_pipe(component_name).labels) == 0 assert len(nlp2.get_pipe(component_name).labels) == 0
nlp2.initialize() nlp2.initialize()
assert len(nlp2.get_pipe(component_name).labels) == 4 assert len(nlp2.get_pipe(component_name).labels) == 4
def test_get_third_party_dependencies():
# We can't easily test the detection of third-party packages here, but we
# can at least make sure that the function and its importlib magic runs.
nlp = Dutch()
# Test with component factory based on Cython module
nlp.add_pipe("tagger")
assert get_third_party_dependencies(nlp.config) == []
# Test with legacy function
nlp = Dutch()
nlp.add_pipe(
"textcat",
config={
"model": {
# Do not update from legacy architecture spacy.TextCatBOW.v1
"@architectures": "spacy.TextCatBOW.v1",
"exclusive_classes": True,
"ngram_size": 1,
"no_output_layer": False,
}
},
)
get_third_party_dependencies(nlp.config) == []
@pytest.mark.parametrize(
"parent,child,expected",
[
("/tmp", "/tmp", True),
("/tmp", "/", False),
("/tmp", "/tmp/subdir", True),
("/tmp", "/tmpdir", False),
("/tmp", "/tmp/subdir/..", True),
("/tmp", "/tmp/..", False),
],
)
def test_is_subpath_of(parent, child, expected):
assert is_subpath_of(parent, child) == expected

View File

@ -30,6 +30,7 @@ from ..compat import copy_reg, pickle
from ..errors import Errors, Warnings from ..errors import Errors, Warnings
from ..morphology import Morphology from ..morphology import Morphology
from .. import util from .. import util
from .. import parts_of_speech
from .underscore import Underscore, get_ext_args from .underscore import Underscore, get_ext_args
from ._retokenize import Retokenizer from ._retokenize import Retokenizer
from ._serialize import ALL_ATTRS as DOCBIN_ALL_ATTRS from ._serialize import ALL_ATTRS as DOCBIN_ALL_ATTRS
@ -285,6 +286,10 @@ cdef class Doc:
sent_starts[i] = -1 sent_starts[i] = -1
elif sent_starts[i] is None or sent_starts[i] not in [-1, 0, 1]: elif sent_starts[i] is None or sent_starts[i] not in [-1, 0, 1]:
sent_starts[i] = 0 sent_starts[i] = 0
if pos is not None:
for pp in set(pos):
if pp not in parts_of_speech.IDS:
raise ValueError(Errors.E1021.format(pp=pp))
ent_iobs = None ent_iobs = None
ent_types = None ent_types = None
if ents is not None: if ents is not None:

View File

@ -87,9 +87,10 @@ cdef class Span:
start (int): The index of the first token of the span. start (int): The index of the first token of the span.
end (int): The index of the first token after the span. end (int): The index of the first token after the span.
label (uint64): A label to attach to the Span, e.g. for named entities. label (uint64): A label to attach to the Span, e.g. for named entities.
kb_id (uint64): An identifier from a Knowledge Base to capture the meaning of a named entity.
vector (ndarray[ndim=1, dtype='float32']): A meaning representation vector (ndarray[ndim=1, dtype='float32']): A meaning representation
of the span. of the span.
vector_norm (float): The L2 norm of the span's vector representation.
kb_id (uint64): An identifier from a Knowledge Base to capture the meaning of a named entity.
DOCS: https://spacy.io/api/span#init DOCS: https://spacy.io/api/span#init
""" """
@ -216,10 +217,12 @@ cdef class Span:
return Underscore(Underscore.span_extensions, self, return Underscore(Underscore.span_extensions, self,
start=self.c.start_char, end=self.c.end_char) start=self.c.start_char, end=self.c.end_char)
def as_doc(self, *, bint copy_user_data=False): def as_doc(self, *, bint copy_user_data=False, array_head=None, array=None):
"""Create a `Doc` object with a copy of the `Span`'s data. """Create a `Doc` object with a copy of the `Span`'s data.
copy_user_data (bool): Whether or not to copy the original doc's user data. copy_user_data (bool): Whether or not to copy the original doc's user data.
array_head (tuple): `Doc` array attrs, can be passed in to speed up computation.
array (ndarray): `Doc` as array, can be passed in to speed up computation.
RETURNS (Doc): The `Doc` copy of the span. RETURNS (Doc): The `Doc` copy of the span.
DOCS: https://spacy.io/api/span#as_doc DOCS: https://spacy.io/api/span#as_doc
@ -227,8 +230,10 @@ cdef class Span:
words = [t.text for t in self] words = [t.text for t in self]
spaces = [bool(t.whitespace_) for t in self] spaces = [bool(t.whitespace_) for t in self]
cdef Doc doc = Doc(self.doc.vocab, words=words, spaces=spaces) cdef Doc doc = Doc(self.doc.vocab, words=words, spaces=spaces)
array_head = self.doc._get_array_attrs() if array_head is None:
array = self.doc.to_array(array_head) array_head = self.doc._get_array_attrs()
if array is None:
array = self.doc.to_array(array_head)
array = array[self.start : self.end] array = array[self.start : self.end]
self._fix_dep_copy(array_head, array) self._fix_dep_copy(array_head, array)
# Fix initial IOB so the entities are valid for doc.ents below. # Fix initial IOB so the entities are valid for doc.ents below.
@ -467,7 +472,11 @@ cdef class Span:
if "vector" in self.doc.user_span_hooks: if "vector" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["vector"](self) return self.doc.user_span_hooks["vector"](self)
if self._vector is None: if self._vector is None:
self._vector = sum(t.vector for t in self) / len(self) if not len(self):
xp = get_array_module(self.vocab.vectors.data)
self._vector = xp.zeros((self.vocab.vectors_length,), dtype="f")
else:
self._vector = sum(t.vector for t in self) / len(self)
return self._vector return self._vector
@property @property
@ -480,10 +489,10 @@ cdef class Span:
""" """
if "vector_norm" in self.doc.user_span_hooks: if "vector_norm" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["vector"](self) return self.doc.user_span_hooks["vector"](self)
vector = self.vector
xp = get_array_module(vector)
if self._vector_norm is None: if self._vector_norm is None:
vector = self.vector
total = (vector*vector).sum() total = (vector*vector).sum()
xp = get_array_module(vector)
self._vector_norm = xp.sqrt(total) if total != 0. else 0. self._vector_norm = xp.sqrt(total) if total != 0. else 0.
return self._vector_norm return self._vector_norm

View File

@ -1,6 +1,8 @@
import weakref import weakref
import struct import struct
import srsly import srsly
from spacy.errors import Errors
from .span cimport Span from .span cimport Span
from libc.stdint cimport uint64_t, uint32_t, int32_t from libc.stdint cimport uint64_t, uint32_t, int32_t
@ -58,7 +60,11 @@ cdef class SpanGroup:
DOCS: https://spacy.io/api/spangroup#doc DOCS: https://spacy.io/api/spangroup#doc
""" """
return self._doc_ref() doc = self._doc_ref()
if doc is None:
# referent has been garbage collected
raise RuntimeError(Errors.E865)
return doc
@property @property
def has_overlap(self): def has_overlap(self):

View File

@ -867,6 +867,8 @@ cdef class Token:
return parts_of_speech.NAMES[self.c.pos] return parts_of_speech.NAMES[self.c.pos]
def __set__(self, pos_name): def __set__(self, pos_name):
if pos_name not in parts_of_speech.IDS:
raise ValueError(Errors.E1021.format(pp=pos_name))
self.c.pos = parts_of_speech.IDS[pos_name] self.c.pos = parts_of_speech.IDS[pos_name]
property tag_: property tag_:

View File

@ -95,7 +95,8 @@ def init_nlp(config: Config, *, use_gpu: int = -1) -> "Language":
logger.warning(Warnings.W087.format(name=name, listener=listener)) logger.warning(Warnings.W087.format(name=name, listener=listener))
# We always check this regardless, in case user freezes tok2vec # We always check this regardless, in case user freezes tok2vec
if listener not in frozen_components and name in frozen_components: if listener not in frozen_components and name in frozen_components:
logger.warning(Warnings.W086.format(name=name, listener=listener)) if name not in T["annotating_components"]:
logger.warning(Warnings.W086.format(name=name, listener=listener))
return nlp return nlp

View File

@ -177,3 +177,89 @@ def wandb_logger(
return log_step, finalize return log_step, finalize
return setup_logger return setup_logger
@registry.loggers("spacy.WandbLogger.v3")
def wandb_logger(
project_name: str,
remove_config_values: List[str] = [],
model_log_interval: Optional[int] = None,
log_dataset_dir: Optional[str] = None,
entity: Optional[str] = None,
run_name: Optional[str] = None,
):
try:
import wandb
# test that these are available
from wandb import init, log, join # noqa: F401
except ImportError:
raise ImportError(Errors.E880)
console = console_logger(progress_bar=False)
def setup_logger(
nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr
) -> Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]:
config = nlp.config.interpolate()
config_dot = util.dict_to_dot(config)
for field in remove_config_values:
del config_dot[field]
config = util.dot_to_dict(config_dot)
run = wandb.init(
project=project_name, config=config, entity=entity, reinit=True
)
if run_name:
wandb.run.name = run_name
console_log_step, console_finalize = console(nlp, stdout, stderr)
def log_dir_artifact(
path: str,
name: str,
type: str,
metadata: Optional[Dict[str, Any]] = {},
aliases: Optional[List[str]] = [],
):
dataset_artifact = wandb.Artifact(name, type=type, metadata=metadata)
dataset_artifact.add_dir(path, name=name)
wandb.log_artifact(dataset_artifact, aliases=aliases)
if log_dataset_dir:
log_dir_artifact(path=log_dataset_dir, name="dataset", type="dataset")
def log_step(info: Optional[Dict[str, Any]]):
console_log_step(info)
if info is not None:
score = info["score"]
other_scores = info["other_scores"]
losses = info["losses"]
wandb.log({"score": score})
if losses:
wandb.log({f"loss_{k}": v for k, v in losses.items()})
if isinstance(other_scores, dict):
wandb.log(other_scores)
if model_log_interval and info.get("output_path"):
if info["step"] % model_log_interval == 0 and info["step"] != 0:
log_dir_artifact(
path=info["output_path"],
name="pipeline_" + run.id,
type="checkpoint",
metadata=info,
aliases=[
f"epoch {info['epoch']} step {info['step']}",
"latest",
"best"
if info["score"] == max(info["checkpoints"])[0]
else "",
],
)
def finalize() -> None:
console_finalize()
wandb.join()
return log_step, finalize
return setup_logger

View File

@ -41,10 +41,11 @@ def pretrain(
optimizer = P["optimizer"] optimizer = P["optimizer"]
# Load in pretrained weights to resume from # Load in pretrained weights to resume from
if resume_path is not None: if resume_path is not None:
_resume_model(model, resume_path, epoch_resume, silent=silent) epoch_resume = _resume_model(model, resume_path, epoch_resume, silent=silent)
else: else:
# Without '--resume-path' the '--epoch-resume' argument is ignored # Without '--resume-path' the '--epoch-resume' argument is ignored
epoch_resume = 0 epoch_resume = 0
objective = model.attrs["loss"] objective = model.attrs["loss"]
# TODO: move this to logger function? # TODO: move this to logger function?
tracker = ProgressTracker(frequency=10000) tracker = ProgressTracker(frequency=10000)
@ -101,20 +102,25 @@ def ensure_docs(examples_or_docs: Iterable[Union[Doc, Example]]) -> List[Doc]:
def _resume_model( def _resume_model(
model: Model, resume_path: Path, epoch_resume: int, silent: bool = True model: Model, resume_path: Path, epoch_resume: int, silent: bool = True
) -> None: ) -> int:
msg = Printer(no_print=silent) msg = Printer(no_print=silent)
msg.info(f"Resume training tok2vec from: {resume_path}") msg.info(f"Resume training tok2vec from: {resume_path}")
with resume_path.open("rb") as file_: with resume_path.open("rb") as file_:
weights_data = file_.read() weights_data = file_.read()
model.get_ref("tok2vec").from_bytes(weights_data) model.get_ref("tok2vec").from_bytes(weights_data)
# Parse the epoch number from the given weight file
model_name = re.search(r"model\d+\.bin", str(resume_path)) if epoch_resume is None:
if model_name: # Parse the epoch number from the given weight file
# Default weight file name so read epoch_start from it by cutting off 'model' and '.bin' model_name = re.search(r"model\d+\.bin", str(resume_path))
epoch_resume = int(model_name.group(0)[5:][:-4]) + 1 if model_name:
msg.info(f"Resuming from epoch: {epoch_resume}") # Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
else: epoch_resume = int(model_name.group(0)[5:][:-4]) + 1
msg.info(f"Resuming from epoch: {epoch_resume}") else:
# No epoch given and couldn't infer it
raise ValueError(Errors.E1020)
msg.info(f"Resuming from epoch: {epoch_resume}")
return epoch_resume
def make_update( def make_update(

View File

@ -20,8 +20,10 @@ import sys
import warnings import warnings
from packaging.specifiers import SpecifierSet, InvalidSpecifier from packaging.specifiers import SpecifierSet, InvalidSpecifier
from packaging.version import Version, InvalidVersion from packaging.version import Version, InvalidVersion
from packaging.requirements import Requirement
import subprocess import subprocess
from contextlib import contextmanager from contextlib import contextmanager
from collections import defaultdict
import tempfile import tempfile
import shutil import shutil
import shlex import shlex
@ -33,11 +35,6 @@ try:
except ImportError: except ImportError:
cupy = None cupy = None
try: # Python 3.8
import importlib.metadata as importlib_metadata
except ImportError:
from catalogue import _importlib_metadata as importlib_metadata
# These are functions that were previously (v2.x) available from spacy.util # These are functions that were previously (v2.x) available from spacy.util
# and have since moved to Thinc. We're importing them here so people's code # and have since moved to Thinc. We're importing them here so people's code
# doesn't break, but they should always be imported from Thinc from now on, # doesn't break, but they should always be imported from Thinc from now on,
@ -46,7 +43,7 @@ from thinc.api import fix_random_seed, compounding, decaying # noqa: F401
from .symbols import ORTH from .symbols import ORTH
from .compat import cupy, CudaStream, is_windows from .compat import cupy, CudaStream, is_windows, importlib_metadata
from .errors import Errors, Warnings, OLD_MODEL_SHORTCUTS from .errors import Errors, Warnings, OLD_MODEL_SHORTCUTS
from . import about from . import about
@ -144,6 +141,32 @@ class registry(thinc.registry):
) from None ) from None
return func return func
@classmethod
def find(cls, registry_name: str, func_name: str) -> Callable:
"""Get info about a registered function from the registry."""
# We're overwriting this classmethod so we're able to provide more
# specific error messages and implement a fallback to spacy-legacy.
if not hasattr(cls, registry_name):
names = ", ".join(cls.get_registry_names()) or "none"
raise RegistryError(Errors.E892.format(name=registry_name, available=names))
reg = getattr(cls, registry_name)
try:
func_info = reg.find(func_name)
except RegistryError:
if func_name.startswith("spacy."):
legacy_name = func_name.replace("spacy.", "spacy-legacy.")
try:
return reg.find(legacy_name)
except catalogue.RegistryError:
pass
available = ", ".join(sorted(reg.get_all().keys())) or "none"
raise RegistryError(
Errors.E893.format(
name=func_name, reg_name=registry_name, available=available
)
) from None
return func_info
@classmethod @classmethod
def has(cls, registry_name: str, func_name: str) -> bool: def has(cls, registry_name: str, func_name: str) -> bool:
"""Check whether a function is available in a registry.""" """Check whether a function is available in a registry."""
@ -640,13 +663,18 @@ def is_unconstrained_version(
return True return True
def get_model_version_range(spacy_version: str) -> str: def split_requirement(requirement: str) -> Tuple[str, str]:
"""Generate a version range like >=1.2.3,<1.3.0 based on a given spaCy """Split a requirement like spacy>=1.2.3 into ("spacy", ">=1.2.3")."""
version. Models are always compatible across patch versions but not req = Requirement(requirement)
across minor or major versions. return (req.name, str(req.specifier))
def get_minor_version_range(version: str) -> str:
"""Generate a version range like >=1.2.3,<1.3.0 based on a given version
(e.g. of spaCy).
""" """
release = Version(spacy_version).release release = Version(version).release
return f">={spacy_version},<{release[0]}.{release[1] + 1}.0" return f">={version},<{release[0]}.{release[1] + 1}.0"
def get_model_lower_version(constraint: str) -> Optional[str]: def get_model_lower_version(constraint: str) -> Optional[str]:
@ -734,7 +762,7 @@ def load_meta(path: Union[str, Path]) -> Dict[str, Any]:
model=f"{meta['lang']}_{meta['name']}", model=f"{meta['lang']}_{meta['name']}",
model_version=meta["version"], model_version=meta["version"],
version=meta["spacy_version"], version=meta["spacy_version"],
example=get_model_version_range(about.__version__), example=get_minor_version_range(about.__version__),
) )
warnings.warn(warn_msg) warnings.warn(warn_msg)
return meta return meta
@ -1550,3 +1578,19 @@ def to_ternary_int(val) -> int:
return 0 return 0
else: else:
return -1 return -1
# The following implementation of packages_distributions() is adapted from
# importlib_metadata, which is distributed under the Apache 2.0 License.
# Copyright (c) 2017-2019 Jason R. Coombs, Barry Warsaw
# See licenses/3rd_party_licenses.txt
def packages_distributions() -> Dict[str, List[str]]:
"""Return a mapping of top-level packages to their distributions. We're
inlining this helper from the importlib_metadata "backport" here, since
it's not available in the builtin importlib.metadata.
"""
pkg_to_dist = defaultdict(list)
for dist in importlib_metadata.distributions():
for pkg in (dist.read_text("top_level.txt") or "").split():
pkg_to_dist[pkg].append(dist.metadata["Name"])
return dict(pkg_to_dist)

View File

@ -555,8 +555,8 @@ consists of either two or three subnetworks:
<Accordion title="spacy.TransitionBasedParser.v1 definition" spaced> <Accordion title="spacy.TransitionBasedParser.v1 definition" spaced>
[TransitionBasedParser.v1](/api/legacy#TransitionBasedParser_v1) had the exact same signature, [TransitionBasedParser.v1](/api/legacy#TransitionBasedParser_v1) had the exact
but the `use_upper` argument was `True` by default. same signature, but the `use_upper` argument was `True` by default.
</Accordion> </Accordion>

View File

@ -283,6 +283,10 @@ CLI [`train`](/api/cli#train) command. The built-in
of the `.conllu` format used by the of the `.conllu` format used by the
[Universal Dependencies corpora](https://github.com/UniversalDependencies). [Universal Dependencies corpora](https://github.com/UniversalDependencies).
Note that while this is the format used to save training data, you do not have
to understand the internal details to use it or create training data. See the
section on [preparing training data](/usage/training#training-data).
### JSON training format {#json-input tag="deprecated"} ### JSON training format {#json-input tag="deprecated"}
<Infobox variant="warning" title="Changed in v3.0"> <Infobox variant="warning" title="Changed in v3.0">

View File

@ -25,6 +25,20 @@ current state. The weights are updated such that the scores assigned to the set
of optimal actions is increased, while scores assigned to other actions are of optimal actions is increased, while scores assigned to other actions are
decreased. Note that more than one action may be optimal for a given state. decreased. Note that more than one action may be optimal for a given state.
## Assigned Attributes {#assigned-attributes}
Dependency predictions are assigned to the `Token.dep` and `Token.head` fields.
Beside the dependencies themselves, the parser decides sentence boundaries,
which are saved in `Token.is_sent_start` and accessible via `Doc.sents`.
| Location | Value |
| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| `Token.dep` | The type of dependency relation (hash). ~~int~~ |
| `Token.dep_` | The type of dependency relation. ~~str~~ |
| `Token.head` | The syntactic parent, or "governor", of this token. ~~Token~~ |
| `Token.is_sent_start` | A boolean value indicating whether the token starts a sentence. After the parser runs this will be `True` or `False` for all tokens. ~~bool~~ |
| `Doc.sents` | An iterator over sentences in the `Doc`, determined by `Token.is_sent_start` values. ~~Iterator[Span]~~ |
## Config and implementation {#config} ## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes The default config is defined by the pipeline component factory and describes

View File

@ -212,7 +212,7 @@ alignment mode `"strict".
| Name | Description | | Name | Description |
| ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `start` | The index of the first character of the span. ~~int~~ | | `start` | The index of the first character of the span. ~~int~~ |
| `end` | The index of the last character after the span. ~int~~ | | `end` | The index of the last character after the span. ~~int~~ |
| `label` | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~ | | `label` | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~ |
| `kb_id` <Tag variant="new">2.2</Tag> | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~ | | `kb_id` <Tag variant="new">2.2</Tag> | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~ |
| `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | | `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
@ -571,9 +571,9 @@ objects, if the entity recognizer has been applied.
> assert ents[0].text == "Mr. Best" > assert ents[0].text == "Mr. Best"
> ``` > ```
| Name | Description | | Name | Description |
| ----------- | --------------------------------------------------------------------- | | ----------- | ---------------------------------------------------------------- |
| **RETURNS** | Entities in the document, one `Span` per entity. ~~Tuple[Span, ...]~~ | | **RETURNS** | Entities in the document, one `Span` per entity. ~~Tuple[Span]~~ |
## Doc.spans {#spans tag="property"} ## Doc.spans {#spans tag="property"}

View File

@ -16,7 +16,7 @@ document from the `DocBin`. The serialization format is gzipped msgpack, where
the msgpack object has the following structure: the msgpack object has the following structure:
```python ```python
### msgpack object structrue ### msgpack object structure
{ {
"version": str, # DocBin version number "version": str, # DocBin version number
"attrs": List[uint64], # e.g. [TAG, HEAD, ENT_IOB, ENT_TYPE] "attrs": List[uint64], # e.g. [TAG, HEAD, ENT_IOB, ENT_TYPE]

View File

@ -16,6 +16,16 @@ plausible candidates from that `KnowledgeBase` given a certain textual mention,
and a machine learning model to pick the right candidate, given the local and a machine learning model to pick the right candidate, given the local
context of the mention. context of the mention.
## Assigned Attributes {#assigned-attributes}
Predictions, in the form of knowledge base IDs, will be assigned to
`Token.ent_kb_id_`.
| Location | Value |
| ------------------ | --------------------------------- |
| `Token.ent_kb_id` | Knowledge base ID (hash). ~~int~~ |
| `Token.ent_kb_id_` | Knowledge base ID. ~~str~~ |
## Config and implementation {#config} ## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes The default config is defined by the pipeline component factory and describes

View File

@ -20,6 +20,24 @@ your entities will be close to their initial tokens. If your entities are long
and characterized by tokens in their middle, the component will likely not be a and characterized by tokens in their middle, the component will likely not be a
good fit for your task. good fit for your task.
## Assigned Attributes {#assigned-attributes}
Predictions will be saved to `Doc.ents` as a tuple. Each label will also be
reflected to each underlying token, where it is saved in the `Token.ent_type`
and `Token.ent_iob` fields. Note that by definition each token can only have one
label.
When setting `Doc.ents` to create training data, all the spans must be valid and
non-overlapping, or an error will be thrown.
| Location | Value |
| ----------------- | ----------------------------------------------------------------- |
| `Doc.ents` | The annotated spans. ~~Tuple[Span]~~ |
| `Token.ent_iob` | An enum encoding of the IOB part of the named entity tag. ~~int~~ |
| `Token.ent_iob_` | The IOB part of the named entity tag. ~~str~~ |
| `Token.ent_type` | The label part of the named entity tag (hash). ~~int~~ |
| `Token.ent_type_` | The label part of the named entity tag. ~~str~~ |
## Config and implementation {#config} ## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes The default config is defined by the pipeline component factory and describes

View File

@ -15,6 +15,27 @@ used on its own to implement a purely rule-based entity recognition system. For
usage examples, see the docs on usage examples, see the docs on
[rule-based entity recognition](/usage/rule-based-matching#entityruler). [rule-based entity recognition](/usage/rule-based-matching#entityruler).
## Assigned Attributes {#assigned-attributes}
This component assigns predictions basically the same way as the
[`EntityRecognizer`](/api/entityrecognizer).
Predictions can be accessed under `Doc.ents` as a tuple. Each label will also be
reflected in each underlying token, where it is saved in the `Token.ent_type`
and `Token.ent_iob` fields. Note that by definition each token can only have one
label.
When setting `Doc.ents` to create training data, all the spans must be valid and
non-overlapping, or an error will be thrown.
| Location | Value |
| ----------------- | ----------------------------------------------------------------- |
| `Doc.ents` | The annotated spans. ~~Tuple[Span]~~ |
| `Token.ent_iob` | An enum encoding of the IOB part of the named entity tag. ~~int~~ |
| `Token.ent_iob_` | The IOB part of the named entity tag. ~~str~~ |
| `Token.ent_type` | The label part of the named entity tag (hash). ~~int~~ |
| `Token.ent_type_` | The label part of the named entity tag. ~~str~~ |
## Config and implementation {#config} ## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes The default config is defined by the pipeline component factory and describes

View File

@ -446,7 +446,7 @@ component, adds it to the pipeline and returns it.
| `after` | Component name or index to insert component directly after. ~~Optional[Union[str, int]]~~ | | `after` | Component name or index to insert component directly after. ~~Optional[Union[str, int]]~~ |
| `first` | Insert component first / not first in the pipeline. ~~Optional[bool]~~ | | `first` | Insert component first / not first in the pipeline. ~~Optional[bool]~~ |
| `last` | Insert component last / not last in the pipeline. ~~Optional[bool]~~ | | `last` | Insert component last / not last in the pipeline. ~~Optional[bool]~~ |
| `config` <Tag variant="new">3</Tag> | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. ~~Optional[Dict[str, Any]]~~ | | `config` <Tag variant="new">3</Tag> | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. ~~Dict[str, Any]~~ |
| `source` <Tag variant="new">3</Tag> | Optional source pipeline to copy component from. If a source is provided, the `factory_name` is interpreted as the name of the component in the source pipeline. Make sure that the vocab, vectors and settings of the source pipeline match the target pipeline. ~~Optional[Language]~~ | | `source` <Tag variant="new">3</Tag> | Optional source pipeline to copy component from. If a source is provided, the `factory_name` is interpreted as the name of the component in the source pipeline. Make sure that the vocab, vectors and settings of the source pipeline match the target pipeline. ~~Optional[Language]~~ |
| `validate` <Tag variant="new">3</Tag> | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ | | `validate` <Tag variant="new">3</Tag> | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ |
| **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ | | **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ |
@ -476,7 +476,7 @@ To create a component and add it to the pipeline, you should always use
| `factory_name` | Name of the registered component factory. ~~str~~ | | `factory_name` | Name of the registered component factory. ~~str~~ |
| `name` | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. ~~Optional[str]~~ | | `name` | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. ~~Optional[str]~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `config` <Tag variant="new">3</Tag> | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. ~~Optional[Dict[str, Any]]~~ | | `config` <Tag variant="new">3</Tag> | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. ~~Dict[str, Any]~~ |
| `validate` <Tag variant="new">3</Tag> | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ | | `validate` <Tag variant="new">3</Tag> | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ |
| **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ | | **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ |

View File

@ -105,7 +105,8 @@ and residual connections.
### spacy.TransitionBasedParser.v1 {#TransitionBasedParser_v1} ### spacy.TransitionBasedParser.v1 {#TransitionBasedParser_v1}
Identical to [`spacy.TransitionBasedParser.v2`](/api/architectures#TransitionBasedParser) Identical to
[`spacy.TransitionBasedParser.v2`](/api/architectures#TransitionBasedParser)
except the `use_upper` was set to `True` by default. except the `use_upper` was set to `True` by default.
### spacy.TextCatEnsemble.v1 {#TextCatEnsemble_v1} ### spacy.TextCatEnsemble.v1 {#TextCatEnsemble_v1}

View File

@ -31,6 +31,15 @@ available in the pipeline and runs _before_ the lemmatizer.
</Infobox> </Infobox>
## Assigned Attributes {#assigned-attributes}
Lemmas generated by rules or predicted will be saved to `Token.lemma`.
| Location | Value |
| -------------- | ------------------------- |
| `Token.lemma` | The lemma (hash). ~~int~~ |
| `Token.lemma_` | The lemma. ~~str~~ |
## Config and implementation ## Config and implementation
The default config is defined by the pipeline component factory and describes The default config is defined by the pipeline component factory and describes

View File

@ -15,6 +15,16 @@ coarse-grained POS tags following the Universal Dependencies
[FEATS](https://universaldependencies.org/format.html#morphological-annotation) [FEATS](https://universaldependencies.org/format.html#morphological-annotation)
annotation guidelines. annotation guidelines.
## Assigned Attributes {#assigned-attributes}
Predictions are saved to `Token.morph` and `Token.pos`.
| Location | Value |
| ------------- | ----------------------------------------- |
| `Token.pos` | The UPOS part of speech (hash). ~~int~~ |
| `Token.pos_` | The UPOS part of speech. ~~str~~ |
| `Token.morph` | Morphological features. ~~MorphAnalysis~~ |
## Config and implementation {#config} ## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes The default config is defined by the pipeline component factory and describes

View File

@ -105,11 +105,11 @@ representation.
## Attributes {#attributes} ## Attributes {#attributes}
| Name | Description | | Name | Description |
| ------------- | ---------------------------------------------------------------------------------------------------------------------------- | ---------- | | ------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| `FEATURE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) feature separator. Default is ` | `. ~~str~~ | | `FEATURE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) feature separator. Default is `|`. ~~str~~ |
| `FIELD_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) field separator. Default is `=`. ~~str~~ | | `FIELD_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) field separator. Default is `=`. ~~str~~ |
| `VALUE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) value separator. Default is `,`. ~~str~~ | | `VALUE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) value separator. Default is `,`. ~~str~~ |
## MorphAnalysis {#morphanalysis tag="class" source="spacy/tokens/morphanalysis.pyx"} ## MorphAnalysis {#morphanalysis tag="class" source="spacy/tokens/morphanalysis.pyx"}

View File

@ -149,8 +149,8 @@ patterns = [nlp("health care reform"), nlp("healthcare reform")]
</Infobox> </Infobox>
| Name | Description | | Name | Description |
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `match_id` | An ID for the thing you're matching. ~~str~~ | | | `match_id` | An ID for the thing you're matching. ~~str~~ | |
| `docs` | `Doc` objects of the phrases to match. ~~List[Doc]~~ | | `docs` | `Doc` objects of the phrases to match. ~~List[Doc]~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ | | `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ |

View File

@ -12,6 +12,16 @@ api_trainable: true
A trainable pipeline component for sentence segmentation. For a simpler, A trainable pipeline component for sentence segmentation. For a simpler,
rule-based strategy, see the [`Sentencizer`](/api/sentencizer). rule-based strategy, see the [`Sentencizer`](/api/sentencizer).
## Assigned Attributes {#assigned-attributes}
Predicted values will be assigned to `Token.is_sent_start`. The resulting
sentences can be accessed using `Doc.sents`.
| Location | Value |
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| `Token.is_sent_start` | A boolean value indicating whether the token starts a sentence. This will be either `True` or `False` for all tokens. ~~bool~~ |
| `Doc.sents` | An iterator over sentences in the `Doc`, determined by `Token.is_sent_start` values. ~~Iterator[Span]~~ |
## Config and implementation {#config} ## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes The default config is defined by the pipeline component factory and describes

View File

@ -13,6 +13,16 @@ performed by the [`DependencyParser`](/api/dependencyparser), so the
`Sentencizer` lets you implement a simpler, rule-based strategy that doesn't `Sentencizer` lets you implement a simpler, rule-based strategy that doesn't
require a statistical model to be loaded. require a statistical model to be loaded.
## Assigned Attributes {#assigned-attributes}
Calculated values will be assigned to `Token.is_sent_start`. The resulting
sentences can be accessed using `Doc.sents`.
| Location | Value |
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| `Token.is_sent_start` | A boolean value indicating whether the token starts a sentence. This will be either `True` or `False` for all tokens. ~~bool~~ |
| `Doc.sents` | An iterator over sentences in the `Doc`, determined by `Token.is_sent_start` values. ~~Iterator[Span]~~ |
## Config and implementation {#config} ## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes The default config is defined by the pipeline component factory and describes

View File

@ -18,14 +18,15 @@ Create a `Span` object from the slice `doc[start : end]`.
> assert [t.text for t in span] == ["it", "back", "!"] > assert [t.text for t in span] == ["it", "back", "!"]
> ``` > ```
| Name | Description | | Name | Description |
| -------- | --------------------------------------------------------------------------------------- | | ------------- | --------------------------------------------------------------------------------------- |
| `doc` | The parent document. ~~Doc~~ | | `doc` | The parent document. ~~Doc~~ |
| `start` | The index of the first token of the span. ~~int~~ | | `start` | The index of the first token of the span. ~~int~~ |
| `end` | The index of the first token after the span. ~~int~~ | | `end` | The index of the first token after the span. ~~int~~ |
| `label` | A label to attach to the span, e.g. for named entities. ~~Union[str, int]~~ | | `label` | A label to attach to the span, e.g. for named entities. ~~Union[str, int]~~ |
| `kb_id` | A knowledge base ID to attach to the span, e.g. for named entities. ~~Union[str, int]~~ | | `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
| `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | | `vector_norm` | The L2 norm of the document's vector representation. ~~float~~ |
| `kb_id` | A knowledge base ID to attach to the span, e.g. for named entities. ~~Union[str, int]~~ |
## Span.\_\_getitem\_\_ {#getitem tag="method"} ## Span.\_\_getitem\_\_ {#getitem tag="method"}
@ -303,6 +304,10 @@ not been implemeted for the given language, a `NotImplementedError` is raised.
Create a new `Doc` object corresponding to the `Span`, with a copy of the data. Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
When calling this on many spans from the same doc, passing in a precomputed
array representation of the doc using the `array_head` and `array` args can save
time.
> #### Example > #### Example
> >
> ```python > ```python
@ -312,10 +317,12 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
> assert doc2.text == "New York" > assert doc2.text == "New York"
> ``` > ```
| Name | Description | | Name | Description |
| ---------------- | ------------------------------------------------------------- | | ---------------- | -------------------------------------------------------------------------------------------------------------------- |
| `copy_user_data` | Whether or not to copy the original doc's user data. ~~bool~~ | | `copy_user_data` | Whether or not to copy the original doc's user data. ~~bool~~ |
| **RETURNS** | A `Doc` object of the `Span`'s content. ~~Doc~~ | | `array_head` | Precomputed array attributes (headers) of the original doc, as generated by `Doc._get_array_attrs()`. ~~Tuple~~ |
| `array` | Precomputed array version of the original doc as generated by [`Doc.to_array`](/api/doc#to_array). ~~numpy.ndarray~~ |
| **RETURNS** | A `Doc` object of the `Span`'s content. ~~Doc~~ |
## Span.root {#root tag="property" model="parser"} ## Span.root {#root tag="property" model="parser"}

View File

@ -13,6 +13,22 @@ A span categorizer consists of two parts: a [suggester function](#suggesters)
that proposes candidate spans, which may or may not overlap, and a labeler model that proposes candidate spans, which may or may not overlap, and a labeler model
that predicts zero or more labels for each candidate. that predicts zero or more labels for each candidate.
Predicted spans will be saved in a [`SpanGroup`](/api/spangroup) on the doc.
Individual span scores can be found in `spangroup.attrs["scores"]`.
## Assigned Attributes {#assigned-attributes}
Predictions will be saved to `Doc.spans[spans_key]` as a
[`SpanGroup`](/api/spangroup). The scores for the spans in the `SpanGroup` will
be saved in `SpanGroup.attrs["scores"]`.
`spans_key` defaults to `"sc"`, but can be passed as a parameter.
| Location | Value |
| -------------------------------------- | -------------------------------------------------------- |
| `Doc.spans[spans_key]` | The annotated spans. ~~SpanGroup~~ |
| `Doc.spans[spans_key].attrs["scores"]` | The score for each span in the `SpanGroup`. ~~Floats1d~~ |
## Config and implementation {#config} ## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes The default config is defined by the pipeline component factory and describes

View File

@ -46,6 +46,16 @@ Create a `SpanGroup`.
The [`Doc`](/api/doc) object the span group is referring to. The [`Doc`](/api/doc) object the span group is referring to.
<Infobox title="SpanGroup and Doc lifecycle" variant="warning">
When a `Doc` object is garbage collected, any related `SpanGroup` object won't
be functional anymore, as these objects use a `weakref` to refer to the
document. An error will be raised as the internal `doc` object will be `None`.
To avoid this, make sure that the original `Doc` objects are still available in
the scope of your function.
</Infobox>
> #### Example > #### Example
> >
> ```python > ```python

View File

@ -8,6 +8,21 @@ api_string_name: tagger
api_trainable: true api_trainable: true
--- ---
A trainable pipeline component to predict part-of-speech tags for any
part-of-speech tag set.
In the pre-trained pipelines, the tag schemas vary by language; see the
[individual model pages](/models) for details.
## Assigned Attributes {#assigned-attributes}
Predictions are assigned to `Token.tag`.
| Location | Value |
| ------------ | ---------------------------------- |
| `Token.tag` | The part of speech (hash). ~~int~~ |
| `Token.tag_` | The part of speech. ~~str~~ |
## Config and implementation {#config} ## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes The default config is defined by the pipeline component factory and describes

View File

@ -29,6 +29,22 @@ only.
</Infobox> </Infobox>
## Assigned Attributes {#assigned-attributes}
Predictions will be saved to `doc.cats` as a dictionary, where the key is the
name of the category and the value is a score between 0 and 1 (inclusive). For
`textcat` (exclusive categories), the scores will sum to 1, while for
`textcat_multilabel` there is no particular guarantee about their sum.
Note that when assigning values to create training data, the score of each
category must be 0 or 1. Using other values, for example to create a document
that is a little bit in category A and a little bit in category B, is not
supported.
| Location | Value |
| ---------- | ------------------------------------- |
| `Doc.cats` | Category scores. ~~Dict[str, float]~~ |
## Config and implementation {#config} ## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes The default config is defined by the pipeline component factory and describes

View File

@ -463,7 +463,7 @@ start decreasing across epochs.
</Accordion> </Accordion>
#### spacy.WandbLogger.v2 {#WandbLogger tag="registered function"} #### spacy.WandbLogger.v3 {#WandbLogger tag="registered function"}
> #### Installation > #### Installation
> >
@ -495,19 +495,21 @@ remain in the config file stored on your local system.
> >
> ```ini > ```ini
> [training.logger] > [training.logger]
> @loggers = "spacy.WandbLogger.v2" > @loggers = "spacy.WandbLogger.v3"
> project_name = "monitor_spacy_training" > project_name = "monitor_spacy_training"
> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"] > remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
> log_dataset_dir = "corpus" > log_dataset_dir = "corpus"
> model_log_interval = 1000 > model_log_interval = 1000
> ``` > ```
| Name | Description | | Name | Description |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ | | `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ | | `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |
| `model_log_interval` | Steps to wait between logging model checkpoints to W&B dasboard (default: None). ~~Optional[int]~~ | | `model_log_interval` | Steps to wait between logging model checkpoints to W&B dasboard (default: None). ~~Optional[int]~~ |
| `log_dataset_dir` | Directory containing dataset to be logged and versioned as W&B artifact (default: None). ~~Optional[str]~~ | | `log_dataset_dir` | Directory containing dataset to be logged and versioned as W&B artifact (default: None). ~~Optional[str]~~ |
| `run_name` | The name of the run. If you don't specify a run_name, the name will be created by wandb library. (default: None ). ~~Optional[str]~~ |
| `entity` | An entity is a username or team name where you're sending runs. If you don't specify an entity, the run will be sent to your default entity, which is usually your username. (default: None). ~~Optional[str]~~ |
<Project id="integrations/wandb"> <Project id="integrations/wandb">

View File

@ -38,12 +38,21 @@ attributes. We also calculate an alignment between the word-piece tokens and the
spaCy tokenization, so that we can use the last hidden states to set the spaCy tokenization, so that we can use the last hidden states to set the
`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy `Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
token, the spaCy token receives the sum of their values. To access the values, token, the spaCy token receives the sum of their values. To access the values,
you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The you can use the custom [`Doc._.trf_data`](#assigned-attributes) attribute. The
package also adds the function registries [`@span_getters`](#span_getters) and package also adds the function registries [`@span_getters`](#span_getters) and
[`@annotation_setters`](#annotation_setters) with several built-in registered [`@annotation_setters`](#annotation_setters) with several built-in registered
functions. For more details, see the functions. For more details, see the
[usage documentation](/usage/embeddings-transformers). [usage documentation](/usage/embeddings-transformers).
## Assigned Attributes {#assigned-attributes}
The component sets the following
[custom extension attribute](/usage/processing-pipeline#custom-components-attributes):
| Location | Value |
| ---------------- | ------------------------------------------------------------------------ |
| `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ |
## Config and implementation {#config} ## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes The default config is defined by the pipeline component factory and describes
@ -98,7 +107,7 @@ https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/p
Construct a `Transformer` component. One or more subsequent spaCy components can Construct a `Transformer` component. One or more subsequent spaCy components can
use the transformer outputs as features in its model, with gradients use the transformer outputs as features in its model, with gradients
backpropagated to the single shared weights. The activations from the backpropagated to the single shared weights. The activations from the
transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension transformer are saved in the [`Doc._.trf_data`](#assigned-attributes) extension
attribute. You can also provide a callback to set additional annotations. In attribute. You can also provide a callback to set additional annotations. In
your application, you would normally use a shortcut for this and instantiate the your application, you would normally use a shortcut for this and instantiate the
component using its string name and [`nlp.add_pipe`](/api/language#create_pipe). component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
@ -205,7 +214,7 @@ modifying them.
Assign the extracted features to the `Doc` objects. By default, the Assign the extracted features to the `Doc` objects. By default, the
[`TransformerData`](/api/transformer#transformerdata) object is written to the [`TransformerData`](/api/transformer#transformerdata) object is written to the
[`Doc._.trf_data`](#custom-attributes) attribute. Your `set_extra_annotations` [`Doc._.trf_data`](#assigned-attributes) attribute. Your `set_extra_annotations`
callback is then called, if provided. callback is then called, if provided.
> #### Example > #### Example
@ -383,7 +392,7 @@ are wrapped into the
[FullTransformerBatch](/api/transformer#fulltransformerbatch) object. The [FullTransformerBatch](/api/transformer#fulltransformerbatch) object. The
`FullTransformerBatch` then splits out the per-document data, which is handled `FullTransformerBatch` then splits out the per-document data, which is handled
by this class. Instances of this class are typically assigned to the by this class. Instances of this class are typically assigned to the
[`Doc._.trf_data`](/api/transformer#custom-attributes) extension attribute. [`Doc._.trf_data`](/api/transformer#assigned-attributes) extension attribute.
| Name | Description | | Name | Description |
| --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -549,12 +558,3 @@ The following built-in functions are available:
| Name | Description | | Name | Description |
| ---------------------------------------------- | ------------------------------------- | | ---------------------------------------------- | ------------------------------------- |
| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. | | `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. |
## Custom attributes {#custom-attributes}
The component sets the following
[custom extension attributes](/usage/processing-pipeline#custom-components-attributes):
| Name | Description |
| ---------------- | ------------------------------------------------------------------------ |
| `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ |

View File

@ -321,7 +321,7 @@ performed in chunks to avoid consuming too much memory. You can set the
> ``` > ```
| Name | Description | | Name | Description |
| -------------- | --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | | -------------- | --------------------------------------------------------------------------- |
| `queries` | An array with one or more vectors. ~~numpy.ndarray~~ | | `queries` | An array with one or more vectors. ~~numpy.ndarray~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `batch_size` | The batch size to use. Default to `1024`. ~~int~~ | | `batch_size` | The batch size to use. Default to `1024`. ~~int~~ |

View File

@ -21,14 +21,14 @@ Create the vocabulary.
> vocab = Vocab(strings=["hello", "world"]) > vocab = Vocab(strings=["hello", "world"])
> ``` > ```
| Name | Description | | Name | Description |
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | | ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `lex_attr_getters` | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. ~~Optional[Dict[str, Callable[[str], Any]]]~~ | | `lex_attr_getters` | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. ~~Optional[Dict[str, Callable[[str], Any]]]~~ |
| `strings` | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. ~~Union[List[str], StringStore]~~ | | `strings` | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. ~~Union[List[str], StringStore]~~ |
| `lookups` | A [`Lookups`](/api/lookups) that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. ~~Optional[Lookups]~~ | | `lookups` | A [`Lookups`](/api/lookups) that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. ~~Optional[Lookups]~~ |
| `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ | | `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ |
| `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~ | | `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~ |
| `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ | | `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ |
| `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/api/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ | | `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/api/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
## Vocab.\_\_len\_\_ {#len tag="method"} ## Vocab.\_\_len\_\_ {#len tag="method"}

Binary file not shown.

After

Width:  |  Height:  |  Size: 200 KiB

View File

@ -671,7 +671,7 @@ You can then run [`spacy pretrain`](/api/cli#pretrain) with the updated config
and pass in optional config overrides, like the path to the raw text file: and pass in optional config overrides, like the path to the raw text file:
```cli ```cli
$ python -m spacy pretrain config_pretrain.cfg ./output --paths.raw text.jsonl $ python -m spacy pretrain config_pretrain.cfg ./output --paths.raw_text text.jsonl
``` ```
The following defaults are used for the `[pretraining]` block and merged into The following defaults are used for the `[pretraining]` block and merged into

View File

@ -795,7 +795,7 @@ if there's no state to be passed through spaCy can just take care of this fo
you. The following two code examples are equivalent: you. The following two code examples are equivalent:
```python ```python
# Statless component with @Language.factory # Stateless component with @Language.factory
@Language.factory("my_component") @Language.factory("my_component")
def create_my_component(): def create_my_component():
def my_component(doc): def my_component(doc):

View File

@ -291,7 +291,7 @@ files you need and not the whole repo.
| Name | Description | | Name | Description |
| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. | | `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. |
| `git` | `repo`: The URL of the repo to download from.<br />`path`: Path of the file or directory to download, relative to the repo root.<br />`branch`: The branch to download from. Defaults to `"master"`. | | `git` | `repo`: The URL of the repo to download from.<br />`path`: Path of the file or directory to download, relative to the repo root. "" specifies the root directory.<br />`branch`: The branch to download from. Defaults to `"master"`. |
| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. | | `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. |
| `description` | Optional asset description, used in [auto-generated docs](#custom-docs). | | `description` | Optional asset description, used in [auto-generated docs](#custom-docs). |
@ -758,16 +758,6 @@ workflows, but only one can be tracked by DVC.
### Prodigy {#prodigy} <IntegrationLogo name="prodigy" width={100} height="auto" align="right" /> ### Prodigy {#prodigy} <IntegrationLogo name="prodigy" width={100} height="auto" align="right" />
<Infobox title="This section is still under construction" emoji="🚧" variant="warning">
The Prodigy integration will require a nightly version of Prodigy that supports
spaCy v3+. You can already use annotations created with Prodigy in spaCy v3 by
exporting your data with
[`data-to-spacy`](https://prodi.gy/docs/recipes#data-to-spacy) and running
[`spacy convert`](/api/cli#convert) to convert it to the binary format.
</Infobox>
[Prodigy](https://prodi.gy) is a modern annotation tool for creating training [Prodigy](https://prodi.gy) is a modern annotation tool for creating training
data for machine learning models, developed by us. It integrates with spaCy data for machine learning models, developed by us. It integrates with spaCy
out-of-the-box and provides many different out-of-the-box and provides many different
@ -776,17 +766,23 @@ with and without a model in the loop. If Prodigy is installed in your project,
you can start the annotation server from your `project.yml` for a tight feedback you can start the annotation server from your `project.yml` for a tight feedback
loop between data development and training. loop between data development and training.
The following example command starts the Prodigy app using the <Infobox variant="warning">
[`ner.correct`](https://prodi.gy/docs/recipes#ner-correct) recipe and streams in
suggestions for the given entity labels produced by a pretrained model. You can This integration requires [Prodigy v1.11](https://prodi.gy/docs/changelog#v1.11)
then correct the suggestions manually in the UI. After you save and exit the or higher. If you're using an older version of Prodigy, you can still use your
server, the full dataset is exported in spaCy's format and split into a training annotations in spaCy v3 by exporting your data with
and evaluation set. [`data-to-spacy`](https://prodi.gy/docs/recipes#data-to-spacy) and running
[`spacy convert`](/api/cli#convert) to convert it to the binary format.
</Infobox>
The following example shows a workflow for merging and exporting NER annotations
collected with Prodigy and training a spaCy pipeline:
> #### Example usage > #### Example usage
> >
> ```cli > ```cli
> $ python -m spacy project run annotate > $ python -m spacy project run all
> ``` > ```
<!-- prettier-ignore --> <!-- prettier-ignore -->
@ -794,36 +790,71 @@ and evaluation set.
### project.yml ### project.yml
vars: vars:
prodigy: prodigy:
dataset: 'ner_articles' train_dataset: "fashion_brands_training"
labels: 'PERSON,ORG,PRODUCT' eval_dataset: "fashion_brands_eval"
model: 'en_core_web_md'
workflows:
all:
- data-to-spacy
- train_spacy
commands: commands:
- name: annotate - name: "data-to-spacy"
- script: help: "Merge your annotations and create data in spaCy's binary format"
- 'python -m prodigy ner.correct ${vars.prodigy.dataset} ${vars.prodigy.model} ./assets/raw_data.jsonl --labels ${vars.prodigy.labels}' script:
- 'python -m prodigy data-to-spacy ./corpus/train.json ./corpus/eval.json --ner ${vars.prodigy.dataset}' - "python -m prodigy data-to-spacy corpus/ --ner ${vars.prodigy.train_dataset},eval:${vars.prodigy.eval_dataset}"
- 'python -m spacy convert ./corpus/train.json ./corpus/train.spacy' outputs:
- 'python -m spacy convert ./corpus/eval.json ./corpus/eval.spacy' - "corpus/train.spacy"
- deps: - "corpus/dev.spacy"
- 'assets/raw_data.jsonl' - name: "train_spacy"
- outputs: help: "Train a named entity recognition model with spaCy"
- 'corpus/train.spacy' script:
- 'corpus/eval.spacy' - "python -m spacy train configs/config.cfg --output training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy"
deps:
- "corpus/train.spacy"
- "corpus/dev.spacy"
outputs:
- "training/model-best"
``` ```
You can use the same approach for other types of projects and annotation > #### Example train curve output
>
> [![Screenshot of train curve terminal output](../images/prodigy_train_curve.jpg)](https://prodi.gy/docs/recipes#train-curve)
The [`train-curve`](https://prodi.gy/docs/recipes#train-curve) recipe is another
cool workflow you can include in your project. It will run the training with
different portions of the data, e.g. 25%, 50%, 75% and 100%. As a rule of thumb,
if accuracy increases in the last segment, this could indicate that collecting
more annotations of the same type might improve the model further.
<!-- prettier-ignore -->
```yaml
### project.yml (excerpt)
- name: "train_curve"
help: "Train the model with Prodigy by using different portions of training examples to evaluate if more annotations can potentially improve the performance"
script:
- "python -m prodigy train-curve --ner ${vars.prodigy.train_dataset},eval:${vars.prodigy.eval_dataset} --config configs/${vars.config} --show-plot"
```
You can use the same approach for various types of projects and annotation
workflows, including workflows, including
[text classification](https://prodi.gy/docs/recipes#textcat), [named entity recognition](https://prodi.gy/docs/named-entity-recognition),
[dependency parsing](https://prodi.gy/docs/recipes#dep), [span categorization](https://prodi.gy/docs/span-categorization),
[text classification](https://prodi.gy/docs/text-classification),
[dependency parsing](https://prodi.gy/docs/dependencies-relations),
[part-of-speech tagging](https://prodi.gy/docs/recipes#pos) or fully [part-of-speech tagging](https://prodi.gy/docs/recipes#pos) or fully
[custom recipes](https://prodi.gy/docs/custom-recipes) for instance, an A/B [custom recipes](https://prodi.gy/docs/custom-recipes). You can also use spaCy
evaluation workflow that lets you compare two different models and their project templates to quickly start the annotation server to collect more
results. annotations and add them to your Prodigy dataset.
<!-- TODO: <Project id="integrations/prodigy"> <Project id="integrations/prodigy">
</Project> --> Get started with spaCy and Prodigy using our project template. It includes
commands to create a merged training corpus from your Prodigy annotations,
training and packaging a spaCy pipeline and analyzing if more annotations may
improve performance.
</Project>
--- ---

View File

@ -429,7 +429,7 @@ matcher.add("HelloWorld", [pattern])
# 🚨 Raises an error: # 🚨 Raises an error:
# MatchPatternError: Invalid token patterns for matcher rule 'HelloWorld' # MatchPatternError: Invalid token patterns for matcher rule 'HelloWorld'
# Pattern 0: # Pattern 0:
# - Additional properties are not allowed ('CASEINSENSITIVE' was unexpected) [2] # - [pattern -> 2 -> CASEINSENSITIVE] extra fields not permitted
``` ```
@ -438,7 +438,8 @@ matcher.add("HelloWorld", [pattern])
To move on to a more realistic example, let's say you're working with a large To move on to a more realistic example, let's say you're working with a large
corpus of blog articles, and you want to match all mentions of "Google I/O" corpus of blog articles, and you want to match all mentions of "Google I/O"
(which spaCy tokenizes as `['Google', 'I', '/', 'O'`]). To be safe, you only (which spaCy tokenizes as `['Google', 'I', '/', 'O'`]). To be safe, you only
match on the uppercase versions, in case someone has written it as "Google i/o". match on the uppercase versions, avoiding matches with phrases such as "Google
i/o".
```python ```python
### {executable="true"} ### {executable="true"}

View File

@ -6,6 +6,7 @@ menu:
- ['Introduction', 'basics'] - ['Introduction', 'basics']
- ['Quickstart', 'quickstart'] - ['Quickstart', 'quickstart']
- ['Config System', 'config'] - ['Config System', 'config']
- ['Training Data', 'training-data']
- ['Custom Training', 'config-custom'] - ['Custom Training', 'config-custom']
- ['Custom Functions', 'custom-functions'] - ['Custom Functions', 'custom-functions']
- ['Initialization', 'initialization'] - ['Initialization', 'initialization']
@ -355,6 +356,59 @@ that reference this variable.
</Infobox> </Infobox>
## Preparing Training Data {#training-data}
Training data for NLP projects comes in many different formats. For some common
formats such as CoNLL, spaCy provides [converters](/api/cli#convert) you can use
from the command line. In other cases you'll have to prepare the training data
yourself.
When converting training data for use in spaCy, the main thing is to create
[`Doc`](/api/doc) objects just like the results you want as output from the
pipeline. For example, if you're creating an NER pipeline, loading your
annotations and setting them as the `.ents` property on a `Doc` is all you need
to worry about. On disk the annotations will be saved as a
[`DocBin`](/api/docbin) in the
[`.spacy` format](/api/data-formats#binary-training), but the details of that
are handled automatically.
Here's an example of creating a `.spacy` file from some NER annotations.
```python
### preprocess.py
import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en")
training_data = [
("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
]
# the DocBin will store the example documents
db = DocBin()
for text, annotations in training_data:
doc = nlp(text)
ents = []
for start, end, label in annotations:
span = doc.char_span(start, end, label=label)
ents.append(span)
doc.ents = ents
db.add(doc)
db.to_disk("./train.spacy")
```
For more examples of how to convert training data from a wide variety of formats
for use with spaCy, look at the preprocessing steps in the
[tutorial projects](https://github.com/explosion/projects/tree/v3/tutorials).
<Accordion title="What about the spaCy JSON format?" id="json-annotations" spaced>
In spaCy v2, the recommended way to store training data was in
[a particular JSON format](/api/data-formats#json-input), but in v3 this format
is deprecated. It's fine as a readable storage format, but there's no need to
convert your data to JSON before creating a `.spacy` file.
</Accordion>
## Customizing the pipeline and training {#config-custom} ## Customizing the pipeline and training {#config-custom}
### Defining pipeline components {#config-components} ### Defining pipeline components {#config-components}
@ -426,7 +480,10 @@ as-is. They are also excluded when calling
> still impact your model's performance for instance, a sentence boundary > still impact your model's performance for instance, a sentence boundary
> detector can impact what the parser or entity recognizer considers a valid > detector can impact what the parser or entity recognizer considers a valid
> parse. So the evaluation results should always reflect what your pipeline will > parse. So the evaluation results should always reflect what your pipeline will
> produce at runtime. > produce at runtime. If you want a frozen component to run (without updating)
> during training as well, so that downstream components can use its
> **predictions**, you can add it to the list of
> [`annotating_components`](/usage/training#annotating-components).
```ini ```ini
[nlp] [nlp]
@ -513,6 +570,10 @@ frozen_components = ["ner"]
annotating_components = ["sentencizer", "ner"] annotating_components = ["sentencizer", "ner"]
``` ```
Similarly, a pretrained `tok2vec` layer can be frozen and specified in the list
of `annotating_components` to ensure that a downstream component can use the
embedding layer without updating it.
<Infobox variant="warning" title="Training speed with annotating components" id="annotating-components-speed"> <Infobox variant="warning" title="Training speed with annotating components" id="annotating-components-speed">
Be aware that non-frozen annotating components with statistical models will Be aware that non-frozen annotating components with statistical models will
@ -645,14 +706,14 @@ excluded from the logs and the score won't be weighted.
<Accordion title="Understanding the training output and score types" spaced id="score-types"> <Accordion title="Understanding the training output and score types" spaced id="score-types">
| Name | Description | | Name | Description |
| -------------------------- | ----------------------------------------------------------------------------------------------------------------------- | | ----------------- | ----------------------------------------------------------------------------------------------------------------------- |
| **Loss** | The training loss representing the amount of work left for the optimizer. Should decrease, but usually not to `0`. | | **Loss** | The training loss representing the amount of work left for the optimizer. Should decrease, but usually not to `0`. |
| **Precision** (P) | Percentage of predicted annotations that were correct. Should increase. | | **Precision** (P) | Percentage of predicted annotations that were correct. Should increase. |
| **Recall** (R) | Percentage of reference annotations recovered. Should increase. | | **Recall** (R) | Percentage of reference annotations recovered. Should increase. |
| **F-Score** (F) | Harmonic mean of precision and recall. Should increase. | | **F-Score** (F) | Harmonic mean of precision and recall. Should increase. |
| **UAS** / **LAS** | Unlabeled and labeled attachment score for the dependency parser, i.e. the percentage of correct arcs. Should increase. | | **UAS** / **LAS** | Unlabeled and labeled attachment score for the dependency parser, i.e. the percentage of correct arcs. Should increase. |
| **Speed** | Prediction speed in words per second (WPS). Should stay stable. | | **Speed** | Prediction speed in words per second (WPS). Should stay stable. |
Note that if the development data has raw text, some of the gold-standard Note that if the development data has raw text, some of the gold-standard
entities might not align to the predicted tokenization. These tokenization entities might not align to the predicted tokenization. These tokenization

View File

@ -328,6 +328,15 @@ position.
} }
``` ```
```python
### ENT input with knowledge base links
{
"text": "But Google is starting from behind.",
"ents": [{"start": 4, "end": 10, "label": "ORG", "kb_id": "Q95", "kb_url": "https://www.wikidata.org/entity/Q95"}],
"title": None
}
```
## Using displaCy in a web application {#webapp} ## Using displaCy in a web application {#webapp}
If you want to use the visualizers as part of a web application, for example to If you want to use the visualizers as part of a web application, for example to

View File

@ -516,12 +516,12 @@
"title": "NeuroNER", "title": "NeuroNER",
"slogan": "Named-entity recognition using neural networks", "slogan": "Named-entity recognition using neural networks",
"github": "Franck-Dernoncourt/NeuroNER", "github": "Franck-Dernoncourt/NeuroNER",
"category": ["ner"],
"pip": "pyneuroner[cpu]", "pip": "pyneuroner[cpu]",
"code_example": [ "code_example": [
"from neuroner import neuromodel", "from neuroner import neuromodel",
"nn = neuromodel.NeuroNER(train_model=False, use_pretrained_model=True)" "nn = neuromodel.NeuroNER(train_model=False, use_pretrained_model=True)"
], ],
"category": ["ner"],
"tags": ["standalone"] "tags": ["standalone"]
}, },
{ {
@ -642,6 +642,32 @@
"website": "https://ines.io" "website": "https://ines.io"
} }
}, },
{
"id": "spacyopentapioca",
"title": "spaCyOpenTapioca",
"slogan": "Named entity linking on Wikidata in spaCy via OpenTapioca",
"description": "A spaCy wrapper of OpenTapioca for named entity linking on Wikidata",
"github": "UB-Mannheim/spacyopentapioca",
"pip": "spacyopentapioca",
"code_example": [
"import spacy",
"nlp = spacy.blank('en')",
"nlp.add_pipe('opentapioca')",
"doc = nlp('Christian Drosten works in Germany.')",
"for span in doc.ents:",
" print((span.text, span.kb_id_, span.label_, span._.description, span._.score))",
"# ('Christian Drosten', 'Q1079331', 'PERSON', 'German virologist and university teacher', 3.6533377082098895)",
"# ('Germany', 'Q183', 'LOC', 'sovereign state in Central Europe', 2.1099332471902863)",
"## Check also span._.types, span._.aliases, span._.rank"
],
"category": ["models", "pipeline"],
"tags": ["NER", "NEL"],
"author": "Renat Shigapov",
"author_links": {
"twitter": "_shigapov",
"github": "shigapov"
}
},
{ {
"id": "spacy_hunspell", "id": "spacy_hunspell",
"slogan": "Add spellchecking and spelling suggestions to your spaCy pipeline using Hunspell", "slogan": "Add spellchecking and spelling suggestions to your spaCy pipeline using Hunspell",
@ -939,6 +965,29 @@
"category": ["pipeline"], "category": ["pipeline"],
"tags": ["pipeline", "danish"] "tags": ["pipeline", "danish"]
}, },
{
"id": "textdescriptives",
"title": "TextDescriptives",
"slogan": "Extraction of descriptive stats, readability, and syntactic complexity measures",
"description": "Pipeline component for spaCy v.3 that calculates descriptive statistics, readability metrics, and syntactic complexity (dependency distance).",
"github": "HLasse/TextDescriptives",
"pip": "textdescriptives",
"code_example": [
"import spacy",
"import textdescriptives as td",
"nlp = spacy.load('en_core_web_sm')",
"nlp.add_pipe('textdescriptives')",
"doc = nlp('This is a short test text')",
"doc._.readability # access some of the values",
"td.extract_df(doc) # extract all metrics to DataFrame"
],
"author": "Lasse Hansen, Kenneth Enevoldsen, Ludvig Olsen",
"author_links": {
"github": "HLasse"
},
"category": ["pipeline"],
"tags": ["pipeline", "readability", "syntactic complexity", "descriptive statistics"]
},
{ {
"id": "wmd-relax", "id": "wmd-relax",
"slogan": "Calculates word mover's distance insanely fast", "slogan": "Calculates word mover's distance insanely fast",
@ -1086,6 +1135,26 @@
}, },
"category": ["visualizers"] "category": ["visualizers"]
}, },
{
"id": "deplacy",
"slogan": "CUI-based Tree Visualizer for Universal Dependencies and Immediate Catena Analysis",
"discreption": "Simple dependency visualizer for [spaCy](https://spacy.io/), [UniDic2UD](https://pypi.org/project/unidic2ud), [Stanza](https://stanfordnlp.github.io/stanza/), [NLP-Cube](https://github.com/Adobe/NLP-Cube), [Trankit](https://github.com/nlp-uoregon/trankit), etc.",
"github": "KoichiYasuoka/deplacy",
"image": "https://i.imgur.com/6uOI4Op.png",
"code_example": [
"import spacy",
"import deplacy",
"",
"nlp=spacy.load('en_core_web_sm')",
"doc=nlp('I saw a horse yesterday which had no name.')",
"deplacy.render(doc)"
],
"author": "Koichi Yasuoka",
"author_links": {
"github": "KoichiYasuoka"
},
"category": ["visualizers"]
},
{ {
"id": "scattertext", "id": "scattertext",
"slogan": "Beautiful visualizations of how language differs among document types", "slogan": "Beautiful visualizations of how language differs among document types",
@ -1614,6 +1683,38 @@
"author": "Bhargav Srinivasa-Desikan", "author": "Bhargav Srinivasa-Desikan",
"category": ["books"] "category": ["books"]
}, },
{
"type": "education",
"id": "mastering-spacy",
"title": "Mastering spaCy",
"slogan": "Packt, 2021",
"description": "This is your ultimate spaCy book. Master the crucial skills to use spaCy components effectively to create real-world NLP applications with spaCy. Explaining linguistic concepts such as dependency parsing, POS-tagging and named entity extraction with many examples, this book will help you to conquer computational linguistics with spaCy. The book further focuses on ML topics with Keras and Tensorflow. You'll cover popular topics, including intent recognition, sentiment analysis and context resolution; and use them on popular datasets and interpret the results. A special hands-on section on chatbot design is included.",
"github": "PacktPublishing/Mastering-spaCy",
"cover": "https://tinyimg.io/i/aWEm0dh.jpeg",
"url": "https://www.amazon.com/Mastering-spaCy-end-end-implementing/dp/1800563353",
"author": "Duygu Altinok",
"author_links": {
"github": "DuyguA",
"website": "https://www.linkedin.com/in/duygu-altinok-4021389a"
},
"category": ["books"]
},
{
"type": "education",
"id": "applied-nlp-in-enterprise",
"title": "Applied Natural Language Processing in the Enterprise: Teaching Machines to Read, Write, and Understand",
"slogan": "O'Reilly, 2021",
"description": "Natural language processing (NLP) is one of the hottest topics in AI today. Having lagged behind other deep learning fields such as computer vision for years, NLP only recently gained mainstream popularity. Even though Google, Facebook, and OpenAI have open sourced large pretrained language models to make NLP easier, many organizations today still struggle with developing and productionizing NLP applications. This hands-on guide helps you learn the field quickly.",
"github": "nlpbook/nlpbook",
"cover": "https://i.imgur.com/6RxLBvf.jpg",
"url": "https://www.amazon.com/dp/149206257X",
"author": "Ankur A. Patel",
"author_links": {
"github": "aapatel09",
"website": "https://www.ankurapatel.io"
},
"category": ["books"]
},
{ {
"type": "education", "type": "education",
"id": "learning-path-spacy", "id": "learning-path-spacy",
@ -1625,6 +1726,16 @@
"author": "Aaron Kramer", "author": "Aaron Kramer",
"category": ["courses"] "category": ["courses"]
}, },
{
"type": "education",
"id": "introduction-into-spacy-3",
"title": "Introduction to spaCy 3",
"slogan": "A free course for beginners by Dr. W.J.B. Mattingly",
"url": "http://spacy.pythonhumanities.com/",
"thumb": "https://spacy.pythonhumanities.com/_static/freecodecamp_small.jpg",
"author": "Dr. W.J.B. Mattingly",
"category": ["courses"]
},
{ {
"type": "education", "type": "education",
"id": "spacy-course", "id": "spacy-course",
@ -2025,11 +2136,9 @@
"github": "nikitakit/self-attentive-parser", "github": "nikitakit/self-attentive-parser",
"pip": "benepar", "pip": "benepar",
"code_example": [ "code_example": [
"import spacy", "import benepar, spacy",
"from benepar.spacy_plugin import BeneparComponent", "nlp = spacy.load('en_core_web_md')",
"", "nlp.add_pipe('benepar', config={'model': 'benepar_en3'})",
"nlp = spacy.load('en')",
"nlp.add_pipe(BeneparComponent('benepar_en'))",
"doc = nlp('The time for action is now. It is never too late to do something.')", "doc = nlp('The time for action is now. It is never too late to do something.')",
"sent = list(doc.sents)[0]", "sent = list(doc.sents)[0]",
"print(sent._.parse_string)", "print(sent._.parse_string)",
@ -2493,6 +2602,75 @@
"website": "https://explosion.ai" "website": "https://explosion.ai"
} }
}, },
{
"id": "spacy-huggingface-hub",
"title": "spacy-huggingface-hub",
"slogan": "Push your spaCy pipelines to the Hugging Face Hub",
"description": "This package provides a CLI command for uploading any trained spaCy pipeline packaged with [`spacy package`](https://spacy.io/api/cli#package) to the [Hugging Face Hub](https://huggingface.co). It auto-generates all meta information for you, uploads a pretty README (requires spaCy v3.1+) and handles version control under the hood.",
"github": "explosion/spacy-huggingface-hub",
"thumb": "https://i.imgur.com/j6FO9O6.jpg",
"url": "https://github.com/explosion/spacy-huggingface-hub",
"pip": "spacy-huggingface-hub",
"category": ["pipeline", "models"],
"author": "Explosion",
"author_links": {
"twitter": "explosion_ai",
"github": "explosion",
"website": "https://explosion.ai"
}
},
{
"id": "spacy-clausie",
"title": "spacy-clausie",
"slogan": "Implementation of the ClausIE information extraction system for Python+spaCy",
"github": "mmxgn/spacy-clausie",
"url": "https://github.com/mmxgn/spacy-clausie",
"description": "ClausIE, a novel, clause-based approach to open information extraction, which extracts relations and their arguments from natural language text",
"category": ["pipeline", "scientific", "research"],
"code_example": [
"import spacy",
"import claucy",
"",
"nlp = spacy.load(\"en\")",
"claucy.add_to_pipe(nlp)",
"",
"doc = nlp(\"AE died in Princeton in 1955.\")",
"",
"print(doc._.clauses)",
"# Output:",
"# <SV, AE, died, None, None, None, [in Princeton, in 1955]>",
"",
"propositions = doc._.clauses[0].to_propositions(as_text=True)",
"",
"print(propositions)",
"# Output:",
"# [AE died in Princeton in 1955, AE died in 1955, AE died in Princeton"
],
"author": "Emmanouil Theofanis Chourdakis",
"author_links": {
"github": "mmxgn"
}
},
{
"id": "ipymarkup",
"slogan": "NER, syntax markup visualizations",
"description": "Collection of NLP visualizations for NER and syntax tree markup. Similar to [displaCy](https://explosion.ai/demos/displacy) and [displaCy ENT](https://explosion.ai/demos/displacy-ent).",
"github": "natasha/ipymarkup",
"image": "https://github.com/natasha/ipymarkup/blob/master/table.png?raw=true",
"pip":"pip install ipymarkup",
"code_example": [
"from ipymarkup import show_span_ascii_markup, show_dep_ascii_markup",
"",
"text = 'В мероприятии примут участие не только российские учёные, но и зарубежные исследователи, в том числе, Крис Хелмбрехт - управляющий директор и совладелец креативного агентства Kollektiv (Германия, США), Ннека Угбома - руководитель проекта Mushroom works (Великобритания), Гергей Ковач - политик и лидер субкультурной партии «Dog with two tails» (Венгрия), Георг Жено - немецкий режиссёр, один из создателей экспериментального театра «Театр.doc», Театра им. Йозефа Бойса (Германия).'",
"spans = [(102, 116, 'PER'), (186, 194, 'LOC'), (196, 199, 'LOC'), (202, 214, 'PER'), (254, 268, 'LOC'), (271, 283, 'PER'), (324, 342, 'ORG'), (345, 352, 'LOC'), (355, 365, 'PER'), (445, 455, 'ORG'), (456, 468, 'PER'), (470, 478, 'LOC')]",
"show_span_ascii_markup(text, spans)"
],
"author": "Alexander Kukushkin",
"author_links": {
"github": "kuk"
},
"category": ["visualizers"]
},
{ {
"id": "negspacy", "id": "negspacy",
"title": "negspaCy", "title": "negspaCy",
@ -3175,33 +3353,61 @@
"github": "babylonhealth/hmrb", "github": "babylonhealth/hmrb",
"pip": "hmrb", "pip": "hmrb",
"code_example": [ "code_example": [
"import spacy # __version__ 3.0+", "import spacy",
"from hmrb.core import SpacyCore", "from hmrb.core import SpacyCore",
"", "",
"nlp = spacy.load(\"en_core_web_sm\")",
"sentences = \"I love gorillas. Peter loves gorillas. Jane loves Tarzan.\"",
"",
"def conj_be(subj: str) -> str:",
" if subj == \"I\":",
" return \"am\"",
" elif subj == \"you\":",
" return \"are\"",
" else:",
" return \"is\"",
"",
"@spacy.registry.callbacks(\"gorilla_callback\")",
"def gorilla_clb(seq: list, span: slice, data: dict) -> None:",
" subj = seq[span.start].text",
" be = conj_be(subj)",
" print(f\"{subj} {be} a gorilla person.\")",
"@spacy.registry.callbacks(\"lover_callback\")",
"def lover_clb(seq: list, span: slice, data: dict) -> None:",
" print(f\"{seq[span][-1].text} is a love interest of {seq[span.start].text}.\")",
"",
"grammar = \"\"\"", "grammar = \"\"\"",
"Var is_hurting:", " Law:",
"(", " - callback: \"loves_gorilla\"",
" optional (lemma: \"be\")", " (",
" (lemma: \"hurt\")", " ((pos: \"PROPN\") or (pos: \"PRON\"))",
")", " (lemma: \"love\")",
"Law:", " (lemma: \"gorilla\")",
" - package: \"headache\"", " )",
" - callback: \"mark_headache\"", " Law:",
"(", " - callback: \"loves_someone\"",
" (lemma: \"head\", pos: \"NOUN\")", " (",
" $is_hurting", " (pos: \"PROPN\")",
")\"\"\"", " (lower: \"loves\")",
" (pos: \"PROPN\")",
" )",
"\"\"\"",
"",
"@spacy.registry.augmenters(\"jsonify_span\")",
"def jsonify_span(span):",
" return [{\"lemma\": token.lemma_, \"pos\": token.pos_, \"lower\": token.lower_} for token in span]",
"", "",
"conf = {", "conf = {",
" \"rules\": grammar", " \"rules\": grammar,",
" \"callbacks\": {", " \"callbacks\": {",
" \"mark_headache\": \"callbacks.headache_handler\",", " \"loves_gorilla\": \"callbacks.gorilla_callback\",",
" },", " \"loves_someone\": \"callbacks.lover_callback\",",
" },",
" \"map_doc\": \"augmenters.jsonify_span\",", " \"map_doc\": \"augmenters.jsonify_span\",",
" \"sort_length\": True,", " \"sort_length\": True,",
"}", "}",
"nlp = spacy.load(\"en_core_web_sm\")", "",
"nlp.add_pipe(\"hammurabi\", config=conf)", "nlp.add_pipe(\"hmrb\", config=conf)",
"nlp(sentences)" "nlp(sentences)"
], ],
"code_language": "python", "code_language": "python",
@ -3222,15 +3428,17 @@
"slogan": "Forte is a toolkit for building Natural Language Processing pipelines, featuring cross-task interaction, adaptable data-model interfaces and composable pipelines.", "slogan": "Forte is a toolkit for building Natural Language Processing pipelines, featuring cross-task interaction, adaptable data-model interfaces and composable pipelines.",
"description": "Forte provides a platform to assemble state-of-the-art NLP and ML technologies in a highly-composable fashion, including a wide spectrum of tasks ranging from Information Retrieval, Natural Language Understanding to Natural Language Generation.", "description": "Forte provides a platform to assemble state-of-the-art NLP and ML technologies in a highly-composable fashion, including a wide spectrum of tasks ranging from Information Retrieval, Natural Language Understanding to Natural Language Generation.",
"github": "asyml/forte", "github": "asyml/forte",
"pip": "forte.spacy torch", "pip": "forte.spacy stave torch",
"code_example": [ "code_example": [
"from forte.spacy import SpacyProcessor", "from fortex.spacy import SpacyProcessor",
"from forte.processors.stave import StaveProcessor",
"from forte import Pipeline", "from forte import Pipeline",
"from forte.data.readers import StringReader", "from forte.data.readers import StringReader",
"", "",
"pipeline = Pipeline()", "pipeline = Pipeline()",
"pipeline.set_reader(StringReader())", "pipeline.set_reader(StringReader())",
"pipeline.add(SpacyProcessor())", "pipeline.add(SpacyProcessor())",
"pipeline.add(StaveProcessor())",
"pipeline.run('Running SpaCy with Forte!')" "pipeline.run('Running SpaCy with Forte!')"
], ],
"code_language": "python", "code_language": "python",
@ -3245,6 +3453,29 @@
}, },
"category": ["pipeline", "standalone"], "category": ["pipeline", "standalone"],
"tags": ["pipeline"] "tags": ["pipeline"]
},
{
"id": "spacy-api-docker-v3",
"slogan": "spaCy v3 REST API, wrapped in a Docker container",
"github": "bbieniek/spacy-api-docker",
"url": "https://hub.docker.com/r/bbieniek/spacyapi/",
"thumb": "https://i.imgur.com/NRnDKyj.jpg",
"code_example": [
"version: '3'",
"",
"services:",
" spacyapi:",
" image: bbieniek/spacyapi:en_v3",
" ports:",
" - \"127.0.0.1:8080:80\"",
" restart: always"
],
"code_language": "docker",
"author": "Baltazar Bieniek",
"author_links": {
"github": "bbieniek"
},
"category": ["apis"]
} }
], ],