mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-10 16:22:29 +03:00
Merge branch 'develop' into nightly.spacy.io
This commit is contained in:
commit
c2709a32c9
106
.github/contributors/Stannislav.md
vendored
Normal file
106
.github/contributors/Stannislav.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Stanislav Schmidt |
|
||||||
|
| Company name (if applicable) | Blue Brain Project |
|
||||||
|
| Title or role (if applicable) | ML Engineer |
|
||||||
|
| Date | 2020-10-02 |
|
||||||
|
| GitHub username | Stannislav |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/rasyidf.md
vendored
Normal file
106
.github/contributors/rasyidf.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | ------------------------ |
|
||||||
|
| Name | Muhammad Fahmi Rasyid |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2020-09-23 |
|
||||||
|
| GitHub username | rasyidf |
|
||||||
|
| Website (optional) | http://rasyidf.github.io |
|
|
@ -1,6 +1,6 @@
|
||||||
# fmt: off
|
# fmt: off
|
||||||
__title__ = "spacy-nightly"
|
__title__ = "spacy-nightly"
|
||||||
__version__ = "3.0.0a29"
|
__version__ = "3.0.0a32"
|
||||||
__release__ = True
|
__release__ = True
|
||||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||||
|
|
|
@ -322,8 +322,7 @@ def git_checkout(
|
||||||
if dest.exists():
|
if dest.exists():
|
||||||
msg.fail("Destination of checkout must not exist", exits=1)
|
msg.fail("Destination of checkout must not exist", exits=1)
|
||||||
if not dest.parent.exists():
|
if not dest.parent.exists():
|
||||||
raise IOError("Parent of destination of checkout must exist")
|
msg.fail("Parent of destination of checkout must exist", exits=1)
|
||||||
|
|
||||||
if sparse and git_version >= (2, 22):
|
if sparse and git_version >= (2, 22):
|
||||||
return git_sparse_checkout(repo, subpath, dest, branch)
|
return git_sparse_checkout(repo, subpath, dest, branch)
|
||||||
elif sparse:
|
elif sparse:
|
||||||
|
|
|
@ -171,7 +171,7 @@ def debug_data(
|
||||||
n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values())
|
n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values())
|
||||||
msg.warn(
|
msg.warn(
|
||||||
"{} words in training data without vectors ({:0.2f}%)".format(
|
"{} words in training data without vectors ({:0.2f}%)".format(
|
||||||
n_missing_vectors, n_missing_vectors / gold_train_data["n_words"],
|
n_missing_vectors, n_missing_vectors / gold_train_data["n_words"]
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
msg.text(
|
msg.text(
|
||||||
|
|
|
@ -3,6 +3,7 @@ from pathlib import Path
|
||||||
from wasabi import msg
|
from wasabi import msg
|
||||||
import typer
|
import typer
|
||||||
import logging
|
import logging
|
||||||
|
import sys
|
||||||
|
|
||||||
from ._util import app, Arg, Opt, parse_config_overrides, show_validation_error
|
from ._util import app, Arg, Opt, parse_config_overrides, show_validation_error
|
||||||
from ._util import import_code, setup_gpu
|
from ._util import import_code, setup_gpu
|
||||||
|
@ -39,7 +40,12 @@ def train_cli(
|
||||||
DOCS: https://nightly.spacy.io/api/cli#train
|
DOCS: https://nightly.spacy.io/api/cli#train
|
||||||
"""
|
"""
|
||||||
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
|
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
|
||||||
verify_cli_args(config_path, output_path)
|
# Make sure all files and paths exists if they are needed
|
||||||
|
if not config_path or not config_path.exists():
|
||||||
|
msg.fail("Config file not found", config_path, exits=1)
|
||||||
|
if output_path is not None and not output_path.exists():
|
||||||
|
output_path.mkdir()
|
||||||
|
msg.good(f"Created output directory: {output_path}")
|
||||||
overrides = parse_config_overrides(ctx.args)
|
overrides = parse_config_overrides(ctx.args)
|
||||||
import_code(code_path)
|
import_code(code_path)
|
||||||
setup_gpu(use_gpu)
|
setup_gpu(use_gpu)
|
||||||
|
@ -50,14 +56,4 @@ def train_cli(
|
||||||
nlp = init_nlp(config, use_gpu=use_gpu)
|
nlp = init_nlp(config, use_gpu=use_gpu)
|
||||||
msg.good("Initialized pipeline")
|
msg.good("Initialized pipeline")
|
||||||
msg.divider("Training pipeline")
|
msg.divider("Training pipeline")
|
||||||
train(nlp, output_path, use_gpu=use_gpu, silent=False)
|
train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
|
||||||
|
|
||||||
|
|
||||||
def verify_cli_args(config_path: Path, output_path: Optional[Path] = None) -> None:
|
|
||||||
# Make sure all files and paths exists if they are needed
|
|
||||||
if not config_path or not config_path.exists():
|
|
||||||
msg.fail("Config file not found", config_path, exits=1)
|
|
||||||
if output_path is not None:
|
|
||||||
if not output_path.exists():
|
|
||||||
output_path.mkdir()
|
|
||||||
msg.good(f"Created output directory: {output_path}")
|
|
||||||
|
|
327
spacy/errors.py
327
spacy/errors.py
|
@ -16,8 +16,6 @@ def add_codes(err_cls):
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
class Warnings:
|
class Warnings:
|
||||||
W004 = ("No text fixing enabled. Run `pip install ftfy` to enable fixing "
|
|
||||||
"using ftfy.fix_text if necessary.")
|
|
||||||
W005 = ("Doc object not parsed. This means displaCy won't be able to "
|
W005 = ("Doc object not parsed. This means displaCy won't be able to "
|
||||||
"generate a dependency visualization for it. Make sure the Doc "
|
"generate a dependency visualization for it. Make sure the Doc "
|
||||||
"was processed with a model that supports dependency parsing, and "
|
"was processed with a model that supports dependency parsing, and "
|
||||||
|
@ -51,8 +49,6 @@ class Warnings:
|
||||||
W017 = ("Alias '{alias}' already exists in the Knowledge Base.")
|
W017 = ("Alias '{alias}' already exists in the Knowledge Base.")
|
||||||
W018 = ("Entity '{entity}' already exists in the Knowledge Base - "
|
W018 = ("Entity '{entity}' already exists in the Knowledge Base - "
|
||||||
"ignoring the duplicate entry.")
|
"ignoring the duplicate entry.")
|
||||||
W020 = ("Unnamed vectors. This won't allow multiple vectors models to be "
|
|
||||||
"loaded. (Shape: {shape})")
|
|
||||||
W021 = ("Unexpected hash collision in PhraseMatcher. Matches may be "
|
W021 = ("Unexpected hash collision in PhraseMatcher. Matches may be "
|
||||||
"incorrect. Modify PhraseMatcher._terminal_hash to fix.")
|
"incorrect. Modify PhraseMatcher._terminal_hash to fix.")
|
||||||
W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in "
|
W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in "
|
||||||
|
@ -65,7 +61,7 @@ class Warnings:
|
||||||
"be more efficient to split your training data into multiple "
|
"be more efficient to split your training data into multiple "
|
||||||
"smaller JSON files instead.")
|
"smaller JSON files instead.")
|
||||||
W028 = ("Doc.from_array was called with a vector of type '{type}', "
|
W028 = ("Doc.from_array was called with a vector of type '{type}', "
|
||||||
"but is expecting one of type 'uint64' instead. This may result "
|
"but is expecting one of type uint64 instead. This may result "
|
||||||
"in problems with the vocab further on in the pipeline.")
|
"in problems with the vocab further on in the pipeline.")
|
||||||
W030 = ("Some entities could not be aligned in the text \"{text}\" with "
|
W030 = ("Some entities could not be aligned in the text \"{text}\" with "
|
||||||
"entities \"{entities}\". Use "
|
"entities \"{entities}\". Use "
|
||||||
|
@ -79,13 +75,17 @@ class Warnings:
|
||||||
"If this is surprising, make sure you have the spacy-lookups-data "
|
"If this is surprising, make sure you have the spacy-lookups-data "
|
||||||
"package installed. The languages with lexeme normalization tables "
|
"package installed. The languages with lexeme normalization tables "
|
||||||
"are currently: {langs}")
|
"are currently: {langs}")
|
||||||
W034 = ("Please install the package spacy-lookups-data in order to include "
|
|
||||||
"the default lexeme normalization table for the language '{lang}'.")
|
|
||||||
W035 = ('Discarding subpattern "{pattern}" due to an unrecognized '
|
W035 = ('Discarding subpattern "{pattern}" due to an unrecognized '
|
||||||
"attribute or operator.")
|
"attribute or operator.")
|
||||||
|
|
||||||
# TODO: fix numbering after merging develop into master
|
# TODO: fix numbering after merging develop into master
|
||||||
W089 = ("The nlp.begin_training method has been renamed to nlp.initialize.")
|
W088 = ("The pipeline component {name} implements a `begin_training` "
|
||||||
|
"method, which won't be called by spaCy. As of v3.0, `begin_training` "
|
||||||
|
"has been renamed to `initialize`, so you likely want to rename the "
|
||||||
|
"component method. See the documentation for details: "
|
||||||
|
"https://nightly.spacy.io/api/language#initialize")
|
||||||
|
W089 = ("As of spaCy v3.0, the `nlp.begin_training` method has been renamed "
|
||||||
|
"to `nlp.initialize`.")
|
||||||
W090 = ("Could not locate any {format} files in path '{path}'.")
|
W090 = ("Could not locate any {format} files in path '{path}'.")
|
||||||
W091 = ("Could not clean/remove the temp directory at {dir}: {msg}.")
|
W091 = ("Could not clean/remove the temp directory at {dir}: {msg}.")
|
||||||
W092 = ("Ignoring annotations for sentence starts, as dependency heads are set.")
|
W092 = ("Ignoring annotations for sentence starts, as dependency heads are set.")
|
||||||
|
@ -103,39 +103,33 @@ class Warnings:
|
||||||
"download a newer compatible model or retrain your custom model "
|
"download a newer compatible model or retrain your custom model "
|
||||||
"with the current spaCy version. For more details and available "
|
"with the current spaCy version. For more details and available "
|
||||||
"updates, run: python -m spacy validate")
|
"updates, run: python -m spacy validate")
|
||||||
W096 = ("The method 'disable_pipes' has become deprecated - use 'select_pipes' "
|
W096 = ("The method `nlp.disable_pipes` is now deprecated - use "
|
||||||
"instead.")
|
"`nlp.select_pipes` instead.")
|
||||||
W097 = ("No Model config was provided to create the '{name}' component, "
|
|
||||||
"and no default configuration could be found either.")
|
|
||||||
W098 = ("No Model config was provided to create the '{name}' component, "
|
|
||||||
"so a default configuration was used.")
|
|
||||||
W099 = ("Expected 'dict' type for the 'model' argument of pipe '{pipe}', "
|
|
||||||
"but got '{type}' instead, so ignoring it.")
|
|
||||||
W100 = ("Skipping unsupported morphological feature(s): '{feature}'. "
|
W100 = ("Skipping unsupported morphological feature(s): '{feature}'. "
|
||||||
"Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
|
"Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
|
||||||
"string \"Field1=Value1,Value2|Field2=Value3\".")
|
"string \"Field1=Value1,Value2|Field2=Value3\".")
|
||||||
W101 = ("Skipping `Doc` custom extension '{name}' while merging docs.")
|
W101 = ("Skipping Doc custom extension '{name}' while merging docs.")
|
||||||
W102 = ("Skipping unsupported user data '{key}: {value}' while merging docs.")
|
W102 = ("Skipping unsupported user data '{key}: {value}' while merging docs.")
|
||||||
W103 = ("Unknown {lang} word segmenter '{segmenter}'. Supported "
|
W103 = ("Unknown {lang} word segmenter '{segmenter}'. Supported "
|
||||||
"word segmenters: {supported}. Defaulting to {default}.")
|
"word segmenters: {supported}. Defaulting to {default}.")
|
||||||
W104 = ("Skipping modifications for '{target}' segmenter. The current "
|
W104 = ("Skipping modifications for '{target}' segmenter. The current "
|
||||||
"segmenter is '{current}'.")
|
"segmenter is '{current}'.")
|
||||||
W105 = ("As of spaCy v3.0, the {matcher}.pipe method is deprecated. If you "
|
W105 = ("As of spaCy v3.0, the `{matcher}.pipe` method is deprecated. If you "
|
||||||
"need to match on a stream of documents, you can use nlp.pipe and "
|
"need to match on a stream of documents, you can use `nlp.pipe` and "
|
||||||
"call the {matcher} on each Doc object.")
|
"call the {matcher} on each Doc object.")
|
||||||
W107 = ("The property Doc.{prop} is deprecated. Use "
|
W107 = ("The property `Doc.{prop}` is deprecated. Use "
|
||||||
"Doc.has_annotation(\"{attr}\") instead.")
|
"`Doc.has_annotation(\"{attr}\")` instead.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
class Errors:
|
class Errors:
|
||||||
E001 = ("No component '{name}' found in pipeline. Available names: {opts}")
|
E001 = ("No component '{name}' found in pipeline. Available names: {opts}")
|
||||||
E002 = ("Can't find factory for '{name}' for language {lang} ({lang_code}). "
|
E002 = ("Can't find factory for '{name}' for language {lang} ({lang_code}). "
|
||||||
"This usually happens when spaCy calls nlp.{method} with custom "
|
"This usually happens when spaCy calls `nlp.{method}` with custom "
|
||||||
"component name that's not registered on the current language class. "
|
"component name that's not registered on the current language class. "
|
||||||
"If you're using a custom component, make sure you've added the "
|
"If you're using a custom component, make sure you've added the "
|
||||||
"decorator @Language.component (for function components) or "
|
"decorator `@Language.component` (for function components) or "
|
||||||
"@Language.factory (for class components).\n\nAvailable "
|
"`@Language.factory` (for class components).\n\nAvailable "
|
||||||
"factories: {opts}")
|
"factories: {opts}")
|
||||||
E003 = ("Not a valid pipeline component. Expected callable, but "
|
E003 = ("Not a valid pipeline component. Expected callable, but "
|
||||||
"got {component} (name: '{name}'). If you're using a custom "
|
"got {component} (name: '{name}'). If you're using a custom "
|
||||||
|
@ -153,14 +147,13 @@ class Errors:
|
||||||
E008 = ("Can't restore disabled pipeline component '{name}' because it "
|
E008 = ("Can't restore disabled pipeline component '{name}' because it "
|
||||||
"doesn't exist in the pipeline anymore. If you want to remove "
|
"doesn't exist in the pipeline anymore. If you want to remove "
|
||||||
"components from the pipeline, you should do it before calling "
|
"components from the pipeline, you should do it before calling "
|
||||||
"`nlp.select_pipes()` or after restoring the disabled components.")
|
"`nlp.select_pipes` or after restoring the disabled components.")
|
||||||
E010 = ("Word vectors set to length 0. This may be because you don't have "
|
E010 = ("Word vectors set to length 0. This may be because you don't have "
|
||||||
"a model installed or loaded, or because your model doesn't "
|
"a model installed or loaded, or because your model doesn't "
|
||||||
"include word vectors. For more info, see the docs:\n"
|
"include word vectors. For more info, see the docs:\n"
|
||||||
"https://nightly.spacy.io/usage/models")
|
"https://nightly.spacy.io/usage/models")
|
||||||
E011 = ("Unknown operator: '{op}'. Options: {opts}")
|
E011 = ("Unknown operator: '{op}'. Options: {opts}")
|
||||||
E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
|
E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
|
||||||
E014 = ("Unknown tag ID: {tag}")
|
|
||||||
E016 = ("MultitaskObjective target should be function or one of: dep, "
|
E016 = ("MultitaskObjective target should be function or one of: dep, "
|
||||||
"tag, ent, dep_tag_offset, ent_tag.")
|
"tag, ent, dep_tag_offset, ent_tag.")
|
||||||
E017 = ("Can only add unicode or bytes. Got type: {value_type}")
|
E017 = ("Can only add unicode or bytes. Got type: {value_type}")
|
||||||
|
@ -176,27 +169,24 @@ class Errors:
|
||||||
"For example, are all labels added to the model? If you're "
|
"For example, are all labels added to the model? If you're "
|
||||||
"training a named entity recognizer, also make sure that none of "
|
"training a named entity recognizer, also make sure that none of "
|
||||||
"your annotated entity spans have leading or trailing whitespace "
|
"your annotated entity spans have leading or trailing whitespace "
|
||||||
"or punctuation. "
|
"or punctuation. You can also use the `debug data` command to "
|
||||||
"You can also use the experimental `debug data` command to "
|
|
||||||
"validate your JSON-formatted training data. For details, run:\n"
|
"validate your JSON-formatted training data. For details, run:\n"
|
||||||
"python -m spacy debug data --help")
|
"python -m spacy debug data --help")
|
||||||
E025 = ("String is too long: {length} characters. Max is 2**30.")
|
E025 = ("String is too long: {length} characters. Max is 2**30.")
|
||||||
E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
|
E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
|
||||||
"length {length}.")
|
"length {length}.")
|
||||||
E027 = ("Arguments 'words' and 'spaces' should be sequences of the same "
|
E027 = ("Arguments `words` and `spaces` should be sequences of the same "
|
||||||
"length, or 'spaces' should be left default at None. spaces "
|
"length, or `spaces` should be left default at None. `spaces` "
|
||||||
"should be a sequence of booleans, with True meaning that the "
|
"should be a sequence of booleans, with True meaning that the "
|
||||||
"word owns a ' ' character following it.")
|
"word owns a ' ' character following it.")
|
||||||
E028 = ("orths_and_spaces expects either a list of unicode string or a "
|
E028 = ("`words` expects a list of unicode strings, but got bytes instance: {value}")
|
||||||
"list of (unicode, bool) tuples. Got bytes instance: {value}")
|
E029 = ("`noun_chunks` requires the dependency parse, which requires a "
|
||||||
E029 = ("noun_chunks requires the dependency parse, which requires a "
|
|
||||||
"statistical model to be installed and loaded. For more info, see "
|
"statistical model to be installed and loaded. For more info, see "
|
||||||
"the documentation:\nhttps://nightly.spacy.io/usage/models")
|
"the documentation:\nhttps://nightly.spacy.io/usage/models")
|
||||||
E030 = ("Sentence boundaries unset. You can add the 'sentencizer' "
|
E030 = ("Sentence boundaries unset. You can add the 'sentencizer' "
|
||||||
"component to the pipeline with: "
|
"component to the pipeline with: `nlp.add_pipe('sentencizer')`. "
|
||||||
"nlp.add_pipe('sentencizer'). "
|
"Alternatively, add the dependency parser or sentence recognizer, "
|
||||||
"Alternatively, add the dependency parser, or set sentence "
|
"or set sentence boundaries by setting `doc[i].is_sent_start`.")
|
||||||
"boundaries by setting doc[i].is_sent_start.")
|
|
||||||
E031 = ("Invalid token: empty string ('') at position {i}.")
|
E031 = ("Invalid token: empty string ('') at position {i}.")
|
||||||
E033 = ("Cannot load into non-empty Doc of length {length}.")
|
E033 = ("Cannot load into non-empty Doc of length {length}.")
|
||||||
E035 = ("Error creating span with start {start} and end {end} for Doc of "
|
E035 = ("Error creating span with start {start} and end {end} for Doc of "
|
||||||
|
@ -210,7 +200,7 @@ class Errors:
|
||||||
"issue here: http://github.com/explosion/spaCy/issues")
|
"issue here: http://github.com/explosion/spaCy/issues")
|
||||||
E040 = ("Attempt to access token at {i}, max length {max_length}.")
|
E040 = ("Attempt to access token at {i}, max length {max_length}.")
|
||||||
E041 = ("Invalid comparison operator: {op}. Likely a Cython bug?")
|
E041 = ("Invalid comparison operator: {op}. Likely a Cython bug?")
|
||||||
E042 = ("Error accessing doc[{i}].nbor({j}), for doc of length {length}.")
|
E042 = ("Error accessing `doc[{i}].nbor({j})`, for doc of length {length}.")
|
||||||
E043 = ("Refusing to write to token.sent_start if its document is parsed, "
|
E043 = ("Refusing to write to token.sent_start if its document is parsed, "
|
||||||
"because this may cause inconsistent state.")
|
"because this may cause inconsistent state.")
|
||||||
E044 = ("Invalid value for token.sent_start: {value}. Must be one of: "
|
E044 = ("Invalid value for token.sent_start: {value}. Must be one of: "
|
||||||
|
@ -230,7 +220,7 @@ class Errors:
|
||||||
E056 = ("Invalid tokenizer exception: ORTH values combined don't match "
|
E056 = ("Invalid tokenizer exception: ORTH values combined don't match "
|
||||||
"original string.\nKey: {key}\nOrths: {orths}")
|
"original string.\nKey: {key}\nOrths: {orths}")
|
||||||
E057 = ("Stepped slices not supported in Span objects. Try: "
|
E057 = ("Stepped slices not supported in Span objects. Try: "
|
||||||
"list(tokens)[start:stop:step] instead.")
|
"`list(tokens)[start:stop:step]` instead.")
|
||||||
E058 = ("Could not retrieve vector for key {key}.")
|
E058 = ("Could not retrieve vector for key {key}.")
|
||||||
E059 = ("One (and only one) keyword arg must be set. Got: {kwargs}")
|
E059 = ("One (and only one) keyword arg must be set. Got: {kwargs}")
|
||||||
E060 = ("Cannot add new key to vectors: the table is full. Current shape: "
|
E060 = ("Cannot add new key to vectors: the table is full. Current shape: "
|
||||||
|
@ -239,7 +229,7 @@ class Errors:
|
||||||
"and 63 are occupied. You can replace one by specifying the "
|
"and 63 are occupied. You can replace one by specifying the "
|
||||||
"`flag_id` explicitly, e.g. "
|
"`flag_id` explicitly, e.g. "
|
||||||
"`nlp.vocab.add_flag(your_func, flag_id=IS_ALPHA`.")
|
"`nlp.vocab.add_flag(your_func, flag_id=IS_ALPHA`.")
|
||||||
E063 = ("Invalid value for flag_id: {value}. Flag IDs must be between 1 "
|
E063 = ("Invalid value for `flag_id`: {value}. Flag IDs must be between 1 "
|
||||||
"and 63 (inclusive).")
|
"and 63 (inclusive).")
|
||||||
E064 = ("Error fetching a Lexeme from the Vocab. When looking up a "
|
E064 = ("Error fetching a Lexeme from the Vocab. When looking up a "
|
||||||
"string, the lexeme returned had an orth ID that did not match "
|
"string, the lexeme returned had an orth ID that did not match "
|
||||||
|
@ -268,7 +258,7 @@ class Errors:
|
||||||
E085 = ("Can't create lexeme for string '{string}'.")
|
E085 = ("Can't create lexeme for string '{string}'.")
|
||||||
E087 = ("Unknown displaCy style: {style}.")
|
E087 = ("Unknown displaCy style: {style}.")
|
||||||
E088 = ("Text of length {length} exceeds maximum of {max_length}. The "
|
E088 = ("Text of length {length} exceeds maximum of {max_length}. The "
|
||||||
"v2.x parser and NER models require roughly 1GB of temporary "
|
"parser and NER models require roughly 1GB of temporary "
|
||||||
"memory per 100,000 characters in the input. This means long "
|
"memory per 100,000 characters in the input. This means long "
|
||||||
"texts may cause memory allocation errors. If you're not using "
|
"texts may cause memory allocation errors. If you're not using "
|
||||||
"the parser or NER, it's probably safe to increase the "
|
"the parser or NER, it's probably safe to increase the "
|
||||||
|
@ -285,8 +275,8 @@ class Errors:
|
||||||
E094 = ("Error reading line {line_num} in vectors file {loc}.")
|
E094 = ("Error reading line {line_num} in vectors file {loc}.")
|
||||||
E095 = ("Can't write to frozen dictionary. This is likely an internal "
|
E095 = ("Can't write to frozen dictionary. This is likely an internal "
|
||||||
"error. Are you writing to a default function argument?")
|
"error. Are you writing to a default function argument?")
|
||||||
E096 = ("Invalid object passed to displaCy: Can only visualize Doc or "
|
E096 = ("Invalid object passed to displaCy: Can only visualize `Doc` or "
|
||||||
"Span objects, or dicts if set to manual=True.")
|
"Span objects, or dicts if set to `manual=True`.")
|
||||||
E097 = ("Invalid pattern: expected token pattern (list of dicts) or "
|
E097 = ("Invalid pattern: expected token pattern (list of dicts) or "
|
||||||
"phrase pattern (string) but got:\n{pattern}")
|
"phrase pattern (string) but got:\n{pattern}")
|
||||||
E098 = ("Invalid pattern: expected both RIGHT_ID and RIGHT_ATTRS.")
|
E098 = ("Invalid pattern: expected both RIGHT_ID and RIGHT_ATTRS.")
|
||||||
|
@ -303,11 +293,11 @@ class Errors:
|
||||||
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
|
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
|
||||||
"token can only be part of one entity, so make sure the entities "
|
"token can only be part of one entity, so make sure the entities "
|
||||||
"you're setting don't overlap.")
|
"you're setting don't overlap.")
|
||||||
E106 = ("Can't find doc._.{attr} attribute specified in the underscore "
|
E106 = ("Can't find `doc._.{attr}` attribute specified in the underscore "
|
||||||
"settings: {opts}")
|
"settings: {opts}")
|
||||||
E107 = ("Value of doc._.{attr} is not JSON-serializable: {value}")
|
E107 = ("Value of `doc._.{attr}` is not JSON-serializable: {value}")
|
||||||
E109 = ("Component '{name}' could not be run. Did you forget to "
|
E109 = ("Component '{name}' could not be run. Did you forget to "
|
||||||
"call initialize()?")
|
"call `initialize()`?")
|
||||||
E110 = ("Invalid displaCy render wrapper. Expected callable, got: {obj}")
|
E110 = ("Invalid displaCy render wrapper. Expected callable, got: {obj}")
|
||||||
E111 = ("Pickling a token is not supported, because tokens are only views "
|
E111 = ("Pickling a token is not supported, because tokens are only views "
|
||||||
"of the parent Doc and can't exist on their own. A pickled token "
|
"of the parent Doc and can't exist on their own. A pickled token "
|
||||||
|
@ -324,8 +314,8 @@ class Errors:
|
||||||
E117 = ("The newly split tokens must match the text of the original token. "
|
E117 = ("The newly split tokens must match the text of the original token. "
|
||||||
"New orths: {new}. Old text: {old}.")
|
"New orths: {new}. Old text: {old}.")
|
||||||
E118 = ("The custom extension attribute '{attr}' is not registered on the "
|
E118 = ("The custom extension attribute '{attr}' is not registered on the "
|
||||||
"Token object so it can't be set during retokenization. To "
|
"`Token` object so it can't be set during retokenization. To "
|
||||||
"register an attribute, use the Token.set_extension classmethod.")
|
"register an attribute, use the `Token.set_extension` classmethod.")
|
||||||
E119 = ("Can't set custom extension attribute '{attr}' during "
|
E119 = ("Can't set custom extension attribute '{attr}' during "
|
||||||
"retokenization because it's not writable. This usually means it "
|
"retokenization because it's not writable. This usually means it "
|
||||||
"was registered with a getter function (and no setter) or as a "
|
"was registered with a getter function (and no setter) or as a "
|
||||||
|
@ -349,7 +339,7 @@ class Errors:
|
||||||
E130 = ("You are running a narrow unicode build, which is incompatible "
|
E130 = ("You are running a narrow unicode build, which is incompatible "
|
||||||
"with spacy >= 2.1.0. To fix this, reinstall Python and use a wide "
|
"with spacy >= 2.1.0. To fix this, reinstall Python and use a wide "
|
||||||
"unicode build instead. You can also rebuild Python and set the "
|
"unicode build instead. You can also rebuild Python and set the "
|
||||||
"--enable-unicode=ucs4 flag.")
|
"`--enable-unicode=ucs4 flag`.")
|
||||||
E131 = ("Cannot write the kb_id of an existing Span object because a Span "
|
E131 = ("Cannot write the kb_id of an existing Span object because a Span "
|
||||||
"is a read-only view of the underlying Token objects stored in "
|
"is a read-only view of the underlying Token objects stored in "
|
||||||
"the Doc. Instead, create a new Span object and specify the "
|
"the Doc. Instead, create a new Span object and specify the "
|
||||||
|
@ -362,27 +352,20 @@ class Errors:
|
||||||
E133 = ("The sum of prior probabilities for alias '{alias}' should not "
|
E133 = ("The sum of prior probabilities for alias '{alias}' should not "
|
||||||
"exceed 1, but found {sum}.")
|
"exceed 1, but found {sum}.")
|
||||||
E134 = ("Entity '{entity}' is not defined in the Knowledge Base.")
|
E134 = ("Entity '{entity}' is not defined in the Knowledge Base.")
|
||||||
E137 = ("Expected 'dict' type, but got '{type}' from '{line}'. Make sure "
|
E139 = ("Knowledge base for component '{name}' is empty. Use the methods "
|
||||||
"to provide a valid JSON object as input with either the `text` "
|
"`kb.add_entity` and `kb.add_alias` to add entries.")
|
||||||
"or `tokens` key. For more info, see the docs:\n"
|
|
||||||
"https://nightly.spacy.io/api/cli#pretrain-jsonl")
|
|
||||||
E138 = ("Invalid JSONL format for raw text '{text}'. Make sure the input "
|
|
||||||
"includes either the `text` or `tokens` key. For more info, see "
|
|
||||||
"the docs:\nhttps://nightly.spacy.io/api/cli#pretrain-jsonl")
|
|
||||||
E139 = ("Knowledge Base for component '{name}' is empty. Use the methods "
|
|
||||||
"kb.add_entity and kb.add_alias to add entries.")
|
|
||||||
E140 = ("The list of entities, prior probabilities and entity vectors "
|
E140 = ("The list of entities, prior probabilities and entity vectors "
|
||||||
"should be of equal length.")
|
"should be of equal length.")
|
||||||
E141 = ("Entity vectors should be of length {required} instead of the "
|
E141 = ("Entity vectors should be of length {required} instead of the "
|
||||||
"provided {found}.")
|
"provided {found}.")
|
||||||
E143 = ("Labels for component '{name}' not initialized. This can be fixed "
|
E143 = ("Labels for component '{name}' not initialized. This can be fixed "
|
||||||
"by calling add_label, or by providing a representative batch of "
|
"by calling add_label, or by providing a representative batch of "
|
||||||
"examples to the component's initialize method.")
|
"examples to the component's `initialize` method.")
|
||||||
E145 = ("Error reading `{param}` from input file.")
|
E145 = ("Error reading `{param}` from input file.")
|
||||||
E146 = ("Could not access `{path}`.")
|
E146 = ("Could not access {path}.")
|
||||||
E147 = ("Unexpected error in the {method} functionality of the "
|
E147 = ("Unexpected error in the {method} functionality of the "
|
||||||
"EntityLinker: {msg}. This is likely a bug in spaCy, so feel free "
|
"EntityLinker: {msg}. This is likely a bug in spaCy, so feel free "
|
||||||
"to open an issue.")
|
"to open an issue: https://github.com/explosion/spaCy/issues")
|
||||||
E148 = ("Expected {ents} KB identifiers but got {ids}. Make sure that "
|
E148 = ("Expected {ents} KB identifiers but got {ids}. Make sure that "
|
||||||
"each entity in `doc.ents` is assigned to a KB identifier.")
|
"each entity in `doc.ents` is assigned to a KB identifier.")
|
||||||
E149 = ("Error deserializing model. Check that the config used to create "
|
E149 = ("Error deserializing model. Check that the config used to create "
|
||||||
|
@ -390,18 +373,18 @@ class Errors:
|
||||||
E150 = ("The language of the `nlp` object and the `vocab` should be the "
|
E150 = ("The language of the `nlp` object and the `vocab` should be the "
|
||||||
"same, but found '{nlp}' and '{vocab}' respectively.")
|
"same, but found '{nlp}' and '{vocab}' respectively.")
|
||||||
E152 = ("The attribute {attr} is not supported for token patterns. "
|
E152 = ("The attribute {attr} is not supported for token patterns. "
|
||||||
"Please use the option validate=True with Matcher, PhraseMatcher, "
|
"Please use the option `validate=True` with the Matcher, PhraseMatcher, "
|
||||||
"or EntityRuler for more details.")
|
"or EntityRuler for more details.")
|
||||||
E153 = ("The value type {vtype} is not supported for token patterns. "
|
E153 = ("The value type {vtype} is not supported for token patterns. "
|
||||||
"Please use the option validate=True with Matcher, PhraseMatcher, "
|
"Please use the option validate=True with Matcher, PhraseMatcher, "
|
||||||
"or EntityRuler for more details.")
|
"or EntityRuler for more details.")
|
||||||
E154 = ("One of the attributes or values is not supported for token "
|
E154 = ("One of the attributes or values is not supported for token "
|
||||||
"patterns. Please use the option validate=True with Matcher, "
|
"patterns. Please use the option `validate=True` with the Matcher, "
|
||||||
"PhraseMatcher, or EntityRuler for more details.")
|
"PhraseMatcher, or EntityRuler for more details.")
|
||||||
E155 = ("The pipeline needs to include a {pipe} in order to use "
|
E155 = ("The pipeline needs to include a {pipe} in order to use "
|
||||||
"Matcher or PhraseMatcher with the attribute {attr}. "
|
"Matcher or PhraseMatcher with the attribute {attr}. "
|
||||||
"Try using nlp() instead of nlp.make_doc() or list(nlp.pipe()) "
|
"Try using `nlp()` instead of `nlp.make_doc()` or `list(nlp.pipe())` "
|
||||||
"instead of list(nlp.tokenizer.pipe()).")
|
"instead of `list(nlp.tokenizer.pipe())`.")
|
||||||
E157 = ("Can't render negative values for dependency arc start or end. "
|
E157 = ("Can't render negative values for dependency arc start or end. "
|
||||||
"Make sure that you're passing in absolute token indices, not "
|
"Make sure that you're passing in absolute token indices, not "
|
||||||
"relative token offsets.\nstart: {start}, end: {end}, label: "
|
"relative token offsets.\nstart: {start}, end: {end}, label: "
|
||||||
|
@ -410,13 +393,11 @@ class Errors:
|
||||||
E159 = ("Can't find table '{name}' in lookups. Available tables: {tables}")
|
E159 = ("Can't find table '{name}' in lookups. Available tables: {tables}")
|
||||||
E160 = ("Can't find language data file: {path}")
|
E160 = ("Can't find language data file: {path}")
|
||||||
E161 = ("Found an internal inconsistency when predicting entity links. "
|
E161 = ("Found an internal inconsistency when predicting entity links. "
|
||||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
"This is likely a bug in spaCy, so feel free to open an issue: "
|
||||||
E162 = ("Cannot evaluate textcat model on data with different labels.\n"
|
"https://github.com/explosion/spaCy/issues")
|
||||||
"Labels in model: {model_labels}\nLabels in evaluation "
|
|
||||||
"data: {eval_labels}")
|
|
||||||
E163 = ("cumsum was found to be unstable: its last element does not "
|
E163 = ("cumsum was found to be unstable: its last element does not "
|
||||||
"correspond to sum")
|
"correspond to sum")
|
||||||
E164 = ("x is neither increasing nor decreasing: {}.")
|
E164 = ("x is neither increasing nor decreasing: {x}.")
|
||||||
E165 = ("Only one class present in y_true. ROC AUC score is not defined in "
|
E165 = ("Only one class present in y_true. ROC AUC score is not defined in "
|
||||||
"that case.")
|
"that case.")
|
||||||
E166 = ("Can only merge DocBins with the same value for '{param}'.\n"
|
E166 = ("Can only merge DocBins with the same value for '{param}'.\n"
|
||||||
|
@ -431,10 +412,10 @@ class Errors:
|
||||||
E178 = ("Each pattern should be a list of dicts, but got: {pat}. Maybe you "
|
E178 = ("Each pattern should be a list of dicts, but got: {pat}. Maybe you "
|
||||||
"accidentally passed a single pattern to Matcher.add instead of a "
|
"accidentally passed a single pattern to Matcher.add instead of a "
|
||||||
"list of patterns? If you only want to add one pattern, make sure "
|
"list of patterns? If you only want to add one pattern, make sure "
|
||||||
"to wrap it in a list. For example: matcher.add('{key}', [pattern])")
|
"to wrap it in a list. For example: `matcher.add('{key}', [pattern])`")
|
||||||
E179 = ("Invalid pattern. Expected a list of Doc objects but got a single "
|
E179 = ("Invalid pattern. Expected a list of Doc objects but got a single "
|
||||||
"Doc. If you only want to add one pattern, make sure to wrap it "
|
"Doc. If you only want to add one pattern, make sure to wrap it "
|
||||||
"in a list. For example: matcher.add('{key}', [doc])")
|
"in a list. For example: `matcher.add('{key}', [doc])`")
|
||||||
E180 = ("Span attributes can't be declared as required or assigned by "
|
E180 = ("Span attributes can't be declared as required or assigned by "
|
||||||
"components, since spans are only views of the Doc. Use Doc and "
|
"components, since spans are only views of the Doc. Use Doc and "
|
||||||
"Token attributes (or custom extension attributes) only and remove "
|
"Token attributes (or custom extension attributes) only and remove "
|
||||||
|
@ -442,17 +423,16 @@ class Errors:
|
||||||
E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. "
|
E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. "
|
||||||
"Only Doc and Token attributes are supported.")
|
"Only Doc and Token attributes are supported.")
|
||||||
E182 = ("Received invalid attribute declaration: {attr}\nDid you forget "
|
E182 = ("Received invalid attribute declaration: {attr}\nDid you forget "
|
||||||
"to define the attribute? For example: {attr}.???")
|
"to define the attribute? For example: `{attr}.???`")
|
||||||
E183 = ("Received invalid attribute declaration: {attr}\nOnly top-level "
|
E183 = ("Received invalid attribute declaration: {attr}\nOnly top-level "
|
||||||
"attributes are supported, for example: {solution}")
|
"attributes are supported, for example: {solution}")
|
||||||
E184 = ("Only attributes without underscores are supported in component "
|
E184 = ("Only attributes without underscores are supported in component "
|
||||||
"attribute declarations (because underscore and non-underscore "
|
"attribute declarations (because underscore and non-underscore "
|
||||||
"attributes are connected anyways): {attr} -> {solution}")
|
"attributes are connected anyways): {attr} -> {solution}")
|
||||||
E185 = ("Received invalid attribute in component attribute declaration: "
|
E185 = ("Received invalid attribute in component attribute declaration: "
|
||||||
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
|
"`{obj}.{attr}`\nAttribute '{attr}' does not exist on {obj}.")
|
||||||
E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
|
|
||||||
E187 = ("Only unicode strings are supported as labels.")
|
E187 = ("Only unicode strings are supported as labels.")
|
||||||
E189 = ("Each argument to Doc.__init__ should be of equal length.")
|
E189 = ("Each argument to `Doc.__init__` should be of equal length.")
|
||||||
E190 = ("Token head out of range in `Doc.from_array()` for token index "
|
E190 = ("Token head out of range in `Doc.from_array()` for token index "
|
||||||
"'{index}' with value '{value}' (equivalent to relative head "
|
"'{index}' with value '{value}' (equivalent to relative head "
|
||||||
"index: '{rel_head_index}'). The head indices should be relative "
|
"index: '{rel_head_index}'). The head indices should be relative "
|
||||||
|
@ -466,17 +446,32 @@ class Errors:
|
||||||
"({curr_dim}).")
|
"({curr_dim}).")
|
||||||
E194 = ("Unable to aligned mismatched text '{text}' and words '{words}'.")
|
E194 = ("Unable to aligned mismatched text '{text}' and words '{words}'.")
|
||||||
E195 = ("Matcher can be called on {good} only, got {got}.")
|
E195 = ("Matcher can be called on {good} only, got {got}.")
|
||||||
E196 = ("Refusing to write to token.is_sent_end. Sentence boundaries can "
|
E196 = ("Refusing to write to `token.is_sent_end`. Sentence boundaries can "
|
||||||
"only be fixed with token.is_sent_start.")
|
"only be fixed with `token.is_sent_start`.")
|
||||||
E197 = ("Row out of bounds, unable to add row {row} for key {key}.")
|
E197 = ("Row out of bounds, unable to add row {row} for key {key}.")
|
||||||
E198 = ("Unable to return {n} most similar vectors for the current vectors "
|
E198 = ("Unable to return {n} most similar vectors for the current vectors "
|
||||||
"table, which contains {n_rows} vectors.")
|
"table, which contains {n_rows} vectors.")
|
||||||
E199 = ("Unable to merge 0-length span at doc[{start}:{end}].")
|
E199 = ("Unable to merge 0-length span at `doc[{start}:{end}]`.")
|
||||||
E200 = ("Specifying a base model with a pretrained component '{component}' "
|
E200 = ("Can't yet set {attr} from Span. Vote for this feature on the "
|
||||||
"can not be combined with adding a pretrained Tok2Vec layer.")
|
"issue tracker: http://github.com/explosion/spaCy/issues")
|
||||||
E201 = ("Span index out of range.")
|
|
||||||
|
|
||||||
# TODO: fix numbering after merging develop into master
|
# TODO: fix numbering after merging develop into master
|
||||||
|
E092 = ("The sentence-per-line IOB/IOB2 file is not formatted correctly. "
|
||||||
|
"Try checking whitespace and delimiters. See "
|
||||||
|
"https://nightly.spacy.io/api/cli#convert")
|
||||||
|
E093 = ("The token-per-line NER file is not formatted correctly. Try checking "
|
||||||
|
"whitespace and delimiters. See https://nightly.spacy.io/api/cli#convert")
|
||||||
|
E904 = ("Cannot initialize StaticVectors layer: nO dimension unset. This "
|
||||||
|
"dimension refers to the output width, after the linear projection "
|
||||||
|
"has been applied.")
|
||||||
|
E905 = ("Cannot initialize StaticVectors layer: nM dimension unset. This "
|
||||||
|
"dimension refers to the width of the vectors table.")
|
||||||
|
E906 = ("Unexpected `loss` value in pretraining objective: {loss_type}")
|
||||||
|
E907 = ("Unexpected `objective_type` value in pretraining objective: {objective_type}")
|
||||||
|
E908 = ("Can't set `spaces` without `words` in `Doc.__init__`.")
|
||||||
|
E909 = ("Expected {name} in parser internals. This is likely a bug in spaCy.")
|
||||||
|
E910 = ("Encountered NaN value when computing loss for component '{name}'.")
|
||||||
|
E911 = ("Invalid feature: {feat}. Must be a token attribute.")
|
||||||
E912 = ("Failed to initialize lemmatizer. Missing lemmatizer table(s) found "
|
E912 = ("Failed to initialize lemmatizer. Missing lemmatizer table(s) found "
|
||||||
"for mode '{mode}'. Required tables: {tables}. Found: {found}.")
|
"for mode '{mode}'. Required tables: {tables}. Found: {found}.")
|
||||||
E913 = ("Corpus path can't be None. Maybe you forgot to define it in your "
|
E913 = ("Corpus path can't be None. Maybe you forgot to define it in your "
|
||||||
|
@ -489,43 +484,44 @@ class Errors:
|
||||||
"final score, set its weight to null in the [training.score_weights] "
|
"final score, set its weight to null in the [training.score_weights] "
|
||||||
"section of your training config.")
|
"section of your training config.")
|
||||||
E916 = ("Can't log score for '{name}' in table: not a valid score ({score_type})")
|
E916 = ("Can't log score for '{name}' in table: not a valid score ({score_type})")
|
||||||
E917 = ("Received invalid value {value} for 'state_type' in "
|
E917 = ("Received invalid value {value} for `state_type` in "
|
||||||
"TransitionBasedParser: only 'parser' or 'ner' are valid options.")
|
"TransitionBasedParser: only 'parser' or 'ner' are valid options.")
|
||||||
E918 = ("Received invalid value for vocab: {vocab} ({vocab_type}). Valid "
|
E918 = ("Received invalid value for vocab: {vocab} ({vocab_type}). Valid "
|
||||||
"values are an instance of spacy.vocab.Vocab or True to create one"
|
"values are an instance of `spacy.vocab.Vocab` or True to create one"
|
||||||
" (default).")
|
" (default).")
|
||||||
E919 = ("A textcat 'positive_label' '{pos_label}' was provided for training "
|
E919 = ("A textcat `positive_label` '{pos_label}' was provided for training "
|
||||||
"data that does not appear to be a binary classification problem "
|
"data that does not appear to be a binary classification problem "
|
||||||
"with two labels. Labels found: {labels}")
|
"with two labels. Labels found: {labels}")
|
||||||
E920 = ("The textcat's 'positive_label' config setting '{pos_label}' "
|
E920 = ("The textcat's `positive_label` setting '{pos_label}' "
|
||||||
"does not match any label in the training data. Labels found: {labels}")
|
"does not match any label in the training data or provided during "
|
||||||
E921 = ("The method 'set_output' can only be called on components that have "
|
"initialization. Available labels: {labels}")
|
||||||
"a Model with a 'resize_output' attribute. Otherwise, the output "
|
E921 = ("The method `set_output` can only be called on components that have "
|
||||||
|
"a Model with a `resize_output` attribute. Otherwise, the output "
|
||||||
"layer can not be dynamically changed.")
|
"layer can not be dynamically changed.")
|
||||||
E922 = ("Component '{name}' has been initialized with an output dimension of "
|
E922 = ("Component '{name}' has been initialized with an output dimension of "
|
||||||
"{nO} - cannot add any more labels.")
|
"{nO} - cannot add any more labels.")
|
||||||
E923 = ("It looks like there is no proper sample data to initialize the "
|
E923 = ("It looks like there is no proper sample data to initialize the "
|
||||||
"Model of component '{name}'. "
|
"Model of component '{name}'. This is likely a bug in spaCy, so "
|
||||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
"feel free to open an issue: https://github.com/explosion/spaCy/issues")
|
||||||
E924 = ("The '{name}' component does not seem to be initialized properly. "
|
E924 = ("The '{name}' component does not seem to be initialized properly. "
|
||||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
"This is likely a bug in spaCy, so feel free to open an issue: "
|
||||||
|
"https://github.com/explosion/spaCy/issues")
|
||||||
E925 = ("Invalid color values for displaCy visualizer: expected dictionary "
|
E925 = ("Invalid color values for displaCy visualizer: expected dictionary "
|
||||||
"mapping label names to colors but got: {obj}")
|
"mapping label names to colors but got: {obj}")
|
||||||
E926 = ("It looks like you're trying to modify nlp.{attr} directly. This "
|
E926 = ("It looks like you're trying to modify `nlp.{attr}` directly. This "
|
||||||
"doesn't work because it's an immutable computed property. If you "
|
"doesn't work because it's an immutable computed property. If you "
|
||||||
"need to modify the pipeline, use the built-in methods like "
|
"need to modify the pipeline, use the built-in methods like "
|
||||||
"nlp.add_pipe, nlp.remove_pipe, nlp.disable_pipe or nlp.enable_pipe "
|
"`nlp.add_pipe`, `nlp.remove_pipe`, `nlp.disable_pipe` or "
|
||||||
"instead.")
|
"`nlp.enable_pipe` instead.")
|
||||||
E927 = ("Can't write to frozen list Maybe you're trying to modify a computed "
|
E927 = ("Can't write to frozen list Maybe you're trying to modify a computed "
|
||||||
"property or default function argument?")
|
"property or default function argument?")
|
||||||
E928 = ("A 'KnowledgeBase' can only be serialized to/from from a directory, "
|
E928 = ("A KnowledgeBase can only be serialized to/from from a directory, "
|
||||||
"but the provided argument {loc} points to a file.")
|
"but the provided argument {loc} points to a file.")
|
||||||
E929 = ("A 'KnowledgeBase' could not be read from {loc} - the path does "
|
E929 = ("Couldn't read KnowledgeBase from {loc}. The path does not seem to exist.")
|
||||||
"not seem to exist.")
|
E930 = ("Received invalid get_examples callback in `{name}.initialize`. "
|
||||||
E930 = ("Received invalid get_examples callback in {name}.initialize. "
|
|
||||||
"Expected function that returns an iterable of Example objects but "
|
"Expected function that returns an iterable of Example objects but "
|
||||||
"got: {obj}")
|
"got: {obj}")
|
||||||
E931 = ("Encountered Pipe subclass without Pipe.{method} method in component "
|
E931 = ("Encountered Pipe subclass without `Pipe.{method}` method in component "
|
||||||
"'{name}'. If the component is trainable and you want to use this "
|
"'{name}'. If the component is trainable and you want to use this "
|
||||||
"method, make sure it's overwritten on the subclass. If your "
|
"method, make sure it's overwritten on the subclass. If your "
|
||||||
"component isn't trainable, add a method that does nothing or "
|
"component isn't trainable, add a method that does nothing or "
|
||||||
|
@ -538,21 +534,21 @@ class Errors:
|
||||||
"models, see the models directory: https://spacy.io/models. If you "
|
"models, see the models directory: https://spacy.io/models. If you "
|
||||||
"want to create a blank model, use spacy.blank: "
|
"want to create a blank model, use spacy.blank: "
|
||||||
"nlp = spacy.blank(\"{name}\")")
|
"nlp = spacy.blank(\"{name}\")")
|
||||||
E942 = ("Executing after_{name} callback failed. Expected the function to "
|
E942 = ("Executing `after_{name}` callback failed. Expected the function to "
|
||||||
"return an initialized nlp object but got: {value}. Maybe "
|
"return an initialized nlp object but got: {value}. Maybe "
|
||||||
"you forgot to return the modified object in your function?")
|
"you forgot to return the modified object in your function?")
|
||||||
E943 = ("Executing before_creation callback failed. Expected the function to "
|
E943 = ("Executing `before_creation` callback failed. Expected the function to "
|
||||||
"return an uninitialized Language subclass but got: {value}. Maybe "
|
"return an uninitialized Language subclass but got: {value}. Maybe "
|
||||||
"you forgot to return the modified object in your function or "
|
"you forgot to return the modified object in your function or "
|
||||||
"returned the initialized nlp object instead?")
|
"returned the initialized nlp object instead?")
|
||||||
E944 = ("Can't copy pipeline component '{name}' from source model '{model}': "
|
E944 = ("Can't copy pipeline component '{name}' from source '{model}': "
|
||||||
"not found in pipeline. Available components: {opts}")
|
"not found in pipeline. Available components: {opts}")
|
||||||
E945 = ("Can't copy pipeline component '{name}' from source. Expected loaded "
|
E945 = ("Can't copy pipeline component '{name}' from source. Expected loaded "
|
||||||
"nlp object, but got: {source}")
|
"nlp object, but got: {source}")
|
||||||
E947 = ("Matcher.add received invalid 'greedy' argument: expected "
|
E947 = ("`Matcher.add` received invalid `greedy` argument: expected "
|
||||||
"a string value from {expected} but got: '{arg}'")
|
"a string value from {expected} but got: '{arg}'")
|
||||||
E948 = ("Matcher.add received invalid 'patterns' argument: expected "
|
E948 = ("`Matcher.add` received invalid 'patterns' argument: expected "
|
||||||
"a List, but got: {arg_type}")
|
"a list, but got: {arg_type}")
|
||||||
E949 = ("Can only create an alignment when the texts are the same.")
|
E949 = ("Can only create an alignment when the texts are the same.")
|
||||||
E952 = ("The section '{name}' is not a valid section in the provided config.")
|
E952 = ("The section '{name}' is not a valid section in the provided config.")
|
||||||
E953 = ("Mismatched IDs received by the Tok2Vec listener: {id1} vs. {id2}")
|
E953 = ("Mismatched IDs received by the Tok2Vec listener: {id1} vs. {id2}")
|
||||||
|
@ -564,9 +560,9 @@ class Errors:
|
||||||
"for your language.")
|
"for your language.")
|
||||||
E956 = ("Can't find component '{name}' in [components] block in the config. "
|
E956 = ("Can't find component '{name}' in [components] block in the config. "
|
||||||
"Available components: {opts}")
|
"Available components: {opts}")
|
||||||
E957 = ("Writing directly to Language.factories isn't needed anymore in "
|
E957 = ("Writing directly to `Language.factories` isn't needed anymore in "
|
||||||
"spaCy v3. Instead, you can use the @Language.factory decorator "
|
"spaCy v3. Instead, you can use the `@Language.factory` decorator "
|
||||||
"to register your custom component factory or @Language.component "
|
"to register your custom component factory or `@Language.component` "
|
||||||
"to register a simple stateless function component that just takes "
|
"to register a simple stateless function component that just takes "
|
||||||
"a Doc and returns it.")
|
"a Doc and returns it.")
|
||||||
E958 = ("Language code defined in config ({bad_lang_code}) does not match "
|
E958 = ("Language code defined in config ({bad_lang_code}) does not match "
|
||||||
|
@ -584,99 +580,93 @@ class Errors:
|
||||||
"component.\n\n{config}")
|
"component.\n\n{config}")
|
||||||
E962 = ("Received incorrect {style} for pipe '{name}'. Expected dict, "
|
E962 = ("Received incorrect {style} for pipe '{name}'. Expected dict, "
|
||||||
"got: {cfg_type}.")
|
"got: {cfg_type}.")
|
||||||
E963 = ("Can't read component info from @Language.{decorator} decorator. "
|
E963 = ("Can't read component info from `@Language.{decorator}` decorator. "
|
||||||
"Maybe you forgot to call it? Make sure you're using "
|
"Maybe you forgot to call it? Make sure you're using "
|
||||||
"@Language.{decorator}() instead of @Language.{decorator}.")
|
"`@Language.{decorator}()` instead of `@Language.{decorator}`.")
|
||||||
E964 = ("The pipeline component factory for '{name}' needs to have the "
|
E964 = ("The pipeline component factory for '{name}' needs to have the "
|
||||||
"following named arguments, which are passed in by spaCy:\n- nlp: "
|
"following named arguments, which are passed in by spaCy:\n- nlp: "
|
||||||
"receives the current nlp object and lets you access the vocab\n- "
|
"receives the current nlp object and lets you access the vocab\n- "
|
||||||
"name: the name of the component instance, can be used to identify "
|
"name: the name of the component instance, can be used to identify "
|
||||||
"the component, output losses etc.")
|
"the component, output losses etc.")
|
||||||
E965 = ("It looks like you're using the @Language.component decorator to "
|
E965 = ("It looks like you're using the `@Language.component` decorator to "
|
||||||
"register '{name}' on a class instead of a function component. If "
|
"register '{name}' on a class instead of a function component. If "
|
||||||
"you need to register a class or function that *returns* a component "
|
"you need to register a class or function that *returns* a component "
|
||||||
"function, use the @Language.factory decorator instead.")
|
"function, use the `@Language.factory` decorator instead.")
|
||||||
E966 = ("nlp.add_pipe now takes the string name of the registered component "
|
E966 = ("`nlp.add_pipe` now takes the string name of the registered component "
|
||||||
"factory, not a callable component. Expected string, but got "
|
"factory, not a callable component. Expected string, but got "
|
||||||
"{component} (name: '{name}').\n\n- If you created your component "
|
"{component} (name: '{name}').\n\n- If you created your component "
|
||||||
"with nlp.create_pipe('name'): remove nlp.create_pipe and call "
|
"with `nlp.create_pipe('name')`: remove nlp.create_pipe and call "
|
||||||
"nlp.add_pipe('name') instead.\n\n- If you passed in a component "
|
"`nlp.add_pipe('name')` instead.\n\n- If you passed in a component "
|
||||||
"like TextCategorizer(): call nlp.add_pipe with the string name "
|
"like `TextCategorizer()`: call `nlp.add_pipe` with the string name "
|
||||||
"instead, e.g. nlp.add_pipe('textcat').\n\n- If you're using a custom "
|
"instead, e.g. `nlp.add_pipe('textcat')`.\n\n- If you're using a custom "
|
||||||
"component: Add the decorator @Language.component (for function "
|
"component: Add the decorator `@Language.component` (for function "
|
||||||
"components) or @Language.factory (for class components / factories) "
|
"components) or `@Language.factory` (for class components / factories) "
|
||||||
"to your custom component and assign it a name, e.g. "
|
"to your custom component and assign it a name, e.g. "
|
||||||
"@Language.component('your_name'). You can then run "
|
"`@Language.component('your_name')`. You can then run "
|
||||||
"nlp.add_pipe('your_name') to add it to the pipeline.")
|
"`nlp.add_pipe('your_name')` to add it to the pipeline.")
|
||||||
E967 = ("No {meta} meta information found for '{name}'. This is likely a bug in spaCy.")
|
E967 = ("No {meta} meta information found for '{name}'. This is likely a bug in spaCy.")
|
||||||
E968 = ("nlp.replace_pipe now takes the string name of the registered component "
|
E968 = ("`nlp.replace_pipe` now takes the string name of the registered component "
|
||||||
"factory, not a callable component. Expected string, but got "
|
"factory, not a callable component. Expected string, but got "
|
||||||
"{component}.\n\n- If you created your component with"
|
"{component}.\n\n- If you created your component with"
|
||||||
"with nlp.create_pipe('name'): remove nlp.create_pipe and call "
|
"with `nlp.create_pipe('name')`: remove `nlp.create_pipe` and call "
|
||||||
"nlp.replace_pipe('{name}', 'name') instead.\n\n- If you passed in a "
|
"`nlp.replace_pipe('{name}', 'name')` instead.\n\n- If you passed in a "
|
||||||
"component like TextCategorizer(): call nlp.replace_pipe with the "
|
"component like `TextCategorizer()`: call `nlp.replace_pipe` with the "
|
||||||
"string name instead, e.g. nlp.replace_pipe('{name}', 'textcat').\n\n"
|
"string name instead, e.g. `nlp.replace_pipe('{name}', 'textcat')`.\n\n"
|
||||||
"- If you're using a custom component: Add the decorator "
|
"- If you're using a custom component: Add the decorator "
|
||||||
"@Language.component (for function components) or @Language.factory "
|
"`@Language.component` (for function components) or `@Language.factory` "
|
||||||
"(for class components / factories) to your custom component and "
|
"(for class components / factories) to your custom component and "
|
||||||
"assign it a name, e.g. @Language.component('your_name'). You can "
|
"assign it a name, e.g. `@Language.component('your_name')`. You can "
|
||||||
"then run nlp.replace_pipe('{name}', 'your_name').")
|
"then run `nlp.replace_pipe('{name}', 'your_name')`.")
|
||||||
E969 = ("Expected string values for field '{field}', but received {types} instead. ")
|
E969 = ("Expected string values for field '{field}', but received {types} instead. ")
|
||||||
E970 = ("Can not execute command '{str_command}'. Do you have '{tool}' installed?")
|
E970 = ("Can not execute command '{str_command}'. Do you have '{tool}' installed?")
|
||||||
E971 = ("Found incompatible lengths in Doc.from_array: {array_length} for the "
|
E971 = ("Found incompatible lengths in `Doc.from_array`: {array_length} for the "
|
||||||
"array and {doc_length} for the Doc itself.")
|
"array and {doc_length} for the Doc itself.")
|
||||||
E972 = ("Example.__init__ got None for '{arg}'. Requires Doc.")
|
E972 = ("`Example.__init__` got None for '{arg}'. Requires Doc.")
|
||||||
E973 = ("Unexpected type for NER data")
|
E973 = ("Unexpected type for NER data")
|
||||||
E974 = ("Unknown {obj} attribute: {key}")
|
E974 = ("Unknown {obj} attribute: {key}")
|
||||||
E976 = ("The method 'Example.from_dict' expects a {type} as {n} argument, "
|
E976 = ("The method `Example.from_dict` expects a {type} as {n} argument, "
|
||||||
"but received None.")
|
"but received None.")
|
||||||
E977 = ("Can not compare a MorphAnalysis with a string object. "
|
E977 = ("Can not compare a MorphAnalysis with a string object. "
|
||||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
"This is likely a bug in spaCy, so feel free to open an issue: "
|
||||||
|
"https://github.com/explosion/spaCy/issues")
|
||||||
E978 = ("The {name} method takes a list of Example objects, but got: {types}")
|
E978 = ("The {name} method takes a list of Example objects, but got: {types}")
|
||||||
E979 = ("Cannot convert {type} to an Example object.")
|
|
||||||
E980 = ("Each link annotation should refer to a dictionary with at most one "
|
E980 = ("Each link annotation should refer to a dictionary with at most one "
|
||||||
"identifier mapping to 1.0, and all others to 0.0.")
|
"identifier mapping to 1.0, and all others to 0.0.")
|
||||||
E981 = ("The offsets of the annotations for 'links' could not be aligned "
|
E981 = ("The offsets of the annotations for `links` could not be aligned "
|
||||||
"to token boundaries.")
|
"to token boundaries.")
|
||||||
E982 = ("The 'ent_iob' attribute of a Token should be an integer indexing "
|
E982 = ("The `Token.ent_iob` attribute should be an integer indexing "
|
||||||
"into {values}, but found {value}.")
|
"into {values}, but found {value}.")
|
||||||
E983 = ("Invalid key for '{dict}': {key}. Available keys: "
|
E983 = ("Invalid key for '{dict}': {key}. Available keys: "
|
||||||
"{keys}")
|
"{keys}")
|
||||||
E984 = ("Invalid component config for '{name}': component block needs either "
|
E984 = ("Invalid component config for '{name}': component block needs either "
|
||||||
"a key 'factory' specifying the registered function used to "
|
"a key `factory` specifying the registered function used to "
|
||||||
"initialize the component, or a key 'source' key specifying a "
|
"initialize the component, or a key `source` key specifying a "
|
||||||
"spaCy model to copy the component from. For example, factory = "
|
"spaCy model to copy the component from. For example, `factory = "
|
||||||
"\"ner\" will use the 'ner' factory and all other settings in the "
|
"\"ner\"` will use the 'ner' factory and all other settings in the "
|
||||||
"block will be passed to it as arguments. Alternatively, source = "
|
"block will be passed to it as arguments. Alternatively, `source = "
|
||||||
"\"en_core_web_sm\" will copy the component from that model.\n\n{config}")
|
"\"en_core_web_sm\"` will copy the component from that model.\n\n{config}")
|
||||||
E985 = ("Can't load model from config file: no 'nlp' section found.\n\n{config}")
|
E985 = ("Can't load model from config file: no [nlp] section found.\n\n{config}")
|
||||||
E986 = ("Could not create any training batches: check your input. "
|
E986 = ("Could not create any training batches: check your input. "
|
||||||
"Are the train and dev paths defined? "
|
"Are the train and dev paths defined? Is `discard_oversize` set appropriately? ")
|
||||||
"Is 'discard_oversize' set appropriately? ")
|
E989 = ("`nlp.update()` was called with two positional arguments. This "
|
||||||
E987 = ("The text of an example training instance is either a Doc or "
|
|
||||||
"a string, but found {type} instead.")
|
|
||||||
E988 = ("Could not parse any training examples. Ensure the data is "
|
|
||||||
"formatted correctly.")
|
|
||||||
E989 = ("'nlp.update()' was called with two positional arguments. This "
|
|
||||||
"may be due to a backwards-incompatible change to the format "
|
"may be due to a backwards-incompatible change to the format "
|
||||||
"of the training data in spaCy 3.0 onwards. The 'update' "
|
"of the training data in spaCy 3.0 onwards. The 'update' "
|
||||||
"function should now be called with a batch of 'Example' "
|
"function should now be called with a batch of Example "
|
||||||
"objects, instead of (text, annotation) tuples. ")
|
"objects, instead of `(text, annotation)` tuples. ")
|
||||||
E991 = ("The function 'select_pipes' should be called with either a "
|
E991 = ("The function `nlp.select_pipes` should be called with either a "
|
||||||
"'disable' argument to list the names of the pipe components "
|
"`disable` argument to list the names of the pipe components "
|
||||||
"that should be disabled, or with an 'enable' argument that "
|
"that should be disabled, or with an 'enable' argument that "
|
||||||
"specifies which pipes should not be disabled.")
|
"specifies which pipes should not be disabled.")
|
||||||
E992 = ("The function `select_pipes` was called with `enable`={enable} "
|
E992 = ("The function `select_pipes` was called with `enable`={enable} "
|
||||||
"and `disable`={disable} but that information is conflicting "
|
"and `disable`={disable} but that information is conflicting "
|
||||||
"for the `nlp` pipeline with components {names}.")
|
"for the `nlp` pipeline with components {names}.")
|
||||||
E993 = ("The config for 'nlp' needs to include a key 'lang' specifying "
|
E993 = ("The config for the nlp object needs to include a key `lang` specifying "
|
||||||
"the code of the language to initialize it with (for example "
|
"the code of the language to initialize it with (for example "
|
||||||
"'en' for English) - this can't be 'None'.\n\n{config}")
|
"'en' for English) - this can't be None.\n\n{config}")
|
||||||
E996 = ("Could not parse {file}: {msg}")
|
|
||||||
E997 = ("Tokenizer special cases are not allowed to modify the text. "
|
E997 = ("Tokenizer special cases are not allowed to modify the text. "
|
||||||
"This would map '{chunk}' to '{orth}' given token attributes "
|
"This would map '{chunk}' to '{orth}' given token attributes "
|
||||||
"'{token_attrs}'.")
|
"'{token_attrs}'.")
|
||||||
E999 = ("Unable to merge the `Doc` objects because they do not all share "
|
E999 = ("Unable to merge the Doc objects because they do not all share "
|
||||||
"the same `Vocab`.")
|
"the same `Vocab`.")
|
||||||
E1000 = ("The Chinese word segmenter is pkuseg but no pkuseg model was "
|
E1000 = ("The Chinese word segmenter is pkuseg but no pkuseg model was "
|
||||||
"loaded. Provide the name of a pretrained model or the path to "
|
"loaded. Provide the name of a pretrained model or the path to "
|
||||||
|
@ -688,35 +678,24 @@ class Errors:
|
||||||
E1003 = ("Unsupported lemmatizer mode '{mode}'.")
|
E1003 = ("Unsupported lemmatizer mode '{mode}'.")
|
||||||
E1004 = ("Missing lemmatizer table(s) found for lemmatizer mode '{mode}'. "
|
E1004 = ("Missing lemmatizer table(s) found for lemmatizer mode '{mode}'. "
|
||||||
"Required tables: {tables}. Found: {found}. Maybe you forgot to "
|
"Required tables: {tables}. Found: {found}. Maybe you forgot to "
|
||||||
"call nlp.initialize() to load in the data?")
|
"call `nlp.initialize()` to load in the data?")
|
||||||
E1005 = ("Unable to set attribute '{attr}' in tokenizer exception for "
|
E1005 = ("Unable to set attribute '{attr}' in tokenizer exception for "
|
||||||
"'{chunk}'. Tokenizer exceptions are only allowed to specify "
|
"'{chunk}'. Tokenizer exceptions are only allowed to specify "
|
||||||
"`ORTH` and `NORM`.")
|
"ORTH and NORM.")
|
||||||
E1006 = ("Unable to initialize {name} model with 0 labels.")
|
|
||||||
E1007 = ("Unsupported DependencyMatcher operator '{op}'.")
|
E1007 = ("Unsupported DependencyMatcher operator '{op}'.")
|
||||||
E1008 = ("Invalid pattern: each pattern should be a list of dicts. Check "
|
E1008 = ("Invalid pattern: each pattern should be a list of dicts. Check "
|
||||||
"that you are providing a list of patterns as `List[List[dict]]`.")
|
"that you are providing a list of patterns as `List[List[dict]]`.")
|
||||||
E1009 = ("String for hash '{val}' not found in StringStore. Set the value "
|
|
||||||
"through token.morph_ instead or add the string to the "
|
|
||||||
"StringStore with `nlp.vocab.strings.add(string)`.")
|
|
||||||
E1010 = ("Unable to set entity information for token {i} which is included "
|
E1010 = ("Unable to set entity information for token {i} which is included "
|
||||||
"in more than one span in entities, blocked, missing or outside.")
|
"in more than one span in entities, blocked, missing or outside.")
|
||||||
E1011 = ("Unsupported default '{default}' in doc.set_ents. Available "
|
E1011 = ("Unsupported default '{default}' in `doc.set_ents`. Available "
|
||||||
"options: {modes}")
|
"options: {modes}")
|
||||||
E1012 = ("Entity spans and blocked/missing/outside spans should be "
|
E1012 = ("Entity spans and blocked/missing/outside spans should be "
|
||||||
"provided to doc.set_ents as lists of `Span` objects.")
|
"provided to `doc.set_ents` as lists of Span objects.")
|
||||||
E1013 = ("Invalid morph: the MorphAnalysis must have the same vocab as the "
|
E1013 = ("Invalid morph: the MorphAnalysis must have the same vocab as the "
|
||||||
"token itself. To set the morph from this MorphAnalysis, set from "
|
"token itself. To set the morph from this MorphAnalysis, set from "
|
||||||
"the string value with: `token.set_morph(str(other_morph))`.")
|
"the string value with: `token.set_morph(str(other_morph))`.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
|
||||||
class TempErrors:
|
|
||||||
T003 = ("Resizing pretrained Tagger models is not currently supported.")
|
|
||||||
T007 = ("Can't yet set {attr} from Span. Vote for this feature on the "
|
|
||||||
"issue tracker: http://github.com/explosion/spaCy/issues")
|
|
||||||
|
|
||||||
|
|
||||||
# Deprecated model shortcuts, only used in errors and warnings
|
# Deprecated model shortcuts, only used in errors and warnings
|
||||||
OLD_MODEL_SHORTCUTS = {
|
OLD_MODEL_SHORTCUTS = {
|
||||||
"en": "en_core_web_sm", "de": "de_core_news_sm", "es": "es_core_news_sm",
|
"en": "en_core_web_sm", "de": "de_core_news_sm", "es": "es_core_news_sm",
|
||||||
|
|
|
@ -22,9 +22,13 @@ def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
|
||||||
np_deps = set(doc.vocab.strings.add(label) for label in labels)
|
np_deps = set(doc.vocab.strings.add(label) for label in labels)
|
||||||
close_app = doc.vocab.strings.add("nk")
|
close_app = doc.vocab.strings.add("nk")
|
||||||
rbracket = 0
|
rbracket = 0
|
||||||
|
prev_end = -1
|
||||||
for i, word in enumerate(doclike):
|
for i, word in enumerate(doclike):
|
||||||
if i < rbracket:
|
if i < rbracket:
|
||||||
continue
|
continue
|
||||||
|
# Prevent nested chunks from being produced
|
||||||
|
if word.left_edge.i <= prev_end:
|
||||||
|
continue
|
||||||
if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
|
if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
|
||||||
rbracket = word.i + 1
|
rbracket = word.i + 1
|
||||||
# try to extend the span to the right
|
# try to extend the span to the right
|
||||||
|
@ -32,6 +36,7 @@ def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
|
||||||
for rdep in doc[word.i].rights:
|
for rdep in doc[word.i].rights:
|
||||||
if rdep.pos in (NOUN, PROPN) and rdep.dep == close_app:
|
if rdep.pos in (NOUN, PROPN) and rdep.dep == close_app:
|
||||||
rbracket = rdep.i + 1
|
rbracket = rdep.i + 1
|
||||||
|
prev_end = rbracket - 1
|
||||||
yield word.left_edge.i, rbracket, np_label
|
yield word.left_edge.i, rbracket, np_label
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import List, Dict
|
from typing import List, Tuple
|
||||||
|
|
||||||
from ...pipeline import Lemmatizer
|
from ...pipeline import Lemmatizer
|
||||||
from ...tokens import Token
|
from ...tokens import Token
|
||||||
|
@ -15,17 +15,10 @@ class FrenchLemmatizer(Lemmatizer):
|
||||||
"""
|
"""
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def get_lookups_config(cls, mode: str) -> Dict:
|
def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]:
|
||||||
if mode == "rule":
|
if mode == "rule":
|
||||||
return {
|
required = ["lemma_lookup", "lemma_rules", "lemma_exc", "lemma_index"]
|
||||||
"required_tables": [
|
return (required, [])
|
||||||
"lemma_lookup",
|
|
||||||
"lemma_rules",
|
|
||||||
"lemma_exc",
|
|
||||||
"lemma_index",
|
|
||||||
],
|
|
||||||
"optional_tables": [],
|
|
||||||
}
|
|
||||||
else:
|
else:
|
||||||
return super().get_lookups_config(mode)
|
return super().get_lookups_config(mode)
|
||||||
|
|
||||||
|
|
|
@ -7,8 +7,8 @@ Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
||||||
sentences = [
|
sentences = [
|
||||||
"Al Qaidah mengklaim bom mobil yang menewaskan 60 Orang di Mali",
|
"Indonesia merupakan negara kepulauan yang kaya akan budaya.",
|
||||||
"Abu Sayyaf mengeksekusi sandera warga Filipina",
|
"Berapa banyak warga yang dibutuhkan saat kerja bakti?",
|
||||||
"Penyaluran pupuk berasal dari lima lokasi yakni Bontang, Kalimantan Timur, Surabaya, Banyuwangi, Semarang, dan Makassar.",
|
"Penyaluran pupuk berasal dari lima lokasi yakni Bontang, Kalimantan Timur, Surabaya, Banyuwangi, Semarang, dan Makassar.",
|
||||||
"PT Pupuk Kaltim telah menyalurkan 274.707 ton pupuk bersubsidi ke wilayah penyaluran di 14 provinsi.",
|
"PT Pupuk Kaltim telah menyalurkan 274.707 ton pupuk bersubsidi ke wilayah penyaluran di 14 provinsi.",
|
||||||
"Jakarta adalah kota besar yang nyaris tidak pernah tidur."
|
"Jakarta adalah kota besar yang nyaris tidak pernah tidur."
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import List, Dict
|
from typing import List, Tuple
|
||||||
|
|
||||||
from ...pipeline import Lemmatizer
|
from ...pipeline import Lemmatizer
|
||||||
from ...tokens import Token
|
from ...tokens import Token
|
||||||
|
@ -6,16 +6,10 @@ from ...tokens import Token
|
||||||
|
|
||||||
class DutchLemmatizer(Lemmatizer):
|
class DutchLemmatizer(Lemmatizer):
|
||||||
@classmethod
|
@classmethod
|
||||||
def get_lookups_config(cls, mode: str) -> Dict:
|
def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]:
|
||||||
if mode == "rule":
|
if mode == "rule":
|
||||||
return {
|
required = ["lemma_lookup", "lemma_rules", "lemma_exc", "lemma_index"]
|
||||||
"required_tables": [
|
return (required, [])
|
||||||
"lemma_lookup",
|
|
||||||
"lemma_rules",
|
|
||||||
"lemma_exc",
|
|
||||||
"lemma_index",
|
|
||||||
],
|
|
||||||
}
|
|
||||||
else:
|
else:
|
||||||
return super().get_lookups_config(mode)
|
return super().get_lookups_config(mode)
|
||||||
|
|
||||||
|
|
|
@ -8,7 +8,6 @@ from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from .lemmatizer import PolishLemmatizer
|
from .lemmatizer import PolishLemmatizer
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ...lookups import Lookups
|
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import List, Dict
|
from typing import List, Dict, Tuple
|
||||||
|
|
||||||
from ...pipeline import Lemmatizer
|
from ...pipeline import Lemmatizer
|
||||||
from ...tokens import Token
|
from ...tokens import Token
|
||||||
|
@ -11,21 +11,16 @@ class PolishLemmatizer(Lemmatizer):
|
||||||
# lemmatization, as well as case-sensitive lemmatization for nouns.
|
# lemmatization, as well as case-sensitive lemmatization for nouns.
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def get_lookups_config(cls, mode: str) -> Dict:
|
def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]:
|
||||||
if mode == "pos_lookup":
|
if mode == "pos_lookup":
|
||||||
return {
|
# fmt: off
|
||||||
"required_tables": [
|
required = [
|
||||||
"lemma_lookup_adj",
|
"lemma_lookup_adj", "lemma_lookup_adp", "lemma_lookup_adv",
|
||||||
"lemma_lookup_adp",
|
"lemma_lookup_aux", "lemma_lookup_noun", "lemma_lookup_num",
|
||||||
"lemma_lookup_adv",
|
"lemma_lookup_part", "lemma_lookup_pron", "lemma_lookup_verb"
|
||||||
"lemma_lookup_aux",
|
]
|
||||||
"lemma_lookup_noun",
|
# fmt: on
|
||||||
"lemma_lookup_num",
|
return (required, [])
|
||||||
"lemma_lookup_part",
|
|
||||||
"lemma_lookup_pron",
|
|
||||||
"lemma_lookup_verb",
|
|
||||||
]
|
|
||||||
}
|
|
||||||
else:
|
else:
|
||||||
return super().get_lookups_config(mode)
|
return super().get_lookups_config(mode)
|
||||||
|
|
||||||
|
|
|
@ -47,7 +47,7 @@ class Segmenter(str, Enum):
|
||||||
|
|
||||||
|
|
||||||
@registry.tokenizers("spacy.zh.ChineseTokenizer")
|
@registry.tokenizers("spacy.zh.ChineseTokenizer")
|
||||||
def create_chinese_tokenizer(segmenter: Segmenter = Segmenter.char,):
|
def create_chinese_tokenizer(segmenter: Segmenter = Segmenter.char):
|
||||||
def chinese_tokenizer_factory(nlp):
|
def chinese_tokenizer_factory(nlp):
|
||||||
return ChineseTokenizer(nlp, segmenter=segmenter)
|
return ChineseTokenizer(nlp, segmenter=segmenter)
|
||||||
|
|
||||||
|
|
|
@ -896,6 +896,10 @@ class Language:
|
||||||
self._components[i] = (new_name, self._components[i][1])
|
self._components[i] = (new_name, self._components[i][1])
|
||||||
self._pipe_meta[new_name] = self._pipe_meta.pop(old_name)
|
self._pipe_meta[new_name] = self._pipe_meta.pop(old_name)
|
||||||
self._pipe_configs[new_name] = self._pipe_configs.pop(old_name)
|
self._pipe_configs[new_name] = self._pipe_configs.pop(old_name)
|
||||||
|
# Make sure [initialize] config is adjusted
|
||||||
|
if old_name in self._config["initialize"]["components"]:
|
||||||
|
init_cfg = self._config["initialize"]["components"].pop(old_name)
|
||||||
|
self._config["initialize"]["components"][new_name] = init_cfg
|
||||||
|
|
||||||
def remove_pipe(self, name: str) -> Tuple[str, Callable[[Doc], Doc]]:
|
def remove_pipe(self, name: str) -> Tuple[str, Callable[[Doc], Doc]]:
|
||||||
"""Remove a component from the pipeline.
|
"""Remove a component from the pipeline.
|
||||||
|
@ -912,6 +916,9 @@ class Language:
|
||||||
# because factory may be used for something else
|
# because factory may be used for something else
|
||||||
self._pipe_meta.pop(name)
|
self._pipe_meta.pop(name)
|
||||||
self._pipe_configs.pop(name)
|
self._pipe_configs.pop(name)
|
||||||
|
# Make sure name is removed from the [initialize] config
|
||||||
|
if name in self._config["initialize"]["components"]:
|
||||||
|
self._config["initialize"]["components"].pop(name)
|
||||||
# Make sure the name is also removed from the set of disabled components
|
# Make sure the name is also removed from the set of disabled components
|
||||||
if name in self.disabled:
|
if name in self.disabled:
|
||||||
self._disabled.remove(name)
|
self._disabled.remove(name)
|
||||||
|
|
|
@ -6,6 +6,7 @@ from thinc.api import expand_window, residual, Maxout, Mish, PyTorchLSTM
|
||||||
|
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc
|
||||||
from ...util import registry
|
from ...util import registry
|
||||||
|
from ...errors import Errors
|
||||||
from ...ml import _character_embed
|
from ...ml import _character_embed
|
||||||
from ..staticvectors import StaticVectors
|
from ..staticvectors import StaticVectors
|
||||||
from ..featureextractor import FeatureExtractor
|
from ..featureextractor import FeatureExtractor
|
||||||
|
@ -165,8 +166,12 @@ def MultiHashEmbed(
|
||||||
|
|
||||||
@registry.architectures.register("spacy.CharacterEmbed.v1")
|
@registry.architectures.register("spacy.CharacterEmbed.v1")
|
||||||
def CharacterEmbed(
|
def CharacterEmbed(
|
||||||
width: int, rows: int, nM: int, nC: int, also_use_static_vectors: bool,
|
width: int,
|
||||||
feature: Union[int, str]="LOWER"
|
rows: int,
|
||||||
|
nM: int,
|
||||||
|
nC: int,
|
||||||
|
also_use_static_vectors: bool,
|
||||||
|
feature: Union[int, str] = "LOWER",
|
||||||
) -> Model[List[Doc], List[Floats2d]]:
|
) -> Model[List[Doc], List[Floats2d]]:
|
||||||
"""Construct an embedded representation based on character embeddings, using
|
"""Construct an embedded representation based on character embeddings, using
|
||||||
a feed-forward network. A fixed number of UTF-8 byte characters are used for
|
a feed-forward network. A fixed number of UTF-8 byte characters are used for
|
||||||
|
@ -197,7 +202,7 @@ def CharacterEmbed(
|
||||||
"""
|
"""
|
||||||
feature = intify_attr(feature)
|
feature = intify_attr(feature)
|
||||||
if feature is None:
|
if feature is None:
|
||||||
raise ValueError("Invalid feature: Must be a token attribute.")
|
raise ValueError(Errors.E911(feat=feature))
|
||||||
if also_use_static_vectors:
|
if also_use_static_vectors:
|
||||||
model = chain(
|
model = chain(
|
||||||
concatenate(
|
concatenate(
|
||||||
|
|
|
@ -1,11 +1,11 @@
|
||||||
from typing import List, Tuple, Callable, Optional, cast
|
from typing import List, Tuple, Callable, Optional, cast
|
||||||
|
|
||||||
from thinc.initializers import glorot_uniform_init
|
from thinc.initializers import glorot_uniform_init
|
||||||
from thinc.util import partial
|
from thinc.util import partial
|
||||||
from thinc.types import Ragged, Floats2d, Floats1d
|
from thinc.types import Ragged, Floats2d, Floats1d
|
||||||
from thinc.api import Model, Ops, registry
|
from thinc.api import Model, Ops, registry
|
||||||
|
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
|
from ..errors import Errors
|
||||||
|
|
||||||
|
|
||||||
@registry.layers("spacy.StaticVectors.v1")
|
@registry.layers("spacy.StaticVectors.v1")
|
||||||
|
@ -76,16 +76,9 @@ def init(
|
||||||
nO = Y.data.shape[1]
|
nO = Y.data.shape[1]
|
||||||
|
|
||||||
if nM is None:
|
if nM is None:
|
||||||
raise ValueError(
|
raise ValueError(Errors.E905)
|
||||||
"Cannot initialize StaticVectors layer: nM dimension unset. "
|
|
||||||
"This dimension refers to the width of the vectors table."
|
|
||||||
)
|
|
||||||
if nO is None:
|
if nO is None:
|
||||||
raise ValueError(
|
raise ValueError(Errors.E904)
|
||||||
"Cannot initialize StaticVectors layer: nO dimension unset. "
|
|
||||||
"This dimension refers to the output width, after the linear "
|
|
||||||
"projection has been applied."
|
|
||||||
)
|
|
||||||
model.set_dim("nM", nM)
|
model.set_dim("nM", nM)
|
||||||
model.set_dim("nO", nO)
|
model.set_dim("nO", nO)
|
||||||
model.set_param("W", init_W(model.ops, (nO, nM)))
|
model.set_param("W", init_W(model.ops, (nO, nM)))
|
||||||
|
|
|
@ -9,10 +9,11 @@ from ...strings cimport hash_string
|
||||||
from ...structs cimport TokenC
|
from ...structs cimport TokenC
|
||||||
from ...tokens.doc cimport Doc, set_children_from_heads
|
from ...tokens.doc cimport Doc, set_children_from_heads
|
||||||
from ...training.example cimport Example
|
from ...training.example cimport Example
|
||||||
from ...errors import Errors
|
|
||||||
from .stateclass cimport StateClass
|
from .stateclass cimport StateClass
|
||||||
from ._state cimport StateC
|
from ._state cimport StateC
|
||||||
|
|
||||||
|
from ...errors import Errors
|
||||||
|
|
||||||
# Calculate cost as gold/not gold. We don't use scalar value anyway.
|
# Calculate cost as gold/not gold. We don't use scalar value anyway.
|
||||||
cdef int BINARY_COSTS = 1
|
cdef int BINARY_COSTS = 1
|
||||||
cdef weight_t MIN_SCORE = -90000
|
cdef weight_t MIN_SCORE = -90000
|
||||||
|
@ -86,7 +87,7 @@ cdef GoldParseStateC create_gold_state(Pool mem, StateClass stcls,
|
||||||
SENT_START_UNKNOWN,
|
SENT_START_UNKNOWN,
|
||||||
0
|
0
|
||||||
)
|
)
|
||||||
|
|
||||||
elif is_sent_start is None:
|
elif is_sent_start is None:
|
||||||
gs.state_bits[i] = set_state_flag(
|
gs.state_bits[i] = set_state_flag(
|
||||||
gs.state_bits[i],
|
gs.state_bits[i],
|
||||||
|
@ -109,7 +110,7 @@ cdef GoldParseStateC create_gold_state(Pool mem, StateClass stcls,
|
||||||
IS_SENT_START,
|
IS_SENT_START,
|
||||||
0
|
0
|
||||||
)
|
)
|
||||||
|
|
||||||
for i, (head, label) in enumerate(zip(heads, labels)):
|
for i, (head, label) in enumerate(zip(heads, labels)):
|
||||||
if head is not None:
|
if head is not None:
|
||||||
gs.heads[i] = head
|
gs.heads[i] = head
|
||||||
|
@ -158,7 +159,7 @@ cdef void update_gold_state(GoldParseStateC* gs, StateClass stcls) nogil:
|
||||||
)
|
)
|
||||||
gs.n_kids_in_stack[i] = 0
|
gs.n_kids_in_stack[i] = 0
|
||||||
gs.n_kids_in_buffer[i] = 0
|
gs.n_kids_in_buffer[i] = 0
|
||||||
|
|
||||||
for i in range(stcls.stack_depth()):
|
for i in range(stcls.stack_depth()):
|
||||||
s_i = stcls.S(i)
|
s_i = stcls.S(i)
|
||||||
if not is_head_unknown(gs, s_i):
|
if not is_head_unknown(gs, s_i):
|
||||||
|
@ -403,7 +404,7 @@ cdef class RightArc:
|
||||||
return 0
|
return 0
|
||||||
sent_start = st._sent[st.B_(0).l_edge].sent_start
|
sent_start = st._sent[st.B_(0).l_edge].sent_start
|
||||||
return sent_start != 1 and st.H(st.S(0)) != st.B(0)
|
return sent_start != 1 and st.H(st.S(0)) != st.B(0)
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
cdef int transition(StateC* st, attr_t label) nogil:
|
cdef int transition(StateC* st, attr_t label) nogil:
|
||||||
st.add_arc(st.S(0), st.B(0), label)
|
st.add_arc(st.S(0), st.B(0), label)
|
||||||
|
@ -701,10 +702,10 @@ cdef class ArcEager(TransitionSystem):
|
||||||
output[i] = self.c[i].is_valid(st, self.c[i].label)
|
output[i] = self.c[i].is_valid(st, self.c[i].label)
|
||||||
else:
|
else:
|
||||||
output[i] = is_valid[self.c[i].move]
|
output[i] = is_valid[self.c[i].move]
|
||||||
|
|
||||||
def get_cost(self, StateClass stcls, gold, int i):
|
def get_cost(self, StateClass stcls, gold, int i):
|
||||||
if not isinstance(gold, ArcEagerGold):
|
if not isinstance(gold, ArcEagerGold):
|
||||||
raise TypeError("Expected ArcEagerGold")
|
raise TypeError(Errors.E909.format(name="ArcEagerGold"))
|
||||||
cdef ArcEagerGold gold_ = gold
|
cdef ArcEagerGold gold_ = gold
|
||||||
gold_state = gold_.c
|
gold_state = gold_.c
|
||||||
n_gold = 0
|
n_gold = 0
|
||||||
|
@ -717,7 +718,7 @@ cdef class ArcEager(TransitionSystem):
|
||||||
cdef int set_costs(self, int* is_valid, weight_t* costs,
|
cdef int set_costs(self, int* is_valid, weight_t* costs,
|
||||||
StateClass stcls, gold) except -1:
|
StateClass stcls, gold) except -1:
|
||||||
if not isinstance(gold, ArcEagerGold):
|
if not isinstance(gold, ArcEagerGold):
|
||||||
raise TypeError("Expected ArcEagerGold")
|
raise TypeError(Errors.E909.format(name="ArcEagerGold"))
|
||||||
cdef ArcEagerGold gold_ = gold
|
cdef ArcEagerGold gold_ = gold
|
||||||
gold_.update(stcls)
|
gold_.update(stcls)
|
||||||
gold_state = gold_.c
|
gold_state = gold_.c
|
||||||
|
|
|
@ -1,16 +1,18 @@
|
||||||
from collections import Counter
|
|
||||||
from libc.stdint cimport int32_t
|
from libc.stdint cimport int32_t
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
|
|
||||||
|
from collections import Counter
|
||||||
|
|
||||||
from ...typedefs cimport weight_t, attr_t
|
from ...typedefs cimport weight_t, attr_t
|
||||||
from ...lexeme cimport Lexeme
|
from ...lexeme cimport Lexeme
|
||||||
from ...attrs cimport IS_SPACE
|
from ...attrs cimport IS_SPACE
|
||||||
from ...training.example cimport Example
|
from ...training.example cimport Example
|
||||||
from ...errors import Errors
|
|
||||||
from .stateclass cimport StateClass
|
from .stateclass cimport StateClass
|
||||||
from ._state cimport StateC
|
from ._state cimport StateC
|
||||||
from .transition_system cimport Transition, do_func_t
|
from .transition_system cimport Transition, do_func_t
|
||||||
|
|
||||||
|
from ...errors import Errors
|
||||||
|
|
||||||
|
|
||||||
cdef enum:
|
cdef enum:
|
||||||
MISSING
|
MISSING
|
||||||
|
@ -248,7 +250,7 @@ cdef class BiluoPushDown(TransitionSystem):
|
||||||
|
|
||||||
def get_cost(self, StateClass stcls, gold, int i):
|
def get_cost(self, StateClass stcls, gold, int i):
|
||||||
if not isinstance(gold, BiluoGold):
|
if not isinstance(gold, BiluoGold):
|
||||||
raise TypeError("Expected BiluoGold")
|
raise TypeError(Errors.E909.format(name="BiluoGold"))
|
||||||
cdef BiluoGold gold_ = gold
|
cdef BiluoGold gold_ = gold
|
||||||
gold_state = gold_.c
|
gold_state = gold_.c
|
||||||
n_gold = 0
|
n_gold = 0
|
||||||
|
@ -261,7 +263,7 @@ cdef class BiluoPushDown(TransitionSystem):
|
||||||
cdef int set_costs(self, int* is_valid, weight_t* costs,
|
cdef int set_costs(self, int* is_valid, weight_t* costs,
|
||||||
StateClass stcls, gold) except -1:
|
StateClass stcls, gold) except -1:
|
||||||
if not isinstance(gold, BiluoGold):
|
if not isinstance(gold, BiluoGold):
|
||||||
raise TypeError("Expected BiluoGold")
|
raise TypeError(Errors.E909.format(name="BiluoGold"))
|
||||||
cdef BiluoGold gold_ = gold
|
cdef BiluoGold gold_ = gold
|
||||||
gold_.update(stcls)
|
gold_.update(stcls)
|
||||||
gold_state = gold_.c
|
gold_state = gold_.c
|
||||||
|
|
|
@ -1,10 +1,11 @@
|
||||||
|
from typing import List, Dict, Union, Iterable, Any, Optional, Callable, Iterator
|
||||||
|
from typing import Tuple
|
||||||
import srsly
|
import srsly
|
||||||
from typing import List, Dict, Union, Iterable, Any, Optional
|
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from .pipe import Pipe
|
from .pipe import Pipe
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
from ..training import validate_examples
|
from ..training import validate_examples, Example
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
from ..matcher import Matcher
|
from ..matcher import Matcher
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
|
@ -18,20 +19,13 @@ from .. import util
|
||||||
|
|
||||||
MatcherPatternType = List[Dict[Union[int, str], Any]]
|
MatcherPatternType = List[Dict[Union[int, str], Any]]
|
||||||
AttributeRulerPatternType = Dict[str, Union[MatcherPatternType, Dict, int]]
|
AttributeRulerPatternType = Dict[str, Union[MatcherPatternType, Dict, int]]
|
||||||
|
TagMapType = Dict[str, Dict[Union[int, str], Union[int, str]]]
|
||||||
|
MorphRulesType = Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]
|
||||||
|
|
||||||
|
|
||||||
@Language.factory(
|
@Language.factory("attribute_ruler", default_config={"validate": False})
|
||||||
"attribute_ruler", default_config={"pattern_dicts": None, "validate": False}
|
def make_attribute_ruler(nlp: Language, name: str, validate: bool):
|
||||||
)
|
return AttributeRuler(nlp.vocab, name, validate=validate)
|
||||||
def make_attribute_ruler(
|
|
||||||
nlp: Language,
|
|
||||||
name: str,
|
|
||||||
pattern_dicts: Optional[Iterable[AttributeRulerPatternType]],
|
|
||||||
validate: bool,
|
|
||||||
):
|
|
||||||
return AttributeRuler(
|
|
||||||
nlp.vocab, name, pattern_dicts=pattern_dicts, validate=validate
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
class AttributeRuler(Pipe):
|
class AttributeRuler(Pipe):
|
||||||
|
@ -42,20 +36,15 @@ class AttributeRuler(Pipe):
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(
|
def __init__(
|
||||||
self,
|
self, vocab: Vocab, name: str = "attribute_ruler", *, validate: bool = False
|
||||||
vocab: Vocab,
|
|
||||||
name: str = "attribute_ruler",
|
|
||||||
*,
|
|
||||||
pattern_dicts: Optional[Iterable[AttributeRulerPatternType]] = None,
|
|
||||||
validate: bool = False,
|
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Initialize the AttributeRuler.
|
"""Create the AttributeRuler. After creation, you can add patterns
|
||||||
|
with the `.initialize()` or `.add_patterns()` methods, or load patterns
|
||||||
|
with `.from_bytes()` or `.from_disk()`. Loading patterns will remove
|
||||||
|
any patterns you've added previously.
|
||||||
|
|
||||||
vocab (Vocab): The vocab.
|
vocab (Vocab): The vocab.
|
||||||
name (str): The pipe name. Defaults to "attribute_ruler".
|
name (str): The pipe name. Defaults to "attribute_ruler".
|
||||||
pattern_dicts (Iterable[Dict]): A list of pattern dicts with the keys as
|
|
||||||
the arguments to AttributeRuler.add (`patterns`/`attrs`/`index`) to add
|
|
||||||
as patterns.
|
|
||||||
|
|
||||||
RETURNS (AttributeRuler): The AttributeRuler component.
|
RETURNS (AttributeRuler): The AttributeRuler component.
|
||||||
|
|
||||||
|
@ -68,8 +57,27 @@ class AttributeRuler(Pipe):
|
||||||
self._attrs_unnormed = [] # store for reference
|
self._attrs_unnormed = [] # store for reference
|
||||||
self.indices = []
|
self.indices = []
|
||||||
|
|
||||||
if pattern_dicts:
|
def initialize(
|
||||||
self.add_patterns(pattern_dicts)
|
self,
|
||||||
|
get_examples: Optional[Callable[[], Iterable[Example]]],
|
||||||
|
*,
|
||||||
|
nlp: Optional[Language] = None,
|
||||||
|
patterns: Optional[Iterable[AttributeRulerPatternType]] = None,
|
||||||
|
tag_map: Optional[TagMapType] = None,
|
||||||
|
morph_rules: Optional[MorphRulesType] = None,
|
||||||
|
):
|
||||||
|
"""Initialize the attribute ruler by adding zero or more patterns.
|
||||||
|
|
||||||
|
Rules can be specified as a sequence of dicts using the `patterns`
|
||||||
|
keyword argument. You can also provide rules using the "tag map" or
|
||||||
|
"morph rules" formats supported by spaCy prior to v3.
|
||||||
|
"""
|
||||||
|
if patterns:
|
||||||
|
self.add_patterns(patterns)
|
||||||
|
if tag_map:
|
||||||
|
self.load_from_tag_map(tag_map)
|
||||||
|
if morph_rules:
|
||||||
|
self.load_from_morph_rules(morph_rules)
|
||||||
|
|
||||||
def __call__(self, doc: Doc) -> Doc:
|
def __call__(self, doc: Doc) -> Doc:
|
||||||
"""Apply the AttributeRuler to a Doc and set all attribute exceptions.
|
"""Apply the AttributeRuler to a Doc and set all attribute exceptions.
|
||||||
|
@ -106,7 +114,7 @@ class AttributeRuler(Pipe):
|
||||||
set_token_attrs(span[index], attrs)
|
set_token_attrs(span[index], attrs)
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def pipe(self, stream, *, batch_size=128):
|
def pipe(self, stream: Iterable[Doc], *, batch_size: int = 128) -> Iterator[Doc]:
|
||||||
"""Apply the pipe to a stream of documents. This usually happens under
|
"""Apply the pipe to a stream of documents. This usually happens under
|
||||||
the hood when the nlp object is called on a text and all components are
|
the hood when the nlp object is called on a text and all components are
|
||||||
applied to the Doc.
|
applied to the Doc.
|
||||||
|
@ -190,16 +198,16 @@ class AttributeRuler(Pipe):
|
||||||
self.attrs.append(attrs)
|
self.attrs.append(attrs)
|
||||||
self.indices.append(index)
|
self.indices.append(index)
|
||||||
|
|
||||||
def add_patterns(self, pattern_dicts: Iterable[AttributeRulerPatternType]) -> None:
|
def add_patterns(self, patterns: Iterable[AttributeRulerPatternType]) -> None:
|
||||||
"""Add patterns from a list of pattern dicts with the keys as the
|
"""Add patterns from a list of pattern dicts with the keys as the
|
||||||
arguments to AttributeRuler.add.
|
arguments to AttributeRuler.add.
|
||||||
pattern_dicts (Iterable[dict]): A list of pattern dicts with the keys
|
patterns (Iterable[dict]): A list of pattern dicts with the keys
|
||||||
as the arguments to AttributeRuler.add (patterns/attrs/index) to
|
as the arguments to AttributeRuler.add (patterns/attrs/index) to
|
||||||
add as patterns.
|
add as patterns.
|
||||||
|
|
||||||
DOCS: https://nightly.spacy.io/api/attributeruler#add_patterns
|
DOCS: https://nightly.spacy.io/api/attributeruler#add_patterns
|
||||||
"""
|
"""
|
||||||
for p in pattern_dicts:
|
for p in patterns:
|
||||||
self.add(**p)
|
self.add(**p)
|
||||||
|
|
||||||
@property
|
@property
|
||||||
|
@ -214,7 +222,7 @@ class AttributeRuler(Pipe):
|
||||||
all_patterns.append(p)
|
all_patterns.append(p)
|
||||||
return all_patterns
|
return all_patterns
|
||||||
|
|
||||||
def score(self, examples, **kwargs):
|
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
||||||
"""Score a batch of examples.
|
"""Score a batch of examples.
|
||||||
|
|
||||||
examples (Iterable[Example]): The examples to score.
|
examples (Iterable[Example]): The examples to score.
|
||||||
|
@ -255,7 +263,7 @@ class AttributeRuler(Pipe):
|
||||||
|
|
||||||
def from_bytes(
|
def from_bytes(
|
||||||
self, bytes_data: bytes, exclude: Iterable[str] = SimpleFrozenList()
|
self, bytes_data: bytes, exclude: Iterable[str] = SimpleFrozenList()
|
||||||
):
|
) -> "AttributeRuler":
|
||||||
"""Load the AttributeRuler from a bytestring.
|
"""Load the AttributeRuler from a bytestring.
|
||||||
|
|
||||||
bytes_data (bytes): The data to load.
|
bytes_data (bytes): The data to load.
|
||||||
|
@ -273,7 +281,6 @@ class AttributeRuler(Pipe):
|
||||||
"patterns": load_patterns,
|
"patterns": load_patterns,
|
||||||
}
|
}
|
||||||
util.from_bytes(bytes_data, deserialize, exclude)
|
util.from_bytes(bytes_data, deserialize, exclude)
|
||||||
|
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(
|
def to_disk(
|
||||||
|
@ -283,6 +290,7 @@ class AttributeRuler(Pipe):
|
||||||
|
|
||||||
path (Union[Path, str]): A path to a directory.
|
path (Union[Path, str]): A path to a directory.
|
||||||
exclude (Iterable[str]): String names of serialization fields to exclude.
|
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||||
|
|
||||||
DOCS: https://nightly.spacy.io/api/attributeruler#to_disk
|
DOCS: https://nightly.spacy.io/api/attributeruler#to_disk
|
||||||
"""
|
"""
|
||||||
serialize = {
|
serialize = {
|
||||||
|
@ -293,11 +301,13 @@ class AttributeRuler(Pipe):
|
||||||
|
|
||||||
def from_disk(
|
def from_disk(
|
||||||
self, path: Union[Path, str], exclude: Iterable[str] = SimpleFrozenList()
|
self, path: Union[Path, str], exclude: Iterable[str] = SimpleFrozenList()
|
||||||
) -> None:
|
) -> "AttributeRuler":
|
||||||
"""Load the AttributeRuler from disk.
|
"""Load the AttributeRuler from disk.
|
||||||
|
|
||||||
path (Union[Path, str]): A path to a directory.
|
path (Union[Path, str]): A path to a directory.
|
||||||
exclude (Iterable[str]): String names of serialization fields to exclude.
|
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||||
|
RETURNS (AttributeRuler): The loaded object.
|
||||||
|
|
||||||
DOCS: https://nightly.spacy.io/api/attributeruler#from_disk
|
DOCS: https://nightly.spacy.io/api/attributeruler#from_disk
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
@ -309,11 +319,10 @@ class AttributeRuler(Pipe):
|
||||||
"patterns": load_patterns,
|
"patterns": load_patterns,
|
||||||
}
|
}
|
||||||
util.from_disk(path, deserialize, exclude)
|
util.from_disk(path, deserialize, exclude)
|
||||||
|
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
|
||||||
def _split_morph_attrs(attrs):
|
def _split_morph_attrs(attrs: dict) -> Tuple[dict, dict]:
|
||||||
"""Split entries from a tag map or morph rules dict into to two dicts, one
|
"""Split entries from a tag map or morph rules dict into to two dicts, one
|
||||||
with the token-level features (POS, LEMMA) and one with the remaining
|
with the token-level features (POS, LEMMA) and one with the remaining
|
||||||
features, which are presumed to be individual MORPH features."""
|
features, which are presumed to be individual MORPH features."""
|
||||||
|
|
|
@ -134,7 +134,7 @@ class Morphologizer(Tagger):
|
||||||
self.cfg["labels_pos"][norm_label] = POS_IDS[pos]
|
self.cfg["labels_pos"][norm_label] = POS_IDS[pos]
|
||||||
return 1
|
return 1
|
||||||
|
|
||||||
def initialize(self, get_examples, *, nlp=None):
|
def initialize(self, get_examples, *, nlp=None, labels=None):
|
||||||
"""Initialize the pipe for training, using a representative set
|
"""Initialize the pipe for training, using a representative set
|
||||||
of data examples.
|
of data examples.
|
||||||
|
|
||||||
|
@ -145,20 +145,24 @@ class Morphologizer(Tagger):
|
||||||
DOCS: https://nightly.spacy.io/api/morphologizer#initialize
|
DOCS: https://nightly.spacy.io/api/morphologizer#initialize
|
||||||
"""
|
"""
|
||||||
self._ensure_examples(get_examples)
|
self._ensure_examples(get_examples)
|
||||||
# First, fetch all labels from the data
|
if labels is not None:
|
||||||
for example in get_examples():
|
self.cfg["labels_morph"] = labels["morph"]
|
||||||
for i, token in enumerate(example.reference):
|
self.cfg["labels_pos"] = labels["pos"]
|
||||||
pos = token.pos_
|
else:
|
||||||
morph = str(token.morph)
|
# First, fetch all labels from the data
|
||||||
# create and add the combined morph+POS label
|
for example in get_examples():
|
||||||
morph_dict = Morphology.feats_to_dict(morph)
|
for i, token in enumerate(example.reference):
|
||||||
if pos:
|
pos = token.pos_
|
||||||
morph_dict[self.POS_FEAT] = pos
|
morph = str(token.morph)
|
||||||
norm_label = self.vocab.strings[self.vocab.morphology.add(morph_dict)]
|
# create and add the combined morph+POS label
|
||||||
# add label->morph and label->POS mappings
|
morph_dict = Morphology.feats_to_dict(morph)
|
||||||
if norm_label not in self.cfg["labels_morph"]:
|
if pos:
|
||||||
self.cfg["labels_morph"][norm_label] = morph
|
morph_dict[self.POS_FEAT] = pos
|
||||||
self.cfg["labels_pos"][norm_label] = POS_IDS[pos]
|
norm_label = self.vocab.strings[self.vocab.morphology.add(morph_dict)]
|
||||||
|
# add label->morph and label->POS mappings
|
||||||
|
if norm_label not in self.cfg["labels_morph"]:
|
||||||
|
self.cfg["labels_morph"][norm_label] = morph
|
||||||
|
self.cfg["labels_pos"][norm_label] = POS_IDS[pos]
|
||||||
if len(self.labels) <= 1:
|
if len(self.labels) <= 1:
|
||||||
raise ValueError(Errors.E143.format(name=self.name))
|
raise ValueError(Errors.E143.format(name=self.name))
|
||||||
doc_sample = []
|
doc_sample = []
|
||||||
|
@ -234,7 +238,7 @@ class Morphologizer(Tagger):
|
||||||
truths.append(eg_truths)
|
truths.append(eg_truths)
|
||||||
d_scores, loss = loss_func(scores, truths)
|
d_scores, loss = loss_func(scores, truths)
|
||||||
if self.model.ops.xp.isnan(loss):
|
if self.model.ops.xp.isnan(loss):
|
||||||
raise ValueError("nan value when computing loss")
|
raise ValueError(Errors.E910.format(name=self.name))
|
||||||
return float(loss), d_scores
|
return float(loss), d_scores
|
||||||
|
|
||||||
def score(self, examples, **kwargs):
|
def score(self, examples, **kwargs):
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
# cython: infer_types=True, profile=True
|
# cython: infer_types=True, profile=True
|
||||||
|
import warnings
|
||||||
from typing import Optional, Tuple
|
from typing import Optional, Tuple
|
||||||
import srsly
|
import srsly
|
||||||
from thinc.api import set_dropout_rate, Model
|
from thinc.api import set_dropout_rate, Model
|
||||||
|
@ -6,7 +7,7 @@ from thinc.api import set_dropout_rate, Model
|
||||||
from ..tokens.doc cimport Doc
|
from ..tokens.doc cimport Doc
|
||||||
|
|
||||||
from ..training import validate_examples
|
from ..training import validate_examples
|
||||||
from ..errors import Errors
|
from ..errors import Errors, Warnings
|
||||||
from .. import util
|
from .. import util
|
||||||
|
|
||||||
|
|
||||||
|
@ -33,6 +34,13 @@ cdef class Pipe:
|
||||||
self.name = name
|
self.name = name
|
||||||
self.cfg = dict(cfg)
|
self.cfg = dict(cfg)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def __init_subclass__(cls, **kwargs):
|
||||||
|
"""Raise a warning if an inheriting class implements 'begin_training'
|
||||||
|
(from v2) instead of the new 'initialize' method (from v3)"""
|
||||||
|
if hasattr(cls, "begin_training"):
|
||||||
|
warnings.warn(Warnings.W088.format(name=cls.__name__))
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def labels(self) -> Optional[Tuple[str]]:
|
def labels(self) -> Optional[Tuple[str]]:
|
||||||
return []
|
return []
|
||||||
|
|
|
@ -73,7 +73,7 @@ class SentenceRecognizer(Tagger):
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def label_data(self):
|
def label_data(self):
|
||||||
return self.labels
|
return None
|
||||||
|
|
||||||
def set_annotations(self, docs, batch_tag_ids):
|
def set_annotations(self, docs, batch_tag_ids):
|
||||||
"""Modify a batch of documents, using pre-computed scores.
|
"""Modify a batch of documents, using pre-computed scores.
|
||||||
|
@ -125,7 +125,7 @@ class SentenceRecognizer(Tagger):
|
||||||
truths.append(eg_truth)
|
truths.append(eg_truth)
|
||||||
d_scores, loss = loss_func(scores, truths)
|
d_scores, loss = loss_func(scores, truths)
|
||||||
if self.model.ops.xp.isnan(loss):
|
if self.model.ops.xp.isnan(loss):
|
||||||
raise ValueError("nan value when computing loss")
|
raise ValueError(Errors.E910.format(name=self.name))
|
||||||
return float(loss), d_scores
|
return float(loss), d_scores
|
||||||
|
|
||||||
def initialize(self, get_examples, *, nlp=None):
|
def initialize(self, get_examples, *, nlp=None):
|
||||||
|
|
|
@ -15,7 +15,7 @@ from .pipe import Pipe, deserialize_config
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
from ..attrs import POS, ID
|
from ..attrs import POS, ID
|
||||||
from ..parts_of_speech import X
|
from ..parts_of_speech import X
|
||||||
from ..errors import Errors, TempErrors, Warnings
|
from ..errors import Errors, Warnings
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
from ..training import validate_examples
|
from ..training import validate_examples
|
||||||
from .. import util
|
from .. import util
|
||||||
|
@ -258,7 +258,7 @@ class Tagger(Pipe):
|
||||||
truths = [eg.get_aligned("TAG", as_string=True) for eg in examples]
|
truths = [eg.get_aligned("TAG", as_string=True) for eg in examples]
|
||||||
d_scores, loss = loss_func(scores, truths)
|
d_scores, loss = loss_func(scores, truths)
|
||||||
if self.model.ops.xp.isnan(loss):
|
if self.model.ops.xp.isnan(loss):
|
||||||
raise ValueError("nan value when computing loss")
|
raise ValueError(Errors.E910.format(name=self.name))
|
||||||
return float(loss), d_scores
|
return float(loss), d_scores
|
||||||
|
|
||||||
def initialize(self, get_examples, *, nlp=None, labels=None):
|
def initialize(self, get_examples, *, nlp=None, labels=None):
|
||||||
|
|
|
@ -56,12 +56,7 @@ subword_features = true
|
||||||
@Language.factory(
|
@Language.factory(
|
||||||
"textcat",
|
"textcat",
|
||||||
assigns=["doc.cats"],
|
assigns=["doc.cats"],
|
||||||
default_config={
|
default_config={"threshold": 0.5, "model": DEFAULT_TEXTCAT_MODEL},
|
||||||
"labels": [],
|
|
||||||
"threshold": 0.5,
|
|
||||||
"positive_label": None,
|
|
||||||
"model": DEFAULT_TEXTCAT_MODEL,
|
|
||||||
},
|
|
||||||
default_score_weights={
|
default_score_weights={
|
||||||
"cats_score": 1.0,
|
"cats_score": 1.0,
|
||||||
"cats_score_desc": None,
|
"cats_score_desc": None,
|
||||||
|
@ -75,12 +70,7 @@ subword_features = true
|
||||||
},
|
},
|
||||||
)
|
)
|
||||||
def make_textcat(
|
def make_textcat(
|
||||||
nlp: Language,
|
nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float
|
||||||
name: str,
|
|
||||||
model: Model[List[Doc], List[Floats2d]],
|
|
||||||
labels: List[str],
|
|
||||||
threshold: float,
|
|
||||||
positive_label: Optional[str],
|
|
||||||
) -> "TextCategorizer":
|
) -> "TextCategorizer":
|
||||||
"""Create a TextCategorizer compoment. The text categorizer predicts categories
|
"""Create a TextCategorizer compoment. The text categorizer predicts categories
|
||||||
over a whole document. It can learn one or more labels, and the labels can
|
over a whole document. It can learn one or more labels, and the labels can
|
||||||
|
@ -90,19 +80,9 @@ def make_textcat(
|
||||||
|
|
||||||
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
|
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
|
||||||
scores for each category.
|
scores for each category.
|
||||||
labels (list): A list of categories to learn. If empty, the model infers the
|
|
||||||
categories from the data.
|
|
||||||
threshold (float): Cutoff to consider a prediction "positive".
|
threshold (float): Cutoff to consider a prediction "positive".
|
||||||
positive_label (Optional[str]): The positive label for a binary task with exclusive classes, None otherwise.
|
|
||||||
"""
|
"""
|
||||||
return TextCategorizer(
|
return TextCategorizer(nlp.vocab, model, name, threshold=threshold)
|
||||||
nlp.vocab,
|
|
||||||
model,
|
|
||||||
name,
|
|
||||||
labels=labels,
|
|
||||||
threshold=threshold,
|
|
||||||
positive_label=positive_label,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
class TextCategorizer(Pipe):
|
class TextCategorizer(Pipe):
|
||||||
|
@ -112,14 +92,7 @@ class TextCategorizer(Pipe):
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(
|
def __init__(
|
||||||
self,
|
self, vocab: Vocab, model: Model, name: str = "textcat", *, threshold: float
|
||||||
vocab: Vocab,
|
|
||||||
model: Model,
|
|
||||||
name: str = "textcat",
|
|
||||||
*,
|
|
||||||
labels: List[str],
|
|
||||||
threshold: float,
|
|
||||||
positive_label: Optional[str],
|
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Initialize a text categorizer.
|
"""Initialize a text categorizer.
|
||||||
|
|
||||||
|
@ -127,9 +100,7 @@ class TextCategorizer(Pipe):
|
||||||
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
||||||
name (str): The component instance name, used to add entries to the
|
name (str): The component instance name, used to add entries to the
|
||||||
losses during training.
|
losses during training.
|
||||||
labels (List[str]): The labels to use.
|
|
||||||
threshold (float): Cutoff to consider a prediction "positive".
|
threshold (float): Cutoff to consider a prediction "positive".
|
||||||
positive_label (Optional[str]): The positive label for a binary task with exclusive classes, None otherwise.
|
|
||||||
|
|
||||||
DOCS: https://nightly.spacy.io/api/textcategorizer#init
|
DOCS: https://nightly.spacy.io/api/textcategorizer#init
|
||||||
"""
|
"""
|
||||||
|
@ -137,11 +108,7 @@ class TextCategorizer(Pipe):
|
||||||
self.model = model
|
self.model = model
|
||||||
self.name = name
|
self.name = name
|
||||||
self._rehearsal_model = None
|
self._rehearsal_model = None
|
||||||
cfg = {
|
cfg = {"labels": [], "threshold": threshold, "positive_label": None}
|
||||||
"labels": labels,
|
|
||||||
"threshold": threshold,
|
|
||||||
"positive_label": positive_label,
|
|
||||||
}
|
|
||||||
self.cfg = dict(cfg)
|
self.cfg = dict(cfg)
|
||||||
|
|
||||||
@property
|
@property
|
||||||
|
@ -348,6 +315,7 @@ class TextCategorizer(Pipe):
|
||||||
*,
|
*,
|
||||||
nlp: Optional[Language] = None,
|
nlp: Optional[Language] = None,
|
||||||
labels: Optional[Dict] = None,
|
labels: Optional[Dict] = None,
|
||||||
|
positive_label: Optional[str] = None,
|
||||||
):
|
):
|
||||||
"""Initialize the pipe for training, using a representative set
|
"""Initialize the pipe for training, using a representative set
|
||||||
of data examples.
|
of data examples.
|
||||||
|
@ -369,6 +337,14 @@ class TextCategorizer(Pipe):
|
||||||
else:
|
else:
|
||||||
for label in labels:
|
for label in labels:
|
||||||
self.add_label(label)
|
self.add_label(label)
|
||||||
|
if positive_label is not None:
|
||||||
|
if positive_label not in self.labels:
|
||||||
|
err = Errors.E920.format(pos_label=positive_label, labels=self.labels)
|
||||||
|
raise ValueError(err)
|
||||||
|
if len(self.labels) != 2:
|
||||||
|
err = Errors.E919.format(pos_label=positive_label, labels=self.labels)
|
||||||
|
raise ValueError(err)
|
||||||
|
self.cfg["positive_label"] = positive_label
|
||||||
subbatch = list(islice(get_examples(), 10))
|
subbatch = list(islice(get_examples(), 10))
|
||||||
doc_sample = [eg.reference for eg in subbatch]
|
doc_sample = [eg.reference for eg in subbatch]
|
||||||
label_sample, _ = self._examples_to_truth(subbatch)
|
label_sample, _ = self._examples_to_truth(subbatch)
|
||||||
|
|
|
@ -905,7 +905,7 @@ def _auc(x, y):
|
||||||
if np.all(dx <= 0):
|
if np.all(dx <= 0):
|
||||||
direction = -1
|
direction = -1
|
||||||
else:
|
else:
|
||||||
raise ValueError(Errors.E164.format(x))
|
raise ValueError(Errors.E164.format(x=x))
|
||||||
|
|
||||||
area = direction * np.trapz(y, x)
|
area = direction * np.trapz(y, x)
|
||||||
if isinstance(area, np.memmap):
|
if isinstance(area, np.memmap):
|
||||||
|
|
|
@ -294,7 +294,8 @@ def zh_tokenizer_pkuseg():
|
||||||
"segmenter": "pkuseg",
|
"segmenter": "pkuseg",
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"initialize": {"tokenizer": {
|
"initialize": {
|
||||||
|
"tokenizer": {
|
||||||
"pkuseg_model": "default",
|
"pkuseg_model": "default",
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
|
|
@ -5,12 +5,14 @@ import pytest
|
||||||
def i_has(en_tokenizer):
|
def i_has(en_tokenizer):
|
||||||
doc = en_tokenizer("I has")
|
doc = en_tokenizer("I has")
|
||||||
doc[0].set_morph({"PronType": "prs"})
|
doc[0].set_morph({"PronType": "prs"})
|
||||||
doc[1].set_morph({
|
doc[1].set_morph(
|
||||||
"VerbForm": "fin",
|
{
|
||||||
"Tense": "pres",
|
"VerbForm": "fin",
|
||||||
"Number": "sing",
|
"Tense": "pres",
|
||||||
"Person": "three",
|
"Number": "sing",
|
||||||
})
|
"Person": "three",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
|
|
|
@ -196,3 +196,22 @@ def test_doc_retokenizer_realloc(en_vocab):
|
||||||
token = doc[0]
|
token = doc[0]
|
||||||
heads = [(token, 0)] * len(token)
|
heads = [(token, 0)] * len(token)
|
||||||
retokenizer.split(doc[token.i], list(token.text), heads=heads)
|
retokenizer.split(doc[token.i], list(token.text), heads=heads)
|
||||||
|
|
||||||
|
|
||||||
|
def test_doc_retokenizer_split_norm(en_vocab):
|
||||||
|
"""#6060: reset norm in split"""
|
||||||
|
text = "The quick brownfoxjumpsoverthe lazy dog w/ white spots"
|
||||||
|
doc = Doc(en_vocab, words=text.split())
|
||||||
|
|
||||||
|
# Set custom norm on the w/ token.
|
||||||
|
doc[5].norm_ = "with"
|
||||||
|
|
||||||
|
# Retokenize to split out the words in the token at doc[2].
|
||||||
|
token = doc[2]
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
retokenizer.split(token, ["brown", "fox", "jumps", "over", "the"], heads=[(token, idx) for idx in range(5)])
|
||||||
|
|
||||||
|
assert doc[9].text == "w/"
|
||||||
|
assert doc[9].norm_ == "with"
|
||||||
|
assert doc[5].text == "over"
|
||||||
|
assert doc[5].norm_ == "over"
|
||||||
|
|
|
@ -322,3 +322,11 @@ def test_span_boundaries(doc):
|
||||||
span[-5]
|
span[-5]
|
||||||
with pytest.raises(IndexError):
|
with pytest.raises(IndexError):
|
||||||
span[5]
|
span[5]
|
||||||
|
|
||||||
|
|
||||||
|
def test_sent(en_tokenizer):
|
||||||
|
doc = en_tokenizer("Check span.sent raises error if doc is not sentencized.")
|
||||||
|
span = doc[1:3]
|
||||||
|
assert not span.doc.has_annotation("SENT_START")
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
span.sent
|
||||||
|
|
|
@ -23,8 +23,9 @@ def test_lemmatizer_initialize(lang, capfd):
|
||||||
lookups.add_table("lemma_rules", {"verb": [["ing", ""]]})
|
lookups.add_table("lemma_rules", {"verb": [["ing", ""]]})
|
||||||
return lookups
|
return lookups
|
||||||
|
|
||||||
|
lang_cls = get_lang_class(lang)
|
||||||
# Test that languages can be initialized
|
# Test that languages can be initialized
|
||||||
nlp = get_lang_class(lang)()
|
nlp = lang_cls()
|
||||||
lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
|
lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
|
||||||
assert not lemmatizer.lookups.tables
|
assert not lemmatizer.lookups.tables
|
||||||
nlp.config["initialize"]["components"]["lemmatizer"] = {
|
nlp.config["initialize"]["components"]["lemmatizer"] = {
|
||||||
|
@ -41,7 +42,13 @@ def test_lemmatizer_initialize(lang, capfd):
|
||||||
assert doc[0].lemma_ == "y"
|
assert doc[0].lemma_ == "y"
|
||||||
|
|
||||||
# Test initialization by calling .initialize() directly
|
# Test initialization by calling .initialize() directly
|
||||||
nlp = get_lang_class(lang)()
|
nlp = lang_cls()
|
||||||
lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
|
lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
|
||||||
lemmatizer.initialize(lookups=lemmatizer_init_lookups())
|
lemmatizer.initialize(lookups=lemmatizer_init_lookups())
|
||||||
assert nlp("x")[0].lemma_ == "y"
|
assert nlp("x")[0].lemma_ == "y"
|
||||||
|
|
||||||
|
# Test lookups config format
|
||||||
|
for mode in ("rule", "lookup", "pos_lookup"):
|
||||||
|
required, optional = lemmatizer.get_lookups_config(mode)
|
||||||
|
assert isinstance(required, list)
|
||||||
|
assert isinstance(optional, list)
|
||||||
|
|
|
@ -34,7 +34,8 @@ def test_zh_tokenizer_serialize_pkuseg_with_processors(zh_tokenizer_pkuseg):
|
||||||
"segmenter": "pkuseg",
|
"segmenter": "pkuseg",
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"initialize": {"tokenizer": {
|
"initialize": {
|
||||||
|
"tokenizer": {
|
||||||
"pkuseg_model": "medicine",
|
"pkuseg_model": "medicine",
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
|
|
@ -63,6 +63,39 @@ def morph_rules():
|
||||||
return {"DT": {"the": {"POS": "DET", "LEMMA": "a", "Case": "Nom"}}}
|
return {"DT": {"the": {"POS": "DET", "LEMMA": "a", "Case": "Nom"}}}
|
||||||
|
|
||||||
|
|
||||||
|
def check_tag_map(ruler):
|
||||||
|
doc = Doc(
|
||||||
|
ruler.vocab,
|
||||||
|
words=["This", "is", "a", "test", "."],
|
||||||
|
tags=["DT", "VBZ", "DT", "NN", "."],
|
||||||
|
)
|
||||||
|
doc = ruler(doc)
|
||||||
|
for i in range(len(doc)):
|
||||||
|
if i == 4:
|
||||||
|
assert doc[i].pos_ == "PUNCT"
|
||||||
|
assert str(doc[i].morph) == "PunctType=peri"
|
||||||
|
else:
|
||||||
|
assert doc[i].pos_ == ""
|
||||||
|
assert str(doc[i].morph) == ""
|
||||||
|
|
||||||
|
|
||||||
|
def check_morph_rules(ruler):
|
||||||
|
doc = Doc(
|
||||||
|
ruler.vocab,
|
||||||
|
words=["This", "is", "the", "test", "."],
|
||||||
|
tags=["DT", "VBZ", "DT", "NN", "."],
|
||||||
|
)
|
||||||
|
doc = ruler(doc)
|
||||||
|
for i in range(len(doc)):
|
||||||
|
if i != 2:
|
||||||
|
assert doc[i].pos_ == ""
|
||||||
|
assert str(doc[i].morph) == ""
|
||||||
|
else:
|
||||||
|
assert doc[2].pos_ == "DET"
|
||||||
|
assert doc[2].lemma_ == "a"
|
||||||
|
assert str(doc[2].morph) == "Case=Nom"
|
||||||
|
|
||||||
|
|
||||||
def test_attributeruler_init(nlp, pattern_dicts):
|
def test_attributeruler_init(nlp, pattern_dicts):
|
||||||
a = nlp.add_pipe("attribute_ruler")
|
a = nlp.add_pipe("attribute_ruler")
|
||||||
for p in pattern_dicts:
|
for p in pattern_dicts:
|
||||||
|
@ -78,7 +111,8 @@ def test_attributeruler_init(nlp, pattern_dicts):
|
||||||
|
|
||||||
def test_attributeruler_init_patterns(nlp, pattern_dicts):
|
def test_attributeruler_init_patterns(nlp, pattern_dicts):
|
||||||
# initialize with patterns
|
# initialize with patterns
|
||||||
nlp.add_pipe("attribute_ruler", config={"pattern_dicts": pattern_dicts})
|
ruler = nlp.add_pipe("attribute_ruler")
|
||||||
|
ruler.initialize(lambda: [], patterns=pattern_dicts)
|
||||||
doc = nlp("This is a test.")
|
doc = nlp("This is a test.")
|
||||||
assert doc[2].lemma_ == "the"
|
assert doc[2].lemma_ == "the"
|
||||||
assert str(doc[2].morph) == "Case=Nom|Number=Plur"
|
assert str(doc[2].morph) == "Case=Nom|Number=Plur"
|
||||||
|
@ -88,10 +122,11 @@ def test_attributeruler_init_patterns(nlp, pattern_dicts):
|
||||||
assert doc.has_annotation("MORPH")
|
assert doc.has_annotation("MORPH")
|
||||||
nlp.remove_pipe("attribute_ruler")
|
nlp.remove_pipe("attribute_ruler")
|
||||||
# initialize with patterns from asset
|
# initialize with patterns from asset
|
||||||
nlp.add_pipe(
|
nlp.config["initialize"]["components"]["attribute_ruler"] = {
|
||||||
"attribute_ruler",
|
"patterns": {"@misc": "attribute_ruler_patterns"}
|
||||||
config={"pattern_dicts": {"@misc": "attribute_ruler_patterns"}},
|
}
|
||||||
)
|
nlp.add_pipe("attribute_ruler")
|
||||||
|
nlp.initialize()
|
||||||
doc = nlp("This is a test.")
|
doc = nlp("This is a test.")
|
||||||
assert doc[2].lemma_ == "the"
|
assert doc[2].lemma_ == "the"
|
||||||
assert str(doc[2].morph) == "Case=Nom|Number=Plur"
|
assert str(doc[2].morph) == "Case=Nom|Number=Plur"
|
||||||
|
@ -103,18 +138,15 @@ def test_attributeruler_init_patterns(nlp, pattern_dicts):
|
||||||
|
|
||||||
def test_attributeruler_score(nlp, pattern_dicts):
|
def test_attributeruler_score(nlp, pattern_dicts):
|
||||||
# initialize with patterns
|
# initialize with patterns
|
||||||
nlp.add_pipe("attribute_ruler", config={"pattern_dicts": pattern_dicts})
|
ruler = nlp.add_pipe("attribute_ruler")
|
||||||
|
ruler.initialize(lambda: [], patterns=pattern_dicts)
|
||||||
doc = nlp("This is a test.")
|
doc = nlp("This is a test.")
|
||||||
assert doc[2].lemma_ == "the"
|
assert doc[2].lemma_ == "the"
|
||||||
assert str(doc[2].morph) == "Case=Nom|Number=Plur"
|
assert str(doc[2].morph) == "Case=Nom|Number=Plur"
|
||||||
assert doc[3].lemma_ == "cat"
|
assert doc[3].lemma_ == "cat"
|
||||||
assert str(doc[3].morph) == "Case=Nom|Number=Sing"
|
assert str(doc[3].morph) == "Case=Nom|Number=Sing"
|
||||||
|
doc = nlp.make_doc("This is a test.")
|
||||||
dev_examples = [
|
dev_examples = [Example.from_dict(doc, {"lemmas": ["this", "is", "a", "cat", "."]})]
|
||||||
Example.from_dict(
|
|
||||||
nlp.make_doc("This is a test."), {"lemmas": ["this", "is", "a", "cat", "."]}
|
|
||||||
)
|
|
||||||
]
|
|
||||||
scores = nlp.evaluate(dev_examples)
|
scores = nlp.evaluate(dev_examples)
|
||||||
# "cat" is the only correct lemma
|
# "cat" is the only correct lemma
|
||||||
assert scores["lemma_acc"] == pytest.approx(0.2)
|
assert scores["lemma_acc"] == pytest.approx(0.2)
|
||||||
|
@ -139,40 +171,27 @@ def test_attributeruler_rule_order(nlp):
|
||||||
|
|
||||||
|
|
||||||
def test_attributeruler_tag_map(nlp, tag_map):
|
def test_attributeruler_tag_map(nlp, tag_map):
|
||||||
a = AttributeRuler(nlp.vocab)
|
ruler = AttributeRuler(nlp.vocab)
|
||||||
a.load_from_tag_map(tag_map)
|
ruler.load_from_tag_map(tag_map)
|
||||||
doc = Doc(
|
check_tag_map(ruler)
|
||||||
nlp.vocab,
|
|
||||||
words=["This", "is", "a", "test", "."],
|
|
||||||
tags=["DT", "VBZ", "DT", "NN", "."],
|
def test_attributeruler_tag_map_initialize(nlp, tag_map):
|
||||||
)
|
ruler = nlp.add_pipe("attribute_ruler")
|
||||||
doc = a(doc)
|
ruler.initialize(lambda: [], tag_map=tag_map)
|
||||||
for i in range(len(doc)):
|
check_tag_map(ruler)
|
||||||
if i == 4:
|
|
||||||
assert doc[i].pos_ == "PUNCT"
|
|
||||||
assert str(doc[i].morph) == "PunctType=peri"
|
|
||||||
else:
|
|
||||||
assert doc[i].pos_ == ""
|
|
||||||
assert str(doc[i].morph) == ""
|
|
||||||
|
|
||||||
|
|
||||||
def test_attributeruler_morph_rules(nlp, morph_rules):
|
def test_attributeruler_morph_rules(nlp, morph_rules):
|
||||||
a = AttributeRuler(nlp.vocab)
|
ruler = AttributeRuler(nlp.vocab)
|
||||||
a.load_from_morph_rules(morph_rules)
|
ruler.load_from_morph_rules(morph_rules)
|
||||||
doc = Doc(
|
check_morph_rules(ruler)
|
||||||
nlp.vocab,
|
|
||||||
words=["This", "is", "the", "test", "."],
|
|
||||||
tags=["DT", "VBZ", "DT", "NN", "."],
|
def test_attributeruler_morph_rules_initialize(nlp, morph_rules):
|
||||||
)
|
ruler = nlp.add_pipe("attribute_ruler")
|
||||||
doc = a(doc)
|
ruler.initialize(lambda: [], morph_rules=morph_rules)
|
||||||
for i in range(len(doc)):
|
check_morph_rules(ruler)
|
||||||
if i != 2:
|
|
||||||
assert doc[i].pos_ == ""
|
|
||||||
assert str(doc[i].morph) == ""
|
|
||||||
else:
|
|
||||||
assert doc[2].pos_ == "DET"
|
|
||||||
assert doc[2].lemma_ == "a"
|
|
||||||
assert str(doc[2].morph) == "Case=Nom"
|
|
||||||
|
|
||||||
|
|
||||||
def test_attributeruler_indices(nlp):
|
def test_attributeruler_indices(nlp):
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
import pytest
|
import pytest
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
from spacy.util import SimpleFrozenList
|
from spacy.pipeline import Pipe
|
||||||
|
from spacy.util import SimpleFrozenList, get_arg_names
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
|
@ -346,3 +347,60 @@ def test_pipe_methods_frozen():
|
||||||
nlp.components.sort()
|
nlp.components.sort()
|
||||||
with pytest.raises(NotImplementedError):
|
with pytest.raises(NotImplementedError):
|
||||||
nlp.component_names.clear()
|
nlp.component_names.clear()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"pipe", ["tagger", "parser", "ner", "textcat", "morphologizer"],
|
||||||
|
)
|
||||||
|
def test_pipe_label_data_exports_labels(pipe):
|
||||||
|
nlp = Language()
|
||||||
|
pipe = nlp.add_pipe(pipe)
|
||||||
|
# Make sure pipe has pipe labels
|
||||||
|
assert getattr(pipe, "label_data", None) is not None
|
||||||
|
# Make sure pipe can be initialized with labels
|
||||||
|
initialize = getattr(pipe, "initialize", None)
|
||||||
|
assert initialize is not None
|
||||||
|
assert "labels" in get_arg_names(initialize)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("pipe", ["senter", "entity_linker"])
|
||||||
|
def test_pipe_label_data_no_labels(pipe):
|
||||||
|
nlp = Language()
|
||||||
|
pipe = nlp.add_pipe(pipe)
|
||||||
|
assert getattr(pipe, "label_data", None) is None
|
||||||
|
initialize = getattr(pipe, "initialize", None)
|
||||||
|
if initialize is not None:
|
||||||
|
assert "labels" not in get_arg_names(initialize)
|
||||||
|
|
||||||
|
|
||||||
|
def test_warning_pipe_begin_training():
|
||||||
|
with pytest.warns(UserWarning, match="begin_training"):
|
||||||
|
|
||||||
|
class IncompatPipe(Pipe):
|
||||||
|
def __init__(self):
|
||||||
|
...
|
||||||
|
|
||||||
|
def begin_training(*args, **kwargs):
|
||||||
|
...
|
||||||
|
|
||||||
|
|
||||||
|
def test_pipe_methods_initialize():
|
||||||
|
"""Test that the [initialize] config reflects the components correctly."""
|
||||||
|
nlp = Language()
|
||||||
|
nlp.add_pipe("tagger")
|
||||||
|
assert "tagger" not in nlp.config["initialize"]["components"]
|
||||||
|
nlp.config["initialize"]["components"]["tagger"] = {"labels": ["hello"]}
|
||||||
|
assert nlp.config["initialize"]["components"]["tagger"] == {"labels": ["hello"]}
|
||||||
|
nlp.remove_pipe("tagger")
|
||||||
|
assert "tagger" not in nlp.config["initialize"]["components"]
|
||||||
|
nlp.add_pipe("tagger")
|
||||||
|
assert "tagger" not in nlp.config["initialize"]["components"]
|
||||||
|
nlp.config["initialize"]["components"]["tagger"] = {"labels": ["hello"]}
|
||||||
|
nlp.rename_pipe("tagger", "my_tagger")
|
||||||
|
assert "tagger" not in nlp.config["initialize"]["components"]
|
||||||
|
assert nlp.config["initialize"]["components"]["my_tagger"] == {"labels": ["hello"]}
|
||||||
|
nlp.config["initialize"]["components"]["test"] = {"foo": "bar"}
|
||||||
|
nlp.add_pipe("ner", name="test")
|
||||||
|
assert "test" in nlp.config["initialize"]["components"]
|
||||||
|
nlp.remove_pipe("test")
|
||||||
|
assert "test" not in nlp.config["initialize"]["components"]
|
||||||
|
|
|
@ -10,7 +10,6 @@ from spacy.tokens import Doc
|
||||||
from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL
|
from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL
|
||||||
from spacy.scorer import Scorer
|
from spacy.scorer import Scorer
|
||||||
from spacy.training import Example
|
from spacy.training import Example
|
||||||
from spacy.training.initialize import verify_textcat_config
|
|
||||||
|
|
||||||
from ..util import make_tempdir
|
from ..util import make_tempdir
|
||||||
|
|
||||||
|
@ -21,6 +20,17 @@ TRAIN_DATA = [
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def make_get_examples(nlp):
|
||||||
|
train_examples = []
|
||||||
|
for t in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||||
|
|
||||||
|
def get_examples():
|
||||||
|
return train_examples
|
||||||
|
|
||||||
|
return get_examples
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip(reason="Test is flakey when run with others")
|
@pytest.mark.skip(reason="Test is flakey when run with others")
|
||||||
def test_simple_train():
|
def test_simple_train():
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
|
@ -92,10 +102,7 @@ def test_no_label():
|
||||||
def test_implicit_label():
|
def test_implicit_label():
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
nlp.add_pipe("textcat")
|
nlp.add_pipe("textcat")
|
||||||
train_examples = []
|
nlp.initialize(get_examples=make_get_examples(nlp))
|
||||||
for t in TRAIN_DATA:
|
|
||||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
|
||||||
nlp.initialize(get_examples=lambda: train_examples)
|
|
||||||
|
|
||||||
|
|
||||||
def test_no_resize():
|
def test_no_resize():
|
||||||
|
@ -113,29 +120,27 @@ def test_no_resize():
|
||||||
def test_initialize_examples():
|
def test_initialize_examples():
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
textcat = nlp.add_pipe("textcat")
|
textcat = nlp.add_pipe("textcat")
|
||||||
train_examples = []
|
|
||||||
for text, annotations in TRAIN_DATA:
|
for text, annotations in TRAIN_DATA:
|
||||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
|
||||||
for label, value in annotations.get("cats").items():
|
for label, value in annotations.get("cats").items():
|
||||||
textcat.add_label(label)
|
textcat.add_label(label)
|
||||||
# you shouldn't really call this more than once, but for testing it should be fine
|
# you shouldn't really call this more than once, but for testing it should be fine
|
||||||
nlp.initialize()
|
nlp.initialize()
|
||||||
nlp.initialize(get_examples=lambda: train_examples)
|
get_examples = make_get_examples(nlp)
|
||||||
|
nlp.initialize(get_examples=get_examples)
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
nlp.initialize(get_examples=lambda: None)
|
nlp.initialize(get_examples=lambda: None)
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
nlp.initialize(get_examples=train_examples)
|
nlp.initialize(get_examples=get_examples())
|
||||||
|
|
||||||
|
|
||||||
def test_overfitting_IO():
|
def test_overfitting_IO():
|
||||||
# Simple test to try and quickly overfit the textcat component - ensuring the ML models work correctly
|
# Simple test to try and quickly overfit the textcat component - ensuring the ML models work correctly
|
||||||
fix_random_seed(0)
|
fix_random_seed(0)
|
||||||
nlp = English()
|
nlp = English()
|
||||||
|
nlp.config["initialize"]["components"]["textcat"] = {"positive_label": "POSITIVE"}
|
||||||
# Set exclusive labels
|
# Set exclusive labels
|
||||||
textcat = nlp.add_pipe(
|
config = {"model": {"exclusive_classes": True}}
|
||||||
"textcat",
|
textcat = nlp.add_pipe("textcat", config=config)
|
||||||
config={"model": {"exclusive_classes": True}, "positive_label": "POSITIVE"},
|
|
||||||
)
|
|
||||||
train_examples = []
|
train_examples = []
|
||||||
for text, annotations in TRAIN_DATA:
|
for text, annotations in TRAIN_DATA:
|
||||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
|
@ -203,28 +208,28 @@ def test_textcat_configs(textcat_config):
|
||||||
|
|
||||||
def test_positive_class():
|
def test_positive_class():
|
||||||
nlp = English()
|
nlp = English()
|
||||||
pipe_config = {"positive_label": "POS", "labels": ["POS", "NEG"]}
|
textcat = nlp.add_pipe("textcat")
|
||||||
textcat = nlp.add_pipe("textcat", config=pipe_config)
|
get_examples = make_get_examples(nlp)
|
||||||
|
textcat.initialize(get_examples, labels=["POS", "NEG"], positive_label="POS")
|
||||||
assert textcat.labels == ("POS", "NEG")
|
assert textcat.labels == ("POS", "NEG")
|
||||||
verify_textcat_config(nlp, pipe_config)
|
|
||||||
|
|
||||||
|
|
||||||
def test_positive_class_not_present():
|
def test_positive_class_not_present():
|
||||||
nlp = English()
|
nlp = English()
|
||||||
pipe_config = {"positive_label": "POS", "labels": ["SOME", "THING"]}
|
textcat = nlp.add_pipe("textcat")
|
||||||
textcat = nlp.add_pipe("textcat", config=pipe_config)
|
get_examples = make_get_examples(nlp)
|
||||||
assert textcat.labels == ("SOME", "THING")
|
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
verify_textcat_config(nlp, pipe_config)
|
textcat.initialize(get_examples, labels=["SOME", "THING"], positive_label="POS")
|
||||||
|
|
||||||
|
|
||||||
def test_positive_class_not_binary():
|
def test_positive_class_not_binary():
|
||||||
nlp = English()
|
nlp = English()
|
||||||
pipe_config = {"positive_label": "POS", "labels": ["SOME", "THING", "POS"]}
|
textcat = nlp.add_pipe("textcat")
|
||||||
textcat = nlp.add_pipe("textcat", config=pipe_config)
|
get_examples = make_get_examples(nlp)
|
||||||
assert textcat.labels == ("SOME", "THING", "POS")
|
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
verify_textcat_config(nlp, pipe_config)
|
textcat.initialize(
|
||||||
|
get_examples, labels=["SOME", "THING", "POS"], positive_label="POS"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def test_textcat_evaluation():
|
def test_textcat_evaluation():
|
||||||
|
|
|
@ -92,7 +92,13 @@ def test_serialize_doc_bin_unknown_spaces(en_vocab):
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"writer_flag,reader_flag,reader_value", [(True, True, "bar"), (True, False, "bar"), (False, True, "nothing"), (False, False, "nothing")]
|
"writer_flag,reader_flag,reader_value",
|
||||||
|
[
|
||||||
|
(True, True, "bar"),
|
||||||
|
(True, False, "bar"),
|
||||||
|
(False, True, "nothing"),
|
||||||
|
(False, False, "nothing"),
|
||||||
|
],
|
||||||
)
|
)
|
||||||
def test_serialize_custom_extension(en_vocab, writer_flag, reader_flag, reader_value):
|
def test_serialize_custom_extension(en_vocab, writer_flag, reader_flag, reader_value):
|
||||||
"""Test that custom extensions are correctly serialized in DocBin."""
|
"""Test that custom extensions are correctly serialized in DocBin."""
|
||||||
|
|
|
@ -136,13 +136,7 @@ def test_serialize_textcat_empty(en_vocab):
|
||||||
# See issue #1105
|
# See issue #1105
|
||||||
cfg = {"model": DEFAULT_TEXTCAT_MODEL}
|
cfg = {"model": DEFAULT_TEXTCAT_MODEL}
|
||||||
model = registry.resolve(cfg, validate=True)["model"]
|
model = registry.resolve(cfg, validate=True)["model"]
|
||||||
textcat = TextCategorizer(
|
textcat = TextCategorizer(en_vocab, model, threshold=0.5)
|
||||||
en_vocab,
|
|
||||||
model,
|
|
||||||
labels=["ENTITY", "ACTION", "MODIFIER"],
|
|
||||||
threshold=0.5,
|
|
||||||
positive_label=None,
|
|
||||||
)
|
|
||||||
textcat.to_bytes(exclude=["vocab"])
|
textcat.to_bytes(exclude=["vocab"])
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -158,7 +158,7 @@ def test_las_per_type(en_vocab):
|
||||||
examples = []
|
examples = []
|
||||||
for input_, annot in test_las_apple:
|
for input_, annot in test_las_apple:
|
||||||
doc = Doc(
|
doc = Doc(
|
||||||
en_vocab, words=input_.split(" "), heads=annot["heads"], deps=annot["deps"],
|
en_vocab, words=input_.split(" "), heads=annot["heads"], deps=annot["deps"]
|
||||||
)
|
)
|
||||||
gold = {"heads": annot["heads"], "deps": annot["deps"]}
|
gold = {"heads": annot["heads"], "deps": annot["deps"]}
|
||||||
doc[0].dep_ = "compound"
|
doc[0].dep_ = "compound"
|
||||||
|
@ -182,9 +182,7 @@ def test_ner_per_type(en_vocab):
|
||||||
examples = []
|
examples = []
|
||||||
for input_, annot in test_ner_cardinal:
|
for input_, annot in test_ner_cardinal:
|
||||||
doc = Doc(
|
doc = Doc(
|
||||||
en_vocab,
|
en_vocab, words=input_.split(" "), ents=["B-CARDINAL", "O", "B-CARDINAL"]
|
||||||
words=input_.split(" "),
|
|
||||||
ents=["B-CARDINAL", "O", "B-CARDINAL"],
|
|
||||||
)
|
)
|
||||||
entities = offsets_to_biluo_tags(doc, annot["entities"])
|
entities = offsets_to_biluo_tags(doc, annot["entities"])
|
||||||
example = Example.from_dict(doc, {"entities": entities})
|
example = Example.from_dict(doc, {"entities": entities})
|
||||||
|
|
100
spacy/tests/training/test_augmenters.py
Normal file
100
spacy/tests/training/test_augmenters.py
Normal file
|
@ -0,0 +1,100 @@
|
||||||
|
import pytest
|
||||||
|
from spacy.training import Corpus
|
||||||
|
from spacy.training.augment import create_orth_variants_augmenter
|
||||||
|
from spacy.training.augment import create_lower_casing_augmenter
|
||||||
|
from spacy.lang.en import English
|
||||||
|
from spacy.tokens import DocBin, Doc
|
||||||
|
from contextlib import contextmanager
|
||||||
|
import random
|
||||||
|
|
||||||
|
from ..util import make_tempdir
|
||||||
|
|
||||||
|
|
||||||
|
@contextmanager
|
||||||
|
def make_docbin(docs, name="roundtrip.spacy"):
|
||||||
|
with make_tempdir() as tmpdir:
|
||||||
|
output_file = tmpdir / name
|
||||||
|
DocBin(docs=docs).to_disk(output_file)
|
||||||
|
yield output_file
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def nlp():
|
||||||
|
return English()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def doc(nlp):
|
||||||
|
# fmt: off
|
||||||
|
words = ["Sarah", "'s", "sister", "flew", "to", "Silicon", "Valley", "via", "London", "."]
|
||||||
|
tags = ["NNP", "POS", "NN", "VBD", "IN", "NNP", "NNP", "IN", "NNP", "."]
|
||||||
|
pos = ["PROPN", "PART", "NOUN", "VERB", "ADP", "PROPN", "PROPN", "ADP", "PROPN", "PUNCT"]
|
||||||
|
ents = ["B-PERSON", "I-PERSON", "O", "O", "O", "B-LOC", "I-LOC", "O", "B-GPE", "O"]
|
||||||
|
cats = {"TRAVEL": 1.0, "BAKING": 0.0}
|
||||||
|
# fmt: on
|
||||||
|
doc = Doc(nlp.vocab, words=words, tags=tags, pos=pos, ents=ents)
|
||||||
|
doc.cats = cats
|
||||||
|
return doc
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.filterwarnings("ignore::UserWarning")
|
||||||
|
def test_make_orth_variants(nlp, doc):
|
||||||
|
single = [
|
||||||
|
{"tags": ["NFP"], "variants": ["…", "..."]},
|
||||||
|
{"tags": [":"], "variants": ["-", "—", "–", "--", "---", "——"]},
|
||||||
|
]
|
||||||
|
augmenter = create_orth_variants_augmenter(
|
||||||
|
level=0.2, lower=0.5, orth_variants={"single": single}
|
||||||
|
)
|
||||||
|
with make_docbin([doc]) as output_file:
|
||||||
|
reader = Corpus(output_file, augmenter=augmenter)
|
||||||
|
# Due to randomness, only test that it works without errors for now
|
||||||
|
list(reader(nlp))
|
||||||
|
|
||||||
|
|
||||||
|
def test_lowercase_augmenter(nlp, doc):
|
||||||
|
augmenter = create_lower_casing_augmenter(level=1.0)
|
||||||
|
with make_docbin([doc]) as output_file:
|
||||||
|
reader = Corpus(output_file, augmenter=augmenter)
|
||||||
|
corpus = list(reader(nlp))
|
||||||
|
eg = corpus[0]
|
||||||
|
assert eg.reference.text == doc.text.lower()
|
||||||
|
assert eg.predicted.text == doc.text.lower()
|
||||||
|
ents = [(e.start, e.end, e.label) for e in doc.ents]
|
||||||
|
assert [(e.start, e.end, e.label) for e in eg.reference.ents] == ents
|
||||||
|
for ref_ent, orig_ent in zip(eg.reference.ents, doc.ents):
|
||||||
|
assert ref_ent.text == orig_ent.text.lower()
|
||||||
|
assert [t.pos_ for t in eg.reference] == [t.pos_ for t in doc]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.filterwarnings("ignore::UserWarning")
|
||||||
|
def test_custom_data_augmentation(nlp, doc):
|
||||||
|
def create_spongebob_augmenter(randomize: bool = False):
|
||||||
|
def augment(nlp, example):
|
||||||
|
text = example.text
|
||||||
|
if randomize:
|
||||||
|
ch = [c.lower() if random.random() < 0.5 else c.upper() for c in text]
|
||||||
|
else:
|
||||||
|
ch = [c.lower() if i % 2 else c.upper() for i, c in enumerate(text)]
|
||||||
|
example_dict = example.to_dict()
|
||||||
|
doc = nlp.make_doc("".join(ch))
|
||||||
|
example_dict["token_annotation"]["ORTH"] = [t.text for t in doc]
|
||||||
|
yield example
|
||||||
|
yield example.from_dict(doc, example_dict)
|
||||||
|
|
||||||
|
return augment
|
||||||
|
|
||||||
|
with make_docbin([doc]) as output_file:
|
||||||
|
reader = Corpus(output_file, augmenter=create_spongebob_augmenter())
|
||||||
|
corpus = list(reader(nlp))
|
||||||
|
orig_text = "Sarah 's sister flew to Silicon Valley via London . "
|
||||||
|
augmented = "SaRaH 's sIsTeR FlEw tO SiLiCoN VaLlEy vIa lOnDoN . "
|
||||||
|
assert corpus[0].text == orig_text
|
||||||
|
assert corpus[0].reference.text == orig_text
|
||||||
|
assert corpus[0].predicted.text == orig_text
|
||||||
|
assert corpus[1].text == augmented
|
||||||
|
assert corpus[1].reference.text == augmented
|
||||||
|
assert corpus[1].predicted.text == augmented
|
||||||
|
ents = [(e.start, e.end, e.label) for e in doc.ents]
|
||||||
|
assert [(e.start, e.end, e.label) for e in corpus[0].reference.ents] == ents
|
||||||
|
assert [(e.start, e.end, e.label) for e in corpus[1].reference.ents] == ents
|
|
@ -1,23 +1,20 @@
|
||||||
import numpy
|
import numpy
|
||||||
from spacy.training import offsets_to_biluo_tags, biluo_tags_to_offsets, Alignment
|
from spacy.training import offsets_to_biluo_tags, biluo_tags_to_offsets, Alignment
|
||||||
from spacy.training import biluo_tags_to_spans, iob_to_biluo
|
from spacy.training import biluo_tags_to_spans, iob_to_biluo
|
||||||
from spacy.training import Corpus, docs_to_json
|
from spacy.training import Corpus, docs_to_json, Example
|
||||||
from spacy.training.example import Example
|
|
||||||
from spacy.training.converters import json_to_docs
|
from spacy.training.converters import json_to_docs
|
||||||
from spacy.training.augment import create_orth_variants_augmenter
|
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.tokens import Doc, DocBin
|
from spacy.tokens import Doc, DocBin
|
||||||
from spacy.util import get_words_and_spaces, minibatch
|
from spacy.util import get_words_and_spaces, minibatch
|
||||||
from thinc.api import compounding
|
from thinc.api import compounding
|
||||||
import pytest
|
import pytest
|
||||||
import srsly
|
import srsly
|
||||||
import random
|
|
||||||
|
|
||||||
from ..util import make_tempdir
|
from ..util import make_tempdir
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def doc(en_vocab):
|
def doc():
|
||||||
nlp = English() # make sure we get a new vocab every time
|
nlp = English() # make sure we get a new vocab every time
|
||||||
# fmt: off
|
# fmt: off
|
||||||
words = ["Sarah", "'s", "sister", "flew", "to", "Silicon", "Valley", "via", "London", "."]
|
words = ["Sarah", "'s", "sister", "flew", "to", "Silicon", "Valley", "via", "London", "."]
|
||||||
|
@ -495,59 +492,6 @@ def test_roundtrip_docs_to_docbin(doc):
|
||||||
assert cats["BAKING"] == reloaded_example.reference.cats["BAKING"]
|
assert cats["BAKING"] == reloaded_example.reference.cats["BAKING"]
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.filterwarnings("ignore::UserWarning")
|
|
||||||
def test_make_orth_variants(doc):
|
|
||||||
nlp = English()
|
|
||||||
orth_variants = {
|
|
||||||
"single": [
|
|
||||||
{"tags": ["NFP"], "variants": ["…", "..."]},
|
|
||||||
{"tags": [":"], "variants": ["-", "—", "–", "--", "---", "——"]},
|
|
||||||
]
|
|
||||||
}
|
|
||||||
augmenter = create_orth_variants_augmenter(
|
|
||||||
level=0.2, lower=0.5, orth_variants=orth_variants
|
|
||||||
)
|
|
||||||
with make_tempdir() as tmpdir:
|
|
||||||
output_file = tmpdir / "roundtrip.spacy"
|
|
||||||
DocBin(docs=[doc]).to_disk(output_file)
|
|
||||||
# due to randomness, test only that this runs with no errors for now
|
|
||||||
reader = Corpus(output_file, augmenter=augmenter)
|
|
||||||
list(reader(nlp))
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.filterwarnings("ignore::UserWarning")
|
|
||||||
def test_custom_data_augmentation(doc):
|
|
||||||
def create_spongebob_augmenter(randomize: bool = False):
|
|
||||||
def augment(nlp, example):
|
|
||||||
text = example.text
|
|
||||||
if randomize:
|
|
||||||
ch = [c.lower() if random.random() < 0.5 else c.upper() for c in text]
|
|
||||||
else:
|
|
||||||
ch = [c.lower() if i % 2 else c.upper() for i, c in enumerate(text)]
|
|
||||||
example_dict = example.to_dict()
|
|
||||||
doc = nlp.make_doc("".join(ch))
|
|
||||||
example_dict["token_annotation"]["ORTH"] = [t.text for t in doc]
|
|
||||||
yield example
|
|
||||||
yield example.from_dict(doc, example_dict)
|
|
||||||
|
|
||||||
return augment
|
|
||||||
|
|
||||||
nlp = English()
|
|
||||||
with make_tempdir() as tmpdir:
|
|
||||||
output_file = tmpdir / "roundtrip.spacy"
|
|
||||||
DocBin(docs=[doc]).to_disk(output_file)
|
|
||||||
reader = Corpus(output_file, augmenter=create_spongebob_augmenter())
|
|
||||||
corpus = list(reader(nlp))
|
|
||||||
orig_text = "Sarah 's sister flew to Silicon Valley via London . "
|
|
||||||
augmented = "SaRaH 's sIsTeR FlEw tO SiLiCoN VaLlEy vIa lOnDoN . "
|
|
||||||
assert corpus[0].text == orig_text
|
|
||||||
assert corpus[0].reference.text == orig_text
|
|
||||||
assert corpus[0].predicted.text == orig_text
|
|
||||||
assert corpus[1].text == augmented
|
|
||||||
assert corpus[1].reference.text == augmented
|
|
||||||
assert corpus[1].predicted.text == augmented
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip("Outdated")
|
@pytest.mark.skip("Outdated")
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"tokens_a,tokens_b,expected",
|
"tokens_a,tokens_b,expected",
|
||||||
|
|
|
@ -336,6 +336,7 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
|
||||||
lex = doc.vocab.get(doc.mem, orth)
|
lex = doc.vocab.get(doc.mem, orth)
|
||||||
token.lex = lex
|
token.lex = lex
|
||||||
token.lemma = 0 # reset lemma
|
token.lemma = 0 # reset lemma
|
||||||
|
token.norm = 0 # reset norm
|
||||||
if to_process_tensor:
|
if to_process_tensor:
|
||||||
# setting the tensors of the split tokens to array of zeros
|
# setting the tensors of the split tokens to array of zeros
|
||||||
doc.tensor[token_index + i] = xp.zeros((1,doc.tensor.shape[1]), dtype="float32")
|
doc.tensor[token_index + i] = xp.zeros((1,doc.tensor.shape[1]), dtype="float32")
|
||||||
|
|
|
@ -245,7 +245,7 @@ cdef class Doc:
|
||||||
self.noun_chunks_iterator = self.vocab.get_noun_chunks
|
self.noun_chunks_iterator = self.vocab.get_noun_chunks
|
||||||
cdef bint has_space
|
cdef bint has_space
|
||||||
if words is None and spaces is not None:
|
if words is None and spaces is not None:
|
||||||
raise ValueError("words must be set if spaces is set")
|
raise ValueError(Errors.E908)
|
||||||
elif spaces is None and words is not None:
|
elif spaces is None and words is not None:
|
||||||
self.has_unknown_spaces = True
|
self.has_unknown_spaces = True
|
||||||
else:
|
else:
|
||||||
|
@ -309,7 +309,7 @@ cdef class Doc:
|
||||||
else:
|
else:
|
||||||
if len(ent) < 3 or ent[1] != "-":
|
if len(ent) < 3 or ent[1] != "-":
|
||||||
raise ValueError(Errors.E177.format(tag=ent))
|
raise ValueError(Errors.E177.format(tag=ent))
|
||||||
ent_iob, ent_type = ent.split("-", 1)
|
ent_iob, ent_type = ent.split("-", 1)
|
||||||
if ent_iob not in iob_strings:
|
if ent_iob not in iob_strings:
|
||||||
raise ValueError(Errors.E177.format(tag=ent))
|
raise ValueError(Errors.E177.format(tag=ent))
|
||||||
ent_iob = iob_strings.index(ent_iob)
|
ent_iob = iob_strings.index(ent_iob)
|
||||||
|
|
|
@ -17,7 +17,7 @@ from ..lexeme cimport Lexeme
|
||||||
from ..symbols cimport dep
|
from ..symbols cimport dep
|
||||||
|
|
||||||
from ..util import normalize_slice
|
from ..util import normalize_slice
|
||||||
from ..errors import Errors, TempErrors, Warnings
|
from ..errors import Errors, Warnings
|
||||||
from .underscore import Underscore, get_ext_args
|
from .underscore import Underscore, get_ext_args
|
||||||
|
|
||||||
|
|
||||||
|
@ -362,8 +362,6 @@ cdef class Span:
|
||||||
"""RETURNS (Span): The sentence span that the span is a part of."""
|
"""RETURNS (Span): The sentence span that the span is a part of."""
|
||||||
if "sent" in self.doc.user_span_hooks:
|
if "sent" in self.doc.user_span_hooks:
|
||||||
return self.doc.user_span_hooks["sent"](self)
|
return self.doc.user_span_hooks["sent"](self)
|
||||||
# This should raise if not parsed / no custom sentence boundaries
|
|
||||||
self.doc.sents
|
|
||||||
# Use `sent_start` token attribute to find sentence boundaries
|
# Use `sent_start` token attribute to find sentence boundaries
|
||||||
cdef int n = 0
|
cdef int n = 0
|
||||||
if self.doc.has_annotation("SENT_START"):
|
if self.doc.has_annotation("SENT_START"):
|
||||||
|
@ -373,13 +371,14 @@ cdef class Span:
|
||||||
start += -1
|
start += -1
|
||||||
# Find end of the sentence
|
# Find end of the sentence
|
||||||
end = self.end
|
end = self.end
|
||||||
n = 0
|
|
||||||
while end < self.doc.length and self.doc.c[end].sent_start != 1:
|
while end < self.doc.length and self.doc.c[end].sent_start != 1:
|
||||||
end += 1
|
end += 1
|
||||||
n += 1
|
n += 1
|
||||||
if n >= self.doc.length:
|
if n >= self.doc.length:
|
||||||
break
|
break
|
||||||
return self.doc[start:end]
|
return self.doc[start:end]
|
||||||
|
else:
|
||||||
|
raise ValueError(Errors.E030)
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def ents(self):
|
def ents(self):
|
||||||
|
@ -652,7 +651,7 @@ cdef class Span:
|
||||||
return self.root.ent_id
|
return self.root.ent_id
|
||||||
|
|
||||||
def __set__(self, hash_t key):
|
def __set__(self, hash_t key):
|
||||||
raise NotImplementedError(TempErrors.T007.format(attr="ent_id"))
|
raise NotImplementedError(Errors.E200.format(attr="ent_id"))
|
||||||
|
|
||||||
property ent_id_:
|
property ent_id_:
|
||||||
"""RETURNS (str): The (string) entity ID."""
|
"""RETURNS (str): The (string) entity ID."""
|
||||||
|
@ -660,7 +659,7 @@ cdef class Span:
|
||||||
return self.root.ent_id_
|
return self.root.ent_id_
|
||||||
|
|
||||||
def __set__(self, hash_t key):
|
def __set__(self, hash_t key):
|
||||||
raise NotImplementedError(TempErrors.T007.format(attr="ent_id_"))
|
raise NotImplementedError(Errors.E200.format(attr="ent_id_"))
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def orth_(self):
|
def orth_(self):
|
||||||
|
|
|
@ -30,20 +30,51 @@ class OrthVariants(BaseModel):
|
||||||
|
|
||||||
@registry.augmenters("spacy.orth_variants.v1")
|
@registry.augmenters("spacy.orth_variants.v1")
|
||||||
def create_orth_variants_augmenter(
|
def create_orth_variants_augmenter(
|
||||||
level: float, lower: float, orth_variants: OrthVariants,
|
level: float, lower: float, orth_variants: OrthVariants
|
||||||
) -> Callable[["Language", Example], Iterator[Example]]:
|
) -> Callable[["Language", Example], Iterator[Example]]:
|
||||||
"""Create a data augmentation callback that uses orth-variant replacement.
|
"""Create a data augmentation callback that uses orth-variant replacement.
|
||||||
The callback can be added to a corpus or other data iterator during training.
|
The callback can be added to a corpus or other data iterator during training.
|
||||||
|
|
||||||
|
level (float): The percentage of texts that will be augmented.
|
||||||
|
lower (float): The percentage of texts that will be lowercased.
|
||||||
|
orth_variants (Dict[str, dict]): A dictionary containing the single and
|
||||||
|
paired orth variants. Typically loaded from a JSON file.
|
||||||
|
RETURNS (Callable[[Language, Example], Iterator[Example]]): The augmenter.
|
||||||
"""
|
"""
|
||||||
return partial(
|
return partial(
|
||||||
orth_variants_augmenter, orth_variants=orth_variants, level=level, lower=lower
|
orth_variants_augmenter, orth_variants=orth_variants, level=level, lower=lower
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@registry.augmenters("spacy.lower_case.v1")
|
||||||
|
def create_lower_casing_augmenter(
|
||||||
|
level: float,
|
||||||
|
) -> Callable[["Language", Example], Iterator[Example]]:
|
||||||
|
"""Create a data augmentation callback that converts documents to lowercase.
|
||||||
|
The callback can be added to a corpus or other data iterator during training.
|
||||||
|
|
||||||
|
level (float): The percentage of texts that will be augmented.
|
||||||
|
RETURNS (Callable[[Language, Example], Iterator[Example]]): The augmenter.
|
||||||
|
"""
|
||||||
|
return partial(lower_casing_augmenter, level=level)
|
||||||
|
|
||||||
|
|
||||||
def dont_augment(nlp: "Language", example: Example) -> Iterator[Example]:
|
def dont_augment(nlp: "Language", example: Example) -> Iterator[Example]:
|
||||||
yield example
|
yield example
|
||||||
|
|
||||||
|
|
||||||
|
def lower_casing_augmenter(
|
||||||
|
nlp: "Language", example: Example, *, level: float,
|
||||||
|
) -> Iterator[Example]:
|
||||||
|
if random.random() >= level:
|
||||||
|
yield example
|
||||||
|
else:
|
||||||
|
example_dict = example.to_dict()
|
||||||
|
doc = nlp.make_doc(example.text.lower())
|
||||||
|
example_dict["token_annotation"]["ORTH"] = [t.lower_ for t in doc]
|
||||||
|
yield example.from_dict(doc, example_dict)
|
||||||
|
|
||||||
|
|
||||||
def orth_variants_augmenter(
|
def orth_variants_augmenter(
|
||||||
nlp: "Language",
|
nlp: "Language",
|
||||||
example: Example,
|
example: Example,
|
||||||
|
|
|
@ -2,9 +2,9 @@ from wasabi import Printer
|
||||||
|
|
||||||
from .. import tags_to_entities
|
from .. import tags_to_entities
|
||||||
from ...training import iob_to_biluo
|
from ...training import iob_to_biluo
|
||||||
from ...lang.xx import MultiLanguage
|
|
||||||
from ...tokens import Doc, Span
|
from ...tokens import Doc, Span
|
||||||
from ...util import load_model
|
from ...errors import Errors
|
||||||
|
from ...util import load_model, get_lang_class
|
||||||
|
|
||||||
|
|
||||||
def conll_ner_to_docs(
|
def conll_ner_to_docs(
|
||||||
|
@ -86,7 +86,7 @@ def conll_ner_to_docs(
|
||||||
if model:
|
if model:
|
||||||
nlp = load_model(model)
|
nlp = load_model(model)
|
||||||
else:
|
else:
|
||||||
nlp = MultiLanguage()
|
nlp = get_lang_class("xx")()
|
||||||
output_docs = []
|
output_docs = []
|
||||||
for conll_doc in input_data.strip().split(doc_delimiter):
|
for conll_doc in input_data.strip().split(doc_delimiter):
|
||||||
conll_doc = conll_doc.strip()
|
conll_doc = conll_doc.strip()
|
||||||
|
@ -103,11 +103,7 @@ def conll_ner_to_docs(
|
||||||
lines = [line.strip() for line in conll_sent.split("\n") if line.strip()]
|
lines = [line.strip() for line in conll_sent.split("\n") if line.strip()]
|
||||||
cols = list(zip(*[line.split() for line in lines]))
|
cols = list(zip(*[line.split() for line in lines]))
|
||||||
if len(cols) < 2:
|
if len(cols) < 2:
|
||||||
raise ValueError(
|
raise ValueError(Errors.E093)
|
||||||
"The token-per-line NER file is not formatted correctly. "
|
|
||||||
"Try checking whitespace and delimiters. See "
|
|
||||||
"https://nightly.spacy.io/api/cli#convert"
|
|
||||||
)
|
|
||||||
length = len(cols[0])
|
length = len(cols[0])
|
||||||
words.extend(cols[0])
|
words.extend(cols[0])
|
||||||
sent_starts.extend([True] + [False] * (length - 1))
|
sent_starts.extend([True] + [False] * (length - 1))
|
||||||
|
@ -136,7 +132,7 @@ def segment_sents_and_docs(doc, n_sents, doc_delimiter, model=None, msg=None):
|
||||||
"Segmenting sentences with sentencizer. (Use `-b model` for "
|
"Segmenting sentences with sentencizer. (Use `-b model` for "
|
||||||
"improved parser-based sentence segmentation.)"
|
"improved parser-based sentence segmentation.)"
|
||||||
)
|
)
|
||||||
nlp = MultiLanguage()
|
nlp = get_lang_class("xx")()
|
||||||
sentencizer = nlp.create_pipe("sentencizer")
|
sentencizer = nlp.create_pipe("sentencizer")
|
||||||
lines = doc.strip().split("\n")
|
lines = doc.strip().split("\n")
|
||||||
words = [line.strip().split()[0] for line in lines]
|
words = [line.strip().split()[0] for line in lines]
|
||||||
|
|
|
@ -4,6 +4,7 @@ from .conll_ner_to_docs import n_sents_info
|
||||||
from ...vocab import Vocab
|
from ...vocab import Vocab
|
||||||
from ...training import iob_to_biluo, tags_to_entities
|
from ...training import iob_to_biluo, tags_to_entities
|
||||||
from ...tokens import Doc, Span
|
from ...tokens import Doc, Span
|
||||||
|
from ...errors import Errors
|
||||||
from ...util import minibatch
|
from ...util import minibatch
|
||||||
|
|
||||||
|
|
||||||
|
@ -45,9 +46,7 @@ def read_iob(raw_sents, vocab, n_sents):
|
||||||
sent_words, sent_iob = zip(*sent_tokens)
|
sent_words, sent_iob = zip(*sent_tokens)
|
||||||
sent_tags = ["-"] * len(sent_words)
|
sent_tags = ["-"] * len(sent_words)
|
||||||
else:
|
else:
|
||||||
raise ValueError(
|
raise ValueError(Errors.E092)
|
||||||
"The sentence-per-line IOB/IOB2 file is not formatted correctly. Try checking whitespace and delimiters. See https://nightly.spacy.io/api/cli#convert"
|
|
||||||
)
|
|
||||||
words.extend(sent_words)
|
words.extend(sent_words)
|
||||||
tags.extend(sent_tags)
|
tags.extend(sent_tags)
|
||||||
iob.extend(sent_iob)
|
iob.extend(sent_iob)
|
||||||
|
|
|
@ -12,6 +12,7 @@ from .iob_utils import biluo_to_iob, offsets_to_biluo_tags, doc_to_biluo_tags
|
||||||
from .iob_utils import biluo_tags_to_spans
|
from .iob_utils import biluo_tags_to_spans
|
||||||
from ..errors import Errors, Warnings
|
from ..errors import Errors, Warnings
|
||||||
from ..pipeline._parser_internals import nonproj
|
from ..pipeline._parser_internals import nonproj
|
||||||
|
from ..util import logger
|
||||||
|
|
||||||
|
|
||||||
cpdef Doc annotations_to_doc(vocab, tok_annot, doc_annot):
|
cpdef Doc annotations_to_doc(vocab, tok_annot, doc_annot):
|
||||||
|
@ -390,7 +391,7 @@ def _fix_legacy_dict_data(example_dict):
|
||||||
if "HEAD" in token_dict and "SENT_START" in token_dict:
|
if "HEAD" in token_dict and "SENT_START" in token_dict:
|
||||||
# If heads are set, we don't also redundantly specify SENT_START.
|
# If heads are set, we don't also redundantly specify SENT_START.
|
||||||
token_dict.pop("SENT_START")
|
token_dict.pop("SENT_START")
|
||||||
warnings.warn(Warnings.W092)
|
logger.debug(Warnings.W092)
|
||||||
return {
|
return {
|
||||||
"token_annotation": token_dict,
|
"token_annotation": token_dict,
|
||||||
"doc_annotation": doc_dict
|
"doc_annotation": doc_dict
|
||||||
|
|
|
@ -50,9 +50,6 @@ def init_nlp(config: Config, *, use_gpu: int = -1) -> "Language":
|
||||||
with nlp.select_pipes(disable=[*frozen_components, *resume_components]):
|
with nlp.select_pipes(disable=[*frozen_components, *resume_components]):
|
||||||
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
|
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
|
||||||
logger.info("Initialized pipeline components")
|
logger.info("Initialized pipeline components")
|
||||||
# Verify the config after calling 'initialize' to ensure labels
|
|
||||||
# are properly initialized
|
|
||||||
verify_config(nlp)
|
|
||||||
return nlp
|
return nlp
|
||||||
|
|
||||||
|
|
||||||
|
@ -102,7 +99,7 @@ def load_vectors_into_model(
|
||||||
"with the packaged vectors. Make sure that the vectors package you're "
|
"with the packaged vectors. Make sure that the vectors package you're "
|
||||||
"loading is compatible with the current version of spaCy."
|
"loading is compatible with the current version of spaCy."
|
||||||
)
|
)
|
||||||
err = ConfigValidationError.from_error(config=None, title=title, desc=desc)
|
err = ConfigValidationError.from_error(e, config=None, title=title, desc=desc)
|
||||||
raise err from None
|
raise err from None
|
||||||
nlp.vocab.vectors = vectors_nlp.vocab.vectors
|
nlp.vocab.vectors = vectors_nlp.vocab.vectors
|
||||||
if add_strings:
|
if add_strings:
|
||||||
|
@ -152,33 +149,6 @@ def init_tok2vec(
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
def verify_config(nlp: "Language") -> None:
|
|
||||||
"""Perform additional checks based on the config, loaded nlp object and training data."""
|
|
||||||
# TODO: maybe we should validate based on the actual components, the list
|
|
||||||
# in config["nlp"]["pipeline"] instead?
|
|
||||||
for pipe_config in nlp.config["components"].values():
|
|
||||||
# We can't assume that the component name == the factory
|
|
||||||
factory = pipe_config["factory"]
|
|
||||||
if factory == "textcat":
|
|
||||||
verify_textcat_config(nlp, pipe_config)
|
|
||||||
|
|
||||||
|
|
||||||
def verify_textcat_config(nlp: "Language", pipe_config: Dict[str, Any]) -> None:
|
|
||||||
# if 'positive_label' is provided: double check whether it's in the data and
|
|
||||||
# the task is binary
|
|
||||||
if pipe_config.get("positive_label"):
|
|
||||||
textcat_labels = nlp.get_pipe("textcat").labels
|
|
||||||
pos_label = pipe_config.get("positive_label")
|
|
||||||
if pos_label not in textcat_labels:
|
|
||||||
raise ValueError(
|
|
||||||
Errors.E920.format(pos_label=pos_label, labels=textcat_labels)
|
|
||||||
)
|
|
||||||
if len(list(textcat_labels)) != 2:
|
|
||||||
raise ValueError(
|
|
||||||
Errors.E919.format(pos_label=pos_label, labels=textcat_labels)
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def get_sourced_components(config: Union[Dict[str, Any], Config]) -> List[str]:
|
def get_sourced_components(config: Union[Dict[str, Any], Config]) -> List[str]:
|
||||||
"""RETURNS (List[str]): All sourced components in the original config,
|
"""RETURNS (List[str]): All sourced components in the original config,
|
||||||
e.g. {"source": "en_core_web_sm"}. If the config contains a key
|
e.g. {"source": "en_core_web_sm"}. If the config contains a key
|
||||||
|
|
|
@ -1,18 +1,25 @@
|
||||||
from typing import Dict, Any, Tuple, Callable, List
|
from typing import TYPE_CHECKING, Dict, Any, Tuple, Callable, List, Optional, IO
|
||||||
|
from wasabi import Printer
|
||||||
|
import tqdm
|
||||||
|
import sys
|
||||||
|
|
||||||
from ..util import registry
|
from ..util import registry
|
||||||
from .. import util
|
from .. import util
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
from wasabi import msg
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from ..language import Language # noqa: F401
|
||||||
|
|
||||||
|
|
||||||
@registry.loggers("spacy.ConsoleLogger.v1")
|
@registry.loggers("spacy.ConsoleLogger.v1")
|
||||||
def console_logger():
|
def console_logger(progress_bar: bool = False):
|
||||||
def setup_printer(
|
def setup_printer(
|
||||||
nlp: "Language",
|
nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr
|
||||||
) -> Tuple[Callable[[Dict[str, Any]], None], Callable]:
|
) -> Tuple[Callable[[Optional[Dict[str, Any]]], None], Callable[[], None]]:
|
||||||
|
msg = Printer(no_print=True)
|
||||||
# we assume here that only components are enabled that should be trained & logged
|
# we assume here that only components are enabled that should be trained & logged
|
||||||
logged_pipes = nlp.pipe_names
|
logged_pipes = nlp.pipe_names
|
||||||
|
eval_frequency = nlp.config["training"]["eval_frequency"]
|
||||||
score_weights = nlp.config["training"]["score_weights"]
|
score_weights = nlp.config["training"]["score_weights"]
|
||||||
score_cols = [col for col, value in score_weights.items() if value is not None]
|
score_cols = [col for col, value in score_weights.items() if value is not None]
|
||||||
score_widths = [max(len(col), 6) for col in score_cols]
|
score_widths = [max(len(col), 6) for col in score_cols]
|
||||||
|
@ -22,10 +29,18 @@ def console_logger():
|
||||||
table_header = [col.upper() for col in table_header]
|
table_header = [col.upper() for col in table_header]
|
||||||
table_widths = [3, 6] + loss_widths + score_widths + [6]
|
table_widths = [3, 6] + loss_widths + score_widths + [6]
|
||||||
table_aligns = ["r" for _ in table_widths]
|
table_aligns = ["r" for _ in table_widths]
|
||||||
msg.row(table_header, widths=table_widths)
|
stdout.write(msg.row(table_header, widths=table_widths) + "\n")
|
||||||
msg.row(["-" * width for width in table_widths])
|
stdout.write(msg.row(["-" * width for width in table_widths]) + "\n")
|
||||||
|
progress = None
|
||||||
|
|
||||||
def log_step(info: Dict[str, Any]):
|
def log_step(info: Optional[Dict[str, Any]]) -> None:
|
||||||
|
nonlocal progress
|
||||||
|
|
||||||
|
if info is None:
|
||||||
|
# If we don't have a new checkpoint, just return.
|
||||||
|
if progress is not None:
|
||||||
|
progress.update(1)
|
||||||
|
return
|
||||||
try:
|
try:
|
||||||
losses = [
|
losses = [
|
||||||
"{0:.2f}".format(float(info["losses"][pipe_name]))
|
"{0:.2f}".format(float(info["losses"][pipe_name]))
|
||||||
|
@ -39,26 +54,36 @@ def console_logger():
|
||||||
keys=list(info["losses"].keys()),
|
keys=list(info["losses"].keys()),
|
||||||
)
|
)
|
||||||
) from None
|
) from None
|
||||||
|
|
||||||
scores = []
|
scores = []
|
||||||
for col in score_cols:
|
for col in score_cols:
|
||||||
score = info["other_scores"].get(col, 0.0)
|
score = info["other_scores"].get(col, 0.0)
|
||||||
try:
|
try:
|
||||||
score = float(score)
|
score = float(score)
|
||||||
if col != "speed":
|
|
||||||
score *= 100
|
|
||||||
scores.append("{0:.2f}".format(score))
|
|
||||||
except TypeError:
|
except TypeError:
|
||||||
err = Errors.E916.format(name=col, score_type=type(score))
|
err = Errors.E916.format(name=col, score_type=type(score))
|
||||||
raise ValueError(err) from None
|
raise ValueError(err) from None
|
||||||
|
if col != "speed":
|
||||||
|
score *= 100
|
||||||
|
scores.append("{0:.2f}".format(score))
|
||||||
|
|
||||||
data = (
|
data = (
|
||||||
[info["epoch"], info["step"]]
|
[info["epoch"], info["step"]]
|
||||||
+ losses
|
+ losses
|
||||||
+ scores
|
+ scores
|
||||||
+ ["{0:.2f}".format(float(info["score"]))]
|
+ ["{0:.2f}".format(float(info["score"]))]
|
||||||
)
|
)
|
||||||
msg.row(data, widths=table_widths, aligns=table_aligns)
|
if progress is not None:
|
||||||
|
progress.close()
|
||||||
|
stdout.write(msg.row(data, widths=table_widths, aligns=table_aligns) + "\n")
|
||||||
|
if progress_bar:
|
||||||
|
# Set disable=None, so that it disables on non-TTY
|
||||||
|
progress = tqdm.tqdm(
|
||||||
|
total=eval_frequency, disable=None, leave=False, file=stderr
|
||||||
|
)
|
||||||
|
progress.set_description(f"Epoch {info['epoch']+1}")
|
||||||
|
|
||||||
def finalize():
|
def finalize() -> None:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
return log_step, finalize
|
return log_step, finalize
|
||||||
|
@ -70,31 +95,32 @@ def console_logger():
|
||||||
def wandb_logger(project_name: str, remove_config_values: List[str] = []):
|
def wandb_logger(project_name: str, remove_config_values: List[str] = []):
|
||||||
import wandb
|
import wandb
|
||||||
|
|
||||||
console = console_logger()
|
console = console_logger(progress_bar=False)
|
||||||
|
|
||||||
def setup_logger(
|
def setup_logger(
|
||||||
nlp: "Language",
|
nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr
|
||||||
) -> Tuple[Callable[[Dict[str, Any]], None], Callable]:
|
) -> Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]:
|
||||||
config = nlp.config.interpolate()
|
config = nlp.config.interpolate()
|
||||||
config_dot = util.dict_to_dot(config)
|
config_dot = util.dict_to_dot(config)
|
||||||
for field in remove_config_values:
|
for field in remove_config_values:
|
||||||
del config_dot[field]
|
del config_dot[field]
|
||||||
config = util.dot_to_dict(config_dot)
|
config = util.dot_to_dict(config_dot)
|
||||||
wandb.init(project=project_name, config=config, reinit=True)
|
wandb.init(project=project_name, config=config, reinit=True)
|
||||||
console_log_step, console_finalize = console(nlp)
|
console_log_step, console_finalize = console(nlp, stdout, stderr)
|
||||||
|
|
||||||
def log_step(info: Dict[str, Any]):
|
def log_step(info: Optional[Dict[str, Any]]):
|
||||||
console_log_step(info)
|
console_log_step(info)
|
||||||
score = info["score"]
|
if info is not None:
|
||||||
other_scores = info["other_scores"]
|
score = info["score"]
|
||||||
losses = info["losses"]
|
other_scores = info["other_scores"]
|
||||||
wandb.log({"score": score})
|
losses = info["losses"]
|
||||||
if losses:
|
wandb.log({"score": score})
|
||||||
wandb.log({f"loss_{k}": v for k, v in losses.items()})
|
if losses:
|
||||||
if isinstance(other_scores, dict):
|
wandb.log({f"loss_{k}": v for k, v in losses.items()})
|
||||||
wandb.log(other_scores)
|
if isinstance(other_scores, dict):
|
||||||
|
wandb.log(other_scores)
|
||||||
|
|
||||||
def finalize():
|
def finalize() -> None:
|
||||||
console_finalize()
|
console_finalize()
|
||||||
wandb.join()
|
wandb.join()
|
||||||
|
|
||||||
|
|
|
@ -1,11 +1,11 @@
|
||||||
from typing import List, Callable, Tuple, Dict, Iterable, Iterator, Union, Any
|
from typing import List, Callable, Tuple, Dict, Iterable, Iterator, Union, Any, IO
|
||||||
from typing import Optional, TYPE_CHECKING
|
from typing import Optional, TYPE_CHECKING
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from timeit import default_timer as timer
|
from timeit import default_timer as timer
|
||||||
from thinc.api import Optimizer, Config, constant, fix_random_seed, set_gpu_allocator
|
from thinc.api import Optimizer, Config, constant, fix_random_seed, set_gpu_allocator
|
||||||
import random
|
import random
|
||||||
import tqdm
|
import wasabi
|
||||||
from wasabi import Printer
|
import sys
|
||||||
|
|
||||||
from .example import Example
|
from .example import Example
|
||||||
from ..schemas import ConfigSchemaTraining
|
from ..schemas import ConfigSchemaTraining
|
||||||
|
@ -21,7 +21,8 @@ def train(
|
||||||
output_path: Optional[Path] = None,
|
output_path: Optional[Path] = None,
|
||||||
*,
|
*,
|
||||||
use_gpu: int = -1,
|
use_gpu: int = -1,
|
||||||
silent: bool = False,
|
stdout: IO = sys.stdout,
|
||||||
|
stderr: IO = sys.stderr,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Train a pipeline.
|
"""Train a pipeline.
|
||||||
|
|
||||||
|
@ -29,10 +30,15 @@ def train(
|
||||||
output_path (Path): Optional output path to save trained model to.
|
output_path (Path): Optional output path to save trained model to.
|
||||||
use_gpu (int): Whether to train on GPU. Make sure to call require_gpu
|
use_gpu (int): Whether to train on GPU. Make sure to call require_gpu
|
||||||
before calling this function.
|
before calling this function.
|
||||||
silent (bool): Whether to pretty-print outputs.
|
stdout (file): A file-like object to write output messages. To disable
|
||||||
|
printing, set to io.StringIO.
|
||||||
|
stderr (file): A second file-like object to write output messages. To disable
|
||||||
|
printing, set to io.StringIO.
|
||||||
|
|
||||||
RETURNS (Path / None): The path to the final exported model.
|
RETURNS (Path / None): The path to the final exported model.
|
||||||
"""
|
"""
|
||||||
msg = Printer(no_print=silent)
|
# We use no_print here so we can respect the stdout/stderr options.
|
||||||
|
msg = wasabi.Printer(no_print=True)
|
||||||
# Create iterator, which yields out info after each optimization step.
|
# Create iterator, which yields out info after each optimization step.
|
||||||
config = nlp.config.interpolate()
|
config = nlp.config.interpolate()
|
||||||
if config["training"]["seed"] is not None:
|
if config["training"]["seed"] is not None:
|
||||||
|
@ -63,50 +69,47 @@ def train(
|
||||||
eval_frequency=T["eval_frequency"],
|
eval_frequency=T["eval_frequency"],
|
||||||
exclude=frozen_components,
|
exclude=frozen_components,
|
||||||
)
|
)
|
||||||
msg.info(f"Pipeline: {nlp.pipe_names}")
|
stdout.write(msg.info(f"Pipeline: {nlp.pipe_names}") + "\n")
|
||||||
if frozen_components:
|
if frozen_components:
|
||||||
msg.info(f"Frozen components: {frozen_components}")
|
stdout.write(msg.info(f"Frozen components: {frozen_components}") + "\n")
|
||||||
msg.info(f"Initial learn rate: {optimizer.learn_rate}")
|
stdout.write(msg.info(f"Initial learn rate: {optimizer.learn_rate}") + "\n")
|
||||||
with nlp.select_pipes(disable=frozen_components):
|
with nlp.select_pipes(disable=frozen_components):
|
||||||
print_row, finalize_logger = train_logger(nlp)
|
log_step, finalize_logger = train_logger(nlp, stdout, stderr)
|
||||||
try:
|
try:
|
||||||
progress = tqdm.tqdm(total=T["eval_frequency"], leave=False)
|
|
||||||
progress.set_description(f"Epoch 1")
|
|
||||||
for batch, info, is_best_checkpoint in training_step_iterator:
|
for batch, info, is_best_checkpoint in training_step_iterator:
|
||||||
progress.update(1)
|
log_step(info if is_best_checkpoint is not None else None)
|
||||||
if is_best_checkpoint is not None:
|
if is_best_checkpoint is not None and output_path is not None:
|
||||||
progress.close()
|
with nlp.select_pipes(disable=frozen_components):
|
||||||
print_row(info)
|
update_meta(T, nlp, info)
|
||||||
if is_best_checkpoint and output_path is not None:
|
with nlp.use_params(optimizer.averages):
|
||||||
with nlp.select_pipes(disable=frozen_components):
|
nlp = before_to_disk(nlp)
|
||||||
update_meta(T, nlp, info)
|
nlp.to_disk(output_path / "model-best")
|
||||||
with nlp.use_params(optimizer.averages):
|
|
||||||
nlp = before_to_disk(nlp)
|
|
||||||
nlp.to_disk(output_path / "model-best")
|
|
||||||
progress = tqdm.tqdm(total=T["eval_frequency"], leave=False)
|
|
||||||
progress.set_description(f"Epoch {info['epoch']}")
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
finalize_logger()
|
|
||||||
if output_path is not None:
|
if output_path is not None:
|
||||||
# We don't want to swallow the traceback if we don't have a
|
# We don't want to swallow the traceback if we don't have a
|
||||||
# specific error.
|
# specific error, but we do want to warn that we're trying
|
||||||
msg.warn(
|
# to do something here.
|
||||||
f"Aborting and saving the final best model. "
|
stdout.write(
|
||||||
f"Encountered exception: {str(e)}"
|
msg.warn(
|
||||||
|
f"Aborting and saving the final best model. "
|
||||||
|
f"Encountered exception: {str(e)}"
|
||||||
|
)
|
||||||
|
+ "\n"
|
||||||
)
|
)
|
||||||
nlp = before_to_disk(nlp)
|
|
||||||
nlp.to_disk(output_path / "model-final")
|
|
||||||
raise e
|
raise e
|
||||||
finally:
|
finally:
|
||||||
finalize_logger()
|
finalize_logger()
|
||||||
if output_path is not None:
|
if output_path is not None:
|
||||||
final_model_path = output_path / "model-final"
|
final_model_path = output_path / "model-last"
|
||||||
if optimizer.averages:
|
if optimizer.averages:
|
||||||
with nlp.use_params(optimizer.averages):
|
with nlp.use_params(optimizer.averages):
|
||||||
nlp.to_disk(final_model_path)
|
nlp.to_disk(final_model_path)
|
||||||
else:
|
else:
|
||||||
nlp.to_disk(final_model_path)
|
nlp.to_disk(final_model_path)
|
||||||
msg.good(f"Saved pipeline to output directory", final_model_path)
|
# This will only run if we don't hit an error
|
||||||
|
stdout.write(
|
||||||
|
msg.good("Saved pipeline to output directory", final_model_path) + "\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def train_while_improving(
|
def train_while_improving(
|
||||||
|
|
|
@ -16,6 +16,7 @@ from ..attrs import ID
|
||||||
from ..ml.models.multi_task import build_cloze_multi_task_model
|
from ..ml.models.multi_task import build_cloze_multi_task_model
|
||||||
from ..ml.models.multi_task import build_cloze_characters_multi_task_model
|
from ..ml.models.multi_task import build_cloze_characters_multi_task_model
|
||||||
from ..schemas import ConfigSchemaTraining, ConfigSchemaPretrain
|
from ..schemas import ConfigSchemaTraining, ConfigSchemaPretrain
|
||||||
|
from ..errors import Errors
|
||||||
from ..util import registry, load_model_from_config, dot_to_object
|
from ..util import registry, load_model_from_config, dot_to_object
|
||||||
|
|
||||||
|
|
||||||
|
@ -151,9 +152,9 @@ def create_objective(config: Config):
|
||||||
distance = L2Distance(normalize=True, ignore_zeros=True)
|
distance = L2Distance(normalize=True, ignore_zeros=True)
|
||||||
return partial(get_vectors_loss, distance=distance)
|
return partial(get_vectors_loss, distance=distance)
|
||||||
else:
|
else:
|
||||||
raise ValueError("Unexpected loss type", config["loss"])
|
raise ValueError(Errors.E906.format(loss_type=config["loss"]))
|
||||||
else:
|
else:
|
||||||
raise ValueError("Unexpected objective_type", objective_type)
|
raise ValueError(Errors.E907.format(objective_type=objective_type))
|
||||||
|
|
||||||
|
|
||||||
def get_vectors_loss(ops, docs, prediction, distance):
|
def get_vectors_loss(ops, docs, prediction, distance):
|
||||||
|
|
|
@ -16,7 +16,7 @@ from .errors import Errors
|
||||||
from .attrs import intify_attrs, NORM, IS_STOP
|
from .attrs import intify_attrs, NORM, IS_STOP
|
||||||
from .vectors import Vectors
|
from .vectors import Vectors
|
||||||
from .util import registry
|
from .util import registry
|
||||||
from .lookups import Lookups, load_lookups
|
from .lookups import Lookups
|
||||||
from . import util
|
from . import util
|
||||||
from .lang.norm_exceptions import BASE_NORMS
|
from .lang.norm_exceptions import BASE_NORMS
|
||||||
from .lang.lex_attrs import LEX_ATTRS, is_stop, get_lang
|
from .lang.lex_attrs import LEX_ATTRS, is_stop, get_lang
|
||||||
|
|
|
@ -4,6 +4,7 @@ tag: class
|
||||||
source: spacy/pipeline/attributeruler.py
|
source: spacy/pipeline/attributeruler.py
|
||||||
new: 3
|
new: 3
|
||||||
teaser: 'Pipeline component for rule-based token attribute assignment'
|
teaser: 'Pipeline component for rule-based token attribute assignment'
|
||||||
|
api_base_class: /api/pipe
|
||||||
api_string_name: attribute_ruler
|
api_string_name: attribute_ruler
|
||||||
api_trainable: false
|
api_trainable: false
|
||||||
---
|
---
|
||||||
|
@ -25,17 +26,13 @@ how the component should be configured. You can override its settings via the
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> config = {
|
> config = {"validate": True}
|
||||||
> "pattern_dicts": None,
|
|
||||||
> "validate": True,
|
|
||||||
> }
|
|
||||||
> nlp.add_pipe("attribute_ruler", config=config)
|
> nlp.add_pipe("attribute_ruler", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Description |
|
| Setting | Description |
|
||||||
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ---------- | --------------------------------------------------------------------------------------------- |
|
||||||
| `pattern_dicts` | A list of pattern dicts with the keys as the arguments to [`AttributeRuler.add`](/api/attributeruler#add) (`patterns`/`attrs`/`index`) to add as patterns. Defaults to `None`. ~~Optional[Iterable[Dict[str, Union[List[dict], dict, int]]]]~~ |
|
| `validate` | Whether patterns should be validated (passed to the `Matcher`). Defaults to `False`. ~~bool~~ |
|
||||||
| `validate` | Whether patterns should be validated (passed to the `Matcher`). Defaults to `False`. ~~bool~~ |
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
%%GITHUB_SPACY/spacy/pipeline/attributeruler.py
|
%%GITHUB_SPACY/spacy/pipeline/attributeruler.py
|
||||||
|
@ -43,36 +40,26 @@ how the component should be configured. You can override its settings via the
|
||||||
|
|
||||||
## AttributeRuler.\_\_init\_\_ {#init tag="method"}
|
## AttributeRuler.\_\_init\_\_ {#init tag="method"}
|
||||||
|
|
||||||
Initialize the attribute ruler. If pattern dicts are supplied here, they need to
|
Initialize the attribute ruler.
|
||||||
be a list of dictionaries with `"patterns"`, `"attrs"`, and optional `"index"`
|
|
||||||
keys, e.g.:
|
|
||||||
|
|
||||||
```python
|
|
||||||
pattern_dicts = [
|
|
||||||
{"patterns": [[{"TAG": "VB"}]], "attrs": {"POS": "VERB"}},
|
|
||||||
{"patterns": [[{"LOWER": "an"}]], "attrs": {"LEMMA": "a"}},
|
|
||||||
]
|
|
||||||
```
|
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> # Construction via add_pipe
|
> # Construction via add_pipe
|
||||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
> ruler = nlp.add_pipe("attribute_ruler")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | The shared vocabulary to pass to the matcher. ~~Vocab~~ |
|
| `vocab` | The shared vocabulary to pass to the matcher. ~~Vocab~~ |
|
||||||
| `name` | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. ~~str~~ |
|
| `name` | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. ~~str~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `pattern_dicts` | Optional patterns to load in on initialization. Defaults to `None`. ~~Optional[Iterable[Dict[str, Union[List[dict], dict, int]]]]~~ |
|
| `validate` | Whether patterns should be validated (passed to the [`Matcher`](/api/matcher#init)). Defaults to `False`. ~~bool~~ |
|
||||||
| `validate` | Whether patterns should be validated (passed to the [`Matcher`](/api/matcher#init)). Defaults to `False`. ~~bool~~ |
|
|
||||||
|
|
||||||
## AttributeRuler.\_\_call\_\_ {#call tag="method"}
|
## AttributeRuler.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
Apply the attribute ruler to a `Doc`, setting token attributes for tokens matched
|
Apply the attribute ruler to a `Doc`, setting token attributes for tokens
|
||||||
by the provided patterns.
|
matched by the provided patterns.
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------- | -------------------------------- |
|
| ----------- | -------------------------------- |
|
||||||
|
@ -90,10 +77,10 @@ may be negative to index from the end of the span.
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
> ruler = nlp.add_pipe("attribute_ruler")
|
||||||
> patterns = [[{"TAG": "VB"}]]
|
> patterns = [[{"TAG": "VB"}]]
|
||||||
> attrs = {"POS": "VERB"}
|
> attrs = {"POS": "VERB"}
|
||||||
> attribute_ruler.add(patterns=patterns, attrs=attrs)
|
> ruler.add(patterns=patterns, attrs=attrs)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
|
@ -107,11 +94,10 @@ may be negative to index from the end of the span.
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
> ruler = nlp.add_pipe("attribute_ruler")
|
||||||
> pattern_dicts = [
|
> patterns = [
|
||||||
> {
|
> {
|
||||||
> "patterns": [[{"TAG": "VB"}]],
|
> "patterns": [[{"TAG": "VB"}]], "attrs": {"POS": "VERB"}
|
||||||
> "attrs": {"POS": "VERB"}
|
|
||||||
> },
|
> },
|
||||||
> {
|
> {
|
||||||
> "patterns": [[{"LOWER": "two"}, {"LOWER": "apples"}]],
|
> "patterns": [[{"LOWER": "two"}, {"LOWER": "apples"}]],
|
||||||
|
@ -119,15 +105,16 @@ may be negative to index from the end of the span.
|
||||||
> "index": -1
|
> "index": -1
|
||||||
> },
|
> },
|
||||||
> ]
|
> ]
|
||||||
> attribute_ruler.add_patterns(pattern_dicts)
|
> ruler.add_patterns(patterns)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Add patterns from a list of pattern dicts with the keys as the arguments to
|
Add patterns from a list of pattern dicts. Each pattern dict can specify the
|
||||||
|
keys `"patterns"`, `"attrs"` and `"index"`, which match the arguments of
|
||||||
[`AttributeRuler.add`](/api/attributeruler#add).
|
[`AttributeRuler.add`](/api/attributeruler#add).
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| --------------- | -------------------------------------------------------------------------- |
|
| ---------- | -------------------------------------------------------------------------- |
|
||||||
| `pattern_dicts` | The patterns to add. ~~Iterable[Dict[str, Union[List[dict], dict, int]]]~~ |
|
| `patterns` | The patterns to add. ~~Iterable[Dict[str, Union[List[dict], dict, int]]]~~ |
|
||||||
|
|
||||||
## AttributeRuler.patterns {#patterns tag="property"}
|
## AttributeRuler.patterns {#patterns tag="property"}
|
||||||
|
|
||||||
|
@ -139,20 +126,39 @@ Get all patterns that have been added to the attribute ruler in the
|
||||||
| ----------- | -------------------------------------------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------------------------- |
|
||||||
| **RETURNS** | The patterns added to the attribute ruler. ~~List[Dict[str, Union[List[dict], dict, int]]]~~ |
|
| **RETURNS** | The patterns added to the attribute ruler. ~~List[Dict[str, Union[List[dict], dict, int]]]~~ |
|
||||||
|
|
||||||
## AttributeRuler.score {#score tag="method" new="3"}
|
## AttributeRuler.initialize {#initialize tag="method"}
|
||||||
|
|
||||||
Score a batch of examples.
|
Initialize the component with data. Typically called before training to load in
|
||||||
|
rules from a file. This method is typically called by
|
||||||
|
[`Language.initialize`](/api/language#initialize) and lets you customize
|
||||||
|
arguments it receives via the
|
||||||
|
[`[initialize.components]`](/api/data-formats#config-initialize) block in the
|
||||||
|
config.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> scores = attribute_ruler.score(examples)
|
> ruler = nlp.add_pipe("attribute_ruler")
|
||||||
|
> ruler.initialize(lambda: [], nlp=nlp, patterns=patterns)
|
||||||
|
> ```
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> ### config.cfg
|
||||||
|
> [initialize.components.attribute_ruler]
|
||||||
|
>
|
||||||
|
> [initialize.components.attribute_ruler.patterns]
|
||||||
|
> @readers = "srsly.read_json.v1"
|
||||||
|
> path = "corpus/attribute_ruler_patterns.json
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects (the training data). Not used by this component. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| **RETURNS** | The scores, produced by [`Scorer.score_token_attr`](/api/scorer#score_token_attr) for the attributes `"tag"`, `"pos"`, `"morph"` and `"lemma"` if present in any of the target token attributes. ~~Dict[str, float]~~ |
|
| _keyword-only_ | |
|
||||||
|
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||||
|
| `patterns` | A list of pattern dicts with the keys as the arguments to [`AttributeRuler.add`](/api/attributeruler#add) (`patterns`/`attrs`/`index`) to add as patterns. Defaults to `None`. ~~Optional[Iterable[Dict[str, Union[List[dict], dict, int]]]]~~ |
|
||||||
|
| `tag_map` | The tag map that maps fine-grained tags to coarse-grained tags and morphological features. Defaults to `None`. ~~Optional[Dict[str, Dict[Union[int, str], Union[int, str]]]]~~ |
|
||||||
|
| `morph_rules` | The morph rules that map token text and fine-grained tags to coarse-grained tags, lemmas and morphological features. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]]~~ |
|
||||||
|
|
||||||
## AttributeRuler.load_from_tag_map {#load_from_tag_map tag="method"}
|
## AttributeRuler.load_from_tag_map {#load_from_tag_map tag="method"}
|
||||||
|
|
||||||
|
@ -170,6 +176,21 @@ Load attribute ruler patterns from morph rules.
|
||||||
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `morph_rules` | The morph rules that map token text and fine-grained tags to coarse-grained tags, lemmas and morphological features. ~~Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]~~ |
|
| `morph_rules` | The morph rules that map token text and fine-grained tags to coarse-grained tags, lemmas and morphological features. ~~Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]~~ |
|
||||||
|
|
||||||
|
## AttributeRuler.score {#score tag="method" new="3"}
|
||||||
|
|
||||||
|
Score a batch of examples.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> scores = ruler.score(examples)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||||
|
| **RETURNS** | The scores, produced by [`Scorer.score_token_attr`](/api/scorer#score_token_attr) for the attributes `"tag"`, `"pos"`, `"morph"` and `"lemma"` if present in any of the target token attributes. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## AttributeRuler.to_disk {#to_disk tag="method"}
|
## AttributeRuler.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
Serialize the pipe to disk.
|
Serialize the pipe to disk.
|
||||||
|
@ -177,8 +198,8 @@ Serialize the pipe to disk.
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
> ruler = nlp.add_pipe("attribute_ruler")
|
||||||
> attribute_ruler.to_disk("/path/to/attribute_ruler")
|
> ruler.to_disk("/path/to/attribute_ruler")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
|
@ -194,8 +215,8 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
> ruler = nlp.add_pipe("attribute_ruler")
|
||||||
> attribute_ruler.from_disk("/path/to/attribute_ruler")
|
> ruler.from_disk("/path/to/attribute_ruler")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
|
@ -210,8 +231,8 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
> ruler = nlp.add_pipe("attribute_ruler")
|
||||||
> attribute_ruler_bytes = attribute_ruler.to_bytes()
|
> ruler = ruler.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
@ -229,9 +250,9 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> attribute_ruler_bytes = attribute_ruler.to_bytes()
|
> ruler_bytes = ruler.to_bytes()
|
||||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
> ruler = nlp.add_pipe("attribute_ruler")
|
||||||
> attribute_ruler.from_bytes(attribute_ruler_bytes)
|
> ruler.from_bytes(ruler_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
|
@ -250,12 +271,12 @@ serialization by passing in the string names via the `exclude` argument.
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> data = attribute_ruler.to_disk("/path", exclude=["vocab"])
|
> data = ruler.to_disk("/path", exclude=["vocab"])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ---------- | -------------------------------------------------------------- |
|
| ---------- | --------------------------------------------------------------- |
|
||||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||||
| `patterns` | The `Matcher` patterns. You usually don't want to exclude this. |
|
| `patterns` | The `Matcher` patterns. You usually don't want to exclude this. |
|
||||||
| `attrs` | The attributes to set. You usually don't want to exclude this. |
|
| `attrs` | The attributes to set. You usually don't want to exclude this. |
|
||||||
| `indices` | The token indices. You usually don't want to exclude this. |
|
| `indices` | The token indices. You usually don't want to exclude this. |
|
||||||
|
|
|
@ -232,7 +232,7 @@ $ python -m spacy init labels [config_path] [output_path] [--code] [--verbose] [
|
||||||
| `--gpu-id`, `-g` | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~ |
|
| `--gpu-id`, `-g` | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~ |
|
||||||
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
|
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
|
||||||
| **CREATES** | The final trained pipeline and the best trained pipeline. |
|
| **CREATES** | The best trained pipeline and the final checkpoint (if training is terminated). |
|
||||||
|
|
||||||
## convert {#convert tag="command"}
|
## convert {#convert tag="command"}
|
||||||
|
|
||||||
|
|
|
@ -176,12 +176,12 @@ This method was previously called `begin_training`.
|
||||||
> path = "corpus/labels/parser.json
|
> path = "corpus/labels/parser.json
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Dict[str, Dict[str, int]]]~~ |
|
||||||
|
|
||||||
## DependencyParser.predict {#predict tag="method"}
|
## DependencyParser.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -433,6 +433,24 @@ The labels currently added to the component.
|
||||||
| ----------- | ------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
|
## DependencyParser.label_data {#label_data tag="property" new="3"}
|
||||||
|
|
||||||
|
The labels currently added to the component and their internal meta information.
|
||||||
|
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||||
|
[`DependencyParser.initialize`](/api/dependencyparser#initialize) to initialize
|
||||||
|
the model with a pre-defined label set.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> labels = parser.label_data
|
||||||
|
> parser.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ------------------------------------------------------------------------------- |
|
||||||
|
| **RETURNS** | The label data added to the component. ~~Dict[str, Dict[str, Dict[str, int]]]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
During serialization, spaCy will export several data fields used to restore
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
|
|
@ -165,12 +165,12 @@ This method was previously called `begin_training`.
|
||||||
> path = "corpus/labels/ner.json
|
> path = "corpus/labels/ner.json
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Dict[str, Dict[str, int]]]~~ |
|
||||||
|
|
||||||
## EntityRecognizer.predict {#predict tag="method"}
|
## EntityRecognizer.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -421,6 +421,24 @@ The labels currently added to the component.
|
||||||
| ----------- | ------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
|
## EntityRecognizer.label_data {#label_data tag="property" new="3"}
|
||||||
|
|
||||||
|
The labels currently added to the component and their internal meta information.
|
||||||
|
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||||
|
[`EntityRecognizer.initialize`](/api/entityrecognizer#initialize) to initialize
|
||||||
|
the model with a pre-defined label set.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> labels = ner.label_data
|
||||||
|
> ner.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ------------------------------------------------------------------------------- |
|
||||||
|
| **RETURNS** | The label data added to the component. ~~Dict[str, Dict[str, Dict[str, int]]]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
During serialization, spaCy will export several data fields used to restore
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
|
|
@ -190,23 +190,10 @@ lemmatization entirely.
|
||||||
Returns the lookups configuration settings for a given mode for use in
|
Returns the lookups configuration settings for a given mode for use in
|
||||||
[`Lemmatizer.load_lookups`](/api/lemmatizer#load_lookups).
|
[`Lemmatizer.load_lookups`](/api/lemmatizer#load_lookups).
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------------------- |
|
||||||
| `mode` | The lemmatizer mode. ~~str~~ |
|
| `mode` | The lemmatizer mode. ~~str~~ |
|
||||||
| **RETURNS** | The lookups configuration settings for this mode. Includes the keys `"required_tables"` and `"optional_tables"`, mapped to a list of table string names. ~~Dict[str, List[str]]~~ |
|
| **RETURNS** | The required table names and the optional table names. ~~Tuple[List[str], List[str]]~~ |
|
||||||
|
|
||||||
## Lemmatizer.load_lookups {#load_lookups tag="classmethod"}
|
|
||||||
|
|
||||||
Load and validate lookups tables. If the provided lookups is `None`, load the
|
|
||||||
default lookups tables according to the language and mode settings. Confirm that
|
|
||||||
all required tables for the language and mode are present.
|
|
||||||
|
|
||||||
| Name | Description |
|
|
||||||
| ----------- | -------------------------------------------------------------------------------------------------- |
|
|
||||||
| `lang` | The language. ~~str~~ |
|
|
||||||
| `mode` | The lemmatizer mode. ~~str~~ |
|
|
||||||
| `lookups` | The provided lookups, may be `None` if the default lookups should be loaded. ~~Optional[Lookups]~~ |
|
|
||||||
| **RETURNS** | The lookups. ~~Lookups~~ |
|
|
||||||
|
|
||||||
## Lemmatizer.to_disk {#to_disk tag="method"}
|
## Lemmatizer.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
|
|
@ -147,12 +147,12 @@ config.
|
||||||
> path = "corpus/labels/morphologizer.json
|
> path = "corpus/labels/morphologizer.json
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
||||||
|
|
||||||
## Morphologizer.predict {#predict tag="method"}
|
## Morphologizer.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -377,6 +377,24 @@ coarse-grained POS as the feature `POS`.
|
||||||
| ----------- | ------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
|
## Morphologizer.label_data {#label_data tag="property" new="3"}
|
||||||
|
|
||||||
|
The labels currently added to the component and their internal meta information.
|
||||||
|
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||||
|
[`Morphologizer.initialize`](/api/morphologizer#initialize) to initialize the
|
||||||
|
model with a pre-defined label set.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> labels = morphologizer.label_data
|
||||||
|
> morphologizer.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ----------------------------------------------- |
|
||||||
|
| **RETURNS** | The label data added to the component. ~~dict~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
During serialization, spaCy will export several data fields used to restore
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
|
|
@ -148,12 +148,12 @@ This method was previously called `begin_training`.
|
||||||
> path = "corpus/labels/tagger.json
|
> path = "corpus/labels/tagger.json
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[list]~~ |
|
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |
|
||||||
|
|
||||||
## Tagger.predict {#predict tag="method"}
|
## Tagger.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -411,6 +411,24 @@ The labels currently added to the component.
|
||||||
| ----------- | ------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
|
## Tagger.label_data {#label_data tag="property" new="3"}
|
||||||
|
|
||||||
|
The labels currently added to the component and their internal meta information.
|
||||||
|
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||||
|
[`Tagger.initialize`](/api/tagger#initialize) to initialize the model with a
|
||||||
|
pre-defined label set.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> labels = tagger.label_data
|
||||||
|
> tagger.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ---------------------------------------------------------- |
|
||||||
|
| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
During serialization, spaCy will export several data fields used to restore
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
|
|
@ -29,19 +29,16 @@ architectures and their arguments and hyperparameters.
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL
|
> from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL
|
||||||
> config = {
|
> config = {
|
||||||
> "labels": [],
|
|
||||||
> "threshold": 0.5,
|
> "threshold": 0.5,
|
||||||
> "model": DEFAULT_TEXTCAT_MODEL,
|
> "model": DEFAULT_TEXTCAT_MODEL,
|
||||||
> }
|
> }
|
||||||
> nlp.add_pipe("textcat", config=config)
|
> nlp.add_pipe("textcat", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Description |
|
| Setting | Description |
|
||||||
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `labels` | A list of categories to learn. If empty, the model infers the categories from the data. Defaults to `[]`. ~~Iterable[str]~~ |
|
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||||
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise and by default. ~~Optional[str]~~ |
|
|
||||||
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
%%GITHUB_SPACY/spacy/pipeline/textcat.py
|
%%GITHUB_SPACY/spacy/pipeline/textcat.py
|
||||||
|
@ -61,22 +58,20 @@ architectures and their arguments and hyperparameters.
|
||||||
>
|
>
|
||||||
> # Construction from class
|
> # Construction from class
|
||||||
> from spacy.pipeline import TextCategorizer
|
> from spacy.pipeline import TextCategorizer
|
||||||
> textcat = TextCategorizer(nlp.vocab, model, labels=[], threshold=0.5, positive_label="POS")
|
> textcat = TextCategorizer(nlp.vocab, model, threshold=0.5)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Create a new pipeline instance. In your application, you would normally use a
|
Create a new pipeline instance. In your application, you would normally use a
|
||||||
shortcut for this and instantiate the component using its string name and
|
shortcut for this and instantiate the component using its string name and
|
||||||
[`nlp.add_pipe`](/api/language#create_pipe).
|
[`nlp.add_pipe`](/api/language#create_pipe).
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ---------------- | -------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | -------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `labels` | The labels to use. ~~Iterable[str]~~ |
|
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||||
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
|
||||||
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise. ~~Optional[str]~~ |
|
|
||||||
|
|
||||||
## TextCategorizer.\_\_call\_\_ {#call tag="method"}
|
## TextCategorizer.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -155,18 +150,20 @@ This method was previously called `begin_training`.
|
||||||
> ```ini
|
> ```ini
|
||||||
> ### config.cfg
|
> ### config.cfg
|
||||||
> [initialize.components.textcat]
|
> [initialize.components.textcat]
|
||||||
|
> positive_label = "POS"
|
||||||
>
|
>
|
||||||
> [initialize.components.textcat.labels]
|
> [initialize.components.textcat.labels]
|
||||||
> @readers = "spacy.read_labels.v1"
|
> @readers = "spacy.read_labels.v1"
|
||||||
> path = "corpus/labels/textcat.json
|
> path = "corpus/labels/textcat.json
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |
|
||||||
|
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise and by default. ~~Optional[str]~~ |
|
||||||
|
|
||||||
## TextCategorizer.predict {#predict tag="method"}
|
## TextCategorizer.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -425,6 +422,24 @@ The labels currently added to the component.
|
||||||
| ----------- | ------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
|
## TextCategorizer.label_data {#label_data tag="property" new="3"}
|
||||||
|
|
||||||
|
The labels currently added to the component and their internal meta information.
|
||||||
|
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||||
|
[`TextCategorizer.initialize`](/api/textcategorizer#initialize) to initialize
|
||||||
|
the model with a pre-defined label set.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> labels = textcat.label_data
|
||||||
|
> textcat.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ---------------------------------------------------------- |
|
||||||
|
| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
During serialization, spaCy will export several data fields used to restore
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
|
|
@ -689,7 +689,8 @@ Data augmentation is the process of applying small modifications to the training
|
||||||
data. It can be especially useful for punctuation and case replacement – for
|
data. It can be especially useful for punctuation and case replacement – for
|
||||||
example, if your corpus only uses smart quotes and you want to include
|
example, if your corpus only uses smart quotes and you want to include
|
||||||
variations using regular quotes, or to make the model less sensitive to
|
variations using regular quotes, or to make the model less sensitive to
|
||||||
capitalization by including a mix of capitalized and lowercase examples. See the [usage guide](/usage/training#data-augmentation) for details and examples.
|
capitalization by including a mix of capitalized and lowercase examples. See the
|
||||||
|
[usage guide](/usage/training#data-augmentation) for details and examples.
|
||||||
|
|
||||||
### spacy.orth_variants.v1 {#orth_variants tag="registered function"}
|
### spacy.orth_variants.v1 {#orth_variants tag="registered function"}
|
||||||
|
|
||||||
|
@ -707,7 +708,7 @@ capitalization by including a mix of capitalized and lowercase examples. See the
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Create a data augmentation callback that uses orth-variant replacement. The
|
Create a data augmentation callback that uses orth-variant replacement. The
|
||||||
callback can be added to a corpus or other data iterator during training. This
|
callback can be added to a corpus or other data iterator during training. It's
|
||||||
is especially useful for punctuation and case replacement, to help generalize
|
is especially useful for punctuation and case replacement, to help generalize
|
||||||
beyond corpora that don't have smart quotes, or only have smart quotes etc.
|
beyond corpora that don't have smart quotes, or only have smart quotes etc.
|
||||||
|
|
||||||
|
@ -718,6 +719,25 @@ beyond corpora that don't have smart quotes, or only have smart quotes etc.
|
||||||
| `orth_variants` | A dictionary containing the single and paired orth variants. Typically loaded from a JSON file. See [`en_orth_variants.json`](https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/en_orth_variants.json) for an example. ~~Dict[str, Dict[List[Union[str, List[str]]]]]~~ |
|
| `orth_variants` | A dictionary containing the single and paired orth variants. Typically loaded from a JSON file. See [`en_orth_variants.json`](https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/en_orth_variants.json) for an example. ~~Dict[str, Dict[List[Union[str, List[str]]]]]~~ |
|
||||||
| **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ |
|
| **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ |
|
||||||
|
|
||||||
|
### spacy.lower_case.v1 {#lower_case tag="registered function"}
|
||||||
|
|
||||||
|
> #### Example config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [corpora.train.augmenter]
|
||||||
|
> @augmenters = "spacy.lower_case.v1"
|
||||||
|
> level = 0.3
|
||||||
|
> ```
|
||||||
|
|
||||||
|
Create a data augmentation callback that lowercases documents. The callback can
|
||||||
|
be added to a corpus or other data iterator during training. It's especially
|
||||||
|
useful for making the model less sensitive to capitalization.
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `level` | The percentage of texts that will be augmented. ~~float~~ |
|
||||||
|
| **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ |
|
||||||
|
|
||||||
## Training data and alignment {#gold source="spacy/training"}
|
## Training data and alignment {#gold source="spacy/training"}
|
||||||
|
|
||||||
### training.offsets_to_biluo_tags {#offsets_to_biluo_tags tag="function"}
|
### training.offsets_to_biluo_tags {#offsets_to_biluo_tags tag="function"}
|
||||||
|
@ -827,10 +847,10 @@ utilities.
|
||||||
### util.get_lang_class {#util.get_lang_class tag="function"}
|
### util.get_lang_class {#util.get_lang_class tag="function"}
|
||||||
|
|
||||||
Import and load a `Language` class. Allows lazy-loading
|
Import and load a `Language` class. Allows lazy-loading
|
||||||
[language data](/usage/linguistic-features#language-data) and importing languages using the
|
[language data](/usage/linguistic-features#language-data) and importing
|
||||||
two-letter language code. To add a language code for a custom language class,
|
languages using the two-letter language code. To add a language code for a
|
||||||
you can register it using the [`@registry.languages`](/api/top-level#registry)
|
custom language class, you can register it using the
|
||||||
decorator.
|
[`@registry.languages`](/api/top-level#registry) decorator.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
|
|
@ -1801,17 +1801,7 @@ print(doc2[5].tag_, doc2[5].pos_) # WP PRON
|
||||||
|
|
||||||
<Infobox variant="warning" title="Migrating from spaCy v2.x">
|
<Infobox variant="warning" title="Migrating from spaCy v2.x">
|
||||||
|
|
||||||
For easy migration from from spaCy v2 to v3, the
|
The [`AttributeRuler`](/api/attributeruler) can import a **tag map and morph rules** in the v2.x format via its built-in methods or when the component is initialized before training. See the [migration guide](/usage/v3#migrating-training-mappings-exceptions) for details.
|
||||||
[`AttributeRuler`](/api/attributeruler) can import a **tag map and morph rules**
|
|
||||||
in the v2 format with the methods
|
|
||||||
[`load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
|
|
||||||
[`load_from_morph_rules`](/api/attributeruler#load_from_morph_rules).
|
|
||||||
|
|
||||||
```diff
|
|
||||||
nlp = spacy.blank("en")
|
|
||||||
+ ruler = nlp.add_pipe("attribute_ruler")
|
|
||||||
+ ruler.load_from_tag_map(YOUR_TAG_MAP)
|
|
||||||
```
|
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
|
|
@ -8,6 +8,7 @@ menu:
|
||||||
- ['Config System', 'config']
|
- ['Config System', 'config']
|
||||||
- ['Custom Training', 'config-custom']
|
- ['Custom Training', 'config-custom']
|
||||||
- ['Custom Functions', 'custom-functions']
|
- ['Custom Functions', 'custom-functions']
|
||||||
|
- ['Initialization', 'initialization']
|
||||||
- ['Data Utilities', 'data']
|
- ['Data Utilities', 'data']
|
||||||
- ['Parallel Training', 'parallel-training']
|
- ['Parallel Training', 'parallel-training']
|
||||||
- ['Internal API', 'api']
|
- ['Internal API', 'api']
|
||||||
|
@ -689,17 +690,17 @@ During training, the results of each step are passed to a logger function. By
|
||||||
default, these results are written to the console with the
|
default, these results are written to the console with the
|
||||||
[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support
|
[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support
|
||||||
for writing the log files to [Weights & Biases](https://www.wandb.com/) with the
|
for writing the log files to [Weights & Biases](https://www.wandb.com/) with the
|
||||||
[`WandbLogger`](/api/top-level#WandbLogger). The logger function receives a
|
[`WandbLogger`](/api/top-level#WandbLogger). On each step, the logger function
|
||||||
**dictionary** with the following keys:
|
receives a **dictionary** with the following keys:
|
||||||
|
|
||||||
| Key | Value |
|
| Key | Value |
|
||||||
| -------------- | ---------------------------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------------- |
|
||||||
| `epoch` | How many passes over the data have been completed. ~~int~~ |
|
| `epoch` | How many passes over the data have been completed. ~~int~~ |
|
||||||
| `step` | How many steps have been completed. ~~int~~ |
|
| `step` | How many steps have been completed. ~~int~~ |
|
||||||
| `score` | The main score from the last evaluation, measured on the dev set. ~~float~~ |
|
| `score` | The main score from the last evaluation, measured on the dev set. ~~float~~ |
|
||||||
| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ |
|
| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ |
|
||||||
| `losses` | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~ |
|
| `losses` | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~ |
|
||||||
| `checkpoints` | A list of previous results, where each result is a (score, step, epoch) tuple. ~~List[Tuple]~~ |
|
| `checkpoints` | A list of previous results, where each result is a `(score, step)` tuple. ~~List[Tuple[float, int]]~~ |
|
||||||
|
|
||||||
You can easily implement and plug in your own logger that records the training
|
You can easily implement and plug in your own logger that records the training
|
||||||
results in a custom way, or sends them to an experiment management tracker of
|
results in a custom way, or sends them to an experiment management tracker of
|
||||||
|
@ -715,30 +716,37 @@ tabular results to a file:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### functions.py
|
### functions.py
|
||||||
from typing import Tuple, Callable, Dict, Any
|
import sys
|
||||||
|
from typing import IO, Tuple, Callable, Dict, Any
|
||||||
import spacy
|
import spacy
|
||||||
|
from spacy import Language
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
@spacy.registry.loggers("my_custom_logger.v1")
|
@spacy.registry.loggers("my_custom_logger.v1")
|
||||||
def custom_logger(log_path):
|
def custom_logger(log_path):
|
||||||
def setup_logger(nlp: "Language") -> Tuple[Callable, Callable]:
|
def setup_logger(
|
||||||
with Path(log_path).open("w", encoding="utf8") as file_:
|
nlp: Language,
|
||||||
file_.write("step\\t")
|
stdout: IO=sys.stdout,
|
||||||
file_.write("score\\t")
|
stderr: IO=sys.stderr
|
||||||
for pipe in nlp.pipe_names:
|
) -> Tuple[Callable, Callable]:
|
||||||
file_.write(f"loss_{pipe}\\t")
|
stdout.write(f"Logging to {log_path}\n")
|
||||||
file_.write("\\n")
|
log_file = Path(log_path).open("w", encoding="utf8")
|
||||||
|
log_file.write("step\\t")
|
||||||
|
log_file.write("score\\t")
|
||||||
|
for pipe in nlp.pipe_names:
|
||||||
|
log_file.write(f"loss_{pipe}\\t")
|
||||||
|
log_file.write("\\n")
|
||||||
|
|
||||||
def log_step(info: Dict[str, Any]):
|
def log_step(info: Optional[Dict[str, Any]]):
|
||||||
with Path(log_path).open("a") as file_:
|
if info:
|
||||||
file_.write(f"{info['step']}\\t")
|
log_file.write(f"{info['step']}\\t")
|
||||||
file_.write(f"{info['score']}\\t")
|
log_file.write(f"{info['score']}\\t")
|
||||||
for pipe in nlp.pipe_names:
|
for pipe in nlp.pipe_names:
|
||||||
file_.write(f"{info['losses'][pipe]}\\t")
|
log_file.write(f"{info['losses'][pipe]}\\t")
|
||||||
file_.write("\\n")
|
log_file.write("\\n")
|
||||||
|
|
||||||
def finalize():
|
def finalize():
|
||||||
pass
|
log_file.close()
|
||||||
|
|
||||||
return log_step, finalize
|
return log_step, finalize
|
||||||
|
|
||||||
|
@ -817,9 +825,101 @@ def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]:
|
||||||
return create_model(output_width)
|
return create_model(output_width)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Customizing the initialization {#initialization}
|
## Customizing the initialization {#initialization}
|
||||||
|
|
||||||
|
When you start training a new model from scratch,
|
||||||
|
[`spacy train`](/api/cli#train) will call
|
||||||
|
[`nlp.initialize`](/api/language#initialize) to initialize the pipeline and load
|
||||||
|
the required data. All settings for this are defined in the
|
||||||
|
[`[initialize]`](/api/data-formats#config-initialize) block of the config, so
|
||||||
|
you can keep track of how the initial `nlp` object was created. The
|
||||||
|
initialization process typically includes the following:
|
||||||
|
|
||||||
|
> #### config.cfg (excerpt)
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [initialize]
|
||||||
|
> vectors = ${paths.vectors}
|
||||||
|
> init_tok2vec = ${paths.init_tok2vec}
|
||||||
|
>
|
||||||
|
> [initialize.components]
|
||||||
|
> # Settings for components
|
||||||
|
> ```
|
||||||
|
|
||||||
|
1. Load in **data resources** defined in the `[initialize]` config, including
|
||||||
|
**word vectors** and
|
||||||
|
[pretrained](/usage/embeddings-transformers/#pretraining) **tok2vec
|
||||||
|
weights**.
|
||||||
|
2. Call the `initialize` methods of the tokenizer (if implemented, e.g. for
|
||||||
|
[Chinese](/usage/models#chinese)) and pipeline components with a callback to
|
||||||
|
access the training data, the current `nlp` object and any **custom
|
||||||
|
arguments** defined in the `[initialize]` config.
|
||||||
|
3. In **pipeline components**: if needed, use the data to
|
||||||
|
[infer missing shapes](/usage/layers-architectures#thinc-shape-inference) and
|
||||||
|
set up the label scheme if no labels are provided. Components may also load
|
||||||
|
other data like lookup tables or dictionaries.
|
||||||
|
|
||||||
|
The initialization step allows the config to define **all settings** required
|
||||||
|
for the pipeline, while keeping a separation between settings and functions that
|
||||||
|
should only be used **before training** to set up the initial pipeline, and
|
||||||
|
logic and configuration that needs to be available **at runtime**. Without that
|
||||||
|
separation, it would be very difficult to use the came, reproducible config file
|
||||||
|
because the component settings required for training (load data from an external
|
||||||
|
file) wouldn't match the component settings required at runtime (load what's
|
||||||
|
included with the saved `nlp` object and don't depend on external file).
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
<Infobox title="How components save and load data" emoji="📖">
|
||||||
|
|
||||||
|
For details and examples of how pipeline components can **save and load data
|
||||||
|
assets** like model weights or lookup tables, and how the component
|
||||||
|
initialization is implemented under the hood, see the usage guide on
|
||||||
|
[serializing and initializing component data](/usage/processing-pipelines#component-data-initialization).
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
#### Initializing labels {#initialization-labels}
|
||||||
|
|
||||||
|
Built-in pipeline components like the
|
||||||
|
[`EntityRecognizer`](/api/entityrecognizer) or
|
||||||
|
[`DependencyParser`](/api/dependencyparser) need to know their available labels
|
||||||
|
and associated internal meta information to initialize their model weights.
|
||||||
|
Using the `get_examples` callback provided on initialization, they're able to
|
||||||
|
**read the labels off the training data** automatically, which is very
|
||||||
|
convenient – but it can also slow down the training process to compute this
|
||||||
|
information on every run.
|
||||||
|
|
||||||
|
The [`init labels`](/api/cli#init-labels) command lets you auto-generate JSON
|
||||||
|
files containing the label data for all supported components. You can then pass
|
||||||
|
in the labels in the `[initialize]` settings for the respective components to
|
||||||
|
allow them to initialize faster.
|
||||||
|
|
||||||
|
> #### config.cfg
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [initialize.components.ner]
|
||||||
|
>
|
||||||
|
> [initialize.components.ner.labels]
|
||||||
|
> @readers = "spacy.read_labels.v1"
|
||||||
|
> path = "corpus/labels/ner.json
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```cli
|
||||||
|
$ python -m spacy init labels config.cfg ./corpus --paths.train ./corpus/train.spacy
|
||||||
|
```
|
||||||
|
|
||||||
|
Under the hood, the command delegates to the `label_data` property of the
|
||||||
|
pipeline components, for instance
|
||||||
|
[`EntityRecognizer.label_data`](/api/entityrecognizer#label_data).
|
||||||
|
|
||||||
|
<Infobox variant="warning" title="Important note">
|
||||||
|
|
||||||
|
The JSON format differs for each component and some components need additional
|
||||||
|
meta information about their labels. The format exported by
|
||||||
|
[`init labels`](/api/cli#init-labels) matches what the components need, so you
|
||||||
|
should always let spaCy **auto-generate the labels** for you.
|
||||||
|
|
||||||
<Infobox title="This section is still under construction" emoji="🚧" variant="warning">
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
## Data utilities {#data}
|
## Data utilities {#data}
|
||||||
|
@ -1298,8 +1398,8 @@ of being dropped.
|
||||||
|
|
||||||
> - [`nlp`](/api/language): The `nlp` object with the pipeline components and
|
> - [`nlp`](/api/language): The `nlp` object with the pipeline components and
|
||||||
> their models.
|
> their models.
|
||||||
> - [`nlp.initialize`](/api/language#initialize): Start the training and return
|
> - [`nlp.initialize`](/api/language#initialize): Initialize the pipeline and
|
||||||
> an optimizer to update the component model weights.
|
> return an optimizer to update the component model weights.
|
||||||
> - [`Optimizer`](https://thinc.ai/docs/api-optimizers): Function that holds
|
> - [`Optimizer`](https://thinc.ai/docs/api-optimizers): Function that holds
|
||||||
> state between updates.
|
> state between updates.
|
||||||
> - [`nlp.update`](/api/language#update): Update component models with examples.
|
> - [`nlp.update`](/api/language#update): Update component models with examples.
|
||||||
|
|
|
@ -804,8 +804,30 @@ nlp = spacy.blank("en")
|
||||||
Instead of defining a `tag_map` and `morph_rules` in the language data, spaCy
|
Instead of defining a `tag_map` and `morph_rules` in the language data, spaCy
|
||||||
v3.0 now manages mappings and exceptions with a separate and more flexible
|
v3.0 now manages mappings and exceptions with a separate and more flexible
|
||||||
pipeline component, the [`AttributeRuler`](/api/attributeruler). See the
|
pipeline component, the [`AttributeRuler`](/api/attributeruler). See the
|
||||||
[usage guide](/usage/linguistic-features#mappings-exceptions) for examples. The
|
[usage guide](/usage/linguistic-features#mappings-exceptions) for examples. If
|
||||||
`AttributeRuler` provides two handy helper methods
|
you have tag maps and morph rules in the v2.x format, you can load them into the
|
||||||
|
attribute ruler before training using the `[initialize]` block of your config.
|
||||||
|
|
||||||
|
> #### What does the initialization do?
|
||||||
|
>
|
||||||
|
> The `[initialize]` block is used when
|
||||||
|
> [`nlp.initialize`](/api/language#initialize) is called (usually right before
|
||||||
|
> training). It lets you define data resources for initializing the pipeline in
|
||||||
|
> your `config.cfg`. After training, the rules are saved to disk with the
|
||||||
|
> exported pipeline, so your runtime model doesn't depend on local data. For
|
||||||
|
> details see the [config lifecycle](/usage/training/#config-lifecycle) and
|
||||||
|
> [initialization](/usage/training/#initialization) docs.
|
||||||
|
|
||||||
|
```ini
|
||||||
|
### config.cfg (excerpt)
|
||||||
|
[initialize.components.attribute_ruler]
|
||||||
|
|
||||||
|
[initialize.components.attribute_ruler.tag_map]
|
||||||
|
@readers = "srsly.read_json.v1"
|
||||||
|
path = "./corpus/tag_map.json"
|
||||||
|
```
|
||||||
|
|
||||||
|
The `AttributeRuler` also provides two handy helper methods
|
||||||
[`load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
|
[`load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
|
||||||
[`load_from_morph_rules`](/api/attributeruler#load_from_morph_rules) that let
|
[`load_from_morph_rules`](/api/attributeruler#load_from_morph_rules) that let
|
||||||
you load in your existing tag map or morph rules:
|
you load in your existing tag map or morph rules:
|
||||||
|
|
Loading…
Reference in New Issue
Block a user