mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-10 16:22:29 +03:00
Merge branch 'develop' into nightly.spacy.io
This commit is contained in:
commit
c2709a32c9
106
.github/contributors/Stannislav.md
vendored
Normal file
106
.github/contributors/Stannislav.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Stanislav Schmidt |
|
||||
| Company name (if applicable) | Blue Brain Project |
|
||||
| Title or role (if applicable) | ML Engineer |
|
||||
| Date | 2020-10-02 |
|
||||
| GitHub username | Stannislav |
|
||||
| Website (optional) | |
|
106
.github/contributors/rasyidf.md
vendored
Normal file
106
.github/contributors/rasyidf.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ------------------------ |
|
||||
| Name | Muhammad Fahmi Rasyid |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-09-23 |
|
||||
| GitHub username | rasyidf |
|
||||
| Website (optional) | http://rasyidf.github.io |
|
|
@ -1,6 +1,6 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy-nightly"
|
||||
__version__ = "3.0.0a29"
|
||||
__version__ = "3.0.0a32"
|
||||
__release__ = True
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
|
|
|
@ -322,8 +322,7 @@ def git_checkout(
|
|||
if dest.exists():
|
||||
msg.fail("Destination of checkout must not exist", exits=1)
|
||||
if not dest.parent.exists():
|
||||
raise IOError("Parent of destination of checkout must exist")
|
||||
|
||||
msg.fail("Parent of destination of checkout must exist", exits=1)
|
||||
if sparse and git_version >= (2, 22):
|
||||
return git_sparse_checkout(repo, subpath, dest, branch)
|
||||
elif sparse:
|
||||
|
|
|
@ -171,7 +171,7 @@ def debug_data(
|
|||
n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values())
|
||||
msg.warn(
|
||||
"{} words in training data without vectors ({:0.2f}%)".format(
|
||||
n_missing_vectors, n_missing_vectors / gold_train_data["n_words"],
|
||||
n_missing_vectors, n_missing_vectors / gold_train_data["n_words"]
|
||||
),
|
||||
)
|
||||
msg.text(
|
||||
|
|
|
@ -3,6 +3,7 @@ from pathlib import Path
|
|||
from wasabi import msg
|
||||
import typer
|
||||
import logging
|
||||
import sys
|
||||
|
||||
from ._util import app, Arg, Opt, parse_config_overrides, show_validation_error
|
||||
from ._util import import_code, setup_gpu
|
||||
|
@ -39,7 +40,12 @@ def train_cli(
|
|||
DOCS: https://nightly.spacy.io/api/cli#train
|
||||
"""
|
||||
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
|
||||
verify_cli_args(config_path, output_path)
|
||||
# Make sure all files and paths exists if they are needed
|
||||
if not config_path or not config_path.exists():
|
||||
msg.fail("Config file not found", config_path, exits=1)
|
||||
if output_path is not None and not output_path.exists():
|
||||
output_path.mkdir()
|
||||
msg.good(f"Created output directory: {output_path}")
|
||||
overrides = parse_config_overrides(ctx.args)
|
||||
import_code(code_path)
|
||||
setup_gpu(use_gpu)
|
||||
|
@ -50,14 +56,4 @@ def train_cli(
|
|||
nlp = init_nlp(config, use_gpu=use_gpu)
|
||||
msg.good("Initialized pipeline")
|
||||
msg.divider("Training pipeline")
|
||||
train(nlp, output_path, use_gpu=use_gpu, silent=False)
|
||||
|
||||
|
||||
def verify_cli_args(config_path: Path, output_path: Optional[Path] = None) -> None:
|
||||
# Make sure all files and paths exists if they are needed
|
||||
if not config_path or not config_path.exists():
|
||||
msg.fail("Config file not found", config_path, exits=1)
|
||||
if output_path is not None:
|
||||
if not output_path.exists():
|
||||
output_path.mkdir()
|
||||
msg.good(f"Created output directory: {output_path}")
|
||||
train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
|
||||
|
|
327
spacy/errors.py
327
spacy/errors.py
|
@ -16,8 +16,6 @@ def add_codes(err_cls):
|
|||
|
||||
@add_codes
|
||||
class Warnings:
|
||||
W004 = ("No text fixing enabled. Run `pip install ftfy` to enable fixing "
|
||||
"using ftfy.fix_text if necessary.")
|
||||
W005 = ("Doc object not parsed. This means displaCy won't be able to "
|
||||
"generate a dependency visualization for it. Make sure the Doc "
|
||||
"was processed with a model that supports dependency parsing, and "
|
||||
|
@ -51,8 +49,6 @@ class Warnings:
|
|||
W017 = ("Alias '{alias}' already exists in the Knowledge Base.")
|
||||
W018 = ("Entity '{entity}' already exists in the Knowledge Base - "
|
||||
"ignoring the duplicate entry.")
|
||||
W020 = ("Unnamed vectors. This won't allow multiple vectors models to be "
|
||||
"loaded. (Shape: {shape})")
|
||||
W021 = ("Unexpected hash collision in PhraseMatcher. Matches may be "
|
||||
"incorrect. Modify PhraseMatcher._terminal_hash to fix.")
|
||||
W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in "
|
||||
|
@ -65,7 +61,7 @@ class Warnings:
|
|||
"be more efficient to split your training data into multiple "
|
||||
"smaller JSON files instead.")
|
||||
W028 = ("Doc.from_array was called with a vector of type '{type}', "
|
||||
"but is expecting one of type 'uint64' instead. This may result "
|
||||
"but is expecting one of type uint64 instead. This may result "
|
||||
"in problems with the vocab further on in the pipeline.")
|
||||
W030 = ("Some entities could not be aligned in the text \"{text}\" with "
|
||||
"entities \"{entities}\". Use "
|
||||
|
@ -79,13 +75,17 @@ class Warnings:
|
|||
"If this is surprising, make sure you have the spacy-lookups-data "
|
||||
"package installed. The languages with lexeme normalization tables "
|
||||
"are currently: {langs}")
|
||||
W034 = ("Please install the package spacy-lookups-data in order to include "
|
||||
"the default lexeme normalization table for the language '{lang}'.")
|
||||
W035 = ('Discarding subpattern "{pattern}" due to an unrecognized '
|
||||
"attribute or operator.")
|
||||
|
||||
# TODO: fix numbering after merging develop into master
|
||||
W089 = ("The nlp.begin_training method has been renamed to nlp.initialize.")
|
||||
W088 = ("The pipeline component {name} implements a `begin_training` "
|
||||
"method, which won't be called by spaCy. As of v3.0, `begin_training` "
|
||||
"has been renamed to `initialize`, so you likely want to rename the "
|
||||
"component method. See the documentation for details: "
|
||||
"https://nightly.spacy.io/api/language#initialize")
|
||||
W089 = ("As of spaCy v3.0, the `nlp.begin_training` method has been renamed "
|
||||
"to `nlp.initialize`.")
|
||||
W090 = ("Could not locate any {format} files in path '{path}'.")
|
||||
W091 = ("Could not clean/remove the temp directory at {dir}: {msg}.")
|
||||
W092 = ("Ignoring annotations for sentence starts, as dependency heads are set.")
|
||||
|
@ -103,39 +103,33 @@ class Warnings:
|
|||
"download a newer compatible model or retrain your custom model "
|
||||
"with the current spaCy version. For more details and available "
|
||||
"updates, run: python -m spacy validate")
|
||||
W096 = ("The method 'disable_pipes' has become deprecated - use 'select_pipes' "
|
||||
"instead.")
|
||||
W097 = ("No Model config was provided to create the '{name}' component, "
|
||||
"and no default configuration could be found either.")
|
||||
W098 = ("No Model config was provided to create the '{name}' component, "
|
||||
"so a default configuration was used.")
|
||||
W099 = ("Expected 'dict' type for the 'model' argument of pipe '{pipe}', "
|
||||
"but got '{type}' instead, so ignoring it.")
|
||||
W096 = ("The method `nlp.disable_pipes` is now deprecated - use "
|
||||
"`nlp.select_pipes` instead.")
|
||||
W100 = ("Skipping unsupported morphological feature(s): '{feature}'. "
|
||||
"Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
|
||||
"string \"Field1=Value1,Value2|Field2=Value3\".")
|
||||
W101 = ("Skipping `Doc` custom extension '{name}' while merging docs.")
|
||||
W101 = ("Skipping Doc custom extension '{name}' while merging docs.")
|
||||
W102 = ("Skipping unsupported user data '{key}: {value}' while merging docs.")
|
||||
W103 = ("Unknown {lang} word segmenter '{segmenter}'. Supported "
|
||||
"word segmenters: {supported}. Defaulting to {default}.")
|
||||
W104 = ("Skipping modifications for '{target}' segmenter. The current "
|
||||
"segmenter is '{current}'.")
|
||||
W105 = ("As of spaCy v3.0, the {matcher}.pipe method is deprecated. If you "
|
||||
"need to match on a stream of documents, you can use nlp.pipe and "
|
||||
W105 = ("As of spaCy v3.0, the `{matcher}.pipe` method is deprecated. If you "
|
||||
"need to match on a stream of documents, you can use `nlp.pipe` and "
|
||||
"call the {matcher} on each Doc object.")
|
||||
W107 = ("The property Doc.{prop} is deprecated. Use "
|
||||
"Doc.has_annotation(\"{attr}\") instead.")
|
||||
W107 = ("The property `Doc.{prop}` is deprecated. Use "
|
||||
"`Doc.has_annotation(\"{attr}\")` instead.")
|
||||
|
||||
|
||||
@add_codes
|
||||
class Errors:
|
||||
E001 = ("No component '{name}' found in pipeline. Available names: {opts}")
|
||||
E002 = ("Can't find factory for '{name}' for language {lang} ({lang_code}). "
|
||||
"This usually happens when spaCy calls nlp.{method} with custom "
|
||||
"This usually happens when spaCy calls `nlp.{method}` with custom "
|
||||
"component name that's not registered on the current language class. "
|
||||
"If you're using a custom component, make sure you've added the "
|
||||
"decorator @Language.component (for function components) or "
|
||||
"@Language.factory (for class components).\n\nAvailable "
|
||||
"decorator `@Language.component` (for function components) or "
|
||||
"`@Language.factory` (for class components).\n\nAvailable "
|
||||
"factories: {opts}")
|
||||
E003 = ("Not a valid pipeline component. Expected callable, but "
|
||||
"got {component} (name: '{name}'). If you're using a custom "
|
||||
|
@ -153,14 +147,13 @@ class Errors:
|
|||
E008 = ("Can't restore disabled pipeline component '{name}' because it "
|
||||
"doesn't exist in the pipeline anymore. If you want to remove "
|
||||
"components from the pipeline, you should do it before calling "
|
||||
"`nlp.select_pipes()` or after restoring the disabled components.")
|
||||
"`nlp.select_pipes` or after restoring the disabled components.")
|
||||
E010 = ("Word vectors set to length 0. This may be because you don't have "
|
||||
"a model installed or loaded, or because your model doesn't "
|
||||
"include word vectors. For more info, see the docs:\n"
|
||||
"https://nightly.spacy.io/usage/models")
|
||||
E011 = ("Unknown operator: '{op}'. Options: {opts}")
|
||||
E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
|
||||
E014 = ("Unknown tag ID: {tag}")
|
||||
E016 = ("MultitaskObjective target should be function or one of: dep, "
|
||||
"tag, ent, dep_tag_offset, ent_tag.")
|
||||
E017 = ("Can only add unicode or bytes. Got type: {value_type}")
|
||||
|
@ -176,27 +169,24 @@ class Errors:
|
|||
"For example, are all labels added to the model? If you're "
|
||||
"training a named entity recognizer, also make sure that none of "
|
||||
"your annotated entity spans have leading or trailing whitespace "
|
||||
"or punctuation. "
|
||||
"You can also use the experimental `debug data` command to "
|
||||
"or punctuation. You can also use the `debug data` command to "
|
||||
"validate your JSON-formatted training data. For details, run:\n"
|
||||
"python -m spacy debug data --help")
|
||||
E025 = ("String is too long: {length} characters. Max is 2**30.")
|
||||
E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
|
||||
"length {length}.")
|
||||
E027 = ("Arguments 'words' and 'spaces' should be sequences of the same "
|
||||
"length, or 'spaces' should be left default at None. spaces "
|
||||
E027 = ("Arguments `words` and `spaces` should be sequences of the same "
|
||||
"length, or `spaces` should be left default at None. `spaces` "
|
||||
"should be a sequence of booleans, with True meaning that the "
|
||||
"word owns a ' ' character following it.")
|
||||
E028 = ("orths_and_spaces expects either a list of unicode string or a "
|
||||
"list of (unicode, bool) tuples. Got bytes instance: {value}")
|
||||
E029 = ("noun_chunks requires the dependency parse, which requires a "
|
||||
E028 = ("`words` expects a list of unicode strings, but got bytes instance: {value}")
|
||||
E029 = ("`noun_chunks` requires the dependency parse, which requires a "
|
||||
"statistical model to be installed and loaded. For more info, see "
|
||||
"the documentation:\nhttps://nightly.spacy.io/usage/models")
|
||||
E030 = ("Sentence boundaries unset. You can add the 'sentencizer' "
|
||||
"component to the pipeline with: "
|
||||
"nlp.add_pipe('sentencizer'). "
|
||||
"Alternatively, add the dependency parser, or set sentence "
|
||||
"boundaries by setting doc[i].is_sent_start.")
|
||||
"component to the pipeline with: `nlp.add_pipe('sentencizer')`. "
|
||||
"Alternatively, add the dependency parser or sentence recognizer, "
|
||||
"or set sentence boundaries by setting `doc[i].is_sent_start`.")
|
||||
E031 = ("Invalid token: empty string ('') at position {i}.")
|
||||
E033 = ("Cannot load into non-empty Doc of length {length}.")
|
||||
E035 = ("Error creating span with start {start} and end {end} for Doc of "
|
||||
|
@ -210,7 +200,7 @@ class Errors:
|
|||
"issue here: http://github.com/explosion/spaCy/issues")
|
||||
E040 = ("Attempt to access token at {i}, max length {max_length}.")
|
||||
E041 = ("Invalid comparison operator: {op}. Likely a Cython bug?")
|
||||
E042 = ("Error accessing doc[{i}].nbor({j}), for doc of length {length}.")
|
||||
E042 = ("Error accessing `doc[{i}].nbor({j})`, for doc of length {length}.")
|
||||
E043 = ("Refusing to write to token.sent_start if its document is parsed, "
|
||||
"because this may cause inconsistent state.")
|
||||
E044 = ("Invalid value for token.sent_start: {value}. Must be one of: "
|
||||
|
@ -230,7 +220,7 @@ class Errors:
|
|||
E056 = ("Invalid tokenizer exception: ORTH values combined don't match "
|
||||
"original string.\nKey: {key}\nOrths: {orths}")
|
||||
E057 = ("Stepped slices not supported in Span objects. Try: "
|
||||
"list(tokens)[start:stop:step] instead.")
|
||||
"`list(tokens)[start:stop:step]` instead.")
|
||||
E058 = ("Could not retrieve vector for key {key}.")
|
||||
E059 = ("One (and only one) keyword arg must be set. Got: {kwargs}")
|
||||
E060 = ("Cannot add new key to vectors: the table is full. Current shape: "
|
||||
|
@ -239,7 +229,7 @@ class Errors:
|
|||
"and 63 are occupied. You can replace one by specifying the "
|
||||
"`flag_id` explicitly, e.g. "
|
||||
"`nlp.vocab.add_flag(your_func, flag_id=IS_ALPHA`.")
|
||||
E063 = ("Invalid value for flag_id: {value}. Flag IDs must be between 1 "
|
||||
E063 = ("Invalid value for `flag_id`: {value}. Flag IDs must be between 1 "
|
||||
"and 63 (inclusive).")
|
||||
E064 = ("Error fetching a Lexeme from the Vocab. When looking up a "
|
||||
"string, the lexeme returned had an orth ID that did not match "
|
||||
|
@ -268,7 +258,7 @@ class Errors:
|
|||
E085 = ("Can't create lexeme for string '{string}'.")
|
||||
E087 = ("Unknown displaCy style: {style}.")
|
||||
E088 = ("Text of length {length} exceeds maximum of {max_length}. The "
|
||||
"v2.x parser and NER models require roughly 1GB of temporary "
|
||||
"parser and NER models require roughly 1GB of temporary "
|
||||
"memory per 100,000 characters in the input. This means long "
|
||||
"texts may cause memory allocation errors. If you're not using "
|
||||
"the parser or NER, it's probably safe to increase the "
|
||||
|
@ -285,8 +275,8 @@ class Errors:
|
|||
E094 = ("Error reading line {line_num} in vectors file {loc}.")
|
||||
E095 = ("Can't write to frozen dictionary. This is likely an internal "
|
||||
"error. Are you writing to a default function argument?")
|
||||
E096 = ("Invalid object passed to displaCy: Can only visualize Doc or "
|
||||
"Span objects, or dicts if set to manual=True.")
|
||||
E096 = ("Invalid object passed to displaCy: Can only visualize `Doc` or "
|
||||
"Span objects, or dicts if set to `manual=True`.")
|
||||
E097 = ("Invalid pattern: expected token pattern (list of dicts) or "
|
||||
"phrase pattern (string) but got:\n{pattern}")
|
||||
E098 = ("Invalid pattern: expected both RIGHT_ID and RIGHT_ATTRS.")
|
||||
|
@ -303,11 +293,11 @@ class Errors:
|
|||
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
|
||||
"token can only be part of one entity, so make sure the entities "
|
||||
"you're setting don't overlap.")
|
||||
E106 = ("Can't find doc._.{attr} attribute specified in the underscore "
|
||||
E106 = ("Can't find `doc._.{attr}` attribute specified in the underscore "
|
||||
"settings: {opts}")
|
||||
E107 = ("Value of doc._.{attr} is not JSON-serializable: {value}")
|
||||
E107 = ("Value of `doc._.{attr}` is not JSON-serializable: {value}")
|
||||
E109 = ("Component '{name}' could not be run. Did you forget to "
|
||||
"call initialize()?")
|
||||
"call `initialize()`?")
|
||||
E110 = ("Invalid displaCy render wrapper. Expected callable, got: {obj}")
|
||||
E111 = ("Pickling a token is not supported, because tokens are only views "
|
||||
"of the parent Doc and can't exist on their own. A pickled token "
|
||||
|
@ -324,8 +314,8 @@ class Errors:
|
|||
E117 = ("The newly split tokens must match the text of the original token. "
|
||||
"New orths: {new}. Old text: {old}.")
|
||||
E118 = ("The custom extension attribute '{attr}' is not registered on the "
|
||||
"Token object so it can't be set during retokenization. To "
|
||||
"register an attribute, use the Token.set_extension classmethod.")
|
||||
"`Token` object so it can't be set during retokenization. To "
|
||||
"register an attribute, use the `Token.set_extension` classmethod.")
|
||||
E119 = ("Can't set custom extension attribute '{attr}' during "
|
||||
"retokenization because it's not writable. This usually means it "
|
||||
"was registered with a getter function (and no setter) or as a "
|
||||
|
@ -349,7 +339,7 @@ class Errors:
|
|||
E130 = ("You are running a narrow unicode build, which is incompatible "
|
||||
"with spacy >= 2.1.0. To fix this, reinstall Python and use a wide "
|
||||
"unicode build instead. You can also rebuild Python and set the "
|
||||
"--enable-unicode=ucs4 flag.")
|
||||
"`--enable-unicode=ucs4 flag`.")
|
||||
E131 = ("Cannot write the kb_id of an existing Span object because a Span "
|
||||
"is a read-only view of the underlying Token objects stored in "
|
||||
"the Doc. Instead, create a new Span object and specify the "
|
||||
|
@ -362,27 +352,20 @@ class Errors:
|
|||
E133 = ("The sum of prior probabilities for alias '{alias}' should not "
|
||||
"exceed 1, but found {sum}.")
|
||||
E134 = ("Entity '{entity}' is not defined in the Knowledge Base.")
|
||||
E137 = ("Expected 'dict' type, but got '{type}' from '{line}'. Make sure "
|
||||
"to provide a valid JSON object as input with either the `text` "
|
||||
"or `tokens` key. For more info, see the docs:\n"
|
||||
"https://nightly.spacy.io/api/cli#pretrain-jsonl")
|
||||
E138 = ("Invalid JSONL format for raw text '{text}'. Make sure the input "
|
||||
"includes either the `text` or `tokens` key. For more info, see "
|
||||
"the docs:\nhttps://nightly.spacy.io/api/cli#pretrain-jsonl")
|
||||
E139 = ("Knowledge Base for component '{name}' is empty. Use the methods "
|
||||
"kb.add_entity and kb.add_alias to add entries.")
|
||||
E139 = ("Knowledge base for component '{name}' is empty. Use the methods "
|
||||
"`kb.add_entity` and `kb.add_alias` to add entries.")
|
||||
E140 = ("The list of entities, prior probabilities and entity vectors "
|
||||
"should be of equal length.")
|
||||
E141 = ("Entity vectors should be of length {required} instead of the "
|
||||
"provided {found}.")
|
||||
E143 = ("Labels for component '{name}' not initialized. This can be fixed "
|
||||
"by calling add_label, or by providing a representative batch of "
|
||||
"examples to the component's initialize method.")
|
||||
"examples to the component's `initialize` method.")
|
||||
E145 = ("Error reading `{param}` from input file.")
|
||||
E146 = ("Could not access `{path}`.")
|
||||
E146 = ("Could not access {path}.")
|
||||
E147 = ("Unexpected error in the {method} functionality of the "
|
||||
"EntityLinker: {msg}. This is likely a bug in spaCy, so feel free "
|
||||
"to open an issue.")
|
||||
"to open an issue: https://github.com/explosion/spaCy/issues")
|
||||
E148 = ("Expected {ents} KB identifiers but got {ids}. Make sure that "
|
||||
"each entity in `doc.ents` is assigned to a KB identifier.")
|
||||
E149 = ("Error deserializing model. Check that the config used to create "
|
||||
|
@ -390,18 +373,18 @@ class Errors:
|
|||
E150 = ("The language of the `nlp` object and the `vocab` should be the "
|
||||
"same, but found '{nlp}' and '{vocab}' respectively.")
|
||||
E152 = ("The attribute {attr} is not supported for token patterns. "
|
||||
"Please use the option validate=True with Matcher, PhraseMatcher, "
|
||||
"Please use the option `validate=True` with the Matcher, PhraseMatcher, "
|
||||
"or EntityRuler for more details.")
|
||||
E153 = ("The value type {vtype} is not supported for token patterns. "
|
||||
"Please use the option validate=True with Matcher, PhraseMatcher, "
|
||||
"or EntityRuler for more details.")
|
||||
E154 = ("One of the attributes or values is not supported for token "
|
||||
"patterns. Please use the option validate=True with Matcher, "
|
||||
"patterns. Please use the option `validate=True` with the Matcher, "
|
||||
"PhraseMatcher, or EntityRuler for more details.")
|
||||
E155 = ("The pipeline needs to include a {pipe} in order to use "
|
||||
"Matcher or PhraseMatcher with the attribute {attr}. "
|
||||
"Try using nlp() instead of nlp.make_doc() or list(nlp.pipe()) "
|
||||
"instead of list(nlp.tokenizer.pipe()).")
|
||||
"Try using `nlp()` instead of `nlp.make_doc()` or `list(nlp.pipe())` "
|
||||
"instead of `list(nlp.tokenizer.pipe())`.")
|
||||
E157 = ("Can't render negative values for dependency arc start or end. "
|
||||
"Make sure that you're passing in absolute token indices, not "
|
||||
"relative token offsets.\nstart: {start}, end: {end}, label: "
|
||||
|
@ -410,13 +393,11 @@ class Errors:
|
|||
E159 = ("Can't find table '{name}' in lookups. Available tables: {tables}")
|
||||
E160 = ("Can't find language data file: {path}")
|
||||
E161 = ("Found an internal inconsistency when predicting entity links. "
|
||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||
E162 = ("Cannot evaluate textcat model on data with different labels.\n"
|
||||
"Labels in model: {model_labels}\nLabels in evaluation "
|
||||
"data: {eval_labels}")
|
||||
"This is likely a bug in spaCy, so feel free to open an issue: "
|
||||
"https://github.com/explosion/spaCy/issues")
|
||||
E163 = ("cumsum was found to be unstable: its last element does not "
|
||||
"correspond to sum")
|
||||
E164 = ("x is neither increasing nor decreasing: {}.")
|
||||
E164 = ("x is neither increasing nor decreasing: {x}.")
|
||||
E165 = ("Only one class present in y_true. ROC AUC score is not defined in "
|
||||
"that case.")
|
||||
E166 = ("Can only merge DocBins with the same value for '{param}'.\n"
|
||||
|
@ -431,10 +412,10 @@ class Errors:
|
|||
E178 = ("Each pattern should be a list of dicts, but got: {pat}. Maybe you "
|
||||
"accidentally passed a single pattern to Matcher.add instead of a "
|
||||
"list of patterns? If you only want to add one pattern, make sure "
|
||||
"to wrap it in a list. For example: matcher.add('{key}', [pattern])")
|
||||
"to wrap it in a list. For example: `matcher.add('{key}', [pattern])`")
|
||||
E179 = ("Invalid pattern. Expected a list of Doc objects but got a single "
|
||||
"Doc. If you only want to add one pattern, make sure to wrap it "
|
||||
"in a list. For example: matcher.add('{key}', [doc])")
|
||||
"in a list. For example: `matcher.add('{key}', [doc])`")
|
||||
E180 = ("Span attributes can't be declared as required or assigned by "
|
||||
"components, since spans are only views of the Doc. Use Doc and "
|
||||
"Token attributes (or custom extension attributes) only and remove "
|
||||
|
@ -442,17 +423,16 @@ class Errors:
|
|||
E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. "
|
||||
"Only Doc and Token attributes are supported.")
|
||||
E182 = ("Received invalid attribute declaration: {attr}\nDid you forget "
|
||||
"to define the attribute? For example: {attr}.???")
|
||||
"to define the attribute? For example: `{attr}.???`")
|
||||
E183 = ("Received invalid attribute declaration: {attr}\nOnly top-level "
|
||||
"attributes are supported, for example: {solution}")
|
||||
E184 = ("Only attributes without underscores are supported in component "
|
||||
"attribute declarations (because underscore and non-underscore "
|
||||
"attributes are connected anyways): {attr} -> {solution}")
|
||||
E185 = ("Received invalid attribute in component attribute declaration: "
|
||||
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
|
||||
E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
|
||||
"`{obj}.{attr}`\nAttribute '{attr}' does not exist on {obj}.")
|
||||
E187 = ("Only unicode strings are supported as labels.")
|
||||
E189 = ("Each argument to Doc.__init__ should be of equal length.")
|
||||
E189 = ("Each argument to `Doc.__init__` should be of equal length.")
|
||||
E190 = ("Token head out of range in `Doc.from_array()` for token index "
|
||||
"'{index}' with value '{value}' (equivalent to relative head "
|
||||
"index: '{rel_head_index}'). The head indices should be relative "
|
||||
|
@ -466,17 +446,32 @@ class Errors:
|
|||
"({curr_dim}).")
|
||||
E194 = ("Unable to aligned mismatched text '{text}' and words '{words}'.")
|
||||
E195 = ("Matcher can be called on {good} only, got {got}.")
|
||||
E196 = ("Refusing to write to token.is_sent_end. Sentence boundaries can "
|
||||
"only be fixed with token.is_sent_start.")
|
||||
E196 = ("Refusing to write to `token.is_sent_end`. Sentence boundaries can "
|
||||
"only be fixed with `token.is_sent_start`.")
|
||||
E197 = ("Row out of bounds, unable to add row {row} for key {key}.")
|
||||
E198 = ("Unable to return {n} most similar vectors for the current vectors "
|
||||
"table, which contains {n_rows} vectors.")
|
||||
E199 = ("Unable to merge 0-length span at doc[{start}:{end}].")
|
||||
E200 = ("Specifying a base model with a pretrained component '{component}' "
|
||||
"can not be combined with adding a pretrained Tok2Vec layer.")
|
||||
E201 = ("Span index out of range.")
|
||||
E199 = ("Unable to merge 0-length span at `doc[{start}:{end}]`.")
|
||||
E200 = ("Can't yet set {attr} from Span. Vote for this feature on the "
|
||||
"issue tracker: http://github.com/explosion/spaCy/issues")
|
||||
|
||||
# TODO: fix numbering after merging develop into master
|
||||
E092 = ("The sentence-per-line IOB/IOB2 file is not formatted correctly. "
|
||||
"Try checking whitespace and delimiters. See "
|
||||
"https://nightly.spacy.io/api/cli#convert")
|
||||
E093 = ("The token-per-line NER file is not formatted correctly. Try checking "
|
||||
"whitespace and delimiters. See https://nightly.spacy.io/api/cli#convert")
|
||||
E904 = ("Cannot initialize StaticVectors layer: nO dimension unset. This "
|
||||
"dimension refers to the output width, after the linear projection "
|
||||
"has been applied.")
|
||||
E905 = ("Cannot initialize StaticVectors layer: nM dimension unset. This "
|
||||
"dimension refers to the width of the vectors table.")
|
||||
E906 = ("Unexpected `loss` value in pretraining objective: {loss_type}")
|
||||
E907 = ("Unexpected `objective_type` value in pretraining objective: {objective_type}")
|
||||
E908 = ("Can't set `spaces` without `words` in `Doc.__init__`.")
|
||||
E909 = ("Expected {name} in parser internals. This is likely a bug in spaCy.")
|
||||
E910 = ("Encountered NaN value when computing loss for component '{name}'.")
|
||||
E911 = ("Invalid feature: {feat}. Must be a token attribute.")
|
||||
E912 = ("Failed to initialize lemmatizer. Missing lemmatizer table(s) found "
|
||||
"for mode '{mode}'. Required tables: {tables}. Found: {found}.")
|
||||
E913 = ("Corpus path can't be None. Maybe you forgot to define it in your "
|
||||
|
@ -489,43 +484,44 @@ class Errors:
|
|||
"final score, set its weight to null in the [training.score_weights] "
|
||||
"section of your training config.")
|
||||
E916 = ("Can't log score for '{name}' in table: not a valid score ({score_type})")
|
||||
E917 = ("Received invalid value {value} for 'state_type' in "
|
||||
E917 = ("Received invalid value {value} for `state_type` in "
|
||||
"TransitionBasedParser: only 'parser' or 'ner' are valid options.")
|
||||
E918 = ("Received invalid value for vocab: {vocab} ({vocab_type}). Valid "
|
||||
"values are an instance of spacy.vocab.Vocab or True to create one"
|
||||
"values are an instance of `spacy.vocab.Vocab` or True to create one"
|
||||
" (default).")
|
||||
E919 = ("A textcat 'positive_label' '{pos_label}' was provided for training "
|
||||
E919 = ("A textcat `positive_label` '{pos_label}' was provided for training "
|
||||
"data that does not appear to be a binary classification problem "
|
||||
"with two labels. Labels found: {labels}")
|
||||
E920 = ("The textcat's 'positive_label' config setting '{pos_label}' "
|
||||
"does not match any label in the training data. Labels found: {labels}")
|
||||
E921 = ("The method 'set_output' can only be called on components that have "
|
||||
"a Model with a 'resize_output' attribute. Otherwise, the output "
|
||||
E920 = ("The textcat's `positive_label` setting '{pos_label}' "
|
||||
"does not match any label in the training data or provided during "
|
||||
"initialization. Available labels: {labels}")
|
||||
E921 = ("The method `set_output` can only be called on components that have "
|
||||
"a Model with a `resize_output` attribute. Otherwise, the output "
|
||||
"layer can not be dynamically changed.")
|
||||
E922 = ("Component '{name}' has been initialized with an output dimension of "
|
||||
"{nO} - cannot add any more labels.")
|
||||
E923 = ("It looks like there is no proper sample data to initialize the "
|
||||
"Model of component '{name}'. "
|
||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||
"Model of component '{name}'. This is likely a bug in spaCy, so "
|
||||
"feel free to open an issue: https://github.com/explosion/spaCy/issues")
|
||||
E924 = ("The '{name}' component does not seem to be initialized properly. "
|
||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||
"This is likely a bug in spaCy, so feel free to open an issue: "
|
||||
"https://github.com/explosion/spaCy/issues")
|
||||
E925 = ("Invalid color values for displaCy visualizer: expected dictionary "
|
||||
"mapping label names to colors but got: {obj}")
|
||||
E926 = ("It looks like you're trying to modify nlp.{attr} directly. This "
|
||||
E926 = ("It looks like you're trying to modify `nlp.{attr}` directly. This "
|
||||
"doesn't work because it's an immutable computed property. If you "
|
||||
"need to modify the pipeline, use the built-in methods like "
|
||||
"nlp.add_pipe, nlp.remove_pipe, nlp.disable_pipe or nlp.enable_pipe "
|
||||
"instead.")
|
||||
"`nlp.add_pipe`, `nlp.remove_pipe`, `nlp.disable_pipe` or "
|
||||
"`nlp.enable_pipe` instead.")
|
||||
E927 = ("Can't write to frozen list Maybe you're trying to modify a computed "
|
||||
"property or default function argument?")
|
||||
E928 = ("A 'KnowledgeBase' can only be serialized to/from from a directory, "
|
||||
E928 = ("A KnowledgeBase can only be serialized to/from from a directory, "
|
||||
"but the provided argument {loc} points to a file.")
|
||||
E929 = ("A 'KnowledgeBase' could not be read from {loc} - the path does "
|
||||
"not seem to exist.")
|
||||
E930 = ("Received invalid get_examples callback in {name}.initialize. "
|
||||
E929 = ("Couldn't read KnowledgeBase from {loc}. The path does not seem to exist.")
|
||||
E930 = ("Received invalid get_examples callback in `{name}.initialize`. "
|
||||
"Expected function that returns an iterable of Example objects but "
|
||||
"got: {obj}")
|
||||
E931 = ("Encountered Pipe subclass without Pipe.{method} method in component "
|
||||
E931 = ("Encountered Pipe subclass without `Pipe.{method}` method in component "
|
||||
"'{name}'. If the component is trainable and you want to use this "
|
||||
"method, make sure it's overwritten on the subclass. If your "
|
||||
"component isn't trainable, add a method that does nothing or "
|
||||
|
@ -538,21 +534,21 @@ class Errors:
|
|||
"models, see the models directory: https://spacy.io/models. If you "
|
||||
"want to create a blank model, use spacy.blank: "
|
||||
"nlp = spacy.blank(\"{name}\")")
|
||||
E942 = ("Executing after_{name} callback failed. Expected the function to "
|
||||
E942 = ("Executing `after_{name}` callback failed. Expected the function to "
|
||||
"return an initialized nlp object but got: {value}. Maybe "
|
||||
"you forgot to return the modified object in your function?")
|
||||
E943 = ("Executing before_creation callback failed. Expected the function to "
|
||||
E943 = ("Executing `before_creation` callback failed. Expected the function to "
|
||||
"return an uninitialized Language subclass but got: {value}. Maybe "
|
||||
"you forgot to return the modified object in your function or "
|
||||
"returned the initialized nlp object instead?")
|
||||
E944 = ("Can't copy pipeline component '{name}' from source model '{model}': "
|
||||
E944 = ("Can't copy pipeline component '{name}' from source '{model}': "
|
||||
"not found in pipeline. Available components: {opts}")
|
||||
E945 = ("Can't copy pipeline component '{name}' from source. Expected loaded "
|
||||
"nlp object, but got: {source}")
|
||||
E947 = ("Matcher.add received invalid 'greedy' argument: expected "
|
||||
E947 = ("`Matcher.add` received invalid `greedy` argument: expected "
|
||||
"a string value from {expected} but got: '{arg}'")
|
||||
E948 = ("Matcher.add received invalid 'patterns' argument: expected "
|
||||
"a List, but got: {arg_type}")
|
||||
E948 = ("`Matcher.add` received invalid 'patterns' argument: expected "
|
||||
"a list, but got: {arg_type}")
|
||||
E949 = ("Can only create an alignment when the texts are the same.")
|
||||
E952 = ("The section '{name}' is not a valid section in the provided config.")
|
||||
E953 = ("Mismatched IDs received by the Tok2Vec listener: {id1} vs. {id2}")
|
||||
|
@ -564,9 +560,9 @@ class Errors:
|
|||
"for your language.")
|
||||
E956 = ("Can't find component '{name}' in [components] block in the config. "
|
||||
"Available components: {opts}")
|
||||
E957 = ("Writing directly to Language.factories isn't needed anymore in "
|
||||
"spaCy v3. Instead, you can use the @Language.factory decorator "
|
||||
"to register your custom component factory or @Language.component "
|
||||
E957 = ("Writing directly to `Language.factories` isn't needed anymore in "
|
||||
"spaCy v3. Instead, you can use the `@Language.factory` decorator "
|
||||
"to register your custom component factory or `@Language.component` "
|
||||
"to register a simple stateless function component that just takes "
|
||||
"a Doc and returns it.")
|
||||
E958 = ("Language code defined in config ({bad_lang_code}) does not match "
|
||||
|
@ -584,99 +580,93 @@ class Errors:
|
|||
"component.\n\n{config}")
|
||||
E962 = ("Received incorrect {style} for pipe '{name}'. Expected dict, "
|
||||
"got: {cfg_type}.")
|
||||
E963 = ("Can't read component info from @Language.{decorator} decorator. "
|
||||
E963 = ("Can't read component info from `@Language.{decorator}` decorator. "
|
||||
"Maybe you forgot to call it? Make sure you're using "
|
||||
"@Language.{decorator}() instead of @Language.{decorator}.")
|
||||
"`@Language.{decorator}()` instead of `@Language.{decorator}`.")
|
||||
E964 = ("The pipeline component factory for '{name}' needs to have the "
|
||||
"following named arguments, which are passed in by spaCy:\n- nlp: "
|
||||
"receives the current nlp object and lets you access the vocab\n- "
|
||||
"name: the name of the component instance, can be used to identify "
|
||||
"the component, output losses etc.")
|
||||
E965 = ("It looks like you're using the @Language.component decorator to "
|
||||
E965 = ("It looks like you're using the `@Language.component` decorator to "
|
||||
"register '{name}' on a class instead of a function component. If "
|
||||
"you need to register a class or function that *returns* a component "
|
||||
"function, use the @Language.factory decorator instead.")
|
||||
E966 = ("nlp.add_pipe now takes the string name of the registered component "
|
||||
"function, use the `@Language.factory` decorator instead.")
|
||||
E966 = ("`nlp.add_pipe` now takes the string name of the registered component "
|
||||
"factory, not a callable component. Expected string, but got "
|
||||
"{component} (name: '{name}').\n\n- If you created your component "
|
||||
"with nlp.create_pipe('name'): remove nlp.create_pipe and call "
|
||||
"nlp.add_pipe('name') instead.\n\n- If you passed in a component "
|
||||
"like TextCategorizer(): call nlp.add_pipe with the string name "
|
||||
"instead, e.g. nlp.add_pipe('textcat').\n\n- If you're using a custom "
|
||||
"component: Add the decorator @Language.component (for function "
|
||||
"components) or @Language.factory (for class components / factories) "
|
||||
"with `nlp.create_pipe('name')`: remove nlp.create_pipe and call "
|
||||
"`nlp.add_pipe('name')` instead.\n\n- If you passed in a component "
|
||||
"like `TextCategorizer()`: call `nlp.add_pipe` with the string name "
|
||||
"instead, e.g. `nlp.add_pipe('textcat')`.\n\n- If you're using a custom "
|
||||
"component: Add the decorator `@Language.component` (for function "
|
||||
"components) or `@Language.factory` (for class components / factories) "
|
||||
"to your custom component and assign it a name, e.g. "
|
||||
"@Language.component('your_name'). You can then run "
|
||||
"nlp.add_pipe('your_name') to add it to the pipeline.")
|
||||
"`@Language.component('your_name')`. You can then run "
|
||||
"`nlp.add_pipe('your_name')` to add it to the pipeline.")
|
||||
E967 = ("No {meta} meta information found for '{name}'. This is likely a bug in spaCy.")
|
||||
E968 = ("nlp.replace_pipe now takes the string name of the registered component "
|
||||
E968 = ("`nlp.replace_pipe` now takes the string name of the registered component "
|
||||
"factory, not a callable component. Expected string, but got "
|
||||
"{component}.\n\n- If you created your component with"
|
||||
"with nlp.create_pipe('name'): remove nlp.create_pipe and call "
|
||||
"nlp.replace_pipe('{name}', 'name') instead.\n\n- If you passed in a "
|
||||
"component like TextCategorizer(): call nlp.replace_pipe with the "
|
||||
"string name instead, e.g. nlp.replace_pipe('{name}', 'textcat').\n\n"
|
||||
"with `nlp.create_pipe('name')`: remove `nlp.create_pipe` and call "
|
||||
"`nlp.replace_pipe('{name}', 'name')` instead.\n\n- If you passed in a "
|
||||
"component like `TextCategorizer()`: call `nlp.replace_pipe` with the "
|
||||
"string name instead, e.g. `nlp.replace_pipe('{name}', 'textcat')`.\n\n"
|
||||
"- If you're using a custom component: Add the decorator "
|
||||
"@Language.component (for function components) or @Language.factory "
|
||||
"`@Language.component` (for function components) or `@Language.factory` "
|
||||
"(for class components / factories) to your custom component and "
|
||||
"assign it a name, e.g. @Language.component('your_name'). You can "
|
||||
"then run nlp.replace_pipe('{name}', 'your_name').")
|
||||
"assign it a name, e.g. `@Language.component('your_name')`. You can "
|
||||
"then run `nlp.replace_pipe('{name}', 'your_name')`.")
|
||||
E969 = ("Expected string values for field '{field}', but received {types} instead. ")
|
||||
E970 = ("Can not execute command '{str_command}'. Do you have '{tool}' installed?")
|
||||
E971 = ("Found incompatible lengths in Doc.from_array: {array_length} for the "
|
||||
E971 = ("Found incompatible lengths in `Doc.from_array`: {array_length} for the "
|
||||
"array and {doc_length} for the Doc itself.")
|
||||
E972 = ("Example.__init__ got None for '{arg}'. Requires Doc.")
|
||||
E972 = ("`Example.__init__` got None for '{arg}'. Requires Doc.")
|
||||
E973 = ("Unexpected type for NER data")
|
||||
E974 = ("Unknown {obj} attribute: {key}")
|
||||
E976 = ("The method 'Example.from_dict' expects a {type} as {n} argument, "
|
||||
E976 = ("The method `Example.from_dict` expects a {type} as {n} argument, "
|
||||
"but received None.")
|
||||
E977 = ("Can not compare a MorphAnalysis with a string object. "
|
||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||
"This is likely a bug in spaCy, so feel free to open an issue: "
|
||||
"https://github.com/explosion/spaCy/issues")
|
||||
E978 = ("The {name} method takes a list of Example objects, but got: {types}")
|
||||
E979 = ("Cannot convert {type} to an Example object.")
|
||||
E980 = ("Each link annotation should refer to a dictionary with at most one "
|
||||
"identifier mapping to 1.0, and all others to 0.0.")
|
||||
E981 = ("The offsets of the annotations for 'links' could not be aligned "
|
||||
E981 = ("The offsets of the annotations for `links` could not be aligned "
|
||||
"to token boundaries.")
|
||||
E982 = ("The 'ent_iob' attribute of a Token should be an integer indexing "
|
||||
E982 = ("The `Token.ent_iob` attribute should be an integer indexing "
|
||||
"into {values}, but found {value}.")
|
||||
E983 = ("Invalid key for '{dict}': {key}. Available keys: "
|
||||
"{keys}")
|
||||
E984 = ("Invalid component config for '{name}': component block needs either "
|
||||
"a key 'factory' specifying the registered function used to "
|
||||
"initialize the component, or a key 'source' key specifying a "
|
||||
"spaCy model to copy the component from. For example, factory = "
|
||||
"\"ner\" will use the 'ner' factory and all other settings in the "
|
||||
"block will be passed to it as arguments. Alternatively, source = "
|
||||
"\"en_core_web_sm\" will copy the component from that model.\n\n{config}")
|
||||
E985 = ("Can't load model from config file: no 'nlp' section found.\n\n{config}")
|
||||
"a key `factory` specifying the registered function used to "
|
||||
"initialize the component, or a key `source` key specifying a "
|
||||
"spaCy model to copy the component from. For example, `factory = "
|
||||
"\"ner\"` will use the 'ner' factory and all other settings in the "
|
||||
"block will be passed to it as arguments. Alternatively, `source = "
|
||||
"\"en_core_web_sm\"` will copy the component from that model.\n\n{config}")
|
||||
E985 = ("Can't load model from config file: no [nlp] section found.\n\n{config}")
|
||||
E986 = ("Could not create any training batches: check your input. "
|
||||
"Are the train and dev paths defined? "
|
||||
"Is 'discard_oversize' set appropriately? ")
|
||||
E987 = ("The text of an example training instance is either a Doc or "
|
||||
"a string, but found {type} instead.")
|
||||
E988 = ("Could not parse any training examples. Ensure the data is "
|
||||
"formatted correctly.")
|
||||
E989 = ("'nlp.update()' was called with two positional arguments. This "
|
||||
"Are the train and dev paths defined? Is `discard_oversize` set appropriately? ")
|
||||
E989 = ("`nlp.update()` was called with two positional arguments. This "
|
||||
"may be due to a backwards-incompatible change to the format "
|
||||
"of the training data in spaCy 3.0 onwards. The 'update' "
|
||||
"function should now be called with a batch of 'Example' "
|
||||
"objects, instead of (text, annotation) tuples. ")
|
||||
E991 = ("The function 'select_pipes' should be called with either a "
|
||||
"'disable' argument to list the names of the pipe components "
|
||||
"function should now be called with a batch of Example "
|
||||
"objects, instead of `(text, annotation)` tuples. ")
|
||||
E991 = ("The function `nlp.select_pipes` should be called with either a "
|
||||
"`disable` argument to list the names of the pipe components "
|
||||
"that should be disabled, or with an 'enable' argument that "
|
||||
"specifies which pipes should not be disabled.")
|
||||
E992 = ("The function `select_pipes` was called with `enable`={enable} "
|
||||
"and `disable`={disable} but that information is conflicting "
|
||||
"for the `nlp` pipeline with components {names}.")
|
||||
E993 = ("The config for 'nlp' needs to include a key 'lang' specifying "
|
||||
E993 = ("The config for the nlp object needs to include a key `lang` specifying "
|
||||
"the code of the language to initialize it with (for example "
|
||||
"'en' for English) - this can't be 'None'.\n\n{config}")
|
||||
E996 = ("Could not parse {file}: {msg}")
|
||||
"'en' for English) - this can't be None.\n\n{config}")
|
||||
E997 = ("Tokenizer special cases are not allowed to modify the text. "
|
||||
"This would map '{chunk}' to '{orth}' given token attributes "
|
||||
"'{token_attrs}'.")
|
||||
E999 = ("Unable to merge the `Doc` objects because they do not all share "
|
||||
E999 = ("Unable to merge the Doc objects because they do not all share "
|
||||
"the same `Vocab`.")
|
||||
E1000 = ("The Chinese word segmenter is pkuseg but no pkuseg model was "
|
||||
"loaded. Provide the name of a pretrained model or the path to "
|
||||
|
@ -688,35 +678,24 @@ class Errors:
|
|||
E1003 = ("Unsupported lemmatizer mode '{mode}'.")
|
||||
E1004 = ("Missing lemmatizer table(s) found for lemmatizer mode '{mode}'. "
|
||||
"Required tables: {tables}. Found: {found}. Maybe you forgot to "
|
||||
"call nlp.initialize() to load in the data?")
|
||||
"call `nlp.initialize()` to load in the data?")
|
||||
E1005 = ("Unable to set attribute '{attr}' in tokenizer exception for "
|
||||
"'{chunk}'. Tokenizer exceptions are only allowed to specify "
|
||||
"`ORTH` and `NORM`.")
|
||||
E1006 = ("Unable to initialize {name} model with 0 labels.")
|
||||
"ORTH and NORM.")
|
||||
E1007 = ("Unsupported DependencyMatcher operator '{op}'.")
|
||||
E1008 = ("Invalid pattern: each pattern should be a list of dicts. Check "
|
||||
"that you are providing a list of patterns as `List[List[dict]]`.")
|
||||
E1009 = ("String for hash '{val}' not found in StringStore. Set the value "
|
||||
"through token.morph_ instead or add the string to the "
|
||||
"StringStore with `nlp.vocab.strings.add(string)`.")
|
||||
E1010 = ("Unable to set entity information for token {i} which is included "
|
||||
"in more than one span in entities, blocked, missing or outside.")
|
||||
E1011 = ("Unsupported default '{default}' in doc.set_ents. Available "
|
||||
E1011 = ("Unsupported default '{default}' in `doc.set_ents`. Available "
|
||||
"options: {modes}")
|
||||
E1012 = ("Entity spans and blocked/missing/outside spans should be "
|
||||
"provided to doc.set_ents as lists of `Span` objects.")
|
||||
"provided to `doc.set_ents` as lists of Span objects.")
|
||||
E1013 = ("Invalid morph: the MorphAnalysis must have the same vocab as the "
|
||||
"token itself. To set the morph from this MorphAnalysis, set from "
|
||||
"the string value with: `token.set_morph(str(other_morph))`.")
|
||||
|
||||
|
||||
@add_codes
|
||||
class TempErrors:
|
||||
T003 = ("Resizing pretrained Tagger models is not currently supported.")
|
||||
T007 = ("Can't yet set {attr} from Span. Vote for this feature on the "
|
||||
"issue tracker: http://github.com/explosion/spaCy/issues")
|
||||
|
||||
|
||||
# Deprecated model shortcuts, only used in errors and warnings
|
||||
OLD_MODEL_SHORTCUTS = {
|
||||
"en": "en_core_web_sm", "de": "de_core_news_sm", "es": "es_core_news_sm",
|
||||
|
|
|
@ -22,9 +22,13 @@ def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
|
|||
np_deps = set(doc.vocab.strings.add(label) for label in labels)
|
||||
close_app = doc.vocab.strings.add("nk")
|
||||
rbracket = 0
|
||||
prev_end = -1
|
||||
for i, word in enumerate(doclike):
|
||||
if i < rbracket:
|
||||
continue
|
||||
# Prevent nested chunks from being produced
|
||||
if word.left_edge.i <= prev_end:
|
||||
continue
|
||||
if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
|
||||
rbracket = word.i + 1
|
||||
# try to extend the span to the right
|
||||
|
@ -32,6 +36,7 @@ def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]:
|
|||
for rdep in doc[word.i].rights:
|
||||
if rdep.pos in (NOUN, PROPN) and rdep.dep == close_app:
|
||||
rbracket = rdep.i + 1
|
||||
prev_end = rbracket - 1
|
||||
yield word.left_edge.i, rbracket, np_label
|
||||
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
from typing import List, Dict
|
||||
from typing import List, Tuple
|
||||
|
||||
from ...pipeline import Lemmatizer
|
||||
from ...tokens import Token
|
||||
|
@ -15,17 +15,10 @@ class FrenchLemmatizer(Lemmatizer):
|
|||
"""
|
||||
|
||||
@classmethod
|
||||
def get_lookups_config(cls, mode: str) -> Dict:
|
||||
def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]:
|
||||
if mode == "rule":
|
||||
return {
|
||||
"required_tables": [
|
||||
"lemma_lookup",
|
||||
"lemma_rules",
|
||||
"lemma_exc",
|
||||
"lemma_index",
|
||||
],
|
||||
"optional_tables": [],
|
||||
}
|
||||
required = ["lemma_lookup", "lemma_rules", "lemma_exc", "lemma_index"]
|
||||
return (required, [])
|
||||
else:
|
||||
return super().get_lookups_config(mode)
|
||||
|
||||
|
|
|
@ -7,8 +7,8 @@ Example sentences to test spaCy and its language models.
|
|||
|
||||
|
||||
sentences = [
|
||||
"Al Qaidah mengklaim bom mobil yang menewaskan 60 Orang di Mali",
|
||||
"Abu Sayyaf mengeksekusi sandera warga Filipina",
|
||||
"Indonesia merupakan negara kepulauan yang kaya akan budaya.",
|
||||
"Berapa banyak warga yang dibutuhkan saat kerja bakti?",
|
||||
"Penyaluran pupuk berasal dari lima lokasi yakni Bontang, Kalimantan Timur, Surabaya, Banyuwangi, Semarang, dan Makassar.",
|
||||
"PT Pupuk Kaltim telah menyalurkan 274.707 ton pupuk bersubsidi ke wilayah penyaluran di 14 provinsi.",
|
||||
"Jakarta adalah kota besar yang nyaris tidak pernah tidur."
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
from typing import List, Dict
|
||||
from typing import List, Tuple
|
||||
|
||||
from ...pipeline import Lemmatizer
|
||||
from ...tokens import Token
|
||||
|
@ -6,16 +6,10 @@ from ...tokens import Token
|
|||
|
||||
class DutchLemmatizer(Lemmatizer):
|
||||
@classmethod
|
||||
def get_lookups_config(cls, mode: str) -> Dict:
|
||||
def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]:
|
||||
if mode == "rule":
|
||||
return {
|
||||
"required_tables": [
|
||||
"lemma_lookup",
|
||||
"lemma_rules",
|
||||
"lemma_exc",
|
||||
"lemma_index",
|
||||
],
|
||||
}
|
||||
required = ["lemma_lookup", "lemma_rules", "lemma_exc", "lemma_index"]
|
||||
return (required, [])
|
||||
else:
|
||||
return super().get_lookups_config(mode)
|
||||
|
||||
|
|
|
@ -8,7 +8,6 @@ from .stop_words import STOP_WORDS
|
|||
from .lex_attrs import LEX_ATTRS
|
||||
from .lemmatizer import PolishLemmatizer
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ...lookups import Lookups
|
||||
from ...language import Language
|
||||
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
from typing import List, Dict
|
||||
from typing import List, Dict, Tuple
|
||||
|
||||
from ...pipeline import Lemmatizer
|
||||
from ...tokens import Token
|
||||
|
@ -11,21 +11,16 @@ class PolishLemmatizer(Lemmatizer):
|
|||
# lemmatization, as well as case-sensitive lemmatization for nouns.
|
||||
|
||||
@classmethod
|
||||
def get_lookups_config(cls, mode: str) -> Dict:
|
||||
def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]:
|
||||
if mode == "pos_lookup":
|
||||
return {
|
||||
"required_tables": [
|
||||
"lemma_lookup_adj",
|
||||
"lemma_lookup_adp",
|
||||
"lemma_lookup_adv",
|
||||
"lemma_lookup_aux",
|
||||
"lemma_lookup_noun",
|
||||
"lemma_lookup_num",
|
||||
"lemma_lookup_part",
|
||||
"lemma_lookup_pron",
|
||||
"lemma_lookup_verb",
|
||||
]
|
||||
}
|
||||
# fmt: off
|
||||
required = [
|
||||
"lemma_lookup_adj", "lemma_lookup_adp", "lemma_lookup_adv",
|
||||
"lemma_lookup_aux", "lemma_lookup_noun", "lemma_lookup_num",
|
||||
"lemma_lookup_part", "lemma_lookup_pron", "lemma_lookup_verb"
|
||||
]
|
||||
# fmt: on
|
||||
return (required, [])
|
||||
else:
|
||||
return super().get_lookups_config(mode)
|
||||
|
||||
|
|
|
@ -47,7 +47,7 @@ class Segmenter(str, Enum):
|
|||
|
||||
|
||||
@registry.tokenizers("spacy.zh.ChineseTokenizer")
|
||||
def create_chinese_tokenizer(segmenter: Segmenter = Segmenter.char,):
|
||||
def create_chinese_tokenizer(segmenter: Segmenter = Segmenter.char):
|
||||
def chinese_tokenizer_factory(nlp):
|
||||
return ChineseTokenizer(nlp, segmenter=segmenter)
|
||||
|
||||
|
|
|
@ -896,6 +896,10 @@ class Language:
|
|||
self._components[i] = (new_name, self._components[i][1])
|
||||
self._pipe_meta[new_name] = self._pipe_meta.pop(old_name)
|
||||
self._pipe_configs[new_name] = self._pipe_configs.pop(old_name)
|
||||
# Make sure [initialize] config is adjusted
|
||||
if old_name in self._config["initialize"]["components"]:
|
||||
init_cfg = self._config["initialize"]["components"].pop(old_name)
|
||||
self._config["initialize"]["components"][new_name] = init_cfg
|
||||
|
||||
def remove_pipe(self, name: str) -> Tuple[str, Callable[[Doc], Doc]]:
|
||||
"""Remove a component from the pipeline.
|
||||
|
@ -912,6 +916,9 @@ class Language:
|
|||
# because factory may be used for something else
|
||||
self._pipe_meta.pop(name)
|
||||
self._pipe_configs.pop(name)
|
||||
# Make sure name is removed from the [initialize] config
|
||||
if name in self._config["initialize"]["components"]:
|
||||
self._config["initialize"]["components"].pop(name)
|
||||
# Make sure the name is also removed from the set of disabled components
|
||||
if name in self.disabled:
|
||||
self._disabled.remove(name)
|
||||
|
|
|
@ -6,6 +6,7 @@ from thinc.api import expand_window, residual, Maxout, Mish, PyTorchLSTM
|
|||
|
||||
from ...tokens import Doc
|
||||
from ...util import registry
|
||||
from ...errors import Errors
|
||||
from ...ml import _character_embed
|
||||
from ..staticvectors import StaticVectors
|
||||
from ..featureextractor import FeatureExtractor
|
||||
|
@ -165,8 +166,12 @@ def MultiHashEmbed(
|
|||
|
||||
@registry.architectures.register("spacy.CharacterEmbed.v1")
|
||||
def CharacterEmbed(
|
||||
width: int, rows: int, nM: int, nC: int, also_use_static_vectors: bool,
|
||||
feature: Union[int, str]="LOWER"
|
||||
width: int,
|
||||
rows: int,
|
||||
nM: int,
|
||||
nC: int,
|
||||
also_use_static_vectors: bool,
|
||||
feature: Union[int, str] = "LOWER",
|
||||
) -> Model[List[Doc], List[Floats2d]]:
|
||||
"""Construct an embedded representation based on character embeddings, using
|
||||
a feed-forward network. A fixed number of UTF-8 byte characters are used for
|
||||
|
@ -197,7 +202,7 @@ def CharacterEmbed(
|
|||
"""
|
||||
feature = intify_attr(feature)
|
||||
if feature is None:
|
||||
raise ValueError("Invalid feature: Must be a token attribute.")
|
||||
raise ValueError(Errors.E911(feat=feature))
|
||||
if also_use_static_vectors:
|
||||
model = chain(
|
||||
concatenate(
|
||||
|
|
|
@ -1,11 +1,11 @@
|
|||
from typing import List, Tuple, Callable, Optional, cast
|
||||
|
||||
from thinc.initializers import glorot_uniform_init
|
||||
from thinc.util import partial
|
||||
from thinc.types import Ragged, Floats2d, Floats1d
|
||||
from thinc.api import Model, Ops, registry
|
||||
|
||||
from ..tokens import Doc
|
||||
from ..errors import Errors
|
||||
|
||||
|
||||
@registry.layers("spacy.StaticVectors.v1")
|
||||
|
@ -76,16 +76,9 @@ def init(
|
|||
nO = Y.data.shape[1]
|
||||
|
||||
if nM is None:
|
||||
raise ValueError(
|
||||
"Cannot initialize StaticVectors layer: nM dimension unset. "
|
||||
"This dimension refers to the width of the vectors table."
|
||||
)
|
||||
raise ValueError(Errors.E905)
|
||||
if nO is None:
|
||||
raise ValueError(
|
||||
"Cannot initialize StaticVectors layer: nO dimension unset. "
|
||||
"This dimension refers to the output width, after the linear "
|
||||
"projection has been applied."
|
||||
)
|
||||
raise ValueError(Errors.E904)
|
||||
model.set_dim("nM", nM)
|
||||
model.set_dim("nO", nO)
|
||||
model.set_param("W", init_W(model.ops, (nO, nM)))
|
||||
|
|
|
@ -9,10 +9,11 @@ from ...strings cimport hash_string
|
|||
from ...structs cimport TokenC
|
||||
from ...tokens.doc cimport Doc, set_children_from_heads
|
||||
from ...training.example cimport Example
|
||||
from ...errors import Errors
|
||||
from .stateclass cimport StateClass
|
||||
from ._state cimport StateC
|
||||
|
||||
from ...errors import Errors
|
||||
|
||||
# Calculate cost as gold/not gold. We don't use scalar value anyway.
|
||||
cdef int BINARY_COSTS = 1
|
||||
cdef weight_t MIN_SCORE = -90000
|
||||
|
@ -704,7 +705,7 @@ cdef class ArcEager(TransitionSystem):
|
|||
|
||||
def get_cost(self, StateClass stcls, gold, int i):
|
||||
if not isinstance(gold, ArcEagerGold):
|
||||
raise TypeError("Expected ArcEagerGold")
|
||||
raise TypeError(Errors.E909.format(name="ArcEagerGold"))
|
||||
cdef ArcEagerGold gold_ = gold
|
||||
gold_state = gold_.c
|
||||
n_gold = 0
|
||||
|
@ -717,7 +718,7 @@ cdef class ArcEager(TransitionSystem):
|
|||
cdef int set_costs(self, int* is_valid, weight_t* costs,
|
||||
StateClass stcls, gold) except -1:
|
||||
if not isinstance(gold, ArcEagerGold):
|
||||
raise TypeError("Expected ArcEagerGold")
|
||||
raise TypeError(Errors.E909.format(name="ArcEagerGold"))
|
||||
cdef ArcEagerGold gold_ = gold
|
||||
gold_.update(stcls)
|
||||
gold_state = gold_.c
|
||||
|
|
|
@ -1,16 +1,18 @@
|
|||
from collections import Counter
|
||||
from libc.stdint cimport int32_t
|
||||
from cymem.cymem cimport Pool
|
||||
|
||||
from collections import Counter
|
||||
|
||||
from ...typedefs cimport weight_t, attr_t
|
||||
from ...lexeme cimport Lexeme
|
||||
from ...attrs cimport IS_SPACE
|
||||
from ...training.example cimport Example
|
||||
from ...errors import Errors
|
||||
from .stateclass cimport StateClass
|
||||
from ._state cimport StateC
|
||||
from .transition_system cimport Transition, do_func_t
|
||||
|
||||
from ...errors import Errors
|
||||
|
||||
|
||||
cdef enum:
|
||||
MISSING
|
||||
|
@ -248,7 +250,7 @@ cdef class BiluoPushDown(TransitionSystem):
|
|||
|
||||
def get_cost(self, StateClass stcls, gold, int i):
|
||||
if not isinstance(gold, BiluoGold):
|
||||
raise TypeError("Expected BiluoGold")
|
||||
raise TypeError(Errors.E909.format(name="BiluoGold"))
|
||||
cdef BiluoGold gold_ = gold
|
||||
gold_state = gold_.c
|
||||
n_gold = 0
|
||||
|
@ -261,7 +263,7 @@ cdef class BiluoPushDown(TransitionSystem):
|
|||
cdef int set_costs(self, int* is_valid, weight_t* costs,
|
||||
StateClass stcls, gold) except -1:
|
||||
if not isinstance(gold, BiluoGold):
|
||||
raise TypeError("Expected BiluoGold")
|
||||
raise TypeError(Errors.E909.format(name="BiluoGold"))
|
||||
cdef BiluoGold gold_ = gold
|
||||
gold_.update(stcls)
|
||||
gold_state = gold_.c
|
||||
|
|
|
@ -1,10 +1,11 @@
|
|||
from typing import List, Dict, Union, Iterable, Any, Optional, Callable, Iterator
|
||||
from typing import Tuple
|
||||
import srsly
|
||||
from typing import List, Dict, Union, Iterable, Any, Optional
|
||||
from pathlib import Path
|
||||
|
||||
from .pipe import Pipe
|
||||
from ..errors import Errors
|
||||
from ..training import validate_examples
|
||||
from ..training import validate_examples, Example
|
||||
from ..language import Language
|
||||
from ..matcher import Matcher
|
||||
from ..scorer import Scorer
|
||||
|
@ -18,20 +19,13 @@ from .. import util
|
|||
|
||||
MatcherPatternType = List[Dict[Union[int, str], Any]]
|
||||
AttributeRulerPatternType = Dict[str, Union[MatcherPatternType, Dict, int]]
|
||||
TagMapType = Dict[str, Dict[Union[int, str], Union[int, str]]]
|
||||
MorphRulesType = Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]
|
||||
|
||||
|
||||
@Language.factory(
|
||||
"attribute_ruler", default_config={"pattern_dicts": None, "validate": False}
|
||||
)
|
||||
def make_attribute_ruler(
|
||||
nlp: Language,
|
||||
name: str,
|
||||
pattern_dicts: Optional[Iterable[AttributeRulerPatternType]],
|
||||
validate: bool,
|
||||
):
|
||||
return AttributeRuler(
|
||||
nlp.vocab, name, pattern_dicts=pattern_dicts, validate=validate
|
||||
)
|
||||
@Language.factory("attribute_ruler", default_config={"validate": False})
|
||||
def make_attribute_ruler(nlp: Language, name: str, validate: bool):
|
||||
return AttributeRuler(nlp.vocab, name, validate=validate)
|
||||
|
||||
|
||||
class AttributeRuler(Pipe):
|
||||
|
@ -42,20 +36,15 @@ class AttributeRuler(Pipe):
|
|||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab: Vocab,
|
||||
name: str = "attribute_ruler",
|
||||
*,
|
||||
pattern_dicts: Optional[Iterable[AttributeRulerPatternType]] = None,
|
||||
validate: bool = False,
|
||||
self, vocab: Vocab, name: str = "attribute_ruler", *, validate: bool = False
|
||||
) -> None:
|
||||
"""Initialize the AttributeRuler.
|
||||
"""Create the AttributeRuler. After creation, you can add patterns
|
||||
with the `.initialize()` or `.add_patterns()` methods, or load patterns
|
||||
with `.from_bytes()` or `.from_disk()`. Loading patterns will remove
|
||||
any patterns you've added previously.
|
||||
|
||||
vocab (Vocab): The vocab.
|
||||
name (str): The pipe name. Defaults to "attribute_ruler".
|
||||
pattern_dicts (Iterable[Dict]): A list of pattern dicts with the keys as
|
||||
the arguments to AttributeRuler.add (`patterns`/`attrs`/`index`) to add
|
||||
as patterns.
|
||||
|
||||
RETURNS (AttributeRuler): The AttributeRuler component.
|
||||
|
||||
|
@ -68,8 +57,27 @@ class AttributeRuler(Pipe):
|
|||
self._attrs_unnormed = [] # store for reference
|
||||
self.indices = []
|
||||
|
||||
if pattern_dicts:
|
||||
self.add_patterns(pattern_dicts)
|
||||
def initialize(
|
||||
self,
|
||||
get_examples: Optional[Callable[[], Iterable[Example]]],
|
||||
*,
|
||||
nlp: Optional[Language] = None,
|
||||
patterns: Optional[Iterable[AttributeRulerPatternType]] = None,
|
||||
tag_map: Optional[TagMapType] = None,
|
||||
morph_rules: Optional[MorphRulesType] = None,
|
||||
):
|
||||
"""Initialize the attribute ruler by adding zero or more patterns.
|
||||
|
||||
Rules can be specified as a sequence of dicts using the `patterns`
|
||||
keyword argument. You can also provide rules using the "tag map" or
|
||||
"morph rules" formats supported by spaCy prior to v3.
|
||||
"""
|
||||
if patterns:
|
||||
self.add_patterns(patterns)
|
||||
if tag_map:
|
||||
self.load_from_tag_map(tag_map)
|
||||
if morph_rules:
|
||||
self.load_from_morph_rules(morph_rules)
|
||||
|
||||
def __call__(self, doc: Doc) -> Doc:
|
||||
"""Apply the AttributeRuler to a Doc and set all attribute exceptions.
|
||||
|
@ -106,7 +114,7 @@ class AttributeRuler(Pipe):
|
|||
set_token_attrs(span[index], attrs)
|
||||
return doc
|
||||
|
||||
def pipe(self, stream, *, batch_size=128):
|
||||
def pipe(self, stream: Iterable[Doc], *, batch_size: int = 128) -> Iterator[Doc]:
|
||||
"""Apply the pipe to a stream of documents. This usually happens under
|
||||
the hood when the nlp object is called on a text and all components are
|
||||
applied to the Doc.
|
||||
|
@ -190,16 +198,16 @@ class AttributeRuler(Pipe):
|
|||
self.attrs.append(attrs)
|
||||
self.indices.append(index)
|
||||
|
||||
def add_patterns(self, pattern_dicts: Iterable[AttributeRulerPatternType]) -> None:
|
||||
def add_patterns(self, patterns: Iterable[AttributeRulerPatternType]) -> None:
|
||||
"""Add patterns from a list of pattern dicts with the keys as the
|
||||
arguments to AttributeRuler.add.
|
||||
pattern_dicts (Iterable[dict]): A list of pattern dicts with the keys
|
||||
patterns (Iterable[dict]): A list of pattern dicts with the keys
|
||||
as the arguments to AttributeRuler.add (patterns/attrs/index) to
|
||||
add as patterns.
|
||||
|
||||
DOCS: https://nightly.spacy.io/api/attributeruler#add_patterns
|
||||
"""
|
||||
for p in pattern_dicts:
|
||||
for p in patterns:
|
||||
self.add(**p)
|
||||
|
||||
@property
|
||||
|
@ -214,7 +222,7 @@ class AttributeRuler(Pipe):
|
|||
all_patterns.append(p)
|
||||
return all_patterns
|
||||
|
||||
def score(self, examples, **kwargs):
|
||||
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
||||
"""Score a batch of examples.
|
||||
|
||||
examples (Iterable[Example]): The examples to score.
|
||||
|
@ -255,7 +263,7 @@ class AttributeRuler(Pipe):
|
|||
|
||||
def from_bytes(
|
||||
self, bytes_data: bytes, exclude: Iterable[str] = SimpleFrozenList()
|
||||
):
|
||||
) -> "AttributeRuler":
|
||||
"""Load the AttributeRuler from a bytestring.
|
||||
|
||||
bytes_data (bytes): The data to load.
|
||||
|
@ -273,7 +281,6 @@ class AttributeRuler(Pipe):
|
|||
"patterns": load_patterns,
|
||||
}
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
|
||||
return self
|
||||
|
||||
def to_disk(
|
||||
|
@ -283,6 +290,7 @@ class AttributeRuler(Pipe):
|
|||
|
||||
path (Union[Path, str]): A path to a directory.
|
||||
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||
|
||||
DOCS: https://nightly.spacy.io/api/attributeruler#to_disk
|
||||
"""
|
||||
serialize = {
|
||||
|
@ -293,11 +301,13 @@ class AttributeRuler(Pipe):
|
|||
|
||||
def from_disk(
|
||||
self, path: Union[Path, str], exclude: Iterable[str] = SimpleFrozenList()
|
||||
) -> None:
|
||||
) -> "AttributeRuler":
|
||||
"""Load the AttributeRuler from disk.
|
||||
|
||||
path (Union[Path, str]): A path to a directory.
|
||||
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||
RETURNS (AttributeRuler): The loaded object.
|
||||
|
||||
DOCS: https://nightly.spacy.io/api/attributeruler#from_disk
|
||||
"""
|
||||
|
||||
|
@ -309,11 +319,10 @@ class AttributeRuler(Pipe):
|
|||
"patterns": load_patterns,
|
||||
}
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
|
||||
return self
|
||||
|
||||
|
||||
def _split_morph_attrs(attrs):
|
||||
def _split_morph_attrs(attrs: dict) -> Tuple[dict, dict]:
|
||||
"""Split entries from a tag map or morph rules dict into to two dicts, one
|
||||
with the token-level features (POS, LEMMA) and one with the remaining
|
||||
features, which are presumed to be individual MORPH features."""
|
||||
|
|
|
@ -134,7 +134,7 @@ class Morphologizer(Tagger):
|
|||
self.cfg["labels_pos"][norm_label] = POS_IDS[pos]
|
||||
return 1
|
||||
|
||||
def initialize(self, get_examples, *, nlp=None):
|
||||
def initialize(self, get_examples, *, nlp=None, labels=None):
|
||||
"""Initialize the pipe for training, using a representative set
|
||||
of data examples.
|
||||
|
||||
|
@ -145,20 +145,24 @@ class Morphologizer(Tagger):
|
|||
DOCS: https://nightly.spacy.io/api/morphologizer#initialize
|
||||
"""
|
||||
self._ensure_examples(get_examples)
|
||||
# First, fetch all labels from the data
|
||||
for example in get_examples():
|
||||
for i, token in enumerate(example.reference):
|
||||
pos = token.pos_
|
||||
morph = str(token.morph)
|
||||
# create and add the combined morph+POS label
|
||||
morph_dict = Morphology.feats_to_dict(morph)
|
||||
if pos:
|
||||
morph_dict[self.POS_FEAT] = pos
|
||||
norm_label = self.vocab.strings[self.vocab.morphology.add(morph_dict)]
|
||||
# add label->morph and label->POS mappings
|
||||
if norm_label not in self.cfg["labels_morph"]:
|
||||
self.cfg["labels_morph"][norm_label] = morph
|
||||
self.cfg["labels_pos"][norm_label] = POS_IDS[pos]
|
||||
if labels is not None:
|
||||
self.cfg["labels_morph"] = labels["morph"]
|
||||
self.cfg["labels_pos"] = labels["pos"]
|
||||
else:
|
||||
# First, fetch all labels from the data
|
||||
for example in get_examples():
|
||||
for i, token in enumerate(example.reference):
|
||||
pos = token.pos_
|
||||
morph = str(token.morph)
|
||||
# create and add the combined morph+POS label
|
||||
morph_dict = Morphology.feats_to_dict(morph)
|
||||
if pos:
|
||||
morph_dict[self.POS_FEAT] = pos
|
||||
norm_label = self.vocab.strings[self.vocab.morphology.add(morph_dict)]
|
||||
# add label->morph and label->POS mappings
|
||||
if norm_label not in self.cfg["labels_morph"]:
|
||||
self.cfg["labels_morph"][norm_label] = morph
|
||||
self.cfg["labels_pos"][norm_label] = POS_IDS[pos]
|
||||
if len(self.labels) <= 1:
|
||||
raise ValueError(Errors.E143.format(name=self.name))
|
||||
doc_sample = []
|
||||
|
@ -234,7 +238,7 @@ class Morphologizer(Tagger):
|
|||
truths.append(eg_truths)
|
||||
d_scores, loss = loss_func(scores, truths)
|
||||
if self.model.ops.xp.isnan(loss):
|
||||
raise ValueError("nan value when computing loss")
|
||||
raise ValueError(Errors.E910.format(name=self.name))
|
||||
return float(loss), d_scores
|
||||
|
||||
def score(self, examples, **kwargs):
|
||||
|
|
|
@ -1,4 +1,5 @@
|
|||
# cython: infer_types=True, profile=True
|
||||
import warnings
|
||||
from typing import Optional, Tuple
|
||||
import srsly
|
||||
from thinc.api import set_dropout_rate, Model
|
||||
|
@ -6,7 +7,7 @@ from thinc.api import set_dropout_rate, Model
|
|||
from ..tokens.doc cimport Doc
|
||||
|
||||
from ..training import validate_examples
|
||||
from ..errors import Errors
|
||||
from ..errors import Errors, Warnings
|
||||
from .. import util
|
||||
|
||||
|
||||
|
@ -33,6 +34,13 @@ cdef class Pipe:
|
|||
self.name = name
|
||||
self.cfg = dict(cfg)
|
||||
|
||||
@classmethod
|
||||
def __init_subclass__(cls, **kwargs):
|
||||
"""Raise a warning if an inheriting class implements 'begin_training'
|
||||
(from v2) instead of the new 'initialize' method (from v3)"""
|
||||
if hasattr(cls, "begin_training"):
|
||||
warnings.warn(Warnings.W088.format(name=cls.__name__))
|
||||
|
||||
@property
|
||||
def labels(self) -> Optional[Tuple[str]]:
|
||||
return []
|
||||
|
|
|
@ -73,7 +73,7 @@ class SentenceRecognizer(Tagger):
|
|||
|
||||
@property
|
||||
def label_data(self):
|
||||
return self.labels
|
||||
return None
|
||||
|
||||
def set_annotations(self, docs, batch_tag_ids):
|
||||
"""Modify a batch of documents, using pre-computed scores.
|
||||
|
@ -125,7 +125,7 @@ class SentenceRecognizer(Tagger):
|
|||
truths.append(eg_truth)
|
||||
d_scores, loss = loss_func(scores, truths)
|
||||
if self.model.ops.xp.isnan(loss):
|
||||
raise ValueError("nan value when computing loss")
|
||||
raise ValueError(Errors.E910.format(name=self.name))
|
||||
return float(loss), d_scores
|
||||
|
||||
def initialize(self, get_examples, *, nlp=None):
|
||||
|
|
|
@ -15,7 +15,7 @@ from .pipe import Pipe, deserialize_config
|
|||
from ..language import Language
|
||||
from ..attrs import POS, ID
|
||||
from ..parts_of_speech import X
|
||||
from ..errors import Errors, TempErrors, Warnings
|
||||
from ..errors import Errors, Warnings
|
||||
from ..scorer import Scorer
|
||||
from ..training import validate_examples
|
||||
from .. import util
|
||||
|
@ -258,7 +258,7 @@ class Tagger(Pipe):
|
|||
truths = [eg.get_aligned("TAG", as_string=True) for eg in examples]
|
||||
d_scores, loss = loss_func(scores, truths)
|
||||
if self.model.ops.xp.isnan(loss):
|
||||
raise ValueError("nan value when computing loss")
|
||||
raise ValueError(Errors.E910.format(name=self.name))
|
||||
return float(loss), d_scores
|
||||
|
||||
def initialize(self, get_examples, *, nlp=None, labels=None):
|
||||
|
|
|
@ -56,12 +56,7 @@ subword_features = true
|
|||
@Language.factory(
|
||||
"textcat",
|
||||
assigns=["doc.cats"],
|
||||
default_config={
|
||||
"labels": [],
|
||||
"threshold": 0.5,
|
||||
"positive_label": None,
|
||||
"model": DEFAULT_TEXTCAT_MODEL,
|
||||
},
|
||||
default_config={"threshold": 0.5, "model": DEFAULT_TEXTCAT_MODEL},
|
||||
default_score_weights={
|
||||
"cats_score": 1.0,
|
||||
"cats_score_desc": None,
|
||||
|
@ -75,12 +70,7 @@ subword_features = true
|
|||
},
|
||||
)
|
||||
def make_textcat(
|
||||
nlp: Language,
|
||||
name: str,
|
||||
model: Model[List[Doc], List[Floats2d]],
|
||||
labels: List[str],
|
||||
threshold: float,
|
||||
positive_label: Optional[str],
|
||||
nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float
|
||||
) -> "TextCategorizer":
|
||||
"""Create a TextCategorizer compoment. The text categorizer predicts categories
|
||||
over a whole document. It can learn one or more labels, and the labels can
|
||||
|
@ -90,19 +80,9 @@ def make_textcat(
|
|||
|
||||
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
|
||||
scores for each category.
|
||||
labels (list): A list of categories to learn. If empty, the model infers the
|
||||
categories from the data.
|
||||
threshold (float): Cutoff to consider a prediction "positive".
|
||||
positive_label (Optional[str]): The positive label for a binary task with exclusive classes, None otherwise.
|
||||
"""
|
||||
return TextCategorizer(
|
||||
nlp.vocab,
|
||||
model,
|
||||
name,
|
||||
labels=labels,
|
||||
threshold=threshold,
|
||||
positive_label=positive_label,
|
||||
)
|
||||
return TextCategorizer(nlp.vocab, model, name, threshold=threshold)
|
||||
|
||||
|
||||
class TextCategorizer(Pipe):
|
||||
|
@ -112,14 +92,7 @@ class TextCategorizer(Pipe):
|
|||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab: Vocab,
|
||||
model: Model,
|
||||
name: str = "textcat",
|
||||
*,
|
||||
labels: List[str],
|
||||
threshold: float,
|
||||
positive_label: Optional[str],
|
||||
self, vocab: Vocab, model: Model, name: str = "textcat", *, threshold: float
|
||||
) -> None:
|
||||
"""Initialize a text categorizer.
|
||||
|
||||
|
@ -127,9 +100,7 @@ class TextCategorizer(Pipe):
|
|||
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
||||
name (str): The component instance name, used to add entries to the
|
||||
losses during training.
|
||||
labels (List[str]): The labels to use.
|
||||
threshold (float): Cutoff to consider a prediction "positive".
|
||||
positive_label (Optional[str]): The positive label for a binary task with exclusive classes, None otherwise.
|
||||
|
||||
DOCS: https://nightly.spacy.io/api/textcategorizer#init
|
||||
"""
|
||||
|
@ -137,11 +108,7 @@ class TextCategorizer(Pipe):
|
|||
self.model = model
|
||||
self.name = name
|
||||
self._rehearsal_model = None
|
||||
cfg = {
|
||||
"labels": labels,
|
||||
"threshold": threshold,
|
||||
"positive_label": positive_label,
|
||||
}
|
||||
cfg = {"labels": [], "threshold": threshold, "positive_label": None}
|
||||
self.cfg = dict(cfg)
|
||||
|
||||
@property
|
||||
|
@ -348,6 +315,7 @@ class TextCategorizer(Pipe):
|
|||
*,
|
||||
nlp: Optional[Language] = None,
|
||||
labels: Optional[Dict] = None,
|
||||
positive_label: Optional[str] = None,
|
||||
):
|
||||
"""Initialize the pipe for training, using a representative set
|
||||
of data examples.
|
||||
|
@ -369,6 +337,14 @@ class TextCategorizer(Pipe):
|
|||
else:
|
||||
for label in labels:
|
||||
self.add_label(label)
|
||||
if positive_label is not None:
|
||||
if positive_label not in self.labels:
|
||||
err = Errors.E920.format(pos_label=positive_label, labels=self.labels)
|
||||
raise ValueError(err)
|
||||
if len(self.labels) != 2:
|
||||
err = Errors.E919.format(pos_label=positive_label, labels=self.labels)
|
||||
raise ValueError(err)
|
||||
self.cfg["positive_label"] = positive_label
|
||||
subbatch = list(islice(get_examples(), 10))
|
||||
doc_sample = [eg.reference for eg in subbatch]
|
||||
label_sample, _ = self._examples_to_truth(subbatch)
|
||||
|
|
|
@ -905,7 +905,7 @@ def _auc(x, y):
|
|||
if np.all(dx <= 0):
|
||||
direction = -1
|
||||
else:
|
||||
raise ValueError(Errors.E164.format(x))
|
||||
raise ValueError(Errors.E164.format(x=x))
|
||||
|
||||
area = direction * np.trapz(y, x)
|
||||
if isinstance(area, np.memmap):
|
||||
|
|
|
@ -294,7 +294,8 @@ def zh_tokenizer_pkuseg():
|
|||
"segmenter": "pkuseg",
|
||||
}
|
||||
},
|
||||
"initialize": {"tokenizer": {
|
||||
"initialize": {
|
||||
"tokenizer": {
|
||||
"pkuseg_model": "default",
|
||||
}
|
||||
},
|
||||
|
|
|
@ -5,12 +5,14 @@ import pytest
|
|||
def i_has(en_tokenizer):
|
||||
doc = en_tokenizer("I has")
|
||||
doc[0].set_morph({"PronType": "prs"})
|
||||
doc[1].set_morph({
|
||||
"VerbForm": "fin",
|
||||
"Tense": "pres",
|
||||
"Number": "sing",
|
||||
"Person": "three",
|
||||
})
|
||||
doc[1].set_morph(
|
||||
{
|
||||
"VerbForm": "fin",
|
||||
"Tense": "pres",
|
||||
"Number": "sing",
|
||||
"Person": "three",
|
||||
}
|
||||
)
|
||||
|
||||
return doc
|
||||
|
||||
|
|
|
@ -196,3 +196,22 @@ def test_doc_retokenizer_realloc(en_vocab):
|
|||
token = doc[0]
|
||||
heads = [(token, 0)] * len(token)
|
||||
retokenizer.split(doc[token.i], list(token.text), heads=heads)
|
||||
|
||||
|
||||
def test_doc_retokenizer_split_norm(en_vocab):
|
||||
"""#6060: reset norm in split"""
|
||||
text = "The quick brownfoxjumpsoverthe lazy dog w/ white spots"
|
||||
doc = Doc(en_vocab, words=text.split())
|
||||
|
||||
# Set custom norm on the w/ token.
|
||||
doc[5].norm_ = "with"
|
||||
|
||||
# Retokenize to split out the words in the token at doc[2].
|
||||
token = doc[2]
|
||||
with doc.retokenize() as retokenizer:
|
||||
retokenizer.split(token, ["brown", "fox", "jumps", "over", "the"], heads=[(token, idx) for idx in range(5)])
|
||||
|
||||
assert doc[9].text == "w/"
|
||||
assert doc[9].norm_ == "with"
|
||||
assert doc[5].text == "over"
|
||||
assert doc[5].norm_ == "over"
|
||||
|
|
|
@ -322,3 +322,11 @@ def test_span_boundaries(doc):
|
|||
span[-5]
|
||||
with pytest.raises(IndexError):
|
||||
span[5]
|
||||
|
||||
|
||||
def test_sent(en_tokenizer):
|
||||
doc = en_tokenizer("Check span.sent raises error if doc is not sentencized.")
|
||||
span = doc[1:3]
|
||||
assert not span.doc.has_annotation("SENT_START")
|
||||
with pytest.raises(ValueError):
|
||||
span.sent
|
||||
|
|
|
@ -23,8 +23,9 @@ def test_lemmatizer_initialize(lang, capfd):
|
|||
lookups.add_table("lemma_rules", {"verb": [["ing", ""]]})
|
||||
return lookups
|
||||
|
||||
lang_cls = get_lang_class(lang)
|
||||
# Test that languages can be initialized
|
||||
nlp = get_lang_class(lang)()
|
||||
nlp = lang_cls()
|
||||
lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
|
||||
assert not lemmatizer.lookups.tables
|
||||
nlp.config["initialize"]["components"]["lemmatizer"] = {
|
||||
|
@ -41,7 +42,13 @@ def test_lemmatizer_initialize(lang, capfd):
|
|||
assert doc[0].lemma_ == "y"
|
||||
|
||||
# Test initialization by calling .initialize() directly
|
||||
nlp = get_lang_class(lang)()
|
||||
nlp = lang_cls()
|
||||
lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
|
||||
lemmatizer.initialize(lookups=lemmatizer_init_lookups())
|
||||
assert nlp("x")[0].lemma_ == "y"
|
||||
|
||||
# Test lookups config format
|
||||
for mode in ("rule", "lookup", "pos_lookup"):
|
||||
required, optional = lemmatizer.get_lookups_config(mode)
|
||||
assert isinstance(required, list)
|
||||
assert isinstance(optional, list)
|
||||
|
|
|
@ -34,7 +34,8 @@ def test_zh_tokenizer_serialize_pkuseg_with_processors(zh_tokenizer_pkuseg):
|
|||
"segmenter": "pkuseg",
|
||||
}
|
||||
},
|
||||
"initialize": {"tokenizer": {
|
||||
"initialize": {
|
||||
"tokenizer": {
|
||||
"pkuseg_model": "medicine",
|
||||
}
|
||||
},
|
||||
|
|
|
@ -63,6 +63,39 @@ def morph_rules():
|
|||
return {"DT": {"the": {"POS": "DET", "LEMMA": "a", "Case": "Nom"}}}
|
||||
|
||||
|
||||
def check_tag_map(ruler):
|
||||
doc = Doc(
|
||||
ruler.vocab,
|
||||
words=["This", "is", "a", "test", "."],
|
||||
tags=["DT", "VBZ", "DT", "NN", "."],
|
||||
)
|
||||
doc = ruler(doc)
|
||||
for i in range(len(doc)):
|
||||
if i == 4:
|
||||
assert doc[i].pos_ == "PUNCT"
|
||||
assert str(doc[i].morph) == "PunctType=peri"
|
||||
else:
|
||||
assert doc[i].pos_ == ""
|
||||
assert str(doc[i].morph) == ""
|
||||
|
||||
|
||||
def check_morph_rules(ruler):
|
||||
doc = Doc(
|
||||
ruler.vocab,
|
||||
words=["This", "is", "the", "test", "."],
|
||||
tags=["DT", "VBZ", "DT", "NN", "."],
|
||||
)
|
||||
doc = ruler(doc)
|
||||
for i in range(len(doc)):
|
||||
if i != 2:
|
||||
assert doc[i].pos_ == ""
|
||||
assert str(doc[i].morph) == ""
|
||||
else:
|
||||
assert doc[2].pos_ == "DET"
|
||||
assert doc[2].lemma_ == "a"
|
||||
assert str(doc[2].morph) == "Case=Nom"
|
||||
|
||||
|
||||
def test_attributeruler_init(nlp, pattern_dicts):
|
||||
a = nlp.add_pipe("attribute_ruler")
|
||||
for p in pattern_dicts:
|
||||
|
@ -78,7 +111,8 @@ def test_attributeruler_init(nlp, pattern_dicts):
|
|||
|
||||
def test_attributeruler_init_patterns(nlp, pattern_dicts):
|
||||
# initialize with patterns
|
||||
nlp.add_pipe("attribute_ruler", config={"pattern_dicts": pattern_dicts})
|
||||
ruler = nlp.add_pipe("attribute_ruler")
|
||||
ruler.initialize(lambda: [], patterns=pattern_dicts)
|
||||
doc = nlp("This is a test.")
|
||||
assert doc[2].lemma_ == "the"
|
||||
assert str(doc[2].morph) == "Case=Nom|Number=Plur"
|
||||
|
@ -88,10 +122,11 @@ def test_attributeruler_init_patterns(nlp, pattern_dicts):
|
|||
assert doc.has_annotation("MORPH")
|
||||
nlp.remove_pipe("attribute_ruler")
|
||||
# initialize with patterns from asset
|
||||
nlp.add_pipe(
|
||||
"attribute_ruler",
|
||||
config={"pattern_dicts": {"@misc": "attribute_ruler_patterns"}},
|
||||
)
|
||||
nlp.config["initialize"]["components"]["attribute_ruler"] = {
|
||||
"patterns": {"@misc": "attribute_ruler_patterns"}
|
||||
}
|
||||
nlp.add_pipe("attribute_ruler")
|
||||
nlp.initialize()
|
||||
doc = nlp("This is a test.")
|
||||
assert doc[2].lemma_ == "the"
|
||||
assert str(doc[2].morph) == "Case=Nom|Number=Plur"
|
||||
|
@ -103,18 +138,15 @@ def test_attributeruler_init_patterns(nlp, pattern_dicts):
|
|||
|
||||
def test_attributeruler_score(nlp, pattern_dicts):
|
||||
# initialize with patterns
|
||||
nlp.add_pipe("attribute_ruler", config={"pattern_dicts": pattern_dicts})
|
||||
ruler = nlp.add_pipe("attribute_ruler")
|
||||
ruler.initialize(lambda: [], patterns=pattern_dicts)
|
||||
doc = nlp("This is a test.")
|
||||
assert doc[2].lemma_ == "the"
|
||||
assert str(doc[2].morph) == "Case=Nom|Number=Plur"
|
||||
assert doc[3].lemma_ == "cat"
|
||||
assert str(doc[3].morph) == "Case=Nom|Number=Sing"
|
||||
|
||||
dev_examples = [
|
||||
Example.from_dict(
|
||||
nlp.make_doc("This is a test."), {"lemmas": ["this", "is", "a", "cat", "."]}
|
||||
)
|
||||
]
|
||||
doc = nlp.make_doc("This is a test.")
|
||||
dev_examples = [Example.from_dict(doc, {"lemmas": ["this", "is", "a", "cat", "."]})]
|
||||
scores = nlp.evaluate(dev_examples)
|
||||
# "cat" is the only correct lemma
|
||||
assert scores["lemma_acc"] == pytest.approx(0.2)
|
||||
|
@ -139,40 +171,27 @@ def test_attributeruler_rule_order(nlp):
|
|||
|
||||
|
||||
def test_attributeruler_tag_map(nlp, tag_map):
|
||||
a = AttributeRuler(nlp.vocab)
|
||||
a.load_from_tag_map(tag_map)
|
||||
doc = Doc(
|
||||
nlp.vocab,
|
||||
words=["This", "is", "a", "test", "."],
|
||||
tags=["DT", "VBZ", "DT", "NN", "."],
|
||||
)
|
||||
doc = a(doc)
|
||||
for i in range(len(doc)):
|
||||
if i == 4:
|
||||
assert doc[i].pos_ == "PUNCT"
|
||||
assert str(doc[i].morph) == "PunctType=peri"
|
||||
else:
|
||||
assert doc[i].pos_ == ""
|
||||
assert str(doc[i].morph) == ""
|
||||
ruler = AttributeRuler(nlp.vocab)
|
||||
ruler.load_from_tag_map(tag_map)
|
||||
check_tag_map(ruler)
|
||||
|
||||
|
||||
def test_attributeruler_tag_map_initialize(nlp, tag_map):
|
||||
ruler = nlp.add_pipe("attribute_ruler")
|
||||
ruler.initialize(lambda: [], tag_map=tag_map)
|
||||
check_tag_map(ruler)
|
||||
|
||||
|
||||
def test_attributeruler_morph_rules(nlp, morph_rules):
|
||||
a = AttributeRuler(nlp.vocab)
|
||||
a.load_from_morph_rules(morph_rules)
|
||||
doc = Doc(
|
||||
nlp.vocab,
|
||||
words=["This", "is", "the", "test", "."],
|
||||
tags=["DT", "VBZ", "DT", "NN", "."],
|
||||
)
|
||||
doc = a(doc)
|
||||
for i in range(len(doc)):
|
||||
if i != 2:
|
||||
assert doc[i].pos_ == ""
|
||||
assert str(doc[i].morph) == ""
|
||||
else:
|
||||
assert doc[2].pos_ == "DET"
|
||||
assert doc[2].lemma_ == "a"
|
||||
assert str(doc[2].morph) == "Case=Nom"
|
||||
ruler = AttributeRuler(nlp.vocab)
|
||||
ruler.load_from_morph_rules(morph_rules)
|
||||
check_morph_rules(ruler)
|
||||
|
||||
|
||||
def test_attributeruler_morph_rules_initialize(nlp, morph_rules):
|
||||
ruler = nlp.add_pipe("attribute_ruler")
|
||||
ruler.initialize(lambda: [], morph_rules=morph_rules)
|
||||
check_morph_rules(ruler)
|
||||
|
||||
|
||||
def test_attributeruler_indices(nlp):
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
import pytest
|
||||
from spacy.language import Language
|
||||
from spacy.util import SimpleFrozenList
|
||||
from spacy.pipeline import Pipe
|
||||
from spacy.util import SimpleFrozenList, get_arg_names
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
@ -346,3 +347,60 @@ def test_pipe_methods_frozen():
|
|||
nlp.components.sort()
|
||||
with pytest.raises(NotImplementedError):
|
||||
nlp.component_names.clear()
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"pipe", ["tagger", "parser", "ner", "textcat", "morphologizer"],
|
||||
)
|
||||
def test_pipe_label_data_exports_labels(pipe):
|
||||
nlp = Language()
|
||||
pipe = nlp.add_pipe(pipe)
|
||||
# Make sure pipe has pipe labels
|
||||
assert getattr(pipe, "label_data", None) is not None
|
||||
# Make sure pipe can be initialized with labels
|
||||
initialize = getattr(pipe, "initialize", None)
|
||||
assert initialize is not None
|
||||
assert "labels" in get_arg_names(initialize)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("pipe", ["senter", "entity_linker"])
|
||||
def test_pipe_label_data_no_labels(pipe):
|
||||
nlp = Language()
|
||||
pipe = nlp.add_pipe(pipe)
|
||||
assert getattr(pipe, "label_data", None) is None
|
||||
initialize = getattr(pipe, "initialize", None)
|
||||
if initialize is not None:
|
||||
assert "labels" not in get_arg_names(initialize)
|
||||
|
||||
|
||||
def test_warning_pipe_begin_training():
|
||||
with pytest.warns(UserWarning, match="begin_training"):
|
||||
|
||||
class IncompatPipe(Pipe):
|
||||
def __init__(self):
|
||||
...
|
||||
|
||||
def begin_training(*args, **kwargs):
|
||||
...
|
||||
|
||||
|
||||
def test_pipe_methods_initialize():
|
||||
"""Test that the [initialize] config reflects the components correctly."""
|
||||
nlp = Language()
|
||||
nlp.add_pipe("tagger")
|
||||
assert "tagger" not in nlp.config["initialize"]["components"]
|
||||
nlp.config["initialize"]["components"]["tagger"] = {"labels": ["hello"]}
|
||||
assert nlp.config["initialize"]["components"]["tagger"] == {"labels": ["hello"]}
|
||||
nlp.remove_pipe("tagger")
|
||||
assert "tagger" not in nlp.config["initialize"]["components"]
|
||||
nlp.add_pipe("tagger")
|
||||
assert "tagger" not in nlp.config["initialize"]["components"]
|
||||
nlp.config["initialize"]["components"]["tagger"] = {"labels": ["hello"]}
|
||||
nlp.rename_pipe("tagger", "my_tagger")
|
||||
assert "tagger" not in nlp.config["initialize"]["components"]
|
||||
assert nlp.config["initialize"]["components"]["my_tagger"] == {"labels": ["hello"]}
|
||||
nlp.config["initialize"]["components"]["test"] = {"foo": "bar"}
|
||||
nlp.add_pipe("ner", name="test")
|
||||
assert "test" in nlp.config["initialize"]["components"]
|
||||
nlp.remove_pipe("test")
|
||||
assert "test" not in nlp.config["initialize"]["components"]
|
||||
|
|
|
@ -10,7 +10,6 @@ from spacy.tokens import Doc
|
|||
from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL
|
||||
from spacy.scorer import Scorer
|
||||
from spacy.training import Example
|
||||
from spacy.training.initialize import verify_textcat_config
|
||||
|
||||
from ..util import make_tempdir
|
||||
|
||||
|
@ -21,6 +20,17 @@ TRAIN_DATA = [
|
|||
]
|
||||
|
||||
|
||||
def make_get_examples(nlp):
|
||||
train_examples = []
|
||||
for t in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
|
||||
def get_examples():
|
||||
return train_examples
|
||||
|
||||
return get_examples
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="Test is flakey when run with others")
|
||||
def test_simple_train():
|
||||
nlp = Language()
|
||||
|
@ -92,10 +102,7 @@ def test_no_label():
|
|||
def test_implicit_label():
|
||||
nlp = Language()
|
||||
nlp.add_pipe("textcat")
|
||||
train_examples = []
|
||||
for t in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
nlp.initialize(get_examples=lambda: train_examples)
|
||||
nlp.initialize(get_examples=make_get_examples(nlp))
|
||||
|
||||
|
||||
def test_no_resize():
|
||||
|
@ -113,29 +120,27 @@ def test_no_resize():
|
|||
def test_initialize_examples():
|
||||
nlp = Language()
|
||||
textcat = nlp.add_pipe("textcat")
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
for label, value in annotations.get("cats").items():
|
||||
textcat.add_label(label)
|
||||
# you shouldn't really call this more than once, but for testing it should be fine
|
||||
nlp.initialize()
|
||||
nlp.initialize(get_examples=lambda: train_examples)
|
||||
get_examples = make_get_examples(nlp)
|
||||
nlp.initialize(get_examples=get_examples)
|
||||
with pytest.raises(ValueError):
|
||||
nlp.initialize(get_examples=lambda: None)
|
||||
with pytest.raises(ValueError):
|
||||
nlp.initialize(get_examples=train_examples)
|
||||
nlp.initialize(get_examples=get_examples())
|
||||
|
||||
|
||||
def test_overfitting_IO():
|
||||
# Simple test to try and quickly overfit the textcat component - ensuring the ML models work correctly
|
||||
fix_random_seed(0)
|
||||
nlp = English()
|
||||
nlp.config["initialize"]["components"]["textcat"] = {"positive_label": "POSITIVE"}
|
||||
# Set exclusive labels
|
||||
textcat = nlp.add_pipe(
|
||||
"textcat",
|
||||
config={"model": {"exclusive_classes": True}, "positive_label": "POSITIVE"},
|
||||
)
|
||||
config = {"model": {"exclusive_classes": True}}
|
||||
textcat = nlp.add_pipe("textcat", config=config)
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
|
@ -203,28 +208,28 @@ def test_textcat_configs(textcat_config):
|
|||
|
||||
def test_positive_class():
|
||||
nlp = English()
|
||||
pipe_config = {"positive_label": "POS", "labels": ["POS", "NEG"]}
|
||||
textcat = nlp.add_pipe("textcat", config=pipe_config)
|
||||
textcat = nlp.add_pipe("textcat")
|
||||
get_examples = make_get_examples(nlp)
|
||||
textcat.initialize(get_examples, labels=["POS", "NEG"], positive_label="POS")
|
||||
assert textcat.labels == ("POS", "NEG")
|
||||
verify_textcat_config(nlp, pipe_config)
|
||||
|
||||
|
||||
def test_positive_class_not_present():
|
||||
nlp = English()
|
||||
pipe_config = {"positive_label": "POS", "labels": ["SOME", "THING"]}
|
||||
textcat = nlp.add_pipe("textcat", config=pipe_config)
|
||||
assert textcat.labels == ("SOME", "THING")
|
||||
textcat = nlp.add_pipe("textcat")
|
||||
get_examples = make_get_examples(nlp)
|
||||
with pytest.raises(ValueError):
|
||||
verify_textcat_config(nlp, pipe_config)
|
||||
textcat.initialize(get_examples, labels=["SOME", "THING"], positive_label="POS")
|
||||
|
||||
|
||||
def test_positive_class_not_binary():
|
||||
nlp = English()
|
||||
pipe_config = {"positive_label": "POS", "labels": ["SOME", "THING", "POS"]}
|
||||
textcat = nlp.add_pipe("textcat", config=pipe_config)
|
||||
assert textcat.labels == ("SOME", "THING", "POS")
|
||||
textcat = nlp.add_pipe("textcat")
|
||||
get_examples = make_get_examples(nlp)
|
||||
with pytest.raises(ValueError):
|
||||
verify_textcat_config(nlp, pipe_config)
|
||||
textcat.initialize(
|
||||
get_examples, labels=["SOME", "THING", "POS"], positive_label="POS"
|
||||
)
|
||||
|
||||
|
||||
def test_textcat_evaluation():
|
||||
|
|
|
@ -92,7 +92,13 @@ def test_serialize_doc_bin_unknown_spaces(en_vocab):
|
|||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"writer_flag,reader_flag,reader_value", [(True, True, "bar"), (True, False, "bar"), (False, True, "nothing"), (False, False, "nothing")]
|
||||
"writer_flag,reader_flag,reader_value",
|
||||
[
|
||||
(True, True, "bar"),
|
||||
(True, False, "bar"),
|
||||
(False, True, "nothing"),
|
||||
(False, False, "nothing"),
|
||||
],
|
||||
)
|
||||
def test_serialize_custom_extension(en_vocab, writer_flag, reader_flag, reader_value):
|
||||
"""Test that custom extensions are correctly serialized in DocBin."""
|
||||
|
|
|
@ -136,13 +136,7 @@ def test_serialize_textcat_empty(en_vocab):
|
|||
# See issue #1105
|
||||
cfg = {"model": DEFAULT_TEXTCAT_MODEL}
|
||||
model = registry.resolve(cfg, validate=True)["model"]
|
||||
textcat = TextCategorizer(
|
||||
en_vocab,
|
||||
model,
|
||||
labels=["ENTITY", "ACTION", "MODIFIER"],
|
||||
threshold=0.5,
|
||||
positive_label=None,
|
||||
)
|
||||
textcat = TextCategorizer(en_vocab, model, threshold=0.5)
|
||||
textcat.to_bytes(exclude=["vocab"])
|
||||
|
||||
|
||||
|
|
|
@ -158,7 +158,7 @@ def test_las_per_type(en_vocab):
|
|||
examples = []
|
||||
for input_, annot in test_las_apple:
|
||||
doc = Doc(
|
||||
en_vocab, words=input_.split(" "), heads=annot["heads"], deps=annot["deps"],
|
||||
en_vocab, words=input_.split(" "), heads=annot["heads"], deps=annot["deps"]
|
||||
)
|
||||
gold = {"heads": annot["heads"], "deps": annot["deps"]}
|
||||
doc[0].dep_ = "compound"
|
||||
|
@ -182,9 +182,7 @@ def test_ner_per_type(en_vocab):
|
|||
examples = []
|
||||
for input_, annot in test_ner_cardinal:
|
||||
doc = Doc(
|
||||
en_vocab,
|
||||
words=input_.split(" "),
|
||||
ents=["B-CARDINAL", "O", "B-CARDINAL"],
|
||||
en_vocab, words=input_.split(" "), ents=["B-CARDINAL", "O", "B-CARDINAL"]
|
||||
)
|
||||
entities = offsets_to_biluo_tags(doc, annot["entities"])
|
||||
example = Example.from_dict(doc, {"entities": entities})
|
||||
|
|
100
spacy/tests/training/test_augmenters.py
Normal file
100
spacy/tests/training/test_augmenters.py
Normal file
|
@ -0,0 +1,100 @@
|
|||
import pytest
|
||||
from spacy.training import Corpus
|
||||
from spacy.training.augment import create_orth_variants_augmenter
|
||||
from spacy.training.augment import create_lower_casing_augmenter
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import DocBin, Doc
|
||||
from contextlib import contextmanager
|
||||
import random
|
||||
|
||||
from ..util import make_tempdir
|
||||
|
||||
|
||||
@contextmanager
|
||||
def make_docbin(docs, name="roundtrip.spacy"):
|
||||
with make_tempdir() as tmpdir:
|
||||
output_file = tmpdir / name
|
||||
DocBin(docs=docs).to_disk(output_file)
|
||||
yield output_file
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def nlp():
|
||||
return English()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def doc(nlp):
|
||||
# fmt: off
|
||||
words = ["Sarah", "'s", "sister", "flew", "to", "Silicon", "Valley", "via", "London", "."]
|
||||
tags = ["NNP", "POS", "NN", "VBD", "IN", "NNP", "NNP", "IN", "NNP", "."]
|
||||
pos = ["PROPN", "PART", "NOUN", "VERB", "ADP", "PROPN", "PROPN", "ADP", "PROPN", "PUNCT"]
|
||||
ents = ["B-PERSON", "I-PERSON", "O", "O", "O", "B-LOC", "I-LOC", "O", "B-GPE", "O"]
|
||||
cats = {"TRAVEL": 1.0, "BAKING": 0.0}
|
||||
# fmt: on
|
||||
doc = Doc(nlp.vocab, words=words, tags=tags, pos=pos, ents=ents)
|
||||
doc.cats = cats
|
||||
return doc
|
||||
|
||||
|
||||
@pytest.mark.filterwarnings("ignore::UserWarning")
|
||||
def test_make_orth_variants(nlp, doc):
|
||||
single = [
|
||||
{"tags": ["NFP"], "variants": ["…", "..."]},
|
||||
{"tags": [":"], "variants": ["-", "—", "–", "--", "---", "——"]},
|
||||
]
|
||||
augmenter = create_orth_variants_augmenter(
|
||||
level=0.2, lower=0.5, orth_variants={"single": single}
|
||||
)
|
||||
with make_docbin([doc]) as output_file:
|
||||
reader = Corpus(output_file, augmenter=augmenter)
|
||||
# Due to randomness, only test that it works without errors for now
|
||||
list(reader(nlp))
|
||||
|
||||
|
||||
def test_lowercase_augmenter(nlp, doc):
|
||||
augmenter = create_lower_casing_augmenter(level=1.0)
|
||||
with make_docbin([doc]) as output_file:
|
||||
reader = Corpus(output_file, augmenter=augmenter)
|
||||
corpus = list(reader(nlp))
|
||||
eg = corpus[0]
|
||||
assert eg.reference.text == doc.text.lower()
|
||||
assert eg.predicted.text == doc.text.lower()
|
||||
ents = [(e.start, e.end, e.label) for e in doc.ents]
|
||||
assert [(e.start, e.end, e.label) for e in eg.reference.ents] == ents
|
||||
for ref_ent, orig_ent in zip(eg.reference.ents, doc.ents):
|
||||
assert ref_ent.text == orig_ent.text.lower()
|
||||
assert [t.pos_ for t in eg.reference] == [t.pos_ for t in doc]
|
||||
|
||||
|
||||
@pytest.mark.filterwarnings("ignore::UserWarning")
|
||||
def test_custom_data_augmentation(nlp, doc):
|
||||
def create_spongebob_augmenter(randomize: bool = False):
|
||||
def augment(nlp, example):
|
||||
text = example.text
|
||||
if randomize:
|
||||
ch = [c.lower() if random.random() < 0.5 else c.upper() for c in text]
|
||||
else:
|
||||
ch = [c.lower() if i % 2 else c.upper() for i, c in enumerate(text)]
|
||||
example_dict = example.to_dict()
|
||||
doc = nlp.make_doc("".join(ch))
|
||||
example_dict["token_annotation"]["ORTH"] = [t.text for t in doc]
|
||||
yield example
|
||||
yield example.from_dict(doc, example_dict)
|
||||
|
||||
return augment
|
||||
|
||||
with make_docbin([doc]) as output_file:
|
||||
reader = Corpus(output_file, augmenter=create_spongebob_augmenter())
|
||||
corpus = list(reader(nlp))
|
||||
orig_text = "Sarah 's sister flew to Silicon Valley via London . "
|
||||
augmented = "SaRaH 's sIsTeR FlEw tO SiLiCoN VaLlEy vIa lOnDoN . "
|
||||
assert corpus[0].text == orig_text
|
||||
assert corpus[0].reference.text == orig_text
|
||||
assert corpus[0].predicted.text == orig_text
|
||||
assert corpus[1].text == augmented
|
||||
assert corpus[1].reference.text == augmented
|
||||
assert corpus[1].predicted.text == augmented
|
||||
ents = [(e.start, e.end, e.label) for e in doc.ents]
|
||||
assert [(e.start, e.end, e.label) for e in corpus[0].reference.ents] == ents
|
||||
assert [(e.start, e.end, e.label) for e in corpus[1].reference.ents] == ents
|
|
@ -1,23 +1,20 @@
|
|||
import numpy
|
||||
from spacy.training import offsets_to_biluo_tags, biluo_tags_to_offsets, Alignment
|
||||
from spacy.training import biluo_tags_to_spans, iob_to_biluo
|
||||
from spacy.training import Corpus, docs_to_json
|
||||
from spacy.training.example import Example
|
||||
from spacy.training import Corpus, docs_to_json, Example
|
||||
from spacy.training.converters import json_to_docs
|
||||
from spacy.training.augment import create_orth_variants_augmenter
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import Doc, DocBin
|
||||
from spacy.util import get_words_and_spaces, minibatch
|
||||
from thinc.api import compounding
|
||||
import pytest
|
||||
import srsly
|
||||
import random
|
||||
|
||||
from ..util import make_tempdir
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def doc(en_vocab):
|
||||
def doc():
|
||||
nlp = English() # make sure we get a new vocab every time
|
||||
# fmt: off
|
||||
words = ["Sarah", "'s", "sister", "flew", "to", "Silicon", "Valley", "via", "London", "."]
|
||||
|
@ -495,59 +492,6 @@ def test_roundtrip_docs_to_docbin(doc):
|
|||
assert cats["BAKING"] == reloaded_example.reference.cats["BAKING"]
|
||||
|
||||
|
||||
@pytest.mark.filterwarnings("ignore::UserWarning")
|
||||
def test_make_orth_variants(doc):
|
||||
nlp = English()
|
||||
orth_variants = {
|
||||
"single": [
|
||||
{"tags": ["NFP"], "variants": ["…", "..."]},
|
||||
{"tags": [":"], "variants": ["-", "—", "–", "--", "---", "——"]},
|
||||
]
|
||||
}
|
||||
augmenter = create_orth_variants_augmenter(
|
||||
level=0.2, lower=0.5, orth_variants=orth_variants
|
||||
)
|
||||
with make_tempdir() as tmpdir:
|
||||
output_file = tmpdir / "roundtrip.spacy"
|
||||
DocBin(docs=[doc]).to_disk(output_file)
|
||||
# due to randomness, test only that this runs with no errors for now
|
||||
reader = Corpus(output_file, augmenter=augmenter)
|
||||
list(reader(nlp))
|
||||
|
||||
|
||||
@pytest.mark.filterwarnings("ignore::UserWarning")
|
||||
def test_custom_data_augmentation(doc):
|
||||
def create_spongebob_augmenter(randomize: bool = False):
|
||||
def augment(nlp, example):
|
||||
text = example.text
|
||||
if randomize:
|
||||
ch = [c.lower() if random.random() < 0.5 else c.upper() for c in text]
|
||||
else:
|
||||
ch = [c.lower() if i % 2 else c.upper() for i, c in enumerate(text)]
|
||||
example_dict = example.to_dict()
|
||||
doc = nlp.make_doc("".join(ch))
|
||||
example_dict["token_annotation"]["ORTH"] = [t.text for t in doc]
|
||||
yield example
|
||||
yield example.from_dict(doc, example_dict)
|
||||
|
||||
return augment
|
||||
|
||||
nlp = English()
|
||||
with make_tempdir() as tmpdir:
|
||||
output_file = tmpdir / "roundtrip.spacy"
|
||||
DocBin(docs=[doc]).to_disk(output_file)
|
||||
reader = Corpus(output_file, augmenter=create_spongebob_augmenter())
|
||||
corpus = list(reader(nlp))
|
||||
orig_text = "Sarah 's sister flew to Silicon Valley via London . "
|
||||
augmented = "SaRaH 's sIsTeR FlEw tO SiLiCoN VaLlEy vIa lOnDoN . "
|
||||
assert corpus[0].text == orig_text
|
||||
assert corpus[0].reference.text == orig_text
|
||||
assert corpus[0].predicted.text == orig_text
|
||||
assert corpus[1].text == augmented
|
||||
assert corpus[1].reference.text == augmented
|
||||
assert corpus[1].predicted.text == augmented
|
||||
|
||||
|
||||
@pytest.mark.skip("Outdated")
|
||||
@pytest.mark.parametrize(
|
||||
"tokens_a,tokens_b,expected",
|
||||
|
|
|
@ -336,6 +336,7 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
|
|||
lex = doc.vocab.get(doc.mem, orth)
|
||||
token.lex = lex
|
||||
token.lemma = 0 # reset lemma
|
||||
token.norm = 0 # reset norm
|
||||
if to_process_tensor:
|
||||
# setting the tensors of the split tokens to array of zeros
|
||||
doc.tensor[token_index + i] = xp.zeros((1,doc.tensor.shape[1]), dtype="float32")
|
||||
|
|
|
@ -245,7 +245,7 @@ cdef class Doc:
|
|||
self.noun_chunks_iterator = self.vocab.get_noun_chunks
|
||||
cdef bint has_space
|
||||
if words is None and spaces is not None:
|
||||
raise ValueError("words must be set if spaces is set")
|
||||
raise ValueError(Errors.E908)
|
||||
elif spaces is None and words is not None:
|
||||
self.has_unknown_spaces = True
|
||||
else:
|
||||
|
|
|
@ -17,7 +17,7 @@ from ..lexeme cimport Lexeme
|
|||
from ..symbols cimport dep
|
||||
|
||||
from ..util import normalize_slice
|
||||
from ..errors import Errors, TempErrors, Warnings
|
||||
from ..errors import Errors, Warnings
|
||||
from .underscore import Underscore, get_ext_args
|
||||
|
||||
|
||||
|
@ -362,8 +362,6 @@ cdef class Span:
|
|||
"""RETURNS (Span): The sentence span that the span is a part of."""
|
||||
if "sent" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["sent"](self)
|
||||
# This should raise if not parsed / no custom sentence boundaries
|
||||
self.doc.sents
|
||||
# Use `sent_start` token attribute to find sentence boundaries
|
||||
cdef int n = 0
|
||||
if self.doc.has_annotation("SENT_START"):
|
||||
|
@ -373,13 +371,14 @@ cdef class Span:
|
|||
start += -1
|
||||
# Find end of the sentence
|
||||
end = self.end
|
||||
n = 0
|
||||
while end < self.doc.length and self.doc.c[end].sent_start != 1:
|
||||
end += 1
|
||||
n += 1
|
||||
if n >= self.doc.length:
|
||||
break
|
||||
return self.doc[start:end]
|
||||
else:
|
||||
raise ValueError(Errors.E030)
|
||||
|
||||
@property
|
||||
def ents(self):
|
||||
|
@ -652,7 +651,7 @@ cdef class Span:
|
|||
return self.root.ent_id
|
||||
|
||||
def __set__(self, hash_t key):
|
||||
raise NotImplementedError(TempErrors.T007.format(attr="ent_id"))
|
||||
raise NotImplementedError(Errors.E200.format(attr="ent_id"))
|
||||
|
||||
property ent_id_:
|
||||
"""RETURNS (str): The (string) entity ID."""
|
||||
|
@ -660,7 +659,7 @@ cdef class Span:
|
|||
return self.root.ent_id_
|
||||
|
||||
def __set__(self, hash_t key):
|
||||
raise NotImplementedError(TempErrors.T007.format(attr="ent_id_"))
|
||||
raise NotImplementedError(Errors.E200.format(attr="ent_id_"))
|
||||
|
||||
@property
|
||||
def orth_(self):
|
||||
|
|
|
@ -30,20 +30,51 @@ class OrthVariants(BaseModel):
|
|||
|
||||
@registry.augmenters("spacy.orth_variants.v1")
|
||||
def create_orth_variants_augmenter(
|
||||
level: float, lower: float, orth_variants: OrthVariants,
|
||||
level: float, lower: float, orth_variants: OrthVariants
|
||||
) -> Callable[["Language", Example], Iterator[Example]]:
|
||||
"""Create a data augmentation callback that uses orth-variant replacement.
|
||||
The callback can be added to a corpus or other data iterator during training.
|
||||
|
||||
level (float): The percentage of texts that will be augmented.
|
||||
lower (float): The percentage of texts that will be lowercased.
|
||||
orth_variants (Dict[str, dict]): A dictionary containing the single and
|
||||
paired orth variants. Typically loaded from a JSON file.
|
||||
RETURNS (Callable[[Language, Example], Iterator[Example]]): The augmenter.
|
||||
"""
|
||||
return partial(
|
||||
orth_variants_augmenter, orth_variants=orth_variants, level=level, lower=lower
|
||||
)
|
||||
|
||||
|
||||
@registry.augmenters("spacy.lower_case.v1")
|
||||
def create_lower_casing_augmenter(
|
||||
level: float,
|
||||
) -> Callable[["Language", Example], Iterator[Example]]:
|
||||
"""Create a data augmentation callback that converts documents to lowercase.
|
||||
The callback can be added to a corpus or other data iterator during training.
|
||||
|
||||
level (float): The percentage of texts that will be augmented.
|
||||
RETURNS (Callable[[Language, Example], Iterator[Example]]): The augmenter.
|
||||
"""
|
||||
return partial(lower_casing_augmenter, level=level)
|
||||
|
||||
|
||||
def dont_augment(nlp: "Language", example: Example) -> Iterator[Example]:
|
||||
yield example
|
||||
|
||||
|
||||
def lower_casing_augmenter(
|
||||
nlp: "Language", example: Example, *, level: float,
|
||||
) -> Iterator[Example]:
|
||||
if random.random() >= level:
|
||||
yield example
|
||||
else:
|
||||
example_dict = example.to_dict()
|
||||
doc = nlp.make_doc(example.text.lower())
|
||||
example_dict["token_annotation"]["ORTH"] = [t.lower_ for t in doc]
|
||||
yield example.from_dict(doc, example_dict)
|
||||
|
||||
|
||||
def orth_variants_augmenter(
|
||||
nlp: "Language",
|
||||
example: Example,
|
||||
|
|
|
@ -2,9 +2,9 @@ from wasabi import Printer
|
|||
|
||||
from .. import tags_to_entities
|
||||
from ...training import iob_to_biluo
|
||||
from ...lang.xx import MultiLanguage
|
||||
from ...tokens import Doc, Span
|
||||
from ...util import load_model
|
||||
from ...errors import Errors
|
||||
from ...util import load_model, get_lang_class
|
||||
|
||||
|
||||
def conll_ner_to_docs(
|
||||
|
@ -86,7 +86,7 @@ def conll_ner_to_docs(
|
|||
if model:
|
||||
nlp = load_model(model)
|
||||
else:
|
||||
nlp = MultiLanguage()
|
||||
nlp = get_lang_class("xx")()
|
||||
output_docs = []
|
||||
for conll_doc in input_data.strip().split(doc_delimiter):
|
||||
conll_doc = conll_doc.strip()
|
||||
|
@ -103,11 +103,7 @@ def conll_ner_to_docs(
|
|||
lines = [line.strip() for line in conll_sent.split("\n") if line.strip()]
|
||||
cols = list(zip(*[line.split() for line in lines]))
|
||||
if len(cols) < 2:
|
||||
raise ValueError(
|
||||
"The token-per-line NER file is not formatted correctly. "
|
||||
"Try checking whitespace and delimiters. See "
|
||||
"https://nightly.spacy.io/api/cli#convert"
|
||||
)
|
||||
raise ValueError(Errors.E093)
|
||||
length = len(cols[0])
|
||||
words.extend(cols[0])
|
||||
sent_starts.extend([True] + [False] * (length - 1))
|
||||
|
@ -136,7 +132,7 @@ def segment_sents_and_docs(doc, n_sents, doc_delimiter, model=None, msg=None):
|
|||
"Segmenting sentences with sentencizer. (Use `-b model` for "
|
||||
"improved parser-based sentence segmentation.)"
|
||||
)
|
||||
nlp = MultiLanguage()
|
||||
nlp = get_lang_class("xx")()
|
||||
sentencizer = nlp.create_pipe("sentencizer")
|
||||
lines = doc.strip().split("\n")
|
||||
words = [line.strip().split()[0] for line in lines]
|
||||
|
|
|
@ -4,6 +4,7 @@ from .conll_ner_to_docs import n_sents_info
|
|||
from ...vocab import Vocab
|
||||
from ...training import iob_to_biluo, tags_to_entities
|
||||
from ...tokens import Doc, Span
|
||||
from ...errors import Errors
|
||||
from ...util import minibatch
|
||||
|
||||
|
||||
|
@ -45,9 +46,7 @@ def read_iob(raw_sents, vocab, n_sents):
|
|||
sent_words, sent_iob = zip(*sent_tokens)
|
||||
sent_tags = ["-"] * len(sent_words)
|
||||
else:
|
||||
raise ValueError(
|
||||
"The sentence-per-line IOB/IOB2 file is not formatted correctly. Try checking whitespace and delimiters. See https://nightly.spacy.io/api/cli#convert"
|
||||
)
|
||||
raise ValueError(Errors.E092)
|
||||
words.extend(sent_words)
|
||||
tags.extend(sent_tags)
|
||||
iob.extend(sent_iob)
|
||||
|
|
|
@ -12,6 +12,7 @@ from .iob_utils import biluo_to_iob, offsets_to_biluo_tags, doc_to_biluo_tags
|
|||
from .iob_utils import biluo_tags_to_spans
|
||||
from ..errors import Errors, Warnings
|
||||
from ..pipeline._parser_internals import nonproj
|
||||
from ..util import logger
|
||||
|
||||
|
||||
cpdef Doc annotations_to_doc(vocab, tok_annot, doc_annot):
|
||||
|
@ -390,7 +391,7 @@ def _fix_legacy_dict_data(example_dict):
|
|||
if "HEAD" in token_dict and "SENT_START" in token_dict:
|
||||
# If heads are set, we don't also redundantly specify SENT_START.
|
||||
token_dict.pop("SENT_START")
|
||||
warnings.warn(Warnings.W092)
|
||||
logger.debug(Warnings.W092)
|
||||
return {
|
||||
"token_annotation": token_dict,
|
||||
"doc_annotation": doc_dict
|
||||
|
|
|
@ -50,9 +50,6 @@ def init_nlp(config: Config, *, use_gpu: int = -1) -> "Language":
|
|||
with nlp.select_pipes(disable=[*frozen_components, *resume_components]):
|
||||
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
|
||||
logger.info("Initialized pipeline components")
|
||||
# Verify the config after calling 'initialize' to ensure labels
|
||||
# are properly initialized
|
||||
verify_config(nlp)
|
||||
return nlp
|
||||
|
||||
|
||||
|
@ -102,7 +99,7 @@ def load_vectors_into_model(
|
|||
"with the packaged vectors. Make sure that the vectors package you're "
|
||||
"loading is compatible with the current version of spaCy."
|
||||
)
|
||||
err = ConfigValidationError.from_error(config=None, title=title, desc=desc)
|
||||
err = ConfigValidationError.from_error(e, config=None, title=title, desc=desc)
|
||||
raise err from None
|
||||
nlp.vocab.vectors = vectors_nlp.vocab.vectors
|
||||
if add_strings:
|
||||
|
@ -152,33 +149,6 @@ def init_tok2vec(
|
|||
return False
|
||||
|
||||
|
||||
def verify_config(nlp: "Language") -> None:
|
||||
"""Perform additional checks based on the config, loaded nlp object and training data."""
|
||||
# TODO: maybe we should validate based on the actual components, the list
|
||||
# in config["nlp"]["pipeline"] instead?
|
||||
for pipe_config in nlp.config["components"].values():
|
||||
# We can't assume that the component name == the factory
|
||||
factory = pipe_config["factory"]
|
||||
if factory == "textcat":
|
||||
verify_textcat_config(nlp, pipe_config)
|
||||
|
||||
|
||||
def verify_textcat_config(nlp: "Language", pipe_config: Dict[str, Any]) -> None:
|
||||
# if 'positive_label' is provided: double check whether it's in the data and
|
||||
# the task is binary
|
||||
if pipe_config.get("positive_label"):
|
||||
textcat_labels = nlp.get_pipe("textcat").labels
|
||||
pos_label = pipe_config.get("positive_label")
|
||||
if pos_label not in textcat_labels:
|
||||
raise ValueError(
|
||||
Errors.E920.format(pos_label=pos_label, labels=textcat_labels)
|
||||
)
|
||||
if len(list(textcat_labels)) != 2:
|
||||
raise ValueError(
|
||||
Errors.E919.format(pos_label=pos_label, labels=textcat_labels)
|
||||
)
|
||||
|
||||
|
||||
def get_sourced_components(config: Union[Dict[str, Any], Config]) -> List[str]:
|
||||
"""RETURNS (List[str]): All sourced components in the original config,
|
||||
e.g. {"source": "en_core_web_sm"}. If the config contains a key
|
||||
|
|
|
@ -1,18 +1,25 @@
|
|||
from typing import Dict, Any, Tuple, Callable, List
|
||||
from typing import TYPE_CHECKING, Dict, Any, Tuple, Callable, List, Optional, IO
|
||||
from wasabi import Printer
|
||||
import tqdm
|
||||
import sys
|
||||
|
||||
from ..util import registry
|
||||
from .. import util
|
||||
from ..errors import Errors
|
||||
from wasabi import msg
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from ..language import Language # noqa: F401
|
||||
|
||||
|
||||
@registry.loggers("spacy.ConsoleLogger.v1")
|
||||
def console_logger():
|
||||
def console_logger(progress_bar: bool = False):
|
||||
def setup_printer(
|
||||
nlp: "Language",
|
||||
) -> Tuple[Callable[[Dict[str, Any]], None], Callable]:
|
||||
nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr
|
||||
) -> Tuple[Callable[[Optional[Dict[str, Any]]], None], Callable[[], None]]:
|
||||
msg = Printer(no_print=True)
|
||||
# we assume here that only components are enabled that should be trained & logged
|
||||
logged_pipes = nlp.pipe_names
|
||||
eval_frequency = nlp.config["training"]["eval_frequency"]
|
||||
score_weights = nlp.config["training"]["score_weights"]
|
||||
score_cols = [col for col, value in score_weights.items() if value is not None]
|
||||
score_widths = [max(len(col), 6) for col in score_cols]
|
||||
|
@ -22,10 +29,18 @@ def console_logger():
|
|||
table_header = [col.upper() for col in table_header]
|
||||
table_widths = [3, 6] + loss_widths + score_widths + [6]
|
||||
table_aligns = ["r" for _ in table_widths]
|
||||
msg.row(table_header, widths=table_widths)
|
||||
msg.row(["-" * width for width in table_widths])
|
||||
stdout.write(msg.row(table_header, widths=table_widths) + "\n")
|
||||
stdout.write(msg.row(["-" * width for width in table_widths]) + "\n")
|
||||
progress = None
|
||||
|
||||
def log_step(info: Dict[str, Any]):
|
||||
def log_step(info: Optional[Dict[str, Any]]) -> None:
|
||||
nonlocal progress
|
||||
|
||||
if info is None:
|
||||
# If we don't have a new checkpoint, just return.
|
||||
if progress is not None:
|
||||
progress.update(1)
|
||||
return
|
||||
try:
|
||||
losses = [
|
||||
"{0:.2f}".format(float(info["losses"][pipe_name]))
|
||||
|
@ -39,26 +54,36 @@ def console_logger():
|
|||
keys=list(info["losses"].keys()),
|
||||
)
|
||||
) from None
|
||||
|
||||
scores = []
|
||||
for col in score_cols:
|
||||
score = info["other_scores"].get(col, 0.0)
|
||||
try:
|
||||
score = float(score)
|
||||
if col != "speed":
|
||||
score *= 100
|
||||
scores.append("{0:.2f}".format(score))
|
||||
except TypeError:
|
||||
err = Errors.E916.format(name=col, score_type=type(score))
|
||||
raise ValueError(err) from None
|
||||
if col != "speed":
|
||||
score *= 100
|
||||
scores.append("{0:.2f}".format(score))
|
||||
|
||||
data = (
|
||||
[info["epoch"], info["step"]]
|
||||
+ losses
|
||||
+ scores
|
||||
+ ["{0:.2f}".format(float(info["score"]))]
|
||||
)
|
||||
msg.row(data, widths=table_widths, aligns=table_aligns)
|
||||
if progress is not None:
|
||||
progress.close()
|
||||
stdout.write(msg.row(data, widths=table_widths, aligns=table_aligns) + "\n")
|
||||
if progress_bar:
|
||||
# Set disable=None, so that it disables on non-TTY
|
||||
progress = tqdm.tqdm(
|
||||
total=eval_frequency, disable=None, leave=False, file=stderr
|
||||
)
|
||||
progress.set_description(f"Epoch {info['epoch']+1}")
|
||||
|
||||
def finalize():
|
||||
def finalize() -> None:
|
||||
pass
|
||||
|
||||
return log_step, finalize
|
||||
|
@ -70,31 +95,32 @@ def console_logger():
|
|||
def wandb_logger(project_name: str, remove_config_values: List[str] = []):
|
||||
import wandb
|
||||
|
||||
console = console_logger()
|
||||
console = console_logger(progress_bar=False)
|
||||
|
||||
def setup_logger(
|
||||
nlp: "Language",
|
||||
) -> Tuple[Callable[[Dict[str, Any]], None], Callable]:
|
||||
nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr
|
||||
) -> Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]:
|
||||
config = nlp.config.interpolate()
|
||||
config_dot = util.dict_to_dot(config)
|
||||
for field in remove_config_values:
|
||||
del config_dot[field]
|
||||
config = util.dot_to_dict(config_dot)
|
||||
wandb.init(project=project_name, config=config, reinit=True)
|
||||
console_log_step, console_finalize = console(nlp)
|
||||
console_log_step, console_finalize = console(nlp, stdout, stderr)
|
||||
|
||||
def log_step(info: Dict[str, Any]):
|
||||
def log_step(info: Optional[Dict[str, Any]]):
|
||||
console_log_step(info)
|
||||
score = info["score"]
|
||||
other_scores = info["other_scores"]
|
||||
losses = info["losses"]
|
||||
wandb.log({"score": score})
|
||||
if losses:
|
||||
wandb.log({f"loss_{k}": v for k, v in losses.items()})
|
||||
if isinstance(other_scores, dict):
|
||||
wandb.log(other_scores)
|
||||
if info is not None:
|
||||
score = info["score"]
|
||||
other_scores = info["other_scores"]
|
||||
losses = info["losses"]
|
||||
wandb.log({"score": score})
|
||||
if losses:
|
||||
wandb.log({f"loss_{k}": v for k, v in losses.items()})
|
||||
if isinstance(other_scores, dict):
|
||||
wandb.log(other_scores)
|
||||
|
||||
def finalize():
|
||||
def finalize() -> None:
|
||||
console_finalize()
|
||||
wandb.join()
|
||||
|
||||
|
|
|
@ -1,11 +1,11 @@
|
|||
from typing import List, Callable, Tuple, Dict, Iterable, Iterator, Union, Any
|
||||
from typing import List, Callable, Tuple, Dict, Iterable, Iterator, Union, Any, IO
|
||||
from typing import Optional, TYPE_CHECKING
|
||||
from pathlib import Path
|
||||
from timeit import default_timer as timer
|
||||
from thinc.api import Optimizer, Config, constant, fix_random_seed, set_gpu_allocator
|
||||
import random
|
||||
import tqdm
|
||||
from wasabi import Printer
|
||||
import wasabi
|
||||
import sys
|
||||
|
||||
from .example import Example
|
||||
from ..schemas import ConfigSchemaTraining
|
||||
|
@ -21,7 +21,8 @@ def train(
|
|||
output_path: Optional[Path] = None,
|
||||
*,
|
||||
use_gpu: int = -1,
|
||||
silent: bool = False,
|
||||
stdout: IO = sys.stdout,
|
||||
stderr: IO = sys.stderr,
|
||||
) -> None:
|
||||
"""Train a pipeline.
|
||||
|
||||
|
@ -29,10 +30,15 @@ def train(
|
|||
output_path (Path): Optional output path to save trained model to.
|
||||
use_gpu (int): Whether to train on GPU. Make sure to call require_gpu
|
||||
before calling this function.
|
||||
silent (bool): Whether to pretty-print outputs.
|
||||
stdout (file): A file-like object to write output messages. To disable
|
||||
printing, set to io.StringIO.
|
||||
stderr (file): A second file-like object to write output messages. To disable
|
||||
printing, set to io.StringIO.
|
||||
|
||||
RETURNS (Path / None): The path to the final exported model.
|
||||
"""
|
||||
msg = Printer(no_print=silent)
|
||||
# We use no_print here so we can respect the stdout/stderr options.
|
||||
msg = wasabi.Printer(no_print=True)
|
||||
# Create iterator, which yields out info after each optimization step.
|
||||
config = nlp.config.interpolate()
|
||||
if config["training"]["seed"] is not None:
|
||||
|
@ -63,50 +69,47 @@ def train(
|
|||
eval_frequency=T["eval_frequency"],
|
||||
exclude=frozen_components,
|
||||
)
|
||||
msg.info(f"Pipeline: {nlp.pipe_names}")
|
||||
stdout.write(msg.info(f"Pipeline: {nlp.pipe_names}") + "\n")
|
||||
if frozen_components:
|
||||
msg.info(f"Frozen components: {frozen_components}")
|
||||
msg.info(f"Initial learn rate: {optimizer.learn_rate}")
|
||||
stdout.write(msg.info(f"Frozen components: {frozen_components}") + "\n")
|
||||
stdout.write(msg.info(f"Initial learn rate: {optimizer.learn_rate}") + "\n")
|
||||
with nlp.select_pipes(disable=frozen_components):
|
||||
print_row, finalize_logger = train_logger(nlp)
|
||||
log_step, finalize_logger = train_logger(nlp, stdout, stderr)
|
||||
try:
|
||||
progress = tqdm.tqdm(total=T["eval_frequency"], leave=False)
|
||||
progress.set_description(f"Epoch 1")
|
||||
for batch, info, is_best_checkpoint in training_step_iterator:
|
||||
progress.update(1)
|
||||
if is_best_checkpoint is not None:
|
||||
progress.close()
|
||||
print_row(info)
|
||||
if is_best_checkpoint and output_path is not None:
|
||||
with nlp.select_pipes(disable=frozen_components):
|
||||
update_meta(T, nlp, info)
|
||||
with nlp.use_params(optimizer.averages):
|
||||
nlp = before_to_disk(nlp)
|
||||
nlp.to_disk(output_path / "model-best")
|
||||
progress = tqdm.tqdm(total=T["eval_frequency"], leave=False)
|
||||
progress.set_description(f"Epoch {info['epoch']}")
|
||||
log_step(info if is_best_checkpoint is not None else None)
|
||||
if is_best_checkpoint is not None and output_path is not None:
|
||||
with nlp.select_pipes(disable=frozen_components):
|
||||
update_meta(T, nlp, info)
|
||||
with nlp.use_params(optimizer.averages):
|
||||
nlp = before_to_disk(nlp)
|
||||
nlp.to_disk(output_path / "model-best")
|
||||
except Exception as e:
|
||||
finalize_logger()
|
||||
if output_path is not None:
|
||||
# We don't want to swallow the traceback if we don't have a
|
||||
# specific error.
|
||||
msg.warn(
|
||||
f"Aborting and saving the final best model. "
|
||||
f"Encountered exception: {str(e)}"
|
||||
# specific error, but we do want to warn that we're trying
|
||||
# to do something here.
|
||||
stdout.write(
|
||||
msg.warn(
|
||||
f"Aborting and saving the final best model. "
|
||||
f"Encountered exception: {str(e)}"
|
||||
)
|
||||
+ "\n"
|
||||
)
|
||||
nlp = before_to_disk(nlp)
|
||||
nlp.to_disk(output_path / "model-final")
|
||||
raise e
|
||||
finally:
|
||||
finalize_logger()
|
||||
if output_path is not None:
|
||||
final_model_path = output_path / "model-final"
|
||||
final_model_path = output_path / "model-last"
|
||||
if optimizer.averages:
|
||||
with nlp.use_params(optimizer.averages):
|
||||
nlp.to_disk(final_model_path)
|
||||
else:
|
||||
nlp.to_disk(final_model_path)
|
||||
msg.good(f"Saved pipeline to output directory", final_model_path)
|
||||
# This will only run if we don't hit an error
|
||||
stdout.write(
|
||||
msg.good("Saved pipeline to output directory", final_model_path) + "\n"
|
||||
)
|
||||
|
||||
|
||||
def train_while_improving(
|
||||
|
|
|
@ -16,6 +16,7 @@ from ..attrs import ID
|
|||
from ..ml.models.multi_task import build_cloze_multi_task_model
|
||||
from ..ml.models.multi_task import build_cloze_characters_multi_task_model
|
||||
from ..schemas import ConfigSchemaTraining, ConfigSchemaPretrain
|
||||
from ..errors import Errors
|
||||
from ..util import registry, load_model_from_config, dot_to_object
|
||||
|
||||
|
||||
|
@ -151,9 +152,9 @@ def create_objective(config: Config):
|
|||
distance = L2Distance(normalize=True, ignore_zeros=True)
|
||||
return partial(get_vectors_loss, distance=distance)
|
||||
else:
|
||||
raise ValueError("Unexpected loss type", config["loss"])
|
||||
raise ValueError(Errors.E906.format(loss_type=config["loss"]))
|
||||
else:
|
||||
raise ValueError("Unexpected objective_type", objective_type)
|
||||
raise ValueError(Errors.E907.format(objective_type=objective_type))
|
||||
|
||||
|
||||
def get_vectors_loss(ops, docs, prediction, distance):
|
||||
|
|
|
@ -16,7 +16,7 @@ from .errors import Errors
|
|||
from .attrs import intify_attrs, NORM, IS_STOP
|
||||
from .vectors import Vectors
|
||||
from .util import registry
|
||||
from .lookups import Lookups, load_lookups
|
||||
from .lookups import Lookups
|
||||
from . import util
|
||||
from .lang.norm_exceptions import BASE_NORMS
|
||||
from .lang.lex_attrs import LEX_ATTRS, is_stop, get_lang
|
||||
|
|
|
@ -4,6 +4,7 @@ tag: class
|
|||
source: spacy/pipeline/attributeruler.py
|
||||
new: 3
|
||||
teaser: 'Pipeline component for rule-based token attribute assignment'
|
||||
api_base_class: /api/pipe
|
||||
api_string_name: attribute_ruler
|
||||
api_trainable: false
|
||||
---
|
||||
|
@ -25,17 +26,13 @@ how the component should be configured. You can override its settings via the
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> config = {
|
||||
> "pattern_dicts": None,
|
||||
> "validate": True,
|
||||
> }
|
||||
> config = {"validate": True}
|
||||
> nlp.add_pipe("attribute_ruler", config=config)
|
||||
> ```
|
||||
|
||||
| Setting | Description |
|
||||
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `pattern_dicts` | A list of pattern dicts with the keys as the arguments to [`AttributeRuler.add`](/api/attributeruler#add) (`patterns`/`attrs`/`index`) to add as patterns. Defaults to `None`. ~~Optional[Iterable[Dict[str, Union[List[dict], dict, int]]]]~~ |
|
||||
| `validate` | Whether patterns should be validated (passed to the `Matcher`). Defaults to `False`. ~~bool~~ |
|
||||
| Setting | Description |
|
||||
| ---------- | --------------------------------------------------------------------------------------------- |
|
||||
| `validate` | Whether patterns should be validated (passed to the `Matcher`). Defaults to `False`. ~~bool~~ |
|
||||
|
||||
```python
|
||||
%%GITHUB_SPACY/spacy/pipeline/attributeruler.py
|
||||
|
@ -43,36 +40,26 @@ how the component should be configured. You can override its settings via the
|
|||
|
||||
## AttributeRuler.\_\_init\_\_ {#init tag="method"}
|
||||
|
||||
Initialize the attribute ruler. If pattern dicts are supplied here, they need to
|
||||
be a list of dictionaries with `"patterns"`, `"attrs"`, and optional `"index"`
|
||||
keys, e.g.:
|
||||
|
||||
```python
|
||||
pattern_dicts = [
|
||||
{"patterns": [[{"TAG": "VB"}]], "attrs": {"POS": "VERB"}},
|
||||
{"patterns": [[{"LOWER": "an"}]], "attrs": {"LEMMA": "a"}},
|
||||
]
|
||||
```
|
||||
Initialize the attribute ruler.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> # Construction via add_pipe
|
||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
||||
> ruler = nlp.add_pipe("attribute_ruler")
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | The shared vocabulary to pass to the matcher. ~~Vocab~~ |
|
||||
| `name` | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. ~~str~~ |
|
||||
| _keyword-only_ | |
|
||||
| `pattern_dicts` | Optional patterns to load in on initialization. Defaults to `None`. ~~Optional[Iterable[Dict[str, Union[List[dict], dict, int]]]]~~ |
|
||||
| `validate` | Whether patterns should be validated (passed to the [`Matcher`](/api/matcher#init)). Defaults to `False`. ~~bool~~ |
|
||||
| Name | Description |
|
||||
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | The shared vocabulary to pass to the matcher. ~~Vocab~~ |
|
||||
| `name` | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. ~~str~~ |
|
||||
| _keyword-only_ | |
|
||||
| `validate` | Whether patterns should be validated (passed to the [`Matcher`](/api/matcher#init)). Defaults to `False`. ~~bool~~ |
|
||||
|
||||
## AttributeRuler.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
Apply the attribute ruler to a `Doc`, setting token attributes for tokens matched
|
||||
by the provided patterns.
|
||||
Apply the attribute ruler to a `Doc`, setting token attributes for tokens
|
||||
matched by the provided patterns.
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | -------------------------------- |
|
||||
|
@ -90,10 +77,10 @@ may be negative to index from the end of the span.
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
||||
> ruler = nlp.add_pipe("attribute_ruler")
|
||||
> patterns = [[{"TAG": "VB"}]]
|
||||
> attrs = {"POS": "VERB"}
|
||||
> attribute_ruler.add(patterns=patterns, attrs=attrs)
|
||||
> ruler.add(patterns=patterns, attrs=attrs)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
|
@ -107,11 +94,10 @@ may be negative to index from the end of the span.
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
||||
> pattern_dicts = [
|
||||
> ruler = nlp.add_pipe("attribute_ruler")
|
||||
> patterns = [
|
||||
> {
|
||||
> "patterns": [[{"TAG": "VB"}]],
|
||||
> "attrs": {"POS": "VERB"}
|
||||
> "patterns": [[{"TAG": "VB"}]], "attrs": {"POS": "VERB"}
|
||||
> },
|
||||
> {
|
||||
> "patterns": [[{"LOWER": "two"}, {"LOWER": "apples"}]],
|
||||
|
@ -119,15 +105,16 @@ may be negative to index from the end of the span.
|
|||
> "index": -1
|
||||
> },
|
||||
> ]
|
||||
> attribute_ruler.add_patterns(pattern_dicts)
|
||||
> ruler.add_patterns(patterns)
|
||||
> ```
|
||||
|
||||
Add patterns from a list of pattern dicts with the keys as the arguments to
|
||||
Add patterns from a list of pattern dicts. Each pattern dict can specify the
|
||||
keys `"patterns"`, `"attrs"` and `"index"`, which match the arguments of
|
||||
[`AttributeRuler.add`](/api/attributeruler#add).
|
||||
|
||||
| Name | Description |
|
||||
| --------------- | -------------------------------------------------------------------------- |
|
||||
| `pattern_dicts` | The patterns to add. ~~Iterable[Dict[str, Union[List[dict], dict, int]]]~~ |
|
||||
| Name | Description |
|
||||
| ---------- | -------------------------------------------------------------------------- |
|
||||
| `patterns` | The patterns to add. ~~Iterable[Dict[str, Union[List[dict], dict, int]]]~~ |
|
||||
|
||||
## AttributeRuler.patterns {#patterns tag="property"}
|
||||
|
||||
|
@ -139,20 +126,39 @@ Get all patterns that have been added to the attribute ruler in the
|
|||
| ----------- | -------------------------------------------------------------------------------------------- |
|
||||
| **RETURNS** | The patterns added to the attribute ruler. ~~List[Dict[str, Union[List[dict], dict, int]]]~~ |
|
||||
|
||||
## AttributeRuler.score {#score tag="method" new="3"}
|
||||
## AttributeRuler.initialize {#initialize tag="method"}
|
||||
|
||||
Score a batch of examples.
|
||||
Initialize the component with data. Typically called before training to load in
|
||||
rules from a file. This method is typically called by
|
||||
[`Language.initialize`](/api/language#initialize) and lets you customize
|
||||
arguments it receives via the
|
||||
[`[initialize.components]`](/api/data-formats#config-initialize) block in the
|
||||
config.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> scores = attribute_ruler.score(examples)
|
||||
> ruler = nlp.add_pipe("attribute_ruler")
|
||||
> ruler.initialize(lambda: [], nlp=nlp, patterns=patterns)
|
||||
> ```
|
||||
>
|
||||
> ```ini
|
||||
> ### config.cfg
|
||||
> [initialize.components.attribute_ruler]
|
||||
>
|
||||
> [initialize.components.attribute_ruler.patterns]
|
||||
> @readers = "srsly.read_json.v1"
|
||||
> path = "corpus/attribute_ruler_patterns.json
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||
| **RETURNS** | The scores, produced by [`Scorer.score_token_attr`](/api/scorer#score_token_attr) for the attributes `"tag"`, `"pos"`, `"morph"` and `"lemma"` if present in any of the target token attributes. ~~Dict[str, float]~~ |
|
||||
| Name | Description |
|
||||
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects (the training data). Not used by this component. ~~Callable[[], Iterable[Example]]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||
| `patterns` | A list of pattern dicts with the keys as the arguments to [`AttributeRuler.add`](/api/attributeruler#add) (`patterns`/`attrs`/`index`) to add as patterns. Defaults to `None`. ~~Optional[Iterable[Dict[str, Union[List[dict], dict, int]]]]~~ |
|
||||
| `tag_map` | The tag map that maps fine-grained tags to coarse-grained tags and morphological features. Defaults to `None`. ~~Optional[Dict[str, Dict[Union[int, str], Union[int, str]]]]~~ |
|
||||
| `morph_rules` | The morph rules that map token text and fine-grained tags to coarse-grained tags, lemmas and morphological features. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]]~~ |
|
||||
|
||||
## AttributeRuler.load_from_tag_map {#load_from_tag_map tag="method"}
|
||||
|
||||
|
@ -170,6 +176,21 @@ Load attribute ruler patterns from morph rules.
|
|||
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `morph_rules` | The morph rules that map token text and fine-grained tags to coarse-grained tags, lemmas and morphological features. ~~Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]~~ |
|
||||
|
||||
## AttributeRuler.score {#score tag="method" new="3"}
|
||||
|
||||
Score a batch of examples.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> scores = ruler.score(examples)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||
| **RETURNS** | The scores, produced by [`Scorer.score_token_attr`](/api/scorer#score_token_attr) for the attributes `"tag"`, `"pos"`, `"morph"` and `"lemma"` if present in any of the target token attributes. ~~Dict[str, float]~~ |
|
||||
|
||||
## AttributeRuler.to_disk {#to_disk tag="method"}
|
||||
|
||||
Serialize the pipe to disk.
|
||||
|
@ -177,8 +198,8 @@ Serialize the pipe to disk.
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
||||
> attribute_ruler.to_disk("/path/to/attribute_ruler")
|
||||
> ruler = nlp.add_pipe("attribute_ruler")
|
||||
> ruler.to_disk("/path/to/attribute_ruler")
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
|
@ -194,8 +215,8 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
||||
> attribute_ruler.from_disk("/path/to/attribute_ruler")
|
||||
> ruler = nlp.add_pipe("attribute_ruler")
|
||||
> ruler.from_disk("/path/to/attribute_ruler")
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
|
@ -210,8 +231,8 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
||||
> attribute_ruler_bytes = attribute_ruler.to_bytes()
|
||||
> ruler = nlp.add_pipe("attribute_ruler")
|
||||
> ruler = ruler.to_bytes()
|
||||
> ```
|
||||
|
||||
Serialize the pipe to a bytestring.
|
||||
|
@ -229,9 +250,9 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> attribute_ruler_bytes = attribute_ruler.to_bytes()
|
||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
||||
> attribute_ruler.from_bytes(attribute_ruler_bytes)
|
||||
> ruler_bytes = ruler.to_bytes()
|
||||
> ruler = nlp.add_pipe("attribute_ruler")
|
||||
> ruler.from_bytes(ruler_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
|
@ -250,12 +271,12 @@ serialization by passing in the string names via the `exclude` argument.
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = attribute_ruler.to_disk("/path", exclude=["vocab"])
|
||||
> data = ruler.to_disk("/path", exclude=["vocab"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ---------- | -------------------------------------------------------------- |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `patterns` | The `Matcher` patterns. You usually don't want to exclude this. |
|
||||
| `attrs` | The attributes to set. You usually don't want to exclude this. |
|
||||
| `indices` | The token indices. You usually don't want to exclude this. |
|
||||
| Name | Description |
|
||||
| ---------- | --------------------------------------------------------------- |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `patterns` | The `Matcher` patterns. You usually don't want to exclude this. |
|
||||
| `attrs` | The attributes to set. You usually don't want to exclude this. |
|
||||
| `indices` | The token indices. You usually don't want to exclude this. |
|
||||
|
|
|
@ -232,7 +232,7 @@ $ python -m spacy init labels [config_path] [output_path] [--code] [--verbose] [
|
|||
| `--gpu-id`, `-g` | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~ |
|
||||
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
|
||||
| **CREATES** | The final trained pipeline and the best trained pipeline. |
|
||||
| **CREATES** | The best trained pipeline and the final checkpoint (if training is terminated). |
|
||||
|
||||
## convert {#convert tag="command"}
|
||||
|
||||
|
|
|
@ -176,12 +176,12 @@ This method was previously called `begin_training`.
|
|||
> path = "corpus/labels/parser.json
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
||||
| Name | Description |
|
||||
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Dict[str, Dict[str, int]]]~~ |
|
||||
|
||||
## DependencyParser.predict {#predict tag="method"}
|
||||
|
||||
|
@ -433,6 +433,24 @@ The labels currently added to the component.
|
|||
| ----------- | ------------------------------------------------------ |
|
||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||
|
||||
## DependencyParser.label_data {#label_data tag="property" new="3"}
|
||||
|
||||
The labels currently added to the component and their internal meta information.
|
||||
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||
[`DependencyParser.initialize`](/api/dependencyparser#initialize) to initialize
|
||||
the model with a pre-defined label set.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> labels = parser.label_data
|
||||
> parser.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ------------------------------------------------------------------------------- |
|
||||
| **RETURNS** | The label data added to the component. ~~Dict[str, Dict[str, Dict[str, int]]]~~ |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
|
|
|
@ -165,12 +165,12 @@ This method was previously called `begin_training`.
|
|||
> path = "corpus/labels/ner.json
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
||||
| Name | Description |
|
||||
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Dict[str, Dict[str, int]]]~~ |
|
||||
|
||||
## EntityRecognizer.predict {#predict tag="method"}
|
||||
|
||||
|
@ -421,6 +421,24 @@ The labels currently added to the component.
|
|||
| ----------- | ------------------------------------------------------ |
|
||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||
|
||||
## EntityRecognizer.label_data {#label_data tag="property" new="3"}
|
||||
|
||||
The labels currently added to the component and their internal meta information.
|
||||
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||
[`EntityRecognizer.initialize`](/api/entityrecognizer#initialize) to initialize
|
||||
the model with a pre-defined label set.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> labels = ner.label_data
|
||||
> ner.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ------------------------------------------------------------------------------- |
|
||||
| **RETURNS** | The label data added to the component. ~~Dict[str, Dict[str, Dict[str, int]]]~~ |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
|
|
|
@ -190,23 +190,10 @@ lemmatization entirely.
|
|||
Returns the lookups configuration settings for a given mode for use in
|
||||
[`Lemmatizer.load_lookups`](/api/lemmatizer#load_lookups).
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `mode` | The lemmatizer mode. ~~str~~ |
|
||||
| **RETURNS** | The lookups configuration settings for this mode. Includes the keys `"required_tables"` and `"optional_tables"`, mapped to a list of table string names. ~~Dict[str, List[str]]~~ |
|
||||
|
||||
## Lemmatizer.load_lookups {#load_lookups tag="classmethod"}
|
||||
|
||||
Load and validate lookups tables. If the provided lookups is `None`, load the
|
||||
default lookups tables according to the language and mode settings. Confirm that
|
||||
all required tables for the language and mode are present.
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | -------------------------------------------------------------------------------------------------- |
|
||||
| `lang` | The language. ~~str~~ |
|
||||
| `mode` | The lemmatizer mode. ~~str~~ |
|
||||
| `lookups` | The provided lookups, may be `None` if the default lookups should be loaded. ~~Optional[Lookups]~~ |
|
||||
| **RETURNS** | The lookups. ~~Lookups~~ |
|
||||
| Name | Description |
|
||||
| ----------- | -------------------------------------------------------------------------------------- |
|
||||
| `mode` | The lemmatizer mode. ~~str~~ |
|
||||
| **RETURNS** | The required table names and the optional table names. ~~Tuple[List[str], List[str]]~~ |
|
||||
|
||||
## Lemmatizer.to_disk {#to_disk tag="method"}
|
||||
|
||||
|
|
|
@ -147,12 +147,12 @@ config.
|
|||
> path = "corpus/labels/morphologizer.json
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
||||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
||||
|
||||
## Morphologizer.predict {#predict tag="method"}
|
||||
|
||||
|
@ -377,6 +377,24 @@ coarse-grained POS as the feature `POS`.
|
|||
| ----------- | ------------------------------------------------------ |
|
||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||
|
||||
## Morphologizer.label_data {#label_data tag="property" new="3"}
|
||||
|
||||
The labels currently added to the component and their internal meta information.
|
||||
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||
[`Morphologizer.initialize`](/api/morphologizer#initialize) to initialize the
|
||||
model with a pre-defined label set.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> labels = morphologizer.label_data
|
||||
> morphologizer.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ----------------------------------------------- |
|
||||
| **RETURNS** | The label data added to the component. ~~dict~~ |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
|
|
|
@ -148,12 +148,12 @@ This method was previously called `begin_training`.
|
|||
> path = "corpus/labels/tagger.json
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[list]~~ |
|
||||
| Name | Description |
|
||||
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |
|
||||
|
||||
## Tagger.predict {#predict tag="method"}
|
||||
|
||||
|
@ -411,6 +411,24 @@ The labels currently added to the component.
|
|||
| ----------- | ------------------------------------------------------ |
|
||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||
|
||||
## Tagger.label_data {#label_data tag="property" new="3"}
|
||||
|
||||
The labels currently added to the component and their internal meta information.
|
||||
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||
[`Tagger.initialize`](/api/tagger#initialize) to initialize the model with a
|
||||
pre-defined label set.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> labels = tagger.label_data
|
||||
> tagger.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ---------------------------------------------------------- |
|
||||
| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
|
|
|
@ -29,19 +29,16 @@ architectures and their arguments and hyperparameters.
|
|||
> ```python
|
||||
> from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL
|
||||
> config = {
|
||||
> "labels": [],
|
||||
> "threshold": 0.5,
|
||||
> "model": DEFAULT_TEXTCAT_MODEL,
|
||||
> }
|
||||
> nlp.add_pipe("textcat", config=config)
|
||||
> ```
|
||||
|
||||
| Setting | Description |
|
||||
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `labels` | A list of categories to learn. If empty, the model infers the categories from the data. Defaults to `[]`. ~~Iterable[str]~~ |
|
||||
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise and by default. ~~Optional[str]~~ |
|
||||
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
| Setting | Description |
|
||||
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
|
||||
```python
|
||||
%%GITHUB_SPACY/spacy/pipeline/textcat.py
|
||||
|
@ -61,22 +58,20 @@ architectures and their arguments and hyperparameters.
|
|||
>
|
||||
> # Construction from class
|
||||
> from spacy.pipeline import TextCategorizer
|
||||
> textcat = TextCategorizer(nlp.vocab, model, labels=[], threshold=0.5, positive_label="POS")
|
||||
> textcat = TextCategorizer(nlp.vocab, model, threshold=0.5)
|
||||
> ```
|
||||
|
||||
Create a new pipeline instance. In your application, you would normally use a
|
||||
shortcut for this and instantiate the component using its string name and
|
||||
[`nlp.add_pipe`](/api/language#create_pipe).
|
||||
|
||||
| Name | Description |
|
||||
| ---------------- | -------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||
| _keyword-only_ | |
|
||||
| `labels` | The labels to use. ~~Iterable[str]~~ |
|
||||
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise. ~~Optional[str]~~ |
|
||||
| Name | Description |
|
||||
| -------------- | -------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||
| _keyword-only_ | |
|
||||
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||
|
||||
## TextCategorizer.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
@ -155,18 +150,20 @@ This method was previously called `begin_training`.
|
|||
> ```ini
|
||||
> ### config.cfg
|
||||
> [initialize.components.textcat]
|
||||
> positive_label = "POS"
|
||||
>
|
||||
> [initialize.components.textcat.labels]
|
||||
> @readers = "spacy.read_labels.v1"
|
||||
> path = "corpus/labels/textcat.json
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||
| `labels` | The label information to add to the component. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ |
|
||||
| Name | Description |
|
||||
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |
|
||||
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise and by default. ~~Optional[str]~~ |
|
||||
|
||||
## TextCategorizer.predict {#predict tag="method"}
|
||||
|
||||
|
@ -425,6 +422,24 @@ The labels currently added to the component.
|
|||
| ----------- | ------------------------------------------------------ |
|
||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||
|
||||
## TextCategorizer.label_data {#label_data tag="property" new="3"}
|
||||
|
||||
The labels currently added to the component and their internal meta information.
|
||||
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||
[`TextCategorizer.initialize`](/api/textcategorizer#initialize) to initialize
|
||||
the model with a pre-defined label set.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> labels = textcat.label_data
|
||||
> textcat.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ---------------------------------------------------------- |
|
||||
| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
|
|
|
@ -689,7 +689,8 @@ Data augmentation is the process of applying small modifications to the training
|
|||
data. It can be especially useful for punctuation and case replacement – for
|
||||
example, if your corpus only uses smart quotes and you want to include
|
||||
variations using regular quotes, or to make the model less sensitive to
|
||||
capitalization by including a mix of capitalized and lowercase examples. See the [usage guide](/usage/training#data-augmentation) for details and examples.
|
||||
capitalization by including a mix of capitalized and lowercase examples. See the
|
||||
[usage guide](/usage/training#data-augmentation) for details and examples.
|
||||
|
||||
### spacy.orth_variants.v1 {#orth_variants tag="registered function"}
|
||||
|
||||
|
@ -707,7 +708,7 @@ capitalization by including a mix of capitalized and lowercase examples. See the
|
|||
> ```
|
||||
|
||||
Create a data augmentation callback that uses orth-variant replacement. The
|
||||
callback can be added to a corpus or other data iterator during training. This
|
||||
callback can be added to a corpus or other data iterator during training. It's
|
||||
is especially useful for punctuation and case replacement, to help generalize
|
||||
beyond corpora that don't have smart quotes, or only have smart quotes etc.
|
||||
|
||||
|
@ -718,6 +719,25 @@ beyond corpora that don't have smart quotes, or only have smart quotes etc.
|
|||
| `orth_variants` | A dictionary containing the single and paired orth variants. Typically loaded from a JSON file. See [`en_orth_variants.json`](https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/en_orth_variants.json) for an example. ~~Dict[str, Dict[List[Union[str, List[str]]]]]~~ |
|
||||
| **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ |
|
||||
|
||||
### spacy.lower_case.v1 {#lower_case tag="registered function"}
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [corpora.train.augmenter]
|
||||
> @augmenters = "spacy.lower_case.v1"
|
||||
> level = 0.3
|
||||
> ```
|
||||
|
||||
Create a data augmentation callback that lowercases documents. The callback can
|
||||
be added to a corpus or other data iterator during training. It's especially
|
||||
useful for making the model less sensitive to capitalization.
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `level` | The percentage of texts that will be augmented. ~~float~~ |
|
||||
| **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ |
|
||||
|
||||
## Training data and alignment {#gold source="spacy/training"}
|
||||
|
||||
### training.offsets_to_biluo_tags {#offsets_to_biluo_tags tag="function"}
|
||||
|
@ -827,10 +847,10 @@ utilities.
|
|||
### util.get_lang_class {#util.get_lang_class tag="function"}
|
||||
|
||||
Import and load a `Language` class. Allows lazy-loading
|
||||
[language data](/usage/linguistic-features#language-data) and importing languages using the
|
||||
two-letter language code. To add a language code for a custom language class,
|
||||
you can register it using the [`@registry.languages`](/api/top-level#registry)
|
||||
decorator.
|
||||
[language data](/usage/linguistic-features#language-data) and importing
|
||||
languages using the two-letter language code. To add a language code for a
|
||||
custom language class, you can register it using the
|
||||
[`@registry.languages`](/api/top-level#registry) decorator.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
|
|
@ -1801,17 +1801,7 @@ print(doc2[5].tag_, doc2[5].pos_) # WP PRON
|
|||
|
||||
<Infobox variant="warning" title="Migrating from spaCy v2.x">
|
||||
|
||||
For easy migration from from spaCy v2 to v3, the
|
||||
[`AttributeRuler`](/api/attributeruler) can import a **tag map and morph rules**
|
||||
in the v2 format with the methods
|
||||
[`load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
|
||||
[`load_from_morph_rules`](/api/attributeruler#load_from_morph_rules).
|
||||
|
||||
```diff
|
||||
nlp = spacy.blank("en")
|
||||
+ ruler = nlp.add_pipe("attribute_ruler")
|
||||
+ ruler.load_from_tag_map(YOUR_TAG_MAP)
|
||||
```
|
||||
The [`AttributeRuler`](/api/attributeruler) can import a **tag map and morph rules** in the v2.x format via its built-in methods or when the component is initialized before training. See the [migration guide](/usage/v3#migrating-training-mappings-exceptions) for details.
|
||||
|
||||
</Infobox>
|
||||
|
||||
|
|
|
@ -8,6 +8,7 @@ menu:
|
|||
- ['Config System', 'config']
|
||||
- ['Custom Training', 'config-custom']
|
||||
- ['Custom Functions', 'custom-functions']
|
||||
- ['Initialization', 'initialization']
|
||||
- ['Data Utilities', 'data']
|
||||
- ['Parallel Training', 'parallel-training']
|
||||
- ['Internal API', 'api']
|
||||
|
@ -689,17 +690,17 @@ During training, the results of each step are passed to a logger function. By
|
|||
default, these results are written to the console with the
|
||||
[`ConsoleLogger`](/api/top-level#ConsoleLogger). There is also built-in support
|
||||
for writing the log files to [Weights & Biases](https://www.wandb.com/) with the
|
||||
[`WandbLogger`](/api/top-level#WandbLogger). The logger function receives a
|
||||
**dictionary** with the following keys:
|
||||
[`WandbLogger`](/api/top-level#WandbLogger). On each step, the logger function
|
||||
receives a **dictionary** with the following keys:
|
||||
|
||||
| Key | Value |
|
||||
| -------------- | ---------------------------------------------------------------------------------------------- |
|
||||
| `epoch` | How many passes over the data have been completed. ~~int~~ |
|
||||
| `step` | How many steps have been completed. ~~int~~ |
|
||||
| `score` | The main score from the last evaluation, measured on the dev set. ~~float~~ |
|
||||
| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ |
|
||||
| `losses` | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~ |
|
||||
| `checkpoints` | A list of previous results, where each result is a (score, step, epoch) tuple. ~~List[Tuple]~~ |
|
||||
| Key | Value |
|
||||
| -------------- | ----------------------------------------------------------------------------------------------------- |
|
||||
| `epoch` | How many passes over the data have been completed. ~~int~~ |
|
||||
| `step` | How many steps have been completed. ~~int~~ |
|
||||
| `score` | The main score from the last evaluation, measured on the dev set. ~~float~~ |
|
||||
| `other_scores` | The other scores from the last evaluation, measured on the dev set. ~~Dict[str, Any]~~ |
|
||||
| `losses` | The accumulated training losses, keyed by component name. ~~Dict[str, float]~~ |
|
||||
| `checkpoints` | A list of previous results, where each result is a `(score, step)` tuple. ~~List[Tuple[float, int]]~~ |
|
||||
|
||||
You can easily implement and plug in your own logger that records the training
|
||||
results in a custom way, or sends them to an experiment management tracker of
|
||||
|
@ -715,30 +716,37 @@ tabular results to a file:
|
|||
|
||||
```python
|
||||
### functions.py
|
||||
from typing import Tuple, Callable, Dict, Any
|
||||
import sys
|
||||
from typing import IO, Tuple, Callable, Dict, Any
|
||||
import spacy
|
||||
from spacy import Language
|
||||
from pathlib import Path
|
||||
|
||||
@spacy.registry.loggers("my_custom_logger.v1")
|
||||
def custom_logger(log_path):
|
||||
def setup_logger(nlp: "Language") -> Tuple[Callable, Callable]:
|
||||
with Path(log_path).open("w", encoding="utf8") as file_:
|
||||
file_.write("step\\t")
|
||||
file_.write("score\\t")
|
||||
for pipe in nlp.pipe_names:
|
||||
file_.write(f"loss_{pipe}\\t")
|
||||
file_.write("\\n")
|
||||
def setup_logger(
|
||||
nlp: Language,
|
||||
stdout: IO=sys.stdout,
|
||||
stderr: IO=sys.stderr
|
||||
) -> Tuple[Callable, Callable]:
|
||||
stdout.write(f"Logging to {log_path}\n")
|
||||
log_file = Path(log_path).open("w", encoding="utf8")
|
||||
log_file.write("step\\t")
|
||||
log_file.write("score\\t")
|
||||
for pipe in nlp.pipe_names:
|
||||
log_file.write(f"loss_{pipe}\\t")
|
||||
log_file.write("\\n")
|
||||
|
||||
def log_step(info: Dict[str, Any]):
|
||||
with Path(log_path).open("a") as file_:
|
||||
file_.write(f"{info['step']}\\t")
|
||||
file_.write(f"{info['score']}\\t")
|
||||
def log_step(info: Optional[Dict[str, Any]]):
|
||||
if info:
|
||||
log_file.write(f"{info['step']}\\t")
|
||||
log_file.write(f"{info['score']}\\t")
|
||||
for pipe in nlp.pipe_names:
|
||||
file_.write(f"{info['losses'][pipe]}\\t")
|
||||
file_.write("\\n")
|
||||
log_file.write(f"{info['losses'][pipe]}\\t")
|
||||
log_file.write("\\n")
|
||||
|
||||
def finalize():
|
||||
pass
|
||||
log_file.close()
|
||||
|
||||
return log_step, finalize
|
||||
|
||||
|
@ -817,9 +825,101 @@ def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]:
|
|||
return create_model(output_width)
|
||||
```
|
||||
|
||||
### Customizing the initialization {#initialization}
|
||||
## Customizing the initialization {#initialization}
|
||||
|
||||
When you start training a new model from scratch,
|
||||
[`spacy train`](/api/cli#train) will call
|
||||
[`nlp.initialize`](/api/language#initialize) to initialize the pipeline and load
|
||||
the required data. All settings for this are defined in the
|
||||
[`[initialize]`](/api/data-formats#config-initialize) block of the config, so
|
||||
you can keep track of how the initial `nlp` object was created. The
|
||||
initialization process typically includes the following:
|
||||
|
||||
> #### config.cfg (excerpt)
|
||||
>
|
||||
> ```ini
|
||||
> [initialize]
|
||||
> vectors = ${paths.vectors}
|
||||
> init_tok2vec = ${paths.init_tok2vec}
|
||||
>
|
||||
> [initialize.components]
|
||||
> # Settings for components
|
||||
> ```
|
||||
|
||||
1. Load in **data resources** defined in the `[initialize]` config, including
|
||||
**word vectors** and
|
||||
[pretrained](/usage/embeddings-transformers/#pretraining) **tok2vec
|
||||
weights**.
|
||||
2. Call the `initialize` methods of the tokenizer (if implemented, e.g. for
|
||||
[Chinese](/usage/models#chinese)) and pipeline components with a callback to
|
||||
access the training data, the current `nlp` object and any **custom
|
||||
arguments** defined in the `[initialize]` config.
|
||||
3. In **pipeline components**: if needed, use the data to
|
||||
[infer missing shapes](/usage/layers-architectures#thinc-shape-inference) and
|
||||
set up the label scheme if no labels are provided. Components may also load
|
||||
other data like lookup tables or dictionaries.
|
||||
|
||||
The initialization step allows the config to define **all settings** required
|
||||
for the pipeline, while keeping a separation between settings and functions that
|
||||
should only be used **before training** to set up the initial pipeline, and
|
||||
logic and configuration that needs to be available **at runtime**. Without that
|
||||
separation, it would be very difficult to use the came, reproducible config file
|
||||
because the component settings required for training (load data from an external
|
||||
file) wouldn't match the component settings required at runtime (load what's
|
||||
included with the saved `nlp` object and don't depend on external file).
|
||||
|
||||

|
||||
|
||||
<Infobox title="How components save and load data" emoji="📖">
|
||||
|
||||
For details and examples of how pipeline components can **save and load data
|
||||
assets** like model weights or lookup tables, and how the component
|
||||
initialization is implemented under the hood, see the usage guide on
|
||||
[serializing and initializing component data](/usage/processing-pipelines#component-data-initialization).
|
||||
|
||||
</Infobox>
|
||||
|
||||
#### Initializing labels {#initialization-labels}
|
||||
|
||||
Built-in pipeline components like the
|
||||
[`EntityRecognizer`](/api/entityrecognizer) or
|
||||
[`DependencyParser`](/api/dependencyparser) need to know their available labels
|
||||
and associated internal meta information to initialize their model weights.
|
||||
Using the `get_examples` callback provided on initialization, they're able to
|
||||
**read the labels off the training data** automatically, which is very
|
||||
convenient – but it can also slow down the training process to compute this
|
||||
information on every run.
|
||||
|
||||
The [`init labels`](/api/cli#init-labels) command lets you auto-generate JSON
|
||||
files containing the label data for all supported components. You can then pass
|
||||
in the labels in the `[initialize]` settings for the respective components to
|
||||
allow them to initialize faster.
|
||||
|
||||
> #### config.cfg
|
||||
>
|
||||
> ```ini
|
||||
> [initialize.components.ner]
|
||||
>
|
||||
> [initialize.components.ner.labels]
|
||||
> @readers = "spacy.read_labels.v1"
|
||||
> path = "corpus/labels/ner.json
|
||||
> ```
|
||||
|
||||
```cli
|
||||
$ python -m spacy init labels config.cfg ./corpus --paths.train ./corpus/train.spacy
|
||||
```
|
||||
|
||||
Under the hood, the command delegates to the `label_data` property of the
|
||||
pipeline components, for instance
|
||||
[`EntityRecognizer.label_data`](/api/entityrecognizer#label_data).
|
||||
|
||||
<Infobox variant="warning" title="Important note">
|
||||
|
||||
The JSON format differs for each component and some components need additional
|
||||
meta information about their labels. The format exported by
|
||||
[`init labels`](/api/cli#init-labels) matches what the components need, so you
|
||||
should always let spaCy **auto-generate the labels** for you.
|
||||
|
||||
<Infobox title="This section is still under construction" emoji="🚧" variant="warning">
|
||||
</Infobox>
|
||||
|
||||
## Data utilities {#data}
|
||||
|
@ -1298,8 +1398,8 @@ of being dropped.
|
|||
|
||||
> - [`nlp`](/api/language): The `nlp` object with the pipeline components and
|
||||
> their models.
|
||||
> - [`nlp.initialize`](/api/language#initialize): Start the training and return
|
||||
> an optimizer to update the component model weights.
|
||||
> - [`nlp.initialize`](/api/language#initialize): Initialize the pipeline and
|
||||
> return an optimizer to update the component model weights.
|
||||
> - [`Optimizer`](https://thinc.ai/docs/api-optimizers): Function that holds
|
||||
> state between updates.
|
||||
> - [`nlp.update`](/api/language#update): Update component models with examples.
|
||||
|
|
|
@ -804,8 +804,30 @@ nlp = spacy.blank("en")
|
|||
Instead of defining a `tag_map` and `morph_rules` in the language data, spaCy
|
||||
v3.0 now manages mappings and exceptions with a separate and more flexible
|
||||
pipeline component, the [`AttributeRuler`](/api/attributeruler). See the
|
||||
[usage guide](/usage/linguistic-features#mappings-exceptions) for examples. The
|
||||
`AttributeRuler` provides two handy helper methods
|
||||
[usage guide](/usage/linguistic-features#mappings-exceptions) for examples. If
|
||||
you have tag maps and morph rules in the v2.x format, you can load them into the
|
||||
attribute ruler before training using the `[initialize]` block of your config.
|
||||
|
||||
> #### What does the initialization do?
|
||||
>
|
||||
> The `[initialize]` block is used when
|
||||
> [`nlp.initialize`](/api/language#initialize) is called (usually right before
|
||||
> training). It lets you define data resources for initializing the pipeline in
|
||||
> your `config.cfg`. After training, the rules are saved to disk with the
|
||||
> exported pipeline, so your runtime model doesn't depend on local data. For
|
||||
> details see the [config lifecycle](/usage/training/#config-lifecycle) and
|
||||
> [initialization](/usage/training/#initialization) docs.
|
||||
|
||||
```ini
|
||||
### config.cfg (excerpt)
|
||||
[initialize.components.attribute_ruler]
|
||||
|
||||
[initialize.components.attribute_ruler.tag_map]
|
||||
@readers = "srsly.read_json.v1"
|
||||
path = "./corpus/tag_map.json"
|
||||
```
|
||||
|
||||
The `AttributeRuler` also provides two handy helper methods
|
||||
[`load_from_tag_map`](/api/attributeruler#load_from_tag_map) and
|
||||
[`load_from_morph_rules`](/api/attributeruler#load_from_morph_rules) that let
|
||||
you load in your existing tag map or morph rules:
|
||||
|
|
Loading…
Reference in New Issue
Block a user