mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 20:28:20 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
3246cf8b2b
106
.github/contributors/peter-exos.md
vendored
Normal file
106
.github/contributors/peter-exos.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Peter Baumann |
|
||||||
|
| Company name (if applicable) | Exos Financial |
|
||||||
|
| Title or role (if applicable) | data scientist |
|
||||||
|
| Date | Feb 1st, 2021 |
|
||||||
|
| GitHub username | peter-exos |
|
||||||
|
| Website (optional) | |
|
|
@ -1,6 +1,6 @@
|
||||||
# fmt: off
|
# fmt: off
|
||||||
__title__ = "spacy"
|
__title__ = "spacy"
|
||||||
__version__ = "3.0.1"
|
__version__ = "3.0.3"
|
||||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||||
__projects__ = "https://github.com/explosion/projects"
|
__projects__ = "https://github.com/explosion/projects"
|
||||||
|
|
|
@ -16,7 +16,7 @@ import os
|
||||||
|
|
||||||
from ..schemas import ProjectConfigSchema, validate
|
from ..schemas import ProjectConfigSchema, validate
|
||||||
from ..util import import_file, run_command, make_tempdir, registry, logger
|
from ..util import import_file, run_command, make_tempdir, registry, logger
|
||||||
from ..util import is_compatible_version, ENV_VARS
|
from ..util import is_compatible_version, SimpleFrozenDict, ENV_VARS
|
||||||
from .. import about
|
from .. import about
|
||||||
|
|
||||||
if TYPE_CHECKING:
|
if TYPE_CHECKING:
|
||||||
|
@ -111,26 +111,33 @@ def _parse_overrides(args: List[str], is_cli: bool = False) -> Dict[str, Any]:
|
||||||
value = "true"
|
value = "true"
|
||||||
else:
|
else:
|
||||||
value = args.pop(0)
|
value = args.pop(0)
|
||||||
# Just like we do in the config, we're calling json.loads on the
|
result[opt] = _parse_override(value)
|
||||||
# values. But since they come from the CLI, it'd be unintuitive to
|
|
||||||
# explicitly mark strings with escaped quotes. So we're working
|
|
||||||
# around that here by falling back to a string if parsing fails.
|
|
||||||
# TODO: improve logic to handle simple types like list of strings?
|
|
||||||
try:
|
|
||||||
result[opt] = srsly.json_loads(value)
|
|
||||||
except ValueError:
|
|
||||||
result[opt] = str(value)
|
|
||||||
else:
|
else:
|
||||||
msg.fail(f"{err}: name should start with --", exits=1)
|
msg.fail(f"{err}: name should start with --", exits=1)
|
||||||
return result
|
return result
|
||||||
|
|
||||||
|
|
||||||
def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
|
def _parse_override(value: Any) -> Any:
|
||||||
|
# Just like we do in the config, we're calling json.loads on the
|
||||||
|
# values. But since they come from the CLI, it'd be unintuitive to
|
||||||
|
# explicitly mark strings with escaped quotes. So we're working
|
||||||
|
# around that here by falling back to a string if parsing fails.
|
||||||
|
# TODO: improve logic to handle simple types like list of strings?
|
||||||
|
try:
|
||||||
|
return srsly.json_loads(value)
|
||||||
|
except ValueError:
|
||||||
|
return str(value)
|
||||||
|
|
||||||
|
|
||||||
|
def load_project_config(
|
||||||
|
path: Path, interpolate: bool = True, overrides: Dict[str, Any] = SimpleFrozenDict()
|
||||||
|
) -> Dict[str, Any]:
|
||||||
"""Load the project.yml file from a directory and validate it. Also make
|
"""Load the project.yml file from a directory and validate it. Also make
|
||||||
sure that all directories defined in the config exist.
|
sure that all directories defined in the config exist.
|
||||||
|
|
||||||
path (Path): The path to the project directory.
|
path (Path): The path to the project directory.
|
||||||
interpolate (bool): Whether to substitute project variables.
|
interpolate (bool): Whether to substitute project variables.
|
||||||
|
overrides (Dict[str, Any]): Optional config overrides.
|
||||||
RETURNS (Dict[str, Any]): The loaded project.yml.
|
RETURNS (Dict[str, Any]): The loaded project.yml.
|
||||||
"""
|
"""
|
||||||
config_path = path / PROJECT_FILE
|
config_path = path / PROJECT_FILE
|
||||||
|
@ -154,20 +161,36 @@ def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
|
||||||
if not dir_path.exists():
|
if not dir_path.exists():
|
||||||
dir_path.mkdir(parents=True)
|
dir_path.mkdir(parents=True)
|
||||||
if interpolate:
|
if interpolate:
|
||||||
err = "project.yml validation error"
|
err = f"{PROJECT_FILE} validation error"
|
||||||
with show_validation_error(title=err, hint_fill=False):
|
with show_validation_error(title=err, hint_fill=False):
|
||||||
config = substitute_project_variables(config)
|
config = substitute_project_variables(config, overrides)
|
||||||
return config
|
return config
|
||||||
|
|
||||||
|
|
||||||
def substitute_project_variables(config: Dict[str, Any], overrides: Dict = {}):
|
def substitute_project_variables(
|
||||||
key = "vars"
|
config: Dict[str, Any],
|
||||||
|
overrides: Dict[str, Any] = SimpleFrozenDict(),
|
||||||
|
key: str = "vars",
|
||||||
|
env_key: str = "env",
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""Interpolate variables in the project file using the config system.
|
||||||
|
|
||||||
|
config (Dict[str, Any]): The project config.
|
||||||
|
overrides (Dict[str, Any]): Optional config overrides.
|
||||||
|
key (str): Key containing variables in project config.
|
||||||
|
env_key (str): Key containing environment variable mapping in project config.
|
||||||
|
RETURNS (Dict[str, Any]): The interpolated project config.
|
||||||
|
"""
|
||||||
config.setdefault(key, {})
|
config.setdefault(key, {})
|
||||||
config[key].update(overrides)
|
config.setdefault(env_key, {})
|
||||||
|
# Substitute references to env vars with their values
|
||||||
|
for config_var, env_var in config[env_key].items():
|
||||||
|
config[env_key][config_var] = _parse_override(os.environ.get(env_var, ""))
|
||||||
# Need to put variables in the top scope again so we can have a top-level
|
# Need to put variables in the top scope again so we can have a top-level
|
||||||
# section "project" (otherwise, a list of commands in the top scope wouldn't)
|
# section "project" (otherwise, a list of commands in the top scope wouldn't)
|
||||||
# be allowed by Thinc's config system
|
# be allowed by Thinc's config system
|
||||||
cfg = Config({"project": config, key: config[key]})
|
cfg = Config({"project": config, key: config[key], env_key: config[env_key]})
|
||||||
|
cfg = Config().from_str(cfg.to_str(), overrides=overrides)
|
||||||
interpolated = cfg.interpolate()
|
interpolated = cfg.interpolate()
|
||||||
return dict(interpolated["project"])
|
return dict(interpolated["project"])
|
||||||
|
|
||||||
|
|
|
@ -175,10 +175,13 @@ def render_parses(
|
||||||
def print_prf_per_type(
|
def print_prf_per_type(
|
||||||
msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str
|
msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str
|
||||||
) -> None:
|
) -> None:
|
||||||
data = [
|
data = []
|
||||||
(k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}")
|
for key, value in scores.items():
|
||||||
for k, v in scores.items()
|
row = [key]
|
||||||
]
|
for k in ("p", "r", "f"):
|
||||||
|
v = value[k]
|
||||||
|
row.append(f"{v * 100:.2f}" if isinstance(v, (int, float)) else v)
|
||||||
|
data.append(row)
|
||||||
msg.table(
|
msg.table(
|
||||||
data,
|
data,
|
||||||
header=("", "P", "R", "F"),
|
header=("", "P", "R", "F"),
|
||||||
|
@ -191,7 +194,10 @@ def print_textcats_auc_per_cat(
|
||||||
msg: Printer, scores: Dict[str, Dict[str, float]]
|
msg: Printer, scores: Dict[str, Dict[str, float]]
|
||||||
) -> None:
|
) -> None:
|
||||||
msg.table(
|
msg.table(
|
||||||
[(k, f"{v:.2f}") for k, v in scores.items()],
|
[
|
||||||
|
(k, f"{v:.2f}" if isinstance(v, (float, int)) else v)
|
||||||
|
for k, v in scores.items()
|
||||||
|
],
|
||||||
header=("", "ROC AUC"),
|
header=("", "ROC AUC"),
|
||||||
aligns=("l", "r"),
|
aligns=("l", "r"),
|
||||||
title="Textcat ROC AUC (per label)",
|
title="Textcat ROC AUC (per label)",
|
||||||
|
|
|
@ -3,19 +3,23 @@ from pathlib import Path
|
||||||
from wasabi import msg
|
from wasabi import msg
|
||||||
import sys
|
import sys
|
||||||
import srsly
|
import srsly
|
||||||
|
import typer
|
||||||
|
|
||||||
from ... import about
|
from ... import about
|
||||||
from ...git_info import GIT_VERSION
|
from ...git_info import GIT_VERSION
|
||||||
from ...util import working_dir, run_command, split_command, is_cwd, join_command
|
from ...util import working_dir, run_command, split_command, is_cwd, join_command
|
||||||
from ...util import SimpleFrozenList, is_minor_version_match, ENV_VARS
|
from ...util import SimpleFrozenList, is_minor_version_match, ENV_VARS
|
||||||
from ...util import check_bool_env_var
|
from ...util import check_bool_env_var, SimpleFrozenDict
|
||||||
from .._util import PROJECT_FILE, PROJECT_LOCK, load_project_config, get_hash
|
from .._util import PROJECT_FILE, PROJECT_LOCK, load_project_config, get_hash
|
||||||
from .._util import get_checksum, project_cli, Arg, Opt, COMMAND
|
from .._util import get_checksum, project_cli, Arg, Opt, COMMAND, parse_config_overrides
|
||||||
|
|
||||||
|
|
||||||
@project_cli.command("run")
|
@project_cli.command(
|
||||||
|
"run", context_settings={"allow_extra_args": True, "ignore_unknown_options": True}
|
||||||
|
)
|
||||||
def project_run_cli(
|
def project_run_cli(
|
||||||
# fmt: off
|
# fmt: off
|
||||||
|
ctx: typer.Context, # This is only used to read additional arguments
|
||||||
subcommand: str = Arg(None, help=f"Name of command defined in the {PROJECT_FILE}"),
|
subcommand: str = Arg(None, help=f"Name of command defined in the {PROJECT_FILE}"),
|
||||||
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
|
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
|
||||||
force: bool = Opt(False, "--force", "-F", help="Force re-running steps, even if nothing changed"),
|
force: bool = Opt(False, "--force", "-F", help="Force re-running steps, even if nothing changed"),
|
||||||
|
@ -33,13 +37,15 @@ def project_run_cli(
|
||||||
if show_help or not subcommand:
|
if show_help or not subcommand:
|
||||||
print_run_help(project_dir, subcommand)
|
print_run_help(project_dir, subcommand)
|
||||||
else:
|
else:
|
||||||
project_run(project_dir, subcommand, force=force, dry=dry)
|
overrides = parse_config_overrides(ctx.args)
|
||||||
|
project_run(project_dir, subcommand, overrides=overrides, force=force, dry=dry)
|
||||||
|
|
||||||
|
|
||||||
def project_run(
|
def project_run(
|
||||||
project_dir: Path,
|
project_dir: Path,
|
||||||
subcommand: str,
|
subcommand: str,
|
||||||
*,
|
*,
|
||||||
|
overrides: Dict[str, Any] = SimpleFrozenDict(),
|
||||||
force: bool = False,
|
force: bool = False,
|
||||||
dry: bool = False,
|
dry: bool = False,
|
||||||
capture: bool = False,
|
capture: bool = False,
|
||||||
|
@ -59,7 +65,7 @@ def project_run(
|
||||||
when you want to turn over execution to the command, and capture=True
|
when you want to turn over execution to the command, and capture=True
|
||||||
when you want to run the command more like a function.
|
when you want to run the command more like a function.
|
||||||
"""
|
"""
|
||||||
config = load_project_config(project_dir)
|
config = load_project_config(project_dir, overrides=overrides)
|
||||||
commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
|
commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
|
||||||
workflows = config.get("workflows", {})
|
workflows = config.get("workflows", {})
|
||||||
validate_subcommand(commands.keys(), workflows.keys(), subcommand)
|
validate_subcommand(commands.keys(), workflows.keys(), subcommand)
|
||||||
|
|
|
@ -28,6 +28,15 @@ bg:
|
||||||
accuracy:
|
accuracy:
|
||||||
name: iarfmoose/roberta-base-bulgarian
|
name: iarfmoose/roberta-base-bulgarian
|
||||||
size_factor: 3
|
size_factor: 3
|
||||||
|
bn:
|
||||||
|
word_vectors: null
|
||||||
|
transformer:
|
||||||
|
efficiency:
|
||||||
|
name: sagorsarker/bangla-bert-base
|
||||||
|
size_factor: 3
|
||||||
|
accuracy:
|
||||||
|
name: sagorsarker/bangla-bert-base
|
||||||
|
size_factor: 3
|
||||||
da:
|
da:
|
||||||
word_vectors: da_core_news_lg
|
word_vectors: da_core_news_lg
|
||||||
transformer:
|
transformer:
|
||||||
|
@ -104,10 +113,10 @@ hi:
|
||||||
word_vectors: null
|
word_vectors: null
|
||||||
transformer:
|
transformer:
|
||||||
efficiency:
|
efficiency:
|
||||||
name: monsoon-nlp/hindi-tpu-electra
|
name: ai4bharat/indic-bert
|
||||||
size_factor: 3
|
size_factor: 3
|
||||||
accuracy:
|
accuracy:
|
||||||
name: monsoon-nlp/hindi-tpu-electra
|
name: ai4bharat/indic-bert
|
||||||
size_factor: 3
|
size_factor: 3
|
||||||
id:
|
id:
|
||||||
word_vectors: null
|
word_vectors: null
|
||||||
|
@ -185,10 +194,10 @@ si:
|
||||||
word_vectors: null
|
word_vectors: null
|
||||||
transformer:
|
transformer:
|
||||||
efficiency:
|
efficiency:
|
||||||
name: keshan/SinhalaBERTo
|
name: setu4993/LaBSE
|
||||||
size_factor: 3
|
size_factor: 3
|
||||||
accuracy:
|
accuracy:
|
||||||
name: keshan/SinhalaBERTo
|
name: setu4993/LaBSE
|
||||||
size_factor: 3
|
size_factor: 3
|
||||||
sv:
|
sv:
|
||||||
word_vectors: null
|
word_vectors: null
|
||||||
|
@ -203,10 +212,10 @@ ta:
|
||||||
word_vectors: null
|
word_vectors: null
|
||||||
transformer:
|
transformer:
|
||||||
efficiency:
|
efficiency:
|
||||||
name: monsoon-nlp/tamillion
|
name: ai4bharat/indic-bert
|
||||||
size_factor: 3
|
size_factor: 3
|
||||||
accuracy:
|
accuracy:
|
||||||
name: monsoon-nlp/tamillion
|
name: ai4bharat/indic-bert
|
||||||
size_factor: 3
|
size_factor: 3
|
||||||
te:
|
te:
|
||||||
word_vectors: null
|
word_vectors: null
|
||||||
|
|
|
@ -579,8 +579,8 @@ class Errors:
|
||||||
E922 = ("Component '{name}' has been initialized with an output dimension of "
|
E922 = ("Component '{name}' has been initialized with an output dimension of "
|
||||||
"{nO} - cannot add any more labels.")
|
"{nO} - cannot add any more labels.")
|
||||||
E923 = ("It looks like there is no proper sample data to initialize the "
|
E923 = ("It looks like there is no proper sample data to initialize the "
|
||||||
"Model of component '{name}'. This is likely a bug in spaCy, so "
|
"Model of component '{name}'. To check your input data paths and "
|
||||||
"feel free to open an issue: https://github.com/explosion/spaCy/issues")
|
"annotation, run: python -m spacy debug data config.cfg")
|
||||||
E924 = ("The '{name}' component does not seem to be initialized properly. "
|
E924 = ("The '{name}' component does not seem to be initialized properly. "
|
||||||
"This is likely a bug in spaCy, so feel free to open an issue: "
|
"This is likely a bug in spaCy, so feel free to open an issue: "
|
||||||
"https://github.com/explosion/spaCy/issues")
|
"https://github.com/explosion/spaCy/issues")
|
||||||
|
|
18
spacy/lang/tn/__init__.py
Normal file
18
spacy/lang/tn/__init__.py
Normal file
|
@ -0,0 +1,18 @@
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
from .punctuation import TOKENIZER_INFIXES
|
||||||
|
from ...language import Language
|
||||||
|
|
||||||
|
|
||||||
|
class SetswanaDefaults(Language.Defaults):
|
||||||
|
infixes = TOKENIZER_INFIXES
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
lex_attr_getters = LEX_ATTRS
|
||||||
|
|
||||||
|
|
||||||
|
class Setswana(Language):
|
||||||
|
lang = "tn"
|
||||||
|
Defaults = SetswanaDefaults
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["Setswana"]
|
15
spacy/lang/tn/examples.py
Normal file
15
spacy/lang/tn/examples.py
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
>>> from spacy.lang.tn.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"Apple e nyaka go reka JSE ka tlhwatlhwa ta R1 billion",
|
||||||
|
"Johannesburg ke toropo e kgolo mo Afrika Borwa.",
|
||||||
|
"O ko kae?",
|
||||||
|
"ke mang presidente ya Afrika Borwa?",
|
||||||
|
"ke eng toropo kgolo ya Afrika Borwa?",
|
||||||
|
"Nelson Mandela o belegwe leng?",
|
||||||
|
]
|
107
spacy/lang/tn/lex_attrs.py
Normal file
107
spacy/lang/tn/lex_attrs.py
Normal file
|
@ -0,0 +1,107 @@
|
||||||
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
_num_words = [
|
||||||
|
"lefela",
|
||||||
|
"nngwe",
|
||||||
|
"pedi",
|
||||||
|
"tharo",
|
||||||
|
"nne",
|
||||||
|
"tlhano",
|
||||||
|
"thataro",
|
||||||
|
"supa",
|
||||||
|
"robedi",
|
||||||
|
"robongwe",
|
||||||
|
"lesome",
|
||||||
|
"lesomenngwe",
|
||||||
|
"lesomepedi",
|
||||||
|
"sometharo",
|
||||||
|
"somenne",
|
||||||
|
"sometlhano",
|
||||||
|
"somethataro",
|
||||||
|
"somesupa",
|
||||||
|
"somerobedi",
|
||||||
|
"somerobongwe",
|
||||||
|
"someamabedi",
|
||||||
|
"someamararo",
|
||||||
|
"someamane",
|
||||||
|
"someamatlhano",
|
||||||
|
"someamarataro",
|
||||||
|
"someamasupa",
|
||||||
|
"someamarobedi",
|
||||||
|
"someamarobongwe",
|
||||||
|
"lekgolo",
|
||||||
|
"sekete",
|
||||||
|
"milione",
|
||||||
|
"bilione",
|
||||||
|
"terilione",
|
||||||
|
"kwatirilione",
|
||||||
|
"gajillione",
|
||||||
|
"bazillione",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
_ordinal_words = [
|
||||||
|
"ntlha",
|
||||||
|
"bobedi",
|
||||||
|
"boraro",
|
||||||
|
"bone",
|
||||||
|
"botlhano",
|
||||||
|
"borataro",
|
||||||
|
"bosupa",
|
||||||
|
"borobedi ",
|
||||||
|
"borobongwe",
|
||||||
|
"bolesome",
|
||||||
|
"bolesomengwe",
|
||||||
|
"bolesomepedi",
|
||||||
|
"bolesometharo",
|
||||||
|
"bolesomenne",
|
||||||
|
"bolesometlhano",
|
||||||
|
"bolesomethataro",
|
||||||
|
"bolesomesupa",
|
||||||
|
"bolesomerobedi",
|
||||||
|
"bolesomerobongwe",
|
||||||
|
"somamabedi",
|
||||||
|
"someamararo",
|
||||||
|
"someamane",
|
||||||
|
"someamatlhano",
|
||||||
|
"someamarataro",
|
||||||
|
"someamasupa",
|
||||||
|
"someamarobedi",
|
||||||
|
"someamarobongwe",
|
||||||
|
"lekgolo",
|
||||||
|
"sekete",
|
||||||
|
"milione",
|
||||||
|
"bilione",
|
||||||
|
"terilione",
|
||||||
|
"kwatirilione",
|
||||||
|
"gajillione",
|
||||||
|
"bazillione",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
if text.startswith(("+", "-", "±", "~")):
|
||||||
|
text = text[1:]
|
||||||
|
text = text.replace(",", "").replace(".", "")
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count("/") == 1:
|
||||||
|
num, denom = text.split("/")
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
|
||||||
|
text_lower = text.lower()
|
||||||
|
if text_lower in _num_words:
|
||||||
|
return True
|
||||||
|
|
||||||
|
# CHeck ordinal number
|
||||||
|
if text_lower in _ordinal_words:
|
||||||
|
return True
|
||||||
|
if text_lower.endswith("th"):
|
||||||
|
if text_lower[:-2].isdigit():
|
||||||
|
return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
LEX_ATTRS = {LIKE_NUM: like_num}
|
19
spacy/lang/tn/punctuation.py
Normal file
19
spacy/lang/tn/punctuation.py
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, HYPHENS
|
||||||
|
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
|
||||||
|
|
||||||
|
_infixes = (
|
||||||
|
LIST_ELLIPSES
|
||||||
|
+ LIST_ICONS
|
||||||
|
+ [
|
||||||
|
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
|
||||||
|
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
|
||||||
|
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
||||||
|
),
|
||||||
|
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||||
|
r"(?<=[{a}0-9])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
|
||||||
|
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_INFIXES = _infixes
|
20
spacy/lang/tn/stop_words.py
Normal file
20
spacy/lang/tn/stop_words.py
Normal file
|
@ -0,0 +1,20 @@
|
||||||
|
# Stop words
|
||||||
|
STOP_WORDS = set(
|
||||||
|
"""
|
||||||
|
ke gareng ga selekanyo tlhwatlhwa yo mongwe se
|
||||||
|
sengwe fa go le jalo gongwe ba na mo tikologong
|
||||||
|
jaaka kwa morago nna gonne ka sa pele nako teng
|
||||||
|
tlase fela ntle magareng tsona feta bobedi kgabaganya
|
||||||
|
moo gape kgatlhanong botlhe tsotlhe bokana e esi
|
||||||
|
setseng mororo dinako golo kgolo nnye wena gago
|
||||||
|
o ntse ntle tla goreng gangwe mang yotlhe gore
|
||||||
|
eo yona tseraganyo eng ne sentle re rona thata
|
||||||
|
godimo fitlha pedi masomamabedi lesomepedi mmogo
|
||||||
|
tharo tseo boraro tseno yone jaanong bobona bona
|
||||||
|
lesome tsaya tsamaiso nngwe masomethataro thataro
|
||||||
|
tsa mmatota tota sale thoko supa dira tshwanetse di mmalwa masisi
|
||||||
|
bonala e tshwanang bogolo tsenya tsweetswee karolo
|
||||||
|
sepe tlhalosa dirwa robedi robongwe lesomenngwe gaisa
|
||||||
|
tlhano lesometlhano botlalo lekgolo
|
||||||
|
""".split()
|
||||||
|
)
|
|
@ -451,7 +451,7 @@ cdef class Lexeme:
|
||||||
Lexeme.c_set_flag(self.c, IS_QUOTE, x)
|
Lexeme.c_set_flag(self.c, IS_QUOTE, x)
|
||||||
|
|
||||||
property is_left_punct:
|
property is_left_punct:
|
||||||
"""RETURNS (bool): Whether the lexeme is left punctuation, e.g. )."""
|
"""RETURNS (bool): Whether the lexeme is left punctuation, e.g. (."""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return Lexeme.c_check_flag(self.c, IS_LEFT_PUNCT)
|
return Lexeme.c_check_flag(self.c, IS_LEFT_PUNCT)
|
||||||
|
|
||||||
|
|
|
@ -18,4 +18,4 @@ cdef class PhraseMatcher:
|
||||||
cdef Pool mem
|
cdef Pool mem
|
||||||
cdef key_t _terminal_hash
|
cdef key_t _terminal_hash
|
||||||
|
|
||||||
cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil
|
cdef void find_matches(self, Doc doc, int start_idx, int end_idx, vector[SpanC] *matches) nogil
|
||||||
|
|
|
@ -230,10 +230,10 @@ cdef class PhraseMatcher:
|
||||||
result = internal_node
|
result = internal_node
|
||||||
map_set(self.mem, <MapStruct*>result, self.vocab.strings[key], NULL)
|
map_set(self.mem, <MapStruct*>result, self.vocab.strings[key], NULL)
|
||||||
|
|
||||||
def __call__(self, doc, *, as_spans=False):
|
def __call__(self, object doclike, *, as_spans=False):
|
||||||
"""Find all sequences matching the supplied patterns on the `Doc`.
|
"""Find all sequences matching the supplied patterns on the `Doc`.
|
||||||
|
|
||||||
doc (Doc): The document to match over.
|
doclike (Doc or Span): The document to match over.
|
||||||
as_spans (bool): Return Span objects with labels instead of (match_id,
|
as_spans (bool): Return Span objects with labels instead of (match_id,
|
||||||
start, end) tuples.
|
start, end) tuples.
|
||||||
RETURNS (list): A list of `(match_id, start, end)` tuples,
|
RETURNS (list): A list of `(match_id, start, end)` tuples,
|
||||||
|
@ -244,12 +244,22 @@ cdef class PhraseMatcher:
|
||||||
DOCS: https://spacy.io/api/phrasematcher#call
|
DOCS: https://spacy.io/api/phrasematcher#call
|
||||||
"""
|
"""
|
||||||
matches = []
|
matches = []
|
||||||
if doc is None or len(doc) == 0:
|
if doclike is None or len(doclike) == 0:
|
||||||
# if doc is empty or None just return empty list
|
# if doc is empty or None just return empty list
|
||||||
return matches
|
return matches
|
||||||
|
if isinstance(doclike, Doc):
|
||||||
|
doc = doclike
|
||||||
|
start_idx = 0
|
||||||
|
end_idx = len(doc)
|
||||||
|
elif isinstance(doclike, Span):
|
||||||
|
doc = doclike.doc
|
||||||
|
start_idx = doclike.start
|
||||||
|
end_idx = doclike.end
|
||||||
|
else:
|
||||||
|
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
|
||||||
|
|
||||||
cdef vector[SpanC] c_matches
|
cdef vector[SpanC] c_matches
|
||||||
self.find_matches(doc, &c_matches)
|
self.find_matches(doc, start_idx, end_idx, &c_matches)
|
||||||
for i in range(c_matches.size()):
|
for i in range(c_matches.size()):
|
||||||
matches.append((c_matches[i].label, c_matches[i].start, c_matches[i].end))
|
matches.append((c_matches[i].label, c_matches[i].start, c_matches[i].end))
|
||||||
for i, (ent_id, start, end) in enumerate(matches):
|
for i, (ent_id, start, end) in enumerate(matches):
|
||||||
|
@ -261,17 +271,17 @@ cdef class PhraseMatcher:
|
||||||
else:
|
else:
|
||||||
return matches
|
return matches
|
||||||
|
|
||||||
cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil:
|
cdef void find_matches(self, Doc doc, int start_idx, int end_idx, vector[SpanC] *matches) nogil:
|
||||||
cdef MapStruct* current_node = self.c_map
|
cdef MapStruct* current_node = self.c_map
|
||||||
cdef int start = 0
|
cdef int start = 0
|
||||||
cdef int idx = 0
|
cdef int idx = start_idx
|
||||||
cdef int idy = 0
|
cdef int idy = start_idx
|
||||||
cdef key_t key
|
cdef key_t key
|
||||||
cdef void* value
|
cdef void* value
|
||||||
cdef int i = 0
|
cdef int i = 0
|
||||||
cdef SpanC ms
|
cdef SpanC ms
|
||||||
cdef void* result
|
cdef void* result
|
||||||
while idx < doc.length:
|
while idx < end_idx:
|
||||||
start = idx
|
start = idx
|
||||||
token = Token.get_struct_attr(&doc.c[idx], self.attr)
|
token = Token.get_struct_attr(&doc.c[idx], self.attr)
|
||||||
# look for sequences from this position
|
# look for sequences from this position
|
||||||
|
@ -279,7 +289,7 @@ cdef class PhraseMatcher:
|
||||||
if result:
|
if result:
|
||||||
current_node = <MapStruct*>result
|
current_node = <MapStruct*>result
|
||||||
idy = idx + 1
|
idy = idx + 1
|
||||||
while idy < doc.length:
|
while idy < end_idx:
|
||||||
result = map_get(current_node, self._terminal_hash)
|
result = map_get(current_node, self._terminal_hash)
|
||||||
if result:
|
if result:
|
||||||
i = 0
|
i = 0
|
||||||
|
|
|
@ -107,6 +107,7 @@ def init_ensemble_textcat(model, X, Y) -> Model:
|
||||||
model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
|
model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
|
||||||
model.get_ref("maxout_layer").set_dim("nI", tok2vec_width)
|
model.get_ref("maxout_layer").set_dim("nI", tok2vec_width)
|
||||||
model.get_ref("norm_layer").set_dim("nI", tok2vec_width)
|
model.get_ref("norm_layer").set_dim("nI", tok2vec_width)
|
||||||
|
model.get_ref("norm_layer").set_dim("nO", tok2vec_width)
|
||||||
init_chain(model, X, Y)
|
init_chain(model, X, Y)
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
|
@ -273,7 +273,7 @@ class EntityLinker(TrainablePipe):
|
||||||
gradients = self.distance.get_grad(sentence_encodings, entity_encodings)
|
gradients = self.distance.get_grad(sentence_encodings, entity_encodings)
|
||||||
loss = self.distance.get_loss(sentence_encodings, entity_encodings)
|
loss = self.distance.get_loss(sentence_encodings, entity_encodings)
|
||||||
loss = loss / len(entity_encodings)
|
loss = loss / len(entity_encodings)
|
||||||
return loss, gradients
|
return float(loss), gradients
|
||||||
|
|
||||||
def predict(self, docs: Iterable[Doc]) -> List[str]:
|
def predict(self, docs: Iterable[Doc]) -> List[str]:
|
||||||
"""Apply the pipeline's model to a batch of docs, without modifying them.
|
"""Apply the pipeline's model to a batch of docs, without modifying them.
|
||||||
|
|
|
@ -76,7 +76,7 @@ def merge_subtokens(doc: Doc, label: str = "subtok") -> Doc:
|
||||||
retokenizes=True,
|
retokenizes=True,
|
||||||
)
|
)
|
||||||
def make_token_splitter(
|
def make_token_splitter(
|
||||||
nlp: Language, name: str, *, min_length=0, split_length=0,
|
nlp: Language, name: str, *, min_length: int = 0, split_length: int = 0
|
||||||
):
|
):
|
||||||
return TokenSplitter(min_length=min_length, split_length=split_length)
|
return TokenSplitter(min_length=min_length, split_length=split_length)
|
||||||
|
|
||||||
|
|
|
@ -197,7 +197,7 @@ class ClozeMultitask(TrainablePipe):
|
||||||
target = vectors[ids]
|
target = vectors[ids]
|
||||||
gradient = self.distance.get_grad(prediction, target)
|
gradient = self.distance.get_grad(prediction, target)
|
||||||
loss = self.distance.get_loss(prediction, target)
|
loss = self.distance.get_loss(prediction, target)
|
||||||
return loss, gradient
|
return float(loss), gradient
|
||||||
|
|
||||||
def update(self, examples, *, drop=0., sgd=None, losses=None):
|
def update(self, examples, *, drop=0., sgd=None, losses=None):
|
||||||
pass
|
pass
|
||||||
|
|
|
@ -121,7 +121,7 @@ class Tok2Vec(TrainablePipe):
|
||||||
tokvecs = self.model.predict(docs)
|
tokvecs = self.model.predict(docs)
|
||||||
batch_id = Tok2VecListener.get_batch_id(docs)
|
batch_id = Tok2VecListener.get_batch_id(docs)
|
||||||
for listener in self.listeners:
|
for listener in self.listeners:
|
||||||
listener.receive(batch_id, tokvecs, lambda dX: [])
|
listener.receive(batch_id, tokvecs, _empty_backprop)
|
||||||
return tokvecs
|
return tokvecs
|
||||||
|
|
||||||
def set_annotations(self, docs: Sequence[Doc], tokvecses) -> None:
|
def set_annotations(self, docs: Sequence[Doc], tokvecses) -> None:
|
||||||
|
@ -291,12 +291,18 @@ def forward(model: Tok2VecListener, inputs, is_train: bool):
|
||||||
# of data.
|
# of data.
|
||||||
# When the components batch differently, we don't receive a matching
|
# When the components batch differently, we don't receive a matching
|
||||||
# prediction from the upstream, so we can't predict.
|
# prediction from the upstream, so we can't predict.
|
||||||
if not all(doc.tensor.size for doc in inputs):
|
outputs = []
|
||||||
# But we do need to do *something* if the tensor hasn't been set.
|
width = model.get_dim("nO")
|
||||||
# The compromise is to at least return data of the right shape,
|
for doc in inputs:
|
||||||
# so the output is valid.
|
if doc.tensor.size == 0:
|
||||||
width = model.get_dim("nO")
|
# But we do need to do *something* if the tensor hasn't been set.
|
||||||
outputs = [model.ops.alloc2f(len(doc), width) for doc in inputs]
|
# The compromise is to at least return data of the right shape,
|
||||||
else:
|
# so the output is valid.
|
||||||
outputs = [doc.tensor for doc in inputs]
|
outputs.append(model.ops.alloc2f(len(doc), width))
|
||||||
|
else:
|
||||||
|
outputs.append(doc.tensor)
|
||||||
return outputs, lambda dX: []
|
return outputs, lambda dX: []
|
||||||
|
|
||||||
|
|
||||||
|
def _empty_backprop(dX): # for pickling
|
||||||
|
return []
|
||||||
|
|
|
@ -446,6 +446,7 @@ class ProjectConfigCommand(BaseModel):
|
||||||
class ProjectConfigSchema(BaseModel):
|
class ProjectConfigSchema(BaseModel):
|
||||||
# fmt: off
|
# fmt: off
|
||||||
vars: Dict[StrictStr, Any] = Field({}, title="Optional variables to substitute in commands")
|
vars: Dict[StrictStr, Any] = Field({}, title="Optional variables to substitute in commands")
|
||||||
|
env: Dict[StrictStr, Any] = Field({}, title="Optional variable names to substitute in commands, mapped to environment variable names")
|
||||||
assets: List[Union[ProjectConfigAssetURL, ProjectConfigAssetGit]] = Field([], title="Data assets")
|
assets: List[Union[ProjectConfigAssetURL, ProjectConfigAssetGit]] = Field([], title="Data assets")
|
||||||
workflows: Dict[StrictStr, List[StrictStr]] = Field({}, title="Named workflows, mapped to list of project commands to run in order")
|
workflows: Dict[StrictStr, List[StrictStr]] = Field({}, title="Named workflows, mapped to list of project commands to run in order")
|
||||||
commands: List[ProjectConfigCommand] = Field([], title="Project command shortucts")
|
commands: List[ProjectConfigCommand] = Field([], title="Project command shortucts")
|
||||||
|
|
|
@ -8,7 +8,8 @@ from spacy.util import get_lang_class
|
||||||
LANGUAGES = ["af", "ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", "es",
|
LANGUAGES = ["af", "ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", "es",
|
||||||
"et", "fa", "fi", "fr", "ga", "he", "hi", "hr", "hu", "id", "is",
|
"et", "fa", "fi", "fr", "ga", "he", "hi", "hr", "hu", "id", "is",
|
||||||
"it", "kn", "lt", "lv", "nb", "nl", "pl", "pt", "ro", "si", "sk",
|
"it", "kn", "lt", "lv", "nb", "nl", "pl", "pt", "ro", "si", "sk",
|
||||||
"sl", "sq", "sr", "sv", "ta", "te", "tl", "tr", "tt", "ur", 'yo']
|
"sl", "sq", "sr", "sv", "ta", "te", "tl", "tn", "tr", "tt", "ur",
|
||||||
|
"yo"]
|
||||||
# fmt: on
|
# fmt: on
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -323,3 +323,39 @@ def test_phrase_matcher_deprecated(en_vocab):
|
||||||
@pytest.mark.parametrize("attr", ["SENT_START", "IS_SENT_START"])
|
@pytest.mark.parametrize("attr", ["SENT_START", "IS_SENT_START"])
|
||||||
def test_phrase_matcher_sent_start(en_vocab, attr):
|
def test_phrase_matcher_sent_start(en_vocab, attr):
|
||||||
_ = PhraseMatcher(en_vocab, attr=attr) # noqa: F841
|
_ = PhraseMatcher(en_vocab, attr=attr) # noqa: F841
|
||||||
|
|
||||||
|
|
||||||
|
def test_span_in_phrasematcher(en_vocab):
|
||||||
|
"""Ensure that PhraseMatcher accepts Span and Doc as input"""
|
||||||
|
# fmt: off
|
||||||
|
words = ["I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "and", "nothing", "else", "."]
|
||||||
|
# fmt: on
|
||||||
|
doc = Doc(en_vocab, words=words)
|
||||||
|
span = doc[:8]
|
||||||
|
pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
|
||||||
|
matcher = PhraseMatcher(en_vocab)
|
||||||
|
matcher.add("SPACY", [pattern])
|
||||||
|
matches_doc = matcher(doc)
|
||||||
|
matches_span = matcher(span)
|
||||||
|
assert len(matches_doc) == 1
|
||||||
|
assert len(matches_span) == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_span_v_doc_in_phrasematcher(en_vocab):
|
||||||
|
"""Ensure that PhraseMatcher only returns matches in input Span and not in entire Doc"""
|
||||||
|
# fmt: off
|
||||||
|
words = [
|
||||||
|
"I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "Spans",
|
||||||
|
"and", "Docs", "in", "my", "matchers", "," "and", "Spans", "and", "Docs",
|
||||||
|
"everywhere", "."
|
||||||
|
]
|
||||||
|
# fmt: on
|
||||||
|
doc = Doc(en_vocab, words=words)
|
||||||
|
span = doc[9:15] # second clause
|
||||||
|
pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
|
||||||
|
matcher = PhraseMatcher(en_vocab)
|
||||||
|
matcher.add("SPACY", [pattern])
|
||||||
|
matches_doc = matcher(doc)
|
||||||
|
matches_span = matcher(span)
|
||||||
|
assert len(matches_doc) == 3
|
||||||
|
assert len(matches_span) == 1
|
||||||
|
|
|
@ -451,13 +451,27 @@ def test_pipe_factories_from_source_config():
|
||||||
assert config["arg"] == "world"
|
assert config["arg"] == "world"
|
||||||
|
|
||||||
|
|
||||||
def test_pipe_factories_decorator_idempotent():
|
class PipeFactoriesIdempotent:
|
||||||
|
def __init__(self, nlp, name):
|
||||||
|
...
|
||||||
|
|
||||||
|
def __call__(self, doc):
|
||||||
|
...
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"i,func,func2",
|
||||||
|
[
|
||||||
|
(0, lambda nlp, name: lambda doc: doc, lambda doc: doc),
|
||||||
|
(1, PipeFactoriesIdempotent, PipeFactoriesIdempotent(None, None)),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_pipe_factories_decorator_idempotent(i, func, func2):
|
||||||
"""Check that decorator can be run multiple times if the function is the
|
"""Check that decorator can be run multiple times if the function is the
|
||||||
same. This is especially relevant for live reloading because we don't
|
same. This is especially relevant for live reloading because we don't
|
||||||
want spaCy to raise an error if a module registering components is reloaded.
|
want spaCy to raise an error if a module registering components is reloaded.
|
||||||
"""
|
"""
|
||||||
name = "test_pipe_factories_decorator_idempotent"
|
name = f"test_pipe_factories_decorator_idempotent_{i}"
|
||||||
func = lambda nlp, name: lambda doc: doc
|
|
||||||
for i in range(5):
|
for i in range(5):
|
||||||
Language.factory(name, func=func)
|
Language.factory(name, func=func)
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
|
@ -466,7 +480,6 @@ def test_pipe_factories_decorator_idempotent():
|
||||||
# Make sure it also works for component decorator, which creates the
|
# Make sure it also works for component decorator, which creates the
|
||||||
# factory function
|
# factory function
|
||||||
name2 = f"{name}2"
|
name2 = f"{name}2"
|
||||||
func2 = lambda doc: doc
|
|
||||||
for i in range(5):
|
for i in range(5):
|
||||||
Language.component(name2, func=func2)
|
Language.component(name2, func=func2)
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
|
|
229
spacy/tests/regression/test_issue6501-7000.py
Normal file
229
spacy/tests/regression/test_issue6501-7000.py
Normal file
|
@ -0,0 +1,229 @@
|
||||||
|
import pytest
|
||||||
|
from spacy.lang.en import English
|
||||||
|
import numpy as np
|
||||||
|
import spacy
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
from spacy.matcher import PhraseMatcher
|
||||||
|
from spacy.tokens import DocBin
|
||||||
|
from spacy.util import load_config_from_str
|
||||||
|
from spacy.training import Example
|
||||||
|
from spacy.training.initialize import init_nlp
|
||||||
|
import pickle
|
||||||
|
|
||||||
|
from ..util import make_tempdir
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue6730(en_vocab):
|
||||||
|
"""Ensure that the KB does not accept empty strings, but otherwise IO works fine."""
|
||||||
|
from spacy.kb import KnowledgeBase
|
||||||
|
|
||||||
|
kb = KnowledgeBase(en_vocab, entity_vector_length=3)
|
||||||
|
kb.add_entity(entity="1", freq=148, entity_vector=[1, 2, 3])
|
||||||
|
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
kb.add_alias(alias="", entities=["1"], probabilities=[0.4])
|
||||||
|
assert kb.contains_alias("") is False
|
||||||
|
|
||||||
|
kb.add_alias(alias="x", entities=["1"], probabilities=[0.2])
|
||||||
|
kb.add_alias(alias="y", entities=["1"], probabilities=[0.1])
|
||||||
|
|
||||||
|
with make_tempdir() as tmp_dir:
|
||||||
|
kb.to_disk(tmp_dir)
|
||||||
|
kb.from_disk(tmp_dir)
|
||||||
|
assert kb.get_size_aliases() == 2
|
||||||
|
assert set(kb.get_alias_strings()) == {"x", "y"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue6755(en_tokenizer):
|
||||||
|
doc = en_tokenizer("This is a magnificent sentence.")
|
||||||
|
span = doc[:0]
|
||||||
|
assert span.text_with_ws == ""
|
||||||
|
assert span.text == ""
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"sentence, start_idx,end_idx,label",
|
||||||
|
[("Welcome to Mumbai, my friend", 11, 17, "GPE")],
|
||||||
|
)
|
||||||
|
def test_issue6815_1(sentence, start_idx, end_idx, label):
|
||||||
|
nlp = English()
|
||||||
|
doc = nlp(sentence)
|
||||||
|
span = doc[:].char_span(start_idx, end_idx, label=label)
|
||||||
|
assert span.label_ == label
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
|
||||||
|
)
|
||||||
|
def test_issue6815_2(sentence, start_idx, end_idx, kb_id):
|
||||||
|
nlp = English()
|
||||||
|
doc = nlp(sentence)
|
||||||
|
span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
|
||||||
|
assert span.kb_id == kb_id
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"sentence, start_idx,end_idx,vector",
|
||||||
|
[("Welcome to Mumbai, my friend", 11, 17, np.array([0.1, 0.2, 0.3]))],
|
||||||
|
)
|
||||||
|
def test_issue6815_3(sentence, start_idx, end_idx, vector):
|
||||||
|
nlp = English()
|
||||||
|
doc = nlp(sentence)
|
||||||
|
span = doc[:].char_span(start_idx, end_idx, vector=vector)
|
||||||
|
assert (span.vector == vector).all()
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue6839(en_vocab):
|
||||||
|
"""Ensure that PhraseMatcher accepts Span as input"""
|
||||||
|
# fmt: off
|
||||||
|
words = ["I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "and", "nothing", "else", "."]
|
||||||
|
# fmt: on
|
||||||
|
doc = Doc(en_vocab, words=words)
|
||||||
|
span = doc[:8]
|
||||||
|
pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
|
||||||
|
matcher = PhraseMatcher(en_vocab)
|
||||||
|
matcher.add("SPACY", [pattern])
|
||||||
|
matches = matcher(span)
|
||||||
|
assert matches
|
||||||
|
|
||||||
|
|
||||||
|
CONFIG_ISSUE_6908 = """
|
||||||
|
[paths]
|
||||||
|
train = "TRAIN_PLACEHOLDER"
|
||||||
|
raw = null
|
||||||
|
init_tok2vec = null
|
||||||
|
vectors = null
|
||||||
|
|
||||||
|
[system]
|
||||||
|
seed = 0
|
||||||
|
gpu_allocator = null
|
||||||
|
|
||||||
|
[nlp]
|
||||||
|
lang = "en"
|
||||||
|
pipeline = ["textcat"]
|
||||||
|
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
|
||||||
|
disabled = []
|
||||||
|
before_creation = null
|
||||||
|
after_creation = null
|
||||||
|
after_pipeline_creation = null
|
||||||
|
batch_size = 1000
|
||||||
|
|
||||||
|
[components]
|
||||||
|
|
||||||
|
[components.textcat]
|
||||||
|
factory = "TEXTCAT_PLACEHOLDER"
|
||||||
|
|
||||||
|
[corpora]
|
||||||
|
|
||||||
|
[corpora.train]
|
||||||
|
@readers = "spacy.Corpus.v1"
|
||||||
|
path = ${paths:train}
|
||||||
|
|
||||||
|
[corpora.dev]
|
||||||
|
@readers = "spacy.Corpus.v1"
|
||||||
|
path = ${paths:train}
|
||||||
|
|
||||||
|
|
||||||
|
[training]
|
||||||
|
train_corpus = "corpora.train"
|
||||||
|
dev_corpus = "corpora.dev"
|
||||||
|
seed = ${system.seed}
|
||||||
|
gpu_allocator = ${system.gpu_allocator}
|
||||||
|
frozen_components = []
|
||||||
|
before_to_disk = null
|
||||||
|
|
||||||
|
[pretraining]
|
||||||
|
|
||||||
|
[initialize]
|
||||||
|
vectors = ${paths.vectors}
|
||||||
|
init_tok2vec = ${paths.init_tok2vec}
|
||||||
|
vocab_data = null
|
||||||
|
lookups = null
|
||||||
|
before_init = null
|
||||||
|
after_init = null
|
||||||
|
|
||||||
|
[initialize.components]
|
||||||
|
|
||||||
|
[initialize.components.textcat]
|
||||||
|
labels = ['label1', 'label2']
|
||||||
|
|
||||||
|
[initialize.tokenizer]
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"component_name", ["textcat", "textcat_multilabel"],
|
||||||
|
)
|
||||||
|
def test_issue6908(component_name):
|
||||||
|
"""Test intializing textcat with labels in a list"""
|
||||||
|
|
||||||
|
def create_data(out_file):
|
||||||
|
nlp = spacy.blank("en")
|
||||||
|
doc = nlp.make_doc("Some text")
|
||||||
|
doc.cats = {"label1": 0, "label2": 1}
|
||||||
|
out_data = DocBin(docs=[doc]).to_bytes()
|
||||||
|
with out_file.open("wb") as file_:
|
||||||
|
file_.write(out_data)
|
||||||
|
|
||||||
|
with make_tempdir() as tmp_path:
|
||||||
|
train_path = tmp_path / "train.spacy"
|
||||||
|
create_data(train_path)
|
||||||
|
config_str = CONFIG_ISSUE_6908.replace("TEXTCAT_PLACEHOLDER", component_name)
|
||||||
|
config_str = config_str.replace("TRAIN_PLACEHOLDER", train_path.as_posix())
|
||||||
|
config = load_config_from_str(config_str)
|
||||||
|
init_nlp(config)
|
||||||
|
|
||||||
|
|
||||||
|
CONFIG_ISSUE_6950 = """
|
||||||
|
[nlp]
|
||||||
|
lang = "en"
|
||||||
|
pipeline = ["tok2vec", "tagger"]
|
||||||
|
|
||||||
|
[components]
|
||||||
|
|
||||||
|
[components.tok2vec]
|
||||||
|
factory = "tok2vec"
|
||||||
|
|
||||||
|
[components.tok2vec.model]
|
||||||
|
@architectures = "spacy.Tok2Vec.v1"
|
||||||
|
|
||||||
|
[components.tok2vec.model.embed]
|
||||||
|
@architectures = "spacy.MultiHashEmbed.v1"
|
||||||
|
width = ${components.tok2vec.model.encode:width}
|
||||||
|
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
|
||||||
|
rows = [5000,2500,2500,2500]
|
||||||
|
include_static_vectors = false
|
||||||
|
|
||||||
|
[components.tok2vec.model.encode]
|
||||||
|
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||||
|
width = 96
|
||||||
|
depth = 4
|
||||||
|
window_size = 1
|
||||||
|
maxout_pieces = 3
|
||||||
|
|
||||||
|
[components.ner]
|
||||||
|
factory = "ner"
|
||||||
|
|
||||||
|
[components.tagger]
|
||||||
|
factory = "tagger"
|
||||||
|
|
||||||
|
[components.tagger.model]
|
||||||
|
@architectures = "spacy.Tagger.v1"
|
||||||
|
nO = null
|
||||||
|
|
||||||
|
[components.tagger.model.tok2vec]
|
||||||
|
@architectures = "spacy.Tok2VecListener.v1"
|
||||||
|
width = ${components.tok2vec.model.encode:width}
|
||||||
|
upstream = "*"
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue6950():
|
||||||
|
"""Test that the nlp object with initialized tok2vec with listeners pickles
|
||||||
|
correctly (and doesn't have lambdas).
|
||||||
|
"""
|
||||||
|
nlp = English.from_config(load_config_from_str(CONFIG_ISSUE_6950))
|
||||||
|
nlp.initialize(lambda: [Example.from_dict(nlp.make_doc("hello"), {"tags": ["V"]})])
|
||||||
|
pickle.dumps(nlp)
|
||||||
|
nlp("hello")
|
||||||
|
pickle.dumps(nlp)
|
|
@ -1,23 +0,0 @@
|
||||||
import pytest
|
|
||||||
from ..util import make_tempdir
|
|
||||||
|
|
||||||
|
|
||||||
def test_issue6730(en_vocab):
|
|
||||||
"""Ensure that the KB does not accept empty strings, but otherwise IO works fine."""
|
|
||||||
from spacy.kb import KnowledgeBase
|
|
||||||
|
|
||||||
kb = KnowledgeBase(en_vocab, entity_vector_length=3)
|
|
||||||
kb.add_entity(entity="1", freq=148, entity_vector=[1, 2, 3])
|
|
||||||
|
|
||||||
with pytest.raises(ValueError):
|
|
||||||
kb.add_alias(alias="", entities=["1"], probabilities=[0.4])
|
|
||||||
assert kb.contains_alias("") is False
|
|
||||||
|
|
||||||
kb.add_alias(alias="x", entities=["1"], probabilities=[0.2])
|
|
||||||
kb.add_alias(alias="y", entities=["1"], probabilities=[0.1])
|
|
||||||
|
|
||||||
with make_tempdir() as tmp_dir:
|
|
||||||
kb.to_disk(tmp_dir)
|
|
||||||
kb.from_disk(tmp_dir)
|
|
||||||
assert kb.get_size_aliases() == 2
|
|
||||||
assert set(kb.get_alias_strings()) == {"x", "y"}
|
|
|
@ -1,5 +0,0 @@
|
||||||
def test_issue6755(en_tokenizer):
|
|
||||||
doc = en_tokenizer("This is a magnificent sentence.")
|
|
||||||
span = doc[:0]
|
|
||||||
assert span.text_with_ws == ""
|
|
||||||
assert span.text == ""
|
|
|
@ -1,35 +0,0 @@
|
||||||
import pytest
|
|
||||||
from spacy.lang.en import English
|
|
||||||
import numpy as np
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
|
||||||
"sentence, start_idx,end_idx,label",
|
|
||||||
[("Welcome to Mumbai, my friend", 11, 17, "GPE")],
|
|
||||||
)
|
|
||||||
def test_char_span_label(sentence, start_idx, end_idx, label):
|
|
||||||
nlp = English()
|
|
||||||
doc = nlp(sentence)
|
|
||||||
span = doc[:].char_span(start_idx, end_idx, label=label)
|
|
||||||
assert span.label_ == label
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
|
||||||
"sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
|
|
||||||
)
|
|
||||||
def test_char_span_kb_id(sentence, start_idx, end_idx, kb_id):
|
|
||||||
nlp = English()
|
|
||||||
doc = nlp(sentence)
|
|
||||||
span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
|
|
||||||
assert span.kb_id == kb_id
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
|
||||||
"sentence, start_idx,end_idx,vector",
|
|
||||||
[("Welcome to Mumbai, my friend", 11, 17, np.array([0.1, 0.2, 0.3]))],
|
|
||||||
)
|
|
||||||
def test_char_span_vector(sentence, start_idx, end_idx, vector):
|
|
||||||
nlp = English()
|
|
||||||
doc = nlp(sentence)
|
|
||||||
span = doc[:].char_span(start_idx, end_idx, vector=vector)
|
|
||||||
assert (span.vector == vector).all()
|
|
|
@ -1,102 +0,0 @@
|
||||||
import pytest
|
|
||||||
import spacy
|
|
||||||
from spacy.language import Language
|
|
||||||
from spacy.tokens import DocBin
|
|
||||||
from spacy import util
|
|
||||||
from spacy.schemas import ConfigSchemaInit
|
|
||||||
|
|
||||||
from spacy.training.initialize import init_nlp
|
|
||||||
|
|
||||||
from ..util import make_tempdir
|
|
||||||
|
|
||||||
TEXTCAT_WITH_LABELS_ARRAY_CONFIG = """
|
|
||||||
[paths]
|
|
||||||
train = "TRAIN_PLACEHOLDER"
|
|
||||||
raw = null
|
|
||||||
init_tok2vec = null
|
|
||||||
vectors = null
|
|
||||||
|
|
||||||
[system]
|
|
||||||
seed = 0
|
|
||||||
gpu_allocator = null
|
|
||||||
|
|
||||||
[nlp]
|
|
||||||
lang = "en"
|
|
||||||
pipeline = ["textcat"]
|
|
||||||
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
|
|
||||||
disabled = []
|
|
||||||
before_creation = null
|
|
||||||
after_creation = null
|
|
||||||
after_pipeline_creation = null
|
|
||||||
batch_size = 1000
|
|
||||||
|
|
||||||
[components]
|
|
||||||
|
|
||||||
[components.textcat]
|
|
||||||
factory = "TEXTCAT_PLACEHOLDER"
|
|
||||||
|
|
||||||
[corpora]
|
|
||||||
|
|
||||||
[corpora.train]
|
|
||||||
@readers = "spacy.Corpus.v1"
|
|
||||||
path = ${paths:train}
|
|
||||||
|
|
||||||
[corpora.dev]
|
|
||||||
@readers = "spacy.Corpus.v1"
|
|
||||||
path = ${paths:train}
|
|
||||||
|
|
||||||
|
|
||||||
[training]
|
|
||||||
train_corpus = "corpora.train"
|
|
||||||
dev_corpus = "corpora.dev"
|
|
||||||
seed = ${system.seed}
|
|
||||||
gpu_allocator = ${system.gpu_allocator}
|
|
||||||
frozen_components = []
|
|
||||||
before_to_disk = null
|
|
||||||
|
|
||||||
[pretraining]
|
|
||||||
|
|
||||||
[initialize]
|
|
||||||
vectors = ${paths.vectors}
|
|
||||||
init_tok2vec = ${paths.init_tok2vec}
|
|
||||||
vocab_data = null
|
|
||||||
lookups = null
|
|
||||||
before_init = null
|
|
||||||
after_init = null
|
|
||||||
|
|
||||||
[initialize.components]
|
|
||||||
|
|
||||||
[initialize.components.textcat]
|
|
||||||
labels = ['label1', 'label2']
|
|
||||||
|
|
||||||
[initialize.tokenizer]
|
|
||||||
"""
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
|
||||||
"component_name",
|
|
||||||
["textcat", "textcat_multilabel"],
|
|
||||||
)
|
|
||||||
def test_textcat_initialize_labels_validation(component_name):
|
|
||||||
"""Test intializing textcat with labels in a list"""
|
|
||||||
|
|
||||||
def create_data(out_file):
|
|
||||||
nlp = spacy.blank("en")
|
|
||||||
doc = nlp.make_doc("Some text")
|
|
||||||
doc.cats = {"label1": 0, "label2": 1}
|
|
||||||
|
|
||||||
out_data = DocBin(docs=[doc]).to_bytes()
|
|
||||||
with out_file.open("wb") as file_:
|
|
||||||
file_.write(out_data)
|
|
||||||
|
|
||||||
with make_tempdir() as tmp_path:
|
|
||||||
train_path = tmp_path / "train.spacy"
|
|
||||||
create_data(train_path)
|
|
||||||
|
|
||||||
config_str = TEXTCAT_WITH_LABELS_ARRAY_CONFIG.replace(
|
|
||||||
"TEXTCAT_PLACEHOLDER", component_name
|
|
||||||
)
|
|
||||||
config_str = config_str.replace("TRAIN_PLACEHOLDER", train_path.as_posix())
|
|
||||||
|
|
||||||
config = util.load_config_from_str(config_str)
|
|
||||||
init_nlp(config)
|
|
12
spacy/tests/regression/test_issue7019.py
Normal file
12
spacy/tests/regression/test_issue7019.py
Normal file
|
@ -0,0 +1,12 @@
|
||||||
|
from spacy.cli.evaluate import print_textcats_auc_per_cat, print_prf_per_type
|
||||||
|
from wasabi import msg
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue7019():
|
||||||
|
scores = {"LABEL_A": 0.39829102, "LABEL_B": 0.938298329382, "LABEL_C": None}
|
||||||
|
print_textcats_auc_per_cat(msg, scores)
|
||||||
|
scores = {
|
||||||
|
"LABEL_A": {"p": 0.3420302, "r": 0.3929020, "f": 0.49823928932},
|
||||||
|
"LABEL_B": {"p": None, "r": None, "f": None},
|
||||||
|
}
|
||||||
|
print_prf_per_type(msg, scores, name="foo", type="bar")
|
67
spacy/tests/regression/test_issue7029.py
Normal file
67
spacy/tests/regression/test_issue7029.py
Normal file
|
@ -0,0 +1,67 @@
|
||||||
|
from spacy.lang.en import English
|
||||||
|
from spacy.training import Example
|
||||||
|
from spacy.util import load_config_from_str
|
||||||
|
|
||||||
|
|
||||||
|
CONFIG = """
|
||||||
|
[nlp]
|
||||||
|
lang = "en"
|
||||||
|
pipeline = ["tok2vec", "tagger"]
|
||||||
|
|
||||||
|
[components]
|
||||||
|
|
||||||
|
[components.tok2vec]
|
||||||
|
factory = "tok2vec"
|
||||||
|
|
||||||
|
[components.tok2vec.model]
|
||||||
|
@architectures = "spacy.Tok2Vec.v1"
|
||||||
|
|
||||||
|
[components.tok2vec.model.embed]
|
||||||
|
@architectures = "spacy.MultiHashEmbed.v1"
|
||||||
|
width = ${components.tok2vec.model.encode:width}
|
||||||
|
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
|
||||||
|
rows = [5000,2500,2500,2500]
|
||||||
|
include_static_vectors = false
|
||||||
|
|
||||||
|
[components.tok2vec.model.encode]
|
||||||
|
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||||
|
width = 96
|
||||||
|
depth = 4
|
||||||
|
window_size = 1
|
||||||
|
maxout_pieces = 3
|
||||||
|
|
||||||
|
[components.tagger]
|
||||||
|
factory = "tagger"
|
||||||
|
|
||||||
|
[components.tagger.model]
|
||||||
|
@architectures = "spacy.Tagger.v1"
|
||||||
|
nO = null
|
||||||
|
|
||||||
|
[components.tagger.model.tok2vec]
|
||||||
|
@architectures = "spacy.Tok2VecListener.v1"
|
||||||
|
width = ${components.tok2vec.model.encode:width}
|
||||||
|
upstream = "*"
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
TRAIN_DATA = [
|
||||||
|
("I like green eggs", {"tags": ["N", "V", "J", "N"]}),
|
||||||
|
("Eat blue ham", {"tags": ["V", "J", "N"]}),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue7029():
|
||||||
|
"""Test that an empty document doesn't mess up an entire batch."""
|
||||||
|
nlp = English.from_config(load_config_from_str(CONFIG))
|
||||||
|
train_examples = []
|
||||||
|
for t in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||||
|
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||||
|
for i in range(50):
|
||||||
|
losses = {}
|
||||||
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
|
texts = ["first", "second", "third", "fourth", "and", "then", "some", ""]
|
||||||
|
nlp.select_pipes(enable=["tok2vec", "tagger"])
|
||||||
|
docs1 = list(nlp.pipe(texts, batch_size=1))
|
||||||
|
docs2 = list(nlp.pipe(texts, batch_size=4))
|
||||||
|
assert [doc[0].tag_ for doc in docs1[:-1]] == [doc[0].tag_ for doc in docs2[:-1]]
|
|
@ -325,6 +325,23 @@ def test_project_config_interpolation():
|
||||||
substitute_project_variables(project)
|
substitute_project_variables(project)
|
||||||
|
|
||||||
|
|
||||||
|
def test_project_config_interpolation_env():
|
||||||
|
variables = {"a": 10}
|
||||||
|
env_var = "SPACY_TEST_FOO"
|
||||||
|
env_vars = {"foo": env_var}
|
||||||
|
commands = [{"name": "x", "script": ["hello ${vars.a} ${env.foo}"]}]
|
||||||
|
project = {"commands": commands, "vars": variables, "env": env_vars}
|
||||||
|
with make_tempdir() as d:
|
||||||
|
srsly.write_yaml(d / "project.yml", project)
|
||||||
|
cfg = load_project_config(d)
|
||||||
|
assert cfg["commands"][0]["script"][0] == "hello 10 "
|
||||||
|
os.environ[env_var] = "123"
|
||||||
|
with make_tempdir() as d:
|
||||||
|
srsly.write_yaml(d / "project.yml", project)
|
||||||
|
cfg = load_project_config(d)
|
||||||
|
assert cfg["commands"][0]["script"][0] == "hello 10 123"
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"args,expected",
|
"args,expected",
|
||||||
[
|
[
|
||||||
|
|
|
@ -1,7 +1,9 @@
|
||||||
import pytest
|
import pytest
|
||||||
import numpy
|
import numpy
|
||||||
import srsly
|
import srsly
|
||||||
|
from spacy.lang.en import English
|
||||||
from spacy.strings import StringStore
|
from spacy.strings import StringStore
|
||||||
|
from spacy.tokens import Doc
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from spacy.attrs import NORM
|
from spacy.attrs import NORM
|
||||||
|
|
||||||
|
@ -20,7 +22,10 @@ def test_pickle_string_store(text1, text2):
|
||||||
|
|
||||||
@pytest.mark.parametrize("text1,text2", [("dog", "cat")])
|
@pytest.mark.parametrize("text1,text2", [("dog", "cat")])
|
||||||
def test_pickle_vocab(text1, text2):
|
def test_pickle_vocab(text1, text2):
|
||||||
vocab = Vocab(lex_attr_getters={int(NORM): lambda string: string[:-1]})
|
vocab = Vocab(
|
||||||
|
lex_attr_getters={int(NORM): lambda string: string[:-1]},
|
||||||
|
get_noun_chunks=English.Defaults.syntax_iterators.get("noun_chunks"),
|
||||||
|
)
|
||||||
vocab.set_vector("dog", numpy.ones((5,), dtype="f"))
|
vocab.set_vector("dog", numpy.ones((5,), dtype="f"))
|
||||||
lex1 = vocab[text1]
|
lex1 = vocab[text1]
|
||||||
lex2 = vocab[text2]
|
lex2 = vocab[text2]
|
||||||
|
@ -34,4 +39,23 @@ def test_pickle_vocab(text1, text2):
|
||||||
assert unpickled[text2].norm == lex2.norm
|
assert unpickled[text2].norm == lex2.norm
|
||||||
assert unpickled[text1].norm != unpickled[text2].norm
|
assert unpickled[text1].norm != unpickled[text2].norm
|
||||||
assert unpickled.vectors is not None
|
assert unpickled.vectors is not None
|
||||||
|
assert unpickled.get_noun_chunks is not None
|
||||||
assert list(vocab["dog"].vector) == [1.0, 1.0, 1.0, 1.0, 1.0]
|
assert list(vocab["dog"].vector) == [1.0, 1.0, 1.0, 1.0, 1.0]
|
||||||
|
|
||||||
|
|
||||||
|
def test_pickle_doc(en_vocab):
|
||||||
|
words = ["a", "b", "c"]
|
||||||
|
deps = ["dep"] * len(words)
|
||||||
|
heads = [0] * len(words)
|
||||||
|
doc = Doc(
|
||||||
|
en_vocab,
|
||||||
|
words=words,
|
||||||
|
deps=deps,
|
||||||
|
heads=heads,
|
||||||
|
)
|
||||||
|
data = srsly.pickle_dumps(doc)
|
||||||
|
unpickled = srsly.pickle_loads(data)
|
||||||
|
assert [t.text for t in unpickled] == words
|
||||||
|
assert [t.dep_ for t in unpickled] == deps
|
||||||
|
assert [t.head.i for t in unpickled] == heads
|
||||||
|
assert list(doc.noun_chunks) == []
|
||||||
|
|
|
@ -55,6 +55,7 @@ def test_vocab_lexeme_add_flag_provided_id(en_vocab):
|
||||||
assert en_vocab["199"].check_flag(IS_DIGIT) is False
|
assert en_vocab["199"].check_flag(IS_DIGIT) is False
|
||||||
assert en_vocab["the"].check_flag(is_len4) is False
|
assert en_vocab["the"].check_flag(is_len4) is False
|
||||||
assert en_vocab["dogs"].check_flag(is_len4) is True
|
assert en_vocab["dogs"].check_flag(is_len4) is True
|
||||||
|
en_vocab.add_flag(lambda string: string.isdigit(), flag_id=IS_DIGIT)
|
||||||
|
|
||||||
|
|
||||||
def test_vocab_lexeme_oov_rank(en_vocab):
|
def test_vocab_lexeme_oov_rank(en_vocab):
|
||||||
|
|
|
@ -245,7 +245,7 @@ cdef class Tokenizer:
|
||||||
cdef int offset
|
cdef int offset
|
||||||
cdef int modified_doc_length
|
cdef int modified_doc_length
|
||||||
# Find matches for special cases
|
# Find matches for special cases
|
||||||
self._special_matcher.find_matches(doc, &c_matches)
|
self._special_matcher.find_matches(doc, 0, doc.length, &c_matches)
|
||||||
# Skip processing if no matches
|
# Skip processing if no matches
|
||||||
if c_matches.size() == 0:
|
if c_matches.size() == 0:
|
||||||
return True
|
return True
|
||||||
|
|
|
@ -215,8 +215,7 @@ def convert_vectors(
|
||||||
|
|
||||||
|
|
||||||
def read_vectors(vectors_loc: Path, truncate_vectors: int):
|
def read_vectors(vectors_loc: Path, truncate_vectors: int):
|
||||||
f = open_file(vectors_loc)
|
f = ensure_shape(vectors_loc)
|
||||||
f = ensure_shape(f)
|
|
||||||
shape = tuple(int(size) for size in next(f).split())
|
shape = tuple(int(size) for size in next(f).split())
|
||||||
if truncate_vectors >= 1:
|
if truncate_vectors >= 1:
|
||||||
shape = (truncate_vectors, shape[1])
|
shape = (truncate_vectors, shape[1])
|
||||||
|
@ -251,11 +250,12 @@ def open_file(loc: Union[str, Path]) -> IO:
|
||||||
return loc.open("r", encoding="utf8")
|
return loc.open("r", encoding="utf8")
|
||||||
|
|
||||||
|
|
||||||
def ensure_shape(lines):
|
def ensure_shape(vectors_loc):
|
||||||
"""Ensure that the first line of the data is the vectors shape.
|
"""Ensure that the first line of the data is the vectors shape.
|
||||||
If it's not, we read in the data and output the shape as the first result,
|
If it's not, we read in the data and output the shape as the first result,
|
||||||
so that the reader doesn't have to deal with the problem.
|
so that the reader doesn't have to deal with the problem.
|
||||||
"""
|
"""
|
||||||
|
lines = open_file(vectors_loc)
|
||||||
first_line = next(lines)
|
first_line = next(lines)
|
||||||
try:
|
try:
|
||||||
shape = tuple(int(size) for size in first_line.split())
|
shape = tuple(int(size) for size in first_line.split())
|
||||||
|
@ -269,7 +269,11 @@ def ensure_shape(lines):
|
||||||
# Figure out the shape, make it the first value, and then give the
|
# Figure out the shape, make it the first value, and then give the
|
||||||
# rest of the data.
|
# rest of the data.
|
||||||
width = len(first_line.split()) - 1
|
width = len(first_line.split()) - 1
|
||||||
captured = [first_line] + list(lines)
|
length = 1
|
||||||
length = len(captured)
|
for _ in lines:
|
||||||
|
length += 1
|
||||||
yield f"{length} {width}"
|
yield f"{length} {width}"
|
||||||
yield from captured
|
# Reading the lines in again from file. This to avoid having to
|
||||||
|
# store all the results in a list in memory
|
||||||
|
lines2 = open_file(vectors_loc)
|
||||||
|
yield from lines2
|
||||||
|
|
|
@ -930,6 +930,8 @@ def is_same_func(func1: Callable, func2: Callable) -> bool:
|
||||||
"""
|
"""
|
||||||
if not callable(func1) or not callable(func2):
|
if not callable(func1) or not callable(func2):
|
||||||
return False
|
return False
|
||||||
|
if not hasattr(func1, "__qualname__") or not hasattr(func2, "__qualname__"):
|
||||||
|
return False
|
||||||
same_name = func1.__qualname__ == func2.__qualname__
|
same_name = func1.__qualname__ == func2.__qualname__
|
||||||
same_file = inspect.getfile(func1) == inspect.getfile(func2)
|
same_file = inspect.getfile(func1) == inspect.getfile(func2)
|
||||||
same_code = inspect.getsourcelines(func1) == inspect.getsourcelines(func2)
|
same_code = inspect.getsourcelines(func1) == inspect.getsourcelines(func2)
|
||||||
|
|
|
@ -551,12 +551,13 @@ def pickle_vocab(vocab):
|
||||||
data_dir = vocab.data_dir
|
data_dir = vocab.data_dir
|
||||||
lex_attr_getters = srsly.pickle_dumps(vocab.lex_attr_getters)
|
lex_attr_getters = srsly.pickle_dumps(vocab.lex_attr_getters)
|
||||||
lookups = vocab.lookups
|
lookups = vocab.lookups
|
||||||
|
get_noun_chunks = vocab.get_noun_chunks
|
||||||
return (unpickle_vocab,
|
return (unpickle_vocab,
|
||||||
(sstore, vectors, morph, data_dir, lex_attr_getters, lookups))
|
(sstore, vectors, morph, data_dir, lex_attr_getters, lookups, get_noun_chunks))
|
||||||
|
|
||||||
|
|
||||||
def unpickle_vocab(sstore, vectors, morphology, data_dir,
|
def unpickle_vocab(sstore, vectors, morphology, data_dir,
|
||||||
lex_attr_getters, lookups):
|
lex_attr_getters, lookups, get_noun_chunks):
|
||||||
cdef Vocab vocab = Vocab()
|
cdef Vocab vocab = Vocab()
|
||||||
vocab.vectors = vectors
|
vocab.vectors = vectors
|
||||||
vocab.strings = sstore
|
vocab.strings = sstore
|
||||||
|
@ -564,6 +565,7 @@ def unpickle_vocab(sstore, vectors, morphology, data_dir,
|
||||||
vocab.data_dir = data_dir
|
vocab.data_dir = data_dir
|
||||||
vocab.lex_attr_getters = srsly.pickle_loads(lex_attr_getters)
|
vocab.lex_attr_getters = srsly.pickle_loads(lex_attr_getters)
|
||||||
vocab.lookups = lookups
|
vocab.lookups = lookups
|
||||||
|
vocab.get_noun_chunks = get_noun_chunks
|
||||||
return vocab
|
return vocab
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -67,7 +67,7 @@ data format used by the lookup and rule-based lemmatizers, see
|
||||||
> lemmatizer = nlp.add_pipe("lemmatizer")
|
> lemmatizer = nlp.add_pipe("lemmatizer")
|
||||||
>
|
>
|
||||||
> # Construction via add_pipe with custom settings
|
> # Construction via add_pipe with custom settings
|
||||||
> config = {"mode": "rule", overwrite=True}
|
> config = {"mode": "rule", "overwrite": True}
|
||||||
> lemmatizer = nlp.add_pipe("lemmatizer", config=config)
|
> lemmatizer = nlp.add_pipe("lemmatizer", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
|
|
@ -44,7 +44,7 @@ be shown.
|
||||||
|
|
||||||
## PhraseMatcher.\_\_call\_\_ {#call tag="method"}
|
## PhraseMatcher.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
Find all token sequences matching the supplied patterns on the `Doc`.
|
Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -59,7 +59,7 @@ Find all token sequences matching the supplied patterns on the `Doc`.
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `doc` | The document to match over. ~~Doc~~ |
|
| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ |
|
| `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ |
|
||||||
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
|
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
|
||||||
|
|
|
@ -727,7 +727,7 @@ capitalization by including a mix of capitalized and lowercase examples. See the
|
||||||
|
|
||||||
Create a data augmentation callback that uses orth-variant replacement. The
|
Create a data augmentation callback that uses orth-variant replacement. The
|
||||||
callback can be added to a corpus or other data iterator during training. It's
|
callback can be added to a corpus or other data iterator during training. It's
|
||||||
is especially useful for punctuation and case replacement, to help generalize
|
especially useful for punctuation and case replacement, to help generalize
|
||||||
beyond corpora that don't have smart quotes, or only have smart quotes etc.
|
beyond corpora that don't have smart quotes, or only have smart quotes etc.
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
|
|
|
@ -4,8 +4,8 @@ import { Help } from 'components/typography'; import Link from 'components/link'
|
||||||
|
|
||||||
| Pipeline | Parser | Tagger | NER |
|
| Pipeline | Parser | Tagger | NER |
|
||||||
| ---------------------------------------------------------- | -----: | -----: | ---: |
|
| ---------------------------------------------------------- | -----: | -----: | ---: |
|
||||||
| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.2 | 97.8 | 89.9 |
|
| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.1 | 97.8 | 89.8 |
|
||||||
| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 91.9 | 97.4 | 85.5 |
|
| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 92.0 | 97.4 | 85.5 |
|
||||||
| `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | 85.5 |
|
| `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | 85.5 |
|
||||||
|
|
||||||
<figcaption class="caption">
|
<figcaption class="caption">
|
||||||
|
@ -22,7 +22,7 @@ the development set).
|
||||||
|
|
||||||
| Named Entity Recognition System | OntoNotes | CoNLL '03 |
|
| Named Entity Recognition System | OntoNotes | CoNLL '03 |
|
||||||
| -------------------------------- | --------: | --------: |
|
| -------------------------------- | --------: | --------: |
|
||||||
| spaCy RoBERTa (2020) | 89.7 | 91.6 |
|
| spaCy RoBERTa (2020) | 89.8 | 91.6 |
|
||||||
| Stanza (StanfordNLP)<sup>1</sup> | 88.8 | 92.1 |
|
| Stanza (StanfordNLP)<sup>1</sup> | 88.8 | 92.1 |
|
||||||
| Flair<sup>2</sup> | 89.7 | 93.1 |
|
| Flair<sup>2</sup> | 89.7 | 93.1 |
|
||||||
|
|
||||||
|
|
|
@ -77,7 +77,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
|
||||||
|
|
||||||
| Dependency Parsing System | UAS | LAS |
|
| Dependency Parsing System | UAS | LAS |
|
||||||
| ------------------------------------------------------------------------------ | ---: | ---: |
|
| ------------------------------------------------------------------------------ | ---: | ---: |
|
||||||
| spaCy RoBERTa (2020) | 95.5 | 94.3 |
|
| spaCy RoBERTa (2020) | 95.1 | 93.7 |
|
||||||
| [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
|
| [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
|
||||||
| [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 |
|
| [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 |
|
||||||
|
|
||||||
|
|
|
@ -69,9 +69,9 @@ python -m spacy project clone pipelines/tagger_parser_ud
|
||||||
|
|
||||||
By default, the project will be cloned into the current working directory. You
|
By default, the project will be cloned into the current working directory. You
|
||||||
can specify an optional second argument to define the output directory. The
|
can specify an optional second argument to define the output directory. The
|
||||||
`--repo` option lets you define a custom repo to clone from if you don't want
|
`--repo` option lets you define a custom repo to clone from if you don't want to
|
||||||
to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You
|
use the spaCy [`projects`](https://github.com/explosion/projects) repo. You can
|
||||||
can also use any private repo you have access to with Git.
|
also use any private repo you have access to with Git.
|
||||||
|
|
||||||
### 2. Fetch the project assets {#assets}
|
### 2. Fetch the project assets {#assets}
|
||||||
|
|
||||||
|
@ -221,6 +221,7 @@ pipelines.
|
||||||
| `title` | An optional project title used in `--help` message and [auto-generated docs](#custom-docs). |
|
| `title` | An optional project title used in `--help` message and [auto-generated docs](#custom-docs). |
|
||||||
| `description` | An optional project description used in [auto-generated docs](#custom-docs). |
|
| `description` | An optional project description used in [auto-generated docs](#custom-docs). |
|
||||||
| `vars` | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`. |
|
| `vars` | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`. |
|
||||||
|
| `env` | A dictionary of variables, mapped to the names of environment variables that will be read in when running the project. For example, `${env.name}` will use the value of the environment variable defined as `name`. |
|
||||||
| `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. |
|
| `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. |
|
||||||
| `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo. |
|
| `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo. |
|
||||||
| `workflows` | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command. |
|
| `workflows` | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command. |
|
||||||
|
@ -310,8 +311,8 @@ company-internal and not available over the internet. In that case, you can
|
||||||
specify the destination paths and a checksum, and leave out the URL. When your
|
specify the destination paths and a checksum, and leave out the URL. When your
|
||||||
teammates clone and run your project, they can place the files in the respective
|
teammates clone and run your project, they can place the files in the respective
|
||||||
directory themselves. The [`project assets`](/api/cli#project-assets) command
|
directory themselves. The [`project assets`](/api/cli#project-assets) command
|
||||||
will alert you about missing files and mismatched checksums, so you can ensure that
|
will alert you about missing files and mismatched checksums, so you can ensure
|
||||||
others are running your project with the same data.
|
that others are running your project with the same data.
|
||||||
|
|
||||||
### Dependencies and outputs {#deps-outputs}
|
### Dependencies and outputs {#deps-outputs}
|
||||||
|
|
||||||
|
@ -358,9 +359,10 @@ graphs based on the dependencies and outputs, and won't re-run previous steps
|
||||||
automatically. For instance, if you only run the command `train` that depends on
|
automatically. For instance, if you only run the command `train` that depends on
|
||||||
data created by `preprocess` and those files are missing, spaCy will show an
|
data created by `preprocess` and those files are missing, spaCy will show an
|
||||||
error – it won't just re-run `preprocess`. If you're looking for more advanced
|
error – it won't just re-run `preprocess`. If you're looking for more advanced
|
||||||
data management, check out the [Data Version Control (DVC) integration](#dvc). If you're planning on integrating your spaCy project with DVC, you
|
data management, check out the [Data Version Control (DVC) integration](#dvc).
|
||||||
can also use `outputs_no_cache` instead of `outputs` to define outputs that
|
If you're planning on integrating your spaCy project with DVC, you can also use
|
||||||
won't be cached or tracked.
|
`outputs_no_cache` instead of `outputs` to define outputs that won't be cached
|
||||||
|
or tracked.
|
||||||
|
|
||||||
### Files and directory structure {#project-files}
|
### Files and directory structure {#project-files}
|
||||||
|
|
||||||
|
@ -467,7 +469,9 @@ In your `project.yml`, you can then run the script by calling
|
||||||
`python scripts/custom_evaluation.py` with the function arguments. You can also
|
`python scripts/custom_evaluation.py` with the function arguments. You can also
|
||||||
use the `vars` section to define reusable variables that will be substituted in
|
use the `vars` section to define reusable variables that will be substituted in
|
||||||
commands, paths and URLs. In this example, the batch size is defined as a
|
commands, paths and URLs. In this example, the batch size is defined as a
|
||||||
variable will be added in place of `${vars.batch_size}` in the script.
|
variable will be added in place of `${vars.batch_size}` in the script. Just like
|
||||||
|
in the [training config](/usage/training##config-overrides), you can also
|
||||||
|
override settings on the command line – for example using `--vars.batch_size`.
|
||||||
|
|
||||||
> #### Calling into Python
|
> #### Calling into Python
|
||||||
>
|
>
|
||||||
|
@ -491,6 +495,29 @@ commands:
|
||||||
- 'corpus/eval.json'
|
- 'corpus/eval.json'
|
||||||
```
|
```
|
||||||
|
|
||||||
|
You can also use the `env` section to reference **environment variables** and
|
||||||
|
make their values available to the commands. This can be useful for overriding
|
||||||
|
settings on the command line and passing through system-level settings.
|
||||||
|
|
||||||
|
> #### Usage example
|
||||||
|
>
|
||||||
|
> ```bash
|
||||||
|
> export GPU_ID=1
|
||||||
|
> BATCH_SIZE=128 python -m spacy project run evaluate
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
### project.yml
|
||||||
|
env:
|
||||||
|
batch_size: BATCH_SIZE
|
||||||
|
gpu_id: GPU_ID
|
||||||
|
|
||||||
|
commands:
|
||||||
|
- name: evaluate
|
||||||
|
script:
|
||||||
|
- 'python scripts/custom_evaluation.py ${env.batch_size}'
|
||||||
|
```
|
||||||
|
|
||||||
### Documenting your project {#custom-docs}
|
### Documenting your project {#custom-docs}
|
||||||
|
|
||||||
> #### Readme Example
|
> #### Readme Example
|
||||||
|
|
|
@ -185,7 +185,7 @@ sections of a config file are:
|
||||||
|
|
||||||
For a full overview of spaCy's config format and settings, see the
|
For a full overview of spaCy's config format and settings, see the
|
||||||
[data format documentation](/api/data-formats#config) and
|
[data format documentation](/api/data-formats#config) and
|
||||||
[Thinc's config system docs](https://thinc.ai/usage/config). The settings
|
[Thinc's config system docs](https://thinc.ai/docs/usage-config). The settings
|
||||||
available for the different architectures are documented with the
|
available for the different architectures are documented with the
|
||||||
[model architectures API](/api/architectures). See the Thinc documentation for
|
[model architectures API](/api/architectures). See the Thinc documentation for
|
||||||
[optimizers](https://thinc.ai/docs/api-optimizers) and
|
[optimizers](https://thinc.ai/docs/api-optimizers) and
|
||||||
|
|
|
@ -198,6 +198,7 @@
|
||||||
"has_examples": true
|
"has_examples": true
|
||||||
},
|
},
|
||||||
{ "code": "tl", "name": "Tagalog" },
|
{ "code": "tl", "name": "Tagalog" },
|
||||||
|
{ "code": "tn", "name": "Setswana", "has_examples": true },
|
||||||
{ "code": "tr", "name": "Turkish", "example": "Bu bir cümledir.", "has_examples": true },
|
{ "code": "tr", "name": "Turkish", "example": "Bu bir cümledir.", "has_examples": true },
|
||||||
{ "code": "tt", "name": "Tatar", "has_examples": true },
|
{ "code": "tt", "name": "Tatar", "has_examples": true },
|
||||||
{
|
{
|
||||||
|
|
Loading…
Reference in New Issue
Block a user