mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-03 22:06:37 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
3246cf8b2b
106
.github/contributors/peter-exos.md
vendored
Normal file
106
.github/contributors/peter-exos.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Peter Baumann |
|
||||
| Company name (if applicable) | Exos Financial |
|
||||
| Title or role (if applicable) | data scientist |
|
||||
| Date | Feb 1st, 2021 |
|
||||
| GitHub username | peter-exos |
|
||||
| Website (optional) | |
|
|
@ -1,6 +1,6 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy"
|
||||
__version__ = "3.0.1"
|
||||
__version__ = "3.0.3"
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
__projects__ = "https://github.com/explosion/projects"
|
||||
|
|
|
@ -16,7 +16,7 @@ import os
|
|||
|
||||
from ..schemas import ProjectConfigSchema, validate
|
||||
from ..util import import_file, run_command, make_tempdir, registry, logger
|
||||
from ..util import is_compatible_version, ENV_VARS
|
||||
from ..util import is_compatible_version, SimpleFrozenDict, ENV_VARS
|
||||
from .. import about
|
||||
|
||||
if TYPE_CHECKING:
|
||||
|
@ -111,26 +111,33 @@ def _parse_overrides(args: List[str], is_cli: bool = False) -> Dict[str, Any]:
|
|||
value = "true"
|
||||
else:
|
||||
value = args.pop(0)
|
||||
# Just like we do in the config, we're calling json.loads on the
|
||||
# values. But since they come from the CLI, it'd be unintuitive to
|
||||
# explicitly mark strings with escaped quotes. So we're working
|
||||
# around that here by falling back to a string if parsing fails.
|
||||
# TODO: improve logic to handle simple types like list of strings?
|
||||
try:
|
||||
result[opt] = srsly.json_loads(value)
|
||||
except ValueError:
|
||||
result[opt] = str(value)
|
||||
result[opt] = _parse_override(value)
|
||||
else:
|
||||
msg.fail(f"{err}: name should start with --", exits=1)
|
||||
return result
|
||||
|
||||
|
||||
def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
|
||||
def _parse_override(value: Any) -> Any:
|
||||
# Just like we do in the config, we're calling json.loads on the
|
||||
# values. But since they come from the CLI, it'd be unintuitive to
|
||||
# explicitly mark strings with escaped quotes. So we're working
|
||||
# around that here by falling back to a string if parsing fails.
|
||||
# TODO: improve logic to handle simple types like list of strings?
|
||||
try:
|
||||
return srsly.json_loads(value)
|
||||
except ValueError:
|
||||
return str(value)
|
||||
|
||||
|
||||
def load_project_config(
|
||||
path: Path, interpolate: bool = True, overrides: Dict[str, Any] = SimpleFrozenDict()
|
||||
) -> Dict[str, Any]:
|
||||
"""Load the project.yml file from a directory and validate it. Also make
|
||||
sure that all directories defined in the config exist.
|
||||
|
||||
path (Path): The path to the project directory.
|
||||
interpolate (bool): Whether to substitute project variables.
|
||||
overrides (Dict[str, Any]): Optional config overrides.
|
||||
RETURNS (Dict[str, Any]): The loaded project.yml.
|
||||
"""
|
||||
config_path = path / PROJECT_FILE
|
||||
|
@ -154,20 +161,36 @@ def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
|
|||
if not dir_path.exists():
|
||||
dir_path.mkdir(parents=True)
|
||||
if interpolate:
|
||||
err = "project.yml validation error"
|
||||
err = f"{PROJECT_FILE} validation error"
|
||||
with show_validation_error(title=err, hint_fill=False):
|
||||
config = substitute_project_variables(config)
|
||||
config = substitute_project_variables(config, overrides)
|
||||
return config
|
||||
|
||||
|
||||
def substitute_project_variables(config: Dict[str, Any], overrides: Dict = {}):
|
||||
key = "vars"
|
||||
def substitute_project_variables(
|
||||
config: Dict[str, Any],
|
||||
overrides: Dict[str, Any] = SimpleFrozenDict(),
|
||||
key: str = "vars",
|
||||
env_key: str = "env",
|
||||
) -> Dict[str, Any]:
|
||||
"""Interpolate variables in the project file using the config system.
|
||||
|
||||
config (Dict[str, Any]): The project config.
|
||||
overrides (Dict[str, Any]): Optional config overrides.
|
||||
key (str): Key containing variables in project config.
|
||||
env_key (str): Key containing environment variable mapping in project config.
|
||||
RETURNS (Dict[str, Any]): The interpolated project config.
|
||||
"""
|
||||
config.setdefault(key, {})
|
||||
config[key].update(overrides)
|
||||
config.setdefault(env_key, {})
|
||||
# Substitute references to env vars with their values
|
||||
for config_var, env_var in config[env_key].items():
|
||||
config[env_key][config_var] = _parse_override(os.environ.get(env_var, ""))
|
||||
# Need to put variables in the top scope again so we can have a top-level
|
||||
# section "project" (otherwise, a list of commands in the top scope wouldn't)
|
||||
# be allowed by Thinc's config system
|
||||
cfg = Config({"project": config, key: config[key]})
|
||||
cfg = Config({"project": config, key: config[key], env_key: config[env_key]})
|
||||
cfg = Config().from_str(cfg.to_str(), overrides=overrides)
|
||||
interpolated = cfg.interpolate()
|
||||
return dict(interpolated["project"])
|
||||
|
||||
|
|
|
@ -175,10 +175,13 @@ def render_parses(
|
|||
def print_prf_per_type(
|
||||
msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str
|
||||
) -> None:
|
||||
data = [
|
||||
(k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}")
|
||||
for k, v in scores.items()
|
||||
]
|
||||
data = []
|
||||
for key, value in scores.items():
|
||||
row = [key]
|
||||
for k in ("p", "r", "f"):
|
||||
v = value[k]
|
||||
row.append(f"{v * 100:.2f}" if isinstance(v, (int, float)) else v)
|
||||
data.append(row)
|
||||
msg.table(
|
||||
data,
|
||||
header=("", "P", "R", "F"),
|
||||
|
@ -191,7 +194,10 @@ def print_textcats_auc_per_cat(
|
|||
msg: Printer, scores: Dict[str, Dict[str, float]]
|
||||
) -> None:
|
||||
msg.table(
|
||||
[(k, f"{v:.2f}") for k, v in scores.items()],
|
||||
[
|
||||
(k, f"{v:.2f}" if isinstance(v, (float, int)) else v)
|
||||
for k, v in scores.items()
|
||||
],
|
||||
header=("", "ROC AUC"),
|
||||
aligns=("l", "r"),
|
||||
title="Textcat ROC AUC (per label)",
|
||||
|
|
|
@ -3,19 +3,23 @@ from pathlib import Path
|
|||
from wasabi import msg
|
||||
import sys
|
||||
import srsly
|
||||
import typer
|
||||
|
||||
from ... import about
|
||||
from ...git_info import GIT_VERSION
|
||||
from ...util import working_dir, run_command, split_command, is_cwd, join_command
|
||||
from ...util import SimpleFrozenList, is_minor_version_match, ENV_VARS
|
||||
from ...util import check_bool_env_var
|
||||
from ...util import check_bool_env_var, SimpleFrozenDict
|
||||
from .._util import PROJECT_FILE, PROJECT_LOCK, load_project_config, get_hash
|
||||
from .._util import get_checksum, project_cli, Arg, Opt, COMMAND
|
||||
from .._util import get_checksum, project_cli, Arg, Opt, COMMAND, parse_config_overrides
|
||||
|
||||
|
||||
@project_cli.command("run")
|
||||
@project_cli.command(
|
||||
"run", context_settings={"allow_extra_args": True, "ignore_unknown_options": True}
|
||||
)
|
||||
def project_run_cli(
|
||||
# fmt: off
|
||||
ctx: typer.Context, # This is only used to read additional arguments
|
||||
subcommand: str = Arg(None, help=f"Name of command defined in the {PROJECT_FILE}"),
|
||||
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
|
||||
force: bool = Opt(False, "--force", "-F", help="Force re-running steps, even if nothing changed"),
|
||||
|
@ -33,13 +37,15 @@ def project_run_cli(
|
|||
if show_help or not subcommand:
|
||||
print_run_help(project_dir, subcommand)
|
||||
else:
|
||||
project_run(project_dir, subcommand, force=force, dry=dry)
|
||||
overrides = parse_config_overrides(ctx.args)
|
||||
project_run(project_dir, subcommand, overrides=overrides, force=force, dry=dry)
|
||||
|
||||
|
||||
def project_run(
|
||||
project_dir: Path,
|
||||
subcommand: str,
|
||||
*,
|
||||
overrides: Dict[str, Any] = SimpleFrozenDict(),
|
||||
force: bool = False,
|
||||
dry: bool = False,
|
||||
capture: bool = False,
|
||||
|
@ -59,7 +65,7 @@ def project_run(
|
|||
when you want to turn over execution to the command, and capture=True
|
||||
when you want to run the command more like a function.
|
||||
"""
|
||||
config = load_project_config(project_dir)
|
||||
config = load_project_config(project_dir, overrides=overrides)
|
||||
commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
|
||||
workflows = config.get("workflows", {})
|
||||
validate_subcommand(commands.keys(), workflows.keys(), subcommand)
|
||||
|
|
|
@ -28,6 +28,15 @@ bg:
|
|||
accuracy:
|
||||
name: iarfmoose/roberta-base-bulgarian
|
||||
size_factor: 3
|
||||
bn:
|
||||
word_vectors: null
|
||||
transformer:
|
||||
efficiency:
|
||||
name: sagorsarker/bangla-bert-base
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: sagorsarker/bangla-bert-base
|
||||
size_factor: 3
|
||||
da:
|
||||
word_vectors: da_core_news_lg
|
||||
transformer:
|
||||
|
@ -104,10 +113,10 @@ hi:
|
|||
word_vectors: null
|
||||
transformer:
|
||||
efficiency:
|
||||
name: monsoon-nlp/hindi-tpu-electra
|
||||
name: ai4bharat/indic-bert
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: monsoon-nlp/hindi-tpu-electra
|
||||
name: ai4bharat/indic-bert
|
||||
size_factor: 3
|
||||
id:
|
||||
word_vectors: null
|
||||
|
@ -185,10 +194,10 @@ si:
|
|||
word_vectors: null
|
||||
transformer:
|
||||
efficiency:
|
||||
name: keshan/SinhalaBERTo
|
||||
name: setu4993/LaBSE
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: keshan/SinhalaBERTo
|
||||
name: setu4993/LaBSE
|
||||
size_factor: 3
|
||||
sv:
|
||||
word_vectors: null
|
||||
|
@ -203,10 +212,10 @@ ta:
|
|||
word_vectors: null
|
||||
transformer:
|
||||
efficiency:
|
||||
name: monsoon-nlp/tamillion
|
||||
name: ai4bharat/indic-bert
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: monsoon-nlp/tamillion
|
||||
name: ai4bharat/indic-bert
|
||||
size_factor: 3
|
||||
te:
|
||||
word_vectors: null
|
||||
|
|
|
@ -579,8 +579,8 @@ class Errors:
|
|||
E922 = ("Component '{name}' has been initialized with an output dimension of "
|
||||
"{nO} - cannot add any more labels.")
|
||||
E923 = ("It looks like there is no proper sample data to initialize the "
|
||||
"Model of component '{name}'. This is likely a bug in spaCy, so "
|
||||
"feel free to open an issue: https://github.com/explosion/spaCy/issues")
|
||||
"Model of component '{name}'. To check your input data paths and "
|
||||
"annotation, run: python -m spacy debug data config.cfg")
|
||||
E924 = ("The '{name}' component does not seem to be initialized properly. "
|
||||
"This is likely a bug in spaCy, so feel free to open an issue: "
|
||||
"https://github.com/explosion/spaCy/issues")
|
||||
|
|
|
@ -4,30 +4,30 @@
|
|||
STOP_WORDS = set(
|
||||
"""
|
||||
ግን አንቺ አንተ እናንተ ያንተ ያንቺ የናንተ ራስህን ራስሽን ራሳችሁን
|
||||
ሁሉ ኋላ በሰሞኑ አሉ በኋላ ሁኔታ በኩል አስታውቀዋል ሆነ በውስጥ
|
||||
አስታውሰዋል ሆኑ ባጣም እስካሁን ሆኖም በተለይ አሳሰበ ሁል በተመለከተ
|
||||
አሳስበዋል ላይ በተመሳሳይ አስፈላጊ ሌላ የተለያየ አስገነዘቡ ሌሎች የተለያዩ
|
||||
አስገንዝበዋል ልዩ ተባለ አብራርተዋል መሆኑ ተገለጸ አስረድተዋል ተገልጿል
|
||||
ማለቱ ተጨማሪ እባክህ የሚገኝ ተከናወነ እባክሽ ማድረግ ችግር አንጻር ማን
|
||||
ትናንት እስኪደርስ ነበረች እንኳ ሰሞኑን ነበሩ እንኳን ሲሆን ነበር እዚሁ ሲል
|
||||
ነው እንደገለጹት አለ ና እንደተናገሩት ቢሆን ነገር እንዳስረዱት ብለዋል ነገሮች
|
||||
እንደገና ብዙ ናት ወቅት ቦታ ናቸው እንዲሁም በርካታ አሁን እንጂ እስከ
|
||||
ማለት የሚሆኑት ስለማናቸውም ውስጥ ይሆናሉ ሲባል ከሆነው ስለዚሁ ከአንድ
|
||||
ያልሆነ ሳለ የነበረውን ከአንዳንድ በማናቸውም በሙሉ የሆነው ያሉ በእነዚሁ
|
||||
ወር መሆናቸው ከሌሎች በዋና አንዲት ወይም
|
||||
በላይ እንደ በማቀድ ለሌሎች በሆኑ ቢሆንም ጊዜና ይሆኑበታል በሆነ አንዱ
|
||||
ለዚህ ለሆነው ለነዚህ ከዚህ የሌላውን ሶስተኛ አንዳንድ ለማንኛውም የሆነ ከሁለት
|
||||
የነገሩ ሰኣት አንደኛ እንዲሆን እንደነዚህ ማንኛውም ካልሆነ የሆኑት ጋር ቢያንስ
|
||||
ሁሉ ኋላ በሰሞኑ አሉ በኋላ ሁኔታ በኩል አስታውቀዋል ሆነ በውስጥ
|
||||
አስታውሰዋል ሆኑ ባጣም እስካሁን ሆኖም በተለይ አሳሰበ ሁል በተመለከተ
|
||||
አሳስበዋል ላይ በተመሳሳይ አስፈላጊ ሌላ የተለያየ አስገነዘቡ ሌሎች የተለያዩ
|
||||
አስገንዝበዋል ልዩ ተባለ አብራርተዋል መሆኑ ተገለጸ አስረድተዋል ተገልጿል
|
||||
ማለቱ ተጨማሪ እባክህ የሚገኝ ተከናወነ እባክሽ ማድረግ ችግር አንጻር ማን
|
||||
ትናንት እስኪደርስ ነበረች እንኳ ሰሞኑን ነበሩ እንኳን ሲሆን ነበር እዚሁ ሲል
|
||||
ነው እንደገለጹት አለ ና እንደተናገሩት ቢሆን ነገር እንዳስረዱት ብለዋል ነገሮች
|
||||
እንደገና ብዙ ናት ወቅት ቦታ ናቸው እንዲሁም በርካታ አሁን እንጂ እስከ
|
||||
ማለት የሚሆኑት ስለማናቸውም ውስጥ ይሆናሉ ሲባል ከሆነው ስለዚሁ ከአንድ
|
||||
ያልሆነ ሳለ የነበረውን ከአንዳንድ በማናቸውም በሙሉ የሆነው ያሉ በእነዚሁ
|
||||
ወር መሆናቸው ከሌሎች በዋና አንዲት ወይም
|
||||
በላይ እንደ በማቀድ ለሌሎች በሆኑ ቢሆንም ጊዜና ይሆኑበታል በሆነ አንዱ
|
||||
ለዚህ ለሆነው ለነዚህ ከዚህ የሌላውን ሶስተኛ አንዳንድ ለማንኛውም የሆነ ከሁለት
|
||||
የነገሩ ሰኣት አንደኛ እንዲሆን እንደነዚህ ማንኛውም ካልሆነ የሆኑት ጋር ቢያንስ
|
||||
ይህንንም እነደሆነ እነዚህን ይኸው የማናቸውም
|
||||
በሙሉም ይህችው በተለይም አንዱን የሚችለውን በነዚህ ከእነዚህ በሌላ
|
||||
የዚሁ ከእነዚሁ ለዚሁ በሚገባ ለእያንዳንዱ የአንቀጹ ወደ ይህም ስለሆነ ወይ
|
||||
ማናቸውንም ተብሎ እነዚህ መሆናቸውን የሆነችን ከአስር ሳይሆን ከዚያ የለውም
|
||||
የማይበልጥ እንደሆነና እንዲሆኑ በሚችሉ ብቻ ብሎ ከሌላ የሌላቸውን
|
||||
ለሆነ በሌሎች ሁለቱንም በቀር ይህ በታች አንደሆነ በነሱ
|
||||
ይህን የሌላ እንዲህ ከሆነ ያላቸው በነዚሁ በሚል የዚህ ይህንኑ
|
||||
በእንደዚህ ቁጥር ማናቸውም ሆነው ባሉ በዚህ በስተቀር ሲሆንና
|
||||
በዚህም መሆን ምንጊዜም እነዚህም በዚህና ያለ ስም
|
||||
ሲኖር ከዚህም መሆኑን በሁኔታው የማያንስ እነዚህኑ ማንም ከነዚሁ
|
||||
በሙሉም ይህችው በተለይም አንዱን የሚችለውን በነዚህ ከእነዚህ በሌላ
|
||||
የዚሁ ከእነዚሁ ለዚሁ በሚገባ ለእያንዳንዱ የአንቀጹ ወደ ይህም ስለሆነ ወይ
|
||||
ማናቸውንም ተብሎ እነዚህ መሆናቸውን የሆነችን ከአስር ሳይሆን ከዚያ የለውም
|
||||
የማይበልጥ እንደሆነና እንዲሆኑ በሚችሉ ብቻ ብሎ ከሌላ የሌላቸውን
|
||||
ለሆነ በሌሎች ሁለቱንም በቀር ይህ በታች አንደሆነ በነሱ
|
||||
ይህን የሌላ እንዲህ ከሆነ ያላቸው በነዚሁ በሚል የዚህ ይህንኑ
|
||||
በእንደዚህ ቁጥር ማናቸውም ሆነው ባሉ በዚህ በስተቀር ሲሆንና
|
||||
በዚህም መሆን ምንጊዜም እነዚህም በዚህና ያለ ስም
|
||||
ሲኖር ከዚህም መሆኑን በሁኔታው የማያንስ እነዚህኑ ማንም ከነዚሁ
|
||||
ያላቸውን እጅግ ሲሆኑ ለሆኑ ሊሆን ለማናቸውም
|
||||
""".split()
|
||||
)
|
||||
|
|
18
spacy/lang/tn/__init__.py
Normal file
18
spacy/lang/tn/__init__.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .punctuation import TOKENIZER_INFIXES
|
||||
from ...language import Language
|
||||
|
||||
|
||||
class SetswanaDefaults(Language.Defaults):
|
||||
infixes = TOKENIZER_INFIXES
|
||||
stop_words = STOP_WORDS
|
||||
lex_attr_getters = LEX_ATTRS
|
||||
|
||||
|
||||
class Setswana(Language):
|
||||
lang = "tn"
|
||||
Defaults = SetswanaDefaults
|
||||
|
||||
|
||||
__all__ = ["Setswana"]
|
15
spacy/lang/tn/examples.py
Normal file
15
spacy/lang/tn/examples.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
>>> from spacy.lang.tn.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
||||
sentences = [
|
||||
"Apple e nyaka go reka JSE ka tlhwatlhwa ta R1 billion",
|
||||
"Johannesburg ke toropo e kgolo mo Afrika Borwa.",
|
||||
"O ko kae?",
|
||||
"ke mang presidente ya Afrika Borwa?",
|
||||
"ke eng toropo kgolo ya Afrika Borwa?",
|
||||
"Nelson Mandela o belegwe leng?",
|
||||
]
|
107
spacy/lang/tn/lex_attrs.py
Normal file
107
spacy/lang/tn/lex_attrs.py
Normal file
|
@ -0,0 +1,107 @@
|
|||
from ...attrs import LIKE_NUM
|
||||
|
||||
_num_words = [
|
||||
"lefela",
|
||||
"nngwe",
|
||||
"pedi",
|
||||
"tharo",
|
||||
"nne",
|
||||
"tlhano",
|
||||
"thataro",
|
||||
"supa",
|
||||
"robedi",
|
||||
"robongwe",
|
||||
"lesome",
|
||||
"lesomenngwe",
|
||||
"lesomepedi",
|
||||
"sometharo",
|
||||
"somenne",
|
||||
"sometlhano",
|
||||
"somethataro",
|
||||
"somesupa",
|
||||
"somerobedi",
|
||||
"somerobongwe",
|
||||
"someamabedi",
|
||||
"someamararo",
|
||||
"someamane",
|
||||
"someamatlhano",
|
||||
"someamarataro",
|
||||
"someamasupa",
|
||||
"someamarobedi",
|
||||
"someamarobongwe",
|
||||
"lekgolo",
|
||||
"sekete",
|
||||
"milione",
|
||||
"bilione",
|
||||
"terilione",
|
||||
"kwatirilione",
|
||||
"gajillione",
|
||||
"bazillione",
|
||||
]
|
||||
|
||||
|
||||
_ordinal_words = [
|
||||
"ntlha",
|
||||
"bobedi",
|
||||
"boraro",
|
||||
"bone",
|
||||
"botlhano",
|
||||
"borataro",
|
||||
"bosupa",
|
||||
"borobedi ",
|
||||
"borobongwe",
|
||||
"bolesome",
|
||||
"bolesomengwe",
|
||||
"bolesomepedi",
|
||||
"bolesometharo",
|
||||
"bolesomenne",
|
||||
"bolesometlhano",
|
||||
"bolesomethataro",
|
||||
"bolesomesupa",
|
||||
"bolesomerobedi",
|
||||
"bolesomerobongwe",
|
||||
"somamabedi",
|
||||
"someamararo",
|
||||
"someamane",
|
||||
"someamatlhano",
|
||||
"someamarataro",
|
||||
"someamasupa",
|
||||
"someamarobedi",
|
||||
"someamarobongwe",
|
||||
"lekgolo",
|
||||
"sekete",
|
||||
"milione",
|
||||
"bilione",
|
||||
"terilione",
|
||||
"kwatirilione",
|
||||
"gajillione",
|
||||
"bazillione",
|
||||
]
|
||||
|
||||
|
||||
def like_num(text):
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
text = text[1:]
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count("/") == 1:
|
||||
num, denom = text.split("/")
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
|
||||
text_lower = text.lower()
|
||||
if text_lower in _num_words:
|
||||
return True
|
||||
|
||||
# CHeck ordinal number
|
||||
if text_lower in _ordinal_words:
|
||||
return True
|
||||
if text_lower.endswith("th"):
|
||||
if text_lower[:-2].isdigit():
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {LIKE_NUM: like_num}
|
19
spacy/lang/tn/punctuation.py
Normal file
19
spacy/lang/tn/punctuation.py
Normal file
|
@ -0,0 +1,19 @@
|
|||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, HYPHENS
|
||||
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
|
||||
|
||||
_infixes = (
|
||||
LIST_ELLIPSES
|
||||
+ LIST_ICONS
|
||||
+ [
|
||||
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
|
||||
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
|
||||
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
||||
),
|
||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}0-9])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
|
||||
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
TOKENIZER_INFIXES = _infixes
|
20
spacy/lang/tn/stop_words.py
Normal file
20
spacy/lang/tn/stop_words.py
Normal file
|
@ -0,0 +1,20 @@
|
|||
# Stop words
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
ke gareng ga selekanyo tlhwatlhwa yo mongwe se
|
||||
sengwe fa go le jalo gongwe ba na mo tikologong
|
||||
jaaka kwa morago nna gonne ka sa pele nako teng
|
||||
tlase fela ntle magareng tsona feta bobedi kgabaganya
|
||||
moo gape kgatlhanong botlhe tsotlhe bokana e esi
|
||||
setseng mororo dinako golo kgolo nnye wena gago
|
||||
o ntse ntle tla goreng gangwe mang yotlhe gore
|
||||
eo yona tseraganyo eng ne sentle re rona thata
|
||||
godimo fitlha pedi masomamabedi lesomepedi mmogo
|
||||
tharo tseo boraro tseno yone jaanong bobona bona
|
||||
lesome tsaya tsamaiso nngwe masomethataro thataro
|
||||
tsa mmatota tota sale thoko supa dira tshwanetse di mmalwa masisi
|
||||
bonala e tshwanang bogolo tsenya tsweetswee karolo
|
||||
sepe tlhalosa dirwa robedi robongwe lesomenngwe gaisa
|
||||
tlhano lesometlhano botlalo lekgolo
|
||||
""".split()
|
||||
)
|
|
@ -451,7 +451,7 @@ cdef class Lexeme:
|
|||
Lexeme.c_set_flag(self.c, IS_QUOTE, x)
|
||||
|
||||
property is_left_punct:
|
||||
"""RETURNS (bool): Whether the lexeme is left punctuation, e.g. )."""
|
||||
"""RETURNS (bool): Whether the lexeme is left punctuation, e.g. (."""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c, IS_LEFT_PUNCT)
|
||||
|
||||
|
|
|
@ -18,4 +18,4 @@ cdef class PhraseMatcher:
|
|||
cdef Pool mem
|
||||
cdef key_t _terminal_hash
|
||||
|
||||
cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil
|
||||
cdef void find_matches(self, Doc doc, int start_idx, int end_idx, vector[SpanC] *matches) nogil
|
||||
|
|
|
@ -230,10 +230,10 @@ cdef class PhraseMatcher:
|
|||
result = internal_node
|
||||
map_set(self.mem, <MapStruct*>result, self.vocab.strings[key], NULL)
|
||||
|
||||
def __call__(self, doc, *, as_spans=False):
|
||||
def __call__(self, object doclike, *, as_spans=False):
|
||||
"""Find all sequences matching the supplied patterns on the `Doc`.
|
||||
|
||||
doc (Doc): The document to match over.
|
||||
doclike (Doc or Span): The document to match over.
|
||||
as_spans (bool): Return Span objects with labels instead of (match_id,
|
||||
start, end) tuples.
|
||||
RETURNS (list): A list of `(match_id, start, end)` tuples,
|
||||
|
@ -244,12 +244,22 @@ cdef class PhraseMatcher:
|
|||
DOCS: https://spacy.io/api/phrasematcher#call
|
||||
"""
|
||||
matches = []
|
||||
if doc is None or len(doc) == 0:
|
||||
if doclike is None or len(doclike) == 0:
|
||||
# if doc is empty or None just return empty list
|
||||
return matches
|
||||
if isinstance(doclike, Doc):
|
||||
doc = doclike
|
||||
start_idx = 0
|
||||
end_idx = len(doc)
|
||||
elif isinstance(doclike, Span):
|
||||
doc = doclike.doc
|
||||
start_idx = doclike.start
|
||||
end_idx = doclike.end
|
||||
else:
|
||||
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
|
||||
|
||||
cdef vector[SpanC] c_matches
|
||||
self.find_matches(doc, &c_matches)
|
||||
self.find_matches(doc, start_idx, end_idx, &c_matches)
|
||||
for i in range(c_matches.size()):
|
||||
matches.append((c_matches[i].label, c_matches[i].start, c_matches[i].end))
|
||||
for i, (ent_id, start, end) in enumerate(matches):
|
||||
|
@ -261,17 +271,17 @@ cdef class PhraseMatcher:
|
|||
else:
|
||||
return matches
|
||||
|
||||
cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil:
|
||||
cdef void find_matches(self, Doc doc, int start_idx, int end_idx, vector[SpanC] *matches) nogil:
|
||||
cdef MapStruct* current_node = self.c_map
|
||||
cdef int start = 0
|
||||
cdef int idx = 0
|
||||
cdef int idy = 0
|
||||
cdef int idx = start_idx
|
||||
cdef int idy = start_idx
|
||||
cdef key_t key
|
||||
cdef void* value
|
||||
cdef int i = 0
|
||||
cdef SpanC ms
|
||||
cdef void* result
|
||||
while idx < doc.length:
|
||||
while idx < end_idx:
|
||||
start = idx
|
||||
token = Token.get_struct_attr(&doc.c[idx], self.attr)
|
||||
# look for sequences from this position
|
||||
|
@ -279,7 +289,7 @@ cdef class PhraseMatcher:
|
|||
if result:
|
||||
current_node = <MapStruct*>result
|
||||
idy = idx + 1
|
||||
while idy < doc.length:
|
||||
while idy < end_idx:
|
||||
result = map_get(current_node, self._terminal_hash)
|
||||
if result:
|
||||
i = 0
|
||||
|
|
|
@ -107,6 +107,7 @@ def init_ensemble_textcat(model, X, Y) -> Model:
|
|||
model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
|
||||
model.get_ref("maxout_layer").set_dim("nI", tok2vec_width)
|
||||
model.get_ref("norm_layer").set_dim("nI", tok2vec_width)
|
||||
model.get_ref("norm_layer").set_dim("nO", tok2vec_width)
|
||||
init_chain(model, X, Y)
|
||||
return model
|
||||
|
||||
|
|
|
@ -273,7 +273,7 @@ class EntityLinker(TrainablePipe):
|
|||
gradients = self.distance.get_grad(sentence_encodings, entity_encodings)
|
||||
loss = self.distance.get_loss(sentence_encodings, entity_encodings)
|
||||
loss = loss / len(entity_encodings)
|
||||
return loss, gradients
|
||||
return float(loss), gradients
|
||||
|
||||
def predict(self, docs: Iterable[Doc]) -> List[str]:
|
||||
"""Apply the pipeline's model to a batch of docs, without modifying them.
|
||||
|
|
|
@ -76,7 +76,7 @@ def merge_subtokens(doc: Doc, label: str = "subtok") -> Doc:
|
|||
retokenizes=True,
|
||||
)
|
||||
def make_token_splitter(
|
||||
nlp: Language, name: str, *, min_length=0, split_length=0,
|
||||
nlp: Language, name: str, *, min_length: int = 0, split_length: int = 0
|
||||
):
|
||||
return TokenSplitter(min_length=min_length, split_length=split_length)
|
||||
|
||||
|
|
|
@ -197,7 +197,7 @@ class ClozeMultitask(TrainablePipe):
|
|||
target = vectors[ids]
|
||||
gradient = self.distance.get_grad(prediction, target)
|
||||
loss = self.distance.get_loss(prediction, target)
|
||||
return loss, gradient
|
||||
return float(loss), gradient
|
||||
|
||||
def update(self, examples, *, drop=0., sgd=None, losses=None):
|
||||
pass
|
||||
|
|
|
@ -121,7 +121,7 @@ class Tok2Vec(TrainablePipe):
|
|||
tokvecs = self.model.predict(docs)
|
||||
batch_id = Tok2VecListener.get_batch_id(docs)
|
||||
for listener in self.listeners:
|
||||
listener.receive(batch_id, tokvecs, lambda dX: [])
|
||||
listener.receive(batch_id, tokvecs, _empty_backprop)
|
||||
return tokvecs
|
||||
|
||||
def set_annotations(self, docs: Sequence[Doc], tokvecses) -> None:
|
||||
|
@ -291,12 +291,18 @@ def forward(model: Tok2VecListener, inputs, is_train: bool):
|
|||
# of data.
|
||||
# When the components batch differently, we don't receive a matching
|
||||
# prediction from the upstream, so we can't predict.
|
||||
if not all(doc.tensor.size for doc in inputs):
|
||||
# But we do need to do *something* if the tensor hasn't been set.
|
||||
# The compromise is to at least return data of the right shape,
|
||||
# so the output is valid.
|
||||
width = model.get_dim("nO")
|
||||
outputs = [model.ops.alloc2f(len(doc), width) for doc in inputs]
|
||||
else:
|
||||
outputs = [doc.tensor for doc in inputs]
|
||||
outputs = []
|
||||
width = model.get_dim("nO")
|
||||
for doc in inputs:
|
||||
if doc.tensor.size == 0:
|
||||
# But we do need to do *something* if the tensor hasn't been set.
|
||||
# The compromise is to at least return data of the right shape,
|
||||
# so the output is valid.
|
||||
outputs.append(model.ops.alloc2f(len(doc), width))
|
||||
else:
|
||||
outputs.append(doc.tensor)
|
||||
return outputs, lambda dX: []
|
||||
|
||||
|
||||
def _empty_backprop(dX): # for pickling
|
||||
return []
|
||||
|
|
|
@ -446,6 +446,7 @@ class ProjectConfigCommand(BaseModel):
|
|||
class ProjectConfigSchema(BaseModel):
|
||||
# fmt: off
|
||||
vars: Dict[StrictStr, Any] = Field({}, title="Optional variables to substitute in commands")
|
||||
env: Dict[StrictStr, Any] = Field({}, title="Optional variable names to substitute in commands, mapped to environment variable names")
|
||||
assets: List[Union[ProjectConfigAssetURL, ProjectConfigAssetGit]] = Field([], title="Data assets")
|
||||
workflows: Dict[StrictStr, List[StrictStr]] = Field({}, title="Named workflows, mapped to list of project commands to run in order")
|
||||
commands: List[ProjectConfigCommand] = Field([], title="Project command shortucts")
|
||||
|
|
|
@ -8,7 +8,8 @@ from spacy.util import get_lang_class
|
|||
LANGUAGES = ["af", "ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", "es",
|
||||
"et", "fa", "fi", "fr", "ga", "he", "hi", "hr", "hu", "id", "is",
|
||||
"it", "kn", "lt", "lv", "nb", "nl", "pl", "pt", "ro", "si", "sk",
|
||||
"sl", "sq", "sr", "sv", "ta", "te", "tl", "tr", "tt", "ur", 'yo']
|
||||
"sl", "sq", "sr", "sv", "ta", "te", "tl", "tn", "tr", "tt", "ur",
|
||||
"yo"]
|
||||
# fmt: on
|
||||
|
||||
|
||||
|
|
|
@ -323,3 +323,39 @@ def test_phrase_matcher_deprecated(en_vocab):
|
|||
@pytest.mark.parametrize("attr", ["SENT_START", "IS_SENT_START"])
|
||||
def test_phrase_matcher_sent_start(en_vocab, attr):
|
||||
_ = PhraseMatcher(en_vocab, attr=attr) # noqa: F841
|
||||
|
||||
|
||||
def test_span_in_phrasematcher(en_vocab):
|
||||
"""Ensure that PhraseMatcher accepts Span and Doc as input"""
|
||||
# fmt: off
|
||||
words = ["I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "and", "nothing", "else", "."]
|
||||
# fmt: on
|
||||
doc = Doc(en_vocab, words=words)
|
||||
span = doc[:8]
|
||||
pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("SPACY", [pattern])
|
||||
matches_doc = matcher(doc)
|
||||
matches_span = matcher(span)
|
||||
assert len(matches_doc) == 1
|
||||
assert len(matches_span) == 1
|
||||
|
||||
|
||||
def test_span_v_doc_in_phrasematcher(en_vocab):
|
||||
"""Ensure that PhraseMatcher only returns matches in input Span and not in entire Doc"""
|
||||
# fmt: off
|
||||
words = [
|
||||
"I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "Spans",
|
||||
"and", "Docs", "in", "my", "matchers", "," "and", "Spans", "and", "Docs",
|
||||
"everywhere", "."
|
||||
]
|
||||
# fmt: on
|
||||
doc = Doc(en_vocab, words=words)
|
||||
span = doc[9:15] # second clause
|
||||
pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("SPACY", [pattern])
|
||||
matches_doc = matcher(doc)
|
||||
matches_span = matcher(span)
|
||||
assert len(matches_doc) == 3
|
||||
assert len(matches_span) == 1
|
||||
|
|
|
@ -451,13 +451,27 @@ def test_pipe_factories_from_source_config():
|
|||
assert config["arg"] == "world"
|
||||
|
||||
|
||||
def test_pipe_factories_decorator_idempotent():
|
||||
class PipeFactoriesIdempotent:
|
||||
def __init__(self, nlp, name):
|
||||
...
|
||||
|
||||
def __call__(self, doc):
|
||||
...
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"i,func,func2",
|
||||
[
|
||||
(0, lambda nlp, name: lambda doc: doc, lambda doc: doc),
|
||||
(1, PipeFactoriesIdempotent, PipeFactoriesIdempotent(None, None)),
|
||||
],
|
||||
)
|
||||
def test_pipe_factories_decorator_idempotent(i, func, func2):
|
||||
"""Check that decorator can be run multiple times if the function is the
|
||||
same. This is especially relevant for live reloading because we don't
|
||||
want spaCy to raise an error if a module registering components is reloaded.
|
||||
"""
|
||||
name = "test_pipe_factories_decorator_idempotent"
|
||||
func = lambda nlp, name: lambda doc: doc
|
||||
name = f"test_pipe_factories_decorator_idempotent_{i}"
|
||||
for i in range(5):
|
||||
Language.factory(name, func=func)
|
||||
nlp = Language()
|
||||
|
@ -466,7 +480,6 @@ def test_pipe_factories_decorator_idempotent():
|
|||
# Make sure it also works for component decorator, which creates the
|
||||
# factory function
|
||||
name2 = f"{name}2"
|
||||
func2 = lambda doc: doc
|
||||
for i in range(5):
|
||||
Language.component(name2, func=func2)
|
||||
nlp = Language()
|
||||
|
|
229
spacy/tests/regression/test_issue6501-7000.py
Normal file
229
spacy/tests/regression/test_issue6501-7000.py
Normal file
|
@ -0,0 +1,229 @@
|
|||
import pytest
|
||||
from spacy.lang.en import English
|
||||
import numpy as np
|
||||
import spacy
|
||||
from spacy.tokens import Doc
|
||||
from spacy.matcher import PhraseMatcher
|
||||
from spacy.tokens import DocBin
|
||||
from spacy.util import load_config_from_str
|
||||
from spacy.training import Example
|
||||
from spacy.training.initialize import init_nlp
|
||||
import pickle
|
||||
|
||||
from ..util import make_tempdir
|
||||
|
||||
|
||||
def test_issue6730(en_vocab):
|
||||
"""Ensure that the KB does not accept empty strings, but otherwise IO works fine."""
|
||||
from spacy.kb import KnowledgeBase
|
||||
|
||||
kb = KnowledgeBase(en_vocab, entity_vector_length=3)
|
||||
kb.add_entity(entity="1", freq=148, entity_vector=[1, 2, 3])
|
||||
|
||||
with pytest.raises(ValueError):
|
||||
kb.add_alias(alias="", entities=["1"], probabilities=[0.4])
|
||||
assert kb.contains_alias("") is False
|
||||
|
||||
kb.add_alias(alias="x", entities=["1"], probabilities=[0.2])
|
||||
kb.add_alias(alias="y", entities=["1"], probabilities=[0.1])
|
||||
|
||||
with make_tempdir() as tmp_dir:
|
||||
kb.to_disk(tmp_dir)
|
||||
kb.from_disk(tmp_dir)
|
||||
assert kb.get_size_aliases() == 2
|
||||
assert set(kb.get_alias_strings()) == {"x", "y"}
|
||||
|
||||
|
||||
def test_issue6755(en_tokenizer):
|
||||
doc = en_tokenizer("This is a magnificent sentence.")
|
||||
span = doc[:0]
|
||||
assert span.text_with_ws == ""
|
||||
assert span.text == ""
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"sentence, start_idx,end_idx,label",
|
||||
[("Welcome to Mumbai, my friend", 11, 17, "GPE")],
|
||||
)
|
||||
def test_issue6815_1(sentence, start_idx, end_idx, label):
|
||||
nlp = English()
|
||||
doc = nlp(sentence)
|
||||
span = doc[:].char_span(start_idx, end_idx, label=label)
|
||||
assert span.label_ == label
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
|
||||
)
|
||||
def test_issue6815_2(sentence, start_idx, end_idx, kb_id):
|
||||
nlp = English()
|
||||
doc = nlp(sentence)
|
||||
span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
|
||||
assert span.kb_id == kb_id
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"sentence, start_idx,end_idx,vector",
|
||||
[("Welcome to Mumbai, my friend", 11, 17, np.array([0.1, 0.2, 0.3]))],
|
||||
)
|
||||
def test_issue6815_3(sentence, start_idx, end_idx, vector):
|
||||
nlp = English()
|
||||
doc = nlp(sentence)
|
||||
span = doc[:].char_span(start_idx, end_idx, vector=vector)
|
||||
assert (span.vector == vector).all()
|
||||
|
||||
|
||||
def test_issue6839(en_vocab):
|
||||
"""Ensure that PhraseMatcher accepts Span as input"""
|
||||
# fmt: off
|
||||
words = ["I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "and", "nothing", "else", "."]
|
||||
# fmt: on
|
||||
doc = Doc(en_vocab, words=words)
|
||||
span = doc[:8]
|
||||
pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("SPACY", [pattern])
|
||||
matches = matcher(span)
|
||||
assert matches
|
||||
|
||||
|
||||
CONFIG_ISSUE_6908 = """
|
||||
[paths]
|
||||
train = "TRAIN_PLACEHOLDER"
|
||||
raw = null
|
||||
init_tok2vec = null
|
||||
vectors = null
|
||||
|
||||
[system]
|
||||
seed = 0
|
||||
gpu_allocator = null
|
||||
|
||||
[nlp]
|
||||
lang = "en"
|
||||
pipeline = ["textcat"]
|
||||
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
|
||||
disabled = []
|
||||
before_creation = null
|
||||
after_creation = null
|
||||
after_pipeline_creation = null
|
||||
batch_size = 1000
|
||||
|
||||
[components]
|
||||
|
||||
[components.textcat]
|
||||
factory = "TEXTCAT_PLACEHOLDER"
|
||||
|
||||
[corpora]
|
||||
|
||||
[corpora.train]
|
||||
@readers = "spacy.Corpus.v1"
|
||||
path = ${paths:train}
|
||||
|
||||
[corpora.dev]
|
||||
@readers = "spacy.Corpus.v1"
|
||||
path = ${paths:train}
|
||||
|
||||
|
||||
[training]
|
||||
train_corpus = "corpora.train"
|
||||
dev_corpus = "corpora.dev"
|
||||
seed = ${system.seed}
|
||||
gpu_allocator = ${system.gpu_allocator}
|
||||
frozen_components = []
|
||||
before_to_disk = null
|
||||
|
||||
[pretraining]
|
||||
|
||||
[initialize]
|
||||
vectors = ${paths.vectors}
|
||||
init_tok2vec = ${paths.init_tok2vec}
|
||||
vocab_data = null
|
||||
lookups = null
|
||||
before_init = null
|
||||
after_init = null
|
||||
|
||||
[initialize.components]
|
||||
|
||||
[initialize.components.textcat]
|
||||
labels = ['label1', 'label2']
|
||||
|
||||
[initialize.tokenizer]
|
||||
"""
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"component_name", ["textcat", "textcat_multilabel"],
|
||||
)
|
||||
def test_issue6908(component_name):
|
||||
"""Test intializing textcat with labels in a list"""
|
||||
|
||||
def create_data(out_file):
|
||||
nlp = spacy.blank("en")
|
||||
doc = nlp.make_doc("Some text")
|
||||
doc.cats = {"label1": 0, "label2": 1}
|
||||
out_data = DocBin(docs=[doc]).to_bytes()
|
||||
with out_file.open("wb") as file_:
|
||||
file_.write(out_data)
|
||||
|
||||
with make_tempdir() as tmp_path:
|
||||
train_path = tmp_path / "train.spacy"
|
||||
create_data(train_path)
|
||||
config_str = CONFIG_ISSUE_6908.replace("TEXTCAT_PLACEHOLDER", component_name)
|
||||
config_str = config_str.replace("TRAIN_PLACEHOLDER", train_path.as_posix())
|
||||
config = load_config_from_str(config_str)
|
||||
init_nlp(config)
|
||||
|
||||
|
||||
CONFIG_ISSUE_6950 = """
|
||||
[nlp]
|
||||
lang = "en"
|
||||
pipeline = ["tok2vec", "tagger"]
|
||||
|
||||
[components]
|
||||
|
||||
[components.tok2vec]
|
||||
factory = "tok2vec"
|
||||
|
||||
[components.tok2vec.model]
|
||||
@architectures = "spacy.Tok2Vec.v1"
|
||||
|
||||
[components.tok2vec.model.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
width = ${components.tok2vec.model.encode:width}
|
||||
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
|
||||
rows = [5000,2500,2500,2500]
|
||||
include_static_vectors = false
|
||||
|
||||
[components.tok2vec.model.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
width = 96
|
||||
depth = 4
|
||||
window_size = 1
|
||||
maxout_pieces = 3
|
||||
|
||||
[components.ner]
|
||||
factory = "ner"
|
||||
|
||||
[components.tagger]
|
||||
factory = "tagger"
|
||||
|
||||
[components.tagger.model]
|
||||
@architectures = "spacy.Tagger.v1"
|
||||
nO = null
|
||||
|
||||
[components.tagger.model.tok2vec]
|
||||
@architectures = "spacy.Tok2VecListener.v1"
|
||||
width = ${components.tok2vec.model.encode:width}
|
||||
upstream = "*"
|
||||
"""
|
||||
|
||||
|
||||
def test_issue6950():
|
||||
"""Test that the nlp object with initialized tok2vec with listeners pickles
|
||||
correctly (and doesn't have lambdas).
|
||||
"""
|
||||
nlp = English.from_config(load_config_from_str(CONFIG_ISSUE_6950))
|
||||
nlp.initialize(lambda: [Example.from_dict(nlp.make_doc("hello"), {"tags": ["V"]})])
|
||||
pickle.dumps(nlp)
|
||||
nlp("hello")
|
||||
pickle.dumps(nlp)
|
|
@ -1,23 +0,0 @@
|
|||
import pytest
|
||||
from ..util import make_tempdir
|
||||
|
||||
|
||||
def test_issue6730(en_vocab):
|
||||
"""Ensure that the KB does not accept empty strings, but otherwise IO works fine."""
|
||||
from spacy.kb import KnowledgeBase
|
||||
|
||||
kb = KnowledgeBase(en_vocab, entity_vector_length=3)
|
||||
kb.add_entity(entity="1", freq=148, entity_vector=[1, 2, 3])
|
||||
|
||||
with pytest.raises(ValueError):
|
||||
kb.add_alias(alias="", entities=["1"], probabilities=[0.4])
|
||||
assert kb.contains_alias("") is False
|
||||
|
||||
kb.add_alias(alias="x", entities=["1"], probabilities=[0.2])
|
||||
kb.add_alias(alias="y", entities=["1"], probabilities=[0.1])
|
||||
|
||||
with make_tempdir() as tmp_dir:
|
||||
kb.to_disk(tmp_dir)
|
||||
kb.from_disk(tmp_dir)
|
||||
assert kb.get_size_aliases() == 2
|
||||
assert set(kb.get_alias_strings()) == {"x", "y"}
|
|
@ -1,5 +0,0 @@
|
|||
def test_issue6755(en_tokenizer):
|
||||
doc = en_tokenizer("This is a magnificent sentence.")
|
||||
span = doc[:0]
|
||||
assert span.text_with_ws == ""
|
||||
assert span.text == ""
|
|
@ -1,35 +0,0 @@
|
|||
import pytest
|
||||
from spacy.lang.en import English
|
||||
import numpy as np
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"sentence, start_idx,end_idx,label",
|
||||
[("Welcome to Mumbai, my friend", 11, 17, "GPE")],
|
||||
)
|
||||
def test_char_span_label(sentence, start_idx, end_idx, label):
|
||||
nlp = English()
|
||||
doc = nlp(sentence)
|
||||
span = doc[:].char_span(start_idx, end_idx, label=label)
|
||||
assert span.label_ == label
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
|
||||
)
|
||||
def test_char_span_kb_id(sentence, start_idx, end_idx, kb_id):
|
||||
nlp = English()
|
||||
doc = nlp(sentence)
|
||||
span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
|
||||
assert span.kb_id == kb_id
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"sentence, start_idx,end_idx,vector",
|
||||
[("Welcome to Mumbai, my friend", 11, 17, np.array([0.1, 0.2, 0.3]))],
|
||||
)
|
||||
def test_char_span_vector(sentence, start_idx, end_idx, vector):
|
||||
nlp = English()
|
||||
doc = nlp(sentence)
|
||||
span = doc[:].char_span(start_idx, end_idx, vector=vector)
|
||||
assert (span.vector == vector).all()
|
|
@ -1,102 +0,0 @@
|
|||
import pytest
|
||||
import spacy
|
||||
from spacy.language import Language
|
||||
from spacy.tokens import DocBin
|
||||
from spacy import util
|
||||
from spacy.schemas import ConfigSchemaInit
|
||||
|
||||
from spacy.training.initialize import init_nlp
|
||||
|
||||
from ..util import make_tempdir
|
||||
|
||||
TEXTCAT_WITH_LABELS_ARRAY_CONFIG = """
|
||||
[paths]
|
||||
train = "TRAIN_PLACEHOLDER"
|
||||
raw = null
|
||||
init_tok2vec = null
|
||||
vectors = null
|
||||
|
||||
[system]
|
||||
seed = 0
|
||||
gpu_allocator = null
|
||||
|
||||
[nlp]
|
||||
lang = "en"
|
||||
pipeline = ["textcat"]
|
||||
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
|
||||
disabled = []
|
||||
before_creation = null
|
||||
after_creation = null
|
||||
after_pipeline_creation = null
|
||||
batch_size = 1000
|
||||
|
||||
[components]
|
||||
|
||||
[components.textcat]
|
||||
factory = "TEXTCAT_PLACEHOLDER"
|
||||
|
||||
[corpora]
|
||||
|
||||
[corpora.train]
|
||||
@readers = "spacy.Corpus.v1"
|
||||
path = ${paths:train}
|
||||
|
||||
[corpora.dev]
|
||||
@readers = "spacy.Corpus.v1"
|
||||
path = ${paths:train}
|
||||
|
||||
|
||||
[training]
|
||||
train_corpus = "corpora.train"
|
||||
dev_corpus = "corpora.dev"
|
||||
seed = ${system.seed}
|
||||
gpu_allocator = ${system.gpu_allocator}
|
||||
frozen_components = []
|
||||
before_to_disk = null
|
||||
|
||||
[pretraining]
|
||||
|
||||
[initialize]
|
||||
vectors = ${paths.vectors}
|
||||
init_tok2vec = ${paths.init_tok2vec}
|
||||
vocab_data = null
|
||||
lookups = null
|
||||
before_init = null
|
||||
after_init = null
|
||||
|
||||
[initialize.components]
|
||||
|
||||
[initialize.components.textcat]
|
||||
labels = ['label1', 'label2']
|
||||
|
||||
[initialize.tokenizer]
|
||||
"""
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"component_name",
|
||||
["textcat", "textcat_multilabel"],
|
||||
)
|
||||
def test_textcat_initialize_labels_validation(component_name):
|
||||
"""Test intializing textcat with labels in a list"""
|
||||
|
||||
def create_data(out_file):
|
||||
nlp = spacy.blank("en")
|
||||
doc = nlp.make_doc("Some text")
|
||||
doc.cats = {"label1": 0, "label2": 1}
|
||||
|
||||
out_data = DocBin(docs=[doc]).to_bytes()
|
||||
with out_file.open("wb") as file_:
|
||||
file_.write(out_data)
|
||||
|
||||
with make_tempdir() as tmp_path:
|
||||
train_path = tmp_path / "train.spacy"
|
||||
create_data(train_path)
|
||||
|
||||
config_str = TEXTCAT_WITH_LABELS_ARRAY_CONFIG.replace(
|
||||
"TEXTCAT_PLACEHOLDER", component_name
|
||||
)
|
||||
config_str = config_str.replace("TRAIN_PLACEHOLDER", train_path.as_posix())
|
||||
|
||||
config = util.load_config_from_str(config_str)
|
||||
init_nlp(config)
|
12
spacy/tests/regression/test_issue7019.py
Normal file
12
spacy/tests/regression/test_issue7019.py
Normal file
|
@ -0,0 +1,12 @@
|
|||
from spacy.cli.evaluate import print_textcats_auc_per_cat, print_prf_per_type
|
||||
from wasabi import msg
|
||||
|
||||
|
||||
def test_issue7019():
|
||||
scores = {"LABEL_A": 0.39829102, "LABEL_B": 0.938298329382, "LABEL_C": None}
|
||||
print_textcats_auc_per_cat(msg, scores)
|
||||
scores = {
|
||||
"LABEL_A": {"p": 0.3420302, "r": 0.3929020, "f": 0.49823928932},
|
||||
"LABEL_B": {"p": None, "r": None, "f": None},
|
||||
}
|
||||
print_prf_per_type(msg, scores, name="foo", type="bar")
|
67
spacy/tests/regression/test_issue7029.py
Normal file
67
spacy/tests/regression/test_issue7029.py
Normal file
|
@ -0,0 +1,67 @@
|
|||
from spacy.lang.en import English
|
||||
from spacy.training import Example
|
||||
from spacy.util import load_config_from_str
|
||||
|
||||
|
||||
CONFIG = """
|
||||
[nlp]
|
||||
lang = "en"
|
||||
pipeline = ["tok2vec", "tagger"]
|
||||
|
||||
[components]
|
||||
|
||||
[components.tok2vec]
|
||||
factory = "tok2vec"
|
||||
|
||||
[components.tok2vec.model]
|
||||
@architectures = "spacy.Tok2Vec.v1"
|
||||
|
||||
[components.tok2vec.model.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
width = ${components.tok2vec.model.encode:width}
|
||||
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
|
||||
rows = [5000,2500,2500,2500]
|
||||
include_static_vectors = false
|
||||
|
||||
[components.tok2vec.model.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
width = 96
|
||||
depth = 4
|
||||
window_size = 1
|
||||
maxout_pieces = 3
|
||||
|
||||
[components.tagger]
|
||||
factory = "tagger"
|
||||
|
||||
[components.tagger.model]
|
||||
@architectures = "spacy.Tagger.v1"
|
||||
nO = null
|
||||
|
||||
[components.tagger.model.tok2vec]
|
||||
@architectures = "spacy.Tok2VecListener.v1"
|
||||
width = ${components.tok2vec.model.encode:width}
|
||||
upstream = "*"
|
||||
"""
|
||||
|
||||
|
||||
TRAIN_DATA = [
|
||||
("I like green eggs", {"tags": ["N", "V", "J", "N"]}),
|
||||
("Eat blue ham", {"tags": ["V", "J", "N"]}),
|
||||
]
|
||||
|
||||
|
||||
def test_issue7029():
|
||||
"""Test that an empty document doesn't mess up an entire batch."""
|
||||
nlp = English.from_config(load_config_from_str(CONFIG))
|
||||
train_examples = []
|
||||
for t in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||
for i in range(50):
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
texts = ["first", "second", "third", "fourth", "and", "then", "some", ""]
|
||||
nlp.select_pipes(enable=["tok2vec", "tagger"])
|
||||
docs1 = list(nlp.pipe(texts, batch_size=1))
|
||||
docs2 = list(nlp.pipe(texts, batch_size=4))
|
||||
assert [doc[0].tag_ for doc in docs1[:-1]] == [doc[0].tag_ for doc in docs2[:-1]]
|
|
@ -325,6 +325,23 @@ def test_project_config_interpolation():
|
|||
substitute_project_variables(project)
|
||||
|
||||
|
||||
def test_project_config_interpolation_env():
|
||||
variables = {"a": 10}
|
||||
env_var = "SPACY_TEST_FOO"
|
||||
env_vars = {"foo": env_var}
|
||||
commands = [{"name": "x", "script": ["hello ${vars.a} ${env.foo}"]}]
|
||||
project = {"commands": commands, "vars": variables, "env": env_vars}
|
||||
with make_tempdir() as d:
|
||||
srsly.write_yaml(d / "project.yml", project)
|
||||
cfg = load_project_config(d)
|
||||
assert cfg["commands"][0]["script"][0] == "hello 10 "
|
||||
os.environ[env_var] = "123"
|
||||
with make_tempdir() as d:
|
||||
srsly.write_yaml(d / "project.yml", project)
|
||||
cfg = load_project_config(d)
|
||||
assert cfg["commands"][0]["script"][0] == "hello 10 123"
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"args,expected",
|
||||
[
|
||||
|
|
|
@ -1,7 +1,9 @@
|
|||
import pytest
|
||||
import numpy
|
||||
import srsly
|
||||
from spacy.lang.en import English
|
||||
from spacy.strings import StringStore
|
||||
from spacy.tokens import Doc
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.attrs import NORM
|
||||
|
||||
|
@ -20,7 +22,10 @@ def test_pickle_string_store(text1, text2):
|
|||
|
||||
@pytest.mark.parametrize("text1,text2", [("dog", "cat")])
|
||||
def test_pickle_vocab(text1, text2):
|
||||
vocab = Vocab(lex_attr_getters={int(NORM): lambda string: string[:-1]})
|
||||
vocab = Vocab(
|
||||
lex_attr_getters={int(NORM): lambda string: string[:-1]},
|
||||
get_noun_chunks=English.Defaults.syntax_iterators.get("noun_chunks"),
|
||||
)
|
||||
vocab.set_vector("dog", numpy.ones((5,), dtype="f"))
|
||||
lex1 = vocab[text1]
|
||||
lex2 = vocab[text2]
|
||||
|
@ -34,4 +39,23 @@ def test_pickle_vocab(text1, text2):
|
|||
assert unpickled[text2].norm == lex2.norm
|
||||
assert unpickled[text1].norm != unpickled[text2].norm
|
||||
assert unpickled.vectors is not None
|
||||
assert unpickled.get_noun_chunks is not None
|
||||
assert list(vocab["dog"].vector) == [1.0, 1.0, 1.0, 1.0, 1.0]
|
||||
|
||||
|
||||
def test_pickle_doc(en_vocab):
|
||||
words = ["a", "b", "c"]
|
||||
deps = ["dep"] * len(words)
|
||||
heads = [0] * len(words)
|
||||
doc = Doc(
|
||||
en_vocab,
|
||||
words=words,
|
||||
deps=deps,
|
||||
heads=heads,
|
||||
)
|
||||
data = srsly.pickle_dumps(doc)
|
||||
unpickled = srsly.pickle_loads(data)
|
||||
assert [t.text for t in unpickled] == words
|
||||
assert [t.dep_ for t in unpickled] == deps
|
||||
assert [t.head.i for t in unpickled] == heads
|
||||
assert list(doc.noun_chunks) == []
|
||||
|
|
|
@ -55,6 +55,7 @@ def test_vocab_lexeme_add_flag_provided_id(en_vocab):
|
|||
assert en_vocab["199"].check_flag(IS_DIGIT) is False
|
||||
assert en_vocab["the"].check_flag(is_len4) is False
|
||||
assert en_vocab["dogs"].check_flag(is_len4) is True
|
||||
en_vocab.add_flag(lambda string: string.isdigit(), flag_id=IS_DIGIT)
|
||||
|
||||
|
||||
def test_vocab_lexeme_oov_rank(en_vocab):
|
||||
|
|
|
@ -245,7 +245,7 @@ cdef class Tokenizer:
|
|||
cdef int offset
|
||||
cdef int modified_doc_length
|
||||
# Find matches for special cases
|
||||
self._special_matcher.find_matches(doc, &c_matches)
|
||||
self._special_matcher.find_matches(doc, 0, doc.length, &c_matches)
|
||||
# Skip processing if no matches
|
||||
if c_matches.size() == 0:
|
||||
return True
|
||||
|
|
|
@ -215,8 +215,7 @@ def convert_vectors(
|
|||
|
||||
|
||||
def read_vectors(vectors_loc: Path, truncate_vectors: int):
|
||||
f = open_file(vectors_loc)
|
||||
f = ensure_shape(f)
|
||||
f = ensure_shape(vectors_loc)
|
||||
shape = tuple(int(size) for size in next(f).split())
|
||||
if truncate_vectors >= 1:
|
||||
shape = (truncate_vectors, shape[1])
|
||||
|
@ -251,11 +250,12 @@ def open_file(loc: Union[str, Path]) -> IO:
|
|||
return loc.open("r", encoding="utf8")
|
||||
|
||||
|
||||
def ensure_shape(lines):
|
||||
def ensure_shape(vectors_loc):
|
||||
"""Ensure that the first line of the data is the vectors shape.
|
||||
If it's not, we read in the data and output the shape as the first result,
|
||||
so that the reader doesn't have to deal with the problem.
|
||||
"""
|
||||
lines = open_file(vectors_loc)
|
||||
first_line = next(lines)
|
||||
try:
|
||||
shape = tuple(int(size) for size in first_line.split())
|
||||
|
@ -269,7 +269,11 @@ def ensure_shape(lines):
|
|||
# Figure out the shape, make it the first value, and then give the
|
||||
# rest of the data.
|
||||
width = len(first_line.split()) - 1
|
||||
captured = [first_line] + list(lines)
|
||||
length = len(captured)
|
||||
length = 1
|
||||
for _ in lines:
|
||||
length += 1
|
||||
yield f"{length} {width}"
|
||||
yield from captured
|
||||
# Reading the lines in again from file. This to avoid having to
|
||||
# store all the results in a list in memory
|
||||
lines2 = open_file(vectors_loc)
|
||||
yield from lines2
|
||||
|
|
|
@ -930,6 +930,8 @@ def is_same_func(func1: Callable, func2: Callable) -> bool:
|
|||
"""
|
||||
if not callable(func1) or not callable(func2):
|
||||
return False
|
||||
if not hasattr(func1, "__qualname__") or not hasattr(func2, "__qualname__"):
|
||||
return False
|
||||
same_name = func1.__qualname__ == func2.__qualname__
|
||||
same_file = inspect.getfile(func1) == inspect.getfile(func2)
|
||||
same_code = inspect.getsourcelines(func1) == inspect.getsourcelines(func2)
|
||||
|
|
|
@ -551,12 +551,13 @@ def pickle_vocab(vocab):
|
|||
data_dir = vocab.data_dir
|
||||
lex_attr_getters = srsly.pickle_dumps(vocab.lex_attr_getters)
|
||||
lookups = vocab.lookups
|
||||
get_noun_chunks = vocab.get_noun_chunks
|
||||
return (unpickle_vocab,
|
||||
(sstore, vectors, morph, data_dir, lex_attr_getters, lookups))
|
||||
(sstore, vectors, morph, data_dir, lex_attr_getters, lookups, get_noun_chunks))
|
||||
|
||||
|
||||
def unpickle_vocab(sstore, vectors, morphology, data_dir,
|
||||
lex_attr_getters, lookups):
|
||||
lex_attr_getters, lookups, get_noun_chunks):
|
||||
cdef Vocab vocab = Vocab()
|
||||
vocab.vectors = vectors
|
||||
vocab.strings = sstore
|
||||
|
@ -564,6 +565,7 @@ def unpickle_vocab(sstore, vectors, morphology, data_dir,
|
|||
vocab.data_dir = data_dir
|
||||
vocab.lex_attr_getters = srsly.pickle_loads(lex_attr_getters)
|
||||
vocab.lookups = lookups
|
||||
vocab.get_noun_chunks = get_noun_chunks
|
||||
return vocab
|
||||
|
||||
|
||||
|
|
|
@ -67,7 +67,7 @@ data format used by the lookup and rule-based lemmatizers, see
|
|||
> lemmatizer = nlp.add_pipe("lemmatizer")
|
||||
>
|
||||
> # Construction via add_pipe with custom settings
|
||||
> config = {"mode": "rule", overwrite=True}
|
||||
> config = {"mode": "rule", "overwrite": True}
|
||||
> lemmatizer = nlp.add_pipe("lemmatizer", config=config)
|
||||
> ```
|
||||
|
||||
|
|
|
@ -44,7 +44,7 @@ be shown.
|
|||
|
||||
## PhraseMatcher.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
Find all token sequences matching the supplied patterns on the `Doc`.
|
||||
Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -59,7 +59,7 @@ Find all token sequences matching the supplied patterns on the `Doc`.
|
|||
|
||||
| Name | Description |
|
||||
| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `doc` | The document to match over. ~~Doc~~ |
|
||||
| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ |
|
||||
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
|
||||
|
|
|
@ -727,7 +727,7 @@ capitalization by including a mix of capitalized and lowercase examples. See the
|
|||
|
||||
Create a data augmentation callback that uses orth-variant replacement. The
|
||||
callback can be added to a corpus or other data iterator during training. It's
|
||||
is especially useful for punctuation and case replacement, to help generalize
|
||||
especially useful for punctuation and case replacement, to help generalize
|
||||
beyond corpora that don't have smart quotes, or only have smart quotes etc.
|
||||
|
||||
| Name | Description |
|
||||
|
|
|
@ -4,8 +4,8 @@ import { Help } from 'components/typography'; import Link from 'components/link'
|
|||
|
||||
| Pipeline | Parser | Tagger | NER |
|
||||
| ---------------------------------------------------------- | -----: | -----: | ---: |
|
||||
| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.2 | 97.8 | 89.9 |
|
||||
| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 91.9 | 97.4 | 85.5 |
|
||||
| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.1 | 97.8 | 89.8 |
|
||||
| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 92.0 | 97.4 | 85.5 |
|
||||
| `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | 85.5 |
|
||||
|
||||
<figcaption class="caption">
|
||||
|
@ -22,7 +22,7 @@ the development set).
|
|||
|
||||
| Named Entity Recognition System | OntoNotes | CoNLL '03 |
|
||||
| -------------------------------- | --------: | --------: |
|
||||
| spaCy RoBERTa (2020) | 89.7 | 91.6 |
|
||||
| spaCy RoBERTa (2020) | 89.8 | 91.6 |
|
||||
| Stanza (StanfordNLP)<sup>1</sup> | 88.8 | 92.1 |
|
||||
| Flair<sup>2</sup> | 89.7 | 93.1 |
|
||||
|
||||
|
|
|
@ -77,7 +77,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
|
|||
|
||||
| Dependency Parsing System | UAS | LAS |
|
||||
| ------------------------------------------------------------------------------ | ---: | ---: |
|
||||
| spaCy RoBERTa (2020) | 95.5 | 94.3 |
|
||||
| spaCy RoBERTa (2020) | 95.1 | 93.7 |
|
||||
| [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
|
||||
| [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 |
|
||||
|
||||
|
|
|
@ -69,9 +69,9 @@ python -m spacy project clone pipelines/tagger_parser_ud
|
|||
|
||||
By default, the project will be cloned into the current working directory. You
|
||||
can specify an optional second argument to define the output directory. The
|
||||
`--repo` option lets you define a custom repo to clone from if you don't want
|
||||
to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You
|
||||
can also use any private repo you have access to with Git.
|
||||
`--repo` option lets you define a custom repo to clone from if you don't want to
|
||||
use the spaCy [`projects`](https://github.com/explosion/projects) repo. You can
|
||||
also use any private repo you have access to with Git.
|
||||
|
||||
### 2. Fetch the project assets {#assets}
|
||||
|
||||
|
@ -221,6 +221,7 @@ pipelines.
|
|||
| `title` | An optional project title used in `--help` message and [auto-generated docs](#custom-docs). |
|
||||
| `description` | An optional project description used in [auto-generated docs](#custom-docs). |
|
||||
| `vars` | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`. |
|
||||
| `env` | A dictionary of variables, mapped to the names of environment variables that will be read in when running the project. For example, `${env.name}` will use the value of the environment variable defined as `name`. |
|
||||
| `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. |
|
||||
| `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo. |
|
||||
| `workflows` | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command. |
|
||||
|
@ -310,8 +311,8 @@ company-internal and not available over the internet. In that case, you can
|
|||
specify the destination paths and a checksum, and leave out the URL. When your
|
||||
teammates clone and run your project, they can place the files in the respective
|
||||
directory themselves. The [`project assets`](/api/cli#project-assets) command
|
||||
will alert you about missing files and mismatched checksums, so you can ensure that
|
||||
others are running your project with the same data.
|
||||
will alert you about missing files and mismatched checksums, so you can ensure
|
||||
that others are running your project with the same data.
|
||||
|
||||
### Dependencies and outputs {#deps-outputs}
|
||||
|
||||
|
@ -358,9 +359,10 @@ graphs based on the dependencies and outputs, and won't re-run previous steps
|
|||
automatically. For instance, if you only run the command `train` that depends on
|
||||
data created by `preprocess` and those files are missing, spaCy will show an
|
||||
error – it won't just re-run `preprocess`. If you're looking for more advanced
|
||||
data management, check out the [Data Version Control (DVC) integration](#dvc). If you're planning on integrating your spaCy project with DVC, you
|
||||
can also use `outputs_no_cache` instead of `outputs` to define outputs that
|
||||
won't be cached or tracked.
|
||||
data management, check out the [Data Version Control (DVC) integration](#dvc).
|
||||
If you're planning on integrating your spaCy project with DVC, you can also use
|
||||
`outputs_no_cache` instead of `outputs` to define outputs that won't be cached
|
||||
or tracked.
|
||||
|
||||
### Files and directory structure {#project-files}
|
||||
|
||||
|
@ -467,7 +469,9 @@ In your `project.yml`, you can then run the script by calling
|
|||
`python scripts/custom_evaluation.py` with the function arguments. You can also
|
||||
use the `vars` section to define reusable variables that will be substituted in
|
||||
commands, paths and URLs. In this example, the batch size is defined as a
|
||||
variable will be added in place of `${vars.batch_size}` in the script.
|
||||
variable will be added in place of `${vars.batch_size}` in the script. Just like
|
||||
in the [training config](/usage/training##config-overrides), you can also
|
||||
override settings on the command line – for example using `--vars.batch_size`.
|
||||
|
||||
> #### Calling into Python
|
||||
>
|
||||
|
@ -491,6 +495,29 @@ commands:
|
|||
- 'corpus/eval.json'
|
||||
```
|
||||
|
||||
You can also use the `env` section to reference **environment variables** and
|
||||
make their values available to the commands. This can be useful for overriding
|
||||
settings on the command line and passing through system-level settings.
|
||||
|
||||
> #### Usage example
|
||||
>
|
||||
> ```bash
|
||||
> export GPU_ID=1
|
||||
> BATCH_SIZE=128 python -m spacy project run evaluate
|
||||
> ```
|
||||
|
||||
```yaml
|
||||
### project.yml
|
||||
env:
|
||||
batch_size: BATCH_SIZE
|
||||
gpu_id: GPU_ID
|
||||
|
||||
commands:
|
||||
- name: evaluate
|
||||
script:
|
||||
- 'python scripts/custom_evaluation.py ${env.batch_size}'
|
||||
```
|
||||
|
||||
### Documenting your project {#custom-docs}
|
||||
|
||||
> #### Readme Example
|
||||
|
|
|
@ -185,7 +185,7 @@ sections of a config file are:
|
|||
|
||||
For a full overview of spaCy's config format and settings, see the
|
||||
[data format documentation](/api/data-formats#config) and
|
||||
[Thinc's config system docs](https://thinc.ai/usage/config). The settings
|
||||
[Thinc's config system docs](https://thinc.ai/docs/usage-config). The settings
|
||||
available for the different architectures are documented with the
|
||||
[model architectures API](/api/architectures). See the Thinc documentation for
|
||||
[optimizers](https://thinc.ai/docs/api-optimizers) and
|
||||
|
|
|
@ -198,6 +198,7 @@
|
|||
"has_examples": true
|
||||
},
|
||||
{ "code": "tl", "name": "Tagalog" },
|
||||
{ "code": "tn", "name": "Setswana", "has_examples": true },
|
||||
{ "code": "tr", "name": "Turkish", "example": "Bu bir cümledir.", "has_examples": true },
|
||||
{ "code": "tt", "name": "Tatar", "has_examples": true },
|
||||
{
|
||||
|
|
Loading…
Reference in New Issue
Block a user