Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2021-02-14 13:38:33 +11:00
commit 3246cf8b2b
47 changed files with 898 additions and 280 deletions

106
.github/contributors/peter-exos.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Peter Baumann |
| Company name (if applicable) | Exos Financial |
| Title or role (if applicable) | data scientist |
| Date | Feb 1st, 2021 |
| GitHub username | peter-exos |
| Website (optional) | |

View File

@ -1,6 +1,6 @@
# fmt: off
__title__ = "spacy"
__version__ = "3.0.1"
__version__ = "3.0.3"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects"

View File

@ -16,7 +16,7 @@ import os
from ..schemas import ProjectConfigSchema, validate
from ..util import import_file, run_command, make_tempdir, registry, logger
from ..util import is_compatible_version, ENV_VARS
from ..util import is_compatible_version, SimpleFrozenDict, ENV_VARS
from .. import about
if TYPE_CHECKING:
@ -111,26 +111,33 @@ def _parse_overrides(args: List[str], is_cli: bool = False) -> Dict[str, Any]:
value = "true"
else:
value = args.pop(0)
# Just like we do in the config, we're calling json.loads on the
# values. But since they come from the CLI, it'd be unintuitive to
# explicitly mark strings with escaped quotes. So we're working
# around that here by falling back to a string if parsing fails.
# TODO: improve logic to handle simple types like list of strings?
try:
result[opt] = srsly.json_loads(value)
except ValueError:
result[opt] = str(value)
result[opt] = _parse_override(value)
else:
msg.fail(f"{err}: name should start with --", exits=1)
return result
def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
def _parse_override(value: Any) -> Any:
# Just like we do in the config, we're calling json.loads on the
# values. But since they come from the CLI, it'd be unintuitive to
# explicitly mark strings with escaped quotes. So we're working
# around that here by falling back to a string if parsing fails.
# TODO: improve logic to handle simple types like list of strings?
try:
return srsly.json_loads(value)
except ValueError:
return str(value)
def load_project_config(
path: Path, interpolate: bool = True, overrides: Dict[str, Any] = SimpleFrozenDict()
) -> Dict[str, Any]:
"""Load the project.yml file from a directory and validate it. Also make
sure that all directories defined in the config exist.
path (Path): The path to the project directory.
interpolate (bool): Whether to substitute project variables.
overrides (Dict[str, Any]): Optional config overrides.
RETURNS (Dict[str, Any]): The loaded project.yml.
"""
config_path = path / PROJECT_FILE
@ -154,20 +161,36 @@ def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
if not dir_path.exists():
dir_path.mkdir(parents=True)
if interpolate:
err = "project.yml validation error"
err = f"{PROJECT_FILE} validation error"
with show_validation_error(title=err, hint_fill=False):
config = substitute_project_variables(config)
config = substitute_project_variables(config, overrides)
return config
def substitute_project_variables(config: Dict[str, Any], overrides: Dict = {}):
key = "vars"
def substitute_project_variables(
config: Dict[str, Any],
overrides: Dict[str, Any] = SimpleFrozenDict(),
key: str = "vars",
env_key: str = "env",
) -> Dict[str, Any]:
"""Interpolate variables in the project file using the config system.
config (Dict[str, Any]): The project config.
overrides (Dict[str, Any]): Optional config overrides.
key (str): Key containing variables in project config.
env_key (str): Key containing environment variable mapping in project config.
RETURNS (Dict[str, Any]): The interpolated project config.
"""
config.setdefault(key, {})
config[key].update(overrides)
config.setdefault(env_key, {})
# Substitute references to env vars with their values
for config_var, env_var in config[env_key].items():
config[env_key][config_var] = _parse_override(os.environ.get(env_var, ""))
# Need to put variables in the top scope again so we can have a top-level
# section "project" (otherwise, a list of commands in the top scope wouldn't)
# be allowed by Thinc's config system
cfg = Config({"project": config, key: config[key]})
cfg = Config({"project": config, key: config[key], env_key: config[env_key]})
cfg = Config().from_str(cfg.to_str(), overrides=overrides)
interpolated = cfg.interpolate()
return dict(interpolated["project"])

View File

@ -175,10 +175,13 @@ def render_parses(
def print_prf_per_type(
msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str
) -> None:
data = [
(k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}")
for k, v in scores.items()
]
data = []
for key, value in scores.items():
row = [key]
for k in ("p", "r", "f"):
v = value[k]
row.append(f"{v * 100:.2f}" if isinstance(v, (int, float)) else v)
data.append(row)
msg.table(
data,
header=("", "P", "R", "F"),
@ -191,7 +194,10 @@ def print_textcats_auc_per_cat(
msg: Printer, scores: Dict[str, Dict[str, float]]
) -> None:
msg.table(
[(k, f"{v:.2f}") for k, v in scores.items()],
[
(k, f"{v:.2f}" if isinstance(v, (float, int)) else v)
for k, v in scores.items()
],
header=("", "ROC AUC"),
aligns=("l", "r"),
title="Textcat ROC AUC (per label)",

View File

@ -3,19 +3,23 @@ from pathlib import Path
from wasabi import msg
import sys
import srsly
import typer
from ... import about
from ...git_info import GIT_VERSION
from ...util import working_dir, run_command, split_command, is_cwd, join_command
from ...util import SimpleFrozenList, is_minor_version_match, ENV_VARS
from ...util import check_bool_env_var
from ...util import check_bool_env_var, SimpleFrozenDict
from .._util import PROJECT_FILE, PROJECT_LOCK, load_project_config, get_hash
from .._util import get_checksum, project_cli, Arg, Opt, COMMAND
from .._util import get_checksum, project_cli, Arg, Opt, COMMAND, parse_config_overrides
@project_cli.command("run")
@project_cli.command(
"run", context_settings={"allow_extra_args": True, "ignore_unknown_options": True}
)
def project_run_cli(
# fmt: off
ctx: typer.Context, # This is only used to read additional arguments
subcommand: str = Arg(None, help=f"Name of command defined in the {PROJECT_FILE}"),
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
force: bool = Opt(False, "--force", "-F", help="Force re-running steps, even if nothing changed"),
@ -33,13 +37,15 @@ def project_run_cli(
if show_help or not subcommand:
print_run_help(project_dir, subcommand)
else:
project_run(project_dir, subcommand, force=force, dry=dry)
overrides = parse_config_overrides(ctx.args)
project_run(project_dir, subcommand, overrides=overrides, force=force, dry=dry)
def project_run(
project_dir: Path,
subcommand: str,
*,
overrides: Dict[str, Any] = SimpleFrozenDict(),
force: bool = False,
dry: bool = False,
capture: bool = False,
@ -59,7 +65,7 @@ def project_run(
when you want to turn over execution to the command, and capture=True
when you want to run the command more like a function.
"""
config = load_project_config(project_dir)
config = load_project_config(project_dir, overrides=overrides)
commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
workflows = config.get("workflows", {})
validate_subcommand(commands.keys(), workflows.keys(), subcommand)

View File

@ -28,6 +28,15 @@ bg:
accuracy:
name: iarfmoose/roberta-base-bulgarian
size_factor: 3
bn:
word_vectors: null
transformer:
efficiency:
name: sagorsarker/bangla-bert-base
size_factor: 3
accuracy:
name: sagorsarker/bangla-bert-base
size_factor: 3
da:
word_vectors: da_core_news_lg
transformer:
@ -104,10 +113,10 @@ hi:
word_vectors: null
transformer:
efficiency:
name: monsoon-nlp/hindi-tpu-electra
name: ai4bharat/indic-bert
size_factor: 3
accuracy:
name: monsoon-nlp/hindi-tpu-electra
name: ai4bharat/indic-bert
size_factor: 3
id:
word_vectors: null
@ -185,10 +194,10 @@ si:
word_vectors: null
transformer:
efficiency:
name: keshan/SinhalaBERTo
name: setu4993/LaBSE
size_factor: 3
accuracy:
name: keshan/SinhalaBERTo
name: setu4993/LaBSE
size_factor: 3
sv:
word_vectors: null
@ -203,10 +212,10 @@ ta:
word_vectors: null
transformer:
efficiency:
name: monsoon-nlp/tamillion
name: ai4bharat/indic-bert
size_factor: 3
accuracy:
name: monsoon-nlp/tamillion
name: ai4bharat/indic-bert
size_factor: 3
te:
word_vectors: null

View File

@ -579,8 +579,8 @@ class Errors:
E922 = ("Component '{name}' has been initialized with an output dimension of "
"{nO} - cannot add any more labels.")
E923 = ("It looks like there is no proper sample data to initialize the "
"Model of component '{name}'. This is likely a bug in spaCy, so "
"feel free to open an issue: https://github.com/explosion/spaCy/issues")
"Model of component '{name}'. To check your input data paths and "
"annotation, run: python -m spacy debug data config.cfg")
E924 = ("The '{name}' component does not seem to be initialized properly. "
"This is likely a bug in spaCy, so feel free to open an issue: "
"https://github.com/explosion/spaCy/issues")

View File

@ -4,30 +4,30 @@
STOP_WORDS = set(
"""
ግን አንቺ አንተ እናንተ ያንተ ያንቺ የናንተ ራስህን ራስሽን ራሳችሁን
ሁሉ ኋላ በሰሞኑ አሉ በኋላ ሁኔታ በኩል አስታውቀዋል ሆነ በውስጥ
አስታውሰዋል ሆኑ ባጣም እስካሁን ሆኖም በተለይ አሳሰበ ሁል በተመለከተ
አሳስበዋል ላይ በተመሳሳይ አስፈላጊ ሌላ የተለያየ አስገነዘቡ ሌሎች የተለያዩ
አስገንዝበዋል ልዩ ተባለ አብራርተዋል መሆኑ ተገለጸ አስረድተዋል ተገልጿል
ማለቱ ተጨማሪ እባክህ የሚገኝ ተከናወነ እባክሽ ማድረግ ችግር አንጻር ማን
ትናንት እስኪደርስ ነበረች እንኳ ሰሞኑን ነበሩ እንኳን ሲሆን ነበር እዚሁ ሲል
ነው እንደገለጹት አለ እንደተናገሩት ቢሆን ነገር እንዳስረዱት ብለዋል ነገሮች
እንደገና ብዙ ናት ወቅት ቦታ ናቸው እንዲሁም በርካታ አሁን እንጂ እስከ
ማለት የሚሆኑት ስለማናቸውም ውስጥ ይሆናሉ ሲባል ከሆነው ስለዚሁ ከአንድ
ያልሆነ ሳለ የነበረውን ከአንዳንድ በማናቸውም በሙሉ የሆነው ያሉ በእነዚሁ
ወር መሆናቸው ከሌሎች በዋና አንዲት ወይም
በላይ እንደ በማቀድ ለሌሎች በሆኑ ቢሆንም ጊዜና ይሆኑበታል በሆነ አንዱ
ለዚህ ለሆነው ለነዚህ ከዚህ የሌላውን ሶስተኛ አንዳንድ ለማንኛውም የሆነ ከሁለት
የነገሩ ሰኣት አንደኛ እንዲሆን እንደነዚህ ማንኛውም ካልሆነ የሆኑት ጋር ቢያንስ
ሁሉ ኋላ በሰሞኑ አሉ በኋላ ሁኔታ በኩል አስታውቀዋል ሆነ በውስጥ
አስታውሰዋል ሆኑ ባጣም እስካሁን ሆኖም በተለይ አሳሰበ ሁል በተመለከተ
አሳስበዋል ላይ በተመሳሳይ አስፈላጊ ሌላ የተለያየ አስገነዘቡ ሌሎች የተለያዩ
አስገንዝበዋል ልዩ ተባለ አብራርተዋል መሆኑ ተገለጸ አስረድተዋል ተገልጿል
ማለቱ ተጨማሪ እባክህ የሚገኝ ተከናወነ እባክሽ ማድረግ ችግር አንጻር ማን
ትናንት እስኪደርስ ነበረች እንኳ ሰሞኑን ነበሩ እንኳን ሲሆን ነበር እዚሁ ሲል
ነው እንደገለጹት አለ እንደተናገሩት ቢሆን ነገር እንዳስረዱት ብለዋል ነገሮች
እንደገና ብዙ ናት ወቅት ቦታ ናቸው እንዲሁም በርካታ አሁን እንጂ እስከ
ማለት የሚሆኑት ስለማናቸውም ውስጥ ይሆናሉ ሲባል ከሆነው ስለዚሁ ከአንድ
ያልሆነ ሳለ የነበረውን ከአንዳንድ በማናቸውም በሙሉ የሆነው ያሉ በእነዚሁ
ወር መሆናቸው ከሌሎች በዋና አንዲት ወይም
በላይ እንደ በማቀድ ለሌሎች በሆኑ ቢሆንም ጊዜና ይሆኑበታል በሆነ አንዱ
ለዚህ ለሆነው ለነዚህ ከዚህ የሌላውን ሶስተኛ አንዳንድ ለማንኛውም የሆነ ከሁለት
የነገሩ ሰኣት አንደኛ እንዲሆን እንደነዚህ ማንኛውም ካልሆነ የሆኑት ጋር ቢያንስ
ይህንንም እነደሆነ እነዚህን ይኸው የማናቸውም
በሙሉም ይህችው በተለይም አንዱን የሚችለውን በነዚህ ከእነዚህ በሌላ
የዚሁ ከእነዚሁ ለዚሁ በሚገባ ለእያንዳንዱ የአንቀጹ ወደ ይህም ስለሆነ ወይ
ማናቸውንም ተብሎ እነዚህ መሆናቸውን የሆነችን ከአስር ሳይሆን ከዚያ የለውም
የማይበልጥ እንደሆነና እንዲሆኑ በሚችሉ ብቻ ብሎ ከሌላ የሌላቸውን
ለሆነ በሌሎች ሁለቱንም በቀር ይህ በታች አንደሆነ በነሱ
ይህን የሌላ እንዲህ ከሆነ ያላቸው በነዚሁ በሚል የዚህ ይህንኑ
በእንደዚህ ቁጥር ማናቸውም ሆነው ባሉ በዚህ በስተቀር ሲሆንና
በዚህም መሆን ምንጊዜም እነዚህም በዚህና ያለ ስም
ሲኖር ከዚህም መሆኑን በሁኔታው የማያንስ እነዚህኑ ማንም ከነዚሁ
በሙሉም ይህችው በተለይም አንዱን የሚችለውን በነዚህ ከእነዚህ በሌላ
የዚሁ ከእነዚሁ ለዚሁ በሚገባ ለእያንዳንዱ የአንቀጹ ወደ ይህም ስለሆነ ወይ
ማናቸውንም ተብሎ እነዚህ መሆናቸውን የሆነችን ከአስር ሳይሆን ከዚያ የለውም
የማይበልጥ እንደሆነና እንዲሆኑ በሚችሉ ብቻ ብሎ ከሌላ የሌላቸውን
ለሆነ በሌሎች ሁለቱንም በቀር ይህ በታች አንደሆነ በነሱ
ይህን የሌላ እንዲህ ከሆነ ያላቸው በነዚሁ በሚል የዚህ ይህንኑ
በእንደዚህ ቁጥር ማናቸውም ሆነው ባሉ በዚህ በስተቀር ሲሆንና
በዚህም መሆን ምንጊዜም እነዚህም በዚህና ያለ ስም
ሲኖር ከዚህም መሆኑን በሁኔታው የማያንስ እነዚህኑ ማንም ከነዚሁ
ያላቸውን እጅግ ሲሆኑ ለሆኑ ሊሆን ለማናቸውም
""".split()
)

18
spacy/lang/tn/__init__.py Normal file
View File

@ -0,0 +1,18 @@
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .punctuation import TOKENIZER_INFIXES
from ...language import Language
class SetswanaDefaults(Language.Defaults):
infixes = TOKENIZER_INFIXES
stop_words = STOP_WORDS
lex_attr_getters = LEX_ATTRS
class Setswana(Language):
lang = "tn"
Defaults = SetswanaDefaults
__all__ = ["Setswana"]

15
spacy/lang/tn/examples.py Normal file
View File

@ -0,0 +1,15 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.tn.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Apple e nyaka go reka JSE ka tlhwatlhwa ta R1 billion",
"Johannesburg ke toropo e kgolo mo Afrika Borwa.",
"O ko kae?",
"ke mang presidente ya Afrika Borwa?",
"ke eng toropo kgolo ya Afrika Borwa?",
"Nelson Mandela o belegwe leng?",
]

107
spacy/lang/tn/lex_attrs.py Normal file
View File

@ -0,0 +1,107 @@
from ...attrs import LIKE_NUM
_num_words = [
"lefela",
"nngwe",
"pedi",
"tharo",
"nne",
"tlhano",
"thataro",
"supa",
"robedi",
"robongwe",
"lesome",
"lesomenngwe",
"lesomepedi",
"sometharo",
"somenne",
"sometlhano",
"somethataro",
"somesupa",
"somerobedi",
"somerobongwe",
"someamabedi",
"someamararo",
"someamane",
"someamatlhano",
"someamarataro",
"someamasupa",
"someamarobedi",
"someamarobongwe",
"lekgolo",
"sekete",
"milione",
"bilione",
"terilione",
"kwatirilione",
"gajillione",
"bazillione",
]
_ordinal_words = [
"ntlha",
"bobedi",
"boraro",
"bone",
"botlhano",
"borataro",
"bosupa",
"borobedi ",
"borobongwe",
"bolesome",
"bolesomengwe",
"bolesomepedi",
"bolesometharo",
"bolesomenne",
"bolesometlhano",
"bolesomethataro",
"bolesomesupa",
"bolesomerobedi",
"bolesomerobongwe",
"somamabedi",
"someamararo",
"someamane",
"someamatlhano",
"someamarataro",
"someamasupa",
"someamarobedi",
"someamarobongwe",
"lekgolo",
"sekete",
"milione",
"bilione",
"terilione",
"kwatirilione",
"gajillione",
"bazillione",
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
text_lower = text.lower()
if text_lower in _num_words:
return True
# CHeck ordinal number
if text_lower in _ordinal_words:
return True
if text_lower.endswith("th"):
if text_lower[:-2].isdigit():
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -0,0 +1,19 @@
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, HYPHENS
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
_infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}0-9])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
]
)
TOKENIZER_INFIXES = _infixes

View File

@ -0,0 +1,20 @@
# Stop words
STOP_WORDS = set(
"""
ke gareng ga selekanyo tlhwatlhwa yo mongwe se
sengwe fa go le jalo gongwe ba na mo tikologong
jaaka kwa morago nna gonne ka sa pele nako teng
tlase fela ntle magareng tsona feta bobedi kgabaganya
moo gape kgatlhanong botlhe tsotlhe bokana e esi
setseng mororo dinako golo kgolo nnye wena gago
o ntse ntle tla goreng gangwe mang yotlhe gore
eo yona tseraganyo eng ne sentle re rona thata
godimo fitlha pedi masomamabedi lesomepedi mmogo
tharo tseo boraro tseno yone jaanong bobona bona
lesome tsaya tsamaiso nngwe masomethataro thataro
tsa mmatota tota sale thoko supa dira tshwanetse di mmalwa masisi
bonala e tshwanang bogolo tsenya tsweetswee karolo
sepe tlhalosa dirwa robedi robongwe lesomenngwe gaisa
tlhano lesometlhano botlalo lekgolo
""".split()
)

View File

@ -451,7 +451,7 @@ cdef class Lexeme:
Lexeme.c_set_flag(self.c, IS_QUOTE, x)
property is_left_punct:
"""RETURNS (bool): Whether the lexeme is left punctuation, e.g. )."""
"""RETURNS (bool): Whether the lexeme is left punctuation, e.g. (."""
def __get__(self):
return Lexeme.c_check_flag(self.c, IS_LEFT_PUNCT)

View File

@ -18,4 +18,4 @@ cdef class PhraseMatcher:
cdef Pool mem
cdef key_t _terminal_hash
cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil
cdef void find_matches(self, Doc doc, int start_idx, int end_idx, vector[SpanC] *matches) nogil

View File

@ -230,10 +230,10 @@ cdef class PhraseMatcher:
result = internal_node
map_set(self.mem, <MapStruct*>result, self.vocab.strings[key], NULL)
def __call__(self, doc, *, as_spans=False):
def __call__(self, object doclike, *, as_spans=False):
"""Find all sequences matching the supplied patterns on the `Doc`.
doc (Doc): The document to match over.
doclike (Doc or Span): The document to match over.
as_spans (bool): Return Span objects with labels instead of (match_id,
start, end) tuples.
RETURNS (list): A list of `(match_id, start, end)` tuples,
@ -244,12 +244,22 @@ cdef class PhraseMatcher:
DOCS: https://spacy.io/api/phrasematcher#call
"""
matches = []
if doc is None or len(doc) == 0:
if doclike is None or len(doclike) == 0:
# if doc is empty or None just return empty list
return matches
if isinstance(doclike, Doc):
doc = doclike
start_idx = 0
end_idx = len(doc)
elif isinstance(doclike, Span):
doc = doclike.doc
start_idx = doclike.start
end_idx = doclike.end
else:
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
cdef vector[SpanC] c_matches
self.find_matches(doc, &c_matches)
self.find_matches(doc, start_idx, end_idx, &c_matches)
for i in range(c_matches.size()):
matches.append((c_matches[i].label, c_matches[i].start, c_matches[i].end))
for i, (ent_id, start, end) in enumerate(matches):
@ -261,17 +271,17 @@ cdef class PhraseMatcher:
else:
return matches
cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil:
cdef void find_matches(self, Doc doc, int start_idx, int end_idx, vector[SpanC] *matches) nogil:
cdef MapStruct* current_node = self.c_map
cdef int start = 0
cdef int idx = 0
cdef int idy = 0
cdef int idx = start_idx
cdef int idy = start_idx
cdef key_t key
cdef void* value
cdef int i = 0
cdef SpanC ms
cdef void* result
while idx < doc.length:
while idx < end_idx:
start = idx
token = Token.get_struct_attr(&doc.c[idx], self.attr)
# look for sequences from this position
@ -279,7 +289,7 @@ cdef class PhraseMatcher:
if result:
current_node = <MapStruct*>result
idy = idx + 1
while idy < doc.length:
while idy < end_idx:
result = map_get(current_node, self._terminal_hash)
if result:
i = 0

View File

@ -107,6 +107,7 @@ def init_ensemble_textcat(model, X, Y) -> Model:
model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
model.get_ref("maxout_layer").set_dim("nI", tok2vec_width)
model.get_ref("norm_layer").set_dim("nI", tok2vec_width)
model.get_ref("norm_layer").set_dim("nO", tok2vec_width)
init_chain(model, X, Y)
return model

View File

@ -273,7 +273,7 @@ class EntityLinker(TrainablePipe):
gradients = self.distance.get_grad(sentence_encodings, entity_encodings)
loss = self.distance.get_loss(sentence_encodings, entity_encodings)
loss = loss / len(entity_encodings)
return loss, gradients
return float(loss), gradients
def predict(self, docs: Iterable[Doc]) -> List[str]:
"""Apply the pipeline's model to a batch of docs, without modifying them.

View File

@ -76,7 +76,7 @@ def merge_subtokens(doc: Doc, label: str = "subtok") -> Doc:
retokenizes=True,
)
def make_token_splitter(
nlp: Language, name: str, *, min_length=0, split_length=0,
nlp: Language, name: str, *, min_length: int = 0, split_length: int = 0
):
return TokenSplitter(min_length=min_length, split_length=split_length)

View File

@ -197,7 +197,7 @@ class ClozeMultitask(TrainablePipe):
target = vectors[ids]
gradient = self.distance.get_grad(prediction, target)
loss = self.distance.get_loss(prediction, target)
return loss, gradient
return float(loss), gradient
def update(self, examples, *, drop=0., sgd=None, losses=None):
pass

View File

@ -121,7 +121,7 @@ class Tok2Vec(TrainablePipe):
tokvecs = self.model.predict(docs)
batch_id = Tok2VecListener.get_batch_id(docs)
for listener in self.listeners:
listener.receive(batch_id, tokvecs, lambda dX: [])
listener.receive(batch_id, tokvecs, _empty_backprop)
return tokvecs
def set_annotations(self, docs: Sequence[Doc], tokvecses) -> None:
@ -291,12 +291,18 @@ def forward(model: Tok2VecListener, inputs, is_train: bool):
# of data.
# When the components batch differently, we don't receive a matching
# prediction from the upstream, so we can't predict.
if not all(doc.tensor.size for doc in inputs):
# But we do need to do *something* if the tensor hasn't been set.
# The compromise is to at least return data of the right shape,
# so the output is valid.
width = model.get_dim("nO")
outputs = [model.ops.alloc2f(len(doc), width) for doc in inputs]
else:
outputs = [doc.tensor for doc in inputs]
outputs = []
width = model.get_dim("nO")
for doc in inputs:
if doc.tensor.size == 0:
# But we do need to do *something* if the tensor hasn't been set.
# The compromise is to at least return data of the right shape,
# so the output is valid.
outputs.append(model.ops.alloc2f(len(doc), width))
else:
outputs.append(doc.tensor)
return outputs, lambda dX: []
def _empty_backprop(dX): # for pickling
return []

View File

@ -446,6 +446,7 @@ class ProjectConfigCommand(BaseModel):
class ProjectConfigSchema(BaseModel):
# fmt: off
vars: Dict[StrictStr, Any] = Field({}, title="Optional variables to substitute in commands")
env: Dict[StrictStr, Any] = Field({}, title="Optional variable names to substitute in commands, mapped to environment variable names")
assets: List[Union[ProjectConfigAssetURL, ProjectConfigAssetGit]] = Field([], title="Data assets")
workflows: Dict[StrictStr, List[StrictStr]] = Field({}, title="Named workflows, mapped to list of project commands to run in order")
commands: List[ProjectConfigCommand] = Field([], title="Project command shortucts")

View File

@ -8,7 +8,8 @@ from spacy.util import get_lang_class
LANGUAGES = ["af", "ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", "es",
"et", "fa", "fi", "fr", "ga", "he", "hi", "hr", "hu", "id", "is",
"it", "kn", "lt", "lv", "nb", "nl", "pl", "pt", "ro", "si", "sk",
"sl", "sq", "sr", "sv", "ta", "te", "tl", "tr", "tt", "ur", 'yo']
"sl", "sq", "sr", "sv", "ta", "te", "tl", "tn", "tr", "tt", "ur",
"yo"]
# fmt: on

View File

@ -323,3 +323,39 @@ def test_phrase_matcher_deprecated(en_vocab):
@pytest.mark.parametrize("attr", ["SENT_START", "IS_SENT_START"])
def test_phrase_matcher_sent_start(en_vocab, attr):
_ = PhraseMatcher(en_vocab, attr=attr) # noqa: F841
def test_span_in_phrasematcher(en_vocab):
"""Ensure that PhraseMatcher accepts Span and Doc as input"""
# fmt: off
words = ["I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "and", "nothing", "else", "."]
# fmt: on
doc = Doc(en_vocab, words=words)
span = doc[:8]
pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
matcher = PhraseMatcher(en_vocab)
matcher.add("SPACY", [pattern])
matches_doc = matcher(doc)
matches_span = matcher(span)
assert len(matches_doc) == 1
assert len(matches_span) == 1
def test_span_v_doc_in_phrasematcher(en_vocab):
"""Ensure that PhraseMatcher only returns matches in input Span and not in entire Doc"""
# fmt: off
words = [
"I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "Spans",
"and", "Docs", "in", "my", "matchers", "," "and", "Spans", "and", "Docs",
"everywhere", "."
]
# fmt: on
doc = Doc(en_vocab, words=words)
span = doc[9:15] # second clause
pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
matcher = PhraseMatcher(en_vocab)
matcher.add("SPACY", [pattern])
matches_doc = matcher(doc)
matches_span = matcher(span)
assert len(matches_doc) == 3
assert len(matches_span) == 1

View File

@ -451,13 +451,27 @@ def test_pipe_factories_from_source_config():
assert config["arg"] == "world"
def test_pipe_factories_decorator_idempotent():
class PipeFactoriesIdempotent:
def __init__(self, nlp, name):
...
def __call__(self, doc):
...
@pytest.mark.parametrize(
"i,func,func2",
[
(0, lambda nlp, name: lambda doc: doc, lambda doc: doc),
(1, PipeFactoriesIdempotent, PipeFactoriesIdempotent(None, None)),
],
)
def test_pipe_factories_decorator_idempotent(i, func, func2):
"""Check that decorator can be run multiple times if the function is the
same. This is especially relevant for live reloading because we don't
want spaCy to raise an error if a module registering components is reloaded.
"""
name = "test_pipe_factories_decorator_idempotent"
func = lambda nlp, name: lambda doc: doc
name = f"test_pipe_factories_decorator_idempotent_{i}"
for i in range(5):
Language.factory(name, func=func)
nlp = Language()
@ -466,7 +480,6 @@ def test_pipe_factories_decorator_idempotent():
# Make sure it also works for component decorator, which creates the
# factory function
name2 = f"{name}2"
func2 = lambda doc: doc
for i in range(5):
Language.component(name2, func=func2)
nlp = Language()

View File

@ -0,0 +1,229 @@
import pytest
from spacy.lang.en import English
import numpy as np
import spacy
from spacy.tokens import Doc
from spacy.matcher import PhraseMatcher
from spacy.tokens import DocBin
from spacy.util import load_config_from_str
from spacy.training import Example
from spacy.training.initialize import init_nlp
import pickle
from ..util import make_tempdir
def test_issue6730(en_vocab):
"""Ensure that the KB does not accept empty strings, but otherwise IO works fine."""
from spacy.kb import KnowledgeBase
kb = KnowledgeBase(en_vocab, entity_vector_length=3)
kb.add_entity(entity="1", freq=148, entity_vector=[1, 2, 3])
with pytest.raises(ValueError):
kb.add_alias(alias="", entities=["1"], probabilities=[0.4])
assert kb.contains_alias("") is False
kb.add_alias(alias="x", entities=["1"], probabilities=[0.2])
kb.add_alias(alias="y", entities=["1"], probabilities=[0.1])
with make_tempdir() as tmp_dir:
kb.to_disk(tmp_dir)
kb.from_disk(tmp_dir)
assert kb.get_size_aliases() == 2
assert set(kb.get_alias_strings()) == {"x", "y"}
def test_issue6755(en_tokenizer):
doc = en_tokenizer("This is a magnificent sentence.")
span = doc[:0]
assert span.text_with_ws == ""
assert span.text == ""
@pytest.mark.parametrize(
"sentence, start_idx,end_idx,label",
[("Welcome to Mumbai, my friend", 11, 17, "GPE")],
)
def test_issue6815_1(sentence, start_idx, end_idx, label):
nlp = English()
doc = nlp(sentence)
span = doc[:].char_span(start_idx, end_idx, label=label)
assert span.label_ == label
@pytest.mark.parametrize(
"sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
)
def test_issue6815_2(sentence, start_idx, end_idx, kb_id):
nlp = English()
doc = nlp(sentence)
span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
assert span.kb_id == kb_id
@pytest.mark.parametrize(
"sentence, start_idx,end_idx,vector",
[("Welcome to Mumbai, my friend", 11, 17, np.array([0.1, 0.2, 0.3]))],
)
def test_issue6815_3(sentence, start_idx, end_idx, vector):
nlp = English()
doc = nlp(sentence)
span = doc[:].char_span(start_idx, end_idx, vector=vector)
assert (span.vector == vector).all()
def test_issue6839(en_vocab):
"""Ensure that PhraseMatcher accepts Span as input"""
# fmt: off
words = ["I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "and", "nothing", "else", "."]
# fmt: on
doc = Doc(en_vocab, words=words)
span = doc[:8]
pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
matcher = PhraseMatcher(en_vocab)
matcher.add("SPACY", [pattern])
matches = matcher(span)
assert matches
CONFIG_ISSUE_6908 = """
[paths]
train = "TRAIN_PLACEHOLDER"
raw = null
init_tok2vec = null
vectors = null
[system]
seed = 0
gpu_allocator = null
[nlp]
lang = "en"
pipeline = ["textcat"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
[components]
[components.textcat]
factory = "TEXTCAT_PLACEHOLDER"
[corpora]
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths:train}
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths:train}
[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
frozen_components = []
before_to_disk = null
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.components.textcat]
labels = ['label1', 'label2']
[initialize.tokenizer]
"""
@pytest.mark.parametrize(
"component_name", ["textcat", "textcat_multilabel"],
)
def test_issue6908(component_name):
"""Test intializing textcat with labels in a list"""
def create_data(out_file):
nlp = spacy.blank("en")
doc = nlp.make_doc("Some text")
doc.cats = {"label1": 0, "label2": 1}
out_data = DocBin(docs=[doc]).to_bytes()
with out_file.open("wb") as file_:
file_.write(out_data)
with make_tempdir() as tmp_path:
train_path = tmp_path / "train.spacy"
create_data(train_path)
config_str = CONFIG_ISSUE_6908.replace("TEXTCAT_PLACEHOLDER", component_name)
config_str = config_str.replace("TRAIN_PLACEHOLDER", train_path.as_posix())
config = load_config_from_str(config_str)
init_nlp(config)
CONFIG_ISSUE_6950 = """
[nlp]
lang = "en"
pipeline = ["tok2vec", "tagger"]
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode:width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[components.ner]
factory = "ner"
[components.tagger]
factory = "tagger"
[components.tagger.model]
@architectures = "spacy.Tagger.v1"
nO = null
[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
upstream = "*"
"""
def test_issue6950():
"""Test that the nlp object with initialized tok2vec with listeners pickles
correctly (and doesn't have lambdas).
"""
nlp = English.from_config(load_config_from_str(CONFIG_ISSUE_6950))
nlp.initialize(lambda: [Example.from_dict(nlp.make_doc("hello"), {"tags": ["V"]})])
pickle.dumps(nlp)
nlp("hello")
pickle.dumps(nlp)

View File

@ -1,23 +0,0 @@
import pytest
from ..util import make_tempdir
def test_issue6730(en_vocab):
"""Ensure that the KB does not accept empty strings, but otherwise IO works fine."""
from spacy.kb import KnowledgeBase
kb = KnowledgeBase(en_vocab, entity_vector_length=3)
kb.add_entity(entity="1", freq=148, entity_vector=[1, 2, 3])
with pytest.raises(ValueError):
kb.add_alias(alias="", entities=["1"], probabilities=[0.4])
assert kb.contains_alias("") is False
kb.add_alias(alias="x", entities=["1"], probabilities=[0.2])
kb.add_alias(alias="y", entities=["1"], probabilities=[0.1])
with make_tempdir() as tmp_dir:
kb.to_disk(tmp_dir)
kb.from_disk(tmp_dir)
assert kb.get_size_aliases() == 2
assert set(kb.get_alias_strings()) == {"x", "y"}

View File

@ -1,5 +0,0 @@
def test_issue6755(en_tokenizer):
doc = en_tokenizer("This is a magnificent sentence.")
span = doc[:0]
assert span.text_with_ws == ""
assert span.text == ""

View File

@ -1,35 +0,0 @@
import pytest
from spacy.lang.en import English
import numpy as np
@pytest.mark.parametrize(
"sentence, start_idx,end_idx,label",
[("Welcome to Mumbai, my friend", 11, 17, "GPE")],
)
def test_char_span_label(sentence, start_idx, end_idx, label):
nlp = English()
doc = nlp(sentence)
span = doc[:].char_span(start_idx, end_idx, label=label)
assert span.label_ == label
@pytest.mark.parametrize(
"sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
)
def test_char_span_kb_id(sentence, start_idx, end_idx, kb_id):
nlp = English()
doc = nlp(sentence)
span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
assert span.kb_id == kb_id
@pytest.mark.parametrize(
"sentence, start_idx,end_idx,vector",
[("Welcome to Mumbai, my friend", 11, 17, np.array([0.1, 0.2, 0.3]))],
)
def test_char_span_vector(sentence, start_idx, end_idx, vector):
nlp = English()
doc = nlp(sentence)
span = doc[:].char_span(start_idx, end_idx, vector=vector)
assert (span.vector == vector).all()

View File

@ -1,102 +0,0 @@
import pytest
import spacy
from spacy.language import Language
from spacy.tokens import DocBin
from spacy import util
from spacy.schemas import ConfigSchemaInit
from spacy.training.initialize import init_nlp
from ..util import make_tempdir
TEXTCAT_WITH_LABELS_ARRAY_CONFIG = """
[paths]
train = "TRAIN_PLACEHOLDER"
raw = null
init_tok2vec = null
vectors = null
[system]
seed = 0
gpu_allocator = null
[nlp]
lang = "en"
pipeline = ["textcat"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
[components]
[components.textcat]
factory = "TEXTCAT_PLACEHOLDER"
[corpora]
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths:train}
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths:train}
[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
frozen_components = []
before_to_disk = null
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.components.textcat]
labels = ['label1', 'label2']
[initialize.tokenizer]
"""
@pytest.mark.parametrize(
"component_name",
["textcat", "textcat_multilabel"],
)
def test_textcat_initialize_labels_validation(component_name):
"""Test intializing textcat with labels in a list"""
def create_data(out_file):
nlp = spacy.blank("en")
doc = nlp.make_doc("Some text")
doc.cats = {"label1": 0, "label2": 1}
out_data = DocBin(docs=[doc]).to_bytes()
with out_file.open("wb") as file_:
file_.write(out_data)
with make_tempdir() as tmp_path:
train_path = tmp_path / "train.spacy"
create_data(train_path)
config_str = TEXTCAT_WITH_LABELS_ARRAY_CONFIG.replace(
"TEXTCAT_PLACEHOLDER", component_name
)
config_str = config_str.replace("TRAIN_PLACEHOLDER", train_path.as_posix())
config = util.load_config_from_str(config_str)
init_nlp(config)

View File

@ -0,0 +1,12 @@
from spacy.cli.evaluate import print_textcats_auc_per_cat, print_prf_per_type
from wasabi import msg
def test_issue7019():
scores = {"LABEL_A": 0.39829102, "LABEL_B": 0.938298329382, "LABEL_C": None}
print_textcats_auc_per_cat(msg, scores)
scores = {
"LABEL_A": {"p": 0.3420302, "r": 0.3929020, "f": 0.49823928932},
"LABEL_B": {"p": None, "r": None, "f": None},
}
print_prf_per_type(msg, scores, name="foo", type="bar")

View File

@ -0,0 +1,67 @@
from spacy.lang.en import English
from spacy.training import Example
from spacy.util import load_config_from_str
CONFIG = """
[nlp]
lang = "en"
pipeline = ["tok2vec", "tagger"]
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode:width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[components.tagger]
factory = "tagger"
[components.tagger.model]
@architectures = "spacy.Tagger.v1"
nO = null
[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
upstream = "*"
"""
TRAIN_DATA = [
("I like green eggs", {"tags": ["N", "V", "J", "N"]}),
("Eat blue ham", {"tags": ["V", "J", "N"]}),
]
def test_issue7029():
"""Test that an empty document doesn't mess up an entire batch."""
nlp = English.from_config(load_config_from_str(CONFIG))
train_examples = []
for t in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
optimizer = nlp.initialize(get_examples=lambda: train_examples)
for i in range(50):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
texts = ["first", "second", "third", "fourth", "and", "then", "some", ""]
nlp.select_pipes(enable=["tok2vec", "tagger"])
docs1 = list(nlp.pipe(texts, batch_size=1))
docs2 = list(nlp.pipe(texts, batch_size=4))
assert [doc[0].tag_ for doc in docs1[:-1]] == [doc[0].tag_ for doc in docs2[:-1]]

View File

@ -325,6 +325,23 @@ def test_project_config_interpolation():
substitute_project_variables(project)
def test_project_config_interpolation_env():
variables = {"a": 10}
env_var = "SPACY_TEST_FOO"
env_vars = {"foo": env_var}
commands = [{"name": "x", "script": ["hello ${vars.a} ${env.foo}"]}]
project = {"commands": commands, "vars": variables, "env": env_vars}
with make_tempdir() as d:
srsly.write_yaml(d / "project.yml", project)
cfg = load_project_config(d)
assert cfg["commands"][0]["script"][0] == "hello 10 "
os.environ[env_var] = "123"
with make_tempdir() as d:
srsly.write_yaml(d / "project.yml", project)
cfg = load_project_config(d)
assert cfg["commands"][0]["script"][0] == "hello 10 123"
@pytest.mark.parametrize(
"args,expected",
[

View File

@ -1,7 +1,9 @@
import pytest
import numpy
import srsly
from spacy.lang.en import English
from spacy.strings import StringStore
from spacy.tokens import Doc
from spacy.vocab import Vocab
from spacy.attrs import NORM
@ -20,7 +22,10 @@ def test_pickle_string_store(text1, text2):
@pytest.mark.parametrize("text1,text2", [("dog", "cat")])
def test_pickle_vocab(text1, text2):
vocab = Vocab(lex_attr_getters={int(NORM): lambda string: string[:-1]})
vocab = Vocab(
lex_attr_getters={int(NORM): lambda string: string[:-1]},
get_noun_chunks=English.Defaults.syntax_iterators.get("noun_chunks"),
)
vocab.set_vector("dog", numpy.ones((5,), dtype="f"))
lex1 = vocab[text1]
lex2 = vocab[text2]
@ -34,4 +39,23 @@ def test_pickle_vocab(text1, text2):
assert unpickled[text2].norm == lex2.norm
assert unpickled[text1].norm != unpickled[text2].norm
assert unpickled.vectors is not None
assert unpickled.get_noun_chunks is not None
assert list(vocab["dog"].vector) == [1.0, 1.0, 1.0, 1.0, 1.0]
def test_pickle_doc(en_vocab):
words = ["a", "b", "c"]
deps = ["dep"] * len(words)
heads = [0] * len(words)
doc = Doc(
en_vocab,
words=words,
deps=deps,
heads=heads,
)
data = srsly.pickle_dumps(doc)
unpickled = srsly.pickle_loads(data)
assert [t.text for t in unpickled] == words
assert [t.dep_ for t in unpickled] == deps
assert [t.head.i for t in unpickled] == heads
assert list(doc.noun_chunks) == []

View File

@ -55,6 +55,7 @@ def test_vocab_lexeme_add_flag_provided_id(en_vocab):
assert en_vocab["199"].check_flag(IS_DIGIT) is False
assert en_vocab["the"].check_flag(is_len4) is False
assert en_vocab["dogs"].check_flag(is_len4) is True
en_vocab.add_flag(lambda string: string.isdigit(), flag_id=IS_DIGIT)
def test_vocab_lexeme_oov_rank(en_vocab):

View File

@ -245,7 +245,7 @@ cdef class Tokenizer:
cdef int offset
cdef int modified_doc_length
# Find matches for special cases
self._special_matcher.find_matches(doc, &c_matches)
self._special_matcher.find_matches(doc, 0, doc.length, &c_matches)
# Skip processing if no matches
if c_matches.size() == 0:
return True

View File

@ -215,8 +215,7 @@ def convert_vectors(
def read_vectors(vectors_loc: Path, truncate_vectors: int):
f = open_file(vectors_loc)
f = ensure_shape(f)
f = ensure_shape(vectors_loc)
shape = tuple(int(size) for size in next(f).split())
if truncate_vectors >= 1:
shape = (truncate_vectors, shape[1])
@ -251,11 +250,12 @@ def open_file(loc: Union[str, Path]) -> IO:
return loc.open("r", encoding="utf8")
def ensure_shape(lines):
def ensure_shape(vectors_loc):
"""Ensure that the first line of the data is the vectors shape.
If it's not, we read in the data and output the shape as the first result,
so that the reader doesn't have to deal with the problem.
"""
lines = open_file(vectors_loc)
first_line = next(lines)
try:
shape = tuple(int(size) for size in first_line.split())
@ -269,7 +269,11 @@ def ensure_shape(lines):
# Figure out the shape, make it the first value, and then give the
# rest of the data.
width = len(first_line.split()) - 1
captured = [first_line] + list(lines)
length = len(captured)
length = 1
for _ in lines:
length += 1
yield f"{length} {width}"
yield from captured
# Reading the lines in again from file. This to avoid having to
# store all the results in a list in memory
lines2 = open_file(vectors_loc)
yield from lines2

View File

@ -930,6 +930,8 @@ def is_same_func(func1: Callable, func2: Callable) -> bool:
"""
if not callable(func1) or not callable(func2):
return False
if not hasattr(func1, "__qualname__") or not hasattr(func2, "__qualname__"):
return False
same_name = func1.__qualname__ == func2.__qualname__
same_file = inspect.getfile(func1) == inspect.getfile(func2)
same_code = inspect.getsourcelines(func1) == inspect.getsourcelines(func2)

View File

@ -551,12 +551,13 @@ def pickle_vocab(vocab):
data_dir = vocab.data_dir
lex_attr_getters = srsly.pickle_dumps(vocab.lex_attr_getters)
lookups = vocab.lookups
get_noun_chunks = vocab.get_noun_chunks
return (unpickle_vocab,
(sstore, vectors, morph, data_dir, lex_attr_getters, lookups))
(sstore, vectors, morph, data_dir, lex_attr_getters, lookups, get_noun_chunks))
def unpickle_vocab(sstore, vectors, morphology, data_dir,
lex_attr_getters, lookups):
lex_attr_getters, lookups, get_noun_chunks):
cdef Vocab vocab = Vocab()
vocab.vectors = vectors
vocab.strings = sstore
@ -564,6 +565,7 @@ def unpickle_vocab(sstore, vectors, morphology, data_dir,
vocab.data_dir = data_dir
vocab.lex_attr_getters = srsly.pickle_loads(lex_attr_getters)
vocab.lookups = lookups
vocab.get_noun_chunks = get_noun_chunks
return vocab

View File

@ -67,7 +67,7 @@ data format used by the lookup and rule-based lemmatizers, see
> lemmatizer = nlp.add_pipe("lemmatizer")
>
> # Construction via add_pipe with custom settings
> config = {"mode": "rule", overwrite=True}
> config = {"mode": "rule", "overwrite": True}
> lemmatizer = nlp.add_pipe("lemmatizer", config=config)
> ```

View File

@ -44,7 +44,7 @@ be shown.
## PhraseMatcher.\_\_call\_\_ {#call tag="method"}
Find all token sequences matching the supplied patterns on the `Doc`.
Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
> #### Example
>
@ -59,7 +59,7 @@ Find all token sequences matching the supplied patterns on the `Doc`.
| Name | Description |
| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `doc` | The document to match over. ~~Doc~~ |
| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ |
| _keyword-only_ | |
| `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ |
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |

View File

@ -727,7 +727,7 @@ capitalization by including a mix of capitalized and lowercase examples. See the
Create a data augmentation callback that uses orth-variant replacement. The
callback can be added to a corpus or other data iterator during training. It's
is especially useful for punctuation and case replacement, to help generalize
especially useful for punctuation and case replacement, to help generalize
beyond corpora that don't have smart quotes, or only have smart quotes etc.
| Name | Description |

View File

@ -4,8 +4,8 @@ import { Help } from 'components/typography'; import Link from 'components/link'
| Pipeline | Parser | Tagger | NER |
| ---------------------------------------------------------- | -----: | -----: | ---: |
| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.2 | 97.8 | 89.9 |
| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 91.9 | 97.4 | 85.5 |
| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.1 | 97.8 | 89.8 |
| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 92.0 | 97.4 | 85.5 |
| `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | 85.5 |
<figcaption class="caption">
@ -22,7 +22,7 @@ the development set).
| Named Entity Recognition System | OntoNotes | CoNLL '03 |
| -------------------------------- | --------: | --------: |
| spaCy RoBERTa (2020) | 89.7 | 91.6 |
| spaCy RoBERTa (2020) | 89.8 | 91.6 |
| Stanza (StanfordNLP)<sup>1</sup> | 88.8 | 92.1 |
| Flair<sup>2</sup> | 89.7 | 93.1 |

View File

@ -77,7 +77,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
| Dependency Parsing System | UAS | LAS |
| ------------------------------------------------------------------------------ | ---: | ---: |
| spaCy RoBERTa (2020) | 95.5 | 94.3 |
| spaCy RoBERTa (2020) | 95.1 | 93.7 |
| [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
| [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 |

View File

@ -69,9 +69,9 @@ python -m spacy project clone pipelines/tagger_parser_ud
By default, the project will be cloned into the current working directory. You
can specify an optional second argument to define the output directory. The
`--repo` option lets you define a custom repo to clone from if you don't want
to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You
can also use any private repo you have access to with Git.
`--repo` option lets you define a custom repo to clone from if you don't want to
use the spaCy [`projects`](https://github.com/explosion/projects) repo. You can
also use any private repo you have access to with Git.
### 2. Fetch the project assets {#assets}
@ -221,6 +221,7 @@ pipelines.
| `title` | An optional project title used in `--help` message and [auto-generated docs](#custom-docs). |
| `description` | An optional project description used in [auto-generated docs](#custom-docs). |
| `vars` | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`. |
| `env` | A dictionary of variables, mapped to the names of environment variables that will be read in when running the project. For example, `${env.name}` will use the value of the environment variable defined as `name`. |
| `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. |
| `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo. |
| `workflows` | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command. |
@ -310,8 +311,8 @@ company-internal and not available over the internet. In that case, you can
specify the destination paths and a checksum, and leave out the URL. When your
teammates clone and run your project, they can place the files in the respective
directory themselves. The [`project assets`](/api/cli#project-assets) command
will alert you about missing files and mismatched checksums, so you can ensure that
others are running your project with the same data.
will alert you about missing files and mismatched checksums, so you can ensure
that others are running your project with the same data.
### Dependencies and outputs {#deps-outputs}
@ -358,9 +359,10 @@ graphs based on the dependencies and outputs, and won't re-run previous steps
automatically. For instance, if you only run the command `train` that depends on
data created by `preprocess` and those files are missing, spaCy will show an
error it won't just re-run `preprocess`. If you're looking for more advanced
data management, check out the [Data Version Control (DVC) integration](#dvc). If you're planning on integrating your spaCy project with DVC, you
can also use `outputs_no_cache` instead of `outputs` to define outputs that
won't be cached or tracked.
data management, check out the [Data Version Control (DVC) integration](#dvc).
If you're planning on integrating your spaCy project with DVC, you can also use
`outputs_no_cache` instead of `outputs` to define outputs that won't be cached
or tracked.
### Files and directory structure {#project-files}
@ -467,7 +469,9 @@ In your `project.yml`, you can then run the script by calling
`python scripts/custom_evaluation.py` with the function arguments. You can also
use the `vars` section to define reusable variables that will be substituted in
commands, paths and URLs. In this example, the batch size is defined as a
variable will be added in place of `${vars.batch_size}` in the script.
variable will be added in place of `${vars.batch_size}` in the script. Just like
in the [training config](/usage/training##config-overrides), you can also
override settings on the command line for example using `--vars.batch_size`.
> #### Calling into Python
>
@ -491,6 +495,29 @@ commands:
- 'corpus/eval.json'
```
You can also use the `env` section to reference **environment variables** and
make their values available to the commands. This can be useful for overriding
settings on the command line and passing through system-level settings.
> #### Usage example
>
> ```bash
> export GPU_ID=1
> BATCH_SIZE=128 python -m spacy project run evaluate
> ```
```yaml
### project.yml
env:
batch_size: BATCH_SIZE
gpu_id: GPU_ID
commands:
- name: evaluate
script:
- 'python scripts/custom_evaluation.py ${env.batch_size}'
```
### Documenting your project {#custom-docs}
> #### Readme Example

View File

@ -185,7 +185,7 @@ sections of a config file are:
For a full overview of spaCy's config format and settings, see the
[data format documentation](/api/data-formats#config) and
[Thinc's config system docs](https://thinc.ai/usage/config). The settings
[Thinc's config system docs](https://thinc.ai/docs/usage-config). The settings
available for the different architectures are documented with the
[model architectures API](/api/architectures). See the Thinc documentation for
[optimizers](https://thinc.ai/docs/api-optimizers) and

View File

@ -198,6 +198,7 @@
"has_examples": true
},
{ "code": "tl", "name": "Tagalog" },
{ "code": "tn", "name": "Setswana", "has_examples": true },
{ "code": "tr", "name": "Turkish", "example": "Bu bir cümledir.", "has_examples": true },
{ "code": "tt", "name": "Tatar", "has_examples": true },
{