Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2021-02-14 13:38:33 +11:00
commit 3246cf8b2b
47 changed files with 898 additions and 280 deletions

106
.github/contributors/peter-exos.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Peter Baumann |
| Company name (if applicable) | Exos Financial |
| Title or role (if applicable) | data scientist |
| Date | Feb 1st, 2021 |
| GitHub username | peter-exos |
| Website (optional) | |

View File

@ -1,6 +1,6 @@
# fmt: off # fmt: off
__title__ = "spacy" __title__ = "spacy"
__version__ = "3.0.1" __version__ = "3.0.3"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects" __projects__ = "https://github.com/explosion/projects"

View File

@ -16,7 +16,7 @@ import os
from ..schemas import ProjectConfigSchema, validate from ..schemas import ProjectConfigSchema, validate
from ..util import import_file, run_command, make_tempdir, registry, logger from ..util import import_file, run_command, make_tempdir, registry, logger
from ..util import is_compatible_version, ENV_VARS from ..util import is_compatible_version, SimpleFrozenDict, ENV_VARS
from .. import about from .. import about
if TYPE_CHECKING: if TYPE_CHECKING:
@ -111,26 +111,33 @@ def _parse_overrides(args: List[str], is_cli: bool = False) -> Dict[str, Any]:
value = "true" value = "true"
else: else:
value = args.pop(0) value = args.pop(0)
result[opt] = _parse_override(value)
else:
msg.fail(f"{err}: name should start with --", exits=1)
return result
def _parse_override(value: Any) -> Any:
# Just like we do in the config, we're calling json.loads on the # Just like we do in the config, we're calling json.loads on the
# values. But since they come from the CLI, it'd be unintuitive to # values. But since they come from the CLI, it'd be unintuitive to
# explicitly mark strings with escaped quotes. So we're working # explicitly mark strings with escaped quotes. So we're working
# around that here by falling back to a string if parsing fails. # around that here by falling back to a string if parsing fails.
# TODO: improve logic to handle simple types like list of strings? # TODO: improve logic to handle simple types like list of strings?
try: try:
result[opt] = srsly.json_loads(value) return srsly.json_loads(value)
except ValueError: except ValueError:
result[opt] = str(value) return str(value)
else:
msg.fail(f"{err}: name should start with --", exits=1)
return result
def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]: def load_project_config(
path: Path, interpolate: bool = True, overrides: Dict[str, Any] = SimpleFrozenDict()
) -> Dict[str, Any]:
"""Load the project.yml file from a directory and validate it. Also make """Load the project.yml file from a directory and validate it. Also make
sure that all directories defined in the config exist. sure that all directories defined in the config exist.
path (Path): The path to the project directory. path (Path): The path to the project directory.
interpolate (bool): Whether to substitute project variables. interpolate (bool): Whether to substitute project variables.
overrides (Dict[str, Any]): Optional config overrides.
RETURNS (Dict[str, Any]): The loaded project.yml. RETURNS (Dict[str, Any]): The loaded project.yml.
""" """
config_path = path / PROJECT_FILE config_path = path / PROJECT_FILE
@ -154,20 +161,36 @@ def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
if not dir_path.exists(): if not dir_path.exists():
dir_path.mkdir(parents=True) dir_path.mkdir(parents=True)
if interpolate: if interpolate:
err = "project.yml validation error" err = f"{PROJECT_FILE} validation error"
with show_validation_error(title=err, hint_fill=False): with show_validation_error(title=err, hint_fill=False):
config = substitute_project_variables(config) config = substitute_project_variables(config, overrides)
return config return config
def substitute_project_variables(config: Dict[str, Any], overrides: Dict = {}): def substitute_project_variables(
key = "vars" config: Dict[str, Any],
overrides: Dict[str, Any] = SimpleFrozenDict(),
key: str = "vars",
env_key: str = "env",
) -> Dict[str, Any]:
"""Interpolate variables in the project file using the config system.
config (Dict[str, Any]): The project config.
overrides (Dict[str, Any]): Optional config overrides.
key (str): Key containing variables in project config.
env_key (str): Key containing environment variable mapping in project config.
RETURNS (Dict[str, Any]): The interpolated project config.
"""
config.setdefault(key, {}) config.setdefault(key, {})
config[key].update(overrides) config.setdefault(env_key, {})
# Substitute references to env vars with their values
for config_var, env_var in config[env_key].items():
config[env_key][config_var] = _parse_override(os.environ.get(env_var, ""))
# Need to put variables in the top scope again so we can have a top-level # Need to put variables in the top scope again so we can have a top-level
# section "project" (otherwise, a list of commands in the top scope wouldn't) # section "project" (otherwise, a list of commands in the top scope wouldn't)
# be allowed by Thinc's config system # be allowed by Thinc's config system
cfg = Config({"project": config, key: config[key]}) cfg = Config({"project": config, key: config[key], env_key: config[env_key]})
cfg = Config().from_str(cfg.to_str(), overrides=overrides)
interpolated = cfg.interpolate() interpolated = cfg.interpolate()
return dict(interpolated["project"]) return dict(interpolated["project"])

View File

@ -175,10 +175,13 @@ def render_parses(
def print_prf_per_type( def print_prf_per_type(
msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str
) -> None: ) -> None:
data = [ data = []
(k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}") for key, value in scores.items():
for k, v in scores.items() row = [key]
] for k in ("p", "r", "f"):
v = value[k]
row.append(f"{v * 100:.2f}" if isinstance(v, (int, float)) else v)
data.append(row)
msg.table( msg.table(
data, data,
header=("", "P", "R", "F"), header=("", "P", "R", "F"),
@ -191,7 +194,10 @@ def print_textcats_auc_per_cat(
msg: Printer, scores: Dict[str, Dict[str, float]] msg: Printer, scores: Dict[str, Dict[str, float]]
) -> None: ) -> None:
msg.table( msg.table(
[(k, f"{v:.2f}") for k, v in scores.items()], [
(k, f"{v:.2f}" if isinstance(v, (float, int)) else v)
for k, v in scores.items()
],
header=("", "ROC AUC"), header=("", "ROC AUC"),
aligns=("l", "r"), aligns=("l", "r"),
title="Textcat ROC AUC (per label)", title="Textcat ROC AUC (per label)",

View File

@ -3,19 +3,23 @@ from pathlib import Path
from wasabi import msg from wasabi import msg
import sys import sys
import srsly import srsly
import typer
from ... import about from ... import about
from ...git_info import GIT_VERSION from ...git_info import GIT_VERSION
from ...util import working_dir, run_command, split_command, is_cwd, join_command from ...util import working_dir, run_command, split_command, is_cwd, join_command
from ...util import SimpleFrozenList, is_minor_version_match, ENV_VARS from ...util import SimpleFrozenList, is_minor_version_match, ENV_VARS
from ...util import check_bool_env_var from ...util import check_bool_env_var, SimpleFrozenDict
from .._util import PROJECT_FILE, PROJECT_LOCK, load_project_config, get_hash from .._util import PROJECT_FILE, PROJECT_LOCK, load_project_config, get_hash
from .._util import get_checksum, project_cli, Arg, Opt, COMMAND from .._util import get_checksum, project_cli, Arg, Opt, COMMAND, parse_config_overrides
@project_cli.command("run") @project_cli.command(
"run", context_settings={"allow_extra_args": True, "ignore_unknown_options": True}
)
def project_run_cli( def project_run_cli(
# fmt: off # fmt: off
ctx: typer.Context, # This is only used to read additional arguments
subcommand: str = Arg(None, help=f"Name of command defined in the {PROJECT_FILE}"), subcommand: str = Arg(None, help=f"Name of command defined in the {PROJECT_FILE}"),
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False), project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
force: bool = Opt(False, "--force", "-F", help="Force re-running steps, even if nothing changed"), force: bool = Opt(False, "--force", "-F", help="Force re-running steps, even if nothing changed"),
@ -33,13 +37,15 @@ def project_run_cli(
if show_help or not subcommand: if show_help or not subcommand:
print_run_help(project_dir, subcommand) print_run_help(project_dir, subcommand)
else: else:
project_run(project_dir, subcommand, force=force, dry=dry) overrides = parse_config_overrides(ctx.args)
project_run(project_dir, subcommand, overrides=overrides, force=force, dry=dry)
def project_run( def project_run(
project_dir: Path, project_dir: Path,
subcommand: str, subcommand: str,
*, *,
overrides: Dict[str, Any] = SimpleFrozenDict(),
force: bool = False, force: bool = False,
dry: bool = False, dry: bool = False,
capture: bool = False, capture: bool = False,
@ -59,7 +65,7 @@ def project_run(
when you want to turn over execution to the command, and capture=True when you want to turn over execution to the command, and capture=True
when you want to run the command more like a function. when you want to run the command more like a function.
""" """
config = load_project_config(project_dir) config = load_project_config(project_dir, overrides=overrides)
commands = {cmd["name"]: cmd for cmd in config.get("commands", [])} commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
workflows = config.get("workflows", {}) workflows = config.get("workflows", {})
validate_subcommand(commands.keys(), workflows.keys(), subcommand) validate_subcommand(commands.keys(), workflows.keys(), subcommand)

View File

@ -28,6 +28,15 @@ bg:
accuracy: accuracy:
name: iarfmoose/roberta-base-bulgarian name: iarfmoose/roberta-base-bulgarian
size_factor: 3 size_factor: 3
bn:
word_vectors: null
transformer:
efficiency:
name: sagorsarker/bangla-bert-base
size_factor: 3
accuracy:
name: sagorsarker/bangla-bert-base
size_factor: 3
da: da:
word_vectors: da_core_news_lg word_vectors: da_core_news_lg
transformer: transformer:
@ -104,10 +113,10 @@ hi:
word_vectors: null word_vectors: null
transformer: transformer:
efficiency: efficiency:
name: monsoon-nlp/hindi-tpu-electra name: ai4bharat/indic-bert
size_factor: 3 size_factor: 3
accuracy: accuracy:
name: monsoon-nlp/hindi-tpu-electra name: ai4bharat/indic-bert
size_factor: 3 size_factor: 3
id: id:
word_vectors: null word_vectors: null
@ -185,10 +194,10 @@ si:
word_vectors: null word_vectors: null
transformer: transformer:
efficiency: efficiency:
name: keshan/SinhalaBERTo name: setu4993/LaBSE
size_factor: 3 size_factor: 3
accuracy: accuracy:
name: keshan/SinhalaBERTo name: setu4993/LaBSE
size_factor: 3 size_factor: 3
sv: sv:
word_vectors: null word_vectors: null
@ -203,10 +212,10 @@ ta:
word_vectors: null word_vectors: null
transformer: transformer:
efficiency: efficiency:
name: monsoon-nlp/tamillion name: ai4bharat/indic-bert
size_factor: 3 size_factor: 3
accuracy: accuracy:
name: monsoon-nlp/tamillion name: ai4bharat/indic-bert
size_factor: 3 size_factor: 3
te: te:
word_vectors: null word_vectors: null

View File

@ -579,8 +579,8 @@ class Errors:
E922 = ("Component '{name}' has been initialized with an output dimension of " E922 = ("Component '{name}' has been initialized with an output dimension of "
"{nO} - cannot add any more labels.") "{nO} - cannot add any more labels.")
E923 = ("It looks like there is no proper sample data to initialize the " E923 = ("It looks like there is no proper sample data to initialize the "
"Model of component '{name}'. This is likely a bug in spaCy, so " "Model of component '{name}'. To check your input data paths and "
"feel free to open an issue: https://github.com/explosion/spaCy/issues") "annotation, run: python -m spacy debug data config.cfg")
E924 = ("The '{name}' component does not seem to be initialized properly. " E924 = ("The '{name}' component does not seem to be initialized properly. "
"This is likely a bug in spaCy, so feel free to open an issue: " "This is likely a bug in spaCy, so feel free to open an issue: "
"https://github.com/explosion/spaCy/issues") "https://github.com/explosion/spaCy/issues")

18
spacy/lang/tn/__init__.py Normal file
View File

@ -0,0 +1,18 @@
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .punctuation import TOKENIZER_INFIXES
from ...language import Language
class SetswanaDefaults(Language.Defaults):
infixes = TOKENIZER_INFIXES
stop_words = STOP_WORDS
lex_attr_getters = LEX_ATTRS
class Setswana(Language):
lang = "tn"
Defaults = SetswanaDefaults
__all__ = ["Setswana"]

15
spacy/lang/tn/examples.py Normal file
View File

@ -0,0 +1,15 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.tn.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Apple e nyaka go reka JSE ka tlhwatlhwa ta R1 billion",
"Johannesburg ke toropo e kgolo mo Afrika Borwa.",
"O ko kae?",
"ke mang presidente ya Afrika Borwa?",
"ke eng toropo kgolo ya Afrika Borwa?",
"Nelson Mandela o belegwe leng?",
]

107
spacy/lang/tn/lex_attrs.py Normal file
View File

@ -0,0 +1,107 @@
from ...attrs import LIKE_NUM
_num_words = [
"lefela",
"nngwe",
"pedi",
"tharo",
"nne",
"tlhano",
"thataro",
"supa",
"robedi",
"robongwe",
"lesome",
"lesomenngwe",
"lesomepedi",
"sometharo",
"somenne",
"sometlhano",
"somethataro",
"somesupa",
"somerobedi",
"somerobongwe",
"someamabedi",
"someamararo",
"someamane",
"someamatlhano",
"someamarataro",
"someamasupa",
"someamarobedi",
"someamarobongwe",
"lekgolo",
"sekete",
"milione",
"bilione",
"terilione",
"kwatirilione",
"gajillione",
"bazillione",
]
_ordinal_words = [
"ntlha",
"bobedi",
"boraro",
"bone",
"botlhano",
"borataro",
"bosupa",
"borobedi ",
"borobongwe",
"bolesome",
"bolesomengwe",
"bolesomepedi",
"bolesometharo",
"bolesomenne",
"bolesometlhano",
"bolesomethataro",
"bolesomesupa",
"bolesomerobedi",
"bolesomerobongwe",
"somamabedi",
"someamararo",
"someamane",
"someamatlhano",
"someamarataro",
"someamasupa",
"someamarobedi",
"someamarobongwe",
"lekgolo",
"sekete",
"milione",
"bilione",
"terilione",
"kwatirilione",
"gajillione",
"bazillione",
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
text_lower = text.lower()
if text_lower in _num_words:
return True
# CHeck ordinal number
if text_lower in _ordinal_words:
return True
if text_lower.endswith("th"):
if text_lower[:-2].isdigit():
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -0,0 +1,19 @@
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, HYPHENS
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
_infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}0-9])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
]
)
TOKENIZER_INFIXES = _infixes

View File

@ -0,0 +1,20 @@
# Stop words
STOP_WORDS = set(
"""
ke gareng ga selekanyo tlhwatlhwa yo mongwe se
sengwe fa go le jalo gongwe ba na mo tikologong
jaaka kwa morago nna gonne ka sa pele nako teng
tlase fela ntle magareng tsona feta bobedi kgabaganya
moo gape kgatlhanong botlhe tsotlhe bokana e esi
setseng mororo dinako golo kgolo nnye wena gago
o ntse ntle tla goreng gangwe mang yotlhe gore
eo yona tseraganyo eng ne sentle re rona thata
godimo fitlha pedi masomamabedi lesomepedi mmogo
tharo tseo boraro tseno yone jaanong bobona bona
lesome tsaya tsamaiso nngwe masomethataro thataro
tsa mmatota tota sale thoko supa dira tshwanetse di mmalwa masisi
bonala e tshwanang bogolo tsenya tsweetswee karolo
sepe tlhalosa dirwa robedi robongwe lesomenngwe gaisa
tlhano lesometlhano botlalo lekgolo
""".split()
)

View File

@ -451,7 +451,7 @@ cdef class Lexeme:
Lexeme.c_set_flag(self.c, IS_QUOTE, x) Lexeme.c_set_flag(self.c, IS_QUOTE, x)
property is_left_punct: property is_left_punct:
"""RETURNS (bool): Whether the lexeme is left punctuation, e.g. ).""" """RETURNS (bool): Whether the lexeme is left punctuation, e.g. (."""
def __get__(self): def __get__(self):
return Lexeme.c_check_flag(self.c, IS_LEFT_PUNCT) return Lexeme.c_check_flag(self.c, IS_LEFT_PUNCT)

View File

@ -18,4 +18,4 @@ cdef class PhraseMatcher:
cdef Pool mem cdef Pool mem
cdef key_t _terminal_hash cdef key_t _terminal_hash
cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil cdef void find_matches(self, Doc doc, int start_idx, int end_idx, vector[SpanC] *matches) nogil

View File

@ -230,10 +230,10 @@ cdef class PhraseMatcher:
result = internal_node result = internal_node
map_set(self.mem, <MapStruct*>result, self.vocab.strings[key], NULL) map_set(self.mem, <MapStruct*>result, self.vocab.strings[key], NULL)
def __call__(self, doc, *, as_spans=False): def __call__(self, object doclike, *, as_spans=False):
"""Find all sequences matching the supplied patterns on the `Doc`. """Find all sequences matching the supplied patterns on the `Doc`.
doc (Doc): The document to match over. doclike (Doc or Span): The document to match over.
as_spans (bool): Return Span objects with labels instead of (match_id, as_spans (bool): Return Span objects with labels instead of (match_id,
start, end) tuples. start, end) tuples.
RETURNS (list): A list of `(match_id, start, end)` tuples, RETURNS (list): A list of `(match_id, start, end)` tuples,
@ -244,12 +244,22 @@ cdef class PhraseMatcher:
DOCS: https://spacy.io/api/phrasematcher#call DOCS: https://spacy.io/api/phrasematcher#call
""" """
matches = [] matches = []
if doc is None or len(doc) == 0: if doclike is None or len(doclike) == 0:
# if doc is empty or None just return empty list # if doc is empty or None just return empty list
return matches return matches
if isinstance(doclike, Doc):
doc = doclike
start_idx = 0
end_idx = len(doc)
elif isinstance(doclike, Span):
doc = doclike.doc
start_idx = doclike.start
end_idx = doclike.end
else:
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
cdef vector[SpanC] c_matches cdef vector[SpanC] c_matches
self.find_matches(doc, &c_matches) self.find_matches(doc, start_idx, end_idx, &c_matches)
for i in range(c_matches.size()): for i in range(c_matches.size()):
matches.append((c_matches[i].label, c_matches[i].start, c_matches[i].end)) matches.append((c_matches[i].label, c_matches[i].start, c_matches[i].end))
for i, (ent_id, start, end) in enumerate(matches): for i, (ent_id, start, end) in enumerate(matches):
@ -261,17 +271,17 @@ cdef class PhraseMatcher:
else: else:
return matches return matches
cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil: cdef void find_matches(self, Doc doc, int start_idx, int end_idx, vector[SpanC] *matches) nogil:
cdef MapStruct* current_node = self.c_map cdef MapStruct* current_node = self.c_map
cdef int start = 0 cdef int start = 0
cdef int idx = 0 cdef int idx = start_idx
cdef int idy = 0 cdef int idy = start_idx
cdef key_t key cdef key_t key
cdef void* value cdef void* value
cdef int i = 0 cdef int i = 0
cdef SpanC ms cdef SpanC ms
cdef void* result cdef void* result
while idx < doc.length: while idx < end_idx:
start = idx start = idx
token = Token.get_struct_attr(&doc.c[idx], self.attr) token = Token.get_struct_attr(&doc.c[idx], self.attr)
# look for sequences from this position # look for sequences from this position
@ -279,7 +289,7 @@ cdef class PhraseMatcher:
if result: if result:
current_node = <MapStruct*>result current_node = <MapStruct*>result
idy = idx + 1 idy = idx + 1
while idy < doc.length: while idy < end_idx:
result = map_get(current_node, self._terminal_hash) result = map_get(current_node, self._terminal_hash)
if result: if result:
i = 0 i = 0

View File

@ -107,6 +107,7 @@ def init_ensemble_textcat(model, X, Y) -> Model:
model.get_ref("maxout_layer").set_dim("nO", tok2vec_width) model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
model.get_ref("maxout_layer").set_dim("nI", tok2vec_width) model.get_ref("maxout_layer").set_dim("nI", tok2vec_width)
model.get_ref("norm_layer").set_dim("nI", tok2vec_width) model.get_ref("norm_layer").set_dim("nI", tok2vec_width)
model.get_ref("norm_layer").set_dim("nO", tok2vec_width)
init_chain(model, X, Y) init_chain(model, X, Y)
return model return model

View File

@ -273,7 +273,7 @@ class EntityLinker(TrainablePipe):
gradients = self.distance.get_grad(sentence_encodings, entity_encodings) gradients = self.distance.get_grad(sentence_encodings, entity_encodings)
loss = self.distance.get_loss(sentence_encodings, entity_encodings) loss = self.distance.get_loss(sentence_encodings, entity_encodings)
loss = loss / len(entity_encodings) loss = loss / len(entity_encodings)
return loss, gradients return float(loss), gradients
def predict(self, docs: Iterable[Doc]) -> List[str]: def predict(self, docs: Iterable[Doc]) -> List[str]:
"""Apply the pipeline's model to a batch of docs, without modifying them. """Apply the pipeline's model to a batch of docs, without modifying them.

View File

@ -76,7 +76,7 @@ def merge_subtokens(doc: Doc, label: str = "subtok") -> Doc:
retokenizes=True, retokenizes=True,
) )
def make_token_splitter( def make_token_splitter(
nlp: Language, name: str, *, min_length=0, split_length=0, nlp: Language, name: str, *, min_length: int = 0, split_length: int = 0
): ):
return TokenSplitter(min_length=min_length, split_length=split_length) return TokenSplitter(min_length=min_length, split_length=split_length)

View File

@ -197,7 +197,7 @@ class ClozeMultitask(TrainablePipe):
target = vectors[ids] target = vectors[ids]
gradient = self.distance.get_grad(prediction, target) gradient = self.distance.get_grad(prediction, target)
loss = self.distance.get_loss(prediction, target) loss = self.distance.get_loss(prediction, target)
return loss, gradient return float(loss), gradient
def update(self, examples, *, drop=0., sgd=None, losses=None): def update(self, examples, *, drop=0., sgd=None, losses=None):
pass pass

View File

@ -121,7 +121,7 @@ class Tok2Vec(TrainablePipe):
tokvecs = self.model.predict(docs) tokvecs = self.model.predict(docs)
batch_id = Tok2VecListener.get_batch_id(docs) batch_id = Tok2VecListener.get_batch_id(docs)
for listener in self.listeners: for listener in self.listeners:
listener.receive(batch_id, tokvecs, lambda dX: []) listener.receive(batch_id, tokvecs, _empty_backprop)
return tokvecs return tokvecs
def set_annotations(self, docs: Sequence[Doc], tokvecses) -> None: def set_annotations(self, docs: Sequence[Doc], tokvecses) -> None:
@ -291,12 +291,18 @@ def forward(model: Tok2VecListener, inputs, is_train: bool):
# of data. # of data.
# When the components batch differently, we don't receive a matching # When the components batch differently, we don't receive a matching
# prediction from the upstream, so we can't predict. # prediction from the upstream, so we can't predict.
if not all(doc.tensor.size for doc in inputs): outputs = []
width = model.get_dim("nO")
for doc in inputs:
if doc.tensor.size == 0:
# But we do need to do *something* if the tensor hasn't been set. # But we do need to do *something* if the tensor hasn't been set.
# The compromise is to at least return data of the right shape, # The compromise is to at least return data of the right shape,
# so the output is valid. # so the output is valid.
width = model.get_dim("nO") outputs.append(model.ops.alloc2f(len(doc), width))
outputs = [model.ops.alloc2f(len(doc), width) for doc in inputs]
else: else:
outputs = [doc.tensor for doc in inputs] outputs.append(doc.tensor)
return outputs, lambda dX: [] return outputs, lambda dX: []
def _empty_backprop(dX): # for pickling
return []

View File

@ -446,6 +446,7 @@ class ProjectConfigCommand(BaseModel):
class ProjectConfigSchema(BaseModel): class ProjectConfigSchema(BaseModel):
# fmt: off # fmt: off
vars: Dict[StrictStr, Any] = Field({}, title="Optional variables to substitute in commands") vars: Dict[StrictStr, Any] = Field({}, title="Optional variables to substitute in commands")
env: Dict[StrictStr, Any] = Field({}, title="Optional variable names to substitute in commands, mapped to environment variable names")
assets: List[Union[ProjectConfigAssetURL, ProjectConfigAssetGit]] = Field([], title="Data assets") assets: List[Union[ProjectConfigAssetURL, ProjectConfigAssetGit]] = Field([], title="Data assets")
workflows: Dict[StrictStr, List[StrictStr]] = Field({}, title="Named workflows, mapped to list of project commands to run in order") workflows: Dict[StrictStr, List[StrictStr]] = Field({}, title="Named workflows, mapped to list of project commands to run in order")
commands: List[ProjectConfigCommand] = Field([], title="Project command shortucts") commands: List[ProjectConfigCommand] = Field([], title="Project command shortucts")

View File

@ -8,7 +8,8 @@ from spacy.util import get_lang_class
LANGUAGES = ["af", "ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", "es", LANGUAGES = ["af", "ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", "es",
"et", "fa", "fi", "fr", "ga", "he", "hi", "hr", "hu", "id", "is", "et", "fa", "fi", "fr", "ga", "he", "hi", "hr", "hu", "id", "is",
"it", "kn", "lt", "lv", "nb", "nl", "pl", "pt", "ro", "si", "sk", "it", "kn", "lt", "lv", "nb", "nl", "pl", "pt", "ro", "si", "sk",
"sl", "sq", "sr", "sv", "ta", "te", "tl", "tr", "tt", "ur", 'yo'] "sl", "sq", "sr", "sv", "ta", "te", "tl", "tn", "tr", "tt", "ur",
"yo"]
# fmt: on # fmt: on

View File

@ -323,3 +323,39 @@ def test_phrase_matcher_deprecated(en_vocab):
@pytest.mark.parametrize("attr", ["SENT_START", "IS_SENT_START"]) @pytest.mark.parametrize("attr", ["SENT_START", "IS_SENT_START"])
def test_phrase_matcher_sent_start(en_vocab, attr): def test_phrase_matcher_sent_start(en_vocab, attr):
_ = PhraseMatcher(en_vocab, attr=attr) # noqa: F841 _ = PhraseMatcher(en_vocab, attr=attr) # noqa: F841
def test_span_in_phrasematcher(en_vocab):
"""Ensure that PhraseMatcher accepts Span and Doc as input"""
# fmt: off
words = ["I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "and", "nothing", "else", "."]
# fmt: on
doc = Doc(en_vocab, words=words)
span = doc[:8]
pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
matcher = PhraseMatcher(en_vocab)
matcher.add("SPACY", [pattern])
matches_doc = matcher(doc)
matches_span = matcher(span)
assert len(matches_doc) == 1
assert len(matches_span) == 1
def test_span_v_doc_in_phrasematcher(en_vocab):
"""Ensure that PhraseMatcher only returns matches in input Span and not in entire Doc"""
# fmt: off
words = [
"I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "Spans",
"and", "Docs", "in", "my", "matchers", "," "and", "Spans", "and", "Docs",
"everywhere", "."
]
# fmt: on
doc = Doc(en_vocab, words=words)
span = doc[9:15] # second clause
pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
matcher = PhraseMatcher(en_vocab)
matcher.add("SPACY", [pattern])
matches_doc = matcher(doc)
matches_span = matcher(span)
assert len(matches_doc) == 3
assert len(matches_span) == 1

View File

@ -451,13 +451,27 @@ def test_pipe_factories_from_source_config():
assert config["arg"] == "world" assert config["arg"] == "world"
def test_pipe_factories_decorator_idempotent(): class PipeFactoriesIdempotent:
def __init__(self, nlp, name):
...
def __call__(self, doc):
...
@pytest.mark.parametrize(
"i,func,func2",
[
(0, lambda nlp, name: lambda doc: doc, lambda doc: doc),
(1, PipeFactoriesIdempotent, PipeFactoriesIdempotent(None, None)),
],
)
def test_pipe_factories_decorator_idempotent(i, func, func2):
"""Check that decorator can be run multiple times if the function is the """Check that decorator can be run multiple times if the function is the
same. This is especially relevant for live reloading because we don't same. This is especially relevant for live reloading because we don't
want spaCy to raise an error if a module registering components is reloaded. want spaCy to raise an error if a module registering components is reloaded.
""" """
name = "test_pipe_factories_decorator_idempotent" name = f"test_pipe_factories_decorator_idempotent_{i}"
func = lambda nlp, name: lambda doc: doc
for i in range(5): for i in range(5):
Language.factory(name, func=func) Language.factory(name, func=func)
nlp = Language() nlp = Language()
@ -466,7 +480,6 @@ def test_pipe_factories_decorator_idempotent():
# Make sure it also works for component decorator, which creates the # Make sure it also works for component decorator, which creates the
# factory function # factory function
name2 = f"{name}2" name2 = f"{name}2"
func2 = lambda doc: doc
for i in range(5): for i in range(5):
Language.component(name2, func=func2) Language.component(name2, func=func2)
nlp = Language() nlp = Language()

View File

@ -0,0 +1,229 @@
import pytest
from spacy.lang.en import English
import numpy as np
import spacy
from spacy.tokens import Doc
from spacy.matcher import PhraseMatcher
from spacy.tokens import DocBin
from spacy.util import load_config_from_str
from spacy.training import Example
from spacy.training.initialize import init_nlp
import pickle
from ..util import make_tempdir
def test_issue6730(en_vocab):
"""Ensure that the KB does not accept empty strings, but otherwise IO works fine."""
from spacy.kb import KnowledgeBase
kb = KnowledgeBase(en_vocab, entity_vector_length=3)
kb.add_entity(entity="1", freq=148, entity_vector=[1, 2, 3])
with pytest.raises(ValueError):
kb.add_alias(alias="", entities=["1"], probabilities=[0.4])
assert kb.contains_alias("") is False
kb.add_alias(alias="x", entities=["1"], probabilities=[0.2])
kb.add_alias(alias="y", entities=["1"], probabilities=[0.1])
with make_tempdir() as tmp_dir:
kb.to_disk(tmp_dir)
kb.from_disk(tmp_dir)
assert kb.get_size_aliases() == 2
assert set(kb.get_alias_strings()) == {"x", "y"}
def test_issue6755(en_tokenizer):
doc = en_tokenizer("This is a magnificent sentence.")
span = doc[:0]
assert span.text_with_ws == ""
assert span.text == ""
@pytest.mark.parametrize(
"sentence, start_idx,end_idx,label",
[("Welcome to Mumbai, my friend", 11, 17, "GPE")],
)
def test_issue6815_1(sentence, start_idx, end_idx, label):
nlp = English()
doc = nlp(sentence)
span = doc[:].char_span(start_idx, end_idx, label=label)
assert span.label_ == label
@pytest.mark.parametrize(
"sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
)
def test_issue6815_2(sentence, start_idx, end_idx, kb_id):
nlp = English()
doc = nlp(sentence)
span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
assert span.kb_id == kb_id
@pytest.mark.parametrize(
"sentence, start_idx,end_idx,vector",
[("Welcome to Mumbai, my friend", 11, 17, np.array([0.1, 0.2, 0.3]))],
)
def test_issue6815_3(sentence, start_idx, end_idx, vector):
nlp = English()
doc = nlp(sentence)
span = doc[:].char_span(start_idx, end_idx, vector=vector)
assert (span.vector == vector).all()
def test_issue6839(en_vocab):
"""Ensure that PhraseMatcher accepts Span as input"""
# fmt: off
words = ["I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "and", "nothing", "else", "."]
# fmt: on
doc = Doc(en_vocab, words=words)
span = doc[:8]
pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
matcher = PhraseMatcher(en_vocab)
matcher.add("SPACY", [pattern])
matches = matcher(span)
assert matches
CONFIG_ISSUE_6908 = """
[paths]
train = "TRAIN_PLACEHOLDER"
raw = null
init_tok2vec = null
vectors = null
[system]
seed = 0
gpu_allocator = null
[nlp]
lang = "en"
pipeline = ["textcat"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
[components]
[components.textcat]
factory = "TEXTCAT_PLACEHOLDER"
[corpora]
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths:train}
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths:train}
[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
frozen_components = []
before_to_disk = null
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.components.textcat]
labels = ['label1', 'label2']
[initialize.tokenizer]
"""
@pytest.mark.parametrize(
"component_name", ["textcat", "textcat_multilabel"],
)
def test_issue6908(component_name):
"""Test intializing textcat with labels in a list"""
def create_data(out_file):
nlp = spacy.blank("en")
doc = nlp.make_doc("Some text")
doc.cats = {"label1": 0, "label2": 1}
out_data = DocBin(docs=[doc]).to_bytes()
with out_file.open("wb") as file_:
file_.write(out_data)
with make_tempdir() as tmp_path:
train_path = tmp_path / "train.spacy"
create_data(train_path)
config_str = CONFIG_ISSUE_6908.replace("TEXTCAT_PLACEHOLDER", component_name)
config_str = config_str.replace("TRAIN_PLACEHOLDER", train_path.as_posix())
config = load_config_from_str(config_str)
init_nlp(config)
CONFIG_ISSUE_6950 = """
[nlp]
lang = "en"
pipeline = ["tok2vec", "tagger"]
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode:width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[components.ner]
factory = "ner"
[components.tagger]
factory = "tagger"
[components.tagger.model]
@architectures = "spacy.Tagger.v1"
nO = null
[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
upstream = "*"
"""
def test_issue6950():
"""Test that the nlp object with initialized tok2vec with listeners pickles
correctly (and doesn't have lambdas).
"""
nlp = English.from_config(load_config_from_str(CONFIG_ISSUE_6950))
nlp.initialize(lambda: [Example.from_dict(nlp.make_doc("hello"), {"tags": ["V"]})])
pickle.dumps(nlp)
nlp("hello")
pickle.dumps(nlp)

View File

@ -1,23 +0,0 @@
import pytest
from ..util import make_tempdir
def test_issue6730(en_vocab):
"""Ensure that the KB does not accept empty strings, but otherwise IO works fine."""
from spacy.kb import KnowledgeBase
kb = KnowledgeBase(en_vocab, entity_vector_length=3)
kb.add_entity(entity="1", freq=148, entity_vector=[1, 2, 3])
with pytest.raises(ValueError):
kb.add_alias(alias="", entities=["1"], probabilities=[0.4])
assert kb.contains_alias("") is False
kb.add_alias(alias="x", entities=["1"], probabilities=[0.2])
kb.add_alias(alias="y", entities=["1"], probabilities=[0.1])
with make_tempdir() as tmp_dir:
kb.to_disk(tmp_dir)
kb.from_disk(tmp_dir)
assert kb.get_size_aliases() == 2
assert set(kb.get_alias_strings()) == {"x", "y"}

View File

@ -1,5 +0,0 @@
def test_issue6755(en_tokenizer):
doc = en_tokenizer("This is a magnificent sentence.")
span = doc[:0]
assert span.text_with_ws == ""
assert span.text == ""

View File

@ -1,35 +0,0 @@
import pytest
from spacy.lang.en import English
import numpy as np
@pytest.mark.parametrize(
"sentence, start_idx,end_idx,label",
[("Welcome to Mumbai, my friend", 11, 17, "GPE")],
)
def test_char_span_label(sentence, start_idx, end_idx, label):
nlp = English()
doc = nlp(sentence)
span = doc[:].char_span(start_idx, end_idx, label=label)
assert span.label_ == label
@pytest.mark.parametrize(
"sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
)
def test_char_span_kb_id(sentence, start_idx, end_idx, kb_id):
nlp = English()
doc = nlp(sentence)
span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
assert span.kb_id == kb_id
@pytest.mark.parametrize(
"sentence, start_idx,end_idx,vector",
[("Welcome to Mumbai, my friend", 11, 17, np.array([0.1, 0.2, 0.3]))],
)
def test_char_span_vector(sentence, start_idx, end_idx, vector):
nlp = English()
doc = nlp(sentence)
span = doc[:].char_span(start_idx, end_idx, vector=vector)
assert (span.vector == vector).all()

View File

@ -1,102 +0,0 @@
import pytest
import spacy
from spacy.language import Language
from spacy.tokens import DocBin
from spacy import util
from spacy.schemas import ConfigSchemaInit
from spacy.training.initialize import init_nlp
from ..util import make_tempdir
TEXTCAT_WITH_LABELS_ARRAY_CONFIG = """
[paths]
train = "TRAIN_PLACEHOLDER"
raw = null
init_tok2vec = null
vectors = null
[system]
seed = 0
gpu_allocator = null
[nlp]
lang = "en"
pipeline = ["textcat"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
[components]
[components.textcat]
factory = "TEXTCAT_PLACEHOLDER"
[corpora]
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths:train}
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths:train}
[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
frozen_components = []
before_to_disk = null
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.components.textcat]
labels = ['label1', 'label2']
[initialize.tokenizer]
"""
@pytest.mark.parametrize(
"component_name",
["textcat", "textcat_multilabel"],
)
def test_textcat_initialize_labels_validation(component_name):
"""Test intializing textcat with labels in a list"""
def create_data(out_file):
nlp = spacy.blank("en")
doc = nlp.make_doc("Some text")
doc.cats = {"label1": 0, "label2": 1}
out_data = DocBin(docs=[doc]).to_bytes()
with out_file.open("wb") as file_:
file_.write(out_data)
with make_tempdir() as tmp_path:
train_path = tmp_path / "train.spacy"
create_data(train_path)
config_str = TEXTCAT_WITH_LABELS_ARRAY_CONFIG.replace(
"TEXTCAT_PLACEHOLDER", component_name
)
config_str = config_str.replace("TRAIN_PLACEHOLDER", train_path.as_posix())
config = util.load_config_from_str(config_str)
init_nlp(config)

View File

@ -0,0 +1,12 @@
from spacy.cli.evaluate import print_textcats_auc_per_cat, print_prf_per_type
from wasabi import msg
def test_issue7019():
scores = {"LABEL_A": 0.39829102, "LABEL_B": 0.938298329382, "LABEL_C": None}
print_textcats_auc_per_cat(msg, scores)
scores = {
"LABEL_A": {"p": 0.3420302, "r": 0.3929020, "f": 0.49823928932},
"LABEL_B": {"p": None, "r": None, "f": None},
}
print_prf_per_type(msg, scores, name="foo", type="bar")

View File

@ -0,0 +1,67 @@
from spacy.lang.en import English
from spacy.training import Example
from spacy.util import load_config_from_str
CONFIG = """
[nlp]
lang = "en"
pipeline = ["tok2vec", "tagger"]
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode:width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[components.tagger]
factory = "tagger"
[components.tagger.model]
@architectures = "spacy.Tagger.v1"
nO = null
[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
upstream = "*"
"""
TRAIN_DATA = [
("I like green eggs", {"tags": ["N", "V", "J", "N"]}),
("Eat blue ham", {"tags": ["V", "J", "N"]}),
]
def test_issue7029():
"""Test that an empty document doesn't mess up an entire batch."""
nlp = English.from_config(load_config_from_str(CONFIG))
train_examples = []
for t in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
optimizer = nlp.initialize(get_examples=lambda: train_examples)
for i in range(50):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
texts = ["first", "second", "third", "fourth", "and", "then", "some", ""]
nlp.select_pipes(enable=["tok2vec", "tagger"])
docs1 = list(nlp.pipe(texts, batch_size=1))
docs2 = list(nlp.pipe(texts, batch_size=4))
assert [doc[0].tag_ for doc in docs1[:-1]] == [doc[0].tag_ for doc in docs2[:-1]]

View File

@ -325,6 +325,23 @@ def test_project_config_interpolation():
substitute_project_variables(project) substitute_project_variables(project)
def test_project_config_interpolation_env():
variables = {"a": 10}
env_var = "SPACY_TEST_FOO"
env_vars = {"foo": env_var}
commands = [{"name": "x", "script": ["hello ${vars.a} ${env.foo}"]}]
project = {"commands": commands, "vars": variables, "env": env_vars}
with make_tempdir() as d:
srsly.write_yaml(d / "project.yml", project)
cfg = load_project_config(d)
assert cfg["commands"][0]["script"][0] == "hello 10 "
os.environ[env_var] = "123"
with make_tempdir() as d:
srsly.write_yaml(d / "project.yml", project)
cfg = load_project_config(d)
assert cfg["commands"][0]["script"][0] == "hello 10 123"
@pytest.mark.parametrize( @pytest.mark.parametrize(
"args,expected", "args,expected",
[ [

View File

@ -1,7 +1,9 @@
import pytest import pytest
import numpy import numpy
import srsly import srsly
from spacy.lang.en import English
from spacy.strings import StringStore from spacy.strings import StringStore
from spacy.tokens import Doc
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.attrs import NORM from spacy.attrs import NORM
@ -20,7 +22,10 @@ def test_pickle_string_store(text1, text2):
@pytest.mark.parametrize("text1,text2", [("dog", "cat")]) @pytest.mark.parametrize("text1,text2", [("dog", "cat")])
def test_pickle_vocab(text1, text2): def test_pickle_vocab(text1, text2):
vocab = Vocab(lex_attr_getters={int(NORM): lambda string: string[:-1]}) vocab = Vocab(
lex_attr_getters={int(NORM): lambda string: string[:-1]},
get_noun_chunks=English.Defaults.syntax_iterators.get("noun_chunks"),
)
vocab.set_vector("dog", numpy.ones((5,), dtype="f")) vocab.set_vector("dog", numpy.ones((5,), dtype="f"))
lex1 = vocab[text1] lex1 = vocab[text1]
lex2 = vocab[text2] lex2 = vocab[text2]
@ -34,4 +39,23 @@ def test_pickle_vocab(text1, text2):
assert unpickled[text2].norm == lex2.norm assert unpickled[text2].norm == lex2.norm
assert unpickled[text1].norm != unpickled[text2].norm assert unpickled[text1].norm != unpickled[text2].norm
assert unpickled.vectors is not None assert unpickled.vectors is not None
assert unpickled.get_noun_chunks is not None
assert list(vocab["dog"].vector) == [1.0, 1.0, 1.0, 1.0, 1.0] assert list(vocab["dog"].vector) == [1.0, 1.0, 1.0, 1.0, 1.0]
def test_pickle_doc(en_vocab):
words = ["a", "b", "c"]
deps = ["dep"] * len(words)
heads = [0] * len(words)
doc = Doc(
en_vocab,
words=words,
deps=deps,
heads=heads,
)
data = srsly.pickle_dumps(doc)
unpickled = srsly.pickle_loads(data)
assert [t.text for t in unpickled] == words
assert [t.dep_ for t in unpickled] == deps
assert [t.head.i for t in unpickled] == heads
assert list(doc.noun_chunks) == []

View File

@ -55,6 +55,7 @@ def test_vocab_lexeme_add_flag_provided_id(en_vocab):
assert en_vocab["199"].check_flag(IS_DIGIT) is False assert en_vocab["199"].check_flag(IS_DIGIT) is False
assert en_vocab["the"].check_flag(is_len4) is False assert en_vocab["the"].check_flag(is_len4) is False
assert en_vocab["dogs"].check_flag(is_len4) is True assert en_vocab["dogs"].check_flag(is_len4) is True
en_vocab.add_flag(lambda string: string.isdigit(), flag_id=IS_DIGIT)
def test_vocab_lexeme_oov_rank(en_vocab): def test_vocab_lexeme_oov_rank(en_vocab):

View File

@ -245,7 +245,7 @@ cdef class Tokenizer:
cdef int offset cdef int offset
cdef int modified_doc_length cdef int modified_doc_length
# Find matches for special cases # Find matches for special cases
self._special_matcher.find_matches(doc, &c_matches) self._special_matcher.find_matches(doc, 0, doc.length, &c_matches)
# Skip processing if no matches # Skip processing if no matches
if c_matches.size() == 0: if c_matches.size() == 0:
return True return True

View File

@ -215,8 +215,7 @@ def convert_vectors(
def read_vectors(vectors_loc: Path, truncate_vectors: int): def read_vectors(vectors_loc: Path, truncate_vectors: int):
f = open_file(vectors_loc) f = ensure_shape(vectors_loc)
f = ensure_shape(f)
shape = tuple(int(size) for size in next(f).split()) shape = tuple(int(size) for size in next(f).split())
if truncate_vectors >= 1: if truncate_vectors >= 1:
shape = (truncate_vectors, shape[1]) shape = (truncate_vectors, shape[1])
@ -251,11 +250,12 @@ def open_file(loc: Union[str, Path]) -> IO:
return loc.open("r", encoding="utf8") return loc.open("r", encoding="utf8")
def ensure_shape(lines): def ensure_shape(vectors_loc):
"""Ensure that the first line of the data is the vectors shape. """Ensure that the first line of the data is the vectors shape.
If it's not, we read in the data and output the shape as the first result, If it's not, we read in the data and output the shape as the first result,
so that the reader doesn't have to deal with the problem. so that the reader doesn't have to deal with the problem.
""" """
lines = open_file(vectors_loc)
first_line = next(lines) first_line = next(lines)
try: try:
shape = tuple(int(size) for size in first_line.split()) shape = tuple(int(size) for size in first_line.split())
@ -269,7 +269,11 @@ def ensure_shape(lines):
# Figure out the shape, make it the first value, and then give the # Figure out the shape, make it the first value, and then give the
# rest of the data. # rest of the data.
width = len(first_line.split()) - 1 width = len(first_line.split()) - 1
captured = [first_line] + list(lines) length = 1
length = len(captured) for _ in lines:
length += 1
yield f"{length} {width}" yield f"{length} {width}"
yield from captured # Reading the lines in again from file. This to avoid having to
# store all the results in a list in memory
lines2 = open_file(vectors_loc)
yield from lines2

View File

@ -930,6 +930,8 @@ def is_same_func(func1: Callable, func2: Callable) -> bool:
""" """
if not callable(func1) or not callable(func2): if not callable(func1) or not callable(func2):
return False return False
if not hasattr(func1, "__qualname__") or not hasattr(func2, "__qualname__"):
return False
same_name = func1.__qualname__ == func2.__qualname__ same_name = func1.__qualname__ == func2.__qualname__
same_file = inspect.getfile(func1) == inspect.getfile(func2) same_file = inspect.getfile(func1) == inspect.getfile(func2)
same_code = inspect.getsourcelines(func1) == inspect.getsourcelines(func2) same_code = inspect.getsourcelines(func1) == inspect.getsourcelines(func2)

View File

@ -551,12 +551,13 @@ def pickle_vocab(vocab):
data_dir = vocab.data_dir data_dir = vocab.data_dir
lex_attr_getters = srsly.pickle_dumps(vocab.lex_attr_getters) lex_attr_getters = srsly.pickle_dumps(vocab.lex_attr_getters)
lookups = vocab.lookups lookups = vocab.lookups
get_noun_chunks = vocab.get_noun_chunks
return (unpickle_vocab, return (unpickle_vocab,
(sstore, vectors, morph, data_dir, lex_attr_getters, lookups)) (sstore, vectors, morph, data_dir, lex_attr_getters, lookups, get_noun_chunks))
def unpickle_vocab(sstore, vectors, morphology, data_dir, def unpickle_vocab(sstore, vectors, morphology, data_dir,
lex_attr_getters, lookups): lex_attr_getters, lookups, get_noun_chunks):
cdef Vocab vocab = Vocab() cdef Vocab vocab = Vocab()
vocab.vectors = vectors vocab.vectors = vectors
vocab.strings = sstore vocab.strings = sstore
@ -564,6 +565,7 @@ def unpickle_vocab(sstore, vectors, morphology, data_dir,
vocab.data_dir = data_dir vocab.data_dir = data_dir
vocab.lex_attr_getters = srsly.pickle_loads(lex_attr_getters) vocab.lex_attr_getters = srsly.pickle_loads(lex_attr_getters)
vocab.lookups = lookups vocab.lookups = lookups
vocab.get_noun_chunks = get_noun_chunks
return vocab return vocab

View File

@ -67,7 +67,7 @@ data format used by the lookup and rule-based lemmatizers, see
> lemmatizer = nlp.add_pipe("lemmatizer") > lemmatizer = nlp.add_pipe("lemmatizer")
> >
> # Construction via add_pipe with custom settings > # Construction via add_pipe with custom settings
> config = {"mode": "rule", overwrite=True} > config = {"mode": "rule", "overwrite": True}
> lemmatizer = nlp.add_pipe("lemmatizer", config=config) > lemmatizer = nlp.add_pipe("lemmatizer", config=config)
> ``` > ```

View File

@ -44,7 +44,7 @@ be shown.
## PhraseMatcher.\_\_call\_\_ {#call tag="method"} ## PhraseMatcher.\_\_call\_\_ {#call tag="method"}
Find all token sequences matching the supplied patterns on the `Doc`. Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
> #### Example > #### Example
> >
@ -59,7 +59,7 @@ Find all token sequences matching the supplied patterns on the `Doc`.
| Name | Description | | Name | Description |
| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `doc` | The document to match over. ~~Doc~~ | | `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | | `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ |
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | | **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |

View File

@ -727,7 +727,7 @@ capitalization by including a mix of capitalized and lowercase examples. See the
Create a data augmentation callback that uses orth-variant replacement. The Create a data augmentation callback that uses orth-variant replacement. The
callback can be added to a corpus or other data iterator during training. It's callback can be added to a corpus or other data iterator during training. It's
is especially useful for punctuation and case replacement, to help generalize especially useful for punctuation and case replacement, to help generalize
beyond corpora that don't have smart quotes, or only have smart quotes etc. beyond corpora that don't have smart quotes, or only have smart quotes etc.
| Name | Description | | Name | Description |

View File

@ -4,8 +4,8 @@ import { Help } from 'components/typography'; import Link from 'components/link'
| Pipeline | Parser | Tagger | NER | | Pipeline | Parser | Tagger | NER |
| ---------------------------------------------------------- | -----: | -----: | ---: | | ---------------------------------------------------------- | -----: | -----: | ---: |
| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.2 | 97.8 | 89.9 | | [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.1 | 97.8 | 89.8 |
| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 91.9 | 97.4 | 85.5 | | [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 92.0 | 97.4 | 85.5 |
| `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | 85.5 | | `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | 85.5 |
<figcaption class="caption"> <figcaption class="caption">
@ -22,7 +22,7 @@ the development set).
| Named Entity Recognition System | OntoNotes | CoNLL '03 | | Named Entity Recognition System | OntoNotes | CoNLL '03 |
| -------------------------------- | --------: | --------: | | -------------------------------- | --------: | --------: |
| spaCy RoBERTa (2020) | 89.7 | 91.6 | | spaCy RoBERTa (2020) | 89.8 | 91.6 |
| Stanza (StanfordNLP)<sup>1</sup> | 88.8 | 92.1 | | Stanza (StanfordNLP)<sup>1</sup> | 88.8 | 92.1 |
| Flair<sup>2</sup> | 89.7 | 93.1 | | Flair<sup>2</sup> | 89.7 | 93.1 |

View File

@ -77,7 +77,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
| Dependency Parsing System | UAS | LAS | | Dependency Parsing System | UAS | LAS |
| ------------------------------------------------------------------------------ | ---: | ---: | | ------------------------------------------------------------------------------ | ---: | ---: |
| spaCy RoBERTa (2020) | 95.5 | 94.3 | | spaCy RoBERTa (2020) | 95.1 | 93.7 |
| [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 | | [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
| [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 | | [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 |

View File

@ -69,9 +69,9 @@ python -m spacy project clone pipelines/tagger_parser_ud
By default, the project will be cloned into the current working directory. You By default, the project will be cloned into the current working directory. You
can specify an optional second argument to define the output directory. The can specify an optional second argument to define the output directory. The
`--repo` option lets you define a custom repo to clone from if you don't want `--repo` option lets you define a custom repo to clone from if you don't want to
to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You use the spaCy [`projects`](https://github.com/explosion/projects) repo. You can
can also use any private repo you have access to with Git. also use any private repo you have access to with Git.
### 2. Fetch the project assets {#assets} ### 2. Fetch the project assets {#assets}
@ -221,6 +221,7 @@ pipelines.
| `title` | An optional project title used in `--help` message and [auto-generated docs](#custom-docs). | | `title` | An optional project title used in `--help` message and [auto-generated docs](#custom-docs). |
| `description` | An optional project description used in [auto-generated docs](#custom-docs). | | `description` | An optional project description used in [auto-generated docs](#custom-docs). |
| `vars` | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`. | | `vars` | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`. |
| `env` | A dictionary of variables, mapped to the names of environment variables that will be read in when running the project. For example, `${env.name}` will use the value of the environment variable defined as `name`. |
| `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. | | `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. |
| `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo. | | `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo. |
| `workflows` | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command. | | `workflows` | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command. |
@ -310,8 +311,8 @@ company-internal and not available over the internet. In that case, you can
specify the destination paths and a checksum, and leave out the URL. When your specify the destination paths and a checksum, and leave out the URL. When your
teammates clone and run your project, they can place the files in the respective teammates clone and run your project, they can place the files in the respective
directory themselves. The [`project assets`](/api/cli#project-assets) command directory themselves. The [`project assets`](/api/cli#project-assets) command
will alert you about missing files and mismatched checksums, so you can ensure that will alert you about missing files and mismatched checksums, so you can ensure
others are running your project with the same data. that others are running your project with the same data.
### Dependencies and outputs {#deps-outputs} ### Dependencies and outputs {#deps-outputs}
@ -358,9 +359,10 @@ graphs based on the dependencies and outputs, and won't re-run previous steps
automatically. For instance, if you only run the command `train` that depends on automatically. For instance, if you only run the command `train` that depends on
data created by `preprocess` and those files are missing, spaCy will show an data created by `preprocess` and those files are missing, spaCy will show an
error it won't just re-run `preprocess`. If you're looking for more advanced error it won't just re-run `preprocess`. If you're looking for more advanced
data management, check out the [Data Version Control (DVC) integration](#dvc). If you're planning on integrating your spaCy project with DVC, you data management, check out the [Data Version Control (DVC) integration](#dvc).
can also use `outputs_no_cache` instead of `outputs` to define outputs that If you're planning on integrating your spaCy project with DVC, you can also use
won't be cached or tracked. `outputs_no_cache` instead of `outputs` to define outputs that won't be cached
or tracked.
### Files and directory structure {#project-files} ### Files and directory structure {#project-files}
@ -467,7 +469,9 @@ In your `project.yml`, you can then run the script by calling
`python scripts/custom_evaluation.py` with the function arguments. You can also `python scripts/custom_evaluation.py` with the function arguments. You can also
use the `vars` section to define reusable variables that will be substituted in use the `vars` section to define reusable variables that will be substituted in
commands, paths and URLs. In this example, the batch size is defined as a commands, paths and URLs. In this example, the batch size is defined as a
variable will be added in place of `${vars.batch_size}` in the script. variable will be added in place of `${vars.batch_size}` in the script. Just like
in the [training config](/usage/training##config-overrides), you can also
override settings on the command line for example using `--vars.batch_size`.
> #### Calling into Python > #### Calling into Python
> >
@ -491,6 +495,29 @@ commands:
- 'corpus/eval.json' - 'corpus/eval.json'
``` ```
You can also use the `env` section to reference **environment variables** and
make their values available to the commands. This can be useful for overriding
settings on the command line and passing through system-level settings.
> #### Usage example
>
> ```bash
> export GPU_ID=1
> BATCH_SIZE=128 python -m spacy project run evaluate
> ```
```yaml
### project.yml
env:
batch_size: BATCH_SIZE
gpu_id: GPU_ID
commands:
- name: evaluate
script:
- 'python scripts/custom_evaluation.py ${env.batch_size}'
```
### Documenting your project {#custom-docs} ### Documenting your project {#custom-docs}
> #### Readme Example > #### Readme Example

View File

@ -185,7 +185,7 @@ sections of a config file are:
For a full overview of spaCy's config format and settings, see the For a full overview of spaCy's config format and settings, see the
[data format documentation](/api/data-formats#config) and [data format documentation](/api/data-formats#config) and
[Thinc's config system docs](https://thinc.ai/usage/config). The settings [Thinc's config system docs](https://thinc.ai/docs/usage-config). The settings
available for the different architectures are documented with the available for the different architectures are documented with the
[model architectures API](/api/architectures). See the Thinc documentation for [model architectures API](/api/architectures). See the Thinc documentation for
[optimizers](https://thinc.ai/docs/api-optimizers) and [optimizers](https://thinc.ai/docs/api-optimizers) and

View File

@ -198,6 +198,7 @@
"has_examples": true "has_examples": true
}, },
{ "code": "tl", "name": "Tagalog" }, { "code": "tl", "name": "Tagalog" },
{ "code": "tn", "name": "Setswana", "has_examples": true },
{ "code": "tr", "name": "Turkish", "example": "Bu bir cümledir.", "has_examples": true }, { "code": "tr", "name": "Turkish", "example": "Bu bir cümledir.", "has_examples": true },
{ "code": "tt", "name": "Tatar", "has_examples": true }, { "code": "tt", "name": "Tatar", "has_examples": true },
{ {