Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-1

This commit is contained in:
Adriane Boyd 2021-08-09 13:13:13 +02:00
commit a79888ed67
40 changed files with 1255 additions and 85 deletions

106
.github/contributors/ezorita.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Eduard Zorita |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 06/17/2021 |
| GitHub username | ezorita |
| Website (optional) | |

106
.github/contributors/nsorros.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Nick Sorros |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2/8/2021 |
| GitHub username | nsorros |
| Website (optional) | |

View File

@ -2,7 +2,7 @@ from pathlib import Path
from wasabi import msg
from .remote_storage import RemoteStorage
from .remote_storage import get_command_hash
from .._util import project_cli, Arg
from .._util import project_cli, Arg, logger
from .._util import load_project_config
from .run import update_lockfile
@ -39,11 +39,15 @@ def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
# in the list.
while commands:
for i, cmd in enumerate(list(commands)):
logger.debug(f"CMD: {cmd['name']}.")
deps = [project_dir / dep for dep in cmd.get("deps", [])]
if all(dep.exists() for dep in deps):
cmd_hash = get_command_hash("", "", deps, cmd["script"])
for output_path in cmd.get("outputs", []):
url = storage.pull(output_path, command_hash=cmd_hash)
logger.debug(
f"URL: {url} for {output_path} with command hash {cmd_hash}"
)
yield url, output_path
out_locs = [project_dir / out for out in cmd.get("outputs", [])]
@ -53,6 +57,8 @@ def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
# we iterate over the loop again.
commands.pop(i)
break
else:
logger.debug(f"Dependency missing. Skipping {cmd['name']} outputs.")
else:
# If we didn't break the for loop, break the while loop.
break

View File

@ -3,7 +3,7 @@ from wasabi import msg
from .remote_storage import RemoteStorage
from .remote_storage import get_content_hash, get_command_hash
from .._util import load_project_config
from .._util import project_cli, Arg
from .._util import project_cli, Arg, logger
@project_cli.command("push")
@ -37,12 +37,15 @@ def project_push(project_dir: Path, remote: str):
remote = config["remotes"][remote]
storage = RemoteStorage(project_dir, remote)
for cmd in config.get("commands", []):
logger.debug(f"CMD: cmd['name']")
deps = [project_dir / dep for dep in cmd.get("deps", [])]
if any(not dep.exists() for dep in deps):
logger.debug(f"Dependency missing. Skipping {cmd['name']} outputs")
continue
cmd_hash = get_command_hash(
"", "", [project_dir / dep for dep in cmd.get("deps", [])], cmd["script"]
)
logger.debug(f"CMD_HASH: {cmd_hash}")
for output_path in cmd.get("outputs", []):
output_loc = project_dir / output_path
if output_loc.exists() and _is_not_empty_dir(output_loc):
@ -51,6 +54,9 @@ def project_push(project_dir: Path, remote: str):
command_hash=cmd_hash,
content_hash=get_content_hash(output_loc),
)
logger.debug(
f"URL: {url} for output {output_path} with cmd_hash {cmd_hash}"
)
yield output_path, url

View File

@ -43,9 +43,13 @@ def train_cli(
# Make sure all files and paths exists if they are needed
if not config_path or (str(config_path) != "-" and not config_path.exists()):
msg.fail("Config file not found", config_path, exits=1)
if output_path is not None and not output_path.exists():
if not output_path:
msg.info("No output directory provided")
else:
if not output_path.exists():
output_path.mkdir(parents=True)
msg.good(f"Created output directory: {output_path}")
msg.info(f"Saving to output directory: {output_path}")
overrides = parse_config_overrides(ctx.args)
import_code(code_path)
setup_gpu(use_gpu)

View File

@ -356,8 +356,8 @@ class Errors:
E098 = ("Invalid pattern: expected both RIGHT_ID and RIGHT_ATTRS.")
E099 = ("Invalid pattern: the first node of pattern should be an anchor "
"node. The node should only contain RIGHT_ID and RIGHT_ATTRS.")
E100 = ("Nodes other than the anchor node should all contain LEFT_ID, "
"REL_OP and RIGHT_ID.")
E100 = ("Nodes other than the anchor node should all contain {required}, "
"but these are missing: {missing}")
E101 = ("RIGHT_ID should be a new node and LEFT_ID should already have "
"have been declared in previous edges.")
E102 = ("Can't merge non-disjoint spans. '{token}' is already part of "

View File

@ -1698,7 +1698,6 @@ class Language:
# them here so they're only loaded once
source_nlps = {}
source_nlp_vectors_hashes = {}
nlp.meta["_sourced_vectors_hashes"] = {}
for pipe_name in config["nlp"]["pipeline"]:
if pipe_name not in pipeline:
opts = ", ".join(pipeline.keys())
@ -1747,6 +1746,8 @@ class Language:
source_nlp_vectors_hashes[model] = hash(
source_nlps[model].vocab.vectors.to_bytes()
)
if "_sourced_vectors_hashes" not in nlp.meta:
nlp.meta["_sourced_vectors_hashes"] = {}
nlp.meta["_sourced_vectors_hashes"][
pipe_name
] = source_nlp_vectors_hashes[model]
@ -1908,7 +1909,7 @@ class Language:
if not hasattr(proc, "to_disk"):
continue
serializers[name] = lambda p, proc=proc: proc.to_disk(p, exclude=["vocab"])
serializers["vocab"] = lambda p: self.vocab.to_disk(p)
serializers["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
util.to_disk(path, serializers, exclude)
def from_disk(
@ -1939,7 +1940,7 @@ class Language:
def deserialize_vocab(path: Path) -> None:
if path.exists():
self.vocab.from_disk(path)
self.vocab.from_disk(path, exclude=exclude)
path = util.ensure_path(path)
deserializers = {}
@ -1977,7 +1978,7 @@ class Language:
DOCS: https://spacy.io/api/language#to_bytes
"""
serializers = {}
serializers["vocab"] = lambda: self.vocab.to_bytes()
serializers["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
serializers["meta.json"] = lambda: srsly.json_dumps(self.meta)
serializers["config.cfg"] = lambda: self.config.to_bytes()
@ -2013,7 +2014,7 @@ class Language:
b, interpolate=False
)
deserializers["meta.json"] = deserialize_meta
deserializers["vocab"] = self.vocab.from_bytes
deserializers["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
deserializers["tokenizer"] = lambda b: self.tokenizer.from_bytes(
b, exclude=["vocab"]
)

61
spacy/lexeme.pyi Normal file
View File

@ -0,0 +1,61 @@
from typing import (
Union,
Any,
)
from thinc.types import Floats1d
from .tokens import Doc, Span, Token
from .vocab import Vocab
class Lexeme:
def __init__(self, vocab: Vocab, orth: int) -> None: ...
def __richcmp__(self, other: Lexeme, op: int) -> bool: ...
def __hash__(self) -> int: ...
def set_attrs(self, **attrs: Any) -> None: ...
def set_flag(self, flag_id: int, value: bool) -> None: ...
def check_flag(self, flag_id: int) -> bool: ...
def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
@property
def has_vector(self) -> bool: ...
@property
def vector_norm(self) -> float: ...
vector: Floats1d
rank: str
sentiment: float
@property
def orth_(self) -> str: ...
@property
def text(self) -> str: ...
lower: str
norm: int
shape: int
prefix: int
suffix: int
cluster: int
lang: int
prob: float
lower_: str
norm_: str
shape_: str
prefix_: str
suffix_: str
lang_: str
flags: int
@property
def is_oov(self) -> bool: ...
is_stop: bool
is_alpha: bool
is_ascii: bool
is_digit: bool
is_lower: bool
is_upper: bool
is_title: bool
is_punct: bool
is_space: bool
is_bracket: bool
is_quote: bool
is_left_punct: bool
is_right_punct: bool
is_currency: bool
like_url: bool
like_num: bool
like_email: bool

View File

@ -122,13 +122,17 @@ cdef class DependencyMatcher:
raise ValueError(Errors.E099.format(key=key))
visited_nodes[relation["RIGHT_ID"]] = True
else:
if not(
"RIGHT_ID" in relation
and "RIGHT_ATTRS" in relation
and "REL_OP" in relation
and "LEFT_ID" in relation
):
raise ValueError(Errors.E100.format(key=key))
required_keys = set(
("RIGHT_ID", "RIGHT_ATTRS", "REL_OP", "LEFT_ID")
)
relation_keys = set(relation.keys())
missing = required_keys - relation_keys
if missing:
missing_txt = ", ".join(list(missing))
raise ValueError(Errors.E100.format(
required=required_keys,
missing=missing_txt
))
if (
relation["RIGHT_ID"] in visited_nodes
or relation["LEFT_ID"] not in visited_nodes

41
spacy/matcher/matcher.pyi Normal file
View File

@ -0,0 +1,41 @@
from typing import Any, List, Dict, Tuple, Optional, Callable, Union, Iterator, Iterable
from ..vocab import Vocab
from ..tokens import Doc, Span
class Matcher:
def __init__(self, vocab: Vocab, validate: bool = ...) -> None: ...
def __reduce__(self) -> Any: ...
def __len__(self) -> int: ...
def __contains__(self, key: str) -> bool: ...
def add(
self,
key: str,
patterns: List[List[Dict[str, Any]]],
*,
on_match: Optional[
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
] = ...,
greedy: Optional[str] = ...
) -> None: ...
def remove(self, key: str) -> None: ...
def has_key(self, key: Union[str, int]) -> bool: ...
def get(
self, key: Union[str, int], default: Optional[Any] = ...
) -> Tuple[Optional[Callable[[Any], Any]], List[List[Dict[Any, Any]]]]: ...
def pipe(
self,
docs: Iterable[Tuple[Doc, Any]],
batch_size: int = ...,
return_matches: bool = ...,
as_tuples: bool = ...,
) -> Union[
Iterator[Tuple[Tuple[Doc, Any], Any]], Iterator[Tuple[Doc, Any]], Iterator[Doc]
]: ...
def __call__(
self,
doclike: Union[Doc, Span],
*,
as_spans: bool = ...,
allow_missing: bool = ...,
with_alignments: bool = ...
) -> Union[List[Tuple[int, int, int]], List[Span]]: ...

View File

@ -845,7 +845,7 @@ class _RegexPredicate:
class _SetPredicate:
operators = ("IN", "NOT_IN", "IS_SUBSET", "IS_SUPERSET")
operators = ("IN", "NOT_IN", "IS_SUBSET", "IS_SUPERSET", "INTERSECTS")
def __init__(self, i, attr, value, predicate, is_extension=False, vocab=None):
self.i = i
@ -868,14 +868,16 @@ class _SetPredicate:
else:
value = get_token_attr_for_matcher(token.c, self.attr)
if self.predicate in ("IS_SUBSET", "IS_SUPERSET"):
if self.predicate in ("IS_SUBSET", "IS_SUPERSET", "INTERSECTS"):
if self.attr == MORPH:
# break up MORPH into individual Feat=Val values
value = set(get_string_id(v) for v in MorphAnalysis.from_id(self.vocab, value))
else:
# IS_SUBSET for other attrs will be equivalent to "IN"
# IS_SUPERSET will only match for other attrs with 0 or 1 values
value = set([value])
# treat a single value as a list
if isinstance(value, (str, int)):
value = set([get_string_id(value)])
else:
value = set(get_string_id(v) for v in value)
if self.predicate == "IN":
return value in self.value
elif self.predicate == "NOT_IN":
@ -884,6 +886,8 @@ class _SetPredicate:
return value <= self.value
elif self.predicate == "IS_SUPERSET":
return value >= self.value
elif self.predicate == "INTERSECTS":
return bool(value & self.value)
def __repr__(self):
return repr(("SetPredicate", self.i, self.attr, self.value, self.predicate))
@ -928,6 +932,7 @@ def _get_extra_predicates(spec, extra_predicates, vocab):
"NOT_IN": _SetPredicate,
"IS_SUBSET": _SetPredicate,
"IS_SUPERSET": _SetPredicate,
"INTERSECTS": _SetPredicate,
"==": _ComparisonPredicate,
"!=": _ComparisonPredicate,
">=": _ComparisonPredicate,

View File

@ -276,7 +276,7 @@ class AttributeRuler(Pipe):
DOCS: https://spacy.io/api/attributeruler#to_bytes
"""
serialize = {}
serialize["vocab"] = self.vocab.to_bytes
serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
serialize["patterns"] = lambda: srsly.msgpack_dumps(self.patterns)
return util.to_bytes(serialize, exclude)
@ -296,7 +296,7 @@ class AttributeRuler(Pipe):
self.add_patterns(srsly.msgpack_loads(b))
deserialize = {
"vocab": lambda b: self.vocab.from_bytes(b),
"vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
"patterns": load_patterns,
}
util.from_bytes(bytes_data, deserialize, exclude)
@ -313,7 +313,7 @@ class AttributeRuler(Pipe):
DOCS: https://spacy.io/api/attributeruler#to_disk
"""
serialize = {
"vocab": lambda p: self.vocab.to_disk(p),
"vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
"patterns": lambda p: srsly.write_msgpack(p, self.patterns),
}
util.to_disk(path, serialize, exclude)
@ -334,7 +334,7 @@ class AttributeRuler(Pipe):
self.add_patterns(srsly.read_msgpack(p))
deserialize = {
"vocab": lambda p: self.vocab.from_disk(p),
"vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
"patterns": load_patterns,
}
util.from_disk(path, deserialize, exclude)

View File

@ -412,7 +412,7 @@ class EntityLinker(TrainablePipe):
serialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
serialize["vocab"] = self.vocab.to_bytes
serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
serialize["kb"] = self.kb.to_bytes
serialize["model"] = self.model.to_bytes
return util.to_bytes(serialize, exclude)
@ -436,7 +436,7 @@ class EntityLinker(TrainablePipe):
deserialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
deserialize["kb"] = lambda b: self.kb.from_bytes(b)
deserialize["model"] = load_model
util.from_bytes(bytes_data, deserialize, exclude)
@ -453,7 +453,7 @@ class EntityLinker(TrainablePipe):
DOCS: https://spacy.io/api/entitylinker#to_disk
"""
serialize = {}
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
serialize["kb"] = lambda p: self.kb.to_disk(p)
serialize["model"] = lambda p: self.model.to_disk(p)
@ -480,6 +480,7 @@ class EntityLinker(TrainablePipe):
deserialize = {}
deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
deserialize["kb"] = lambda p: self.kb.from_disk(p)
deserialize["model"] = load_model
util.from_disk(path, deserialize, exclude)

View File

@ -269,7 +269,7 @@ class Lemmatizer(Pipe):
DOCS: https://spacy.io/api/lemmatizer#to_disk
"""
serialize = {}
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
serialize["lookups"] = lambda p: self.lookups.to_disk(p)
util.to_disk(path, serialize, exclude)
@ -285,7 +285,7 @@ class Lemmatizer(Pipe):
DOCS: https://spacy.io/api/lemmatizer#from_disk
"""
deserialize = {}
deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
deserialize["lookups"] = lambda p: self.lookups.from_disk(p)
util.from_disk(path, deserialize, exclude)
self._validate_tables()
@ -300,7 +300,7 @@ class Lemmatizer(Pipe):
DOCS: https://spacy.io/api/lemmatizer#to_bytes
"""
serialize = {}
serialize["vocab"] = self.vocab.to_bytes
serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
serialize["lookups"] = self.lookups.to_bytes
return util.to_bytes(serialize, exclude)
@ -316,7 +316,7 @@ class Lemmatizer(Pipe):
DOCS: https://spacy.io/api/lemmatizer#from_bytes
"""
deserialize = {}
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
deserialize["lookups"] = lambda b: self.lookups.from_bytes(b)
util.from_bytes(bytes_data, deserialize, exclude)
self._validate_tables()

View File

@ -273,7 +273,7 @@ cdef class TrainablePipe(Pipe):
serialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
serialize["vocab"] = self.vocab.to_bytes
serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
serialize["model"] = self.model.to_bytes
return util.to_bytes(serialize, exclude)
@ -296,7 +296,7 @@ cdef class TrainablePipe(Pipe):
deserialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
deserialize["model"] = load_model
util.from_bytes(bytes_data, deserialize, exclude)
return self
@ -313,7 +313,7 @@ cdef class TrainablePipe(Pipe):
serialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
serialize["model"] = lambda p: self.model.to_disk(p)
util.to_disk(path, serialize, exclude)
@ -338,7 +338,7 @@ cdef class TrainablePipe(Pipe):
deserialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
deserialize["model"] = load_model
util.from_disk(path, deserialize, exclude)
return self

View File

@ -569,7 +569,7 @@ cdef class Parser(TrainablePipe):
def to_disk(self, path, exclude=tuple()):
serializers = {
"model": lambda p: (self.model.to_disk(p) if self.model is not True else True),
"vocab": lambda p: self.vocab.to_disk(p),
"vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
"moves": lambda p: self.moves.to_disk(p, exclude=["strings"]),
"cfg": lambda p: srsly.write_json(p, self.cfg)
}
@ -577,7 +577,7 @@ cdef class Parser(TrainablePipe):
def from_disk(self, path, exclude=tuple()):
deserializers = {
"vocab": lambda p: self.vocab.from_disk(p),
"vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
"moves": lambda p: self.moves.from_disk(p, exclude=["strings"]),
"cfg": lambda p: self.cfg.update(srsly.read_json(p)),
"model": lambda p: None,
@ -597,7 +597,7 @@ cdef class Parser(TrainablePipe):
def to_bytes(self, exclude=tuple()):
serializers = {
"model": lambda: (self.model.to_bytes()),
"vocab": lambda: self.vocab.to_bytes(),
"vocab": lambda: self.vocab.to_bytes(exclude=exclude),
"moves": lambda: self.moves.to_bytes(exclude=["strings"]),
"cfg": lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True)
}
@ -605,7 +605,7 @@ cdef class Parser(TrainablePipe):
def from_bytes(self, bytes_data, exclude=tuple()):
deserializers = {
"vocab": lambda b: self.vocab.from_bytes(b),
"vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
"moves": lambda b: self.moves.from_bytes(b, exclude=["strings"]),
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
"model": lambda b: None,

View File

@ -159,6 +159,7 @@ class TokenPatternString(BaseModel):
NOT_IN: Optional[List[StrictStr]] = Field(None, alias="not_in")
IS_SUBSET: Optional[List[StrictStr]] = Field(None, alias="is_subset")
IS_SUPERSET: Optional[List[StrictStr]] = Field(None, alias="is_superset")
INTERSECTS: Optional[List[StrictStr]] = Field(None, alias="intersects")
class Config:
extra = "forbid"
@ -175,8 +176,9 @@ class TokenPatternNumber(BaseModel):
REGEX: Optional[StrictStr] = Field(None, alias="regex")
IN: Optional[List[StrictInt]] = Field(None, alias="in")
NOT_IN: Optional[List[StrictInt]] = Field(None, alias="not_in")
ISSUBSET: Optional[List[StrictInt]] = Field(None, alias="issubset")
ISSUPERSET: Optional[List[StrictInt]] = Field(None, alias="issuperset")
IS_SUBSET: Optional[List[StrictInt]] = Field(None, alias="is_subset")
IS_SUPERSET: Optional[List[StrictInt]] = Field(None, alias="is_superset")
INTERSECTS: Optional[List[StrictInt]] = Field(None, alias="intersects")
EQ: Union[StrictInt, StrictFloat] = Field(None, alias="==")
NEQ: Union[StrictInt, StrictFloat] = Field(None, alias="!=")
GEQ: Union[StrictInt, StrictFloat] = Field(None, alias=">=")

22
spacy/strings.pyi Normal file
View File

@ -0,0 +1,22 @@
from typing import Optional, Iterable, Iterator, Union, Any
from pathlib import Path
def get_string_id(key: str) -> int: ...
class StringStore:
def __init__(
self, strings: Optional[Iterable[str]] = ..., freeze: bool = ...
) -> None: ...
def __getitem__(self, string_or_id: Union[bytes, str, int]) -> Union[str, int]: ...
def as_int(self, key: Union[bytes, str, int]) -> int: ...
def as_string(self, key: Union[bytes, str, int]) -> str: ...
def add(self, string: str) -> int: ...
def __len__(self) -> int: ...
def __contains__(self, string: str) -> bool: ...
def __iter__(self) -> Iterator[str]: ...
def __reduce__(self) -> Any: ...
def to_disk(self, path: Union[str, Path]) -> None: ...
def from_disk(self, path: Union[str, Path]) -> StringStore: ...
def to_bytes(self, **kwargs: Any) -> bytes: ...
def from_bytes(self, bytes_data: bytes, **kwargs: Any) -> StringStore: ...
def _reset_and_load(self, strings: Iterable[str]) -> None: ...

View File

@ -357,6 +357,9 @@ def test_span_eq_hash(doc, doc_not_parsed):
assert hash(doc[0:2]) != hash(doc[1:3])
assert hash(doc[0:2]) != hash(doc_not_parsed[0:2])
# check that an out-of-bounds is not equivalent to the span of the full doc
assert doc[0 : len(doc)] != doc[len(doc) : len(doc) + 1]
def test_span_boundaries(doc):
start = 1
@ -369,6 +372,33 @@ def test_span_boundaries(doc):
with pytest.raises(IndexError):
span[5]
empty_span_0 = doc[0:0]
assert empty_span_0.text == ""
assert empty_span_0.start == 0
assert empty_span_0.end == 0
assert empty_span_0.start_char == 0
assert empty_span_0.end_char == 0
empty_span_1 = doc[1:1]
assert empty_span_1.text == ""
assert empty_span_1.start == 1
assert empty_span_1.end == 1
assert empty_span_1.start_char == empty_span_1.end_char
oob_span_start = doc[-len(doc) - 1 : -len(doc) - 10]
assert oob_span_start.text == ""
assert oob_span_start.start == 0
assert oob_span_start.end == 0
assert oob_span_start.start_char == 0
assert oob_span_start.end_char == 0
oob_span_end = doc[len(doc) + 1 : len(doc) + 10]
assert oob_span_end.text == ""
assert oob_span_end.start == len(doc)
assert oob_span_end.end == len(doc)
assert oob_span_end.start_char == len(doc.text)
assert oob_span_end.end_char == len(doc.text)
def test_span_lemma(doc):
# span lemmas should have the same number of spaces as the span

View File

@ -270,6 +270,16 @@ def test_matcher_subset_value_operator(en_vocab):
doc[0].tag_ = "A"
assert len(matcher(doc)) == 0
# IS_SUBSET with a list value
Token.set_extension("ext", default=[])
matcher = Matcher(en_vocab)
pattern = [{"_": {"ext": {"IS_SUBSET": ["A", "B"]}}}]
matcher.add("M", [pattern])
doc = Doc(en_vocab, words=["a", "b", "c"])
doc[0]._.ext = ["A"]
doc[1]._.ext = ["C", "D"]
assert len(matcher(doc)) == 2
def test_matcher_superset_value_operator(en_vocab):
matcher = Matcher(en_vocab)
@ -308,6 +318,72 @@ def test_matcher_superset_value_operator(en_vocab):
doc[0].tag_ = "A"
assert len(matcher(doc)) == 3
# IS_SUPERSET with a list value
Token.set_extension("ext", default=[])
matcher = Matcher(en_vocab)
pattern = [{"_": {"ext": {"IS_SUPERSET": ["A"]}}}]
matcher.add("M", [pattern])
doc = Doc(en_vocab, words=["a", "b", "c"])
doc[0]._.ext = ["A", "B"]
assert len(matcher(doc)) == 1
def test_matcher_intersect_value_operator(en_vocab):
matcher = Matcher(en_vocab)
pattern = [{"MORPH": {"INTERSECTS": ["Feat=Val", "Feat2=Val2", "Feat3=Val3"]}}]
matcher.add("M", [pattern])
doc = Doc(en_vocab, words=["a", "b", "c"])
assert len(matcher(doc)) == 0
doc[0].set_morph("Feat=Val")
assert len(matcher(doc)) == 1
doc[0].set_morph("Feat=Val|Feat2=Val2")
assert len(matcher(doc)) == 1
doc[0].set_morph("Feat=Val|Feat2=Val2|Feat3=Val3")
assert len(matcher(doc)) == 1
doc[0].set_morph("Feat=Val|Feat2=Val2|Feat3=Val3|Feat4=Val4")
assert len(matcher(doc)) == 1
# INTERSECTS with a single value is the same as IN
matcher = Matcher(en_vocab)
pattern = [{"TAG": {"INTERSECTS": ["A", "B"]}}]
matcher.add("M", [pattern])
doc = Doc(en_vocab, words=["a", "b", "c"])
doc[0].tag_ = "A"
assert len(matcher(doc)) == 1
# INTERSECTS with an empty pattern list matches nothing
matcher = Matcher(en_vocab)
pattern = [{"TAG": {"INTERSECTS": []}}]
matcher.add("M", [pattern])
doc = Doc(en_vocab, words=["a", "b", "c"])
doc[0].tag_ = "A"
assert len(matcher(doc)) == 0
# INTERSECTS with a list value
Token.set_extension("ext", default=[])
matcher = Matcher(en_vocab)
pattern = [{"_": {"ext": {"INTERSECTS": ["A", "C"]}}}]
matcher.add("M", [pattern])
doc = Doc(en_vocab, words=["a", "b", "c"])
doc[0]._.ext = ["A", "B"]
assert len(matcher(doc)) == 1
# INTERSECTS with an empty pattern list matches nothing
matcher = Matcher(en_vocab)
pattern = [{"_": {"ext": {"INTERSECTS": []}}}]
matcher.add("M", [pattern])
doc = Doc(en_vocab, words=["a", "b", "c"])
doc[0]._.ext = ["A", "B"]
assert len(matcher(doc)) == 0
# INTERSECTS with an empty value matches nothing
matcher = Matcher(en_vocab)
pattern = [{"_": {"ext": {"INTERSECTS": ["A", "B"]}}}]
matcher.add("M", [pattern])
doc = Doc(en_vocab, words=["a", "b", "c"])
doc[0]._.ext = []
assert len(matcher(doc)) == 0
def test_matcher_morph_handling(en_vocab):
# order of features in pattern doesn't matter

View File

@ -1,9 +1,11 @@
import pytest
from numpy.testing import assert_equal
from numpy.testing import assert_equal, assert_array_equal
from thinc.api import get_current_ops
from spacy.language import Language
from spacy.training import Example
from spacy.util import fix_random_seed, registry
OPS = get_current_ops()
SPAN_KEY = "labeled_spans"
@ -116,12 +118,15 @@ def test_ngram_suggester(en_tokenizer):
for span in spans:
assert 0 <= span[0] < len(doc)
assert 0 < span[1] <= len(doc)
spans_set.add((span[0], span[1]))
spans_set.add((int(span[0]), int(span[1])))
# spans are unique
assert spans.shape[0] == len(spans_set)
offset += ngrams.lengths[i]
# the number of spans is correct
assert_equal(ngrams.lengths, [max(0, len(doc) - (size - 1)) for doc in docs])
assert_array_equal(
OPS.to_numpy(ngrams.lengths),
[max(0, len(doc) - (size - 1)) for doc in docs],
)
# test 1-3-gram suggestions
ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1, 2, 3])
@ -129,9 +134,9 @@ def test_ngram_suggester(en_tokenizer):
en_tokenizer(text) for text in ["a", "a b", "a b c", "a b c d", "a b c d e"]
]
ngrams = ngram_suggester(docs)
assert_equal(ngrams.lengths, [1, 3, 6, 9, 12])
assert_equal(
ngrams.data,
assert_array_equal(OPS.to_numpy(ngrams.lengths), [1, 3, 6, 9, 12])
assert_array_equal(
OPS.to_numpy(ngrams.data),
[
# doc 0
[0, 1],
@ -176,13 +181,13 @@ def test_ngram_suggester(en_tokenizer):
ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1])
docs = [en_tokenizer(text) for text in ["", "a", ""]]
ngrams = ngram_suggester(docs)
assert_equal(ngrams.lengths, [len(doc) for doc in docs])
assert_array_equal(OPS.to_numpy(ngrams.lengths), [len(doc) for doc in docs])
# test all empty docs
ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1])
docs = [en_tokenizer(text) for text in ["", "", ""]]
ngrams = ngram_suggester(docs)
assert_equal(ngrams.lengths, [len(doc) for doc in docs])
assert_array_equal(OPS.to_numpy(ngrams.lengths), [len(doc) for doc in docs])
def test_ngram_sizes(en_tokenizer):
@ -195,12 +200,12 @@ def test_ngram_sizes(en_tokenizer):
]
ngrams_1 = size_suggester(docs)
ngrams_2 = range_suggester(docs)
assert_equal(ngrams_1.lengths, [1, 3, 6, 9, 12])
assert_equal(ngrams_1.lengths, ngrams_2.lengths)
assert_equal(ngrams_1.data, ngrams_2.data)
assert_array_equal(OPS.to_numpy(ngrams_1.lengths), [1, 3, 6, 9, 12])
assert_array_equal(OPS.to_numpy(ngrams_1.lengths), OPS.to_numpy(ngrams_2.lengths))
assert_array_equal(OPS.to_numpy(ngrams_1.data), OPS.to_numpy(ngrams_2.data))
# one more variation
suggester_factory = registry.misc.get("spacy.ngram_range_suggester.v1")
range_suggester = suggester_factory(min_size=2, max_size=4)
ngrams_3 = range_suggester(docs)
assert_equal(ngrams_3.lengths, [0, 1, 3, 6, 9])
assert_array_equal(OPS.to_numpy(ngrams_3.lengths), [0, 1, 3, 6, 9])

View File

@ -1,5 +1,5 @@
import pytest
from spacy import registry, Vocab
from spacy import registry, Vocab, load
from spacy.pipeline import Tagger, DependencyParser, EntityRecognizer
from spacy.pipeline import TextCategorizer, SentenceRecognizer, TrainablePipe
from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
@ -268,3 +268,21 @@ def test_serialize_custom_trainable_pipe():
pipe.to_disk(d)
new_pipe = CustomPipe(Vocab(), Linear()).from_disk(d)
assert new_pipe.to_bytes() == pipe_bytes
def test_load_without_strings():
nlp = spacy.blank("en")
orig_strings_length = len(nlp.vocab.strings)
word = "unlikely_word_" * 20
nlp.vocab.strings.add(word)
assert len(nlp.vocab.strings) == orig_strings_length + 1
with make_tempdir() as d:
nlp.to_disk(d)
# reload with strings
reloaded_nlp = load(d)
assert len(nlp.vocab.strings) == len(reloaded_nlp.vocab.strings)
assert word in reloaded_nlp.vocab.strings
# reload without strings
reloaded_nlp = load(d, exclude=["strings"])
assert orig_strings_length == len(reloaded_nlp.vocab.strings)
assert word not in reloaded_nlp.vocab.strings

View File

@ -765,7 +765,7 @@ cdef class Tokenizer:
DOCS: https://spacy.io/api/tokenizer#to_bytes
"""
serializers = {
"vocab": lambda: self.vocab.to_bytes(),
"vocab": lambda: self.vocab.to_bytes(exclude=exclude),
"prefix_search": lambda: _get_regex_pattern(self.prefix_search),
"suffix_search": lambda: _get_regex_pattern(self.suffix_search),
"infix_finditer": lambda: _get_regex_pattern(self.infix_finditer),
@ -786,7 +786,7 @@ cdef class Tokenizer:
"""
data = {}
deserializers = {
"vocab": lambda b: self.vocab.from_bytes(b),
"vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
"prefix_search": lambda b: data.setdefault("prefix_search", b),
"suffix_search": lambda b: data.setdefault("suffix_search", b),
"infix_finditer": lambda b: data.setdefault("infix_finditer", b),

View File

@ -0,0 +1,17 @@
from typing import Dict, Any, Union, List, Tuple
from .doc import Doc
from .span import Span
from .token import Token
class Retokenizer:
def __init__(self, doc: Doc) -> None: ...
def merge(self, span: Span, attrs: Dict[Union[str, int], Any] = ...) -> None: ...
def split(
self,
token: Token,
orths: List[str],
heads: List[Union[Token, Tuple[Token, int]]],
attrs: Dict[Union[str, int], List[Any]] = ...,
) -> None: ...
def __enter__(self) -> Retokenizer: ...
def __exit__(self, *args: Any) -> None: ...

180
spacy/tokens/doc.pyi Normal file
View File

@ -0,0 +1,180 @@
from typing import (
Callable,
Protocol,
Iterable,
Iterator,
Optional,
Union,
Tuple,
List,
Dict,
Any,
overload,
)
from cymem.cymem import Pool
from thinc.types import Floats1d, Floats2d, Ints2d
from .span import Span
from .token import Token
from ._dict_proxies import SpanGroups
from ._retokenize import Retokenizer
from ..lexeme import Lexeme
from ..vocab import Vocab
from .underscore import Underscore
from pathlib import Path
import numpy
class DocMethod(Protocol):
def __call__(self: Doc, *args: Any, **kwargs: Any) -> Any: ...
class Doc:
vocab: Vocab
mem: Pool
spans: SpanGroups
max_length: int
length: int
sentiment: float
cats: Dict[str, float]
user_hooks: Dict[str, Callable[..., Any]]
user_token_hooks: Dict[str, Callable[..., Any]]
user_span_hooks: Dict[str, Callable[..., Any]]
tensor: numpy.ndarray
user_data: Dict[str, Any]
has_unknown_spaces: bool
@classmethod
def set_extension(
cls,
name: str,
default: Optional[Any] = ...,
getter: Optional[Callable[[Doc], Any]] = ...,
setter: Optional[Callable[[Doc, Any], None]] = ...,
method: Optional[DocMethod] = ...,
force: bool = ...,
) -> None: ...
@classmethod
def get_extension(
cls, name: str
) -> Tuple[
Optional[Any],
Optional[DocMethod],
Optional[Callable[[Doc], Any]],
Optional[Callable[[Doc, Any], None]],
]: ...
@classmethod
def has_extension(cls, name: str) -> bool: ...
@classmethod
def remove_extension(
cls, name: str
) -> Tuple[
Optional[Any],
Optional[DocMethod],
Optional[Callable[[Doc], Any]],
Optional[Callable[[Doc, Any], None]],
]: ...
def __init__(
self,
vocab: Vocab,
words: Optional[List[str]] = ...,
spaces: Optional[List[bool]] = ...,
user_data: Optional[Dict[Any, Any]] = ...,
tags: Optional[List[str]] = ...,
pos: Optional[List[str]] = ...,
morphs: Optional[List[str]] = ...,
lemmas: Optional[List[str]] = ...,
heads: Optional[List[int]] = ...,
deps: Optional[List[str]] = ...,
sent_starts: Optional[List[Union[bool, None]]] = ...,
ents: Optional[List[str]] = ...,
) -> None: ...
@property
def _(self) -> Underscore: ...
@property
def is_tagged(self) -> bool: ...
@property
def is_parsed(self) -> bool: ...
@property
def is_nered(self) -> bool: ...
@property
def is_sentenced(self) -> bool: ...
def has_annotation(
self, attr: Union[int, str], *, require_complete: bool = ...
) -> bool: ...
@overload
def __getitem__(self, i: int) -> Token: ...
@overload
def __getitem__(self, i: slice) -> Span: ...
def __iter__(self) -> Iterator[Token]: ...
def __len__(self) -> int: ...
def __unicode__(self) -> str: ...
def __bytes__(self) -> bytes: ...
def __str__(self) -> str: ...
def __repr__(self) -> str: ...
@property
def doc(self) -> Doc: ...
def char_span(
self,
start_idx: int,
end_idx: int,
label: Union[int, str] = ...,
kb_id: Union[int, str] = ...,
vector: Optional[Floats1d] = ...,
alignment_mode: str = ...,
) -> Span: ...
def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
@property
def has_vector(self) -> bool: ...
vector: Floats1d
vector_norm: float
@property
def text(self) -> str: ...
@property
def text_with_ws(self) -> str: ...
ents: Tuple[Span]
def set_ents(
self,
entities: List[Span],
*,
blocked: Optional[List[Span]] = ...,
missing: Optional[List[Span]] = ...,
outside: Optional[List[Span]] = ...,
default: str = ...
) -> None: ...
@property
def noun_chunks(self) -> Iterator[Span]: ...
@property
def sents(self) -> Iterator[Span]: ...
@property
def lang(self) -> int: ...
@property
def lang_(self) -> str: ...
def count_by(
self, attr_id: int, exclude: Optional[Any] = ..., counts: Optional[Any] = ...
) -> Dict[Any, int]: ...
def from_array(self, attrs: List[int], array: Ints2d) -> Doc: ...
@staticmethod
def from_docs(
docs: List[Doc],
ensure_whitespace: bool = ...,
attrs: Optional[Union[Tuple[Union[str, int]], List[Union[int, str]]]] = ...,
) -> Doc: ...
def get_lca_matrix(self) -> Ints2d: ...
def copy(self) -> Doc: ...
def to_disk(
self, path: Union[str, Path], *, exclude: Iterable[str] = ...
) -> None: ...
def from_disk(
self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
) -> Doc: ...
def to_bytes(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
def from_bytes(
self, bytes_data: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
) -> Doc: ...
def to_dict(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
def from_dict(
self, msg: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
) -> Doc: ...
def extend_tensor(self, tensor: Floats2d) -> None: ...
def retokenize(self) -> Retokenizer: ...
def to_json(self, underscore: Optional[List[str]] = ...) -> Dict[str, Any]: ...
def to_utf8_array(self, nr_char: int = ...) -> Ints2d: ...
@staticmethod
def _get_array_attrs() -> Tuple[Any]: ...

View File

@ -0,0 +1,20 @@
from typing import Any, Dict, Iterator, List, Union
from ..vocab import Vocab
class MorphAnalysis:
def __init__(
self, vocab: Vocab, features: Union[Dict[str, str], str] = ...
) -> None: ...
@classmethod
def from_id(cls, vocab: Vocab, key: Any) -> MorphAnalysis: ...
def __contains__(self, feature: str) -> bool: ...
def __iter__(self) -> Iterator[str]: ...
def __len__(self) -> int: ...
def __hash__(self) -> int: ...
def __eq__(self, other: MorphAnalysis) -> bool: ...
def __ne__(self, other: MorphAnalysis) -> bool: ...
def get(self, field: Any) -> List[str]: ...
def to_json(self) -> str: ...
def to_dict(self) -> Dict[str, str]: ...
def __str__(self) -> str: ...
def __repr__(self) -> str: ...

124
spacy/tokens/span.pyi Normal file
View File

@ -0,0 +1,124 @@
from typing import Callable, Protocol, Iterator, Optional, Union, Tuple, Any, overload
from thinc.types import Floats1d, Ints2d, FloatsXd
from .doc import Doc
from .token import Token
from .underscore import Underscore
from ..lexeme import Lexeme
from ..vocab import Vocab
class SpanMethod(Protocol):
def __call__(self: Span, *args: Any, **kwargs: Any) -> Any: ...
class Span:
@classmethod
def set_extension(
cls,
name: str,
default: Optional[Any] = ...,
getter: Optional[Callable[[Span], Any]] = ...,
setter: Optional[Callable[[Span, Any], None]] = ...,
method: Optional[SpanMethod] = ...,
force: bool = ...,
) -> None: ...
@classmethod
def get_extension(
cls, name: str
) -> Tuple[
Optional[Any],
Optional[SpanMethod],
Optional[Callable[[Span], Any]],
Optional[Callable[[Span, Any], None]],
]: ...
@classmethod
def has_extension(cls, name: str) -> bool: ...
@classmethod
def remove_extension(
cls, name: str
) -> Tuple[
Optional[Any],
Optional[SpanMethod],
Optional[Callable[[Span], Any]],
Optional[Callable[[Span, Any], None]],
]: ...
def __init__(
self,
doc: Doc,
start: int,
end: int,
label: int = ...,
vector: Optional[Floats1d] = ...,
vector_norm: Optional[float] = ...,
kb_id: Optional[int] = ...,
) -> None: ...
def __richcmp__(self, other: Span, op: int) -> bool: ...
def __hash__(self) -> int: ...
def __len__(self) -> int: ...
def __repr__(self) -> str: ...
@overload
def __getitem__(self, i: int) -> Token: ...
@overload
def __getitem__(self, i: slice) -> Span: ...
def __iter__(self) -> Iterator[Token]: ...
@property
def _(self) -> Underscore: ...
def as_doc(self, *, copy_user_data: bool = ...) -> Doc: ...
def get_lca_matrix(self) -> Ints2d: ...
def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
@property
def vocab(self) -> Vocab: ...
@property
def sent(self) -> Span: ...
@property
def ents(self) -> Tuple[Span]: ...
@property
def has_vector(self) -> bool: ...
@property
def vector(self) -> Floats1d: ...
@property
def vector_norm(self) -> float: ...
@property
def tensor(self) -> FloatsXd: ...
@property
def sentiment(self) -> float: ...
@property
def text(self) -> str: ...
@property
def text_with_ws(self) -> str: ...
@property
def noun_chunks(self) -> Iterator[Span]: ...
@property
def root(self) -> Token: ...
def char_span(
self,
start_idx: int,
end_idx: int,
label: int = ...,
kb_id: int = ...,
vector: Optional[Floats1d] = ...,
) -> Span: ...
@property
def conjuncts(self) -> Tuple[Token]: ...
@property
def lefts(self) -> Iterator[Token]: ...
@property
def rights(self) -> Iterator[Token]: ...
@property
def n_lefts(self) -> int: ...
@property
def n_rights(self) -> int: ...
@property
def subtree(self) -> Iterator[Token]: ...
start: int
end: int
start_char: int
end_char: int
label: int
kb_id: int
ent_id: int
ent_id_: str
@property
def orth_(self) -> str: ...
@property
def lemma_(self) -> str: ...
label_: str
kb_id_: str

View File

@ -105,13 +105,18 @@ cdef class Span:
if label not in doc.vocab.strings:
raise ValueError(Errors.E084.format(label=label))
start_char = doc[start].idx if start < doc.length else len(doc.text)
if start == end:
end_char = start_char
else:
end_char = doc[end - 1].idx + len(doc[end - 1])
self.c = SpanC(
label=label,
kb_id=kb_id,
start=start,
end=end,
start_char=doc[start].idx if start < doc.length else 0,
end_char=doc[end - 1].idx + len(doc[end - 1]) if end >= 1 else 0,
start_char=start_char,
end_char=end_char,
)
self._vector = vector
self._vector_norm = vector_norm

View File

@ -0,0 +1,24 @@
from typing import Any, Dict, Iterable
from .doc import Doc
from .span import Span
class SpanGroup:
def __init__(
self,
doc: Doc,
*,
name: str = ...,
attrs: Dict[str, Any] = ...,
spans: Iterable[Span] = ...
) -> None: ...
def __repr__(self) -> str: ...
@property
def doc(self) -> Doc: ...
@property
def has_overlap(self) -> bool: ...
def __len__(self) -> int: ...
def append(self, span: Span) -> None: ...
def extend(self, spans: Iterable[Span]) -> None: ...
def __getitem__(self, i: int) -> Span: ...
def to_bytes(self) -> bytes: ...
def from_bytes(self, bytes_data: bytes) -> SpanGroup: ...

208
spacy/tokens/token.pyi Normal file
View File

@ -0,0 +1,208 @@
from typing import (
Callable,
Protocol,
Iterator,
Optional,
Union,
Tuple,
Any,
)
from thinc.types import Floats1d, FloatsXd
from .doc import Doc
from .span import Span
from .morphanalysis import MorphAnalysis
from ..lexeme import Lexeme
from ..vocab import Vocab
from .underscore import Underscore
class TokenMethod(Protocol):
def __call__(self: Token, *args: Any, **kwargs: Any) -> Any: ...
class Token:
i: int
doc: Doc
vocab: Vocab
@classmethod
def set_extension(
cls,
name: str,
default: Optional[Any] = ...,
getter: Optional[Callable[[Token], Any]] = ...,
setter: Optional[Callable[[Token, Any], None]] = ...,
method: Optional[TokenMethod] = ...,
force: bool = ...,
) -> None: ...
@classmethod
def get_extension(
cls, name: str
) -> Tuple[
Optional[Any],
Optional[TokenMethod],
Optional[Callable[[Token], Any]],
Optional[Callable[[Token, Any], None]],
]: ...
@classmethod
def has_extension(cls, name: str) -> bool: ...
@classmethod
def remove_extension(
cls, name: str
) -> Tuple[
Optional[Any],
Optional[TokenMethod],
Optional[Callable[[Token], Any]],
Optional[Callable[[Token, Any], None]],
]: ...
def __init__(self, vocab: Vocab, doc: Doc, offset: int) -> None: ...
def __hash__(self) -> int: ...
def __len__(self) -> int: ...
def __unicode__(self) -> str: ...
def __bytes__(self) -> bytes: ...
def __str__(self) -> str: ...
def __repr__(self) -> str: ...
def __richcmp__(self, other: Token, op: int) -> bool: ...
@property
def _(self) -> Underscore: ...
def nbor(self, i: int = ...) -> Token: ...
def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
def has_morph(self) -> bool: ...
morph: MorphAnalysis
@property
def lex(self) -> Lexeme: ...
@property
def lex_id(self) -> int: ...
@property
def rank(self) -> int: ...
@property
def text(self) -> str: ...
@property
def text_with_ws(self) -> str: ...
@property
def prob(self) -> float: ...
@property
def sentiment(self) -> float: ...
@property
def lang(self) -> int: ...
@property
def idx(self) -> int: ...
@property
def cluster(self) -> int: ...
@property
def orth(self) -> int: ...
@property
def lower(self) -> int: ...
@property
def norm(self) -> int: ...
@property
def shape(self) -> int: ...
@property
def prefix(self) -> int: ...
@property
def suffix(self) -> int: ...
lemma: int
pos: int
tag: int
dep: int
@property
def has_vector(self) -> bool: ...
@property
def vector(self) -> Floats1d: ...
@property
def vector_norm(self) -> float: ...
@property
def tensor(self) -> Optional[FloatsXd]: ...
@property
def n_lefts(self) -> int: ...
@property
def n_rights(self) -> int: ...
@property
def sent(self) -> Span: ...
sent_start: bool
is_sent_start: Optional[bool]
is_sent_end: Optional[bool]
@property
def lefts(self) -> Iterator[Token]: ...
@property
def rights(self) -> Iterator[Token]: ...
@property
def children(self) -> Iterator[Token]: ...
@property
def subtree(self) -> Iterator[Token]: ...
@property
def left_edge(self) -> Token: ...
@property
def right_edge(self) -> Token: ...
@property
def ancestors(self) -> Iterator[Token]: ...
def is_ancestor(self, descendant: Token) -> bool: ...
def has_head(self) -> bool: ...
head: Token
@property
def conjuncts(self) -> Tuple[Token]: ...
ent_type: int
ent_type_: str
@property
def ent_iob(self) -> int: ...
@classmethod
def iob_strings(cls) -> Tuple[str]: ...
@property
def ent_iob_(self) -> str: ...
ent_id: int
ent_id_: str
ent_kb_id: int
ent_kb_id_: str
@property
def whitespace_(self) -> str: ...
@property
def orth_(self) -> str: ...
@property
def lower_(self) -> str: ...
norm_: str
@property
def shape_(self) -> str: ...
@property
def prefix_(self) -> str: ...
@property
def suffix_(self) -> str: ...
@property
def lang_(self) -> str: ...
lemma_: str
pos_: str
tag_: str
def has_dep(self) -> bool: ...
dep_: str
@property
def is_oov(self) -> bool: ...
@property
def is_stop(self) -> bool: ...
@property
def is_alpha(self) -> bool: ...
@property
def is_ascii(self) -> bool: ...
@property
def is_digit(self) -> bool: ...
@property
def is_lower(self) -> bool: ...
@property
def is_upper(self) -> bool: ...
@property
def is_title(self) -> bool: ...
@property
def is_punct(self) -> bool: ...
@property
def is_space(self) -> bool: ...
@property
def is_bracket(self) -> bool: ...
@property
def is_quote(self) -> bool: ...
@property
def is_left_punct(self) -> bool: ...
@property
def is_right_punct(self) -> bool: ...
@property
def is_currency(self) -> bool: ...
@property
def like_url(self) -> bool: ...
@property
def like_num(self) -> bool: ...
@property
def like_email(self) -> bool: ...

View File

@ -29,7 +29,7 @@ def console_logger(progress_bar: bool = False):
def setup_printer(
nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr
) -> Tuple[Callable[[Optional[Dict[str, Any]]], None], Callable[[], None]]:
write = lambda text: stdout.write(f"{text}\n")
write = lambda text: print(text, file=stdout, flush=True)
msg = Printer(no_print=True)
# ensure that only trainable components are logged
logged_pipes = [

78
spacy/vocab.pyi Normal file
View File

@ -0,0 +1,78 @@
from typing import (
Callable,
Iterator,
Optional,
Union,
Tuple,
List,
Dict,
Any,
)
from thinc.types import Floats1d, FloatsXd
from . import Language
from .strings import StringStore
from .lexeme import Lexeme
from .lookups import Lookups
from .tokens import Doc, Span
from pathlib import Path
def create_vocab(
lang: Language, defaults: Any, vectors_name: Optional[str] = ...
) -> Vocab: ...
class Vocab:
def __init__(
self,
lex_attr_getters: Optional[Dict[str, Callable[[str], Any]]] = ...,
strings: Optional[Union[List[str], StringStore]] = ...,
lookups: Optional[Lookups] = ...,
oov_prob: float = ...,
vectors_name: Optional[str] = ...,
writing_system: Dict[str, Any] = ...,
get_noun_chunks: Optional[Callable[[Union[Doc, Span]], Iterator[Span]]] = ...,
) -> None: ...
@property
def lang(self) -> Language: ...
def __len__(self) -> int: ...
def add_flag(
self, flag_getter: Callable[[str], bool], flag_id: int = ...
) -> int: ...
def __contains__(self, key: str) -> bool: ...
def __iter__(self) -> Iterator[Lexeme]: ...
def __getitem__(self, id_or_string: Union[str, int]) -> Lexeme: ...
@property
def vectors_length(self) -> int: ...
def reset_vectors(
self, *, width: Optional[int] = ..., shape: Optional[int] = ...
) -> None: ...
def prune_vectors(self, nr_row: int, batch_size: int = ...) -> Dict[str, float]: ...
def get_vector(
self,
orth: Union[int, str],
minn: Optional[int] = ...,
maxn: Optional[int] = ...,
) -> FloatsXd: ...
def set_vector(self, orth: Union[int, str], vector: Floats1d) -> None: ...
def has_vector(self, orth: Union[int, str]) -> bool: ...
lookups: Lookups
def to_disk(
self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
) -> None: ...
def from_disk(
self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
) -> Vocab: ...
def to_bytes(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
def from_bytes(
self, bytes_data: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
) -> Vocab: ...
def pickle_vocab(vocab: Vocab) -> Any: ...
def unpickle_vocab(
sstore: StringStore,
vectors: Any,
morphology: Any,
data_dir: Any,
lex_attr_getters: Any,
lookups: Any,
get_noun_chunks: Any,
) -> Vocab: ...

View File

@ -409,7 +409,7 @@ a single token vector given zero or more wordpiece vectors.
>
> ```ini
> [model]
> @architectures = "spacy.Tok2VecTransformer.v1"
> @architectures = "spacy-transformers.Tok2VecTransformer.v1"
> name = "albert-base-v2"
> tokenizer_config = {"use_fast": false}
> grad_factor = 1.0

View File

@ -90,7 +90,6 @@ Defines the `nlp` object, its tokenizer and
> ```ini
> [components.textcat]
> factory = "textcat"
> labels = ["POSITIVE", "NEGATIVE"]
>
> [components.textcat.model]
> @architectures = "spacy.TextCatBOW.v2"

View File

@ -35,11 +35,11 @@ how the component should be configured. You can override its settings via the
> ```
| Setting | Description |
| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ----------- |
| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `phrase_matcher_attr` | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ |
| `validate` | Whether patterns should be validated (passed to the `Matcher` and `PhraseMatcher`). Defaults to `False`. ~~bool~~ |
| `overwrite_ents` | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~ |
| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `" | | "`. ~~str~~ |
| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `"\|\|"`. ~~str~~ |
```python
%%GITHUB_SPACY/spacy/pipeline/entityruler.py
@ -64,14 +64,14 @@ be a token pattern (list) or a phrase pattern (string). For example:
> ```
| Name | Description |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ----------- |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `nlp` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. ~~Language~~ |
| `name` <Tag variant="new">3</Tag> | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. Used to disable the current entity ruler while creating phrase patterns with the nlp object. ~~str~~ |
| _keyword-only_ | |
| `phrase_matcher_attr` | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ |
| `validate` | Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate`. Defaults to `False`. ~~bool~~ |
| `overwrite_ents` | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~ |
| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `" | | "`. ~~str~~ |
| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `"\|\|"`. ~~str~~ |
| `patterns` | Optional patterns to load in on initialization. ~~Optional[List[Dict[str, Union[str, List[dict]]]]]~~ |
## EntityRuler.initialize {#initialize tag="method" new="3"}

View File

@ -78,11 +78,12 @@ it compares to another value.
> ```
| Attribute | Description |
| -------------------------- | ------------------------------------------------------------------------------------------------------- |
| -------------------------- | -------------------------------------------------------------------------------------------------------- |
| `IN` | Attribute value is member of a list. ~~Any~~ |
| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ |
| `ISSUBSET` | Attribute values (for `MORPH`) are a subset of a list. ~~Any~~ |
| `ISSUPERSET` | Attribute values (for `MORPH`) are a superset of a list. ~~Any~~ |
| `IS_SUBSET` | Attribute value (for `MORPH` or custom list attributes) is a subset of a list. ~~Any~~ |
| `IS_SUPERSET` | Attribute value (for `MORPH` or custom list attributes) is a superset of a list. ~~Any~~ |
| `INTERSECTS` | Attribute value (for `MORPH` or custom list attribute) has a non-empty intersection with a list. ~~Any~~ |
| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
## Matcher.\_\_init\_\_ {#init tag="method"}

View File

@ -29,7 +29,7 @@ Create the vocabulary.
| `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ |
| `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~ |
| `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ |
| `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
| `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/api/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
## Vocab.\_\_len\_\_ {#len tag="method"}

View File

@ -232,14 +232,21 @@ following rich comparison attributes are available:
>
> # Matches tokens of length >= 10
> pattern2 = [{"LENGTH": {">=": 10}}]
>
> # Match based on morph attributes
> pattern3 = [{"MORPH": {"IS_SUBSET": ["Number=Sing", "Gender=Neut"]}}]
> # "", "Number=Sing" and "Number=Sing|Gender=Neut" will match as subsets
> # "Number=Plur|Gender=Neut" will not match
> # "Number=Sing|Gender=Neut|Polite=Infm" will not match because it's a superset
> ```
| Attribute | Description |
| -------------------------- | ------------------------------------------------------------------------------------------------------- |
| -------------------------- | --------------------------------------------------------------------------------------------------------- |
| `IN` | Attribute value is member of a list. ~~Any~~ |
| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ |
| `ISSUBSET` | Attribute values (for `MORPH`) are a subset of a list. ~~Any~~ |
| `ISSUPERSET` | Attribute values (for `MORPH`) are a superset of a list. ~~Any~~ |
| `IS_SUBSET` | Attribute value (for `MORPH` or custom list attributes) is a subset of a list. ~~Any~~ |
| `IS_SUPERSET` | Attribute value (for `MORPH` or custom list attributes) is a superset of a list. ~~Any~~ |
| `INTERSECTS` | Attribute value (for `MORPH` or custom list attributes) has a non-empty intersection with a list. ~~Any~~ |
| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
#### Regular expressions {#regex new="2.1"}

View File

@ -652,7 +652,7 @@ excluded from the logs and the score won't be weighted.
| **Recall** (R) | Percentage of reference annotations recovered. Should increase. |
| **F-Score** (F) | Harmonic mean of precision and recall. Should increase. |
| **UAS** / **LAS** | Unlabeled and labeled attachment score for the dependency parser, i.e. the percentage of correct arcs. Should increase. |
| **Words per second** (WPS) | Prediction speed in words per second. Should stay stable. |
| **Speed** | Prediction speed in words per second (WPS). Should stay stable. |
Note that if the development data has raw text, some of the gold-standard
entities might not align to the predicted tokenization. These tokenization

View File

@ -854,6 +854,19 @@ pipeline component, the [`AttributeRuler`](/api/attributeruler). See the
you have tag maps and morph rules in the v2.x format, you can load them into the
attribute ruler before training using the `[initialize]` block of your config.
### Using Lexeme Tables
To use tables like `lexeme_prob` when training a model from scratch, you need
to add an entry to the `initialize` block in your config. Here's what that
looks like for the existing trained pipelines:
```ini
[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]
```
> #### What does the initialization do?
>
> The `[initialize]` block is used when