mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-1
This commit is contained in:
commit
a79888ed67
106
.github/contributors/ezorita.md
vendored
Normal file
106
.github/contributors/ezorita.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Eduard Zorita |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 06/17/2021 |
|
||||
| GitHub username | ezorita |
|
||||
| Website (optional) | |
|
106
.github/contributors/nsorros.md
vendored
Normal file
106
.github/contributors/nsorros.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Nick Sorros |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2/8/2021 |
|
||||
| GitHub username | nsorros |
|
||||
| Website (optional) | |
|
|
@ -2,7 +2,7 @@ from pathlib import Path
|
|||
from wasabi import msg
|
||||
from .remote_storage import RemoteStorage
|
||||
from .remote_storage import get_command_hash
|
||||
from .._util import project_cli, Arg
|
||||
from .._util import project_cli, Arg, logger
|
||||
from .._util import load_project_config
|
||||
from .run import update_lockfile
|
||||
|
||||
|
@ -39,11 +39,15 @@ def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
|
|||
# in the list.
|
||||
while commands:
|
||||
for i, cmd in enumerate(list(commands)):
|
||||
logger.debug(f"CMD: {cmd['name']}.")
|
||||
deps = [project_dir / dep for dep in cmd.get("deps", [])]
|
||||
if all(dep.exists() for dep in deps):
|
||||
cmd_hash = get_command_hash("", "", deps, cmd["script"])
|
||||
for output_path in cmd.get("outputs", []):
|
||||
url = storage.pull(output_path, command_hash=cmd_hash)
|
||||
logger.debug(
|
||||
f"URL: {url} for {output_path} with command hash {cmd_hash}"
|
||||
)
|
||||
yield url, output_path
|
||||
|
||||
out_locs = [project_dir / out for out in cmd.get("outputs", [])]
|
||||
|
@ -53,6 +57,8 @@ def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
|
|||
# we iterate over the loop again.
|
||||
commands.pop(i)
|
||||
break
|
||||
else:
|
||||
logger.debug(f"Dependency missing. Skipping {cmd['name']} outputs.")
|
||||
else:
|
||||
# If we didn't break the for loop, break the while loop.
|
||||
break
|
||||
|
|
|
@ -3,7 +3,7 @@ from wasabi import msg
|
|||
from .remote_storage import RemoteStorage
|
||||
from .remote_storage import get_content_hash, get_command_hash
|
||||
from .._util import load_project_config
|
||||
from .._util import project_cli, Arg
|
||||
from .._util import project_cli, Arg, logger
|
||||
|
||||
|
||||
@project_cli.command("push")
|
||||
|
@ -37,12 +37,15 @@ def project_push(project_dir: Path, remote: str):
|
|||
remote = config["remotes"][remote]
|
||||
storage = RemoteStorage(project_dir, remote)
|
||||
for cmd in config.get("commands", []):
|
||||
logger.debug(f"CMD: cmd['name']")
|
||||
deps = [project_dir / dep for dep in cmd.get("deps", [])]
|
||||
if any(not dep.exists() for dep in deps):
|
||||
logger.debug(f"Dependency missing. Skipping {cmd['name']} outputs")
|
||||
continue
|
||||
cmd_hash = get_command_hash(
|
||||
"", "", [project_dir / dep for dep in cmd.get("deps", [])], cmd["script"]
|
||||
)
|
||||
logger.debug(f"CMD_HASH: {cmd_hash}")
|
||||
for output_path in cmd.get("outputs", []):
|
||||
output_loc = project_dir / output_path
|
||||
if output_loc.exists() and _is_not_empty_dir(output_loc):
|
||||
|
@ -51,6 +54,9 @@ def project_push(project_dir: Path, remote: str):
|
|||
command_hash=cmd_hash,
|
||||
content_hash=get_content_hash(output_loc),
|
||||
)
|
||||
logger.debug(
|
||||
f"URL: {url} for output {output_path} with cmd_hash {cmd_hash}"
|
||||
)
|
||||
yield output_path, url
|
||||
|
||||
|
||||
|
|
|
@ -43,9 +43,13 @@ def train_cli(
|
|||
# Make sure all files and paths exists if they are needed
|
||||
if not config_path or (str(config_path) != "-" and not config_path.exists()):
|
||||
msg.fail("Config file not found", config_path, exits=1)
|
||||
if output_path is not None and not output_path.exists():
|
||||
output_path.mkdir(parents=True)
|
||||
msg.good(f"Created output directory: {output_path}")
|
||||
if not output_path:
|
||||
msg.info("No output directory provided")
|
||||
else:
|
||||
if not output_path.exists():
|
||||
output_path.mkdir(parents=True)
|
||||
msg.good(f"Created output directory: {output_path}")
|
||||
msg.info(f"Saving to output directory: {output_path}")
|
||||
overrides = parse_config_overrides(ctx.args)
|
||||
import_code(code_path)
|
||||
setup_gpu(use_gpu)
|
||||
|
|
|
@ -356,8 +356,8 @@ class Errors:
|
|||
E098 = ("Invalid pattern: expected both RIGHT_ID and RIGHT_ATTRS.")
|
||||
E099 = ("Invalid pattern: the first node of pattern should be an anchor "
|
||||
"node. The node should only contain RIGHT_ID and RIGHT_ATTRS.")
|
||||
E100 = ("Nodes other than the anchor node should all contain LEFT_ID, "
|
||||
"REL_OP and RIGHT_ID.")
|
||||
E100 = ("Nodes other than the anchor node should all contain {required}, "
|
||||
"but these are missing: {missing}")
|
||||
E101 = ("RIGHT_ID should be a new node and LEFT_ID should already have "
|
||||
"have been declared in previous edges.")
|
||||
E102 = ("Can't merge non-disjoint spans. '{token}' is already part of "
|
||||
|
|
|
@ -1698,7 +1698,6 @@ class Language:
|
|||
# them here so they're only loaded once
|
||||
source_nlps = {}
|
||||
source_nlp_vectors_hashes = {}
|
||||
nlp.meta["_sourced_vectors_hashes"] = {}
|
||||
for pipe_name in config["nlp"]["pipeline"]:
|
||||
if pipe_name not in pipeline:
|
||||
opts = ", ".join(pipeline.keys())
|
||||
|
@ -1747,6 +1746,8 @@ class Language:
|
|||
source_nlp_vectors_hashes[model] = hash(
|
||||
source_nlps[model].vocab.vectors.to_bytes()
|
||||
)
|
||||
if "_sourced_vectors_hashes" not in nlp.meta:
|
||||
nlp.meta["_sourced_vectors_hashes"] = {}
|
||||
nlp.meta["_sourced_vectors_hashes"][
|
||||
pipe_name
|
||||
] = source_nlp_vectors_hashes[model]
|
||||
|
@ -1908,7 +1909,7 @@ class Language:
|
|||
if not hasattr(proc, "to_disk"):
|
||||
continue
|
||||
serializers[name] = lambda p, proc=proc: proc.to_disk(p, exclude=["vocab"])
|
||||
serializers["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||
serializers["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
|
||||
util.to_disk(path, serializers, exclude)
|
||||
|
||||
def from_disk(
|
||||
|
@ -1939,7 +1940,7 @@ class Language:
|
|||
|
||||
def deserialize_vocab(path: Path) -> None:
|
||||
if path.exists():
|
||||
self.vocab.from_disk(path)
|
||||
self.vocab.from_disk(path, exclude=exclude)
|
||||
|
||||
path = util.ensure_path(path)
|
||||
deserializers = {}
|
||||
|
@ -1977,7 +1978,7 @@ class Language:
|
|||
DOCS: https://spacy.io/api/language#to_bytes
|
||||
"""
|
||||
serializers = {}
|
||||
serializers["vocab"] = lambda: self.vocab.to_bytes()
|
||||
serializers["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
|
||||
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
|
||||
serializers["meta.json"] = lambda: srsly.json_dumps(self.meta)
|
||||
serializers["config.cfg"] = lambda: self.config.to_bytes()
|
||||
|
@ -2013,7 +2014,7 @@ class Language:
|
|||
b, interpolate=False
|
||||
)
|
||||
deserializers["meta.json"] = deserialize_meta
|
||||
deserializers["vocab"] = self.vocab.from_bytes
|
||||
deserializers["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
|
||||
deserializers["tokenizer"] = lambda b: self.tokenizer.from_bytes(
|
||||
b, exclude=["vocab"]
|
||||
)
|
||||
|
|
61
spacy/lexeme.pyi
Normal file
61
spacy/lexeme.pyi
Normal file
|
@ -0,0 +1,61 @@
|
|||
from typing import (
|
||||
Union,
|
||||
Any,
|
||||
)
|
||||
from thinc.types import Floats1d
|
||||
from .tokens import Doc, Span, Token
|
||||
from .vocab import Vocab
|
||||
|
||||
class Lexeme:
|
||||
def __init__(self, vocab: Vocab, orth: int) -> None: ...
|
||||
def __richcmp__(self, other: Lexeme, op: int) -> bool: ...
|
||||
def __hash__(self) -> int: ...
|
||||
def set_attrs(self, **attrs: Any) -> None: ...
|
||||
def set_flag(self, flag_id: int, value: bool) -> None: ...
|
||||
def check_flag(self, flag_id: int) -> bool: ...
|
||||
def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
|
||||
@property
|
||||
def has_vector(self) -> bool: ...
|
||||
@property
|
||||
def vector_norm(self) -> float: ...
|
||||
vector: Floats1d
|
||||
rank: str
|
||||
sentiment: float
|
||||
@property
|
||||
def orth_(self) -> str: ...
|
||||
@property
|
||||
def text(self) -> str: ...
|
||||
lower: str
|
||||
norm: int
|
||||
shape: int
|
||||
prefix: int
|
||||
suffix: int
|
||||
cluster: int
|
||||
lang: int
|
||||
prob: float
|
||||
lower_: str
|
||||
norm_: str
|
||||
shape_: str
|
||||
prefix_: str
|
||||
suffix_: str
|
||||
lang_: str
|
||||
flags: int
|
||||
@property
|
||||
def is_oov(self) -> bool: ...
|
||||
is_stop: bool
|
||||
is_alpha: bool
|
||||
is_ascii: bool
|
||||
is_digit: bool
|
||||
is_lower: bool
|
||||
is_upper: bool
|
||||
is_title: bool
|
||||
is_punct: bool
|
||||
is_space: bool
|
||||
is_bracket: bool
|
||||
is_quote: bool
|
||||
is_left_punct: bool
|
||||
is_right_punct: bool
|
||||
is_currency: bool
|
||||
like_url: bool
|
||||
like_num: bool
|
||||
like_email: bool
|
|
@ -122,13 +122,17 @@ cdef class DependencyMatcher:
|
|||
raise ValueError(Errors.E099.format(key=key))
|
||||
visited_nodes[relation["RIGHT_ID"]] = True
|
||||
else:
|
||||
if not(
|
||||
"RIGHT_ID" in relation
|
||||
and "RIGHT_ATTRS" in relation
|
||||
and "REL_OP" in relation
|
||||
and "LEFT_ID" in relation
|
||||
):
|
||||
raise ValueError(Errors.E100.format(key=key))
|
||||
required_keys = set(
|
||||
("RIGHT_ID", "RIGHT_ATTRS", "REL_OP", "LEFT_ID")
|
||||
)
|
||||
relation_keys = set(relation.keys())
|
||||
missing = required_keys - relation_keys
|
||||
if missing:
|
||||
missing_txt = ", ".join(list(missing))
|
||||
raise ValueError(Errors.E100.format(
|
||||
required=required_keys,
|
||||
missing=missing_txt
|
||||
))
|
||||
if (
|
||||
relation["RIGHT_ID"] in visited_nodes
|
||||
or relation["LEFT_ID"] not in visited_nodes
|
||||
|
|
41
spacy/matcher/matcher.pyi
Normal file
41
spacy/matcher/matcher.pyi
Normal file
|
@ -0,0 +1,41 @@
|
|||
from typing import Any, List, Dict, Tuple, Optional, Callable, Union, Iterator, Iterable
|
||||
from ..vocab import Vocab
|
||||
from ..tokens import Doc, Span
|
||||
|
||||
class Matcher:
|
||||
def __init__(self, vocab: Vocab, validate: bool = ...) -> None: ...
|
||||
def __reduce__(self) -> Any: ...
|
||||
def __len__(self) -> int: ...
|
||||
def __contains__(self, key: str) -> bool: ...
|
||||
def add(
|
||||
self,
|
||||
key: str,
|
||||
patterns: List[List[Dict[str, Any]]],
|
||||
*,
|
||||
on_match: Optional[
|
||||
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
|
||||
] = ...,
|
||||
greedy: Optional[str] = ...
|
||||
) -> None: ...
|
||||
def remove(self, key: str) -> None: ...
|
||||
def has_key(self, key: Union[str, int]) -> bool: ...
|
||||
def get(
|
||||
self, key: Union[str, int], default: Optional[Any] = ...
|
||||
) -> Tuple[Optional[Callable[[Any], Any]], List[List[Dict[Any, Any]]]]: ...
|
||||
def pipe(
|
||||
self,
|
||||
docs: Iterable[Tuple[Doc, Any]],
|
||||
batch_size: int = ...,
|
||||
return_matches: bool = ...,
|
||||
as_tuples: bool = ...,
|
||||
) -> Union[
|
||||
Iterator[Tuple[Tuple[Doc, Any], Any]], Iterator[Tuple[Doc, Any]], Iterator[Doc]
|
||||
]: ...
|
||||
def __call__(
|
||||
self,
|
||||
doclike: Union[Doc, Span],
|
||||
*,
|
||||
as_spans: bool = ...,
|
||||
allow_missing: bool = ...,
|
||||
with_alignments: bool = ...
|
||||
) -> Union[List[Tuple[int, int, int]], List[Span]]: ...
|
|
@ -845,7 +845,7 @@ class _RegexPredicate:
|
|||
|
||||
|
||||
class _SetPredicate:
|
||||
operators = ("IN", "NOT_IN", "IS_SUBSET", "IS_SUPERSET")
|
||||
operators = ("IN", "NOT_IN", "IS_SUBSET", "IS_SUPERSET", "INTERSECTS")
|
||||
|
||||
def __init__(self, i, attr, value, predicate, is_extension=False, vocab=None):
|
||||
self.i = i
|
||||
|
@ -868,14 +868,16 @@ class _SetPredicate:
|
|||
else:
|
||||
value = get_token_attr_for_matcher(token.c, self.attr)
|
||||
|
||||
if self.predicate in ("IS_SUBSET", "IS_SUPERSET"):
|
||||
if self.predicate in ("IS_SUBSET", "IS_SUPERSET", "INTERSECTS"):
|
||||
if self.attr == MORPH:
|
||||
# break up MORPH into individual Feat=Val values
|
||||
value = set(get_string_id(v) for v in MorphAnalysis.from_id(self.vocab, value))
|
||||
else:
|
||||
# IS_SUBSET for other attrs will be equivalent to "IN"
|
||||
# IS_SUPERSET will only match for other attrs with 0 or 1 values
|
||||
value = set([value])
|
||||
# treat a single value as a list
|
||||
if isinstance(value, (str, int)):
|
||||
value = set([get_string_id(value)])
|
||||
else:
|
||||
value = set(get_string_id(v) for v in value)
|
||||
if self.predicate == "IN":
|
||||
return value in self.value
|
||||
elif self.predicate == "NOT_IN":
|
||||
|
@ -884,6 +886,8 @@ class _SetPredicate:
|
|||
return value <= self.value
|
||||
elif self.predicate == "IS_SUPERSET":
|
||||
return value >= self.value
|
||||
elif self.predicate == "INTERSECTS":
|
||||
return bool(value & self.value)
|
||||
|
||||
def __repr__(self):
|
||||
return repr(("SetPredicate", self.i, self.attr, self.value, self.predicate))
|
||||
|
@ -928,6 +932,7 @@ def _get_extra_predicates(spec, extra_predicates, vocab):
|
|||
"NOT_IN": _SetPredicate,
|
||||
"IS_SUBSET": _SetPredicate,
|
||||
"IS_SUPERSET": _SetPredicate,
|
||||
"INTERSECTS": _SetPredicate,
|
||||
"==": _ComparisonPredicate,
|
||||
"!=": _ComparisonPredicate,
|
||||
">=": _ComparisonPredicate,
|
||||
|
|
|
@ -276,7 +276,7 @@ class AttributeRuler(Pipe):
|
|||
DOCS: https://spacy.io/api/attributeruler#to_bytes
|
||||
"""
|
||||
serialize = {}
|
||||
serialize["vocab"] = self.vocab.to_bytes
|
||||
serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
|
||||
serialize["patterns"] = lambda: srsly.msgpack_dumps(self.patterns)
|
||||
return util.to_bytes(serialize, exclude)
|
||||
|
||||
|
@ -296,7 +296,7 @@ class AttributeRuler(Pipe):
|
|||
self.add_patterns(srsly.msgpack_loads(b))
|
||||
|
||||
deserialize = {
|
||||
"vocab": lambda b: self.vocab.from_bytes(b),
|
||||
"vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
|
||||
"patterns": load_patterns,
|
||||
}
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
|
@ -313,7 +313,7 @@ class AttributeRuler(Pipe):
|
|||
DOCS: https://spacy.io/api/attributeruler#to_disk
|
||||
"""
|
||||
serialize = {
|
||||
"vocab": lambda p: self.vocab.to_disk(p),
|
||||
"vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
|
||||
"patterns": lambda p: srsly.write_msgpack(p, self.patterns),
|
||||
}
|
||||
util.to_disk(path, serialize, exclude)
|
||||
|
@ -334,7 +334,7 @@ class AttributeRuler(Pipe):
|
|||
self.add_patterns(srsly.read_msgpack(p))
|
||||
|
||||
deserialize = {
|
||||
"vocab": lambda p: self.vocab.from_disk(p),
|
||||
"vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
|
||||
"patterns": load_patterns,
|
||||
}
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
|
|
|
@ -412,7 +412,7 @@ class EntityLinker(TrainablePipe):
|
|||
serialize = {}
|
||||
if hasattr(self, "cfg") and self.cfg is not None:
|
||||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||
serialize["vocab"] = self.vocab.to_bytes
|
||||
serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
|
||||
serialize["kb"] = self.kb.to_bytes
|
||||
serialize["model"] = self.model.to_bytes
|
||||
return util.to_bytes(serialize, exclude)
|
||||
|
@ -436,7 +436,7 @@ class EntityLinker(TrainablePipe):
|
|||
deserialize = {}
|
||||
if hasattr(self, "cfg") and self.cfg is not None:
|
||||
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
|
||||
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
|
||||
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
|
||||
deserialize["kb"] = lambda b: self.kb.from_bytes(b)
|
||||
deserialize["model"] = load_model
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
|
@ -453,7 +453,7 @@ class EntityLinker(TrainablePipe):
|
|||
DOCS: https://spacy.io/api/entitylinker#to_disk
|
||||
"""
|
||||
serialize = {}
|
||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
|
||||
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
||||
serialize["kb"] = lambda p: self.kb.to_disk(p)
|
||||
serialize["model"] = lambda p: self.model.to_disk(p)
|
||||
|
@ -480,6 +480,7 @@ class EntityLinker(TrainablePipe):
|
|||
|
||||
deserialize = {}
|
||||
deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
|
||||
deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
|
||||
deserialize["kb"] = lambda p: self.kb.from_disk(p)
|
||||
deserialize["model"] = load_model
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
|
|
|
@ -269,7 +269,7 @@ class Lemmatizer(Pipe):
|
|||
DOCS: https://spacy.io/api/lemmatizer#to_disk
|
||||
"""
|
||||
serialize = {}
|
||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
|
||||
serialize["lookups"] = lambda p: self.lookups.to_disk(p)
|
||||
util.to_disk(path, serialize, exclude)
|
||||
|
||||
|
@ -285,7 +285,7 @@ class Lemmatizer(Pipe):
|
|||
DOCS: https://spacy.io/api/lemmatizer#from_disk
|
||||
"""
|
||||
deserialize = {}
|
||||
deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
|
||||
deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
|
||||
deserialize["lookups"] = lambda p: self.lookups.from_disk(p)
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
self._validate_tables()
|
||||
|
@ -300,7 +300,7 @@ class Lemmatizer(Pipe):
|
|||
DOCS: https://spacy.io/api/lemmatizer#to_bytes
|
||||
"""
|
||||
serialize = {}
|
||||
serialize["vocab"] = self.vocab.to_bytes
|
||||
serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
|
||||
serialize["lookups"] = self.lookups.to_bytes
|
||||
return util.to_bytes(serialize, exclude)
|
||||
|
||||
|
@ -316,7 +316,7 @@ class Lemmatizer(Pipe):
|
|||
DOCS: https://spacy.io/api/lemmatizer#from_bytes
|
||||
"""
|
||||
deserialize = {}
|
||||
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
|
||||
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
|
||||
deserialize["lookups"] = lambda b: self.lookups.from_bytes(b)
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
self._validate_tables()
|
||||
|
|
|
@ -273,7 +273,7 @@ cdef class TrainablePipe(Pipe):
|
|||
serialize = {}
|
||||
if hasattr(self, "cfg") and self.cfg is not None:
|
||||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||
serialize["vocab"] = self.vocab.to_bytes
|
||||
serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
|
||||
serialize["model"] = self.model.to_bytes
|
||||
return util.to_bytes(serialize, exclude)
|
||||
|
||||
|
@ -296,7 +296,7 @@ cdef class TrainablePipe(Pipe):
|
|||
deserialize = {}
|
||||
if hasattr(self, "cfg") and self.cfg is not None:
|
||||
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
|
||||
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
|
||||
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
|
||||
deserialize["model"] = load_model
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
return self
|
||||
|
@ -313,7 +313,7 @@ cdef class TrainablePipe(Pipe):
|
|||
serialize = {}
|
||||
if hasattr(self, "cfg") and self.cfg is not None:
|
||||
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
|
||||
serialize["model"] = lambda p: self.model.to_disk(p)
|
||||
util.to_disk(path, serialize, exclude)
|
||||
|
||||
|
@ -338,7 +338,7 @@ cdef class TrainablePipe(Pipe):
|
|||
deserialize = {}
|
||||
if hasattr(self, "cfg") and self.cfg is not None:
|
||||
deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
|
||||
deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
|
||||
deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
|
||||
deserialize["model"] = load_model
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
return self
|
||||
|
|
|
@ -569,7 +569,7 @@ cdef class Parser(TrainablePipe):
|
|||
def to_disk(self, path, exclude=tuple()):
|
||||
serializers = {
|
||||
"model": lambda p: (self.model.to_disk(p) if self.model is not True else True),
|
||||
"vocab": lambda p: self.vocab.to_disk(p),
|
||||
"vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
|
||||
"moves": lambda p: self.moves.to_disk(p, exclude=["strings"]),
|
||||
"cfg": lambda p: srsly.write_json(p, self.cfg)
|
||||
}
|
||||
|
@ -577,7 +577,7 @@ cdef class Parser(TrainablePipe):
|
|||
|
||||
def from_disk(self, path, exclude=tuple()):
|
||||
deserializers = {
|
||||
"vocab": lambda p: self.vocab.from_disk(p),
|
||||
"vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
|
||||
"moves": lambda p: self.moves.from_disk(p, exclude=["strings"]),
|
||||
"cfg": lambda p: self.cfg.update(srsly.read_json(p)),
|
||||
"model": lambda p: None,
|
||||
|
@ -597,7 +597,7 @@ cdef class Parser(TrainablePipe):
|
|||
def to_bytes(self, exclude=tuple()):
|
||||
serializers = {
|
||||
"model": lambda: (self.model.to_bytes()),
|
||||
"vocab": lambda: self.vocab.to_bytes(),
|
||||
"vocab": lambda: self.vocab.to_bytes(exclude=exclude),
|
||||
"moves": lambda: self.moves.to_bytes(exclude=["strings"]),
|
||||
"cfg": lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True)
|
||||
}
|
||||
|
@ -605,7 +605,7 @@ cdef class Parser(TrainablePipe):
|
|||
|
||||
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||
deserializers = {
|
||||
"vocab": lambda b: self.vocab.from_bytes(b),
|
||||
"vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
|
||||
"moves": lambda b: self.moves.from_bytes(b, exclude=["strings"]),
|
||||
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
||||
"model": lambda b: None,
|
||||
|
|
|
@ -159,6 +159,7 @@ class TokenPatternString(BaseModel):
|
|||
NOT_IN: Optional[List[StrictStr]] = Field(None, alias="not_in")
|
||||
IS_SUBSET: Optional[List[StrictStr]] = Field(None, alias="is_subset")
|
||||
IS_SUPERSET: Optional[List[StrictStr]] = Field(None, alias="is_superset")
|
||||
INTERSECTS: Optional[List[StrictStr]] = Field(None, alias="intersects")
|
||||
|
||||
class Config:
|
||||
extra = "forbid"
|
||||
|
@ -175,8 +176,9 @@ class TokenPatternNumber(BaseModel):
|
|||
REGEX: Optional[StrictStr] = Field(None, alias="regex")
|
||||
IN: Optional[List[StrictInt]] = Field(None, alias="in")
|
||||
NOT_IN: Optional[List[StrictInt]] = Field(None, alias="not_in")
|
||||
ISSUBSET: Optional[List[StrictInt]] = Field(None, alias="issubset")
|
||||
ISSUPERSET: Optional[List[StrictInt]] = Field(None, alias="issuperset")
|
||||
IS_SUBSET: Optional[List[StrictInt]] = Field(None, alias="is_subset")
|
||||
IS_SUPERSET: Optional[List[StrictInt]] = Field(None, alias="is_superset")
|
||||
INTERSECTS: Optional[List[StrictInt]] = Field(None, alias="intersects")
|
||||
EQ: Union[StrictInt, StrictFloat] = Field(None, alias="==")
|
||||
NEQ: Union[StrictInt, StrictFloat] = Field(None, alias="!=")
|
||||
GEQ: Union[StrictInt, StrictFloat] = Field(None, alias=">=")
|
||||
|
|
22
spacy/strings.pyi
Normal file
22
spacy/strings.pyi
Normal file
|
@ -0,0 +1,22 @@
|
|||
from typing import Optional, Iterable, Iterator, Union, Any
|
||||
from pathlib import Path
|
||||
|
||||
def get_string_id(key: str) -> int: ...
|
||||
|
||||
class StringStore:
|
||||
def __init__(
|
||||
self, strings: Optional[Iterable[str]] = ..., freeze: bool = ...
|
||||
) -> None: ...
|
||||
def __getitem__(self, string_or_id: Union[bytes, str, int]) -> Union[str, int]: ...
|
||||
def as_int(self, key: Union[bytes, str, int]) -> int: ...
|
||||
def as_string(self, key: Union[bytes, str, int]) -> str: ...
|
||||
def add(self, string: str) -> int: ...
|
||||
def __len__(self) -> int: ...
|
||||
def __contains__(self, string: str) -> bool: ...
|
||||
def __iter__(self) -> Iterator[str]: ...
|
||||
def __reduce__(self) -> Any: ...
|
||||
def to_disk(self, path: Union[str, Path]) -> None: ...
|
||||
def from_disk(self, path: Union[str, Path]) -> StringStore: ...
|
||||
def to_bytes(self, **kwargs: Any) -> bytes: ...
|
||||
def from_bytes(self, bytes_data: bytes, **kwargs: Any) -> StringStore: ...
|
||||
def _reset_and_load(self, strings: Iterable[str]) -> None: ...
|
|
@ -357,6 +357,9 @@ def test_span_eq_hash(doc, doc_not_parsed):
|
|||
assert hash(doc[0:2]) != hash(doc[1:3])
|
||||
assert hash(doc[0:2]) != hash(doc_not_parsed[0:2])
|
||||
|
||||
# check that an out-of-bounds is not equivalent to the span of the full doc
|
||||
assert doc[0 : len(doc)] != doc[len(doc) : len(doc) + 1]
|
||||
|
||||
|
||||
def test_span_boundaries(doc):
|
||||
start = 1
|
||||
|
@ -369,6 +372,33 @@ def test_span_boundaries(doc):
|
|||
with pytest.raises(IndexError):
|
||||
span[5]
|
||||
|
||||
empty_span_0 = doc[0:0]
|
||||
assert empty_span_0.text == ""
|
||||
assert empty_span_0.start == 0
|
||||
assert empty_span_0.end == 0
|
||||
assert empty_span_0.start_char == 0
|
||||
assert empty_span_0.end_char == 0
|
||||
|
||||
empty_span_1 = doc[1:1]
|
||||
assert empty_span_1.text == ""
|
||||
assert empty_span_1.start == 1
|
||||
assert empty_span_1.end == 1
|
||||
assert empty_span_1.start_char == empty_span_1.end_char
|
||||
|
||||
oob_span_start = doc[-len(doc) - 1 : -len(doc) - 10]
|
||||
assert oob_span_start.text == ""
|
||||
assert oob_span_start.start == 0
|
||||
assert oob_span_start.end == 0
|
||||
assert oob_span_start.start_char == 0
|
||||
assert oob_span_start.end_char == 0
|
||||
|
||||
oob_span_end = doc[len(doc) + 1 : len(doc) + 10]
|
||||
assert oob_span_end.text == ""
|
||||
assert oob_span_end.start == len(doc)
|
||||
assert oob_span_end.end == len(doc)
|
||||
assert oob_span_end.start_char == len(doc.text)
|
||||
assert oob_span_end.end_char == len(doc.text)
|
||||
|
||||
|
||||
def test_span_lemma(doc):
|
||||
# span lemmas should have the same number of spaces as the span
|
||||
|
|
|
@ -270,6 +270,16 @@ def test_matcher_subset_value_operator(en_vocab):
|
|||
doc[0].tag_ = "A"
|
||||
assert len(matcher(doc)) == 0
|
||||
|
||||
# IS_SUBSET with a list value
|
||||
Token.set_extension("ext", default=[])
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"_": {"ext": {"IS_SUBSET": ["A", "B"]}}}]
|
||||
matcher.add("M", [pattern])
|
||||
doc = Doc(en_vocab, words=["a", "b", "c"])
|
||||
doc[0]._.ext = ["A"]
|
||||
doc[1]._.ext = ["C", "D"]
|
||||
assert len(matcher(doc)) == 2
|
||||
|
||||
|
||||
def test_matcher_superset_value_operator(en_vocab):
|
||||
matcher = Matcher(en_vocab)
|
||||
|
@ -308,6 +318,72 @@ def test_matcher_superset_value_operator(en_vocab):
|
|||
doc[0].tag_ = "A"
|
||||
assert len(matcher(doc)) == 3
|
||||
|
||||
# IS_SUPERSET with a list value
|
||||
Token.set_extension("ext", default=[])
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"_": {"ext": {"IS_SUPERSET": ["A"]}}}]
|
||||
matcher.add("M", [pattern])
|
||||
doc = Doc(en_vocab, words=["a", "b", "c"])
|
||||
doc[0]._.ext = ["A", "B"]
|
||||
assert len(matcher(doc)) == 1
|
||||
|
||||
|
||||
def test_matcher_intersect_value_operator(en_vocab):
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"MORPH": {"INTERSECTS": ["Feat=Val", "Feat2=Val2", "Feat3=Val3"]}}]
|
||||
matcher.add("M", [pattern])
|
||||
doc = Doc(en_vocab, words=["a", "b", "c"])
|
||||
assert len(matcher(doc)) == 0
|
||||
doc[0].set_morph("Feat=Val")
|
||||
assert len(matcher(doc)) == 1
|
||||
doc[0].set_morph("Feat=Val|Feat2=Val2")
|
||||
assert len(matcher(doc)) == 1
|
||||
doc[0].set_morph("Feat=Val|Feat2=Val2|Feat3=Val3")
|
||||
assert len(matcher(doc)) == 1
|
||||
doc[0].set_morph("Feat=Val|Feat2=Val2|Feat3=Val3|Feat4=Val4")
|
||||
assert len(matcher(doc)) == 1
|
||||
|
||||
# INTERSECTS with a single value is the same as IN
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"TAG": {"INTERSECTS": ["A", "B"]}}]
|
||||
matcher.add("M", [pattern])
|
||||
doc = Doc(en_vocab, words=["a", "b", "c"])
|
||||
doc[0].tag_ = "A"
|
||||
assert len(matcher(doc)) == 1
|
||||
|
||||
# INTERSECTS with an empty pattern list matches nothing
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"TAG": {"INTERSECTS": []}}]
|
||||
matcher.add("M", [pattern])
|
||||
doc = Doc(en_vocab, words=["a", "b", "c"])
|
||||
doc[0].tag_ = "A"
|
||||
assert len(matcher(doc)) == 0
|
||||
|
||||
# INTERSECTS with a list value
|
||||
Token.set_extension("ext", default=[])
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"_": {"ext": {"INTERSECTS": ["A", "C"]}}}]
|
||||
matcher.add("M", [pattern])
|
||||
doc = Doc(en_vocab, words=["a", "b", "c"])
|
||||
doc[0]._.ext = ["A", "B"]
|
||||
assert len(matcher(doc)) == 1
|
||||
|
||||
# INTERSECTS with an empty pattern list matches nothing
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"_": {"ext": {"INTERSECTS": []}}}]
|
||||
matcher.add("M", [pattern])
|
||||
doc = Doc(en_vocab, words=["a", "b", "c"])
|
||||
doc[0]._.ext = ["A", "B"]
|
||||
assert len(matcher(doc)) == 0
|
||||
|
||||
# INTERSECTS with an empty value matches nothing
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"_": {"ext": {"INTERSECTS": ["A", "B"]}}}]
|
||||
matcher.add("M", [pattern])
|
||||
doc = Doc(en_vocab, words=["a", "b", "c"])
|
||||
doc[0]._.ext = []
|
||||
assert len(matcher(doc)) == 0
|
||||
|
||||
|
||||
def test_matcher_morph_handling(en_vocab):
|
||||
# order of features in pattern doesn't matter
|
||||
|
|
|
@ -1,9 +1,11 @@
|
|||
import pytest
|
||||
from numpy.testing import assert_equal
|
||||
from numpy.testing import assert_equal, assert_array_equal
|
||||
from thinc.api import get_current_ops
|
||||
from spacy.language import Language
|
||||
from spacy.training import Example
|
||||
from spacy.util import fix_random_seed, registry
|
||||
|
||||
OPS = get_current_ops()
|
||||
|
||||
SPAN_KEY = "labeled_spans"
|
||||
|
||||
|
@ -116,12 +118,15 @@ def test_ngram_suggester(en_tokenizer):
|
|||
for span in spans:
|
||||
assert 0 <= span[0] < len(doc)
|
||||
assert 0 < span[1] <= len(doc)
|
||||
spans_set.add((span[0], span[1]))
|
||||
spans_set.add((int(span[0]), int(span[1])))
|
||||
# spans are unique
|
||||
assert spans.shape[0] == len(spans_set)
|
||||
offset += ngrams.lengths[i]
|
||||
# the number of spans is correct
|
||||
assert_equal(ngrams.lengths, [max(0, len(doc) - (size - 1)) for doc in docs])
|
||||
assert_array_equal(
|
||||
OPS.to_numpy(ngrams.lengths),
|
||||
[max(0, len(doc) - (size - 1)) for doc in docs],
|
||||
)
|
||||
|
||||
# test 1-3-gram suggestions
|
||||
ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1, 2, 3])
|
||||
|
@ -129,9 +134,9 @@ def test_ngram_suggester(en_tokenizer):
|
|||
en_tokenizer(text) for text in ["a", "a b", "a b c", "a b c d", "a b c d e"]
|
||||
]
|
||||
ngrams = ngram_suggester(docs)
|
||||
assert_equal(ngrams.lengths, [1, 3, 6, 9, 12])
|
||||
assert_equal(
|
||||
ngrams.data,
|
||||
assert_array_equal(OPS.to_numpy(ngrams.lengths), [1, 3, 6, 9, 12])
|
||||
assert_array_equal(
|
||||
OPS.to_numpy(ngrams.data),
|
||||
[
|
||||
# doc 0
|
||||
[0, 1],
|
||||
|
@ -176,13 +181,13 @@ def test_ngram_suggester(en_tokenizer):
|
|||
ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1])
|
||||
docs = [en_tokenizer(text) for text in ["", "a", ""]]
|
||||
ngrams = ngram_suggester(docs)
|
||||
assert_equal(ngrams.lengths, [len(doc) for doc in docs])
|
||||
assert_array_equal(OPS.to_numpy(ngrams.lengths), [len(doc) for doc in docs])
|
||||
|
||||
# test all empty docs
|
||||
ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1])
|
||||
docs = [en_tokenizer(text) for text in ["", "", ""]]
|
||||
ngrams = ngram_suggester(docs)
|
||||
assert_equal(ngrams.lengths, [len(doc) for doc in docs])
|
||||
assert_array_equal(OPS.to_numpy(ngrams.lengths), [len(doc) for doc in docs])
|
||||
|
||||
|
||||
def test_ngram_sizes(en_tokenizer):
|
||||
|
@ -195,12 +200,12 @@ def test_ngram_sizes(en_tokenizer):
|
|||
]
|
||||
ngrams_1 = size_suggester(docs)
|
||||
ngrams_2 = range_suggester(docs)
|
||||
assert_equal(ngrams_1.lengths, [1, 3, 6, 9, 12])
|
||||
assert_equal(ngrams_1.lengths, ngrams_2.lengths)
|
||||
assert_equal(ngrams_1.data, ngrams_2.data)
|
||||
assert_array_equal(OPS.to_numpy(ngrams_1.lengths), [1, 3, 6, 9, 12])
|
||||
assert_array_equal(OPS.to_numpy(ngrams_1.lengths), OPS.to_numpy(ngrams_2.lengths))
|
||||
assert_array_equal(OPS.to_numpy(ngrams_1.data), OPS.to_numpy(ngrams_2.data))
|
||||
|
||||
# one more variation
|
||||
suggester_factory = registry.misc.get("spacy.ngram_range_suggester.v1")
|
||||
range_suggester = suggester_factory(min_size=2, max_size=4)
|
||||
ngrams_3 = range_suggester(docs)
|
||||
assert_equal(ngrams_3.lengths, [0, 1, 3, 6, 9])
|
||||
assert_array_equal(OPS.to_numpy(ngrams_3.lengths), [0, 1, 3, 6, 9])
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
import pytest
|
||||
from spacy import registry, Vocab
|
||||
from spacy import registry, Vocab, load
|
||||
from spacy.pipeline import Tagger, DependencyParser, EntityRecognizer
|
||||
from spacy.pipeline import TextCategorizer, SentenceRecognizer, TrainablePipe
|
||||
from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
|
||||
|
@ -268,3 +268,21 @@ def test_serialize_custom_trainable_pipe():
|
|||
pipe.to_disk(d)
|
||||
new_pipe = CustomPipe(Vocab(), Linear()).from_disk(d)
|
||||
assert new_pipe.to_bytes() == pipe_bytes
|
||||
|
||||
|
||||
def test_load_without_strings():
|
||||
nlp = spacy.blank("en")
|
||||
orig_strings_length = len(nlp.vocab.strings)
|
||||
word = "unlikely_word_" * 20
|
||||
nlp.vocab.strings.add(word)
|
||||
assert len(nlp.vocab.strings) == orig_strings_length + 1
|
||||
with make_tempdir() as d:
|
||||
nlp.to_disk(d)
|
||||
# reload with strings
|
||||
reloaded_nlp = load(d)
|
||||
assert len(nlp.vocab.strings) == len(reloaded_nlp.vocab.strings)
|
||||
assert word in reloaded_nlp.vocab.strings
|
||||
# reload without strings
|
||||
reloaded_nlp = load(d, exclude=["strings"])
|
||||
assert orig_strings_length == len(reloaded_nlp.vocab.strings)
|
||||
assert word not in reloaded_nlp.vocab.strings
|
||||
|
|
|
@ -765,7 +765,7 @@ cdef class Tokenizer:
|
|||
DOCS: https://spacy.io/api/tokenizer#to_bytes
|
||||
"""
|
||||
serializers = {
|
||||
"vocab": lambda: self.vocab.to_bytes(),
|
||||
"vocab": lambda: self.vocab.to_bytes(exclude=exclude),
|
||||
"prefix_search": lambda: _get_regex_pattern(self.prefix_search),
|
||||
"suffix_search": lambda: _get_regex_pattern(self.suffix_search),
|
||||
"infix_finditer": lambda: _get_regex_pattern(self.infix_finditer),
|
||||
|
@ -786,7 +786,7 @@ cdef class Tokenizer:
|
|||
"""
|
||||
data = {}
|
||||
deserializers = {
|
||||
"vocab": lambda b: self.vocab.from_bytes(b),
|
||||
"vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
|
||||
"prefix_search": lambda b: data.setdefault("prefix_search", b),
|
||||
"suffix_search": lambda b: data.setdefault("suffix_search", b),
|
||||
"infix_finditer": lambda b: data.setdefault("infix_finditer", b),
|
||||
|
|
17
spacy/tokens/_retokenize.pyi
Normal file
17
spacy/tokens/_retokenize.pyi
Normal file
|
@ -0,0 +1,17 @@
|
|||
from typing import Dict, Any, Union, List, Tuple
|
||||
from .doc import Doc
|
||||
from .span import Span
|
||||
from .token import Token
|
||||
|
||||
class Retokenizer:
|
||||
def __init__(self, doc: Doc) -> None: ...
|
||||
def merge(self, span: Span, attrs: Dict[Union[str, int], Any] = ...) -> None: ...
|
||||
def split(
|
||||
self,
|
||||
token: Token,
|
||||
orths: List[str],
|
||||
heads: List[Union[Token, Tuple[Token, int]]],
|
||||
attrs: Dict[Union[str, int], List[Any]] = ...,
|
||||
) -> None: ...
|
||||
def __enter__(self) -> Retokenizer: ...
|
||||
def __exit__(self, *args: Any) -> None: ...
|
180
spacy/tokens/doc.pyi
Normal file
180
spacy/tokens/doc.pyi
Normal file
|
@ -0,0 +1,180 @@
|
|||
from typing import (
|
||||
Callable,
|
||||
Protocol,
|
||||
Iterable,
|
||||
Iterator,
|
||||
Optional,
|
||||
Union,
|
||||
Tuple,
|
||||
List,
|
||||
Dict,
|
||||
Any,
|
||||
overload,
|
||||
)
|
||||
from cymem.cymem import Pool
|
||||
from thinc.types import Floats1d, Floats2d, Ints2d
|
||||
from .span import Span
|
||||
from .token import Token
|
||||
from ._dict_proxies import SpanGroups
|
||||
from ._retokenize import Retokenizer
|
||||
from ..lexeme import Lexeme
|
||||
from ..vocab import Vocab
|
||||
from .underscore import Underscore
|
||||
from pathlib import Path
|
||||
import numpy
|
||||
|
||||
class DocMethod(Protocol):
|
||||
def __call__(self: Doc, *args: Any, **kwargs: Any) -> Any: ...
|
||||
|
||||
class Doc:
|
||||
vocab: Vocab
|
||||
mem: Pool
|
||||
spans: SpanGroups
|
||||
max_length: int
|
||||
length: int
|
||||
sentiment: float
|
||||
cats: Dict[str, float]
|
||||
user_hooks: Dict[str, Callable[..., Any]]
|
||||
user_token_hooks: Dict[str, Callable[..., Any]]
|
||||
user_span_hooks: Dict[str, Callable[..., Any]]
|
||||
tensor: numpy.ndarray
|
||||
user_data: Dict[str, Any]
|
||||
has_unknown_spaces: bool
|
||||
@classmethod
|
||||
def set_extension(
|
||||
cls,
|
||||
name: str,
|
||||
default: Optional[Any] = ...,
|
||||
getter: Optional[Callable[[Doc], Any]] = ...,
|
||||
setter: Optional[Callable[[Doc, Any], None]] = ...,
|
||||
method: Optional[DocMethod] = ...,
|
||||
force: bool = ...,
|
||||
) -> None: ...
|
||||
@classmethod
|
||||
def get_extension(
|
||||
cls, name: str
|
||||
) -> Tuple[
|
||||
Optional[Any],
|
||||
Optional[DocMethod],
|
||||
Optional[Callable[[Doc], Any]],
|
||||
Optional[Callable[[Doc, Any], None]],
|
||||
]: ...
|
||||
@classmethod
|
||||
def has_extension(cls, name: str) -> bool: ...
|
||||
@classmethod
|
||||
def remove_extension(
|
||||
cls, name: str
|
||||
) -> Tuple[
|
||||
Optional[Any],
|
||||
Optional[DocMethod],
|
||||
Optional[Callable[[Doc], Any]],
|
||||
Optional[Callable[[Doc, Any], None]],
|
||||
]: ...
|
||||
def __init__(
|
||||
self,
|
||||
vocab: Vocab,
|
||||
words: Optional[List[str]] = ...,
|
||||
spaces: Optional[List[bool]] = ...,
|
||||
user_data: Optional[Dict[Any, Any]] = ...,
|
||||
tags: Optional[List[str]] = ...,
|
||||
pos: Optional[List[str]] = ...,
|
||||
morphs: Optional[List[str]] = ...,
|
||||
lemmas: Optional[List[str]] = ...,
|
||||
heads: Optional[List[int]] = ...,
|
||||
deps: Optional[List[str]] = ...,
|
||||
sent_starts: Optional[List[Union[bool, None]]] = ...,
|
||||
ents: Optional[List[str]] = ...,
|
||||
) -> None: ...
|
||||
@property
|
||||
def _(self) -> Underscore: ...
|
||||
@property
|
||||
def is_tagged(self) -> bool: ...
|
||||
@property
|
||||
def is_parsed(self) -> bool: ...
|
||||
@property
|
||||
def is_nered(self) -> bool: ...
|
||||
@property
|
||||
def is_sentenced(self) -> bool: ...
|
||||
def has_annotation(
|
||||
self, attr: Union[int, str], *, require_complete: bool = ...
|
||||
) -> bool: ...
|
||||
@overload
|
||||
def __getitem__(self, i: int) -> Token: ...
|
||||
@overload
|
||||
def __getitem__(self, i: slice) -> Span: ...
|
||||
def __iter__(self) -> Iterator[Token]: ...
|
||||
def __len__(self) -> int: ...
|
||||
def __unicode__(self) -> str: ...
|
||||
def __bytes__(self) -> bytes: ...
|
||||
def __str__(self) -> str: ...
|
||||
def __repr__(self) -> str: ...
|
||||
@property
|
||||
def doc(self) -> Doc: ...
|
||||
def char_span(
|
||||
self,
|
||||
start_idx: int,
|
||||
end_idx: int,
|
||||
label: Union[int, str] = ...,
|
||||
kb_id: Union[int, str] = ...,
|
||||
vector: Optional[Floats1d] = ...,
|
||||
alignment_mode: str = ...,
|
||||
) -> Span: ...
|
||||
def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
|
||||
@property
|
||||
def has_vector(self) -> bool: ...
|
||||
vector: Floats1d
|
||||
vector_norm: float
|
||||
@property
|
||||
def text(self) -> str: ...
|
||||
@property
|
||||
def text_with_ws(self) -> str: ...
|
||||
ents: Tuple[Span]
|
||||
def set_ents(
|
||||
self,
|
||||
entities: List[Span],
|
||||
*,
|
||||
blocked: Optional[List[Span]] = ...,
|
||||
missing: Optional[List[Span]] = ...,
|
||||
outside: Optional[List[Span]] = ...,
|
||||
default: str = ...
|
||||
) -> None: ...
|
||||
@property
|
||||
def noun_chunks(self) -> Iterator[Span]: ...
|
||||
@property
|
||||
def sents(self) -> Iterator[Span]: ...
|
||||
@property
|
||||
def lang(self) -> int: ...
|
||||
@property
|
||||
def lang_(self) -> str: ...
|
||||
def count_by(
|
||||
self, attr_id: int, exclude: Optional[Any] = ..., counts: Optional[Any] = ...
|
||||
) -> Dict[Any, int]: ...
|
||||
def from_array(self, attrs: List[int], array: Ints2d) -> Doc: ...
|
||||
@staticmethod
|
||||
def from_docs(
|
||||
docs: List[Doc],
|
||||
ensure_whitespace: bool = ...,
|
||||
attrs: Optional[Union[Tuple[Union[str, int]], List[Union[int, str]]]] = ...,
|
||||
) -> Doc: ...
|
||||
def get_lca_matrix(self) -> Ints2d: ...
|
||||
def copy(self) -> Doc: ...
|
||||
def to_disk(
|
||||
self, path: Union[str, Path], *, exclude: Iterable[str] = ...
|
||||
) -> None: ...
|
||||
def from_disk(
|
||||
self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
|
||||
) -> Doc: ...
|
||||
def to_bytes(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
|
||||
def from_bytes(
|
||||
self, bytes_data: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
|
||||
) -> Doc: ...
|
||||
def to_dict(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
|
||||
def from_dict(
|
||||
self, msg: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
|
||||
) -> Doc: ...
|
||||
def extend_tensor(self, tensor: Floats2d) -> None: ...
|
||||
def retokenize(self) -> Retokenizer: ...
|
||||
def to_json(self, underscore: Optional[List[str]] = ...) -> Dict[str, Any]: ...
|
||||
def to_utf8_array(self, nr_char: int = ...) -> Ints2d: ...
|
||||
@staticmethod
|
||||
def _get_array_attrs() -> Tuple[Any]: ...
|
20
spacy/tokens/morphanalysis.pyi
Normal file
20
spacy/tokens/morphanalysis.pyi
Normal file
|
@ -0,0 +1,20 @@
|
|||
from typing import Any, Dict, Iterator, List, Union
|
||||
from ..vocab import Vocab
|
||||
|
||||
class MorphAnalysis:
|
||||
def __init__(
|
||||
self, vocab: Vocab, features: Union[Dict[str, str], str] = ...
|
||||
) -> None: ...
|
||||
@classmethod
|
||||
def from_id(cls, vocab: Vocab, key: Any) -> MorphAnalysis: ...
|
||||
def __contains__(self, feature: str) -> bool: ...
|
||||
def __iter__(self) -> Iterator[str]: ...
|
||||
def __len__(self) -> int: ...
|
||||
def __hash__(self) -> int: ...
|
||||
def __eq__(self, other: MorphAnalysis) -> bool: ...
|
||||
def __ne__(self, other: MorphAnalysis) -> bool: ...
|
||||
def get(self, field: Any) -> List[str]: ...
|
||||
def to_json(self) -> str: ...
|
||||
def to_dict(self) -> Dict[str, str]: ...
|
||||
def __str__(self) -> str: ...
|
||||
def __repr__(self) -> str: ...
|
124
spacy/tokens/span.pyi
Normal file
124
spacy/tokens/span.pyi
Normal file
|
@ -0,0 +1,124 @@
|
|||
from typing import Callable, Protocol, Iterator, Optional, Union, Tuple, Any, overload
|
||||
from thinc.types import Floats1d, Ints2d, FloatsXd
|
||||
from .doc import Doc
|
||||
from .token import Token
|
||||
from .underscore import Underscore
|
||||
from ..lexeme import Lexeme
|
||||
from ..vocab import Vocab
|
||||
|
||||
class SpanMethod(Protocol):
|
||||
def __call__(self: Span, *args: Any, **kwargs: Any) -> Any: ...
|
||||
|
||||
class Span:
|
||||
@classmethod
|
||||
def set_extension(
|
||||
cls,
|
||||
name: str,
|
||||
default: Optional[Any] = ...,
|
||||
getter: Optional[Callable[[Span], Any]] = ...,
|
||||
setter: Optional[Callable[[Span, Any], None]] = ...,
|
||||
method: Optional[SpanMethod] = ...,
|
||||
force: bool = ...,
|
||||
) -> None: ...
|
||||
@classmethod
|
||||
def get_extension(
|
||||
cls, name: str
|
||||
) -> Tuple[
|
||||
Optional[Any],
|
||||
Optional[SpanMethod],
|
||||
Optional[Callable[[Span], Any]],
|
||||
Optional[Callable[[Span, Any], None]],
|
||||
]: ...
|
||||
@classmethod
|
||||
def has_extension(cls, name: str) -> bool: ...
|
||||
@classmethod
|
||||
def remove_extension(
|
||||
cls, name: str
|
||||
) -> Tuple[
|
||||
Optional[Any],
|
||||
Optional[SpanMethod],
|
||||
Optional[Callable[[Span], Any]],
|
||||
Optional[Callable[[Span, Any], None]],
|
||||
]: ...
|
||||
def __init__(
|
||||
self,
|
||||
doc: Doc,
|
||||
start: int,
|
||||
end: int,
|
||||
label: int = ...,
|
||||
vector: Optional[Floats1d] = ...,
|
||||
vector_norm: Optional[float] = ...,
|
||||
kb_id: Optional[int] = ...,
|
||||
) -> None: ...
|
||||
def __richcmp__(self, other: Span, op: int) -> bool: ...
|
||||
def __hash__(self) -> int: ...
|
||||
def __len__(self) -> int: ...
|
||||
def __repr__(self) -> str: ...
|
||||
@overload
|
||||
def __getitem__(self, i: int) -> Token: ...
|
||||
@overload
|
||||
def __getitem__(self, i: slice) -> Span: ...
|
||||
def __iter__(self) -> Iterator[Token]: ...
|
||||
@property
|
||||
def _(self) -> Underscore: ...
|
||||
def as_doc(self, *, copy_user_data: bool = ...) -> Doc: ...
|
||||
def get_lca_matrix(self) -> Ints2d: ...
|
||||
def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
|
||||
@property
|
||||
def vocab(self) -> Vocab: ...
|
||||
@property
|
||||
def sent(self) -> Span: ...
|
||||
@property
|
||||
def ents(self) -> Tuple[Span]: ...
|
||||
@property
|
||||
def has_vector(self) -> bool: ...
|
||||
@property
|
||||
def vector(self) -> Floats1d: ...
|
||||
@property
|
||||
def vector_norm(self) -> float: ...
|
||||
@property
|
||||
def tensor(self) -> FloatsXd: ...
|
||||
@property
|
||||
def sentiment(self) -> float: ...
|
||||
@property
|
||||
def text(self) -> str: ...
|
||||
@property
|
||||
def text_with_ws(self) -> str: ...
|
||||
@property
|
||||
def noun_chunks(self) -> Iterator[Span]: ...
|
||||
@property
|
||||
def root(self) -> Token: ...
|
||||
def char_span(
|
||||
self,
|
||||
start_idx: int,
|
||||
end_idx: int,
|
||||
label: int = ...,
|
||||
kb_id: int = ...,
|
||||
vector: Optional[Floats1d] = ...,
|
||||
) -> Span: ...
|
||||
@property
|
||||
def conjuncts(self) -> Tuple[Token]: ...
|
||||
@property
|
||||
def lefts(self) -> Iterator[Token]: ...
|
||||
@property
|
||||
def rights(self) -> Iterator[Token]: ...
|
||||
@property
|
||||
def n_lefts(self) -> int: ...
|
||||
@property
|
||||
def n_rights(self) -> int: ...
|
||||
@property
|
||||
def subtree(self) -> Iterator[Token]: ...
|
||||
start: int
|
||||
end: int
|
||||
start_char: int
|
||||
end_char: int
|
||||
label: int
|
||||
kb_id: int
|
||||
ent_id: int
|
||||
ent_id_: str
|
||||
@property
|
||||
def orth_(self) -> str: ...
|
||||
@property
|
||||
def lemma_(self) -> str: ...
|
||||
label_: str
|
||||
kb_id_: str
|
|
@ -105,13 +105,18 @@ cdef class Span:
|
|||
if label not in doc.vocab.strings:
|
||||
raise ValueError(Errors.E084.format(label=label))
|
||||
|
||||
start_char = doc[start].idx if start < doc.length else len(doc.text)
|
||||
if start == end:
|
||||
end_char = start_char
|
||||
else:
|
||||
end_char = doc[end - 1].idx + len(doc[end - 1])
|
||||
self.c = SpanC(
|
||||
label=label,
|
||||
kb_id=kb_id,
|
||||
start=start,
|
||||
end=end,
|
||||
start_char=doc[start].idx if start < doc.length else 0,
|
||||
end_char=doc[end - 1].idx + len(doc[end - 1]) if end >= 1 else 0,
|
||||
start_char=start_char,
|
||||
end_char=end_char,
|
||||
)
|
||||
self._vector = vector
|
||||
self._vector_norm = vector_norm
|
||||
|
|
24
spacy/tokens/span_group.pyi
Normal file
24
spacy/tokens/span_group.pyi
Normal file
|
@ -0,0 +1,24 @@
|
|||
from typing import Any, Dict, Iterable
|
||||
from .doc import Doc
|
||||
from .span import Span
|
||||
|
||||
class SpanGroup:
|
||||
def __init__(
|
||||
self,
|
||||
doc: Doc,
|
||||
*,
|
||||
name: str = ...,
|
||||
attrs: Dict[str, Any] = ...,
|
||||
spans: Iterable[Span] = ...
|
||||
) -> None: ...
|
||||
def __repr__(self) -> str: ...
|
||||
@property
|
||||
def doc(self) -> Doc: ...
|
||||
@property
|
||||
def has_overlap(self) -> bool: ...
|
||||
def __len__(self) -> int: ...
|
||||
def append(self, span: Span) -> None: ...
|
||||
def extend(self, spans: Iterable[Span]) -> None: ...
|
||||
def __getitem__(self, i: int) -> Span: ...
|
||||
def to_bytes(self) -> bytes: ...
|
||||
def from_bytes(self, bytes_data: bytes) -> SpanGroup: ...
|
208
spacy/tokens/token.pyi
Normal file
208
spacy/tokens/token.pyi
Normal file
|
@ -0,0 +1,208 @@
|
|||
from typing import (
|
||||
Callable,
|
||||
Protocol,
|
||||
Iterator,
|
||||
Optional,
|
||||
Union,
|
||||
Tuple,
|
||||
Any,
|
||||
)
|
||||
from thinc.types import Floats1d, FloatsXd
|
||||
from .doc import Doc
|
||||
from .span import Span
|
||||
from .morphanalysis import MorphAnalysis
|
||||
from ..lexeme import Lexeme
|
||||
from ..vocab import Vocab
|
||||
from .underscore import Underscore
|
||||
|
||||
class TokenMethod(Protocol):
|
||||
def __call__(self: Token, *args: Any, **kwargs: Any) -> Any: ...
|
||||
|
||||
class Token:
|
||||
i: int
|
||||
doc: Doc
|
||||
vocab: Vocab
|
||||
@classmethod
|
||||
def set_extension(
|
||||
cls,
|
||||
name: str,
|
||||
default: Optional[Any] = ...,
|
||||
getter: Optional[Callable[[Token], Any]] = ...,
|
||||
setter: Optional[Callable[[Token, Any], None]] = ...,
|
||||
method: Optional[TokenMethod] = ...,
|
||||
force: bool = ...,
|
||||
) -> None: ...
|
||||
@classmethod
|
||||
def get_extension(
|
||||
cls, name: str
|
||||
) -> Tuple[
|
||||
Optional[Any],
|
||||
Optional[TokenMethod],
|
||||
Optional[Callable[[Token], Any]],
|
||||
Optional[Callable[[Token, Any], None]],
|
||||
]: ...
|
||||
@classmethod
|
||||
def has_extension(cls, name: str) -> bool: ...
|
||||
@classmethod
|
||||
def remove_extension(
|
||||
cls, name: str
|
||||
) -> Tuple[
|
||||
Optional[Any],
|
||||
Optional[TokenMethod],
|
||||
Optional[Callable[[Token], Any]],
|
||||
Optional[Callable[[Token, Any], None]],
|
||||
]: ...
|
||||
def __init__(self, vocab: Vocab, doc: Doc, offset: int) -> None: ...
|
||||
def __hash__(self) -> int: ...
|
||||
def __len__(self) -> int: ...
|
||||
def __unicode__(self) -> str: ...
|
||||
def __bytes__(self) -> bytes: ...
|
||||
def __str__(self) -> str: ...
|
||||
def __repr__(self) -> str: ...
|
||||
def __richcmp__(self, other: Token, op: int) -> bool: ...
|
||||
@property
|
||||
def _(self) -> Underscore: ...
|
||||
def nbor(self, i: int = ...) -> Token: ...
|
||||
def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
|
||||
def has_morph(self) -> bool: ...
|
||||
morph: MorphAnalysis
|
||||
@property
|
||||
def lex(self) -> Lexeme: ...
|
||||
@property
|
||||
def lex_id(self) -> int: ...
|
||||
@property
|
||||
def rank(self) -> int: ...
|
||||
@property
|
||||
def text(self) -> str: ...
|
||||
@property
|
||||
def text_with_ws(self) -> str: ...
|
||||
@property
|
||||
def prob(self) -> float: ...
|
||||
@property
|
||||
def sentiment(self) -> float: ...
|
||||
@property
|
||||
def lang(self) -> int: ...
|
||||
@property
|
||||
def idx(self) -> int: ...
|
||||
@property
|
||||
def cluster(self) -> int: ...
|
||||
@property
|
||||
def orth(self) -> int: ...
|
||||
@property
|
||||
def lower(self) -> int: ...
|
||||
@property
|
||||
def norm(self) -> int: ...
|
||||
@property
|
||||
def shape(self) -> int: ...
|
||||
@property
|
||||
def prefix(self) -> int: ...
|
||||
@property
|
||||
def suffix(self) -> int: ...
|
||||
lemma: int
|
||||
pos: int
|
||||
tag: int
|
||||
dep: int
|
||||
@property
|
||||
def has_vector(self) -> bool: ...
|
||||
@property
|
||||
def vector(self) -> Floats1d: ...
|
||||
@property
|
||||
def vector_norm(self) -> float: ...
|
||||
@property
|
||||
def tensor(self) -> Optional[FloatsXd]: ...
|
||||
@property
|
||||
def n_lefts(self) -> int: ...
|
||||
@property
|
||||
def n_rights(self) -> int: ...
|
||||
@property
|
||||
def sent(self) -> Span: ...
|
||||
sent_start: bool
|
||||
is_sent_start: Optional[bool]
|
||||
is_sent_end: Optional[bool]
|
||||
@property
|
||||
def lefts(self) -> Iterator[Token]: ...
|
||||
@property
|
||||
def rights(self) -> Iterator[Token]: ...
|
||||
@property
|
||||
def children(self) -> Iterator[Token]: ...
|
||||
@property
|
||||
def subtree(self) -> Iterator[Token]: ...
|
||||
@property
|
||||
def left_edge(self) -> Token: ...
|
||||
@property
|
||||
def right_edge(self) -> Token: ...
|
||||
@property
|
||||
def ancestors(self) -> Iterator[Token]: ...
|
||||
def is_ancestor(self, descendant: Token) -> bool: ...
|
||||
def has_head(self) -> bool: ...
|
||||
head: Token
|
||||
@property
|
||||
def conjuncts(self) -> Tuple[Token]: ...
|
||||
ent_type: int
|
||||
ent_type_: str
|
||||
@property
|
||||
def ent_iob(self) -> int: ...
|
||||
@classmethod
|
||||
def iob_strings(cls) -> Tuple[str]: ...
|
||||
@property
|
||||
def ent_iob_(self) -> str: ...
|
||||
ent_id: int
|
||||
ent_id_: str
|
||||
ent_kb_id: int
|
||||
ent_kb_id_: str
|
||||
@property
|
||||
def whitespace_(self) -> str: ...
|
||||
@property
|
||||
def orth_(self) -> str: ...
|
||||
@property
|
||||
def lower_(self) -> str: ...
|
||||
norm_: str
|
||||
@property
|
||||
def shape_(self) -> str: ...
|
||||
@property
|
||||
def prefix_(self) -> str: ...
|
||||
@property
|
||||
def suffix_(self) -> str: ...
|
||||
@property
|
||||
def lang_(self) -> str: ...
|
||||
lemma_: str
|
||||
pos_: str
|
||||
tag_: str
|
||||
def has_dep(self) -> bool: ...
|
||||
dep_: str
|
||||
@property
|
||||
def is_oov(self) -> bool: ...
|
||||
@property
|
||||
def is_stop(self) -> bool: ...
|
||||
@property
|
||||
def is_alpha(self) -> bool: ...
|
||||
@property
|
||||
def is_ascii(self) -> bool: ...
|
||||
@property
|
||||
def is_digit(self) -> bool: ...
|
||||
@property
|
||||
def is_lower(self) -> bool: ...
|
||||
@property
|
||||
def is_upper(self) -> bool: ...
|
||||
@property
|
||||
def is_title(self) -> bool: ...
|
||||
@property
|
||||
def is_punct(self) -> bool: ...
|
||||
@property
|
||||
def is_space(self) -> bool: ...
|
||||
@property
|
||||
def is_bracket(self) -> bool: ...
|
||||
@property
|
||||
def is_quote(self) -> bool: ...
|
||||
@property
|
||||
def is_left_punct(self) -> bool: ...
|
||||
@property
|
||||
def is_right_punct(self) -> bool: ...
|
||||
@property
|
||||
def is_currency(self) -> bool: ...
|
||||
@property
|
||||
def like_url(self) -> bool: ...
|
||||
@property
|
||||
def like_num(self) -> bool: ...
|
||||
@property
|
||||
def like_email(self) -> bool: ...
|
|
@ -29,7 +29,7 @@ def console_logger(progress_bar: bool = False):
|
|||
def setup_printer(
|
||||
nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr
|
||||
) -> Tuple[Callable[[Optional[Dict[str, Any]]], None], Callable[[], None]]:
|
||||
write = lambda text: stdout.write(f"{text}\n")
|
||||
write = lambda text: print(text, file=stdout, flush=True)
|
||||
msg = Printer(no_print=True)
|
||||
# ensure that only trainable components are logged
|
||||
logged_pipes = [
|
||||
|
|
78
spacy/vocab.pyi
Normal file
78
spacy/vocab.pyi
Normal file
|
@ -0,0 +1,78 @@
|
|||
from typing import (
|
||||
Callable,
|
||||
Iterator,
|
||||
Optional,
|
||||
Union,
|
||||
Tuple,
|
||||
List,
|
||||
Dict,
|
||||
Any,
|
||||
)
|
||||
from thinc.types import Floats1d, FloatsXd
|
||||
from . import Language
|
||||
from .strings import StringStore
|
||||
from .lexeme import Lexeme
|
||||
from .lookups import Lookups
|
||||
from .tokens import Doc, Span
|
||||
from pathlib import Path
|
||||
|
||||
def create_vocab(
|
||||
lang: Language, defaults: Any, vectors_name: Optional[str] = ...
|
||||
) -> Vocab: ...
|
||||
|
||||
class Vocab:
|
||||
def __init__(
|
||||
self,
|
||||
lex_attr_getters: Optional[Dict[str, Callable[[str], Any]]] = ...,
|
||||
strings: Optional[Union[List[str], StringStore]] = ...,
|
||||
lookups: Optional[Lookups] = ...,
|
||||
oov_prob: float = ...,
|
||||
vectors_name: Optional[str] = ...,
|
||||
writing_system: Dict[str, Any] = ...,
|
||||
get_noun_chunks: Optional[Callable[[Union[Doc, Span]], Iterator[Span]]] = ...,
|
||||
) -> None: ...
|
||||
@property
|
||||
def lang(self) -> Language: ...
|
||||
def __len__(self) -> int: ...
|
||||
def add_flag(
|
||||
self, flag_getter: Callable[[str], bool], flag_id: int = ...
|
||||
) -> int: ...
|
||||
def __contains__(self, key: str) -> bool: ...
|
||||
def __iter__(self) -> Iterator[Lexeme]: ...
|
||||
def __getitem__(self, id_or_string: Union[str, int]) -> Lexeme: ...
|
||||
@property
|
||||
def vectors_length(self) -> int: ...
|
||||
def reset_vectors(
|
||||
self, *, width: Optional[int] = ..., shape: Optional[int] = ...
|
||||
) -> None: ...
|
||||
def prune_vectors(self, nr_row: int, batch_size: int = ...) -> Dict[str, float]: ...
|
||||
def get_vector(
|
||||
self,
|
||||
orth: Union[int, str],
|
||||
minn: Optional[int] = ...,
|
||||
maxn: Optional[int] = ...,
|
||||
) -> FloatsXd: ...
|
||||
def set_vector(self, orth: Union[int, str], vector: Floats1d) -> None: ...
|
||||
def has_vector(self, orth: Union[int, str]) -> bool: ...
|
||||
lookups: Lookups
|
||||
def to_disk(
|
||||
self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
|
||||
) -> None: ...
|
||||
def from_disk(
|
||||
self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
|
||||
) -> Vocab: ...
|
||||
def to_bytes(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
|
||||
def from_bytes(
|
||||
self, bytes_data: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
|
||||
) -> Vocab: ...
|
||||
|
||||
def pickle_vocab(vocab: Vocab) -> Any: ...
|
||||
def unpickle_vocab(
|
||||
sstore: StringStore,
|
||||
vectors: Any,
|
||||
morphology: Any,
|
||||
data_dir: Any,
|
||||
lex_attr_getters: Any,
|
||||
lookups: Any,
|
||||
get_noun_chunks: Any,
|
||||
) -> Vocab: ...
|
|
@ -409,7 +409,7 @@ a single token vector given zero or more wordpiece vectors.
|
|||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.Tok2VecTransformer.v1"
|
||||
> @architectures = "spacy-transformers.Tok2VecTransformer.v1"
|
||||
> name = "albert-base-v2"
|
||||
> tokenizer_config = {"use_fast": false}
|
||||
> grad_factor = 1.0
|
||||
|
|
|
@ -90,7 +90,6 @@ Defines the `nlp` object, its tokenizer and
|
|||
> ```ini
|
||||
> [components.textcat]
|
||||
> factory = "textcat"
|
||||
> labels = ["POSITIVE", "NEGATIVE"]
|
||||
>
|
||||
> [components.textcat.model]
|
||||
> @architectures = "spacy.TextCatBOW.v2"
|
||||
|
|
|
@ -35,11 +35,11 @@ how the component should be configured. You can override its settings via the
|
|||
> ```
|
||||
|
||||
| Setting | Description |
|
||||
| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ----------- |
|
||||
| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `phrase_matcher_attr` | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ |
|
||||
| `validate` | Whether patterns should be validated (passed to the `Matcher` and `PhraseMatcher`). Defaults to `False`. ~~bool~~ |
|
||||
| `overwrite_ents` | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~ |
|
||||
| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `" | | "`. ~~str~~ |
|
||||
| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `"\|\|"`. ~~str~~ |
|
||||
|
||||
```python
|
||||
%%GITHUB_SPACY/spacy/pipeline/entityruler.py
|
||||
|
@ -64,14 +64,14 @@ be a token pattern (list) or a phrase pattern (string). For example:
|
|||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ----------- |
|
||||
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `nlp` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. ~~Language~~ |
|
||||
| `name` <Tag variant="new">3</Tag> | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. Used to disable the current entity ruler while creating phrase patterns with the nlp object. ~~str~~ |
|
||||
| _keyword-only_ | |
|
||||
| `phrase_matcher_attr` | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ |
|
||||
| `validate` | Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate`. Defaults to `False`. ~~bool~~ |
|
||||
| `overwrite_ents` | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~ |
|
||||
| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `" | | "`. ~~str~~ |
|
||||
| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `"\|\|"`. ~~str~~ |
|
||||
| `patterns` | Optional patterns to load in on initialization. ~~Optional[List[Dict[str, Union[str, List[dict]]]]]~~ |
|
||||
|
||||
## EntityRuler.initialize {#initialize tag="method" new="3"}
|
||||
|
|
|
@ -77,13 +77,14 @@ it compares to another value.
|
|||
> ]
|
||||
> ```
|
||||
|
||||
| Attribute | Description |
|
||||
| -------------------------- | ------------------------------------------------------------------------------------------------------- |
|
||||
| `IN` | Attribute value is member of a list. ~~Any~~ |
|
||||
| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ |
|
||||
| `ISSUBSET` | Attribute values (for `MORPH`) are a subset of a list. ~~Any~~ |
|
||||
| `ISSUPERSET` | Attribute values (for `MORPH`) are a superset of a list. ~~Any~~ |
|
||||
| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
|
||||
| Attribute | Description |
|
||||
| -------------------------- | -------------------------------------------------------------------------------------------------------- |
|
||||
| `IN` | Attribute value is member of a list. ~~Any~~ |
|
||||
| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ |
|
||||
| `IS_SUBSET` | Attribute value (for `MORPH` or custom list attributes) is a subset of a list. ~~Any~~ |
|
||||
| `IS_SUPERSET` | Attribute value (for `MORPH` or custom list attributes) is a superset of a list. ~~Any~~ |
|
||||
| `INTERSECTS` | Attribute value (for `MORPH` or custom list attribute) has a non-empty intersection with a list. ~~Any~~ |
|
||||
| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
|
||||
|
||||
## Matcher.\_\_init\_\_ {#init tag="method"}
|
||||
|
||||
|
|
|
@ -29,7 +29,7 @@ Create the vocabulary.
|
|||
| `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ |
|
||||
| `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~ |
|
||||
| `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ |
|
||||
| `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
|
||||
| `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/api/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
|
||||
|
||||
## Vocab.\_\_len\_\_ {#len tag="method"}
|
||||
|
||||
|
|
|
@ -232,15 +232,22 @@ following rich comparison attributes are available:
|
|||
>
|
||||
> # Matches tokens of length >= 10
|
||||
> pattern2 = [{"LENGTH": {">=": 10}}]
|
||||
>
|
||||
> # Match based on morph attributes
|
||||
> pattern3 = [{"MORPH": {"IS_SUBSET": ["Number=Sing", "Gender=Neut"]}}]
|
||||
> # "", "Number=Sing" and "Number=Sing|Gender=Neut" will match as subsets
|
||||
> # "Number=Plur|Gender=Neut" will not match
|
||||
> # "Number=Sing|Gender=Neut|Polite=Infm" will not match because it's a superset
|
||||
> ```
|
||||
|
||||
| Attribute | Description |
|
||||
| -------------------------- | ------------------------------------------------------------------------------------------------------- |
|
||||
| `IN` | Attribute value is member of a list. ~~Any~~ |
|
||||
| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ |
|
||||
| `ISSUBSET` | Attribute values (for `MORPH`) are a subset of a list. ~~Any~~ |
|
||||
| `ISSUPERSET` | Attribute values (for `MORPH`) are a superset of a list. ~~Any~~ |
|
||||
| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
|
||||
| Attribute | Description |
|
||||
| -------------------------- | --------------------------------------------------------------------------------------------------------- |
|
||||
| `IN` | Attribute value is member of a list. ~~Any~~ |
|
||||
| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ |
|
||||
| `IS_SUBSET` | Attribute value (for `MORPH` or custom list attributes) is a subset of a list. ~~Any~~ |
|
||||
| `IS_SUPERSET` | Attribute value (for `MORPH` or custom list attributes) is a superset of a list. ~~Any~~ |
|
||||
| `INTERSECTS` | Attribute value (for `MORPH` or custom list attributes) has a non-empty intersection with a list. ~~Any~~ |
|
||||
| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
|
||||
|
||||
#### Regular expressions {#regex new="2.1"}
|
||||
|
||||
|
|
|
@ -652,7 +652,7 @@ excluded from the logs and the score won't be weighted.
|
|||
| **Recall** (R) | Percentage of reference annotations recovered. Should increase. |
|
||||
| **F-Score** (F) | Harmonic mean of precision and recall. Should increase. |
|
||||
| **UAS** / **LAS** | Unlabeled and labeled attachment score for the dependency parser, i.e. the percentage of correct arcs. Should increase. |
|
||||
| **Words per second** (WPS) | Prediction speed in words per second. Should stay stable. |
|
||||
| **Speed** | Prediction speed in words per second (WPS). Should stay stable. |
|
||||
|
||||
Note that if the development data has raw text, some of the gold-standard
|
||||
entities might not align to the predicted tokenization. These tokenization
|
||||
|
|
|
@ -854,6 +854,19 @@ pipeline component, the [`AttributeRuler`](/api/attributeruler). See the
|
|||
you have tag maps and morph rules in the v2.x format, you can load them into the
|
||||
attribute ruler before training using the `[initialize]` block of your config.
|
||||
|
||||
### Using Lexeme Tables
|
||||
|
||||
To use tables like `lexeme_prob` when training a model from scratch, you need
|
||||
to add an entry to the `initialize` block in your config. Here's what that
|
||||
looks like for the existing trained pipelines:
|
||||
|
||||
```ini
|
||||
[initialize.lookups]
|
||||
@misc = "spacy.LookupsDataLoader.v1"
|
||||
lang = ${nlp.lang}
|
||||
tables = ["lexeme_norm"]
|
||||
```
|
||||
|
||||
> #### What does the initialization do?
|
||||
>
|
||||
> The `[initialize]` block is used when
|
||||
|
|
Loading…
Reference in New Issue
Block a user