diff --git a/.github/contributors/ezorita.md b/.github/contributors/ezorita.md
new file mode 100644
index 000000000..e5f3f5283
--- /dev/null
+++ b/.github/contributors/ezorita.md
@@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+ * you hereby assign to us joint ownership, and to the extent that such
+ assignment is or becomes invalid, ineffective or unenforceable, you hereby
+ grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+ royalty-free, unrestricted license to exercise all rights under those
+ copyrights. This includes, at our option, the right to sublicense these same
+ rights to third parties through multiple levels of sublicensees or other
+ licensing arrangements;
+
+ * you agree that each of us can do all things in relation to your
+ contribution as if each of us were the sole owners, and if one of us makes
+ a derivative work of your contribution, the one who makes the derivative
+ work (or has it made will be the sole owner of that derivative work;
+
+ * you agree that you will not assert any moral rights in your contribution
+ against us, our licensees or transferees;
+
+ * you agree that we may register a copyright in your contribution and
+ exercise all ownership rights associated with it; and
+
+ * you agree that neither of us has any duty to consult with, obtain the
+ consent of, pay or render an accounting to the other for any use or
+ distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+ * make, have made, use, sell, offer to sell, import, and otherwise transfer
+ your contribution in whole or in part, alone or in combination with or
+ included in any product, work or materials arising out of the project to
+ which your contribution was submitted, and
+
+ * at our option, to sublicense these same rights to third parties through
+ multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+ * Each contribution that you submit is and shall be an original work of
+ authorship and you can legally grant the rights set out in this SCA;
+
+ * to the best of your knowledge, each contribution will not violate any
+ third party's copyrights, trademarks, patents, or other intellectual
+ property rights; and
+
+ * each contribution shall be in compliance with U.S. export control laws and
+ other applicable export and import laws. You agree to notify us if you
+ become aware of any circumstance which would make any of the foregoing
+ representations inaccurate in any respect. We may publicly disclose your
+ participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+ * [x] I am signing on behalf of myself as an individual and no other person
+ or entity, including my employer, has or will have rights with respect to my
+ contributions.
+
+ * [ ] I am signing on behalf of my employer or a legal entity and I have the
+ actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field | Entry |
+|------------------------------- | -------------------- |
+| Name | Eduard Zorita |
+| Company name (if applicable) | |
+| Title or role (if applicable) | |
+| Date | 06/17/2021 |
+| GitHub username | ezorita |
+| Website (optional) | |
diff --git a/.github/contributors/nsorros.md b/.github/contributors/nsorros.md
new file mode 100644
index 000000000..a449c52e1
--- /dev/null
+++ b/.github/contributors/nsorros.md
@@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+ * you hereby assign to us joint ownership, and to the extent that such
+ assignment is or becomes invalid, ineffective or unenforceable, you hereby
+ grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+ royalty-free, unrestricted license to exercise all rights under those
+ copyrights. This includes, at our option, the right to sublicense these same
+ rights to third parties through multiple levels of sublicensees or other
+ licensing arrangements;
+
+ * you agree that each of us can do all things in relation to your
+ contribution as if each of us were the sole owners, and if one of us makes
+ a derivative work of your contribution, the one who makes the derivative
+ work (or has it made will be the sole owner of that derivative work;
+
+ * you agree that you will not assert any moral rights in your contribution
+ against us, our licensees or transferees;
+
+ * you agree that we may register a copyright in your contribution and
+ exercise all ownership rights associated with it; and
+
+ * you agree that neither of us has any duty to consult with, obtain the
+ consent of, pay or render an accounting to the other for any use or
+ distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+ * make, have made, use, sell, offer to sell, import, and otherwise transfer
+ your contribution in whole or in part, alone or in combination with or
+ included in any product, work or materials arising out of the project to
+ which your contribution was submitted, and
+
+ * at our option, to sublicense these same rights to third parties through
+ multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+ * Each contribution that you submit is and shall be an original work of
+ authorship and you can legally grant the rights set out in this SCA;
+
+ * to the best of your knowledge, each contribution will not violate any
+ third party's copyrights, trademarks, patents, or other intellectual
+ property rights; and
+
+ * each contribution shall be in compliance with U.S. export control laws and
+ other applicable export and import laws. You agree to notify us if you
+ become aware of any circumstance which would make any of the foregoing
+ representations inaccurate in any respect. We may publicly disclose your
+ participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+ * [x] I am signing on behalf of myself as an individual and no other person
+ or entity, including my employer, has or will have rights with respect to my
+ contributions.
+
+ * [ ] I am signing on behalf of my employer or a legal entity and I have the
+ actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field | Entry |
+|------------------------------- | -------------------- |
+| Name | Nick Sorros |
+| Company name (if applicable) | |
+| Title or role (if applicable) | |
+| Date | 2/8/2021 |
+| GitHub username | nsorros |
+| Website (optional) | |
diff --git a/spacy/cli/project/pull.py b/spacy/cli/project/pull.py
index b88387a9f..6e3cde88c 100644
--- a/spacy/cli/project/pull.py
+++ b/spacy/cli/project/pull.py
@@ -2,7 +2,7 @@ from pathlib import Path
from wasabi import msg
from .remote_storage import RemoteStorage
from .remote_storage import get_command_hash
-from .._util import project_cli, Arg
+from .._util import project_cli, Arg, logger
from .._util import load_project_config
from .run import update_lockfile
@@ -39,11 +39,15 @@ def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
# in the list.
while commands:
for i, cmd in enumerate(list(commands)):
+ logger.debug(f"CMD: {cmd['name']}.")
deps = [project_dir / dep for dep in cmd.get("deps", [])]
if all(dep.exists() for dep in deps):
cmd_hash = get_command_hash("", "", deps, cmd["script"])
for output_path in cmd.get("outputs", []):
url = storage.pull(output_path, command_hash=cmd_hash)
+ logger.debug(
+ f"URL: {url} for {output_path} with command hash {cmd_hash}"
+ )
yield url, output_path
out_locs = [project_dir / out for out in cmd.get("outputs", [])]
@@ -53,6 +57,8 @@ def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
# we iterate over the loop again.
commands.pop(i)
break
+ else:
+ logger.debug(f"Dependency missing. Skipping {cmd['name']} outputs.")
else:
# If we didn't break the for loop, break the while loop.
break
diff --git a/spacy/cli/project/push.py b/spacy/cli/project/push.py
index 44050b716..bc779e9cd 100644
--- a/spacy/cli/project/push.py
+++ b/spacy/cli/project/push.py
@@ -3,7 +3,7 @@ from wasabi import msg
from .remote_storage import RemoteStorage
from .remote_storage import get_content_hash, get_command_hash
from .._util import load_project_config
-from .._util import project_cli, Arg
+from .._util import project_cli, Arg, logger
@project_cli.command("push")
@@ -37,12 +37,15 @@ def project_push(project_dir: Path, remote: str):
remote = config["remotes"][remote]
storage = RemoteStorage(project_dir, remote)
for cmd in config.get("commands", []):
+ logger.debug(f"CMD: cmd['name']")
deps = [project_dir / dep for dep in cmd.get("deps", [])]
if any(not dep.exists() for dep in deps):
+ logger.debug(f"Dependency missing. Skipping {cmd['name']} outputs")
continue
cmd_hash = get_command_hash(
"", "", [project_dir / dep for dep in cmd.get("deps", [])], cmd["script"]
)
+ logger.debug(f"CMD_HASH: {cmd_hash}")
for output_path in cmd.get("outputs", []):
output_loc = project_dir / output_path
if output_loc.exists() and _is_not_empty_dir(output_loc):
@@ -51,6 +54,9 @@ def project_push(project_dir: Path, remote: str):
command_hash=cmd_hash,
content_hash=get_content_hash(output_loc),
)
+ logger.debug(
+ f"URL: {url} for output {output_path} with cmd_hash {cmd_hash}"
+ )
yield output_path, url
diff --git a/spacy/cli/train.py b/spacy/cli/train.py
index 2932edd3b..9fd87dbc1 100644
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@@ -43,9 +43,13 @@ def train_cli(
# Make sure all files and paths exists if they are needed
if not config_path or (str(config_path) != "-" and not config_path.exists()):
msg.fail("Config file not found", config_path, exits=1)
- if output_path is not None and not output_path.exists():
- output_path.mkdir(parents=True)
- msg.good(f"Created output directory: {output_path}")
+ if not output_path:
+ msg.info("No output directory provided")
+ else:
+ if not output_path.exists():
+ output_path.mkdir(parents=True)
+ msg.good(f"Created output directory: {output_path}")
+ msg.info(f"Saving to output directory: {output_path}")
overrides = parse_config_overrides(ctx.args)
import_code(code_path)
setup_gpu(use_gpu)
diff --git a/spacy/errors.py b/spacy/errors.py
index 5651ab0fa..9264ca6d1 100644
--- a/spacy/errors.py
+++ b/spacy/errors.py
@@ -356,8 +356,8 @@ class Errors:
E098 = ("Invalid pattern: expected both RIGHT_ID and RIGHT_ATTRS.")
E099 = ("Invalid pattern: the first node of pattern should be an anchor "
"node. The node should only contain RIGHT_ID and RIGHT_ATTRS.")
- E100 = ("Nodes other than the anchor node should all contain LEFT_ID, "
- "REL_OP and RIGHT_ID.")
+ E100 = ("Nodes other than the anchor node should all contain {required}, "
+ "but these are missing: {missing}")
E101 = ("RIGHT_ID should be a new node and LEFT_ID should already have "
"have been declared in previous edges.")
E102 = ("Can't merge non-disjoint spans. '{token}' is already part of "
diff --git a/spacy/language.py b/spacy/language.py
index 589dca2bf..a8cad1259 100644
--- a/spacy/language.py
+++ b/spacy/language.py
@@ -1698,7 +1698,6 @@ class Language:
# them here so they're only loaded once
source_nlps = {}
source_nlp_vectors_hashes = {}
- nlp.meta["_sourced_vectors_hashes"] = {}
for pipe_name in config["nlp"]["pipeline"]:
if pipe_name not in pipeline:
opts = ", ".join(pipeline.keys())
@@ -1747,6 +1746,8 @@ class Language:
source_nlp_vectors_hashes[model] = hash(
source_nlps[model].vocab.vectors.to_bytes()
)
+ if "_sourced_vectors_hashes" not in nlp.meta:
+ nlp.meta["_sourced_vectors_hashes"] = {}
nlp.meta["_sourced_vectors_hashes"][
pipe_name
] = source_nlp_vectors_hashes[model]
@@ -1908,7 +1909,7 @@ class Language:
if not hasattr(proc, "to_disk"):
continue
serializers[name] = lambda p, proc=proc: proc.to_disk(p, exclude=["vocab"])
- serializers["vocab"] = lambda p: self.vocab.to_disk(p)
+ serializers["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
util.to_disk(path, serializers, exclude)
def from_disk(
@@ -1939,7 +1940,7 @@ class Language:
def deserialize_vocab(path: Path) -> None:
if path.exists():
- self.vocab.from_disk(path)
+ self.vocab.from_disk(path, exclude=exclude)
path = util.ensure_path(path)
deserializers = {}
@@ -1977,7 +1978,7 @@ class Language:
DOCS: https://spacy.io/api/language#to_bytes
"""
serializers = {}
- serializers["vocab"] = lambda: self.vocab.to_bytes()
+ serializers["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
serializers["meta.json"] = lambda: srsly.json_dumps(self.meta)
serializers["config.cfg"] = lambda: self.config.to_bytes()
@@ -2013,7 +2014,7 @@ class Language:
b, interpolate=False
)
deserializers["meta.json"] = deserialize_meta
- deserializers["vocab"] = self.vocab.from_bytes
+ deserializers["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
deserializers["tokenizer"] = lambda b: self.tokenizer.from_bytes(
b, exclude=["vocab"]
)
diff --git a/spacy/lexeme.pyi b/spacy/lexeme.pyi
new file mode 100644
index 000000000..4eae6be43
--- /dev/null
+++ b/spacy/lexeme.pyi
@@ -0,0 +1,61 @@
+from typing import (
+ Union,
+ Any,
+)
+from thinc.types import Floats1d
+from .tokens import Doc, Span, Token
+from .vocab import Vocab
+
+class Lexeme:
+ def __init__(self, vocab: Vocab, orth: int) -> None: ...
+ def __richcmp__(self, other: Lexeme, op: int) -> bool: ...
+ def __hash__(self) -> int: ...
+ def set_attrs(self, **attrs: Any) -> None: ...
+ def set_flag(self, flag_id: int, value: bool) -> None: ...
+ def check_flag(self, flag_id: int) -> bool: ...
+ def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
+ @property
+ def has_vector(self) -> bool: ...
+ @property
+ def vector_norm(self) -> float: ...
+ vector: Floats1d
+ rank: str
+ sentiment: float
+ @property
+ def orth_(self) -> str: ...
+ @property
+ def text(self) -> str: ...
+ lower: str
+ norm: int
+ shape: int
+ prefix: int
+ suffix: int
+ cluster: int
+ lang: int
+ prob: float
+ lower_: str
+ norm_: str
+ shape_: str
+ prefix_: str
+ suffix_: str
+ lang_: str
+ flags: int
+ @property
+ def is_oov(self) -> bool: ...
+ is_stop: bool
+ is_alpha: bool
+ is_ascii: bool
+ is_digit: bool
+ is_lower: bool
+ is_upper: bool
+ is_title: bool
+ is_punct: bool
+ is_space: bool
+ is_bracket: bool
+ is_quote: bool
+ is_left_punct: bool
+ is_right_punct: bool
+ is_currency: bool
+ like_url: bool
+ like_num: bool
+ like_email: bool
diff --git a/spacy/matcher/dependencymatcher.pyx b/spacy/matcher/dependencymatcher.pyx
index b6e84a5da..9e0842d59 100644
--- a/spacy/matcher/dependencymatcher.pyx
+++ b/spacy/matcher/dependencymatcher.pyx
@@ -122,13 +122,17 @@ cdef class DependencyMatcher:
raise ValueError(Errors.E099.format(key=key))
visited_nodes[relation["RIGHT_ID"]] = True
else:
- if not(
- "RIGHT_ID" in relation
- and "RIGHT_ATTRS" in relation
- and "REL_OP" in relation
- and "LEFT_ID" in relation
- ):
- raise ValueError(Errors.E100.format(key=key))
+ required_keys = set(
+ ("RIGHT_ID", "RIGHT_ATTRS", "REL_OP", "LEFT_ID")
+ )
+ relation_keys = set(relation.keys())
+ missing = required_keys - relation_keys
+ if missing:
+ missing_txt = ", ".join(list(missing))
+ raise ValueError(Errors.E100.format(
+ required=required_keys,
+ missing=missing_txt
+ ))
if (
relation["RIGHT_ID"] in visited_nodes
or relation["LEFT_ID"] not in visited_nodes
diff --git a/spacy/matcher/matcher.pyi b/spacy/matcher/matcher.pyi
new file mode 100644
index 000000000..3be065bcd
--- /dev/null
+++ b/spacy/matcher/matcher.pyi
@@ -0,0 +1,41 @@
+from typing import Any, List, Dict, Tuple, Optional, Callable, Union, Iterator, Iterable
+from ..vocab import Vocab
+from ..tokens import Doc, Span
+
+class Matcher:
+ def __init__(self, vocab: Vocab, validate: bool = ...) -> None: ...
+ def __reduce__(self) -> Any: ...
+ def __len__(self) -> int: ...
+ def __contains__(self, key: str) -> bool: ...
+ def add(
+ self,
+ key: str,
+ patterns: List[List[Dict[str, Any]]],
+ *,
+ on_match: Optional[
+ Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
+ ] = ...,
+ greedy: Optional[str] = ...
+ ) -> None: ...
+ def remove(self, key: str) -> None: ...
+ def has_key(self, key: Union[str, int]) -> bool: ...
+ def get(
+ self, key: Union[str, int], default: Optional[Any] = ...
+ ) -> Tuple[Optional[Callable[[Any], Any]], List[List[Dict[Any, Any]]]]: ...
+ def pipe(
+ self,
+ docs: Iterable[Tuple[Doc, Any]],
+ batch_size: int = ...,
+ return_matches: bool = ...,
+ as_tuples: bool = ...,
+ ) -> Union[
+ Iterator[Tuple[Tuple[Doc, Any], Any]], Iterator[Tuple[Doc, Any]], Iterator[Doc]
+ ]: ...
+ def __call__(
+ self,
+ doclike: Union[Doc, Span],
+ *,
+ as_spans: bool = ...,
+ allow_missing: bool = ...,
+ with_alignments: bool = ...
+ ) -> Union[List[Tuple[int, int, int]], List[Span]]: ...
diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx
index 7b1cfb633..555766f62 100644
--- a/spacy/matcher/matcher.pyx
+++ b/spacy/matcher/matcher.pyx
@@ -845,7 +845,7 @@ class _RegexPredicate:
class _SetPredicate:
- operators = ("IN", "NOT_IN", "IS_SUBSET", "IS_SUPERSET")
+ operators = ("IN", "NOT_IN", "IS_SUBSET", "IS_SUPERSET", "INTERSECTS")
def __init__(self, i, attr, value, predicate, is_extension=False, vocab=None):
self.i = i
@@ -868,14 +868,16 @@ class _SetPredicate:
else:
value = get_token_attr_for_matcher(token.c, self.attr)
- if self.predicate in ("IS_SUBSET", "IS_SUPERSET"):
+ if self.predicate in ("IS_SUBSET", "IS_SUPERSET", "INTERSECTS"):
if self.attr == MORPH:
# break up MORPH into individual Feat=Val values
value = set(get_string_id(v) for v in MorphAnalysis.from_id(self.vocab, value))
else:
- # IS_SUBSET for other attrs will be equivalent to "IN"
- # IS_SUPERSET will only match for other attrs with 0 or 1 values
- value = set([value])
+ # treat a single value as a list
+ if isinstance(value, (str, int)):
+ value = set([get_string_id(value)])
+ else:
+ value = set(get_string_id(v) for v in value)
if self.predicate == "IN":
return value in self.value
elif self.predicate == "NOT_IN":
@@ -884,6 +886,8 @@ class _SetPredicate:
return value <= self.value
elif self.predicate == "IS_SUPERSET":
return value >= self.value
+ elif self.predicate == "INTERSECTS":
+ return bool(value & self.value)
def __repr__(self):
return repr(("SetPredicate", self.i, self.attr, self.value, self.predicate))
@@ -928,6 +932,7 @@ def _get_extra_predicates(spec, extra_predicates, vocab):
"NOT_IN": _SetPredicate,
"IS_SUBSET": _SetPredicate,
"IS_SUPERSET": _SetPredicate,
+ "INTERSECTS": _SetPredicate,
"==": _ComparisonPredicate,
"!=": _ComparisonPredicate,
">=": _ComparisonPredicate,
diff --git a/spacy/pipeline/attributeruler.py b/spacy/pipeline/attributeruler.py
index a6efd5906..f95a5a48c 100644
--- a/spacy/pipeline/attributeruler.py
+++ b/spacy/pipeline/attributeruler.py
@@ -276,7 +276,7 @@ class AttributeRuler(Pipe):
DOCS: https://spacy.io/api/attributeruler#to_bytes
"""
serialize = {}
- serialize["vocab"] = self.vocab.to_bytes
+ serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
serialize["patterns"] = lambda: srsly.msgpack_dumps(self.patterns)
return util.to_bytes(serialize, exclude)
@@ -296,7 +296,7 @@ class AttributeRuler(Pipe):
self.add_patterns(srsly.msgpack_loads(b))
deserialize = {
- "vocab": lambda b: self.vocab.from_bytes(b),
+ "vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
"patterns": load_patterns,
}
util.from_bytes(bytes_data, deserialize, exclude)
@@ -313,7 +313,7 @@ class AttributeRuler(Pipe):
DOCS: https://spacy.io/api/attributeruler#to_disk
"""
serialize = {
- "vocab": lambda p: self.vocab.to_disk(p),
+ "vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
"patterns": lambda p: srsly.write_msgpack(p, self.patterns),
}
util.to_disk(path, serialize, exclude)
@@ -334,7 +334,7 @@ class AttributeRuler(Pipe):
self.add_patterns(srsly.read_msgpack(p))
deserialize = {
- "vocab": lambda p: self.vocab.from_disk(p),
+ "vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
"patterns": load_patterns,
}
util.from_disk(path, deserialize, exclude)
diff --git a/spacy/pipeline/entity_linker.py b/spacy/pipeline/entity_linker.py
index ba7e71f15..7b52025bc 100644
--- a/spacy/pipeline/entity_linker.py
+++ b/spacy/pipeline/entity_linker.py
@@ -412,7 +412,7 @@ class EntityLinker(TrainablePipe):
serialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
- serialize["vocab"] = self.vocab.to_bytes
+ serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
serialize["kb"] = self.kb.to_bytes
serialize["model"] = self.model.to_bytes
return util.to_bytes(serialize, exclude)
@@ -436,7 +436,7 @@ class EntityLinker(TrainablePipe):
deserialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
- deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
+ deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
deserialize["kb"] = lambda b: self.kb.from_bytes(b)
deserialize["model"] = load_model
util.from_bytes(bytes_data, deserialize, exclude)
@@ -453,7 +453,7 @@ class EntityLinker(TrainablePipe):
DOCS: https://spacy.io/api/entitylinker#to_disk
"""
serialize = {}
- serialize["vocab"] = lambda p: self.vocab.to_disk(p)
+ serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
serialize["kb"] = lambda p: self.kb.to_disk(p)
serialize["model"] = lambda p: self.model.to_disk(p)
@@ -480,6 +480,7 @@ class EntityLinker(TrainablePipe):
deserialize = {}
deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
+ deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
deserialize["kb"] = lambda p: self.kb.from_disk(p)
deserialize["model"] = load_model
util.from_disk(path, deserialize, exclude)
diff --git a/spacy/pipeline/lemmatizer.py b/spacy/pipeline/lemmatizer.py
index 87504fade..2f436c57a 100644
--- a/spacy/pipeline/lemmatizer.py
+++ b/spacy/pipeline/lemmatizer.py
@@ -269,7 +269,7 @@ class Lemmatizer(Pipe):
DOCS: https://spacy.io/api/lemmatizer#to_disk
"""
serialize = {}
- serialize["vocab"] = lambda p: self.vocab.to_disk(p)
+ serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
serialize["lookups"] = lambda p: self.lookups.to_disk(p)
util.to_disk(path, serialize, exclude)
@@ -285,7 +285,7 @@ class Lemmatizer(Pipe):
DOCS: https://spacy.io/api/lemmatizer#from_disk
"""
deserialize = {}
- deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
+ deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
deserialize["lookups"] = lambda p: self.lookups.from_disk(p)
util.from_disk(path, deserialize, exclude)
self._validate_tables()
@@ -300,7 +300,7 @@ class Lemmatizer(Pipe):
DOCS: https://spacy.io/api/lemmatizer#to_bytes
"""
serialize = {}
- serialize["vocab"] = self.vocab.to_bytes
+ serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
serialize["lookups"] = self.lookups.to_bytes
return util.to_bytes(serialize, exclude)
@@ -316,7 +316,7 @@ class Lemmatizer(Pipe):
DOCS: https://spacy.io/api/lemmatizer#from_bytes
"""
deserialize = {}
- deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
+ deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
deserialize["lookups"] = lambda b: self.lookups.from_bytes(b)
util.from_bytes(bytes_data, deserialize, exclude)
self._validate_tables()
diff --git a/spacy/pipeline/trainable_pipe.pyx b/spacy/pipeline/trainable_pipe.pyx
index ce1e133a2..76b0733cf 100644
--- a/spacy/pipeline/trainable_pipe.pyx
+++ b/spacy/pipeline/trainable_pipe.pyx
@@ -273,7 +273,7 @@ cdef class TrainablePipe(Pipe):
serialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
- serialize["vocab"] = self.vocab.to_bytes
+ serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
serialize["model"] = self.model.to_bytes
return util.to_bytes(serialize, exclude)
@@ -296,7 +296,7 @@ cdef class TrainablePipe(Pipe):
deserialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
- deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
+ deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
deserialize["model"] = load_model
util.from_bytes(bytes_data, deserialize, exclude)
return self
@@ -313,7 +313,7 @@ cdef class TrainablePipe(Pipe):
serialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
- serialize["vocab"] = lambda p: self.vocab.to_disk(p)
+ serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
serialize["model"] = lambda p: self.model.to_disk(p)
util.to_disk(path, serialize, exclude)
@@ -338,7 +338,7 @@ cdef class TrainablePipe(Pipe):
deserialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
- deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
+ deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
deserialize["model"] = load_model
util.from_disk(path, deserialize, exclude)
return self
diff --git a/spacy/pipeline/transition_parser.pyx b/spacy/pipeline/transition_parser.pyx
index a495b1bc7..5e11f5972 100644
--- a/spacy/pipeline/transition_parser.pyx
+++ b/spacy/pipeline/transition_parser.pyx
@@ -569,7 +569,7 @@ cdef class Parser(TrainablePipe):
def to_disk(self, path, exclude=tuple()):
serializers = {
"model": lambda p: (self.model.to_disk(p) if self.model is not True else True),
- "vocab": lambda p: self.vocab.to_disk(p),
+ "vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
"moves": lambda p: self.moves.to_disk(p, exclude=["strings"]),
"cfg": lambda p: srsly.write_json(p, self.cfg)
}
@@ -577,7 +577,7 @@ cdef class Parser(TrainablePipe):
def from_disk(self, path, exclude=tuple()):
deserializers = {
- "vocab": lambda p: self.vocab.from_disk(p),
+ "vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
"moves": lambda p: self.moves.from_disk(p, exclude=["strings"]),
"cfg": lambda p: self.cfg.update(srsly.read_json(p)),
"model": lambda p: None,
@@ -597,7 +597,7 @@ cdef class Parser(TrainablePipe):
def to_bytes(self, exclude=tuple()):
serializers = {
"model": lambda: (self.model.to_bytes()),
- "vocab": lambda: self.vocab.to_bytes(),
+ "vocab": lambda: self.vocab.to_bytes(exclude=exclude),
"moves": lambda: self.moves.to_bytes(exclude=["strings"]),
"cfg": lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True)
}
@@ -605,7 +605,7 @@ cdef class Parser(TrainablePipe):
def from_bytes(self, bytes_data, exclude=tuple()):
deserializers = {
- "vocab": lambda b: self.vocab.from_bytes(b),
+ "vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
"moves": lambda b: self.moves.from_bytes(b, exclude=["strings"]),
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
"model": lambda b: None,
diff --git a/spacy/schemas.py b/spacy/schemas.py
index 992e17d70..83623b104 100644
--- a/spacy/schemas.py
+++ b/spacy/schemas.py
@@ -159,6 +159,7 @@ class TokenPatternString(BaseModel):
NOT_IN: Optional[List[StrictStr]] = Field(None, alias="not_in")
IS_SUBSET: Optional[List[StrictStr]] = Field(None, alias="is_subset")
IS_SUPERSET: Optional[List[StrictStr]] = Field(None, alias="is_superset")
+ INTERSECTS: Optional[List[StrictStr]] = Field(None, alias="intersects")
class Config:
extra = "forbid"
@@ -175,8 +176,9 @@ class TokenPatternNumber(BaseModel):
REGEX: Optional[StrictStr] = Field(None, alias="regex")
IN: Optional[List[StrictInt]] = Field(None, alias="in")
NOT_IN: Optional[List[StrictInt]] = Field(None, alias="not_in")
- ISSUBSET: Optional[List[StrictInt]] = Field(None, alias="issubset")
- ISSUPERSET: Optional[List[StrictInt]] = Field(None, alias="issuperset")
+ IS_SUBSET: Optional[List[StrictInt]] = Field(None, alias="is_subset")
+ IS_SUPERSET: Optional[List[StrictInt]] = Field(None, alias="is_superset")
+ INTERSECTS: Optional[List[StrictInt]] = Field(None, alias="intersects")
EQ: Union[StrictInt, StrictFloat] = Field(None, alias="==")
NEQ: Union[StrictInt, StrictFloat] = Field(None, alias="!=")
GEQ: Union[StrictInt, StrictFloat] = Field(None, alias=">=")
diff --git a/spacy/strings.pyi b/spacy/strings.pyi
new file mode 100644
index 000000000..57bf71b93
--- /dev/null
+++ b/spacy/strings.pyi
@@ -0,0 +1,22 @@
+from typing import Optional, Iterable, Iterator, Union, Any
+from pathlib import Path
+
+def get_string_id(key: str) -> int: ...
+
+class StringStore:
+ def __init__(
+ self, strings: Optional[Iterable[str]] = ..., freeze: bool = ...
+ ) -> None: ...
+ def __getitem__(self, string_or_id: Union[bytes, str, int]) -> Union[str, int]: ...
+ def as_int(self, key: Union[bytes, str, int]) -> int: ...
+ def as_string(self, key: Union[bytes, str, int]) -> str: ...
+ def add(self, string: str) -> int: ...
+ def __len__(self) -> int: ...
+ def __contains__(self, string: str) -> bool: ...
+ def __iter__(self) -> Iterator[str]: ...
+ def __reduce__(self) -> Any: ...
+ def to_disk(self, path: Union[str, Path]) -> None: ...
+ def from_disk(self, path: Union[str, Path]) -> StringStore: ...
+ def to_bytes(self, **kwargs: Any) -> bytes: ...
+ def from_bytes(self, bytes_data: bytes, **kwargs: Any) -> StringStore: ...
+ def _reset_and_load(self, strings: Iterable[str]) -> None: ...
diff --git a/spacy/tests/doc/test_span.py b/spacy/tests/doc/test_span.py
index 6e34f2126..01b022b9d 100644
--- a/spacy/tests/doc/test_span.py
+++ b/spacy/tests/doc/test_span.py
@@ -357,6 +357,9 @@ def test_span_eq_hash(doc, doc_not_parsed):
assert hash(doc[0:2]) != hash(doc[1:3])
assert hash(doc[0:2]) != hash(doc_not_parsed[0:2])
+ # check that an out-of-bounds is not equivalent to the span of the full doc
+ assert doc[0 : len(doc)] != doc[len(doc) : len(doc) + 1]
+
def test_span_boundaries(doc):
start = 1
@@ -369,6 +372,33 @@ def test_span_boundaries(doc):
with pytest.raises(IndexError):
span[5]
+ empty_span_0 = doc[0:0]
+ assert empty_span_0.text == ""
+ assert empty_span_0.start == 0
+ assert empty_span_0.end == 0
+ assert empty_span_0.start_char == 0
+ assert empty_span_0.end_char == 0
+
+ empty_span_1 = doc[1:1]
+ assert empty_span_1.text == ""
+ assert empty_span_1.start == 1
+ assert empty_span_1.end == 1
+ assert empty_span_1.start_char == empty_span_1.end_char
+
+ oob_span_start = doc[-len(doc) - 1 : -len(doc) - 10]
+ assert oob_span_start.text == ""
+ assert oob_span_start.start == 0
+ assert oob_span_start.end == 0
+ assert oob_span_start.start_char == 0
+ assert oob_span_start.end_char == 0
+
+ oob_span_end = doc[len(doc) + 1 : len(doc) + 10]
+ assert oob_span_end.text == ""
+ assert oob_span_end.start == len(doc)
+ assert oob_span_end.end == len(doc)
+ assert oob_span_end.start_char == len(doc.text)
+ assert oob_span_end.end_char == len(doc.text)
+
def test_span_lemma(doc):
# span lemmas should have the same number of spaces as the span
diff --git a/spacy/tests/matcher/test_matcher_api.py b/spacy/tests/matcher/test_matcher_api.py
index e0f655bbe..a42735eae 100644
--- a/spacy/tests/matcher/test_matcher_api.py
+++ b/spacy/tests/matcher/test_matcher_api.py
@@ -270,6 +270,16 @@ def test_matcher_subset_value_operator(en_vocab):
doc[0].tag_ = "A"
assert len(matcher(doc)) == 0
+ # IS_SUBSET with a list value
+ Token.set_extension("ext", default=[])
+ matcher = Matcher(en_vocab)
+ pattern = [{"_": {"ext": {"IS_SUBSET": ["A", "B"]}}}]
+ matcher.add("M", [pattern])
+ doc = Doc(en_vocab, words=["a", "b", "c"])
+ doc[0]._.ext = ["A"]
+ doc[1]._.ext = ["C", "D"]
+ assert len(matcher(doc)) == 2
+
def test_matcher_superset_value_operator(en_vocab):
matcher = Matcher(en_vocab)
@@ -308,6 +318,72 @@ def test_matcher_superset_value_operator(en_vocab):
doc[0].tag_ = "A"
assert len(matcher(doc)) == 3
+ # IS_SUPERSET with a list value
+ Token.set_extension("ext", default=[])
+ matcher = Matcher(en_vocab)
+ pattern = [{"_": {"ext": {"IS_SUPERSET": ["A"]}}}]
+ matcher.add("M", [pattern])
+ doc = Doc(en_vocab, words=["a", "b", "c"])
+ doc[0]._.ext = ["A", "B"]
+ assert len(matcher(doc)) == 1
+
+
+def test_matcher_intersect_value_operator(en_vocab):
+ matcher = Matcher(en_vocab)
+ pattern = [{"MORPH": {"INTERSECTS": ["Feat=Val", "Feat2=Val2", "Feat3=Val3"]}}]
+ matcher.add("M", [pattern])
+ doc = Doc(en_vocab, words=["a", "b", "c"])
+ assert len(matcher(doc)) == 0
+ doc[0].set_morph("Feat=Val")
+ assert len(matcher(doc)) == 1
+ doc[0].set_morph("Feat=Val|Feat2=Val2")
+ assert len(matcher(doc)) == 1
+ doc[0].set_morph("Feat=Val|Feat2=Val2|Feat3=Val3")
+ assert len(matcher(doc)) == 1
+ doc[0].set_morph("Feat=Val|Feat2=Val2|Feat3=Val3|Feat4=Val4")
+ assert len(matcher(doc)) == 1
+
+ # INTERSECTS with a single value is the same as IN
+ matcher = Matcher(en_vocab)
+ pattern = [{"TAG": {"INTERSECTS": ["A", "B"]}}]
+ matcher.add("M", [pattern])
+ doc = Doc(en_vocab, words=["a", "b", "c"])
+ doc[0].tag_ = "A"
+ assert len(matcher(doc)) == 1
+
+ # INTERSECTS with an empty pattern list matches nothing
+ matcher = Matcher(en_vocab)
+ pattern = [{"TAG": {"INTERSECTS": []}}]
+ matcher.add("M", [pattern])
+ doc = Doc(en_vocab, words=["a", "b", "c"])
+ doc[0].tag_ = "A"
+ assert len(matcher(doc)) == 0
+
+ # INTERSECTS with a list value
+ Token.set_extension("ext", default=[])
+ matcher = Matcher(en_vocab)
+ pattern = [{"_": {"ext": {"INTERSECTS": ["A", "C"]}}}]
+ matcher.add("M", [pattern])
+ doc = Doc(en_vocab, words=["a", "b", "c"])
+ doc[0]._.ext = ["A", "B"]
+ assert len(matcher(doc)) == 1
+
+ # INTERSECTS with an empty pattern list matches nothing
+ matcher = Matcher(en_vocab)
+ pattern = [{"_": {"ext": {"INTERSECTS": []}}}]
+ matcher.add("M", [pattern])
+ doc = Doc(en_vocab, words=["a", "b", "c"])
+ doc[0]._.ext = ["A", "B"]
+ assert len(matcher(doc)) == 0
+
+ # INTERSECTS with an empty value matches nothing
+ matcher = Matcher(en_vocab)
+ pattern = [{"_": {"ext": {"INTERSECTS": ["A", "B"]}}}]
+ matcher.add("M", [pattern])
+ doc = Doc(en_vocab, words=["a", "b", "c"])
+ doc[0]._.ext = []
+ assert len(matcher(doc)) == 0
+
def test_matcher_morph_handling(en_vocab):
# order of features in pattern doesn't matter
diff --git a/spacy/tests/pipeline/test_spancat.py b/spacy/tests/pipeline/test_spancat.py
index 0364abf73..6a5ae2c66 100644
--- a/spacy/tests/pipeline/test_spancat.py
+++ b/spacy/tests/pipeline/test_spancat.py
@@ -1,9 +1,11 @@
import pytest
-from numpy.testing import assert_equal
+from numpy.testing import assert_equal, assert_array_equal
+from thinc.api import get_current_ops
from spacy.language import Language
from spacy.training import Example
from spacy.util import fix_random_seed, registry
+OPS = get_current_ops()
SPAN_KEY = "labeled_spans"
@@ -116,12 +118,15 @@ def test_ngram_suggester(en_tokenizer):
for span in spans:
assert 0 <= span[0] < len(doc)
assert 0 < span[1] <= len(doc)
- spans_set.add((span[0], span[1]))
+ spans_set.add((int(span[0]), int(span[1])))
# spans are unique
assert spans.shape[0] == len(spans_set)
offset += ngrams.lengths[i]
# the number of spans is correct
- assert_equal(ngrams.lengths, [max(0, len(doc) - (size - 1)) for doc in docs])
+ assert_array_equal(
+ OPS.to_numpy(ngrams.lengths),
+ [max(0, len(doc) - (size - 1)) for doc in docs],
+ )
# test 1-3-gram suggestions
ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1, 2, 3])
@@ -129,9 +134,9 @@ def test_ngram_suggester(en_tokenizer):
en_tokenizer(text) for text in ["a", "a b", "a b c", "a b c d", "a b c d e"]
]
ngrams = ngram_suggester(docs)
- assert_equal(ngrams.lengths, [1, 3, 6, 9, 12])
- assert_equal(
- ngrams.data,
+ assert_array_equal(OPS.to_numpy(ngrams.lengths), [1, 3, 6, 9, 12])
+ assert_array_equal(
+ OPS.to_numpy(ngrams.data),
[
# doc 0
[0, 1],
@@ -176,13 +181,13 @@ def test_ngram_suggester(en_tokenizer):
ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1])
docs = [en_tokenizer(text) for text in ["", "a", ""]]
ngrams = ngram_suggester(docs)
- assert_equal(ngrams.lengths, [len(doc) for doc in docs])
+ assert_array_equal(OPS.to_numpy(ngrams.lengths), [len(doc) for doc in docs])
# test all empty docs
ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes=[1])
docs = [en_tokenizer(text) for text in ["", "", ""]]
ngrams = ngram_suggester(docs)
- assert_equal(ngrams.lengths, [len(doc) for doc in docs])
+ assert_array_equal(OPS.to_numpy(ngrams.lengths), [len(doc) for doc in docs])
def test_ngram_sizes(en_tokenizer):
@@ -195,12 +200,12 @@ def test_ngram_sizes(en_tokenizer):
]
ngrams_1 = size_suggester(docs)
ngrams_2 = range_suggester(docs)
- assert_equal(ngrams_1.lengths, [1, 3, 6, 9, 12])
- assert_equal(ngrams_1.lengths, ngrams_2.lengths)
- assert_equal(ngrams_1.data, ngrams_2.data)
+ assert_array_equal(OPS.to_numpy(ngrams_1.lengths), [1, 3, 6, 9, 12])
+ assert_array_equal(OPS.to_numpy(ngrams_1.lengths), OPS.to_numpy(ngrams_2.lengths))
+ assert_array_equal(OPS.to_numpy(ngrams_1.data), OPS.to_numpy(ngrams_2.data))
# one more variation
suggester_factory = registry.misc.get("spacy.ngram_range_suggester.v1")
range_suggester = suggester_factory(min_size=2, max_size=4)
ngrams_3 = range_suggester(docs)
- assert_equal(ngrams_3.lengths, [0, 1, 3, 6, 9])
+ assert_array_equal(OPS.to_numpy(ngrams_3.lengths), [0, 1, 3, 6, 9])
diff --git a/spacy/tests/serialize/test_serialize_pipeline.py b/spacy/tests/serialize/test_serialize_pipeline.py
index c8162a690..05871a524 100644
--- a/spacy/tests/serialize/test_serialize_pipeline.py
+++ b/spacy/tests/serialize/test_serialize_pipeline.py
@@ -1,5 +1,5 @@
import pytest
-from spacy import registry, Vocab
+from spacy import registry, Vocab, load
from spacy.pipeline import Tagger, DependencyParser, EntityRecognizer
from spacy.pipeline import TextCategorizer, SentenceRecognizer, TrainablePipe
from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
@@ -268,3 +268,21 @@ def test_serialize_custom_trainable_pipe():
pipe.to_disk(d)
new_pipe = CustomPipe(Vocab(), Linear()).from_disk(d)
assert new_pipe.to_bytes() == pipe_bytes
+
+
+def test_load_without_strings():
+ nlp = spacy.blank("en")
+ orig_strings_length = len(nlp.vocab.strings)
+ word = "unlikely_word_" * 20
+ nlp.vocab.strings.add(word)
+ assert len(nlp.vocab.strings) == orig_strings_length + 1
+ with make_tempdir() as d:
+ nlp.to_disk(d)
+ # reload with strings
+ reloaded_nlp = load(d)
+ assert len(nlp.vocab.strings) == len(reloaded_nlp.vocab.strings)
+ assert word in reloaded_nlp.vocab.strings
+ # reload without strings
+ reloaded_nlp = load(d, exclude=["strings"])
+ assert orig_strings_length == len(reloaded_nlp.vocab.strings)
+ assert word not in reloaded_nlp.vocab.strings
diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx
index 61a7582b1..5a89e5a17 100644
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@@ -765,7 +765,7 @@ cdef class Tokenizer:
DOCS: https://spacy.io/api/tokenizer#to_bytes
"""
serializers = {
- "vocab": lambda: self.vocab.to_bytes(),
+ "vocab": lambda: self.vocab.to_bytes(exclude=exclude),
"prefix_search": lambda: _get_regex_pattern(self.prefix_search),
"suffix_search": lambda: _get_regex_pattern(self.suffix_search),
"infix_finditer": lambda: _get_regex_pattern(self.infix_finditer),
@@ -786,7 +786,7 @@ cdef class Tokenizer:
"""
data = {}
deserializers = {
- "vocab": lambda b: self.vocab.from_bytes(b),
+ "vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
"prefix_search": lambda b: data.setdefault("prefix_search", b),
"suffix_search": lambda b: data.setdefault("suffix_search", b),
"infix_finditer": lambda b: data.setdefault("infix_finditer", b),
diff --git a/spacy/tokens/_retokenize.pyi b/spacy/tokens/_retokenize.pyi
new file mode 100644
index 000000000..b829b71a3
--- /dev/null
+++ b/spacy/tokens/_retokenize.pyi
@@ -0,0 +1,17 @@
+from typing import Dict, Any, Union, List, Tuple
+from .doc import Doc
+from .span import Span
+from .token import Token
+
+class Retokenizer:
+ def __init__(self, doc: Doc) -> None: ...
+ def merge(self, span: Span, attrs: Dict[Union[str, int], Any] = ...) -> None: ...
+ def split(
+ self,
+ token: Token,
+ orths: List[str],
+ heads: List[Union[Token, Tuple[Token, int]]],
+ attrs: Dict[Union[str, int], List[Any]] = ...,
+ ) -> None: ...
+ def __enter__(self) -> Retokenizer: ...
+ def __exit__(self, *args: Any) -> None: ...
diff --git a/spacy/tokens/doc.pyi b/spacy/tokens/doc.pyi
new file mode 100644
index 000000000..8688fb91f
--- /dev/null
+++ b/spacy/tokens/doc.pyi
@@ -0,0 +1,180 @@
+from typing import (
+ Callable,
+ Protocol,
+ Iterable,
+ Iterator,
+ Optional,
+ Union,
+ Tuple,
+ List,
+ Dict,
+ Any,
+ overload,
+)
+from cymem.cymem import Pool
+from thinc.types import Floats1d, Floats2d, Ints2d
+from .span import Span
+from .token import Token
+from ._dict_proxies import SpanGroups
+from ._retokenize import Retokenizer
+from ..lexeme import Lexeme
+from ..vocab import Vocab
+from .underscore import Underscore
+from pathlib import Path
+import numpy
+
+class DocMethod(Protocol):
+ def __call__(self: Doc, *args: Any, **kwargs: Any) -> Any: ...
+
+class Doc:
+ vocab: Vocab
+ mem: Pool
+ spans: SpanGroups
+ max_length: int
+ length: int
+ sentiment: float
+ cats: Dict[str, float]
+ user_hooks: Dict[str, Callable[..., Any]]
+ user_token_hooks: Dict[str, Callable[..., Any]]
+ user_span_hooks: Dict[str, Callable[..., Any]]
+ tensor: numpy.ndarray
+ user_data: Dict[str, Any]
+ has_unknown_spaces: bool
+ @classmethod
+ def set_extension(
+ cls,
+ name: str,
+ default: Optional[Any] = ...,
+ getter: Optional[Callable[[Doc], Any]] = ...,
+ setter: Optional[Callable[[Doc, Any], None]] = ...,
+ method: Optional[DocMethod] = ...,
+ force: bool = ...,
+ ) -> None: ...
+ @classmethod
+ def get_extension(
+ cls, name: str
+ ) -> Tuple[
+ Optional[Any],
+ Optional[DocMethod],
+ Optional[Callable[[Doc], Any]],
+ Optional[Callable[[Doc, Any], None]],
+ ]: ...
+ @classmethod
+ def has_extension(cls, name: str) -> bool: ...
+ @classmethod
+ def remove_extension(
+ cls, name: str
+ ) -> Tuple[
+ Optional[Any],
+ Optional[DocMethod],
+ Optional[Callable[[Doc], Any]],
+ Optional[Callable[[Doc, Any], None]],
+ ]: ...
+ def __init__(
+ self,
+ vocab: Vocab,
+ words: Optional[List[str]] = ...,
+ spaces: Optional[List[bool]] = ...,
+ user_data: Optional[Dict[Any, Any]] = ...,
+ tags: Optional[List[str]] = ...,
+ pos: Optional[List[str]] = ...,
+ morphs: Optional[List[str]] = ...,
+ lemmas: Optional[List[str]] = ...,
+ heads: Optional[List[int]] = ...,
+ deps: Optional[List[str]] = ...,
+ sent_starts: Optional[List[Union[bool, None]]] = ...,
+ ents: Optional[List[str]] = ...,
+ ) -> None: ...
+ @property
+ def _(self) -> Underscore: ...
+ @property
+ def is_tagged(self) -> bool: ...
+ @property
+ def is_parsed(self) -> bool: ...
+ @property
+ def is_nered(self) -> bool: ...
+ @property
+ def is_sentenced(self) -> bool: ...
+ def has_annotation(
+ self, attr: Union[int, str], *, require_complete: bool = ...
+ ) -> bool: ...
+ @overload
+ def __getitem__(self, i: int) -> Token: ...
+ @overload
+ def __getitem__(self, i: slice) -> Span: ...
+ def __iter__(self) -> Iterator[Token]: ...
+ def __len__(self) -> int: ...
+ def __unicode__(self) -> str: ...
+ def __bytes__(self) -> bytes: ...
+ def __str__(self) -> str: ...
+ def __repr__(self) -> str: ...
+ @property
+ def doc(self) -> Doc: ...
+ def char_span(
+ self,
+ start_idx: int,
+ end_idx: int,
+ label: Union[int, str] = ...,
+ kb_id: Union[int, str] = ...,
+ vector: Optional[Floats1d] = ...,
+ alignment_mode: str = ...,
+ ) -> Span: ...
+ def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
+ @property
+ def has_vector(self) -> bool: ...
+ vector: Floats1d
+ vector_norm: float
+ @property
+ def text(self) -> str: ...
+ @property
+ def text_with_ws(self) -> str: ...
+ ents: Tuple[Span]
+ def set_ents(
+ self,
+ entities: List[Span],
+ *,
+ blocked: Optional[List[Span]] = ...,
+ missing: Optional[List[Span]] = ...,
+ outside: Optional[List[Span]] = ...,
+ default: str = ...
+ ) -> None: ...
+ @property
+ def noun_chunks(self) -> Iterator[Span]: ...
+ @property
+ def sents(self) -> Iterator[Span]: ...
+ @property
+ def lang(self) -> int: ...
+ @property
+ def lang_(self) -> str: ...
+ def count_by(
+ self, attr_id: int, exclude: Optional[Any] = ..., counts: Optional[Any] = ...
+ ) -> Dict[Any, int]: ...
+ def from_array(self, attrs: List[int], array: Ints2d) -> Doc: ...
+ @staticmethod
+ def from_docs(
+ docs: List[Doc],
+ ensure_whitespace: bool = ...,
+ attrs: Optional[Union[Tuple[Union[str, int]], List[Union[int, str]]]] = ...,
+ ) -> Doc: ...
+ def get_lca_matrix(self) -> Ints2d: ...
+ def copy(self) -> Doc: ...
+ def to_disk(
+ self, path: Union[str, Path], *, exclude: Iterable[str] = ...
+ ) -> None: ...
+ def from_disk(
+ self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
+ ) -> Doc: ...
+ def to_bytes(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
+ def from_bytes(
+ self, bytes_data: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
+ ) -> Doc: ...
+ def to_dict(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
+ def from_dict(
+ self, msg: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
+ ) -> Doc: ...
+ def extend_tensor(self, tensor: Floats2d) -> None: ...
+ def retokenize(self) -> Retokenizer: ...
+ def to_json(self, underscore: Optional[List[str]] = ...) -> Dict[str, Any]: ...
+ def to_utf8_array(self, nr_char: int = ...) -> Ints2d: ...
+ @staticmethod
+ def _get_array_attrs() -> Tuple[Any]: ...
diff --git a/spacy/tokens/morphanalysis.pyi b/spacy/tokens/morphanalysis.pyi
new file mode 100644
index 000000000..c7e05e58f
--- /dev/null
+++ b/spacy/tokens/morphanalysis.pyi
@@ -0,0 +1,20 @@
+from typing import Any, Dict, Iterator, List, Union
+from ..vocab import Vocab
+
+class MorphAnalysis:
+ def __init__(
+ self, vocab: Vocab, features: Union[Dict[str, str], str] = ...
+ ) -> None: ...
+ @classmethod
+ def from_id(cls, vocab: Vocab, key: Any) -> MorphAnalysis: ...
+ def __contains__(self, feature: str) -> bool: ...
+ def __iter__(self) -> Iterator[str]: ...
+ def __len__(self) -> int: ...
+ def __hash__(self) -> int: ...
+ def __eq__(self, other: MorphAnalysis) -> bool: ...
+ def __ne__(self, other: MorphAnalysis) -> bool: ...
+ def get(self, field: Any) -> List[str]: ...
+ def to_json(self) -> str: ...
+ def to_dict(self) -> Dict[str, str]: ...
+ def __str__(self) -> str: ...
+ def __repr__(self) -> str: ...
diff --git a/spacy/tokens/span.pyi b/spacy/tokens/span.pyi
new file mode 100644
index 000000000..4f65abace
--- /dev/null
+++ b/spacy/tokens/span.pyi
@@ -0,0 +1,124 @@
+from typing import Callable, Protocol, Iterator, Optional, Union, Tuple, Any, overload
+from thinc.types import Floats1d, Ints2d, FloatsXd
+from .doc import Doc
+from .token import Token
+from .underscore import Underscore
+from ..lexeme import Lexeme
+from ..vocab import Vocab
+
+class SpanMethod(Protocol):
+ def __call__(self: Span, *args: Any, **kwargs: Any) -> Any: ...
+
+class Span:
+ @classmethod
+ def set_extension(
+ cls,
+ name: str,
+ default: Optional[Any] = ...,
+ getter: Optional[Callable[[Span], Any]] = ...,
+ setter: Optional[Callable[[Span, Any], None]] = ...,
+ method: Optional[SpanMethod] = ...,
+ force: bool = ...,
+ ) -> None: ...
+ @classmethod
+ def get_extension(
+ cls, name: str
+ ) -> Tuple[
+ Optional[Any],
+ Optional[SpanMethod],
+ Optional[Callable[[Span], Any]],
+ Optional[Callable[[Span, Any], None]],
+ ]: ...
+ @classmethod
+ def has_extension(cls, name: str) -> bool: ...
+ @classmethod
+ def remove_extension(
+ cls, name: str
+ ) -> Tuple[
+ Optional[Any],
+ Optional[SpanMethod],
+ Optional[Callable[[Span], Any]],
+ Optional[Callable[[Span, Any], None]],
+ ]: ...
+ def __init__(
+ self,
+ doc: Doc,
+ start: int,
+ end: int,
+ label: int = ...,
+ vector: Optional[Floats1d] = ...,
+ vector_norm: Optional[float] = ...,
+ kb_id: Optional[int] = ...,
+ ) -> None: ...
+ def __richcmp__(self, other: Span, op: int) -> bool: ...
+ def __hash__(self) -> int: ...
+ def __len__(self) -> int: ...
+ def __repr__(self) -> str: ...
+ @overload
+ def __getitem__(self, i: int) -> Token: ...
+ @overload
+ def __getitem__(self, i: slice) -> Span: ...
+ def __iter__(self) -> Iterator[Token]: ...
+ @property
+ def _(self) -> Underscore: ...
+ def as_doc(self, *, copy_user_data: bool = ...) -> Doc: ...
+ def get_lca_matrix(self) -> Ints2d: ...
+ def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
+ @property
+ def vocab(self) -> Vocab: ...
+ @property
+ def sent(self) -> Span: ...
+ @property
+ def ents(self) -> Tuple[Span]: ...
+ @property
+ def has_vector(self) -> bool: ...
+ @property
+ def vector(self) -> Floats1d: ...
+ @property
+ def vector_norm(self) -> float: ...
+ @property
+ def tensor(self) -> FloatsXd: ...
+ @property
+ def sentiment(self) -> float: ...
+ @property
+ def text(self) -> str: ...
+ @property
+ def text_with_ws(self) -> str: ...
+ @property
+ def noun_chunks(self) -> Iterator[Span]: ...
+ @property
+ def root(self) -> Token: ...
+ def char_span(
+ self,
+ start_idx: int,
+ end_idx: int,
+ label: int = ...,
+ kb_id: int = ...,
+ vector: Optional[Floats1d] = ...,
+ ) -> Span: ...
+ @property
+ def conjuncts(self) -> Tuple[Token]: ...
+ @property
+ def lefts(self) -> Iterator[Token]: ...
+ @property
+ def rights(self) -> Iterator[Token]: ...
+ @property
+ def n_lefts(self) -> int: ...
+ @property
+ def n_rights(self) -> int: ...
+ @property
+ def subtree(self) -> Iterator[Token]: ...
+ start: int
+ end: int
+ start_char: int
+ end_char: int
+ label: int
+ kb_id: int
+ ent_id: int
+ ent_id_: str
+ @property
+ def orth_(self) -> str: ...
+ @property
+ def lemma_(self) -> str: ...
+ label_: str
+ kb_id_: str
diff --git a/spacy/tokens/span.pyx b/spacy/tokens/span.pyx
index 093b2a4da..48c6053c1 100644
--- a/spacy/tokens/span.pyx
+++ b/spacy/tokens/span.pyx
@@ -105,13 +105,18 @@ cdef class Span:
if label not in doc.vocab.strings:
raise ValueError(Errors.E084.format(label=label))
+ start_char = doc[start].idx if start < doc.length else len(doc.text)
+ if start == end:
+ end_char = start_char
+ else:
+ end_char = doc[end - 1].idx + len(doc[end - 1])
self.c = SpanC(
label=label,
kb_id=kb_id,
start=start,
end=end,
- start_char=doc[start].idx if start < doc.length else 0,
- end_char=doc[end - 1].idx + len(doc[end - 1]) if end >= 1 else 0,
+ start_char=start_char,
+ end_char=end_char,
)
self._vector = vector
self._vector_norm = vector_norm
diff --git a/spacy/tokens/span_group.pyi b/spacy/tokens/span_group.pyi
new file mode 100644
index 000000000..4bd6bec27
--- /dev/null
+++ b/spacy/tokens/span_group.pyi
@@ -0,0 +1,24 @@
+from typing import Any, Dict, Iterable
+from .doc import Doc
+from .span import Span
+
+class SpanGroup:
+ def __init__(
+ self,
+ doc: Doc,
+ *,
+ name: str = ...,
+ attrs: Dict[str, Any] = ...,
+ spans: Iterable[Span] = ...
+ ) -> None: ...
+ def __repr__(self) -> str: ...
+ @property
+ def doc(self) -> Doc: ...
+ @property
+ def has_overlap(self) -> bool: ...
+ def __len__(self) -> int: ...
+ def append(self, span: Span) -> None: ...
+ def extend(self, spans: Iterable[Span]) -> None: ...
+ def __getitem__(self, i: int) -> Span: ...
+ def to_bytes(self) -> bytes: ...
+ def from_bytes(self, bytes_data: bytes) -> SpanGroup: ...
diff --git a/spacy/tokens/token.pyi b/spacy/tokens/token.pyi
new file mode 100644
index 000000000..23d028ffd
--- /dev/null
+++ b/spacy/tokens/token.pyi
@@ -0,0 +1,208 @@
+from typing import (
+ Callable,
+ Protocol,
+ Iterator,
+ Optional,
+ Union,
+ Tuple,
+ Any,
+)
+from thinc.types import Floats1d, FloatsXd
+from .doc import Doc
+from .span import Span
+from .morphanalysis import MorphAnalysis
+from ..lexeme import Lexeme
+from ..vocab import Vocab
+from .underscore import Underscore
+
+class TokenMethod(Protocol):
+ def __call__(self: Token, *args: Any, **kwargs: Any) -> Any: ...
+
+class Token:
+ i: int
+ doc: Doc
+ vocab: Vocab
+ @classmethod
+ def set_extension(
+ cls,
+ name: str,
+ default: Optional[Any] = ...,
+ getter: Optional[Callable[[Token], Any]] = ...,
+ setter: Optional[Callable[[Token, Any], None]] = ...,
+ method: Optional[TokenMethod] = ...,
+ force: bool = ...,
+ ) -> None: ...
+ @classmethod
+ def get_extension(
+ cls, name: str
+ ) -> Tuple[
+ Optional[Any],
+ Optional[TokenMethod],
+ Optional[Callable[[Token], Any]],
+ Optional[Callable[[Token, Any], None]],
+ ]: ...
+ @classmethod
+ def has_extension(cls, name: str) -> bool: ...
+ @classmethod
+ def remove_extension(
+ cls, name: str
+ ) -> Tuple[
+ Optional[Any],
+ Optional[TokenMethod],
+ Optional[Callable[[Token], Any]],
+ Optional[Callable[[Token, Any], None]],
+ ]: ...
+ def __init__(self, vocab: Vocab, doc: Doc, offset: int) -> None: ...
+ def __hash__(self) -> int: ...
+ def __len__(self) -> int: ...
+ def __unicode__(self) -> str: ...
+ def __bytes__(self) -> bytes: ...
+ def __str__(self) -> str: ...
+ def __repr__(self) -> str: ...
+ def __richcmp__(self, other: Token, op: int) -> bool: ...
+ @property
+ def _(self) -> Underscore: ...
+ def nbor(self, i: int = ...) -> Token: ...
+ def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
+ def has_morph(self) -> bool: ...
+ morph: MorphAnalysis
+ @property
+ def lex(self) -> Lexeme: ...
+ @property
+ def lex_id(self) -> int: ...
+ @property
+ def rank(self) -> int: ...
+ @property
+ def text(self) -> str: ...
+ @property
+ def text_with_ws(self) -> str: ...
+ @property
+ def prob(self) -> float: ...
+ @property
+ def sentiment(self) -> float: ...
+ @property
+ def lang(self) -> int: ...
+ @property
+ def idx(self) -> int: ...
+ @property
+ def cluster(self) -> int: ...
+ @property
+ def orth(self) -> int: ...
+ @property
+ def lower(self) -> int: ...
+ @property
+ def norm(self) -> int: ...
+ @property
+ def shape(self) -> int: ...
+ @property
+ def prefix(self) -> int: ...
+ @property
+ def suffix(self) -> int: ...
+ lemma: int
+ pos: int
+ tag: int
+ dep: int
+ @property
+ def has_vector(self) -> bool: ...
+ @property
+ def vector(self) -> Floats1d: ...
+ @property
+ def vector_norm(self) -> float: ...
+ @property
+ def tensor(self) -> Optional[FloatsXd]: ...
+ @property
+ def n_lefts(self) -> int: ...
+ @property
+ def n_rights(self) -> int: ...
+ @property
+ def sent(self) -> Span: ...
+ sent_start: bool
+ is_sent_start: Optional[bool]
+ is_sent_end: Optional[bool]
+ @property
+ def lefts(self) -> Iterator[Token]: ...
+ @property
+ def rights(self) -> Iterator[Token]: ...
+ @property
+ def children(self) -> Iterator[Token]: ...
+ @property
+ def subtree(self) -> Iterator[Token]: ...
+ @property
+ def left_edge(self) -> Token: ...
+ @property
+ def right_edge(self) -> Token: ...
+ @property
+ def ancestors(self) -> Iterator[Token]: ...
+ def is_ancestor(self, descendant: Token) -> bool: ...
+ def has_head(self) -> bool: ...
+ head: Token
+ @property
+ def conjuncts(self) -> Tuple[Token]: ...
+ ent_type: int
+ ent_type_: str
+ @property
+ def ent_iob(self) -> int: ...
+ @classmethod
+ def iob_strings(cls) -> Tuple[str]: ...
+ @property
+ def ent_iob_(self) -> str: ...
+ ent_id: int
+ ent_id_: str
+ ent_kb_id: int
+ ent_kb_id_: str
+ @property
+ def whitespace_(self) -> str: ...
+ @property
+ def orth_(self) -> str: ...
+ @property
+ def lower_(self) -> str: ...
+ norm_: str
+ @property
+ def shape_(self) -> str: ...
+ @property
+ def prefix_(self) -> str: ...
+ @property
+ def suffix_(self) -> str: ...
+ @property
+ def lang_(self) -> str: ...
+ lemma_: str
+ pos_: str
+ tag_: str
+ def has_dep(self) -> bool: ...
+ dep_: str
+ @property
+ def is_oov(self) -> bool: ...
+ @property
+ def is_stop(self) -> bool: ...
+ @property
+ def is_alpha(self) -> bool: ...
+ @property
+ def is_ascii(self) -> bool: ...
+ @property
+ def is_digit(self) -> bool: ...
+ @property
+ def is_lower(self) -> bool: ...
+ @property
+ def is_upper(self) -> bool: ...
+ @property
+ def is_title(self) -> bool: ...
+ @property
+ def is_punct(self) -> bool: ...
+ @property
+ def is_space(self) -> bool: ...
+ @property
+ def is_bracket(self) -> bool: ...
+ @property
+ def is_quote(self) -> bool: ...
+ @property
+ def is_left_punct(self) -> bool: ...
+ @property
+ def is_right_punct(self) -> bool: ...
+ @property
+ def is_currency(self) -> bool: ...
+ @property
+ def like_url(self) -> bool: ...
+ @property
+ def like_num(self) -> bool: ...
+ @property
+ def like_email(self) -> bool: ...
diff --git a/spacy/training/loggers.py b/spacy/training/loggers.py
index f7f70226d..5cf2db6b3 100644
--- a/spacy/training/loggers.py
+++ b/spacy/training/loggers.py
@@ -29,7 +29,7 @@ def console_logger(progress_bar: bool = False):
def setup_printer(
nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr
) -> Tuple[Callable[[Optional[Dict[str, Any]]], None], Callable[[], None]]:
- write = lambda text: stdout.write(f"{text}\n")
+ write = lambda text: print(text, file=stdout, flush=True)
msg = Printer(no_print=True)
# ensure that only trainable components are logged
logged_pipes = [
diff --git a/spacy/vocab.pyi b/spacy/vocab.pyi
new file mode 100644
index 000000000..0a8ef6198
--- /dev/null
+++ b/spacy/vocab.pyi
@@ -0,0 +1,78 @@
+from typing import (
+ Callable,
+ Iterator,
+ Optional,
+ Union,
+ Tuple,
+ List,
+ Dict,
+ Any,
+)
+from thinc.types import Floats1d, FloatsXd
+from . import Language
+from .strings import StringStore
+from .lexeme import Lexeme
+from .lookups import Lookups
+from .tokens import Doc, Span
+from pathlib import Path
+
+def create_vocab(
+ lang: Language, defaults: Any, vectors_name: Optional[str] = ...
+) -> Vocab: ...
+
+class Vocab:
+ def __init__(
+ self,
+ lex_attr_getters: Optional[Dict[str, Callable[[str], Any]]] = ...,
+ strings: Optional[Union[List[str], StringStore]] = ...,
+ lookups: Optional[Lookups] = ...,
+ oov_prob: float = ...,
+ vectors_name: Optional[str] = ...,
+ writing_system: Dict[str, Any] = ...,
+ get_noun_chunks: Optional[Callable[[Union[Doc, Span]], Iterator[Span]]] = ...,
+ ) -> None: ...
+ @property
+ def lang(self) -> Language: ...
+ def __len__(self) -> int: ...
+ def add_flag(
+ self, flag_getter: Callable[[str], bool], flag_id: int = ...
+ ) -> int: ...
+ def __contains__(self, key: str) -> bool: ...
+ def __iter__(self) -> Iterator[Lexeme]: ...
+ def __getitem__(self, id_or_string: Union[str, int]) -> Lexeme: ...
+ @property
+ def vectors_length(self) -> int: ...
+ def reset_vectors(
+ self, *, width: Optional[int] = ..., shape: Optional[int] = ...
+ ) -> None: ...
+ def prune_vectors(self, nr_row: int, batch_size: int = ...) -> Dict[str, float]: ...
+ def get_vector(
+ self,
+ orth: Union[int, str],
+ minn: Optional[int] = ...,
+ maxn: Optional[int] = ...,
+ ) -> FloatsXd: ...
+ def set_vector(self, orth: Union[int, str], vector: Floats1d) -> None: ...
+ def has_vector(self, orth: Union[int, str]) -> bool: ...
+ lookups: Lookups
+ def to_disk(
+ self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
+ ) -> None: ...
+ def from_disk(
+ self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
+ ) -> Vocab: ...
+ def to_bytes(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
+ def from_bytes(
+ self, bytes_data: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
+ ) -> Vocab: ...
+
+def pickle_vocab(vocab: Vocab) -> Any: ...
+def unpickle_vocab(
+ sstore: StringStore,
+ vectors: Any,
+ morphology: Any,
+ data_dir: Any,
+ lex_attr_getters: Any,
+ lookups: Any,
+ get_noun_chunks: Any,
+) -> Vocab: ...
diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md
index e90dc1183..f1a11bbc4 100644
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@@ -409,7 +409,7 @@ a single token vector given zero or more wordpiece vectors.
>
> ```ini
> [model]
-> @architectures = "spacy.Tok2VecTransformer.v1"
+> @architectures = "spacy-transformers.Tok2VecTransformer.v1"
> name = "albert-base-v2"
> tokenizer_config = {"use_fast": false}
> grad_factor = 1.0
diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md
index 7dbf50595..1bdeb509a 100644
--- a/website/docs/api/data-formats.md
+++ b/website/docs/api/data-formats.md
@@ -90,7 +90,6 @@ Defines the `nlp` object, its tokenizer and
> ```ini
> [components.textcat]
> factory = "textcat"
-> labels = ["POSITIVE", "NEGATIVE"]
>
> [components.textcat.model]
> @architectures = "spacy.TextCatBOW.v2"
diff --git a/website/docs/api/entityruler.md b/website/docs/api/entityruler.md
index 66cb6d4e4..93b5da45a 100644
--- a/website/docs/api/entityruler.md
+++ b/website/docs/api/entityruler.md
@@ -35,11 +35,11 @@ how the component should be configured. You can override its settings via the
> ```
| Setting | Description |
-| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ----------- |
+| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `phrase_matcher_attr` | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ |
| `validate` | Whether patterns should be validated (passed to the `Matcher` and `PhraseMatcher`). Defaults to `False`. ~~bool~~ |
| `overwrite_ents` | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~ |
-| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `" | | "`. ~~str~~ |
+| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `"\|\|"`. ~~str~~ |
```python
%%GITHUB_SPACY/spacy/pipeline/entityruler.py
@@ -64,14 +64,14 @@ be a token pattern (list) or a phrase pattern (string). For example:
> ```
| Name | Description |
-| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ----------- |
+| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `nlp` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. ~~Language~~ |
| `name` 3 | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. Used to disable the current entity ruler while creating phrase patterns with the nlp object. ~~str~~ |
| _keyword-only_ | |
| `phrase_matcher_attr` | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ |
| `validate` | Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate`. Defaults to `False`. ~~bool~~ |
| `overwrite_ents` | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~ |
-| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `" | | "`. ~~str~~ |
+| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `"\|\|"`. ~~str~~ |
| `patterns` | Optional patterns to load in on initialization. ~~Optional[List[Dict[str, Union[str, List[dict]]]]]~~ |
## EntityRuler.initialize {#initialize tag="method" new="3"}
diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md
index 9c15f8797..c34560dec 100644
--- a/website/docs/api/matcher.md
+++ b/website/docs/api/matcher.md
@@ -77,13 +77,14 @@ it compares to another value.
> ]
> ```
-| Attribute | Description |
-| -------------------------- | ------------------------------------------------------------------------------------------------------- |
-| `IN` | Attribute value is member of a list. ~~Any~~ |
-| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ |
-| `ISSUBSET` | Attribute values (for `MORPH`) are a subset of a list. ~~Any~~ |
-| `ISSUPERSET` | Attribute values (for `MORPH`) are a superset of a list. ~~Any~~ |
-| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
+| Attribute | Description |
+| -------------------------- | -------------------------------------------------------------------------------------------------------- |
+| `IN` | Attribute value is member of a list. ~~Any~~ |
+| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ |
+| `IS_SUBSET` | Attribute value (for `MORPH` or custom list attributes) is a subset of a list. ~~Any~~ |
+| `IS_SUPERSET` | Attribute value (for `MORPH` or custom list attributes) is a superset of a list. ~~Any~~ |
+| `INTERSECTS` | Attribute value (for `MORPH` or custom list attribute) has a non-empty intersection with a list. ~~Any~~ |
+| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
## Matcher.\_\_init\_\_ {#init tag="method"}
diff --git a/website/docs/api/vocab.md b/website/docs/api/vocab.md
index 8fe769cdd..320ad5605 100644
--- a/website/docs/api/vocab.md
+++ b/website/docs/api/vocab.md
@@ -29,7 +29,7 @@ Create the vocabulary.
| `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ |
| `vectors_name` 2.2 | A name to identify the vectors table. ~~str~~ |
| `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ |
-| `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
+| `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/api/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
## Vocab.\_\_len\_\_ {#len tag="method"}
diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md
index 037850154..81c838584 100644
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@@ -232,15 +232,22 @@ following rich comparison attributes are available:
>
> # Matches tokens of length >= 10
> pattern2 = [{"LENGTH": {">=": 10}}]
+>
+> # Match based on morph attributes
+> pattern3 = [{"MORPH": {"IS_SUBSET": ["Number=Sing", "Gender=Neut"]}}]
+> # "", "Number=Sing" and "Number=Sing|Gender=Neut" will match as subsets
+> # "Number=Plur|Gender=Neut" will not match
+> # "Number=Sing|Gender=Neut|Polite=Infm" will not match because it's a superset
> ```
-| Attribute | Description |
-| -------------------------- | ------------------------------------------------------------------------------------------------------- |
-| `IN` | Attribute value is member of a list. ~~Any~~ |
-| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ |
-| `ISSUBSET` | Attribute values (for `MORPH`) are a subset of a list. ~~Any~~ |
-| `ISSUPERSET` | Attribute values (for `MORPH`) are a superset of a list. ~~Any~~ |
-| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
+| Attribute | Description |
+| -------------------------- | --------------------------------------------------------------------------------------------------------- |
+| `IN` | Attribute value is member of a list. ~~Any~~ |
+| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ |
+| `IS_SUBSET` | Attribute value (for `MORPH` or custom list attributes) is a subset of a list. ~~Any~~ |
+| `IS_SUPERSET` | Attribute value (for `MORPH` or custom list attributes) is a superset of a list. ~~Any~~ |
+| `INTERSECTS` | Attribute value (for `MORPH` or custom list attributes) has a non-empty intersection with a list. ~~Any~~ |
+| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
#### Regular expressions {#regex new="2.1"}
diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md
index 17fac05e5..6deba3761 100644
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@@ -652,7 +652,7 @@ excluded from the logs and the score won't be weighted.
| **Recall** (R) | Percentage of reference annotations recovered. Should increase. |
| **F-Score** (F) | Harmonic mean of precision and recall. Should increase. |
| **UAS** / **LAS** | Unlabeled and labeled attachment score for the dependency parser, i.e. the percentage of correct arcs. Should increase. |
-| **Words per second** (WPS) | Prediction speed in words per second. Should stay stable. |
+| **Speed** | Prediction speed in words per second (WPS). Should stay stable. |
Note that if the development data has raw text, some of the gold-standard
entities might not align to the predicted tokenization. These tokenization
diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md
index 8b4d2de7c..980f06172 100644
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@@ -854,6 +854,19 @@ pipeline component, the [`AttributeRuler`](/api/attributeruler). See the
you have tag maps and morph rules in the v2.x format, you can load them into the
attribute ruler before training using the `[initialize]` block of your config.
+### Using Lexeme Tables
+
+To use tables like `lexeme_prob` when training a model from scratch, you need
+to add an entry to the `initialize` block in your config. Here's what that
+looks like for the existing trained pipelines:
+
+```ini
+[initialize.lookups]
+@misc = "spacy.LookupsDataLoader.v1"
+lang = ${nlp.lang}
+tables = ["lexeme_norm"]
+```
+
> #### What does the initialization do?
>
> The `[initialize]` block is used when