mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-15 11:00:34 +03:00
Merge branch 'develop' into nightly.spacy.io
This commit is contained in:
commit
d8d25b1acb
3
.gitignore
vendored
3
.gitignore
vendored
|
@ -18,8 +18,7 @@ website/.npm
|
||||||
website/logs
|
website/logs
|
||||||
*.log
|
*.log
|
||||||
npm-debug.log*
|
npm-debug.log*
|
||||||
website/www/
|
quickstart-training-generator.js
|
||||||
website/_deploy.sh
|
|
||||||
|
|
||||||
# Cython / C extensions
|
# Cython / C extensions
|
||||||
cythonize.json
|
cythonize.json
|
||||||
|
|
|
@ -5,7 +5,7 @@
|
||||||
Thanks for your interest in contributing to spaCy 🎉 The project is maintained
|
Thanks for your interest in contributing to spaCy 🎉 The project is maintained
|
||||||
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
|
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
|
||||||
and we'll do our best to help you get started. This page will give you a quick
|
and we'll do our best to help you get started. This page will give you a quick
|
||||||
overview of how things are organised and most importantly, how to get involved.
|
overview of how things are organized and most importantly, how to get involved.
|
||||||
|
|
||||||
## Table of contents
|
## Table of contents
|
||||||
|
|
||||||
|
@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
|
||||||
### Code formatting
|
### Code formatting
|
||||||
|
|
||||||
[`black`](https://github.com/ambv/black) is an opinionated Python code
|
[`black`](https://github.com/ambv/black) is an opinionated Python code
|
||||||
formatter, optimised to produce readable code and small diffs. You can run
|
formatter, optimized to produce readable code and small diffs. You can run
|
||||||
`black` from the command-line, or via your code editor. For example, if you're
|
`black` from the command-line, or via your code editor. For example, if you're
|
||||||
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
|
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
|
||||||
following to your `settings.json` to use `black` for formatting and auto-format
|
following to your `settings.json` to use `black` for formatting and auto-format
|
||||||
|
@ -216,7 +216,7 @@ list of available editor integrations.
|
||||||
#### Disabling formatting
|
#### Disabling formatting
|
||||||
|
|
||||||
There are a few cases where auto-formatting doesn't improve readability – for
|
There are a few cases where auto-formatting doesn't improve readability – for
|
||||||
example, in some of the the language data files like the `tag_map.py`, or in
|
example, in some of the language data files like the `tag_map.py`, or in
|
||||||
the tests that construct `Doc` objects from lists of words and other labels.
|
the tests that construct `Doc` objects from lists of words and other labels.
|
||||||
Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
|
Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
|
||||||
for that particular code. Here's an example:
|
for that particular code. Here's an example:
|
||||||
|
@ -286,7 +286,7 @@ Code that interacts with the file-system should accept objects that follow the
|
||||||
If the function is user-facing and takes a path as an argument, it should check
|
If the function is user-facing and takes a path as an argument, it should check
|
||||||
whether the path is provided as a string. Strings should be converted to
|
whether the path is provided as a string. Strings should be converted to
|
||||||
`pathlib.Path` objects. Serialization and deserialization functions should always
|
`pathlib.Path` objects. Serialization and deserialization functions should always
|
||||||
accept **file-like objects**, as it makes the library io-agnostic. Working on
|
accept **file-like objects**, as it makes the library IO-agnostic. Working on
|
||||||
buffers makes the code more general, easier to test, and compatible with Python
|
buffers makes the code more general, easier to test, and compatible with Python
|
||||||
3's asynchronous IO.
|
3's asynchronous IO.
|
||||||
|
|
||||||
|
@ -384,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The
|
||||||
many "traps for new players". Working in Cython is very rewarding once you're
|
many "traps for new players". Working in Cython is very rewarding once you're
|
||||||
over the initial learning curve. As with C and C++, the first way you write
|
over the initial learning curve. As with C and C++, the first way you write
|
||||||
something in Cython will often be the performance-optimal approach. In contrast,
|
something in Cython will often be the performance-optimal approach. In contrast,
|
||||||
Python optimisation generally requires a lot of experimentation. Is it faster to
|
Python optimization generally requires a lot of experimentation. Is it faster to
|
||||||
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
|
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
|
||||||
Does this numpy operation create a copy? There's no way to guess the answers to
|
Does this numpy operation create a copy? There's no way to guess the answers to
|
||||||
these questions, and you'll usually be dissatisfied with your results — so
|
these questions, and you'll usually be dissatisfied with your results — so
|
||||||
|
@ -400,7 +400,7 @@ Python. If it's not fast enough the first time, just switch to Cython.
|
||||||
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
|
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
|
||||||
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
|
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
|
||||||
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
|
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
|
||||||
- [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
|
- [Multi-threading spaCy’s parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
|
||||||
|
|
||||||
## Adding tests
|
## Adding tests
|
||||||
|
|
||||||
|
@ -412,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in
|
||||||
all test files and test functions need to be prefixed with `test_`.
|
all test files and test functions need to be prefixed with `test_`.
|
||||||
|
|
||||||
When adding tests, make sure to use descriptive names, keep the code short and
|
When adding tests, make sure to use descriptive names, keep the code short and
|
||||||
concise and only test for one behaviour at a time. Try to `parametrize` test
|
concise and only test for one behavior at a time. Try to `parametrize` test
|
||||||
cases wherever possible, use our pre-defined fixtures for spaCy components and
|
cases wherever possible, use our pre-defined fixtures for spaCy components and
|
||||||
avoid unnecessary imports.
|
avoid unnecessary imports.
|
||||||
|
|
||||||
|
|
|
@ -5,5 +5,5 @@ include README.md
|
||||||
include pyproject.toml
|
include pyproject.toml
|
||||||
recursive-exclude spacy/lang *.json
|
recursive-exclude spacy/lang *.json
|
||||||
recursive-include spacy/lang *.json.gz
|
recursive-include spacy/lang *.json.gz
|
||||||
recursive-include spacy/cli *.json
|
recursive-include spacy/cli *.json *.yml
|
||||||
recursive-include licenses *
|
recursive-include licenses *
|
||||||
|
|
|
@ -49,9 +49,8 @@ It's commercial open-source software, released under the MIT license.
|
||||||
|
|
||||||
## 💬 Where to ask questions
|
## 💬 Where to ask questions
|
||||||
|
|
||||||
The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and
|
The spaCy project is maintained by [@honnibal](https://github.com/honnibal),
|
||||||
[@ines](https://github.com/ines), along with core contributors
|
[@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and
|
||||||
[@svlandeg](https://github.com/svlandeg) and
|
|
||||||
[@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
|
[@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
|
||||||
be able to provide individual support via email. We also believe that help is
|
be able to provide individual support via email. We also believe that help is
|
||||||
much more valuable if it's shared publicly, so that more people can benefit from
|
much more valuable if it's shared publicly, so that more people can benefit from
|
||||||
|
|
|
@ -15,7 +15,8 @@ import spacy.util
|
||||||
from bin.ud import conll17_ud_eval
|
from bin.ud import conll17_ud_eval
|
||||||
from spacy.tokens import Token, Doc
|
from spacy.tokens import Token, Doc
|
||||||
from spacy.gold import Example
|
from spacy.gold import Example
|
||||||
from spacy.util import compounding, minibatch, minibatch_by_words
|
from spacy.util import compounding, minibatch
|
||||||
|
from spacy.gold.batchers import minibatch_by_words
|
||||||
from spacy.pipeline._parser_internals.nonproj import projectivize
|
from spacy.pipeline._parser_internals.nonproj import projectivize
|
||||||
from spacy.matcher import Matcher
|
from spacy.matcher import Matcher
|
||||||
from spacy import displacy
|
from spacy import displacy
|
||||||
|
|
|
@ -48,8 +48,7 @@ def main(model, output_dir=None):
|
||||||
# You can change the dimension of vectors in your KB by using an encoder that changes the dimensionality.
|
# You can change the dimension of vectors in your KB by using an encoder that changes the dimensionality.
|
||||||
# For simplicity, we'll just use the original vector dimension here instead.
|
# For simplicity, we'll just use the original vector dimension here instead.
|
||||||
vectors_dim = nlp.vocab.vectors.shape[1]
|
vectors_dim = nlp.vocab.vectors.shape[1]
|
||||||
kb = KnowledgeBase(entity_vector_length=vectors_dim)
|
kb = KnowledgeBase(nlp.vocab, entity_vector_length=vectors_dim)
|
||||||
kb.initialize(nlp.vocab)
|
|
||||||
|
|
||||||
# set up the data
|
# set up the data
|
||||||
entity_ids = []
|
entity_ids = []
|
||||||
|
@ -81,7 +80,7 @@ def main(model, output_dir=None):
|
||||||
if not output_dir.exists():
|
if not output_dir.exists():
|
||||||
output_dir.mkdir()
|
output_dir.mkdir()
|
||||||
kb_path = str(output_dir / "kb")
|
kb_path = str(output_dir / "kb")
|
||||||
kb.dump(kb_path)
|
kb.to_disk(kb_path)
|
||||||
print()
|
print()
|
||||||
print("Saved KB to", kb_path)
|
print("Saved KB to", kb_path)
|
||||||
|
|
||||||
|
@ -96,9 +95,8 @@ def main(model, output_dir=None):
|
||||||
print("Loading vocab from", vocab_path)
|
print("Loading vocab from", vocab_path)
|
||||||
print("Loading KB from", kb_path)
|
print("Loading KB from", kb_path)
|
||||||
vocab2 = Vocab().from_disk(vocab_path)
|
vocab2 = Vocab().from_disk(vocab_path)
|
||||||
kb2 = KnowledgeBase(entity_vector_length=1)
|
kb2 = KnowledgeBase(vocab2, entity_vector_length=1)
|
||||||
kb.initialize(vocab2)
|
kb2.from_disk(kb_path)
|
||||||
kb2.load_bulk(kb_path)
|
|
||||||
print()
|
print()
|
||||||
_print_kb(kb2)
|
_print_kb(kb2)
|
||||||
|
|
||||||
|
|
|
@ -83,7 +83,7 @@ def main(kb_path, vocab_path, output_dir=None, n_iter=50):
|
||||||
if "entity_linker" not in nlp.pipe_names:
|
if "entity_linker" not in nlp.pipe_names:
|
||||||
print("Loading Knowledge Base from '%s'" % kb_path)
|
print("Loading Knowledge Base from '%s'" % kb_path)
|
||||||
cfg = {
|
cfg = {
|
||||||
"kb": {
|
"kb_loader": {
|
||||||
"@assets": "spacy.KBFromFile.v1",
|
"@assets": "spacy.KBFromFile.v1",
|
||||||
"vocab_path": vocab_path,
|
"vocab_path": vocab_path,
|
||||||
"kb_path": kb_path,
|
"kb_path": kb_path,
|
||||||
|
|
|
@ -36,11 +36,11 @@ redirects = [
|
||||||
{from = "/docs/api/features", to = "/models/#architecture", force = true},
|
{from = "/docs/api/features", to = "/models/#architecture", force = true},
|
||||||
{from = "/docs/api/philosophy", to = "/usage/spacy-101", force = true},
|
{from = "/docs/api/philosophy", to = "/usage/spacy-101", force = true},
|
||||||
{from = "/docs/usage/showcase", to = "/universe", force = true},
|
{from = "/docs/usage/showcase", to = "/universe", force = true},
|
||||||
{from = "/tutorials/load-new-word-vectors", to = "/usage/vectors-similarity#custom", force = true},
|
{from = "/tutorials/load-new-word-vectors", to = "/usage/linguistic-features", force = true},
|
||||||
{from = "/tutorials", to = "/usage/examples", force = true},
|
{from = "/tutorials", to = "/usage/examples", force = true},
|
||||||
# Old documentation pages (v2.x)
|
# Old documentation pages (v2.x)
|
||||||
{from = "/usage/adding-languages", to = "/usage/linguistic-features", force = true},
|
{from = "/usage/adding-languages", to = "/usage/linguistic-features", force = true},
|
||||||
{from = "/usage/vectors-similarity", to = "/usage/vectors-embeddings", force = true},
|
{from = "/usage/vectors-similarity", to = "/usage/linguistic-features#vectors-similarity", force = true},
|
||||||
{from = "/api/goldparse", to = "/api/top-level", force = true},
|
{from = "/api/goldparse", to = "/api/top-level", force = true},
|
||||||
{from = "/api/goldcorpus", to = "/api/corpus", force = true},
|
{from = "/api/goldcorpus", to = "/api/corpus", force = true},
|
||||||
{from = "/api/annotation", to = "/api/data-formats", force = true},
|
{from = "/api/annotation", to = "/api/data-formats", force = true},
|
||||||
|
|
|
@ -6,9 +6,10 @@ requires = [
|
||||||
"cymem>=2.0.2,<2.1.0",
|
"cymem>=2.0.2,<2.1.0",
|
||||||
"preshed>=3.0.2,<3.1.0",
|
"preshed>=3.0.2,<3.1.0",
|
||||||
"murmurhash>=0.28.0,<1.1.0",
|
"murmurhash>=0.28.0,<1.1.0",
|
||||||
"thinc>=8.0.0a27,<8.0.0a30",
|
"thinc>=8.0.0a29,<8.0.0a40",
|
||||||
"blis>=0.4.0,<0.5.0",
|
"blis>=0.4.0,<0.5.0",
|
||||||
"pytokenizations",
|
"pytokenizations",
|
||||||
"smart_open>=2.0.0,<3.0.0"
|
"smart_open>=2.0.0,<3.0.0",
|
||||||
|
"pathy"
|
||||||
]
|
]
|
||||||
build-backend = "setuptools.build_meta"
|
build-backend = "setuptools.build_meta"
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
# Our libraries
|
# Our libraries
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
thinc>=8.0.0a27,<8.0.0a30
|
thinc>=8.0.0a29,<8.0.0a40
|
||||||
blis>=0.4.0,<0.5.0
|
blis>=0.4.0,<0.5.0
|
||||||
ml_datasets>=0.1.1
|
ml_datasets>=0.1.1
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
|
@ -9,6 +9,7 @@ wasabi>=0.7.1,<1.1.0
|
||||||
srsly>=2.1.0,<3.0.0
|
srsly>=2.1.0,<3.0.0
|
||||||
catalogue>=0.0.7,<1.1.0
|
catalogue>=0.0.7,<1.1.0
|
||||||
typer>=0.3.0,<0.4.0
|
typer>=0.3.0,<0.4.0
|
||||||
|
pathy
|
||||||
# Third party dependencies
|
# Third party dependencies
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
|
|
|
@ -34,18 +34,19 @@ setup_requires =
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
thinc>=8.0.0a27,<8.0.0a30
|
thinc>=8.0.0a29,<8.0.0a40
|
||||||
install_requires =
|
install_requires =
|
||||||
# Our libraries
|
# Our libraries
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
thinc>=8.0.0a27,<8.0.0a30
|
thinc>=8.0.0a29,<8.0.0a40
|
||||||
blis>=0.4.0,<0.5.0
|
blis>=0.4.0,<0.5.0
|
||||||
wasabi>=0.7.1,<1.1.0
|
wasabi>=0.7.1,<1.1.0
|
||||||
srsly>=2.1.0,<3.0.0
|
srsly>=2.1.0,<3.0.0
|
||||||
catalogue>=0.0.7,<1.1.0
|
catalogue>=0.0.7,<1.1.0
|
||||||
typer>=0.3.0,<0.4.0
|
typer>=0.3.0,<0.4.0
|
||||||
|
pathy
|
||||||
# Third-party dependencies
|
# Third-party dependencies
|
||||||
tqdm>=4.38.0,<5.0.0
|
tqdm>=4.38.0,<5.0.0
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
# fmt: off
|
# fmt: off
|
||||||
__title__ = "spacy-nightly"
|
__title__ = "spacy-nightly"
|
||||||
__version__ = "3.0.0a7"
|
__version__ = "3.0.0a10"
|
||||||
__release__ = True
|
__release__ = True
|
||||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||||
|
|
|
@ -21,6 +21,8 @@ from .project.clone import project_clone # noqa: F401
|
||||||
from .project.assets import project_assets # noqa: F401
|
from .project.assets import project_assets # noqa: F401
|
||||||
from .project.run import project_run # noqa: F401
|
from .project.run import project_run # noqa: F401
|
||||||
from .project.dvc import project_update_dvc # noqa: F401
|
from .project.dvc import project_update_dvc # noqa: F401
|
||||||
|
from .project.push import project_push # noqa: F401
|
||||||
|
from .project.pull import project_pull # noqa: F401
|
||||||
|
|
||||||
|
|
||||||
@app.command("link", no_args_is_help=True, deprecated=True, hidden=True)
|
@app.command("link", no_args_is_help=True, deprecated=True, hidden=True)
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
from typing import Dict, Any, Union, List, Optional
|
from typing import Dict, Any, Union, List, Optional, TYPE_CHECKING
|
||||||
|
import sys
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from wasabi import msg
|
from wasabi import msg
|
||||||
import srsly
|
import srsly
|
||||||
|
@ -8,11 +9,13 @@ from typer.main import get_command
|
||||||
from contextlib import contextmanager
|
from contextlib import contextmanager
|
||||||
from thinc.config import Config, ConfigValidationError
|
from thinc.config import Config, ConfigValidationError
|
||||||
from configparser import InterpolationError
|
from configparser import InterpolationError
|
||||||
import sys
|
|
||||||
|
|
||||||
from ..schemas import ProjectConfigSchema, validate
|
from ..schemas import ProjectConfigSchema, validate
|
||||||
from ..util import import_file
|
from ..util import import_file
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from pathy import Pathy # noqa: F401
|
||||||
|
|
||||||
|
|
||||||
PROJECT_FILE = "project.yml"
|
PROJECT_FILE = "project.yml"
|
||||||
PROJECT_LOCK = "project.lock"
|
PROJECT_LOCK = "project.lock"
|
||||||
|
@ -68,11 +71,12 @@ def parse_config_overrides(args: List[str]) -> Dict[str, Any]:
|
||||||
opt = args.pop(0)
|
opt = args.pop(0)
|
||||||
err = f"Invalid CLI argument '{opt}'"
|
err = f"Invalid CLI argument '{opt}'"
|
||||||
if opt.startswith("--"): # new argument
|
if opt.startswith("--"): # new argument
|
||||||
opt = opt.replace("--", "").replace("-", "_")
|
opt = opt.replace("--", "")
|
||||||
if "." not in opt:
|
if "." not in opt:
|
||||||
msg.fail(f"{err}: can't override top-level section", exits=1)
|
msg.fail(f"{err}: can't override top-level section", exits=1)
|
||||||
if "=" in opt: # we have --opt=value
|
if "=" in opt: # we have --opt=value
|
||||||
opt, value = opt.split("=", 1)
|
opt, value = opt.split("=", 1)
|
||||||
|
opt = opt.replace("-", "_")
|
||||||
else:
|
else:
|
||||||
if not args or args[0].startswith("--"): # flag with no value
|
if not args or args[0].startswith("--"): # flag with no value
|
||||||
value = "true"
|
value = "true"
|
||||||
|
@ -92,11 +96,12 @@ def parse_config_overrides(args: List[str]) -> Dict[str, Any]:
|
||||||
return result
|
return result
|
||||||
|
|
||||||
|
|
||||||
def load_project_config(path: Path) -> Dict[str, Any]:
|
def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
|
||||||
"""Load the project.yml file from a directory and validate it. Also make
|
"""Load the project.yml file from a directory and validate it. Also make
|
||||||
sure that all directories defined in the config exist.
|
sure that all directories defined in the config exist.
|
||||||
|
|
||||||
path (Path): The path to the project directory.
|
path (Path): The path to the project directory.
|
||||||
|
interpolate (bool): Whether to substitute project variables.
|
||||||
RETURNS (Dict[str, Any]): The loaded project.yml.
|
RETURNS (Dict[str, Any]): The loaded project.yml.
|
||||||
"""
|
"""
|
||||||
config_path = path / PROJECT_FILE
|
config_path = path / PROJECT_FILE
|
||||||
|
@ -109,16 +114,34 @@ def load_project_config(path: Path) -> Dict[str, Any]:
|
||||||
msg.fail(invalid_err, e, exits=1)
|
msg.fail(invalid_err, e, exits=1)
|
||||||
errors = validate(ProjectConfigSchema, config)
|
errors = validate(ProjectConfigSchema, config)
|
||||||
if errors:
|
if errors:
|
||||||
msg.fail(invalid_err, "\n".join(errors), exits=1)
|
msg.fail(invalid_err)
|
||||||
|
print("\n".join(errors))
|
||||||
|
sys.exit(1)
|
||||||
validate_project_commands(config)
|
validate_project_commands(config)
|
||||||
# Make sure directories defined in config exist
|
# Make sure directories defined in config exist
|
||||||
for subdir in config.get("directories", []):
|
for subdir in config.get("directories", []):
|
||||||
dir_path = path / subdir
|
dir_path = path / subdir
|
||||||
if not dir_path.exists():
|
if not dir_path.exists():
|
||||||
dir_path.mkdir(parents=True)
|
dir_path.mkdir(parents=True)
|
||||||
|
if interpolate:
|
||||||
|
err = "project.yml validation error"
|
||||||
|
with show_validation_error(title=err, hint_fill=False):
|
||||||
|
config = substitute_project_variables(config)
|
||||||
return config
|
return config
|
||||||
|
|
||||||
|
|
||||||
|
def substitute_project_variables(config: Dict[str, Any], overrides: Dict = {}):
|
||||||
|
key = "vars"
|
||||||
|
config.setdefault(key, {})
|
||||||
|
config[key].update(overrides)
|
||||||
|
# Need to put variables in the top scope again so we can have a top-level
|
||||||
|
# section "project" (otherwise, a list of commands in the top scope wouldn't)
|
||||||
|
# be allowed by Thinc's config system
|
||||||
|
cfg = Config({"project": config, key: config[key]})
|
||||||
|
interpolated = cfg.interpolate()
|
||||||
|
return dict(interpolated["project"])
|
||||||
|
|
||||||
|
|
||||||
def validate_project_commands(config: Dict[str, Any]) -> None:
|
def validate_project_commands(config: Dict[str, Any]) -> None:
|
||||||
"""Check that project commands and workflows are valid, don't contain
|
"""Check that project commands and workflows are valid, don't contain
|
||||||
duplicates, don't clash and only refer to commands that exist.
|
duplicates, don't clash and only refer to commands that exist.
|
||||||
|
@ -229,3 +252,39 @@ def get_sourced_components(config: Union[Dict[str, Any], Config]) -> List[str]:
|
||||||
for name, cfg in config.get("components", {}).items()
|
for name, cfg in config.get("components", {}).items()
|
||||||
if "factory" not in cfg and "source" in cfg
|
if "factory" not in cfg and "source" in cfg
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def upload_file(src: Path, dest: Union[str, "Pathy"]) -> None:
|
||||||
|
"""Upload a file.
|
||||||
|
|
||||||
|
src (Path): The source path.
|
||||||
|
url (str): The destination URL to upload to.
|
||||||
|
"""
|
||||||
|
dest = ensure_pathy(dest)
|
||||||
|
with dest.open(mode="wb") as output_file:
|
||||||
|
with src.open(mode="rb") as input_file:
|
||||||
|
output_file.write(input_file.read())
|
||||||
|
|
||||||
|
|
||||||
|
def download_file(src: Union[str, "Pathy"], dest: Path, *, force: bool = False) -> None:
|
||||||
|
"""Download a file using smart_open.
|
||||||
|
|
||||||
|
url (str): The URL of the file.
|
||||||
|
dest (Path): The destination path.
|
||||||
|
force (bool): Whether to force download even if file exists.
|
||||||
|
If False, the download will be skipped.
|
||||||
|
"""
|
||||||
|
if dest.exists() and not force:
|
||||||
|
return None
|
||||||
|
src = ensure_pathy(src)
|
||||||
|
with src.open(mode="rb") as input_file:
|
||||||
|
with dest.open(mode="wb") as output_file:
|
||||||
|
output_file.write(input_file.read())
|
||||||
|
|
||||||
|
|
||||||
|
def ensure_pathy(path):
|
||||||
|
"""Temporary helper to prevent importing Pathy globally (which can cause
|
||||||
|
slow and annoying Google Cloud warning)."""
|
||||||
|
from pathy import Pathy # noqa: F811
|
||||||
|
|
||||||
|
return Pathy(path)
|
||||||
|
|
|
@ -3,7 +3,7 @@ from pathlib import Path
|
||||||
from collections import Counter
|
from collections import Counter
|
||||||
import sys
|
import sys
|
||||||
import srsly
|
import srsly
|
||||||
from wasabi import Printer, MESSAGES, msg, diff_strings
|
from wasabi import Printer, MESSAGES, msg
|
||||||
import typer
|
import typer
|
||||||
|
|
||||||
from ._util import app, Arg, Opt, show_validation_error, parse_config_overrides
|
from ._util import app, Arg, Opt, show_validation_error, parse_config_overrides
|
||||||
|
@ -32,8 +32,6 @@ def debug_config_cli(
|
||||||
ctx: typer.Context, # This is only used to read additional arguments
|
ctx: typer.Context, # This is only used to read additional arguments
|
||||||
config_path: Path = Arg(..., help="Path to config file", exists=True),
|
config_path: Path = Arg(..., help="Path to config file", exists=True),
|
||||||
code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
||||||
auto_fill: bool = Opt(False, "--auto-fill", "-F", help="Whether or not to auto-fill the config with built-in defaults if possible"),
|
|
||||||
diff: bool = Opt(False, "--diff", "-D", help="Show a visual diff if config was auto-filled")
|
|
||||||
# fmt: on
|
# fmt: on
|
||||||
):
|
):
|
||||||
"""Debug a config.cfg file and show validation errors. The command will
|
"""Debug a config.cfg file and show validation errors. The command will
|
||||||
|
@ -49,17 +47,7 @@ def debug_config_cli(
|
||||||
import_code(code_path)
|
import_code(code_path)
|
||||||
with show_validation_error(config_path):
|
with show_validation_error(config_path):
|
||||||
config = util.load_config(config_path, overrides=overrides)
|
config = util.load_config(config_path, overrides=overrides)
|
||||||
nlp, _ = util.load_model_from_config(config, auto_fill=auto_fill)
|
nlp, _ = util.load_model_from_config(config)
|
||||||
if auto_fill:
|
|
||||||
orig_config = config.to_str()
|
|
||||||
filled_config = nlp.config.to_str()
|
|
||||||
if orig_config == filled_config:
|
|
||||||
msg.good("Original config is valid, no values were auto-filled")
|
|
||||||
else:
|
|
||||||
msg.good("Auto-filled config is valid")
|
|
||||||
if diff:
|
|
||||||
print(diff_strings(config.to_str(), nlp.config.to_str()))
|
|
||||||
else:
|
|
||||||
msg.good("Original config is valid")
|
msg.good("Original config is valid")
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -70,7 +70,7 @@ def evaluate(
|
||||||
corpus = Corpus(data_path, gold_preproc=gold_preproc)
|
corpus = Corpus(data_path, gold_preproc=gold_preproc)
|
||||||
nlp = util.load_model(model)
|
nlp = util.load_model(model)
|
||||||
dev_dataset = list(corpus(nlp))
|
dev_dataset = list(corpus(nlp))
|
||||||
scores = nlp.evaluate(dev_dataset, verbose=False)
|
scores = nlp.evaluate(dev_dataset)
|
||||||
metrics = {
|
metrics = {
|
||||||
"TOK": "token_acc",
|
"TOK": "token_acc",
|
||||||
"TAG": "tag_acc",
|
"TAG": "tag_acc",
|
||||||
|
|
|
@ -3,17 +3,17 @@ from enum import Enum
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from wasabi import Printer, diff_strings
|
from wasabi import Printer, diff_strings
|
||||||
from thinc.api import Config
|
from thinc.api import Config
|
||||||
from pydantic import BaseModel
|
|
||||||
import srsly
|
import srsly
|
||||||
import re
|
import re
|
||||||
|
|
||||||
from .. import util
|
from .. import util
|
||||||
|
from ..schemas import RecommendationSchema
|
||||||
from ._util import init_cli, Arg, Opt, show_validation_error, COMMAND
|
from ._util import init_cli, Arg, Opt, show_validation_error, COMMAND
|
||||||
|
|
||||||
|
|
||||||
TEMPLATE_ROOT = Path(__file__).parent / "templates"
|
ROOT = Path(__file__).parent / "templates"
|
||||||
TEMPLATE_PATH = TEMPLATE_ROOT / "quickstart_training.jinja"
|
TEMPLATE_PATH = ROOT / "quickstart_training.jinja"
|
||||||
RECOMMENDATIONS_PATH = TEMPLATE_ROOT / "quickstart_training_recommendations.json"
|
RECOMMENDATIONS = srsly.read_yaml(ROOT / "quickstart_training_recommendations.yml")
|
||||||
|
|
||||||
|
|
||||||
class Optimizations(str, Enum):
|
class Optimizations(str, Enum):
|
||||||
|
@ -21,25 +21,10 @@ class Optimizations(str, Enum):
|
||||||
accuracy = "accuracy"
|
accuracy = "accuracy"
|
||||||
|
|
||||||
|
|
||||||
class RecommendationsTrfItem(BaseModel):
|
|
||||||
name: str
|
|
||||||
size_factor: int
|
|
||||||
|
|
||||||
|
|
||||||
class RecommendationsTrf(BaseModel):
|
|
||||||
efficiency: RecommendationsTrfItem
|
|
||||||
accuracy: RecommendationsTrfItem
|
|
||||||
|
|
||||||
|
|
||||||
class RecommendationSchema(BaseModel):
|
|
||||||
word_vectors: Optional[str] = None
|
|
||||||
transformer: Optional[RecommendationsTrf] = None
|
|
||||||
|
|
||||||
|
|
||||||
@init_cli.command("config")
|
@init_cli.command("config")
|
||||||
def init_config_cli(
|
def init_config_cli(
|
||||||
# fmt: off
|
# fmt: off
|
||||||
output_file: Path = Arg("-", help="File to save config.cfg to (or - for stdout)", allow_dash=True),
|
output_file: Path = Arg(..., help="File to save config.cfg to or - for stdout (will only output config and no additional logging info)", allow_dash=True),
|
||||||
lang: Optional[str] = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"),
|
lang: Optional[str] = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"),
|
||||||
pipeline: Optional[str] = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include in the model (without 'tok2vec' or 'transformer')"),
|
pipeline: Optional[str] = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include in the model (without 'tok2vec' or 'transformer')"),
|
||||||
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
|
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
|
||||||
|
@ -111,14 +96,11 @@ def init_config(
|
||||||
from jinja2 import Template
|
from jinja2 import Template
|
||||||
except ImportError:
|
except ImportError:
|
||||||
msg.fail("This command requires jinja2", "pip install jinja2", exits=1)
|
msg.fail("This command requires jinja2", "pip install jinja2", exits=1)
|
||||||
recommendations = srsly.read_json(RECOMMENDATIONS_PATH)
|
|
||||||
lang_defaults = util.get_lang_class(lang).Defaults
|
|
||||||
has_letters = lang_defaults.writing_system.get("has_letters", True)
|
|
||||||
# Filter out duplicates since tok2vec and transformer are added by template
|
|
||||||
pipeline = [pipe for pipe in pipeline if pipe not in ("tok2vec", "transformer")]
|
|
||||||
reco = RecommendationSchema(**recommendations.get(lang, {})).dict()
|
|
||||||
with TEMPLATE_PATH.open("r") as f:
|
with TEMPLATE_PATH.open("r") as f:
|
||||||
template = Template(f.read())
|
template = Template(f.read())
|
||||||
|
# Filter out duplicates since tok2vec and transformer are added by template
|
||||||
|
pipeline = [pipe for pipe in pipeline if pipe not in ("tok2vec", "transformer")]
|
||||||
|
reco = RecommendationSchema(**RECOMMENDATIONS.get(lang, {})).dict()
|
||||||
variables = {
|
variables = {
|
||||||
"lang": lang,
|
"lang": lang,
|
||||||
"components": pipeline,
|
"components": pipeline,
|
||||||
|
@ -126,8 +108,15 @@ def init_config(
|
||||||
"hardware": "cpu" if cpu else "gpu",
|
"hardware": "cpu" if cpu else "gpu",
|
||||||
"transformer_data": reco["transformer"],
|
"transformer_data": reco["transformer"],
|
||||||
"word_vectors": reco["word_vectors"],
|
"word_vectors": reco["word_vectors"],
|
||||||
"has_letters": has_letters,
|
"has_letters": reco["has_letters"],
|
||||||
}
|
}
|
||||||
|
if variables["transformer_data"] and not has_spacy_transformers():
|
||||||
|
msg.warn(
|
||||||
|
"To generate a more effective transformer-based config (GPU-only), "
|
||||||
|
"install the spacy-transformers package and re-run this command. "
|
||||||
|
"The config generated now does not use transformers."
|
||||||
|
)
|
||||||
|
variables["transformer_data"] = None
|
||||||
base_template = template.render(variables).strip()
|
base_template = template.render(variables).strip()
|
||||||
# Giving up on getting the newlines right in jinja for now
|
# Giving up on getting the newlines right in jinja for now
|
||||||
base_template = re.sub(r"\n\n\n+", "\n\n", base_template)
|
base_template = re.sub(r"\n\n\n+", "\n\n", base_template)
|
||||||
|
@ -144,8 +133,6 @@ def init_config(
|
||||||
for label, value in use_case.items():
|
for label, value in use_case.items():
|
||||||
msg.text(f"- {label}: {value}")
|
msg.text(f"- {label}: {value}")
|
||||||
use_transformer = bool(template_vars.use_transformer)
|
use_transformer = bool(template_vars.use_transformer)
|
||||||
if use_transformer:
|
|
||||||
require_spacy_transformers(msg)
|
|
||||||
with show_validation_error(hint_fill=False):
|
with show_validation_error(hint_fill=False):
|
||||||
config = util.load_config_from_str(base_template)
|
config = util.load_config_from_str(base_template)
|
||||||
nlp, _ = util.load_model_from_config(config, auto_fill=True)
|
nlp, _ = util.load_model_from_config(config, auto_fill=True)
|
||||||
|
@ -167,12 +154,10 @@ def save_config(config: Config, output_file: Path, is_stdout: bool = False) -> N
|
||||||
print(f"{COMMAND} train {output_file.parts[-1]} {' '.join(variables)}")
|
print(f"{COMMAND} train {output_file.parts[-1]} {' '.join(variables)}")
|
||||||
|
|
||||||
|
|
||||||
def require_spacy_transformers(msg: Printer) -> None:
|
def has_spacy_transformers() -> bool:
|
||||||
try:
|
try:
|
||||||
import spacy_transformers # noqa: F401
|
import spacy_transformers # noqa: F401
|
||||||
|
|
||||||
|
return True
|
||||||
except ImportError:
|
except ImportError:
|
||||||
msg.fail(
|
return False
|
||||||
"Using a transformer-based pipeline requires spacy-transformers "
|
|
||||||
"to be installed.",
|
|
||||||
exits=1,
|
|
||||||
)
|
|
||||||
|
|
|
@ -229,6 +229,7 @@ if __name__ == '__main__':
|
||||||
|
|
||||||
TEMPLATE_MANIFEST = """
|
TEMPLATE_MANIFEST = """
|
||||||
include meta.json
|
include meta.json
|
||||||
|
include config.cfg
|
||||||
""".strip()
|
""".strip()
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -4,10 +4,10 @@ from wasabi import msg
|
||||||
import re
|
import re
|
||||||
import shutil
|
import shutil
|
||||||
import requests
|
import requests
|
||||||
import smart_open
|
|
||||||
|
|
||||||
from ...util import ensure_path, working_dir
|
from ...util import ensure_path, working_dir
|
||||||
from .._util import project_cli, Arg, PROJECT_FILE, load_project_config, get_checksum
|
from .._util import project_cli, Arg, PROJECT_FILE, load_project_config, get_checksum
|
||||||
|
from .._util import download_file
|
||||||
|
|
||||||
|
|
||||||
# TODO: find a solution for caches
|
# TODO: find a solution for caches
|
||||||
|
@ -44,16 +44,14 @@ def project_assets(project_dir: Path) -> None:
|
||||||
if not assets:
|
if not assets:
|
||||||
msg.warn(f"No assets specified in {PROJECT_FILE}", exits=0)
|
msg.warn(f"No assets specified in {PROJECT_FILE}", exits=0)
|
||||||
msg.info(f"Fetching {len(assets)} asset(s)")
|
msg.info(f"Fetching {len(assets)} asset(s)")
|
||||||
variables = config.get("variables", {})
|
|
||||||
for asset in assets:
|
for asset in assets:
|
||||||
dest = asset["dest"].format(**variables)
|
dest = asset["dest"]
|
||||||
url = asset.get("url")
|
url = asset.get("url")
|
||||||
checksum = asset.get("checksum")
|
checksum = asset.get("checksum")
|
||||||
if not url:
|
if not url:
|
||||||
# project.yml defines asset without URL that the user has to place
|
# project.yml defines asset without URL that the user has to place
|
||||||
check_private_asset(dest, checksum)
|
check_private_asset(dest, checksum)
|
||||||
continue
|
continue
|
||||||
url = url.format(**variables)
|
|
||||||
fetch_asset(project_path, url, dest, checksum)
|
fetch_asset(project_path, url, dest, checksum)
|
||||||
|
|
||||||
|
|
||||||
|
@ -132,15 +130,3 @@ def convert_asset_url(url: str) -> str:
|
||||||
)
|
)
|
||||||
return converted
|
return converted
|
||||||
return url
|
return url
|
||||||
|
|
||||||
|
|
||||||
def download_file(url: str, dest: Path, chunk_size: int = 1024) -> None:
|
|
||||||
"""Download a file using smart_open.
|
|
||||||
|
|
||||||
url (str): The URL of the file.
|
|
||||||
dest (Path): The destination path.
|
|
||||||
chunk_size (int): The size of chunks to read/write.
|
|
||||||
"""
|
|
||||||
with smart_open.open(url, mode="rb") as input_file:
|
|
||||||
with dest.open(mode="wb") as output_file:
|
|
||||||
output_file.write(input_file.read())
|
|
||||||
|
|
|
@ -99,7 +99,6 @@ def update_dvc_config(
|
||||||
if ref_hash == config_hash and not force:
|
if ref_hash == config_hash and not force:
|
||||||
return False # Nothing has changed in project.yml, don't need to update
|
return False # Nothing has changed in project.yml, don't need to update
|
||||||
dvc_config_path.unlink()
|
dvc_config_path.unlink()
|
||||||
variables = config.get("variables", {})
|
|
||||||
dvc_commands = []
|
dvc_commands = []
|
||||||
config_commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
|
config_commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
|
||||||
for name in workflows[workflow]:
|
for name in workflows[workflow]:
|
||||||
|
@ -122,7 +121,7 @@ def update_dvc_config(
|
||||||
dvc_commands.append(join_command(full_cmd))
|
dvc_commands.append(join_command(full_cmd))
|
||||||
with working_dir(path):
|
with working_dir(path):
|
||||||
dvc_flags = {"--verbose": verbose, "--quiet": silent}
|
dvc_flags = {"--verbose": verbose, "--quiet": silent}
|
||||||
run_dvc_commands(dvc_commands, variables, flags=dvc_flags)
|
run_dvc_commands(dvc_commands, flags=dvc_flags)
|
||||||
with dvc_config_path.open("r+", encoding="utf8") as f:
|
with dvc_config_path.open("r+", encoding="utf8") as f:
|
||||||
content = f.read()
|
content = f.read()
|
||||||
f.seek(0, 0)
|
f.seek(0, 0)
|
||||||
|
@ -131,23 +130,16 @@ def update_dvc_config(
|
||||||
|
|
||||||
|
|
||||||
def run_dvc_commands(
|
def run_dvc_commands(
|
||||||
commands: List[str] = tuple(),
|
commands: List[str] = tuple(), flags: Dict[str, bool] = {},
|
||||||
variables: Dict[str, str] = {},
|
|
||||||
flags: Dict[str, bool] = {},
|
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Run a sequence of DVC commands in a subprocess, in order.
|
"""Run a sequence of DVC commands in a subprocess, in order.
|
||||||
|
|
||||||
commands (List[str]): The string commands without the leading "dvc".
|
commands (List[str]): The string commands without the leading "dvc".
|
||||||
variables (Dict[str, str]): Dictionary of variable names, mapped to their
|
|
||||||
values. Will be used to substitute format string variables in the
|
|
||||||
commands.
|
|
||||||
flags (Dict[str, bool]): Conditional flags to be added to command. Makes it
|
flags (Dict[str, bool]): Conditional flags to be added to command. Makes it
|
||||||
easier to pass flags like --quiet that depend on a variable or
|
easier to pass flags like --quiet that depend on a variable or
|
||||||
command-line setting while avoiding lots of nested conditionals.
|
command-line setting while avoiding lots of nested conditionals.
|
||||||
"""
|
"""
|
||||||
for command in commands:
|
for command in commands:
|
||||||
# Substitute variables, e.g. "./{NAME}.json"
|
|
||||||
command = command.format(**variables)
|
|
||||||
command = split_command(command)
|
command = split_command(command)
|
||||||
dvc_command = ["dvc", *command]
|
dvc_command = ["dvc", *command]
|
||||||
# Add the flags if they are set to True
|
# Add the flags if they are set to True
|
||||||
|
|
38
spacy/cli/project/pull.py
Normal file
38
spacy/cli/project/pull.py
Normal file
|
@ -0,0 +1,38 @@
|
||||||
|
from pathlib import Path
|
||||||
|
from wasabi import msg
|
||||||
|
from .remote_storage import RemoteStorage
|
||||||
|
from .remote_storage import get_command_hash
|
||||||
|
from .._util import project_cli, Arg
|
||||||
|
from .._util import load_project_config
|
||||||
|
|
||||||
|
|
||||||
|
@project_cli.command("pull")
|
||||||
|
def project_pull_cli(
|
||||||
|
# fmt: off
|
||||||
|
remote: str = Arg("default", help="Name or path of remote storage"),
|
||||||
|
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
|
||||||
|
# fmt: on
|
||||||
|
):
|
||||||
|
"""Retrieve any precomputed outputs from a remote storage that are available.
|
||||||
|
You can alias remotes in your project.yml by mapping them to storage paths.
|
||||||
|
A storage can be anything that the smart-open library can upload to, e.g.
|
||||||
|
gcs, aws, ssh, local directories etc
|
||||||
|
"""
|
||||||
|
for url, output_path in project_pull(project_dir, remote):
|
||||||
|
if url is not None:
|
||||||
|
msg.good(f"Pulled {output_path} from {url}")
|
||||||
|
|
||||||
|
|
||||||
|
def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
|
||||||
|
config = load_project_config(project_dir)
|
||||||
|
if remote in config.get("remotes", {}):
|
||||||
|
remote = config["remotes"][remote]
|
||||||
|
storage = RemoteStorage(project_dir, remote)
|
||||||
|
for cmd in config.get("commands", []):
|
||||||
|
deps = [project_dir / dep for dep in cmd.get("deps", [])]
|
||||||
|
if any(not dep.exists() for dep in deps):
|
||||||
|
continue
|
||||||
|
cmd_hash = get_command_hash("", "", deps, cmd["script"])
|
||||||
|
for output_path in cmd.get("outputs", []):
|
||||||
|
url = storage.pull(output_path, command_hash=cmd_hash)
|
||||||
|
yield url, output_path
|
51
spacy/cli/project/push.py
Normal file
51
spacy/cli/project/push.py
Normal file
|
@ -0,0 +1,51 @@
|
||||||
|
from pathlib import Path
|
||||||
|
from wasabi import msg
|
||||||
|
from .remote_storage import RemoteStorage
|
||||||
|
from .remote_storage import get_content_hash, get_command_hash
|
||||||
|
from .._util import load_project_config
|
||||||
|
from .._util import project_cli, Arg
|
||||||
|
|
||||||
|
|
||||||
|
@project_cli.command("push")
|
||||||
|
def project_push_cli(
|
||||||
|
# fmt: off
|
||||||
|
remote: str = Arg("default", help="Name or path of remote storage"),
|
||||||
|
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
|
||||||
|
# fmt: on
|
||||||
|
):
|
||||||
|
"""Persist outputs to a remote storage. You can alias remotes in your project.yml
|
||||||
|
by mapping them to storage paths. A storage can be anything that the smart-open
|
||||||
|
library can upload to, e.g. gcs, aws, ssh, local directories etc
|
||||||
|
"""
|
||||||
|
for output_path, url in project_push(project_dir, remote):
|
||||||
|
if url is None:
|
||||||
|
msg.info(f"Skipping {output_path}")
|
||||||
|
else:
|
||||||
|
msg.good(f"Pushed {output_path} to {url}")
|
||||||
|
|
||||||
|
|
||||||
|
def project_push(project_dir: Path, remote: str):
|
||||||
|
"""Persist outputs to a remote storage. You can alias remotes in your project.yml
|
||||||
|
by mapping them to storage paths. A storage can be anything that the smart-open
|
||||||
|
library can upload to, e.g. gcs, aws, ssh, local directories etc
|
||||||
|
"""
|
||||||
|
config = load_project_config(project_dir)
|
||||||
|
if remote in config.get("remotes", {}):
|
||||||
|
remote = config["remotes"][remote]
|
||||||
|
storage = RemoteStorage(project_dir, remote)
|
||||||
|
for cmd in config.get("commands", []):
|
||||||
|
deps = [project_dir / dep for dep in cmd.get("deps", [])]
|
||||||
|
if any(not dep.exists() for dep in deps):
|
||||||
|
continue
|
||||||
|
cmd_hash = get_command_hash(
|
||||||
|
"", "", [project_dir / dep for dep in cmd.get("deps", [])], cmd["script"]
|
||||||
|
)
|
||||||
|
for output_path in cmd.get("outputs", []):
|
||||||
|
output_loc = project_dir / output_path
|
||||||
|
if output_loc.exists():
|
||||||
|
url = storage.push(
|
||||||
|
output_path,
|
||||||
|
command_hash=cmd_hash,
|
||||||
|
content_hash=get_content_hash(output_loc),
|
||||||
|
)
|
||||||
|
yield output_path, url
|
169
spacy/cli/project/remote_storage.py
Normal file
169
spacy/cli/project/remote_storage.py
Normal file
|
@ -0,0 +1,169 @@
|
||||||
|
from typing import Optional, List, Dict, TYPE_CHECKING
|
||||||
|
import os
|
||||||
|
import site
|
||||||
|
import hashlib
|
||||||
|
import urllib.parse
|
||||||
|
import tarfile
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from .._util import get_hash, get_checksum, download_file, ensure_pathy
|
||||||
|
from ...util import make_tempdir
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from pathy import Pathy # noqa: F401
|
||||||
|
|
||||||
|
|
||||||
|
class RemoteStorage:
|
||||||
|
"""Push and pull outputs to and from a remote file storage.
|
||||||
|
|
||||||
|
Remotes can be anything that `smart-open` can support: AWS, GCS, file system,
|
||||||
|
ssh, etc.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, project_root: Path, url: str, *, compression="gz"):
|
||||||
|
self.root = project_root
|
||||||
|
self.url = ensure_pathy(url)
|
||||||
|
self.compression = compression
|
||||||
|
|
||||||
|
def push(self, path: Path, command_hash: str, content_hash: str) -> "Pathy":
|
||||||
|
"""Compress a file or directory within a project and upload it to a remote
|
||||||
|
storage. If an object exists at the full URL, nothing is done.
|
||||||
|
|
||||||
|
Within the remote storage, files are addressed by their project path
|
||||||
|
(url encoded) and two user-supplied hashes, representing their creation
|
||||||
|
context and their file contents. If the URL already exists, the data is
|
||||||
|
not uploaded. Paths are archived and compressed prior to upload.
|
||||||
|
"""
|
||||||
|
loc = self.root / path
|
||||||
|
if not loc.exists():
|
||||||
|
raise IOError(f"Cannot push {loc}: does not exist.")
|
||||||
|
url = self.make_url(path, command_hash, content_hash)
|
||||||
|
if url.exists():
|
||||||
|
return None
|
||||||
|
tmp: Path
|
||||||
|
with make_tempdir() as tmp:
|
||||||
|
tar_loc = tmp / self.encode_name(str(path))
|
||||||
|
mode_string = f"w:{self.compression}" if self.compression else "w"
|
||||||
|
with tarfile.open(tar_loc, mode=mode_string) as tar_file:
|
||||||
|
tar_file.add(str(loc), arcname=str(path))
|
||||||
|
with tar_loc.open(mode="rb") as input_file:
|
||||||
|
with url.open(mode="wb") as output_file:
|
||||||
|
output_file.write(input_file.read())
|
||||||
|
return url
|
||||||
|
|
||||||
|
def pull(
|
||||||
|
self,
|
||||||
|
path: Path,
|
||||||
|
*,
|
||||||
|
command_hash: Optional[str] = None,
|
||||||
|
content_hash: Optional[str] = None,
|
||||||
|
) -> Optional["Pathy"]:
|
||||||
|
"""Retrieve a file from the remote cache. If the file already exists,
|
||||||
|
nothing is done.
|
||||||
|
|
||||||
|
If the command_hash and/or content_hash are specified, only matching
|
||||||
|
results are returned. If no results are available, an error is raised.
|
||||||
|
"""
|
||||||
|
dest = self.root / path
|
||||||
|
if dest.exists():
|
||||||
|
return None
|
||||||
|
url = self.find(path, command_hash=command_hash, content_hash=content_hash)
|
||||||
|
if url is None:
|
||||||
|
return url
|
||||||
|
else:
|
||||||
|
# Make sure the destination exists
|
||||||
|
if not dest.parent.exists():
|
||||||
|
dest.parent.mkdir(parents=True)
|
||||||
|
tmp: Path
|
||||||
|
with make_tempdir() as tmp:
|
||||||
|
tar_loc = tmp / url.parts[-1]
|
||||||
|
download_file(url, tar_loc)
|
||||||
|
mode_string = f"r:{self.compression}" if self.compression else "r"
|
||||||
|
with tarfile.open(tar_loc, mode=mode_string) as tar_file:
|
||||||
|
# This requires that the path is added correctly, relative
|
||||||
|
# to root. This is how we set things up in push()
|
||||||
|
tar_file.extractall(self.root)
|
||||||
|
return url
|
||||||
|
|
||||||
|
def find(
|
||||||
|
self,
|
||||||
|
path: Path,
|
||||||
|
*,
|
||||||
|
command_hash: Optional[str] = None,
|
||||||
|
content_hash: Optional[str] = None,
|
||||||
|
) -> Optional["Pathy"]:
|
||||||
|
"""Find the best matching version of a file within the storage,
|
||||||
|
or `None` if no match can be found. If both the creation and content hash
|
||||||
|
are specified, only exact matches will be returned. Otherwise, the most
|
||||||
|
recent matching file is preferred.
|
||||||
|
"""
|
||||||
|
name = self.encode_name(str(path))
|
||||||
|
if command_hash is not None and content_hash is not None:
|
||||||
|
url = self.make_url(path, command_hash, content_hash)
|
||||||
|
urls = [url] if url.exists() else []
|
||||||
|
elif command_hash is not None:
|
||||||
|
urls = list((self.url / name / command_hash).iterdir())
|
||||||
|
else:
|
||||||
|
urls = list((self.url / name).iterdir())
|
||||||
|
if content_hash is not None:
|
||||||
|
urls = [url for url in urls if url.parts[-1] == content_hash]
|
||||||
|
return urls[-1] if urls else None
|
||||||
|
|
||||||
|
def make_url(self, path: Path, command_hash: str, content_hash: str) -> "Pathy":
|
||||||
|
"""Construct a URL from a subpath, a creation hash and a content hash."""
|
||||||
|
return self.url / self.encode_name(str(path)) / command_hash / content_hash
|
||||||
|
|
||||||
|
def encode_name(self, name: str) -> str:
|
||||||
|
"""Encode a subpath into a URL-safe name."""
|
||||||
|
return urllib.parse.quote_plus(name)
|
||||||
|
|
||||||
|
|
||||||
|
def get_content_hash(loc: Path) -> str:
|
||||||
|
return get_checksum(loc)
|
||||||
|
|
||||||
|
|
||||||
|
def get_command_hash(
|
||||||
|
site_hash: str, env_hash: str, deps: List[Path], cmd: List[str]
|
||||||
|
) -> str:
|
||||||
|
"""Create a hash representing the execution of a command. This includes the
|
||||||
|
currently installed packages, whatever environment variables have been marked
|
||||||
|
as relevant, and the command.
|
||||||
|
"""
|
||||||
|
hashes = [site_hash, env_hash] + [get_checksum(dep) for dep in sorted(deps)]
|
||||||
|
hashes.extend(cmd)
|
||||||
|
creation_bytes = "".join(hashes).encode("utf8")
|
||||||
|
return hashlib.md5(creation_bytes).hexdigest()
|
||||||
|
|
||||||
|
|
||||||
|
def get_site_hash():
|
||||||
|
"""Hash the current Python environment's site-packages contents, including
|
||||||
|
the name and version of the libraries. The list we're hashing is what
|
||||||
|
`pip freeze` would output.
|
||||||
|
"""
|
||||||
|
site_dirs = site.getsitepackages()
|
||||||
|
if site.ENABLE_USER_SITE:
|
||||||
|
site_dirs.extend(site.getusersitepackages())
|
||||||
|
packages = set()
|
||||||
|
for site_dir in site_dirs:
|
||||||
|
site_dir = Path(site_dir)
|
||||||
|
for subpath in site_dir.iterdir():
|
||||||
|
if subpath.parts[-1].endswith("dist-info"):
|
||||||
|
packages.add(subpath.parts[-1].replace(".dist-info", ""))
|
||||||
|
package_bytes = "".join(sorted(packages)).encode("utf8")
|
||||||
|
return hashlib.md5sum(package_bytes).hexdigest()
|
||||||
|
|
||||||
|
|
||||||
|
def get_env_hash(env: Dict[str, str]) -> str:
|
||||||
|
"""Construct a hash of the environment variables that will be passed into
|
||||||
|
the commands.
|
||||||
|
|
||||||
|
Values in the env dict may be references to the current os.environ, using
|
||||||
|
the syntax $ENV_VAR to mean os.environ[ENV_VAR]
|
||||||
|
"""
|
||||||
|
env_vars = {}
|
||||||
|
for key, value in env.items():
|
||||||
|
if value.startswith("$"):
|
||||||
|
env_vars[key] = os.environ.get(value[1:], "")
|
||||||
|
else:
|
||||||
|
env_vars[key] = value
|
||||||
|
return get_hash(env_vars)
|
|
@ -44,7 +44,6 @@ def project_run(
|
||||||
dry (bool): Perform a dry run and don't execute commands.
|
dry (bool): Perform a dry run and don't execute commands.
|
||||||
"""
|
"""
|
||||||
config = load_project_config(project_dir)
|
config = load_project_config(project_dir)
|
||||||
variables = config.get("variables", {})
|
|
||||||
commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
|
commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
|
||||||
workflows = config.get("workflows", {})
|
workflows = config.get("workflows", {})
|
||||||
validate_subcommand(commands.keys(), workflows.keys(), subcommand)
|
validate_subcommand(commands.keys(), workflows.keys(), subcommand)
|
||||||
|
@ -54,22 +53,20 @@ def project_run(
|
||||||
project_run(project_dir, cmd, force=force, dry=dry)
|
project_run(project_dir, cmd, force=force, dry=dry)
|
||||||
else:
|
else:
|
||||||
cmd = commands[subcommand]
|
cmd = commands[subcommand]
|
||||||
variables = config.get("variables", {})
|
|
||||||
for dep in cmd.get("deps", []):
|
for dep in cmd.get("deps", []):
|
||||||
dep = dep.format(**variables)
|
|
||||||
if not (project_dir / dep).exists():
|
if not (project_dir / dep).exists():
|
||||||
err = f"Missing dependency specified by command '{subcommand}': {dep}"
|
err = f"Missing dependency specified by command '{subcommand}': {dep}"
|
||||||
err_kwargs = {"exits": 1} if not dry else {}
|
err_kwargs = {"exits": 1} if not dry else {}
|
||||||
msg.fail(err, **err_kwargs)
|
msg.fail(err, **err_kwargs)
|
||||||
with working_dir(project_dir) as current_dir:
|
with working_dir(project_dir) as current_dir:
|
||||||
rerun = check_rerun(current_dir, cmd, variables)
|
rerun = check_rerun(current_dir, cmd)
|
||||||
if not rerun and not force:
|
if not rerun and not force:
|
||||||
msg.info(f"Skipping '{cmd['name']}': nothing changed")
|
msg.info(f"Skipping '{cmd['name']}': nothing changed")
|
||||||
else:
|
else:
|
||||||
msg.divider(subcommand)
|
msg.divider(subcommand)
|
||||||
run_commands(cmd["script"], variables, dry=dry)
|
run_commands(cmd["script"], dry=dry)
|
||||||
if not dry:
|
if not dry:
|
||||||
update_lockfile(current_dir, cmd, variables)
|
update_lockfile(current_dir, cmd)
|
||||||
|
|
||||||
|
|
||||||
def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None:
|
def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None:
|
||||||
|
@ -115,23 +112,15 @@ def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None:
|
||||||
|
|
||||||
|
|
||||||
def run_commands(
|
def run_commands(
|
||||||
commands: List[str] = tuple(),
|
commands: List[str] = tuple(), silent: bool = False, dry: bool = False,
|
||||||
variables: Dict[str, Any] = {},
|
|
||||||
silent: bool = False,
|
|
||||||
dry: bool = False,
|
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Run a sequence of commands in a subprocess, in order.
|
"""Run a sequence of commands in a subprocess, in order.
|
||||||
|
|
||||||
commands (List[str]): The string commands.
|
commands (List[str]): The string commands.
|
||||||
variables (Dict[str, Any]): Dictionary of variable names, mapped to their
|
|
||||||
values. Will be used to substitute format string variables in the
|
|
||||||
commands.
|
|
||||||
silent (bool): Don't print the commands.
|
silent (bool): Don't print the commands.
|
||||||
dry (bool): Perform a dry run and don't execut anything.
|
dry (bool): Perform a dry run and don't execut anything.
|
||||||
"""
|
"""
|
||||||
for command in commands:
|
for command in commands:
|
||||||
# Substitute variables, e.g. "./{NAME}.json"
|
|
||||||
command = command.format(**variables)
|
|
||||||
command = split_command(command)
|
command = split_command(command)
|
||||||
# Not sure if this is needed or a good idea. Motivation: users may often
|
# Not sure if this is needed or a good idea. Motivation: users may often
|
||||||
# use commands in their config that reference "python" and we want to
|
# use commands in their config that reference "python" and we want to
|
||||||
|
@ -173,15 +162,12 @@ def validate_subcommand(
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
def check_rerun(
|
def check_rerun(project_dir: Path, command: Dict[str, Any]) -> bool:
|
||||||
project_dir: Path, command: Dict[str, Any], variables: Dict[str, Any]
|
|
||||||
) -> bool:
|
|
||||||
"""Check if a command should be rerun because its settings or inputs/outputs
|
"""Check if a command should be rerun because its settings or inputs/outputs
|
||||||
changed.
|
changed.
|
||||||
|
|
||||||
project_dir (Path): The current project directory.
|
project_dir (Path): The current project directory.
|
||||||
command (Dict[str, Any]): The command, as defined in the project.yml.
|
command (Dict[str, Any]): The command, as defined in the project.yml.
|
||||||
variables (Dict[str, Any]): The variables defined in the project.yml.
|
|
||||||
RETURNS (bool): Whether to re-run the command.
|
RETURNS (bool): Whether to re-run the command.
|
||||||
"""
|
"""
|
||||||
lock_path = project_dir / PROJECT_LOCK
|
lock_path = project_dir / PROJECT_LOCK
|
||||||
|
@ -197,19 +183,16 @@ def check_rerun(
|
||||||
# If the entry in the lockfile matches the lockfile entry that would be
|
# If the entry in the lockfile matches the lockfile entry that would be
|
||||||
# generated from the current command, we don't rerun because it means that
|
# generated from the current command, we don't rerun because it means that
|
||||||
# all inputs/outputs, hashes and scripts are the same and nothing changed
|
# all inputs/outputs, hashes and scripts are the same and nothing changed
|
||||||
return get_hash(get_lock_entry(project_dir, command, variables)) != get_hash(entry)
|
return get_hash(get_lock_entry(project_dir, command)) != get_hash(entry)
|
||||||
|
|
||||||
|
|
||||||
def update_lockfile(
|
def update_lockfile(project_dir: Path, command: Dict[str, Any]) -> None:
|
||||||
project_dir: Path, command: Dict[str, Any], variables: Dict[str, Any]
|
|
||||||
) -> None:
|
|
||||||
"""Update the lockfile after running a command. Will create a lockfile if
|
"""Update the lockfile after running a command. Will create a lockfile if
|
||||||
it doesn't yet exist and will add an entry for the current command, its
|
it doesn't yet exist and will add an entry for the current command, its
|
||||||
script and dependencies/outputs.
|
script and dependencies/outputs.
|
||||||
|
|
||||||
project_dir (Path): The current project directory.
|
project_dir (Path): The current project directory.
|
||||||
command (Dict[str, Any]): The command, as defined in the project.yml.
|
command (Dict[str, Any]): The command, as defined in the project.yml.
|
||||||
variables (Dict[str, Any]): The variables defined in the project.yml.
|
|
||||||
"""
|
"""
|
||||||
lock_path = project_dir / PROJECT_LOCK
|
lock_path = project_dir / PROJECT_LOCK
|
||||||
if not lock_path.exists():
|
if not lock_path.exists():
|
||||||
|
@ -217,13 +200,11 @@ def update_lockfile(
|
||||||
data = {}
|
data = {}
|
||||||
else:
|
else:
|
||||||
data = srsly.read_yaml(lock_path)
|
data = srsly.read_yaml(lock_path)
|
||||||
data[command["name"]] = get_lock_entry(project_dir, command, variables)
|
data[command["name"]] = get_lock_entry(project_dir, command)
|
||||||
srsly.write_yaml(lock_path, data)
|
srsly.write_yaml(lock_path, data)
|
||||||
|
|
||||||
|
|
||||||
def get_lock_entry(
|
def get_lock_entry(project_dir: Path, command: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
project_dir: Path, command: Dict[str, Any], variables: Dict[str, Any]
|
|
||||||
) -> Dict[str, Any]:
|
|
||||||
"""Get a lockfile entry for a given command. An entry includes the command,
|
"""Get a lockfile entry for a given command. An entry includes the command,
|
||||||
the script (command steps) and a list of dependencies and outputs with
|
the script (command steps) and a list of dependencies and outputs with
|
||||||
their paths and file hashes, if available. The format is based on the
|
their paths and file hashes, if available. The format is based on the
|
||||||
|
@ -231,12 +212,11 @@ def get_lock_entry(
|
||||||
|
|
||||||
project_dir (Path): The current project directory.
|
project_dir (Path): The current project directory.
|
||||||
command (Dict[str, Any]): The command, as defined in the project.yml.
|
command (Dict[str, Any]): The command, as defined in the project.yml.
|
||||||
variables (Dict[str, Any]): The variables defined in the project.yml.
|
|
||||||
RETURNS (Dict[str, Any]): The lockfile entry.
|
RETURNS (Dict[str, Any]): The lockfile entry.
|
||||||
"""
|
"""
|
||||||
deps = get_fileinfo(project_dir, command.get("deps", []), variables)
|
deps = get_fileinfo(project_dir, command.get("deps", []))
|
||||||
outs = get_fileinfo(project_dir, command.get("outputs", []), variables)
|
outs = get_fileinfo(project_dir, command.get("outputs", []))
|
||||||
outs_nc = get_fileinfo(project_dir, command.get("outputs_no_cache", []), variables)
|
outs_nc = get_fileinfo(project_dir, command.get("outputs_no_cache", []))
|
||||||
return {
|
return {
|
||||||
"cmd": f"{COMMAND} run {command['name']}",
|
"cmd": f"{COMMAND} run {command['name']}",
|
||||||
"script": command["script"],
|
"script": command["script"],
|
||||||
|
@ -245,20 +225,16 @@ def get_lock_entry(
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def get_fileinfo(
|
def get_fileinfo(project_dir: Path, paths: List[str]) -> List[Dict[str, str]]:
|
||||||
project_dir: Path, paths: List[str], variables: Dict[str, Any]
|
|
||||||
) -> List[Dict[str, str]]:
|
|
||||||
"""Generate the file information for a list of paths (dependencies, outputs).
|
"""Generate the file information for a list of paths (dependencies, outputs).
|
||||||
Includes the file path and the file's checksum.
|
Includes the file path and the file's checksum.
|
||||||
|
|
||||||
project_dir (Path): The current project directory.
|
project_dir (Path): The current project directory.
|
||||||
paths (List[str]): The file paths.
|
paths (List[str]): The file paths.
|
||||||
variables (Dict[str, Any]): The variables defined in the project.yml.
|
|
||||||
RETURNS (List[Dict[str, str]]): The lockfile entry for a file.
|
RETURNS (List[Dict[str, str]]): The lockfile entry for a file.
|
||||||
"""
|
"""
|
||||||
data = []
|
data = []
|
||||||
for path in paths:
|
for path in paths:
|
||||||
path = path.format(**variables)
|
|
||||||
file_path = project_dir / path
|
file_path = project_dir / path
|
||||||
md5 = get_checksum(file_path) if file_path.exists() else None
|
md5 = get_checksum(file_path) if file_path.exists() else None
|
||||||
data.append({"path": path, "md5": md5})
|
data.append({"path": path, "md5": md5})
|
||||||
|
|
|
@ -105,10 +105,10 @@ factory = "tok2vec"
|
||||||
|
|
||||||
[components.tok2vec.model.embed]
|
[components.tok2vec.model.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v1"
|
||||||
width = ${components.tok2vec.model.encode:width}
|
width = ${components.tok2vec.model.encode.width}
|
||||||
rows = {{ 2000 if optimize == "efficiency" else 7000 }}
|
rows = {{ 2000 if optimize == "efficiency" else 7000 }}
|
||||||
also_embed_subwords = {{ true if has_letters else false }}
|
also_embed_subwords = {{ "true" if has_letters else "false" }}
|
||||||
also_use_static_vectors = {{ true if optimize == "accuracy" else false }}
|
also_use_static_vectors = {{ "true" if optimize == "accuracy" else "false" }}
|
||||||
|
|
||||||
[components.tok2vec.model.encode]
|
[components.tok2vec.model.encode]
|
||||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||||
|
@ -127,7 +127,7 @@ nO = null
|
||||||
|
|
||||||
[components.tagger.model.tok2vec]
|
[components.tagger.model.tok2vec]
|
||||||
@architectures = "spacy.Tok2VecListener.v1"
|
@architectures = "spacy.Tok2VecListener.v1"
|
||||||
width = ${components.tok2vec.model.encode:width}
|
width = ${components.tok2vec.model.encode.width}
|
||||||
{%- endif %}
|
{%- endif %}
|
||||||
|
|
||||||
{% if "parser" in components -%}
|
{% if "parser" in components -%}
|
||||||
|
@ -144,7 +144,7 @@ nO = null
|
||||||
|
|
||||||
[components.parser.model.tok2vec]
|
[components.parser.model.tok2vec]
|
||||||
@architectures = "spacy.Tok2VecListener.v1"
|
@architectures = "spacy.Tok2VecListener.v1"
|
||||||
width = ${components.tok2vec.model.encode:width}
|
width = ${components.tok2vec.model.encode.width}
|
||||||
{%- endif %}
|
{%- endif %}
|
||||||
|
|
||||||
{% if "ner" in components %}
|
{% if "ner" in components %}
|
||||||
|
@ -161,7 +161,7 @@ nO = null
|
||||||
|
|
||||||
[components.ner.model.tok2vec]
|
[components.ner.model.tok2vec]
|
||||||
@architectures = "spacy.Tok2VecListener.v1"
|
@architectures = "spacy.Tok2VecListener.v1"
|
||||||
width = ${components.tok2vec.model.encode:width}
|
width = ${components.tok2vec.model.encode.width}
|
||||||
{% endif %}
|
{% endif %}
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
|
||||||
|
@ -194,12 +194,12 @@ initial_rate = 5e-5
|
||||||
|
|
||||||
[training.train_corpus]
|
[training.train_corpus]
|
||||||
@readers = "spacy.Corpus.v1"
|
@readers = "spacy.Corpus.v1"
|
||||||
path = ${paths:train}
|
path = ${paths.train}
|
||||||
max_length = {{ 500 if hardware == "gpu" else 0 }}
|
max_length = {{ 500 if hardware == "gpu" else 2000 }}
|
||||||
|
|
||||||
[training.dev_corpus]
|
[training.dev_corpus]
|
||||||
@readers = "spacy.Corpus.v1"
|
@readers = "spacy.Corpus.v1"
|
||||||
path = ${paths:dev}
|
path = ${paths.dev}
|
||||||
max_length = 0
|
max_length = 0
|
||||||
|
|
||||||
{% if use_transformer %}
|
{% if use_transformer %}
|
||||||
|
|
|
@ -1,13 +0,0 @@
|
||||||
{
|
|
||||||
"en": {
|
|
||||||
"word_vectors": "en_vectors_web_lg",
|
|
||||||
"transformer": {
|
|
||||||
"efficiency": { "name": "roberta-base", "size_factor": 3 },
|
|
||||||
"accuracy": { "name": "roberta-base", "size_factor": 3 }
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"de": {
|
|
||||||
"word_vectors": null,
|
|
||||||
"transformer": null
|
|
||||||
}
|
|
||||||
}
|
|
103
spacy/cli/templates/quickstart_training_recommendations.yml
Normal file
103
spacy/cli/templates/quickstart_training_recommendations.yml
Normal file
|
@ -0,0 +1,103 @@
|
||||||
|
# Recommended settings and available resources for each language, if available.
|
||||||
|
# Not all languages have recommended word vectors or transformers and for some,
|
||||||
|
# the recommended transformer for efficiency and accuracy may be the same.
|
||||||
|
en:
|
||||||
|
word_vectors: en_vectors_web_lg
|
||||||
|
transformer:
|
||||||
|
efficiency:
|
||||||
|
name: roberta-base
|
||||||
|
size_factor: 3
|
||||||
|
accuracy:
|
||||||
|
name: roberta-base
|
||||||
|
size_factor: 3
|
||||||
|
de:
|
||||||
|
word_vectors: null
|
||||||
|
transformer:
|
||||||
|
efficiency:
|
||||||
|
name: bert-base-german-cased
|
||||||
|
size_factor: 3
|
||||||
|
accuracy:
|
||||||
|
name: bert-base-german-cased
|
||||||
|
size_factor: 3
|
||||||
|
fr:
|
||||||
|
word_vectors: null
|
||||||
|
transformer:
|
||||||
|
efficiency:
|
||||||
|
name: camembert-base
|
||||||
|
size_factor: 3
|
||||||
|
accuracy:
|
||||||
|
name: camembert-base
|
||||||
|
size_factor: 3
|
||||||
|
es:
|
||||||
|
word_vectors: null
|
||||||
|
transformer:
|
||||||
|
efficiency:
|
||||||
|
name: mrm8488/RuPERTa-base
|
||||||
|
size_factor: 3
|
||||||
|
accuracy:
|
||||||
|
name: mrm8488/RuPERTa-base
|
||||||
|
size_factor: 3
|
||||||
|
sv:
|
||||||
|
word_vectors: null
|
||||||
|
transformer:
|
||||||
|
efficiency:
|
||||||
|
name: KB/bert-base-swedish-cased
|
||||||
|
size_factor: 3
|
||||||
|
accuracy:
|
||||||
|
name: KB/bert-base-swedish-cased
|
||||||
|
size_factor: 3
|
||||||
|
fi:
|
||||||
|
word_vectors: null
|
||||||
|
transformer:
|
||||||
|
efficiency:
|
||||||
|
name: TurkuNLP/bert-base-finnish-cased-v1
|
||||||
|
size_factor: 3
|
||||||
|
accuracy:
|
||||||
|
name: TurkuNLP/bert-base-finnish-cased-v1
|
||||||
|
size_factor: 3
|
||||||
|
el:
|
||||||
|
word_vectors: null
|
||||||
|
transformer:
|
||||||
|
efficiency:
|
||||||
|
name: nlpaueb/bert-base-greek-uncased-v1
|
||||||
|
size_factor: 3
|
||||||
|
accuracy:
|
||||||
|
name: nlpaueb/bert-base-greek-uncased-v1
|
||||||
|
size_factor: 3
|
||||||
|
tr:
|
||||||
|
word_vectors: null
|
||||||
|
transformer:
|
||||||
|
efficiency:
|
||||||
|
name: dbmdz/bert-base-turkish-cased
|
||||||
|
size_factor: 3
|
||||||
|
accuracy:
|
||||||
|
name: dbmdz/bert-base-turkish-cased
|
||||||
|
size_factor: 3
|
||||||
|
zh:
|
||||||
|
word_vectors: null
|
||||||
|
transformer:
|
||||||
|
efficiency:
|
||||||
|
name: bert-base-chinese
|
||||||
|
size_factor: 3
|
||||||
|
accuracy:
|
||||||
|
name: bert-base-chinese
|
||||||
|
size_factor: 3
|
||||||
|
has_letters: false
|
||||||
|
ar:
|
||||||
|
word_vectors: null
|
||||||
|
transformer:
|
||||||
|
efficiency:
|
||||||
|
name: asafaya/bert-base-arabic
|
||||||
|
size_factor: 3
|
||||||
|
accuracy:
|
||||||
|
name: asafaya/bert-base-arabic
|
||||||
|
size_factor: 3
|
||||||
|
pl:
|
||||||
|
word_vectors: null
|
||||||
|
transformer:
|
||||||
|
efficiency:
|
||||||
|
name: dkleczek/bert-base-polish-cased-v1
|
||||||
|
size_factor: 3
|
||||||
|
accuracy:
|
||||||
|
name: dkleczek/bert-base-polish-cased-v1
|
||||||
|
size_factor: 3
|
|
@ -75,7 +75,9 @@ def train(
|
||||||
msg.info("Using CPU")
|
msg.info("Using CPU")
|
||||||
msg.info(f"Loading config and nlp from: {config_path}")
|
msg.info(f"Loading config and nlp from: {config_path}")
|
||||||
with show_validation_error(config_path):
|
with show_validation_error(config_path):
|
||||||
config = util.load_config(config_path, overrides=config_overrides)
|
config = util.load_config(
|
||||||
|
config_path, overrides=config_overrides, interpolate=True
|
||||||
|
)
|
||||||
if config.get("training", {}).get("seed") is not None:
|
if config.get("training", {}).get("seed") is not None:
|
||||||
fix_random_seed(config["training"]["seed"])
|
fix_random_seed(config["training"]["seed"])
|
||||||
# Use original config here before it's resolved to functions
|
# Use original config here before it's resolved to functions
|
||||||
|
@ -162,12 +164,13 @@ def train(
|
||||||
progress = tqdm.tqdm(total=T_cfg["eval_frequency"], leave=False)
|
progress = tqdm.tqdm(total=T_cfg["eval_frequency"], leave=False)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
if output_path is not None:
|
if output_path is not None:
|
||||||
|
# We don't want to swallow the traceback if we don't have a
|
||||||
|
# specific error.
|
||||||
msg.warn(
|
msg.warn(
|
||||||
f"Aborting and saving the final best model. "
|
f"Aborting and saving the final best model. "
|
||||||
f"Encountered exception: {str(e)}",
|
f"Encountered exception: {str(e)}"
|
||||||
exits=1,
|
|
||||||
)
|
)
|
||||||
else:
|
nlp.to_disk(output_path / "model-final")
|
||||||
raise e
|
raise e
|
||||||
finally:
|
finally:
|
||||||
if output_path is not None:
|
if output_path is not None:
|
||||||
|
@ -207,7 +210,9 @@ def create_evaluation_callback(
|
||||||
scores = nlp.evaluate(dev_examples)
|
scores = nlp.evaluate(dev_examples)
|
||||||
# Calculate a weighted sum based on score_weights for the main score
|
# Calculate a weighted sum based on score_weights for the main score
|
||||||
try:
|
try:
|
||||||
weighted_score = sum(scores[s] * weights.get(s, 0.0) for s in weights)
|
weighted_score = sum(
|
||||||
|
scores.get(s, 0.0) * weights.get(s, 0.0) for s in weights
|
||||||
|
)
|
||||||
except KeyError as e:
|
except KeyError as e:
|
||||||
keys = list(scores.keys())
|
keys = list(scores.keys())
|
||||||
err = Errors.E983.format(dict="score_weights", key=str(e), keys=keys)
|
err = Errors.E983.format(dict="score_weights", key=str(e), keys=keys)
|
||||||
|
@ -235,7 +240,7 @@ def train_while_improving(
|
||||||
with each iteration yielding a tuple `(batch, info, is_best_checkpoint)`,
|
with each iteration yielding a tuple `(batch, info, is_best_checkpoint)`,
|
||||||
where info is a dict, and is_best_checkpoint is in [True, False, None] --
|
where info is a dict, and is_best_checkpoint is in [True, False, None] --
|
||||||
None indicating that the iteration was not evaluated as a checkpoint.
|
None indicating that the iteration was not evaluated as a checkpoint.
|
||||||
The evaluation is conducted by calling the evaluate callback, which should
|
The evaluation is conducted by calling the evaluate callback.
|
||||||
|
|
||||||
Positional arguments:
|
Positional arguments:
|
||||||
nlp: The spaCy pipeline to evaluate.
|
nlp: The spaCy pipeline to evaluate.
|
||||||
|
@ -377,7 +382,8 @@ def setup_printer(
|
||||||
|
|
||||||
try:
|
try:
|
||||||
scores = [
|
scores = [
|
||||||
"{0:.2f}".format(float(info["other_scores"][col])) for col in score_cols
|
"{0:.2f}".format(float(info["other_scores"].get(col, 0.0)))
|
||||||
|
for col in score_cols
|
||||||
]
|
]
|
||||||
except KeyError as e:
|
except KeyError as e:
|
||||||
raise KeyError(
|
raise KeyError(
|
||||||
|
@ -403,7 +409,7 @@ def update_meta(
|
||||||
) -> None:
|
) -> None:
|
||||||
nlp.meta["performance"] = {}
|
nlp.meta["performance"] = {}
|
||||||
for metric in training["score_weights"]:
|
for metric in training["score_weights"]:
|
||||||
nlp.meta["performance"][metric] = info["other_scores"][metric]
|
nlp.meta["performance"][metric] = info["other_scores"].get(metric, 0.0)
|
||||||
for pipe_name in nlp.pipe_names:
|
for pipe_name in nlp.pipe_names:
|
||||||
nlp.meta["performance"][f"{pipe_name}_loss"] = info["losses"][pipe_name]
|
nlp.meta["performance"][f"{pipe_name}_loss"] = info["losses"][pipe_name]
|
||||||
|
|
||||||
|
|
|
@ -23,12 +23,12 @@ after_pipeline_creation = null
|
||||||
|
|
||||||
# Training hyper-parameters and additional features.
|
# Training hyper-parameters and additional features.
|
||||||
[training]
|
[training]
|
||||||
seed = ${system:seed}
|
seed = ${system.seed}
|
||||||
dropout = 0.1
|
dropout = 0.1
|
||||||
accumulate_gradient = 1
|
accumulate_gradient = 1
|
||||||
# Extra resources for transfer-learning or pseudo-rehearsal
|
# Extra resources for transfer-learning or pseudo-rehearsal
|
||||||
init_tok2vec = ${paths:init_tok2vec}
|
init_tok2vec = ${paths.init_tok2vec}
|
||||||
raw_text = ${paths:raw}
|
raw_text = ${paths.raw}
|
||||||
vectors = null
|
vectors = null
|
||||||
# Controls early-stopping. 0 or -1 mean unlimited.
|
# Controls early-stopping. 0 or -1 mean unlimited.
|
||||||
patience = 1600
|
patience = 1600
|
||||||
|
@ -42,7 +42,7 @@ frozen_components = []
|
||||||
|
|
||||||
[training.train_corpus]
|
[training.train_corpus]
|
||||||
@readers = "spacy.Corpus.v1"
|
@readers = "spacy.Corpus.v1"
|
||||||
path = ${paths:train}
|
path = ${paths.train}
|
||||||
# Whether to train on sequences with 'gold standard' sentence boundaries
|
# Whether to train on sequences with 'gold standard' sentence boundaries
|
||||||
# and tokens. If you set this to true, take care to ensure your run-time
|
# and tokens. If you set this to true, take care to ensure your run-time
|
||||||
# data is passed in sentence-by-sentence via some prior preprocessing.
|
# data is passed in sentence-by-sentence via some prior preprocessing.
|
||||||
|
@ -54,7 +54,7 @@ limit = 0
|
||||||
|
|
||||||
[training.dev_corpus]
|
[training.dev_corpus]
|
||||||
@readers = "spacy.Corpus.v1"
|
@readers = "spacy.Corpus.v1"
|
||||||
path = ${paths:dev}
|
path = ${paths.dev}
|
||||||
# Whether to train on sequences with 'gold standard' sentence boundaries
|
# Whether to train on sequences with 'gold standard' sentence boundaries
|
||||||
# and tokens. If you set this to true, take care to ensure your run-time
|
# and tokens. If you set this to true, take care to ensure your run-time
|
||||||
# data is passed in sentence-by-sentence via some prior preprocessing.
|
# data is passed in sentence-by-sentence via some prior preprocessing.
|
||||||
|
@ -98,8 +98,8 @@ max_length = 500
|
||||||
dropout = 0.2
|
dropout = 0.2
|
||||||
n_save_every = null
|
n_save_every = null
|
||||||
batch_size = 3000
|
batch_size = 3000
|
||||||
seed = ${system:seed}
|
seed = ${system.seed}
|
||||||
use_pytorch_for_gpu_memory = ${system:use_pytorch_for_gpu_memory}
|
use_pytorch_for_gpu_memory = ${system.use_pytorch_for_gpu_memory}
|
||||||
tok2vec_model = "components.tok2vec.model"
|
tok2vec_model = "components.tok2vec.model"
|
||||||
|
|
||||||
[pretraining.objective]
|
[pretraining.objective]
|
||||||
|
|
|
@ -18,7 +18,7 @@ RENDER_WRAPPER = None
|
||||||
|
|
||||||
|
|
||||||
def render(
|
def render(
|
||||||
docs: Union[Iterable[Doc], Doc],
|
docs: Union[Iterable[Union[Doc, Span]], Doc, Span],
|
||||||
style: str = "dep",
|
style: str = "dep",
|
||||||
page: bool = False,
|
page: bool = False,
|
||||||
minify: bool = False,
|
minify: bool = False,
|
||||||
|
|
|
@ -252,8 +252,10 @@ class EntityRenderer:
|
||||||
colors.update(user_color)
|
colors.update(user_color)
|
||||||
colors.update(options.get("colors", {}))
|
colors.update(options.get("colors", {}))
|
||||||
self.default_color = DEFAULT_ENTITY_COLOR
|
self.default_color = DEFAULT_ENTITY_COLOR
|
||||||
self.colors = colors
|
self.colors = {label.upper(): color for label, color in colors.items()}
|
||||||
self.ents = options.get("ents", None)
|
self.ents = options.get("ents", None)
|
||||||
|
if self.ents is not None:
|
||||||
|
self.ents = [ent.upper() for ent in self.ents]
|
||||||
self.direction = DEFAULT_DIR
|
self.direction = DEFAULT_DIR
|
||||||
self.lang = DEFAULT_LANG
|
self.lang = DEFAULT_LANG
|
||||||
template = options.get("template")
|
template = options.get("template")
|
||||||
|
|
|
@ -51,14 +51,14 @@ TPL_ENTS = """
|
||||||
TPL_ENT = """
|
TPL_ENT = """
|
||||||
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
|
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
|
||||||
{text}
|
{text}
|
||||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">{label}</span>
|
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">{label}</span>
|
||||||
</mark>
|
</mark>
|
||||||
"""
|
"""
|
||||||
|
|
||||||
TPL_ENT_RTL = """
|
TPL_ENT_RTL = """
|
||||||
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em">
|
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em">
|
||||||
{text}
|
{text}
|
||||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span>
|
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-right: 0.5rem">{label}</span>
|
||||||
</mark>
|
</mark>
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
|
@ -78,10 +78,11 @@ class Warnings:
|
||||||
"are currently: {langs}")
|
"are currently: {langs}")
|
||||||
|
|
||||||
# TODO: fix numbering after merging develop into master
|
# TODO: fix numbering after merging develop into master
|
||||||
|
W090 = ("Could not locate any binary .spacy files in path '{path}'.")
|
||||||
W091 = ("Could not clean/remove the temp directory at {dir}: {msg}.")
|
W091 = ("Could not clean/remove the temp directory at {dir}: {msg}.")
|
||||||
W092 = ("Ignoring annotations for sentence starts, as dependency heads are set.")
|
W092 = ("Ignoring annotations for sentence starts, as dependency heads are set.")
|
||||||
W093 = ("Could not find any data to train the {name} on. Is your "
|
W093 = ("Could not find any data to train the {name} on. Is your "
|
||||||
"input data correctly formatted ?")
|
"input data correctly formatted?")
|
||||||
W094 = ("Model '{model}' ({model_version}) specifies an under-constrained "
|
W094 = ("Model '{model}' ({model_version}) specifies an under-constrained "
|
||||||
"spaCy version requirement: {version}. This can lead to compatibility "
|
"spaCy version requirement: {version}. This can lead to compatibility "
|
||||||
"problems with older versions, or as new spaCy versions are "
|
"problems with older versions, or as new spaCy versions are "
|
||||||
|
@ -476,6 +477,10 @@ class Errors:
|
||||||
E199 = ("Unable to merge 0-length span at doc[{start}:{end}].")
|
E199 = ("Unable to merge 0-length span at doc[{start}:{end}].")
|
||||||
|
|
||||||
# TODO: fix numbering after merging develop into master
|
# TODO: fix numbering after merging develop into master
|
||||||
|
E928 = ("A 'KnowledgeBase' should be written to / read from a file, but the "
|
||||||
|
"provided argument {loc} is an existing directory.")
|
||||||
|
E929 = ("A 'KnowledgeBase' could not be read from {loc} - the path does "
|
||||||
|
"not seem to exist.")
|
||||||
E930 = ("Received invalid get_examples callback in {name}.begin_training. "
|
E930 = ("Received invalid get_examples callback in {name}.begin_training. "
|
||||||
"Expected function that returns an iterable of Example objects but "
|
"Expected function that returns an iterable of Example objects but "
|
||||||
"got: {obj}")
|
"got: {obj}")
|
||||||
|
@ -503,8 +508,6 @@ class Errors:
|
||||||
"not found in pipeline. Available components: {opts}")
|
"not found in pipeline. Available components: {opts}")
|
||||||
E945 = ("Can't copy pipeline component '{name}' from source. Expected loaded "
|
E945 = ("Can't copy pipeline component '{name}' from source. Expected loaded "
|
||||||
"nlp object, but got: {source}")
|
"nlp object, but got: {source}")
|
||||||
E946 = ("The Vocab for the knowledge base is not initialized. Did you forget to "
|
|
||||||
"call kb.initialize()?")
|
|
||||||
E947 = ("Matcher.add received invalid 'greedy' argument: expected "
|
E947 = ("Matcher.add received invalid 'greedy' argument: expected "
|
||||||
"a string value from {expected} but got: '{arg}'")
|
"a string value from {expected} but got: '{arg}'")
|
||||||
E948 = ("Matcher.add received invalid 'patterns' argument: expected "
|
E948 = ("Matcher.add received invalid 'patterns' argument: expected "
|
||||||
|
@ -600,7 +603,8 @@ class Errors:
|
||||||
"\"en_core_web_sm\" will copy the component from that model.\n\n{config}")
|
"\"en_core_web_sm\" will copy the component from that model.\n\n{config}")
|
||||||
E985 = ("Can't load model from config file: no 'nlp' section found.\n\n{config}")
|
E985 = ("Can't load model from config file: no 'nlp' section found.\n\n{config}")
|
||||||
E986 = ("Could not create any training batches: check your input. "
|
E986 = ("Could not create any training batches: check your input. "
|
||||||
"Perhaps discard_oversize should be set to False ?")
|
"Are the train and dev paths defined? "
|
||||||
|
"Is 'discard_oversize' set appropriately? ")
|
||||||
E987 = ("The text of an example training instance is either a Doc or "
|
E987 = ("The text of an example training instance is either a Doc or "
|
||||||
"a string, but found {type} instead.")
|
"a string, but found {type} instead.")
|
||||||
E988 = ("Could not parse any training examples. Ensure the data is "
|
E988 = ("Could not parse any training examples. Ensure the data is "
|
||||||
|
@ -610,8 +614,6 @@ class Errors:
|
||||||
"of the training data in spaCy 3.0 onwards. The 'update' "
|
"of the training data in spaCy 3.0 onwards. The 'update' "
|
||||||
"function should now be called with a batch of 'Example' "
|
"function should now be called with a batch of 'Example' "
|
||||||
"objects, instead of (text, annotation) tuples. ")
|
"objects, instead of (text, annotation) tuples. ")
|
||||||
E990 = ("An entity linking component needs to be initialized with a "
|
|
||||||
"KnowledgeBase object, but found {type} instead.")
|
|
||||||
E991 = ("The function 'select_pipes' should be called with either a "
|
E991 = ("The function 'select_pipes' should be called with either a "
|
||||||
"'disable' argument to list the names of the pipe components "
|
"'disable' argument to list the names of the pipe components "
|
||||||
"that should be disabled, or with an 'enable' argument that "
|
"that should be disabled, or with an 'enable' argument that "
|
||||||
|
|
|
@ -1,8 +1,10 @@
|
||||||
|
import warnings
|
||||||
from typing import Union, List, Iterable, Iterator, TYPE_CHECKING, Callable
|
from typing import Union, List, Iterable, Iterator, TYPE_CHECKING, Callable
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from .. import util
|
from .. import util
|
||||||
from .example import Example
|
from .example import Example
|
||||||
|
from ..errors import Warnings
|
||||||
from ..tokens import DocBin, Doc
|
from ..tokens import DocBin, Doc
|
||||||
from ..vocab import Vocab
|
from ..vocab import Vocab
|
||||||
|
|
||||||
|
@ -10,6 +12,8 @@ if TYPE_CHECKING:
|
||||||
# This lets us add type hints for mypy etc. without causing circular imports
|
# This lets us add type hints for mypy etc. without causing circular imports
|
||||||
from ..language import Language # noqa: F401
|
from ..language import Language # noqa: F401
|
||||||
|
|
||||||
|
FILE_TYPE = ".spacy"
|
||||||
|
|
||||||
|
|
||||||
@util.registry.readers("spacy.Corpus.v1")
|
@util.registry.readers("spacy.Corpus.v1")
|
||||||
def create_docbin_reader(
|
def create_docbin_reader(
|
||||||
|
@ -53,8 +57,9 @@ class Corpus:
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def walk_corpus(path: Union[str, Path]) -> List[Path]:
|
def walk_corpus(path: Union[str, Path]) -> List[Path]:
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
if not path.is_dir():
|
if not path.is_dir() and path.parts[-1].endswith(FILE_TYPE):
|
||||||
return [path]
|
return [path]
|
||||||
|
orig_path = path
|
||||||
paths = [path]
|
paths = [path]
|
||||||
locs = []
|
locs = []
|
||||||
seen = set()
|
seen = set()
|
||||||
|
@ -66,8 +71,10 @@ class Corpus:
|
||||||
continue
|
continue
|
||||||
elif path.is_dir():
|
elif path.is_dir():
|
||||||
paths.extend(path.iterdir())
|
paths.extend(path.iterdir())
|
||||||
elif path.parts[-1].endswith(".spacy"):
|
elif path.parts[-1].endswith(FILE_TYPE):
|
||||||
locs.append(path)
|
locs.append(path)
|
||||||
|
if len(locs) == 0:
|
||||||
|
warnings.warn(Warnings.W090.format(path=orig_path))
|
||||||
return locs
|
return locs
|
||||||
|
|
||||||
def __call__(self, nlp: "Language") -> Iterator[Example]:
|
def __call__(self, nlp: "Language") -> Iterator[Example]:
|
||||||
|
@ -135,7 +142,7 @@ class Corpus:
|
||||||
i = 0
|
i = 0
|
||||||
for loc in locs:
|
for loc in locs:
|
||||||
loc = util.ensure_path(loc)
|
loc = util.ensure_path(loc)
|
||||||
if loc.parts[-1].endswith(".spacy"):
|
if loc.parts[-1].endswith(FILE_TYPE):
|
||||||
doc_bin = DocBin().from_disk(loc)
|
doc_bin = DocBin().from_disk(loc)
|
||||||
docs = doc_bin.get_docs(vocab)
|
docs = doc_bin.get_docs(vocab)
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
|
|
|
@ -140,7 +140,7 @@ cdef class KnowledgeBase:
|
||||||
self._entries.push_back(entry)
|
self._entries.push_back(entry)
|
||||||
self._aliases_table.push_back(alias)
|
self._aliases_table.push_back(alias)
|
||||||
|
|
||||||
cpdef load_bulk(self, loc)
|
cpdef from_disk(self, loc)
|
||||||
cpdef set_entities(self, entity_list, freq_list, vector_list)
|
cpdef set_entities(self, entity_list, freq_list, vector_list)
|
||||||
|
|
||||||
|
|
||||||
|
|
54
spacy/kb.pyx
54
spacy/kb.pyx
|
@ -1,4 +1,5 @@
|
||||||
# cython: infer_types=True, profile=True
|
# cython: infer_types=True, profile=True
|
||||||
|
from typing import Iterator
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
from preshed.maps cimport PreshMap
|
from preshed.maps cimport PreshMap
|
||||||
from cpython.exc cimport PyErr_SetFromErrno
|
from cpython.exc cimport PyErr_SetFromErrno
|
||||||
|
@ -64,6 +65,16 @@ cdef class Candidate:
|
||||||
return self.prior_prob
|
return self.prior_prob
|
||||||
|
|
||||||
|
|
||||||
|
def get_candidates(KnowledgeBase kb, span) -> Iterator[Candidate]:
|
||||||
|
"""
|
||||||
|
Return candidate entities for a given span by using the text of the span as the alias
|
||||||
|
and fetching appropriate entries from the index.
|
||||||
|
This particular function is optimized to work with the built-in KB functionality,
|
||||||
|
but any other custom candidate generation method can be used in combination with the KB as well.
|
||||||
|
"""
|
||||||
|
return kb.get_alias_candidates(span.text)
|
||||||
|
|
||||||
|
|
||||||
cdef class KnowledgeBase:
|
cdef class KnowledgeBase:
|
||||||
"""A `KnowledgeBase` instance stores unique identifiers for entities and their textual aliases,
|
"""A `KnowledgeBase` instance stores unique identifiers for entities and their textual aliases,
|
||||||
to support entity linking of named entities to real-world concepts.
|
to support entity linking of named entities to real-world concepts.
|
||||||
|
@ -71,25 +82,16 @@ cdef class KnowledgeBase:
|
||||||
DOCS: https://spacy.io/api/kb
|
DOCS: https://spacy.io/api/kb
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, entity_vector_length):
|
def __init__(self, Vocab vocab, entity_vector_length):
|
||||||
"""Create a KnowledgeBase. Make sure to call kb.initialize() before using it."""
|
"""Create a KnowledgeBase."""
|
||||||
self.mem = Pool()
|
self.mem = Pool()
|
||||||
self.entity_vector_length = entity_vector_length
|
self.entity_vector_length = entity_vector_length
|
||||||
|
|
||||||
self._entry_index = PreshMap()
|
self._entry_index = PreshMap()
|
||||||
self._alias_index = PreshMap()
|
self._alias_index = PreshMap()
|
||||||
self.vocab = None
|
|
||||||
|
|
||||||
|
|
||||||
def initialize(self, Vocab vocab):
|
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
self.vocab.strings.add("")
|
self.vocab.strings.add("")
|
||||||
self._create_empty_vectors(dummy_hash=self.vocab.strings[""])
|
self._create_empty_vectors(dummy_hash=self.vocab.strings[""])
|
||||||
|
|
||||||
def require_vocab(self):
|
|
||||||
if self.vocab is None:
|
|
||||||
raise ValueError(Errors.E946)
|
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def entity_vector_length(self):
|
def entity_vector_length(self):
|
||||||
"""RETURNS (uint64): length of the entity vectors"""
|
"""RETURNS (uint64): length of the entity vectors"""
|
||||||
|
@ -102,14 +104,12 @@ cdef class KnowledgeBase:
|
||||||
return len(self._entry_index)
|
return len(self._entry_index)
|
||||||
|
|
||||||
def get_entity_strings(self):
|
def get_entity_strings(self):
|
||||||
self.require_vocab()
|
|
||||||
return [self.vocab.strings[x] for x in self._entry_index]
|
return [self.vocab.strings[x] for x in self._entry_index]
|
||||||
|
|
||||||
def get_size_aliases(self):
|
def get_size_aliases(self):
|
||||||
return len(self._alias_index)
|
return len(self._alias_index)
|
||||||
|
|
||||||
def get_alias_strings(self):
|
def get_alias_strings(self):
|
||||||
self.require_vocab()
|
|
||||||
return [self.vocab.strings[x] for x in self._alias_index]
|
return [self.vocab.strings[x] for x in self._alias_index]
|
||||||
|
|
||||||
def add_entity(self, unicode entity, float freq, vector[float] entity_vector):
|
def add_entity(self, unicode entity, float freq, vector[float] entity_vector):
|
||||||
|
@ -117,7 +117,6 @@ cdef class KnowledgeBase:
|
||||||
Add an entity to the KB, optionally specifying its log probability based on corpus frequency
|
Add an entity to the KB, optionally specifying its log probability based on corpus frequency
|
||||||
Return the hash of the entity ID/name at the end.
|
Return the hash of the entity ID/name at the end.
|
||||||
"""
|
"""
|
||||||
self.require_vocab()
|
|
||||||
cdef hash_t entity_hash = self.vocab.strings.add(entity)
|
cdef hash_t entity_hash = self.vocab.strings.add(entity)
|
||||||
|
|
||||||
# Return if this entity was added before
|
# Return if this entity was added before
|
||||||
|
@ -140,7 +139,6 @@ cdef class KnowledgeBase:
|
||||||
return entity_hash
|
return entity_hash
|
||||||
|
|
||||||
cpdef set_entities(self, entity_list, freq_list, vector_list):
|
cpdef set_entities(self, entity_list, freq_list, vector_list):
|
||||||
self.require_vocab()
|
|
||||||
if len(entity_list) != len(freq_list) or len(entity_list) != len(vector_list):
|
if len(entity_list) != len(freq_list) or len(entity_list) != len(vector_list):
|
||||||
raise ValueError(Errors.E140)
|
raise ValueError(Errors.E140)
|
||||||
|
|
||||||
|
@ -176,12 +174,10 @@ cdef class KnowledgeBase:
|
||||||
i += 1
|
i += 1
|
||||||
|
|
||||||
def contains_entity(self, unicode entity):
|
def contains_entity(self, unicode entity):
|
||||||
self.require_vocab()
|
|
||||||
cdef hash_t entity_hash = self.vocab.strings.add(entity)
|
cdef hash_t entity_hash = self.vocab.strings.add(entity)
|
||||||
return entity_hash in self._entry_index
|
return entity_hash in self._entry_index
|
||||||
|
|
||||||
def contains_alias(self, unicode alias):
|
def contains_alias(self, unicode alias):
|
||||||
self.require_vocab()
|
|
||||||
cdef hash_t alias_hash = self.vocab.strings.add(alias)
|
cdef hash_t alias_hash = self.vocab.strings.add(alias)
|
||||||
return alias_hash in self._alias_index
|
return alias_hash in self._alias_index
|
||||||
|
|
||||||
|
@ -190,7 +186,6 @@ cdef class KnowledgeBase:
|
||||||
For a given alias, add its potential entities and prior probabilies to the KB.
|
For a given alias, add its potential entities and prior probabilies to the KB.
|
||||||
Return the alias_hash at the end
|
Return the alias_hash at the end
|
||||||
"""
|
"""
|
||||||
self.require_vocab()
|
|
||||||
# Throw an error if the length of entities and probabilities are not the same
|
# Throw an error if the length of entities and probabilities are not the same
|
||||||
if not len(entities) == len(probabilities):
|
if not len(entities) == len(probabilities):
|
||||||
raise ValueError(Errors.E132.format(alias=alias,
|
raise ValueError(Errors.E132.format(alias=alias,
|
||||||
|
@ -234,7 +229,6 @@ cdef class KnowledgeBase:
|
||||||
Throw an error if this entity+prior prob would exceed the sum of 1.
|
Throw an error if this entity+prior prob would exceed the sum of 1.
|
||||||
For efficiency, it's best to use the method `add_alias` as much as possible instead of this one.
|
For efficiency, it's best to use the method `add_alias` as much as possible instead of this one.
|
||||||
"""
|
"""
|
||||||
self.require_vocab()
|
|
||||||
# Check if the alias exists in the KB
|
# Check if the alias exists in the KB
|
||||||
cdef hash_t alias_hash = self.vocab.strings[alias]
|
cdef hash_t alias_hash = self.vocab.strings[alias]
|
||||||
if not alias_hash in self._alias_index:
|
if not alias_hash in self._alias_index:
|
||||||
|
@ -274,14 +268,12 @@ cdef class KnowledgeBase:
|
||||||
alias_entry.probs = probs
|
alias_entry.probs = probs
|
||||||
self._aliases_table[alias_index] = alias_entry
|
self._aliases_table[alias_index] = alias_entry
|
||||||
|
|
||||||
|
def get_alias_candidates(self, unicode alias) -> Iterator[Candidate]:
|
||||||
def get_candidates(self, unicode alias):
|
|
||||||
"""
|
"""
|
||||||
Return candidate entities for an alias. Each candidate defines the entity, the original alias,
|
Return candidate entities for an alias. Each candidate defines the entity, the original alias,
|
||||||
and the prior probability of that alias resolving to that entity.
|
and the prior probability of that alias resolving to that entity.
|
||||||
If the alias is not known in the KB, and empty list is returned.
|
If the alias is not known in the KB, and empty list is returned.
|
||||||
"""
|
"""
|
||||||
self.require_vocab()
|
|
||||||
cdef hash_t alias_hash = self.vocab.strings[alias]
|
cdef hash_t alias_hash = self.vocab.strings[alias]
|
||||||
if not alias_hash in self._alias_index:
|
if not alias_hash in self._alias_index:
|
||||||
return []
|
return []
|
||||||
|
@ -298,7 +290,6 @@ cdef class KnowledgeBase:
|
||||||
if entry_index != 0]
|
if entry_index != 0]
|
||||||
|
|
||||||
def get_vector(self, unicode entity):
|
def get_vector(self, unicode entity):
|
||||||
self.require_vocab()
|
|
||||||
cdef hash_t entity_hash = self.vocab.strings[entity]
|
cdef hash_t entity_hash = self.vocab.strings[entity]
|
||||||
|
|
||||||
# Return an empty list if this entity is unknown in this KB
|
# Return an empty list if this entity is unknown in this KB
|
||||||
|
@ -311,7 +302,6 @@ cdef class KnowledgeBase:
|
||||||
def get_prior_prob(self, unicode entity, unicode alias):
|
def get_prior_prob(self, unicode entity, unicode alias):
|
||||||
""" Return the prior probability of a given alias being linked to a given entity,
|
""" Return the prior probability of a given alias being linked to a given entity,
|
||||||
or return 0.0 when this combination is not known in the knowledge base"""
|
or return 0.0 when this combination is not known in the knowledge base"""
|
||||||
self.require_vocab()
|
|
||||||
cdef hash_t alias_hash = self.vocab.strings[alias]
|
cdef hash_t alias_hash = self.vocab.strings[alias]
|
||||||
cdef hash_t entity_hash = self.vocab.strings[entity]
|
cdef hash_t entity_hash = self.vocab.strings[entity]
|
||||||
|
|
||||||
|
@ -329,8 +319,7 @@ cdef class KnowledgeBase:
|
||||||
return 0.0
|
return 0.0
|
||||||
|
|
||||||
|
|
||||||
def dump(self, loc):
|
def to_disk(self, loc):
|
||||||
self.require_vocab()
|
|
||||||
cdef Writer writer = Writer(loc)
|
cdef Writer writer = Writer(loc)
|
||||||
writer.write_header(self.get_size_entities(), self.entity_vector_length)
|
writer.write_header(self.get_size_entities(), self.entity_vector_length)
|
||||||
|
|
||||||
|
@ -370,7 +359,7 @@ cdef class KnowledgeBase:
|
||||||
|
|
||||||
writer.close()
|
writer.close()
|
||||||
|
|
||||||
cpdef load_bulk(self, loc):
|
cpdef from_disk(self, loc):
|
||||||
cdef hash_t entity_hash
|
cdef hash_t entity_hash
|
||||||
cdef hash_t alias_hash
|
cdef hash_t alias_hash
|
||||||
cdef int64_t entry_index
|
cdef int64_t entry_index
|
||||||
|
@ -462,12 +451,11 @@ cdef class KnowledgeBase:
|
||||||
|
|
||||||
cdef class Writer:
|
cdef class Writer:
|
||||||
def __init__(self, object loc):
|
def __init__(self, object loc):
|
||||||
if path.exists(loc):
|
|
||||||
assert not path.isdir(loc), f"{loc} is directory"
|
|
||||||
if isinstance(loc, Path):
|
if isinstance(loc, Path):
|
||||||
loc = bytes(loc)
|
loc = bytes(loc)
|
||||||
if path.exists(loc):
|
if path.exists(loc):
|
||||||
assert not path.isdir(loc), "%s is directory." % loc
|
if path.isdir(loc):
|
||||||
|
raise ValueError(Errors.E928.format(loc=loc))
|
||||||
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
|
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
|
||||||
self._fp = fopen(<char*>bytes_loc, 'wb')
|
self._fp = fopen(<char*>bytes_loc, 'wb')
|
||||||
if not self._fp:
|
if not self._fp:
|
||||||
|
@ -511,8 +499,10 @@ cdef class Reader:
|
||||||
def __init__(self, object loc):
|
def __init__(self, object loc):
|
||||||
if isinstance(loc, Path):
|
if isinstance(loc, Path):
|
||||||
loc = bytes(loc)
|
loc = bytes(loc)
|
||||||
assert path.exists(loc)
|
if not path.exists(loc):
|
||||||
assert not path.isdir(loc)
|
raise ValueError(Errors.E929.format(loc=loc))
|
||||||
|
if path.isdir(loc):
|
||||||
|
raise ValueError(Errors.E928.format(loc=loc))
|
||||||
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
|
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
|
||||||
self._fp = fopen(<char*>bytes_loc, 'rb')
|
self._fp = fopen(<char*>bytes_loc, 'rb')
|
||||||
if not self._fp:
|
if not self._fp:
|
||||||
|
|
|
@ -439,8 +439,6 @@ class Language:
|
||||||
assigns: Iterable[str] = tuple(),
|
assigns: Iterable[str] = tuple(),
|
||||||
requires: Iterable[str] = tuple(),
|
requires: Iterable[str] = tuple(),
|
||||||
retokenizes: bool = False,
|
retokenizes: bool = False,
|
||||||
scores: Iterable[str] = tuple(),
|
|
||||||
default_score_weights: Dict[str, float] = SimpleFrozenDict(),
|
|
||||||
func: Optional[Callable[[Doc], Doc]] = None,
|
func: Optional[Callable[[Doc], Doc]] = None,
|
||||||
) -> Callable:
|
) -> Callable:
|
||||||
"""Register a new pipeline component. Can be used for stateless function
|
"""Register a new pipeline component. Can be used for stateless function
|
||||||
|
@ -456,12 +454,6 @@ class Language:
|
||||||
e.g. "token.ent_id". Used for pipeline analyis.
|
e.g. "token.ent_id". Used for pipeline analyis.
|
||||||
retokenizes (bool): Whether the component changes the tokenization.
|
retokenizes (bool): Whether the component changes the tokenization.
|
||||||
Used for pipeline analysis.
|
Used for pipeline analysis.
|
||||||
scores (Iterable[str]): All scores set by the component if it's trainable,
|
|
||||||
e.g. ["ents_f", "ents_r", "ents_p"].
|
|
||||||
default_score_weights (Dict[str, float]): The scores to report during
|
|
||||||
training, and their default weight towards the final score used to
|
|
||||||
select the best model. Weights should sum to 1.0 per component and
|
|
||||||
will be combined and normalized for the whole pipeline.
|
|
||||||
func (Optional[Callable]): Factory function if not used as a decorator.
|
func (Optional[Callable]): Factory function if not used as a decorator.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/language#component
|
DOCS: https://spacy.io/api/language#component
|
||||||
|
@ -482,8 +474,6 @@ class Language:
|
||||||
assigns=assigns,
|
assigns=assigns,
|
||||||
requires=requires,
|
requires=requires,
|
||||||
retokenizes=retokenizes,
|
retokenizes=retokenizes,
|
||||||
scores=scores,
|
|
||||||
default_score_weights=default_score_weights,
|
|
||||||
func=factory_func,
|
func=factory_func,
|
||||||
)
|
)
|
||||||
return component_func
|
return component_func
|
||||||
|
@ -782,9 +772,15 @@ class Language:
|
||||||
self.remove_pipe(name)
|
self.remove_pipe(name)
|
||||||
if not len(self.pipeline) or pipe_index == len(self.pipeline):
|
if not len(self.pipeline) or pipe_index == len(self.pipeline):
|
||||||
# we have no components to insert before/after, or we're replacing the last component
|
# we have no components to insert before/after, or we're replacing the last component
|
||||||
self.add_pipe(factory_name, name=name)
|
self.add_pipe(factory_name, name=name, config=config, validate=validate)
|
||||||
else:
|
else:
|
||||||
self.add_pipe(factory_name, name=name, before=pipe_index)
|
self.add_pipe(
|
||||||
|
factory_name,
|
||||||
|
name=name,
|
||||||
|
before=pipe_index,
|
||||||
|
config=config,
|
||||||
|
validate=validate,
|
||||||
|
)
|
||||||
|
|
||||||
def rename_pipe(self, old_name: str, new_name: str) -> None:
|
def rename_pipe(self, old_name: str, new_name: str) -> None:
|
||||||
"""Rename a pipeline component.
|
"""Rename a pipeline component.
|
||||||
|
@ -1112,7 +1108,6 @@ class Language:
|
||||||
self,
|
self,
|
||||||
examples: Iterable[Example],
|
examples: Iterable[Example],
|
||||||
*,
|
*,
|
||||||
verbose: bool = False,
|
|
||||||
batch_size: int = 256,
|
batch_size: int = 256,
|
||||||
scorer: Optional[Scorer] = None,
|
scorer: Optional[Scorer] = None,
|
||||||
component_cfg: Optional[Dict[str, Dict[str, Any]]] = None,
|
component_cfg: Optional[Dict[str, Dict[str, Any]]] = None,
|
||||||
|
@ -1121,7 +1116,6 @@ class Language:
|
||||||
"""Evaluate a model's pipeline components.
|
"""Evaluate a model's pipeline components.
|
||||||
|
|
||||||
examples (Iterable[Example]): `Example` objects.
|
examples (Iterable[Example]): `Example` objects.
|
||||||
verbose (bool): Print debugging information.
|
|
||||||
batch_size (int): Batch size to use.
|
batch_size (int): Batch size to use.
|
||||||
scorer (Optional[Scorer]): Scorer to use. If not passed in, a new one
|
scorer (Optional[Scorer]): Scorer to use. If not passed in, a new one
|
||||||
will be created.
|
will be created.
|
||||||
|
@ -1140,7 +1134,6 @@ class Language:
|
||||||
scorer_cfg = {}
|
scorer_cfg = {}
|
||||||
if scorer is None:
|
if scorer is None:
|
||||||
kwargs = dict(scorer_cfg)
|
kwargs = dict(scorer_cfg)
|
||||||
kwargs.setdefault("verbose", verbose)
|
|
||||||
kwargs.setdefault("nlp", self)
|
kwargs.setdefault("nlp", self)
|
||||||
scorer = Scorer(**kwargs)
|
scorer = Scorer(**kwargs)
|
||||||
texts = [eg.reference.text for eg in examples]
|
texts = [eg.reference.text for eg in examples]
|
||||||
|
@ -1163,8 +1156,7 @@ class Language:
|
||||||
docs = list(docs)
|
docs = list(docs)
|
||||||
end_time = timer()
|
end_time = timer()
|
||||||
for i, (doc, eg) in enumerate(zip(docs, examples)):
|
for i, (doc, eg) in enumerate(zip(docs, examples)):
|
||||||
if verbose:
|
util.logger.debug(doc)
|
||||||
print(doc)
|
|
||||||
eg.predicted = doc
|
eg.predicted = doc
|
||||||
results = scorer.score(examples)
|
results = scorer.score(examples)
|
||||||
n_words = sum(len(eg.predicted) for eg in examples)
|
n_words = sum(len(eg.predicted) for eg in examples)
|
||||||
|
|
|
@ -1,9 +1,9 @@
|
||||||
from typing import Optional
|
from typing import Optional, Callable, Iterable
|
||||||
from thinc.api import chain, clone, list2ragged, reduce_mean, residual
|
from thinc.api import chain, clone, list2ragged, reduce_mean, residual
|
||||||
from thinc.api import Model, Maxout, Linear
|
from thinc.api import Model, Maxout, Linear
|
||||||
|
|
||||||
from ...util import registry
|
from ...util import registry
|
||||||
from ...kb import KnowledgeBase
|
from ...kb import KnowledgeBase, Candidate, get_candidates
|
||||||
from ...vocab import Vocab
|
from ...vocab import Vocab
|
||||||
|
|
||||||
|
|
||||||
|
@ -25,15 +25,23 @@ def build_nel_encoder(tok2vec: Model, nO: Optional[int] = None) -> Model:
|
||||||
|
|
||||||
|
|
||||||
@registry.assets.register("spacy.KBFromFile.v1")
|
@registry.assets.register("spacy.KBFromFile.v1")
|
||||||
def load_kb(vocab_path: str, kb_path: str) -> KnowledgeBase:
|
def load_kb(kb_path: str) -> Callable[[Vocab], KnowledgeBase]:
|
||||||
vocab = Vocab().from_disk(vocab_path)
|
def kb_from_file(vocab):
|
||||||
kb = KnowledgeBase(entity_vector_length=1)
|
kb = KnowledgeBase(vocab, entity_vector_length=1)
|
||||||
kb.initialize(vocab)
|
kb.from_disk(kb_path)
|
||||||
kb.load_bulk(kb_path)
|
|
||||||
return kb
|
return kb
|
||||||
|
|
||||||
|
return kb_from_file
|
||||||
|
|
||||||
|
|
||||||
@registry.assets.register("spacy.EmptyKB.v1")
|
@registry.assets.register("spacy.EmptyKB.v1")
|
||||||
def empty_kb(entity_vector_length: int) -> KnowledgeBase:
|
def empty_kb(entity_vector_length: int) -> Callable[[Vocab], KnowledgeBase]:
|
||||||
kb = KnowledgeBase(entity_vector_length=entity_vector_length)
|
def empty_kb_factory(vocab):
|
||||||
return kb
|
return KnowledgeBase(vocab=vocab, entity_vector_length=entity_vector_length)
|
||||||
|
|
||||||
|
return empty_kb_factory
|
||||||
|
|
||||||
|
|
||||||
|
@registry.assets.register("spacy.CandidateGenerator.v1")
|
||||||
|
def create_candidates() -> Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]:
|
||||||
|
return get_candidates
|
||||||
|
|
|
@ -6,7 +6,7 @@ from thinc.api import CosineDistance, get_array_module, Model, Optimizer, Config
|
||||||
from thinc.api import set_dropout_rate
|
from thinc.api import set_dropout_rate
|
||||||
import warnings
|
import warnings
|
||||||
|
|
||||||
from ..kb import KnowledgeBase
|
from ..kb import KnowledgeBase, Candidate
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
from .pipe import Pipe, deserialize_config
|
from .pipe import Pipe, deserialize_config
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
|
@ -32,35 +32,30 @@ subword_features = true
|
||||||
"""
|
"""
|
||||||
DEFAULT_NEL_MODEL = Config().from_str(default_model_config)["model"]
|
DEFAULT_NEL_MODEL = Config().from_str(default_model_config)["model"]
|
||||||
|
|
||||||
default_kb_config = """
|
|
||||||
[kb]
|
|
||||||
@assets = "spacy.EmptyKB.v1"
|
|
||||||
entity_vector_length = 64
|
|
||||||
"""
|
|
||||||
DEFAULT_NEL_KB = Config().from_str(default_kb_config)["kb"]
|
|
||||||
|
|
||||||
|
|
||||||
@Language.factory(
|
@Language.factory(
|
||||||
"entity_linker",
|
"entity_linker",
|
||||||
requires=["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"],
|
requires=["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"],
|
||||||
assigns=["token.ent_kb_id"],
|
assigns=["token.ent_kb_id"],
|
||||||
default_config={
|
default_config={
|
||||||
"kb": DEFAULT_NEL_KB,
|
"kb_loader": {"@assets": "spacy.EmptyKB.v1", "entity_vector_length": 64},
|
||||||
"model": DEFAULT_NEL_MODEL,
|
"model": DEFAULT_NEL_MODEL,
|
||||||
"labels_discard": [],
|
"labels_discard": [],
|
||||||
"incl_prior": True,
|
"incl_prior": True,
|
||||||
"incl_context": True,
|
"incl_context": True,
|
||||||
|
"get_candidates": {"@assets": "spacy.CandidateGenerator.v1"},
|
||||||
},
|
},
|
||||||
)
|
)
|
||||||
def make_entity_linker(
|
def make_entity_linker(
|
||||||
nlp: Language,
|
nlp: Language,
|
||||||
name: str,
|
name: str,
|
||||||
model: Model,
|
model: Model,
|
||||||
kb: KnowledgeBase,
|
kb_loader: Callable[[Vocab], KnowledgeBase],
|
||||||
*,
|
*,
|
||||||
labels_discard: Iterable[str],
|
labels_discard: Iterable[str],
|
||||||
incl_prior: bool,
|
incl_prior: bool,
|
||||||
incl_context: bool,
|
incl_context: bool,
|
||||||
|
get_candidates: Callable[[KnowledgeBase, "Span"], Iterable[Candidate]],
|
||||||
):
|
):
|
||||||
"""Construct an EntityLinker component.
|
"""Construct an EntityLinker component.
|
||||||
|
|
||||||
|
@ -76,10 +71,11 @@ def make_entity_linker(
|
||||||
nlp.vocab,
|
nlp.vocab,
|
||||||
model,
|
model,
|
||||||
name,
|
name,
|
||||||
kb=kb,
|
kb_loader=kb_loader,
|
||||||
labels_discard=labels_discard,
|
labels_discard=labels_discard,
|
||||||
incl_prior=incl_prior,
|
incl_prior=incl_prior,
|
||||||
incl_context=incl_context,
|
incl_context=incl_context,
|
||||||
|
get_candidates=get_candidates,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@ -97,10 +93,11 @@ class EntityLinker(Pipe):
|
||||||
model: Model,
|
model: Model,
|
||||||
name: str = "entity_linker",
|
name: str = "entity_linker",
|
||||||
*,
|
*,
|
||||||
kb: KnowledgeBase,
|
kb_loader: Callable[[Vocab], KnowledgeBase],
|
||||||
labels_discard: Iterable[str],
|
labels_discard: Iterable[str],
|
||||||
incl_prior: bool,
|
incl_prior: bool,
|
||||||
incl_context: bool,
|
incl_context: bool,
|
||||||
|
get_candidates: Callable[[KnowledgeBase, "Span"], Iterable[Candidate]],
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Initialize an entity linker.
|
"""Initialize an entity linker.
|
||||||
|
|
||||||
|
@ -108,7 +105,7 @@ class EntityLinker(Pipe):
|
||||||
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
||||||
name (str): The component instance name, used to add entries to the
|
name (str): The component instance name, used to add entries to the
|
||||||
losses during training.
|
losses during training.
|
||||||
kb (KnowledgeBase): The KnowledgeBase holding all entities and their aliases.
|
kb_loader (Callable[[Vocab], KnowledgeBase]): A function that creates a KnowledgeBase from a Vocab instance.
|
||||||
labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
|
labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
|
||||||
incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
|
incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
|
||||||
incl_context (bool): Whether or not to include the local context in the model.
|
incl_context (bool): Whether or not to include the local context in the model.
|
||||||
|
@ -119,17 +116,12 @@ class EntityLinker(Pipe):
|
||||||
self.model = model
|
self.model = model
|
||||||
self.name = name
|
self.name = name
|
||||||
cfg = {
|
cfg = {
|
||||||
"kb": kb,
|
|
||||||
"labels_discard": list(labels_discard),
|
"labels_discard": list(labels_discard),
|
||||||
"incl_prior": incl_prior,
|
"incl_prior": incl_prior,
|
||||||
"incl_context": incl_context,
|
"incl_context": incl_context,
|
||||||
}
|
}
|
||||||
if not isinstance(kb, KnowledgeBase):
|
self.kb = kb_loader(self.vocab)
|
||||||
raise ValueError(Errors.E990.format(type=type(self.kb)))
|
self.get_candidates = get_candidates
|
||||||
kb.initialize(vocab)
|
|
||||||
self.kb = kb
|
|
||||||
if "kb" in cfg:
|
|
||||||
del cfg["kb"] # we don't want to duplicate its serialization
|
|
||||||
self.cfg = dict(cfg)
|
self.cfg = dict(cfg)
|
||||||
self.distance = CosineDistance(normalize=False)
|
self.distance = CosineDistance(normalize=False)
|
||||||
# how many neightbour sentences to take into account
|
# how many neightbour sentences to take into account
|
||||||
|
@ -326,8 +318,9 @@ class EntityLinker(Pipe):
|
||||||
end_token = sentences[end_sentence].end
|
end_token = sentences[end_sentence].end
|
||||||
sent_doc = doc[start_token:end_token].as_doc()
|
sent_doc = doc[start_token:end_token].as_doc()
|
||||||
# currently, the context is the same for each entity in a sentence (should be refined)
|
# currently, the context is the same for each entity in a sentence (should be refined)
|
||||||
|
xp = self.model.ops.xp
|
||||||
|
if self.cfg.get("incl_context"):
|
||||||
sentence_encoding = self.model.predict([sent_doc])[0]
|
sentence_encoding = self.model.predict([sent_doc])[0]
|
||||||
xp = get_array_module(sentence_encoding)
|
|
||||||
sentence_encoding_t = sentence_encoding.T
|
sentence_encoding_t = sentence_encoding.T
|
||||||
sentence_norm = xp.linalg.norm(sentence_encoding_t)
|
sentence_norm = xp.linalg.norm(sentence_encoding_t)
|
||||||
for ent in sent.ents:
|
for ent in sent.ents:
|
||||||
|
@ -337,7 +330,7 @@ class EntityLinker(Pipe):
|
||||||
# ignoring this entity - setting to NIL
|
# ignoring this entity - setting to NIL
|
||||||
final_kb_ids.append(self.NIL)
|
final_kb_ids.append(self.NIL)
|
||||||
else:
|
else:
|
||||||
candidates = self.kb.get_candidates(ent.text)
|
candidates = self.get_candidates(self.kb, ent)
|
||||||
if not candidates:
|
if not candidates:
|
||||||
# no prediction possible for this entity - setting to NIL
|
# no prediction possible for this entity - setting to NIL
|
||||||
final_kb_ids.append(self.NIL)
|
final_kb_ids.append(self.NIL)
|
||||||
|
@ -421,10 +414,9 @@ class EntityLinker(Pipe):
|
||||||
DOCS: https://spacy.io/api/entitylinker#to_disk
|
DOCS: https://spacy.io/api/entitylinker#to_disk
|
||||||
"""
|
"""
|
||||||
serialize = {}
|
serialize = {}
|
||||||
self.cfg["entity_width"] = self.kb.entity_vector_length
|
|
||||||
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
||||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||||
serialize["kb"] = lambda p: self.kb.dump(p)
|
serialize["kb"] = lambda p: self.kb.to_disk(p)
|
||||||
serialize["model"] = lambda p: self.model.to_disk(p)
|
serialize["model"] = lambda p: self.model.to_disk(p)
|
||||||
util.to_disk(path, serialize, exclude)
|
util.to_disk(path, serialize, exclude)
|
||||||
|
|
||||||
|
@ -446,15 +438,10 @@ class EntityLinker(Pipe):
|
||||||
except AttributeError:
|
except AttributeError:
|
||||||
raise ValueError(Errors.E149) from None
|
raise ValueError(Errors.E149) from None
|
||||||
|
|
||||||
def load_kb(p):
|
|
||||||
self.kb = KnowledgeBase(entity_vector_length=self.cfg["entity_width"])
|
|
||||||
self.kb.initialize(self.vocab)
|
|
||||||
self.kb.load_bulk(p)
|
|
||||||
|
|
||||||
deserialize = {}
|
deserialize = {}
|
||||||
deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
|
deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
|
||||||
deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
|
deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
|
||||||
deserialize["kb"] = load_kb
|
deserialize["kb"] = lambda p: self.kb.from_disk(p)
|
||||||
deserialize["model"] = load_model
|
deserialize["model"] = load_model
|
||||||
util.from_disk(path, deserialize, exclude)
|
util.from_disk(path, deserialize, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
|
@ -68,7 +68,6 @@ class Tagger(Pipe):
|
||||||
name (str): The component instance name, used to add entries to the
|
name (str): The component instance name, used to add entries to the
|
||||||
losses during training.
|
losses during training.
|
||||||
labels (List): The set of labels. Defaults to None.
|
labels (List): The set of labels. Defaults to None.
|
||||||
set_morphology (bool): Whether to set morphological features.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tagger#init
|
DOCS: https://spacy.io/api/tagger#init
|
||||||
"""
|
"""
|
||||||
|
|
|
@ -167,18 +167,20 @@ class ModelMetaSchema(BaseModel):
|
||||||
lang: StrictStr = Field(..., title="Two-letter language code, e.g. 'en'")
|
lang: StrictStr = Field(..., title="Two-letter language code, e.g. 'en'")
|
||||||
name: StrictStr = Field(..., title="Model name")
|
name: StrictStr = Field(..., title="Model name")
|
||||||
version: StrictStr = Field(..., title="Model version")
|
version: StrictStr = Field(..., title="Model version")
|
||||||
spacy_version: Optional[StrictStr] = Field(None, title="Compatible spaCy version identifier")
|
spacy_version: StrictStr = Field("", title="Compatible spaCy version identifier")
|
||||||
parent_package: Optional[StrictStr] = Field("spacy", title="Name of parent spaCy package, e.g. spacy or spacy-nightly")
|
parent_package: StrictStr = Field("spacy", title="Name of parent spaCy package, e.g. spacy or spacy-nightly")
|
||||||
pipeline: Optional[List[StrictStr]] = Field([], title="Names of pipeline components")
|
pipeline: List[StrictStr] = Field([], title="Names of pipeline components")
|
||||||
description: Optional[StrictStr] = Field(None, title="Model description")
|
description: StrictStr = Field("", title="Model description")
|
||||||
license: Optional[StrictStr] = Field(None, title="Model license")
|
license: StrictStr = Field("", title="Model license")
|
||||||
author: Optional[StrictStr] = Field(None, title="Model author name")
|
author: StrictStr = Field("", title="Model author name")
|
||||||
email: Optional[StrictStr] = Field(None, title="Model author email")
|
email: StrictStr = Field("", title="Model author email")
|
||||||
url: Optional[StrictStr] = Field(None, title="Model author URL")
|
url: StrictStr = Field("", title="Model author URL")
|
||||||
sources: Optional[Union[List[StrictStr], Dict[str, str]]] = Field(None, title="Training data sources")
|
sources: Optional[Union[List[StrictStr], List[Dict[str, str]]]] = Field(None, title="Training data sources")
|
||||||
vectors: Optional[Dict[str, Any]] = Field(None, title="Included word vectors")
|
vectors: Dict[str, Any] = Field({}, title="Included word vectors")
|
||||||
accuracy: Optional[Dict[str, Union[float, int]]] = Field(None, title="Accuracy numbers")
|
labels: Dict[str, Dict[str, List[str]]] = Field({}, title="Component labels, keyed by component name")
|
||||||
speed: Optional[Dict[str, Union[float, int]]] = Field(None, title="Speed evaluation numbers")
|
accuracy: Dict[str, Union[float, Dict[str, float]]] = Field({}, title="Accuracy numbers")
|
||||||
|
speed: Dict[str, Union[float, int]] = Field({}, title="Speed evaluation numbers")
|
||||||
|
spacy_git_version: StrictStr = Field("", title="Commit of spaCy version used")
|
||||||
# fmt: on
|
# fmt: on
|
||||||
|
|
||||||
|
|
||||||
|
@ -301,7 +303,7 @@ class ProjectConfigCommand(BaseModel):
|
||||||
|
|
||||||
class ProjectConfigSchema(BaseModel):
|
class ProjectConfigSchema(BaseModel):
|
||||||
# fmt: off
|
# fmt: off
|
||||||
variables: Dict[StrictStr, Union[str, int, float, bool]] = Field({}, title="Optional variables to substitute in commands")
|
vars: Dict[StrictStr, Any] = Field({}, title="Optional variables to substitute in commands")
|
||||||
assets: List[ProjectConfigAsset] = Field([], title="Data assets")
|
assets: List[ProjectConfigAsset] = Field([], title="Data assets")
|
||||||
workflows: Dict[StrictStr, List[StrictStr]] = Field({}, title="Named workflows, mapped to list of project commands to run in order")
|
workflows: Dict[StrictStr, List[StrictStr]] = Field({}, title="Named workflows, mapped to list of project commands to run in order")
|
||||||
commands: List[ProjectConfigCommand] = Field([], title="Project command shortucts")
|
commands: List[ProjectConfigCommand] = Field([], title="Project command shortucts")
|
||||||
|
@ -309,3 +311,22 @@ class ProjectConfigSchema(BaseModel):
|
||||||
|
|
||||||
class Config:
|
class Config:
|
||||||
title = "Schema for project configuration file"
|
title = "Schema for project configuration file"
|
||||||
|
|
||||||
|
|
||||||
|
# Recommendations for init config workflows
|
||||||
|
|
||||||
|
|
||||||
|
class RecommendationTrfItem(BaseModel):
|
||||||
|
name: str
|
||||||
|
size_factor: int
|
||||||
|
|
||||||
|
|
||||||
|
class RecommendationTrf(BaseModel):
|
||||||
|
efficiency: RecommendationTrfItem
|
||||||
|
accuracy: RecommendationTrfItem
|
||||||
|
|
||||||
|
|
||||||
|
class RecommendationSchema(BaseModel):
|
||||||
|
word_vectors: Optional[str] = None
|
||||||
|
transformer: Optional[RecommendationTrf] = None
|
||||||
|
has_letters: bool = True
|
||||||
|
|
|
@ -2,7 +2,7 @@ from typing import Optional, Iterable, Dict, Any, Callable, Tuple, TYPE_CHECKING
|
||||||
import numpy as np
|
import numpy as np
|
||||||
|
|
||||||
from .gold import Example
|
from .gold import Example
|
||||||
from .tokens import Token, Doc
|
from .tokens import Token, Doc, Span
|
||||||
from .errors import Errors
|
from .errors import Errors
|
||||||
from .util import get_lang_class
|
from .util import get_lang_class
|
||||||
from .morphology import Morphology
|
from .morphology import Morphology
|
||||||
|
@ -250,15 +250,16 @@ class Scorer:
|
||||||
examples: Iterable[Example],
|
examples: Iterable[Example],
|
||||||
attr: str,
|
attr: str,
|
||||||
*,
|
*,
|
||||||
getter: Callable[[Doc, str], Any] = getattr,
|
getter: Callable[[Doc, str], Iterable[Span]] = getattr,
|
||||||
**cfg,
|
**cfg,
|
||||||
) -> Dict[str, Any]:
|
) -> Dict[str, Any]:
|
||||||
"""Returns PRF scores for labeled spans.
|
"""Returns PRF scores for labeled spans.
|
||||||
|
|
||||||
examples (Iterable[Example]): Examples to score
|
examples (Iterable[Example]): Examples to score
|
||||||
attr (str): The attribute to score.
|
attr (str): The attribute to score.
|
||||||
getter (Callable[[Doc, str], Any]): Defaults to getattr. If provided,
|
getter (Callable[[Doc, str], Iterable[Span]]): Defaults to getattr. If
|
||||||
getter(doc, attr) should return the spans for the individual doc.
|
provided, getter(doc, attr) should return the spans for the
|
||||||
|
individual doc.
|
||||||
RETURNS (Dict[str, Any]): A dictionary containing the PRF scores under
|
RETURNS (Dict[str, Any]): A dictionary containing the PRF scores under
|
||||||
the keys attr_p/r/f and the per-type PRF scores under attr_per_type.
|
the keys attr_p/r/f and the per-type PRF scores under attr_per_type.
|
||||||
|
|
||||||
|
@ -444,7 +445,7 @@ class Scorer:
|
||||||
*,
|
*,
|
||||||
getter: Callable[[Token, str], Any] = getattr,
|
getter: Callable[[Token, str], Any] = getattr,
|
||||||
head_attr: str = "head",
|
head_attr: str = "head",
|
||||||
head_getter: Callable[[Token, str], Any] = getattr,
|
head_getter: Callable[[Token, str], Token] = getattr,
|
||||||
ignore_labels: Tuple[str] = tuple(),
|
ignore_labels: Tuple[str] = tuple(),
|
||||||
**cfg,
|
**cfg,
|
||||||
) -> Dict[str, Any]:
|
) -> Dict[str, Any]:
|
||||||
|
@ -458,7 +459,7 @@ class Scorer:
|
||||||
individual token.
|
individual token.
|
||||||
head_attr (str): The attribute containing the head token. Defaults to
|
head_attr (str): The attribute containing the head token. Defaults to
|
||||||
'head'.
|
'head'.
|
||||||
head_getter (Callable[[Token, str], Any]): Defaults to getattr. If provided,
|
head_getter (Callable[[Token, str], Token]): Defaults to getattr. If provided,
|
||||||
head_getter(token, attr) should return the value of the head for an
|
head_getter(token, attr) should return the value of the head for an
|
||||||
individual token.
|
individual token.
|
||||||
ignore_labels (Tuple): Labels to ignore while scoring (e.g., punct).
|
ignore_labels (Tuple): Labels to ignore while scoring (e.g., punct).
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
|
from typing import Callable, Iterable
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from spacy.kb import KnowledgeBase
|
from spacy.kb import KnowledgeBase, get_candidates, Candidate
|
||||||
|
|
||||||
from spacy import util, registry
|
from spacy import util, registry
|
||||||
from spacy.gold import Example
|
from spacy.gold import Example
|
||||||
|
@ -21,8 +22,7 @@ def assert_almost_equal(a, b):
|
||||||
|
|
||||||
def test_kb_valid_entities(nlp):
|
def test_kb_valid_entities(nlp):
|
||||||
"""Test the valid construction of a KB with 3 entities and two aliases"""
|
"""Test the valid construction of a KB with 3 entities and two aliases"""
|
||||||
mykb = KnowledgeBase(entity_vector_length=3)
|
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=3)
|
||||||
mykb.initialize(nlp.vocab)
|
|
||||||
|
|
||||||
# adding entities
|
# adding entities
|
||||||
mykb.add_entity(entity="Q1", freq=19, entity_vector=[8, 4, 3])
|
mykb.add_entity(entity="Q1", freq=19, entity_vector=[8, 4, 3])
|
||||||
|
@ -51,8 +51,7 @@ def test_kb_valid_entities(nlp):
|
||||||
|
|
||||||
def test_kb_invalid_entities(nlp):
|
def test_kb_invalid_entities(nlp):
|
||||||
"""Test the invalid construction of a KB with an alias linked to a non-existing entity"""
|
"""Test the invalid construction of a KB with an alias linked to a non-existing entity"""
|
||||||
mykb = KnowledgeBase(entity_vector_length=1)
|
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
|
||||||
mykb.initialize(nlp.vocab)
|
|
||||||
|
|
||||||
# adding entities
|
# adding entities
|
||||||
mykb.add_entity(entity="Q1", freq=19, entity_vector=[1])
|
mykb.add_entity(entity="Q1", freq=19, entity_vector=[1])
|
||||||
|
@ -68,8 +67,7 @@ def test_kb_invalid_entities(nlp):
|
||||||
|
|
||||||
def test_kb_invalid_probabilities(nlp):
|
def test_kb_invalid_probabilities(nlp):
|
||||||
"""Test the invalid construction of a KB with wrong prior probabilities"""
|
"""Test the invalid construction of a KB with wrong prior probabilities"""
|
||||||
mykb = KnowledgeBase(entity_vector_length=1)
|
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
|
||||||
mykb.initialize(nlp.vocab)
|
|
||||||
|
|
||||||
# adding entities
|
# adding entities
|
||||||
mykb.add_entity(entity="Q1", freq=19, entity_vector=[1])
|
mykb.add_entity(entity="Q1", freq=19, entity_vector=[1])
|
||||||
|
@ -83,8 +81,7 @@ def test_kb_invalid_probabilities(nlp):
|
||||||
|
|
||||||
def test_kb_invalid_combination(nlp):
|
def test_kb_invalid_combination(nlp):
|
||||||
"""Test the invalid construction of a KB with non-matching entity and probability lists"""
|
"""Test the invalid construction of a KB with non-matching entity and probability lists"""
|
||||||
mykb = KnowledgeBase(entity_vector_length=1)
|
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
|
||||||
mykb.initialize(nlp.vocab)
|
|
||||||
|
|
||||||
# adding entities
|
# adding entities
|
||||||
mykb.add_entity(entity="Q1", freq=19, entity_vector=[1])
|
mykb.add_entity(entity="Q1", freq=19, entity_vector=[1])
|
||||||
|
@ -100,8 +97,7 @@ def test_kb_invalid_combination(nlp):
|
||||||
|
|
||||||
def test_kb_invalid_entity_vector(nlp):
|
def test_kb_invalid_entity_vector(nlp):
|
||||||
"""Test the invalid construction of a KB with non-matching entity vector lengths"""
|
"""Test the invalid construction of a KB with non-matching entity vector lengths"""
|
||||||
mykb = KnowledgeBase(entity_vector_length=3)
|
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=3)
|
||||||
mykb.initialize(nlp.vocab)
|
|
||||||
|
|
||||||
# adding entities
|
# adding entities
|
||||||
mykb.add_entity(entity="Q1", freq=19, entity_vector=[1, 2, 3])
|
mykb.add_entity(entity="Q1", freq=19, entity_vector=[1, 2, 3])
|
||||||
|
@ -117,14 +113,14 @@ def test_kb_default(nlp):
|
||||||
assert len(entity_linker.kb) == 0
|
assert len(entity_linker.kb) == 0
|
||||||
assert entity_linker.kb.get_size_entities() == 0
|
assert entity_linker.kb.get_size_entities() == 0
|
||||||
assert entity_linker.kb.get_size_aliases() == 0
|
assert entity_linker.kb.get_size_aliases() == 0
|
||||||
# default value from pipeline.entity_linker
|
# 64 is the default value from pipeline.entity_linker
|
||||||
assert entity_linker.kb.entity_vector_length == 64
|
assert entity_linker.kb.entity_vector_length == 64
|
||||||
|
|
||||||
|
|
||||||
def test_kb_custom_length(nlp):
|
def test_kb_custom_length(nlp):
|
||||||
"""Test that the default (empty) KB can be configured with a custom entity length"""
|
"""Test that the default (empty) KB can be configured with a custom entity length"""
|
||||||
entity_linker = nlp.add_pipe(
|
entity_linker = nlp.add_pipe(
|
||||||
"entity_linker", config={"kb": {"entity_vector_length": 35}}
|
"entity_linker", config={"kb_loader": {"entity_vector_length": 35}}
|
||||||
)
|
)
|
||||||
assert len(entity_linker.kb) == 0
|
assert len(entity_linker.kb) == 0
|
||||||
assert entity_linker.kb.get_size_entities() == 0
|
assert entity_linker.kb.get_size_entities() == 0
|
||||||
|
@ -141,7 +137,7 @@ def test_kb_undefined(nlp):
|
||||||
|
|
||||||
def test_kb_empty(nlp):
|
def test_kb_empty(nlp):
|
||||||
"""Test that the EL can't train with an empty KB"""
|
"""Test that the EL can't train with an empty KB"""
|
||||||
config = {"kb": {"@assets": "spacy.EmptyKB.v1", "entity_vector_length": 342}}
|
config = {"kb_loader": {"@assets": "spacy.EmptyKB.v1", "entity_vector_length": 342}}
|
||||||
entity_linker = nlp.add_pipe("entity_linker", config=config)
|
entity_linker = nlp.add_pipe("entity_linker", config=config)
|
||||||
assert len(entity_linker.kb) == 0
|
assert len(entity_linker.kb) == 0
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
|
@ -150,8 +146,13 @@ def test_kb_empty(nlp):
|
||||||
|
|
||||||
def test_candidate_generation(nlp):
|
def test_candidate_generation(nlp):
|
||||||
"""Test correct candidate generation"""
|
"""Test correct candidate generation"""
|
||||||
mykb = KnowledgeBase(entity_vector_length=1)
|
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
|
||||||
mykb.initialize(nlp.vocab)
|
doc = nlp("douglas adam Adam shrubbery")
|
||||||
|
|
||||||
|
douglas_ent = doc[0:1]
|
||||||
|
adam_ent = doc[1:2]
|
||||||
|
Adam_ent = doc[2:3]
|
||||||
|
shrubbery_ent = doc[3:4]
|
||||||
|
|
||||||
# adding entities
|
# adding entities
|
||||||
mykb.add_entity(entity="Q1", freq=27, entity_vector=[1])
|
mykb.add_entity(entity="Q1", freq=27, entity_vector=[1])
|
||||||
|
@ -163,21 +164,76 @@ def test_candidate_generation(nlp):
|
||||||
mykb.add_alias(alias="adam", entities=["Q2"], probabilities=[0.9])
|
mykb.add_alias(alias="adam", entities=["Q2"], probabilities=[0.9])
|
||||||
|
|
||||||
# test the size of the relevant candidates
|
# test the size of the relevant candidates
|
||||||
assert len(mykb.get_candidates("douglas")) == 2
|
assert len(get_candidates(mykb, douglas_ent)) == 2
|
||||||
assert len(mykb.get_candidates("adam")) == 1
|
assert len(get_candidates(mykb, adam_ent)) == 1
|
||||||
assert len(mykb.get_candidates("shrubbery")) == 0
|
assert len(get_candidates(mykb, Adam_ent)) == 0 # default case sensitive
|
||||||
|
assert len(get_candidates(mykb, shrubbery_ent)) == 0
|
||||||
|
|
||||||
# test the content of the candidates
|
# test the content of the candidates
|
||||||
assert mykb.get_candidates("adam")[0].entity_ == "Q2"
|
assert get_candidates(mykb, adam_ent)[0].entity_ == "Q2"
|
||||||
assert mykb.get_candidates("adam")[0].alias_ == "adam"
|
assert get_candidates(mykb, adam_ent)[0].alias_ == "adam"
|
||||||
assert_almost_equal(mykb.get_candidates("adam")[0].entity_freq, 12)
|
assert_almost_equal(get_candidates(mykb, adam_ent)[0].entity_freq, 12)
|
||||||
assert_almost_equal(mykb.get_candidates("adam")[0].prior_prob, 0.9)
|
assert_almost_equal(get_candidates(mykb, adam_ent)[0].prior_prob, 0.9)
|
||||||
|
|
||||||
|
|
||||||
|
def test_el_pipe_configuration(nlp):
|
||||||
|
"""Test correct candidate generation as part of the EL pipe"""
|
||||||
|
nlp.add_pipe("sentencizer")
|
||||||
|
pattern = {"label": "PERSON", "pattern": [{"LOWER": "douglas"}]}
|
||||||
|
ruler = nlp.add_pipe("entity_ruler")
|
||||||
|
ruler.add_patterns([pattern])
|
||||||
|
|
||||||
|
@registry.assets.register("myAdamKB.v1")
|
||||||
|
def mykb() -> Callable[["Vocab"], KnowledgeBase]:
|
||||||
|
def create_kb(vocab):
|
||||||
|
kb = KnowledgeBase(vocab, entity_vector_length=1)
|
||||||
|
kb.add_entity(entity="Q2", freq=12, entity_vector=[2])
|
||||||
|
kb.add_entity(entity="Q3", freq=5, entity_vector=[3])
|
||||||
|
kb.add_alias(
|
||||||
|
alias="douglas", entities=["Q2", "Q3"], probabilities=[0.8, 0.1]
|
||||||
|
)
|
||||||
|
return kb
|
||||||
|
|
||||||
|
return create_kb
|
||||||
|
|
||||||
|
# run an EL pipe without a trained context encoder, to check the candidate generation step only
|
||||||
|
nlp.add_pipe(
|
||||||
|
"entity_linker",
|
||||||
|
config={"kb_loader": {"@assets": "myAdamKB.v1"}, "incl_context": False},
|
||||||
|
)
|
||||||
|
# With the default get_candidates function, matching is case-sensitive
|
||||||
|
text = "Douglas and douglas are not the same."
|
||||||
|
doc = nlp(text)
|
||||||
|
assert doc[0].ent_kb_id_ == "NIL"
|
||||||
|
assert doc[1].ent_kb_id_ == ""
|
||||||
|
assert doc[2].ent_kb_id_ == "Q2"
|
||||||
|
|
||||||
|
def get_lowercased_candidates(kb, span):
|
||||||
|
return kb.get_alias_candidates(span.text.lower())
|
||||||
|
|
||||||
|
@registry.assets.register("spacy.LowercaseCandidateGenerator.v1")
|
||||||
|
def create_candidates() -> Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]:
|
||||||
|
return get_lowercased_candidates
|
||||||
|
|
||||||
|
# replace the pipe with a new one with with a different candidate generator
|
||||||
|
nlp.replace_pipe(
|
||||||
|
"entity_linker",
|
||||||
|
"entity_linker",
|
||||||
|
config={
|
||||||
|
"kb_loader": {"@assets": "myAdamKB.v1"},
|
||||||
|
"incl_context": False,
|
||||||
|
"get_candidates": {"@assets": "spacy.LowercaseCandidateGenerator.v1"},
|
||||||
|
},
|
||||||
|
)
|
||||||
|
doc = nlp(text)
|
||||||
|
assert doc[0].ent_kb_id_ == "Q2"
|
||||||
|
assert doc[1].ent_kb_id_ == ""
|
||||||
|
assert doc[2].ent_kb_id_ == "Q2"
|
||||||
|
|
||||||
|
|
||||||
def test_append_alias(nlp):
|
def test_append_alias(nlp):
|
||||||
"""Test that we can append additional alias-entity pairs"""
|
"""Test that we can append additional alias-entity pairs"""
|
||||||
mykb = KnowledgeBase(entity_vector_length=1)
|
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
|
||||||
mykb.initialize(nlp.vocab)
|
|
||||||
|
|
||||||
# adding entities
|
# adding entities
|
||||||
mykb.add_entity(entity="Q1", freq=27, entity_vector=[1])
|
mykb.add_entity(entity="Q1", freq=27, entity_vector=[1])
|
||||||
|
@ -189,26 +245,25 @@ def test_append_alias(nlp):
|
||||||
mykb.add_alias(alias="adam", entities=["Q2"], probabilities=[0.9])
|
mykb.add_alias(alias="adam", entities=["Q2"], probabilities=[0.9])
|
||||||
|
|
||||||
# test the size of the relevant candidates
|
# test the size of the relevant candidates
|
||||||
assert len(mykb.get_candidates("douglas")) == 2
|
assert len(mykb.get_alias_candidates("douglas")) == 2
|
||||||
|
|
||||||
# append an alias
|
# append an alias
|
||||||
mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.2)
|
mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.2)
|
||||||
|
|
||||||
# test the size of the relevant candidates has been incremented
|
# test the size of the relevant candidates has been incremented
|
||||||
assert len(mykb.get_candidates("douglas")) == 3
|
assert len(mykb.get_alias_candidates("douglas")) == 3
|
||||||
|
|
||||||
# append the same alias-entity pair again should not work (will throw a warning)
|
# append the same alias-entity pair again should not work (will throw a warning)
|
||||||
with pytest.warns(UserWarning):
|
with pytest.warns(UserWarning):
|
||||||
mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.3)
|
mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.3)
|
||||||
|
|
||||||
# test the size of the relevant candidates remained unchanged
|
# test the size of the relevant candidates remained unchanged
|
||||||
assert len(mykb.get_candidates("douglas")) == 3
|
assert len(mykb.get_alias_candidates("douglas")) == 3
|
||||||
|
|
||||||
|
|
||||||
def test_append_invalid_alias(nlp):
|
def test_append_invalid_alias(nlp):
|
||||||
"""Test that append an alias will throw an error if prior probs are exceeding 1"""
|
"""Test that append an alias will throw an error if prior probs are exceeding 1"""
|
||||||
mykb = KnowledgeBase(entity_vector_length=1)
|
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
|
||||||
mykb.initialize(nlp.vocab)
|
|
||||||
|
|
||||||
# adding entities
|
# adding entities
|
||||||
mykb.add_entity(entity="Q1", freq=27, entity_vector=[1])
|
mykb.add_entity(entity="Q1", freq=27, entity_vector=[1])
|
||||||
|
@ -228,9 +283,9 @@ def test_preserving_links_asdoc(nlp):
|
||||||
"""Test that Span.as_doc preserves the existing entity links"""
|
"""Test that Span.as_doc preserves the existing entity links"""
|
||||||
|
|
||||||
@registry.assets.register("myLocationsKB.v1")
|
@registry.assets.register("myLocationsKB.v1")
|
||||||
def dummy_kb() -> KnowledgeBase:
|
def dummy_kb() -> Callable[["Vocab"], KnowledgeBase]:
|
||||||
mykb = KnowledgeBase(entity_vector_length=1)
|
def create_kb(vocab):
|
||||||
mykb.initialize(nlp.vocab)
|
mykb = KnowledgeBase(vocab, entity_vector_length=1)
|
||||||
# adding entities
|
# adding entities
|
||||||
mykb.add_entity(entity="Q1", freq=19, entity_vector=[1])
|
mykb.add_entity(entity="Q1", freq=19, entity_vector=[1])
|
||||||
mykb.add_entity(entity="Q2", freq=8, entity_vector=[1])
|
mykb.add_entity(entity="Q2", freq=8, entity_vector=[1])
|
||||||
|
@ -239,6 +294,8 @@ def test_preserving_links_asdoc(nlp):
|
||||||
mykb.add_alias(alias="Denver", entities=["Q2"], probabilities=[0.6])
|
mykb.add_alias(alias="Denver", entities=["Q2"], probabilities=[0.6])
|
||||||
return mykb
|
return mykb
|
||||||
|
|
||||||
|
return create_kb
|
||||||
|
|
||||||
# set up pipeline with NER (Entity Ruler) and NEL (prior probability only, model not trained)
|
# set up pipeline with NER (Entity Ruler) and NEL (prior probability only, model not trained)
|
||||||
nlp.add_pipe("sentencizer")
|
nlp.add_pipe("sentencizer")
|
||||||
patterns = [
|
patterns = [
|
||||||
|
@ -247,7 +304,7 @@ def test_preserving_links_asdoc(nlp):
|
||||||
]
|
]
|
||||||
ruler = nlp.add_pipe("entity_ruler")
|
ruler = nlp.add_pipe("entity_ruler")
|
||||||
ruler.add_patterns(patterns)
|
ruler.add_patterns(patterns)
|
||||||
el_config = {"kb": {"@assets": "myLocationsKB.v1"}, "incl_prior": False}
|
el_config = {"kb_loader": {"@assets": "myLocationsKB.v1"}, "incl_prior": False}
|
||||||
el_pipe = nlp.add_pipe("entity_linker", config=el_config, last=True)
|
el_pipe = nlp.add_pipe("entity_linker", config=el_config, last=True)
|
||||||
el_pipe.begin_training(lambda: [])
|
el_pipe.begin_training(lambda: [])
|
||||||
el_pipe.incl_context = False
|
el_pipe.incl_context = False
|
||||||
|
@ -331,12 +388,12 @@ def test_overfitting_IO():
|
||||||
train_examples.append(Example.from_dict(doc, annotation))
|
train_examples.append(Example.from_dict(doc, annotation))
|
||||||
|
|
||||||
@registry.assets.register("myOverfittingKB.v1")
|
@registry.assets.register("myOverfittingKB.v1")
|
||||||
def dummy_kb() -> KnowledgeBase:
|
def dummy_kb() -> Callable[["Vocab"], KnowledgeBase]:
|
||||||
|
def create_kb(vocab):
|
||||||
# create artificial KB - assign same prior weight to the two russ cochran's
|
# create artificial KB - assign same prior weight to the two russ cochran's
|
||||||
# Q2146908 (Russ Cochran): American golfer
|
# Q2146908 (Russ Cochran): American golfer
|
||||||
# Q7381115 (Russ Cochran): publisher
|
# Q7381115 (Russ Cochran): publisher
|
||||||
mykb = KnowledgeBase(entity_vector_length=3)
|
mykb = KnowledgeBase(vocab, entity_vector_length=3)
|
||||||
mykb.initialize(nlp.vocab)
|
|
||||||
mykb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
|
mykb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
|
||||||
mykb.add_entity(entity="Q7381115", freq=12, entity_vector=[9, 1, -7])
|
mykb.add_entity(entity="Q7381115", freq=12, entity_vector=[9, 1, -7])
|
||||||
mykb.add_alias(
|
mykb.add_alias(
|
||||||
|
@ -346,9 +403,13 @@ def test_overfitting_IO():
|
||||||
)
|
)
|
||||||
return mykb
|
return mykb
|
||||||
|
|
||||||
|
return create_kb
|
||||||
|
|
||||||
# Create the Entity Linker component and add it to the pipeline
|
# Create the Entity Linker component and add it to the pipeline
|
||||||
nlp.add_pipe(
|
nlp.add_pipe(
|
||||||
"entity_linker", config={"kb": {"@assets": "myOverfittingKB.v1"}}, last=True
|
"entity_linker",
|
||||||
|
config={"kb_loader": {"@assets": "myOverfittingKB.v1"}},
|
||||||
|
last=True,
|
||||||
)
|
)
|
||||||
|
|
||||||
# train the NEL pipe
|
# train the NEL pipe
|
||||||
|
|
|
@ -356,13 +356,13 @@ def test_language_factories_combine_score_weights(weights, expected):
|
||||||
|
|
||||||
def test_language_factories_scores():
|
def test_language_factories_scores():
|
||||||
name = "test_language_factories_scores"
|
name = "test_language_factories_scores"
|
||||||
func = lambda doc: doc
|
func = lambda nlp, name: lambda doc: doc
|
||||||
weights1 = {"a1": 0.5, "a2": 0.5}
|
weights1 = {"a1": 0.5, "a2": 0.5}
|
||||||
weights2 = {"b1": 0.2, "b2": 0.7, "b3": 0.1}
|
weights2 = {"b1": 0.2, "b2": 0.7, "b3": 0.1}
|
||||||
Language.component(
|
Language.factory(
|
||||||
f"{name}1", scores=list(weights1), default_score_weights=weights1, func=func,
|
f"{name}1", scores=list(weights1), default_score_weights=weights1, func=func,
|
||||||
)
|
)
|
||||||
Language.component(
|
Language.factory(
|
||||||
f"{name}2", scores=list(weights2), default_score_weights=weights2, func=func,
|
f"{name}2", scores=list(weights2), default_score_weights=weights2, func=func,
|
||||||
)
|
)
|
||||||
meta1 = Language.get_factory_meta(f"{name}1")
|
meta1 = Language.get_factory_meta(f"{name}1")
|
||||||
|
|
|
@ -78,6 +78,14 @@ def test_replace_last_pipe(nlp):
|
||||||
assert nlp.pipe_names == ["sentencizer", "ner"]
|
assert nlp.pipe_names == ["sentencizer", "ner"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_replace_pipe_config(nlp):
|
||||||
|
nlp.add_pipe("entity_linker")
|
||||||
|
nlp.add_pipe("sentencizer")
|
||||||
|
assert nlp.get_pipe("entity_linker").cfg["incl_prior"] == True
|
||||||
|
nlp.replace_pipe("entity_linker", "entity_linker", config={"incl_prior": False})
|
||||||
|
assert nlp.get_pipe("entity_linker").cfg["incl_prior"] == False
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("old_name,new_name", [("old_pipe", "new_pipe")])
|
@pytest.mark.parametrize("old_name,new_name", [("old_pipe", "new_pipe")])
|
||||||
def test_rename_pipe(nlp, old_name, new_name):
|
def test_rename_pipe(nlp, old_name, new_name):
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
|
|
|
@ -65,7 +65,7 @@ def test_issue4590(en_vocab):
|
||||||
|
|
||||||
|
|
||||||
def test_issue4651_with_phrase_matcher_attr():
|
def test_issue4651_with_phrase_matcher_attr():
|
||||||
"""Test that the EntityRuler PhraseMatcher is deserialize correctly using
|
"""Test that the EntityRuler PhraseMatcher is deserialized correctly using
|
||||||
the method from_disk when the EntityRuler argument phrase_matcher_attr is
|
the method from_disk when the EntityRuler argument phrase_matcher_attr is
|
||||||
specified.
|
specified.
|
||||||
"""
|
"""
|
||||||
|
@ -87,7 +87,7 @@ def test_issue4651_with_phrase_matcher_attr():
|
||||||
|
|
||||||
|
|
||||||
def test_issue4651_without_phrase_matcher_attr():
|
def test_issue4651_without_phrase_matcher_attr():
|
||||||
"""Test that the EntityRuler PhraseMatcher is deserialize correctly using
|
"""Test that the EntityRuler PhraseMatcher is deserialized correctly using
|
||||||
the method from_disk when the EntityRuler argument phrase_matcher_attr is
|
the method from_disk when the EntityRuler argument phrase_matcher_attr is
|
||||||
not specified.
|
not specified.
|
||||||
"""
|
"""
|
||||||
|
@ -139,8 +139,7 @@ def test_issue4665():
|
||||||
def test_issue4674():
|
def test_issue4674():
|
||||||
"""Test that setting entities with overlapping identifiers does not mess up IO"""
|
"""Test that setting entities with overlapping identifiers does not mess up IO"""
|
||||||
nlp = English()
|
nlp = English()
|
||||||
kb = KnowledgeBase(entity_vector_length=3)
|
kb = KnowledgeBase(nlp.vocab, entity_vector_length=3)
|
||||||
kb.initialize(nlp.vocab)
|
|
||||||
vector1 = [0.9, 1.1, 1.01]
|
vector1 = [0.9, 1.1, 1.01]
|
||||||
vector2 = [1.8, 2.25, 2.01]
|
vector2 = [1.8, 2.25, 2.01]
|
||||||
with pytest.warns(UserWarning):
|
with pytest.warns(UserWarning):
|
||||||
|
@ -156,10 +155,9 @@ def test_issue4674():
|
||||||
if not dir_path.exists():
|
if not dir_path.exists():
|
||||||
dir_path.mkdir()
|
dir_path.mkdir()
|
||||||
file_path = dir_path / "kb"
|
file_path = dir_path / "kb"
|
||||||
kb.dump(str(file_path))
|
kb.to_disk(str(file_path))
|
||||||
kb2 = KnowledgeBase(entity_vector_length=3)
|
kb2 = KnowledgeBase(nlp.vocab, entity_vector_length=3)
|
||||||
kb2.initialize(nlp.vocab)
|
kb2.from_disk(str(file_path))
|
||||||
kb2.load_bulk(str(file_path))
|
|
||||||
assert kb2.get_size_entities() == 1
|
assert kb2.get_size_entities() == 1
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
from typing import Callable
|
||||||
import warnings
|
import warnings
|
||||||
from unittest import TestCase
|
from unittest import TestCase
|
||||||
import pytest
|
import pytest
|
||||||
|
@ -70,13 +71,15 @@ def entity_linker():
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
|
|
||||||
@registry.assets.register("TestIssue5230KB.v1")
|
@registry.assets.register("TestIssue5230KB.v1")
|
||||||
def dummy_kb() -> KnowledgeBase:
|
def dummy_kb() -> Callable[["Vocab"], KnowledgeBase]:
|
||||||
kb = KnowledgeBase(entity_vector_length=1)
|
def create_kb(vocab):
|
||||||
kb.initialize(nlp.vocab)
|
kb = KnowledgeBase(vocab, entity_vector_length=1)
|
||||||
kb.add_entity("test", 0.0, zeros((1, 1), dtype="f"))
|
kb.add_entity("test", 0.0, zeros((1, 1), dtype="f"))
|
||||||
return kb
|
return kb
|
||||||
|
|
||||||
config = {"kb": {"@assets": "TestIssue5230KB.v1"}}
|
return create_kb
|
||||||
|
|
||||||
|
config = {"kb_loader": {"@assets": "TestIssue5230KB.v1"}}
|
||||||
entity_linker = nlp.add_pipe("entity_linker", config=config)
|
entity_linker = nlp.add_pipe("entity_linker", config=config)
|
||||||
# need to add model for two reasons:
|
# need to add model for two reasons:
|
||||||
# 1. no model leads to error in serialization,
|
# 1. no model leads to error in serialization,
|
||||||
|
@ -121,19 +124,17 @@ def test_writer_with_path_py35():
|
||||||
|
|
||||||
def test_save_and_load_knowledge_base():
|
def test_save_and_load_knowledge_base():
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
kb = KnowledgeBase(entity_vector_length=1)
|
kb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
|
||||||
kb.initialize(nlp.vocab)
|
|
||||||
with make_tempdir() as d:
|
with make_tempdir() as d:
|
||||||
path = d / "kb"
|
path = d / "kb"
|
||||||
try:
|
try:
|
||||||
kb.dump(path)
|
kb.to_disk(path)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
pytest.fail(str(e))
|
pytest.fail(str(e))
|
||||||
|
|
||||||
try:
|
try:
|
||||||
kb_loaded = KnowledgeBase(entity_vector_length=1)
|
kb_loaded = KnowledgeBase(nlp.vocab, entity_vector_length=1)
|
||||||
kb_loaded.initialize(nlp.vocab)
|
kb_loaded.from_disk(path)
|
||||||
kb_loaded.load_bulk(path)
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
pytest.fail(str(e))
|
pytest.fail(str(e))
|
||||||
|
|
||||||
|
|
|
@ -20,11 +20,11 @@ dev = ""
|
||||||
|
|
||||||
[training.train_corpus]
|
[training.train_corpus]
|
||||||
@readers = "spacy.Corpus.v1"
|
@readers = "spacy.Corpus.v1"
|
||||||
path = ${paths:train}
|
path = ${paths.train}
|
||||||
|
|
||||||
[training.dev_corpus]
|
[training.dev_corpus]
|
||||||
@readers = "spacy.Corpus.v1"
|
@readers = "spacy.Corpus.v1"
|
||||||
path = ${paths:dev}
|
path = ${paths.dev}
|
||||||
|
|
||||||
[training.batcher]
|
[training.batcher]
|
||||||
@batchers = "batch_by_words.v1"
|
@batchers = "batch_by_words.v1"
|
||||||
|
@ -57,7 +57,7 @@ factory = "tagger"
|
||||||
|
|
||||||
[components.tagger.model.tok2vec]
|
[components.tagger.model.tok2vec]
|
||||||
@architectures = "spacy.Tok2VecListener.v1"
|
@architectures = "spacy.Tok2VecListener.v1"
|
||||||
width = ${components.tok2vec.model:width}
|
width = ${components.tok2vec.model.width}
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
@ -284,13 +284,13 @@ def test_config_overrides():
|
||||||
|
|
||||||
def test_config_interpolation():
|
def test_config_interpolation():
|
||||||
config = Config().from_str(nlp_config_string, interpolate=False)
|
config = Config().from_str(nlp_config_string, interpolate=False)
|
||||||
assert config["training"]["train_corpus"]["path"] == "${paths:train}"
|
assert config["training"]["train_corpus"]["path"] == "${paths.train}"
|
||||||
interpolated = config.interpolate()
|
interpolated = config.interpolate()
|
||||||
assert interpolated["training"]["train_corpus"]["path"] == ""
|
assert interpolated["training"]["train_corpus"]["path"] == ""
|
||||||
nlp = English.from_config(config)
|
nlp = English.from_config(config)
|
||||||
assert nlp.config["training"]["train_corpus"]["path"] == "${paths:train}"
|
assert nlp.config["training"]["train_corpus"]["path"] == "${paths.train}"
|
||||||
# Ensure that variables are preserved in nlp config
|
# Ensure that variables are preserved in nlp config
|
||||||
width = "${components.tok2vec.model:width}"
|
width = "${components.tok2vec.model.width}"
|
||||||
assert config["components"]["tagger"]["model"]["tok2vec"]["width"] == width
|
assert config["components"]["tagger"]["model"]["tok2vec"]["width"] == width
|
||||||
assert nlp.config["components"]["tagger"]["model"]["tok2vec"]["width"] == width
|
assert nlp.config["components"]["tagger"]["model"]["tok2vec"]["width"] == width
|
||||||
interpolated2 = nlp.config.interpolate()
|
interpolated2 = nlp.config.interpolate()
|
||||||
|
|
|
@ -1,4 +1,8 @@
|
||||||
from spacy.util import ensure_path
|
from typing import Callable
|
||||||
|
|
||||||
|
from spacy import util
|
||||||
|
from spacy.lang.en import English
|
||||||
|
from spacy.util import ensure_path, registry
|
||||||
from spacy.kb import KnowledgeBase
|
from spacy.kb import KnowledgeBase
|
||||||
|
|
||||||
from ..util import make_tempdir
|
from ..util import make_tempdir
|
||||||
|
@ -15,20 +19,16 @@ def test_serialize_kb_disk(en_vocab):
|
||||||
if not dir_path.exists():
|
if not dir_path.exists():
|
||||||
dir_path.mkdir()
|
dir_path.mkdir()
|
||||||
file_path = dir_path / "kb"
|
file_path = dir_path / "kb"
|
||||||
kb1.dump(str(file_path))
|
kb1.to_disk(str(file_path))
|
||||||
|
kb2 = KnowledgeBase(vocab=en_vocab, entity_vector_length=3)
|
||||||
kb2 = KnowledgeBase(entity_vector_length=3)
|
kb2.from_disk(str(file_path))
|
||||||
kb2.initialize(en_vocab)
|
|
||||||
kb2.load_bulk(str(file_path))
|
|
||||||
|
|
||||||
# final assertions
|
# final assertions
|
||||||
_check_kb(kb2)
|
_check_kb(kb2)
|
||||||
|
|
||||||
|
|
||||||
def _get_dummy_kb(vocab):
|
def _get_dummy_kb(vocab):
|
||||||
kb = KnowledgeBase(entity_vector_length=3)
|
kb = KnowledgeBase(vocab, entity_vector_length=3)
|
||||||
kb.initialize(vocab)
|
|
||||||
|
|
||||||
kb.add_entity(entity="Q53", freq=33, entity_vector=[0, 5, 3])
|
kb.add_entity(entity="Q53", freq=33, entity_vector=[0, 5, 3])
|
||||||
kb.add_entity(entity="Q17", freq=2, entity_vector=[7, 1, 0])
|
kb.add_entity(entity="Q17", freq=2, entity_vector=[7, 1, 0])
|
||||||
kb.add_entity(entity="Q007", freq=7, entity_vector=[0, 0, 7])
|
kb.add_entity(entity="Q007", freq=7, entity_vector=[0, 0, 7])
|
||||||
|
@ -61,7 +61,7 @@ def _check_kb(kb):
|
||||||
assert alias_string not in kb.get_alias_strings()
|
assert alias_string not in kb.get_alias_strings()
|
||||||
|
|
||||||
# check candidates & probabilities
|
# check candidates & probabilities
|
||||||
candidates = sorted(kb.get_candidates("double07"), key=lambda x: x.entity_)
|
candidates = sorted(kb.get_alias_candidates("double07"), key=lambda x: x.entity_)
|
||||||
assert len(candidates) == 2
|
assert len(candidates) == 2
|
||||||
|
|
||||||
assert candidates[0].entity_ == "Q007"
|
assert candidates[0].entity_ == "Q007"
|
||||||
|
@ -75,3 +75,47 @@ def _check_kb(kb):
|
||||||
assert candidates[1].entity_vector == [7, 1, 0]
|
assert candidates[1].entity_vector == [7, 1, 0]
|
||||||
assert candidates[1].alias_ == "double07"
|
assert candidates[1].alias_ == "double07"
|
||||||
assert 0.099 < candidates[1].prior_prob < 0.101
|
assert 0.099 < candidates[1].prior_prob < 0.101
|
||||||
|
|
||||||
|
|
||||||
|
def test_serialize_subclassed_kb():
|
||||||
|
"""Check that IO of a custom KB works fine as part of an EL pipe."""
|
||||||
|
|
||||||
|
class SubKnowledgeBase(KnowledgeBase):
|
||||||
|
def __init__(self, vocab, entity_vector_length, custom_field):
|
||||||
|
super().__init__(vocab, entity_vector_length)
|
||||||
|
self.custom_field = custom_field
|
||||||
|
|
||||||
|
@registry.assets.register("spacy.CustomKB.v1")
|
||||||
|
def custom_kb(
|
||||||
|
entity_vector_length: int, custom_field: int
|
||||||
|
) -> Callable[["Vocab"], KnowledgeBase]:
|
||||||
|
def custom_kb_factory(vocab):
|
||||||
|
return SubKnowledgeBase(
|
||||||
|
vocab=vocab,
|
||||||
|
entity_vector_length=entity_vector_length,
|
||||||
|
custom_field=custom_field,
|
||||||
|
)
|
||||||
|
|
||||||
|
return custom_kb_factory
|
||||||
|
|
||||||
|
nlp = English()
|
||||||
|
config = {
|
||||||
|
"kb_loader": {
|
||||||
|
"@assets": "spacy.CustomKB.v1",
|
||||||
|
"entity_vector_length": 342,
|
||||||
|
"custom_field": 666,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
entity_linker = nlp.add_pipe("entity_linker", config=config)
|
||||||
|
assert type(entity_linker.kb) == SubKnowledgeBase
|
||||||
|
assert entity_linker.kb.entity_vector_length == 342
|
||||||
|
assert entity_linker.kb.custom_field == 666
|
||||||
|
|
||||||
|
# Make sure the custom KB is serialized correctly
|
||||||
|
with make_tempdir() as tmp_dir:
|
||||||
|
nlp.to_disk(tmp_dir)
|
||||||
|
nlp2 = util.load_model_from_path(tmp_dir)
|
||||||
|
entity_linker2 = nlp2.get_pipe("entity_linker")
|
||||||
|
assert type(entity_linker2.kb) == SubKnowledgeBase
|
||||||
|
assert entity_linker2.kb.entity_vector_length == 342
|
||||||
|
assert entity_linker2.kb.custom_field == 666
|
||||||
|
|
|
@ -2,14 +2,16 @@ import pytest
|
||||||
from spacy.gold import docs_to_json, biluo_tags_from_offsets
|
from spacy.gold import docs_to_json, biluo_tags_from_offsets
|
||||||
from spacy.gold.converters import iob2docs, conll_ner2docs, conllu2docs
|
from spacy.gold.converters import iob2docs, conll_ner2docs, conllu2docs
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.schemas import ProjectConfigSchema, validate
|
from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate
|
||||||
from spacy.cli.pretrain import make_docs
|
from spacy.cli.pretrain import make_docs
|
||||||
from spacy.cli.init_config import init_config, RECOMMENDATIONS_PATH
|
from spacy.cli.init_config import init_config, RECOMMENDATIONS
|
||||||
from spacy.cli.init_config import RecommendationSchema
|
|
||||||
from spacy.cli._util import validate_project_commands, parse_config_overrides
|
from spacy.cli._util import validate_project_commands, parse_config_overrides
|
||||||
from spacy.util import get_lang_class
|
from spacy.cli._util import load_project_config, substitute_project_variables
|
||||||
|
from thinc.config import ConfigValidationError
|
||||||
import srsly
|
import srsly
|
||||||
|
|
||||||
|
from .util import make_tempdir
|
||||||
|
|
||||||
|
|
||||||
def test_cli_converters_conllu2json():
|
def test_cli_converters_conllu2json():
|
||||||
# from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu
|
# from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu
|
||||||
|
@ -296,6 +298,24 @@ def test_project_config_validation2(config, n_errors):
|
||||||
assert len(errors) == n_errors
|
assert len(errors) == n_errors
|
||||||
|
|
||||||
|
|
||||||
|
def test_project_config_interpolation():
|
||||||
|
variables = {"a": 10, "b": {"c": "foo", "d": True}}
|
||||||
|
commands = [
|
||||||
|
{"name": "x", "script": ["hello ${vars.a} ${vars.b.c}"]},
|
||||||
|
{"name": "y", "script": ["${vars.b.c} ${vars.b.d}"]},
|
||||||
|
]
|
||||||
|
project = {"commands": commands, "vars": variables}
|
||||||
|
with make_tempdir() as d:
|
||||||
|
srsly.write_yaml(d / "project.yml", project)
|
||||||
|
cfg = load_project_config(d)
|
||||||
|
assert cfg["commands"][0]["script"][0] == "hello 10 foo"
|
||||||
|
assert cfg["commands"][1]["script"][0] == "foo true"
|
||||||
|
commands = [{"name": "x", "script": ["hello ${vars.a} ${vars.b.e}"]}]
|
||||||
|
project = {"commands": commands, "vars": variables}
|
||||||
|
with pytest.raises(ConfigValidationError):
|
||||||
|
substitute_project_variables(project)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"args,expected",
|
"args,expected",
|
||||||
[
|
[
|
||||||
|
@ -335,7 +355,5 @@ def test_init_config(lang, pipeline, optimize):
|
||||||
|
|
||||||
|
|
||||||
def test_model_recommendations():
|
def test_model_recommendations():
|
||||||
recommendations = srsly.read_json(RECOMMENDATIONS_PATH)
|
for lang, data in RECOMMENDATIONS.items():
|
||||||
for lang, data in recommendations.items():
|
|
||||||
assert get_lang_class(lang)
|
|
||||||
assert RecommendationSchema(**data)
|
assert RecommendationSchema(**data)
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
import pytest
|
import pytest
|
||||||
from spacy import displacy
|
from spacy import displacy
|
||||||
from spacy.displacy.render import DependencyRenderer
|
from spacy.displacy.render import DependencyRenderer, EntityRenderer
|
||||||
from spacy.tokens import Span
|
from spacy.tokens import Span
|
||||||
from spacy.lang.fa import Persian
|
from spacy.lang.fa import Persian
|
||||||
|
|
||||||
|
@ -97,3 +97,17 @@ def test_displacy_render_wrapper(en_vocab):
|
||||||
assert html.endswith("/div>TEST")
|
assert html.endswith("/div>TEST")
|
||||||
# Restore
|
# Restore
|
||||||
displacy.set_render_wrapper(lambda html: html)
|
displacy.set_render_wrapper(lambda html: html)
|
||||||
|
|
||||||
|
|
||||||
|
def test_displacy_options_case():
|
||||||
|
ents = ["foo", "BAR"]
|
||||||
|
colors = {"FOO": "red", "bar": "green"}
|
||||||
|
renderer = EntityRenderer({"ents": ents, "colors": colors})
|
||||||
|
text = "abcd"
|
||||||
|
labels = ["foo", "bar", "FOO", "BAR"]
|
||||||
|
spans = [{"start": i, "end": i + 1, "label": labels[i]} for i in range(len(text))]
|
||||||
|
result = renderer.render_ents("abcde", spans, None).split("\n\n")
|
||||||
|
assert "red" in result[0] and "foo" in result[0]
|
||||||
|
assert "green" in result[1] and "bar" in result[1]
|
||||||
|
assert "red" in result[2] and "FOO" in result[2]
|
||||||
|
assert "green" in result[3] and "BAR" in result[3]
|
||||||
|
|
|
@ -47,9 +47,9 @@ cdef class Tokenizer:
|
||||||
`infix_finditer` (callable): A function matching the signature of
|
`infix_finditer` (callable): A function matching the signature of
|
||||||
`re.compile(string).finditer` to find infixes.
|
`re.compile(string).finditer` to find infixes.
|
||||||
token_match (callable): A boolean function matching strings to be
|
token_match (callable): A boolean function matching strings to be
|
||||||
recognised as tokens.
|
recognized as tokens.
|
||||||
url_match (callable): A boolean function matching strings to be
|
url_match (callable): A boolean function matching strings to be
|
||||||
recognised as tokens after considering prefixes and suffixes.
|
recognized as tokens after considering prefixes and suffixes.
|
||||||
|
|
||||||
EXAMPLE:
|
EXAMPLE:
|
||||||
>>> tokenizer = Tokenizer(nlp.vocab)
|
>>> tokenizer = Tokenizer(nlp.vocab)
|
||||||
|
|
|
@ -102,8 +102,7 @@ cdef class Doc:
|
||||||
|
|
||||||
Construction 2
|
Construction 2
|
||||||
>>> from spacy.tokens import Doc
|
>>> from spacy.tokens import Doc
|
||||||
>>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'],
|
>>> doc = Doc(nlp.vocab, words=["hello", "world", "!"], spaces=[True, False, False])
|
||||||
>>> spaces=[True, False, False])
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/doc
|
DOCS: https://spacy.io/api/doc
|
||||||
"""
|
"""
|
||||||
|
@ -1194,8 +1193,7 @@ cdef class Doc:
|
||||||
retokenizer.merge(span, attributes[i])
|
retokenizer.merge(span, attributes[i])
|
||||||
|
|
||||||
def to_json(self, underscore=None):
|
def to_json(self, underscore=None):
|
||||||
"""Convert a Doc to JSON. The format it produces will be the new format
|
"""Convert a Doc to JSON.
|
||||||
for the `spacy train` command (not implemented yet).
|
|
||||||
|
|
||||||
underscore (list): Optional list of string names of custom doc._.
|
underscore (list): Optional list of string names of custom doc._.
|
||||||
attributes. Attribute values need to be JSON-serializable. Values will
|
attributes. Attribute values need to be JSON-serializable. Values will
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
from typing import List, Union, Dict, Any, Optional, Iterable, Callable, Tuple
|
from typing import List, Union, Dict, Any, Optional, Iterable, Callable, Tuple
|
||||||
from typing import Iterator, Type, Pattern, TYPE_CHECKING
|
from typing import Iterator, Type, Pattern, Generator, TYPE_CHECKING
|
||||||
from types import ModuleType
|
from types import ModuleType
|
||||||
import os
|
import os
|
||||||
import importlib
|
import importlib
|
||||||
|
@ -249,7 +249,16 @@ def load_model_from_package(
|
||||||
disable: Iterable[str] = tuple(),
|
disable: Iterable[str] = tuple(),
|
||||||
config: Union[Dict[str, Any], Config] = SimpleFrozenDict(),
|
config: Union[Dict[str, Any], Config] = SimpleFrozenDict(),
|
||||||
) -> "Language":
|
) -> "Language":
|
||||||
"""Load a model from an installed package."""
|
"""Load a model from an installed package.
|
||||||
|
|
||||||
|
name (str): The package name.
|
||||||
|
vocab (Vocab / True): Optional vocab to pass in on initialization. If True,
|
||||||
|
a new Vocab object will be created.
|
||||||
|
disable (Iterable[str]): Names of pipeline components to disable.
|
||||||
|
config (Dict[str, Any] / Config): Config overrides as nested dict or dict
|
||||||
|
keyed by section values in dot notation.
|
||||||
|
RETURNS (Language): The loaded nlp object.
|
||||||
|
"""
|
||||||
cls = importlib.import_module(name)
|
cls = importlib.import_module(name)
|
||||||
return cls.load(vocab=vocab, disable=disable, config=config)
|
return cls.load(vocab=vocab, disable=disable, config=config)
|
||||||
|
|
||||||
|
@ -263,7 +272,17 @@ def load_model_from_path(
|
||||||
config: Union[Dict[str, Any], Config] = SimpleFrozenDict(),
|
config: Union[Dict[str, Any], Config] = SimpleFrozenDict(),
|
||||||
) -> "Language":
|
) -> "Language":
|
||||||
"""Load a model from a data directory path. Creates Language class with
|
"""Load a model from a data directory path. Creates Language class with
|
||||||
pipeline from config.cfg and then calls from_disk() with path."""
|
pipeline from config.cfg and then calls from_disk() with path.
|
||||||
|
|
||||||
|
name (str): Package name or model path.
|
||||||
|
meta (Dict[str, Any]): Optional model meta.
|
||||||
|
vocab (Vocab / True): Optional vocab to pass in on initialization. If True,
|
||||||
|
a new Vocab object will be created.
|
||||||
|
disable (Iterable[str]): Names of pipeline components to disable.
|
||||||
|
config (Dict[str, Any] / Config): Config overrides as nested dict or dict
|
||||||
|
keyed by section values in dot notation.
|
||||||
|
RETURNS (Language): The loaded nlp object.
|
||||||
|
"""
|
||||||
if not model_path.exists():
|
if not model_path.exists():
|
||||||
raise IOError(Errors.E052.format(path=model_path))
|
raise IOError(Errors.E052.format(path=model_path))
|
||||||
if not meta:
|
if not meta:
|
||||||
|
@ -284,6 +303,15 @@ def load_model_from_config(
|
||||||
) -> Tuple["Language", Config]:
|
) -> Tuple["Language", Config]:
|
||||||
"""Create an nlp object from a config. Expects the full config file including
|
"""Create an nlp object from a config. Expects the full config file including
|
||||||
a section "nlp" containing the settings for the nlp object.
|
a section "nlp" containing the settings for the nlp object.
|
||||||
|
|
||||||
|
name (str): Package name or model path.
|
||||||
|
meta (Dict[str, Any]): Optional model meta.
|
||||||
|
vocab (Vocab / True): Optional vocab to pass in on initialization. If True,
|
||||||
|
a new Vocab object will be created.
|
||||||
|
disable (Iterable[str]): Names of pipeline components to disable.
|
||||||
|
auto_fill (bool): Whether to auto-fill config with missing defaults.
|
||||||
|
validate (bool): Whether to show config validation errors.
|
||||||
|
RETURNS (Language): The loaded nlp object.
|
||||||
"""
|
"""
|
||||||
if "nlp" not in config:
|
if "nlp" not in config:
|
||||||
raise ValueError(Errors.E985.format(config=config))
|
raise ValueError(Errors.E985.format(config=config))
|
||||||
|
@ -308,6 +336,13 @@ def load_model_from_init_py(
|
||||||
) -> "Language":
|
) -> "Language":
|
||||||
"""Helper function to use in the `load()` method of a model package's
|
"""Helper function to use in the `load()` method of a model package's
|
||||||
__init__.py.
|
__init__.py.
|
||||||
|
|
||||||
|
vocab (Vocab / True): Optional vocab to pass in on initialization. If True,
|
||||||
|
a new Vocab object will be created.
|
||||||
|
disable (Iterable[str]): Names of pipeline components to disable.
|
||||||
|
config (Dict[str, Any] / Config): Config overrides as nested dict or dict
|
||||||
|
keyed by section values in dot notation.
|
||||||
|
RETURNS (Language): The loaded nlp object.
|
||||||
"""
|
"""
|
||||||
model_path = Path(init_file).parent
|
model_path = Path(init_file).parent
|
||||||
meta = get_model_meta(model_path)
|
meta = get_model_meta(model_path)
|
||||||
|
@ -325,7 +360,14 @@ def load_config(
|
||||||
overrides: Dict[str, Any] = SimpleFrozenDict(),
|
overrides: Dict[str, Any] = SimpleFrozenDict(),
|
||||||
interpolate: bool = False,
|
interpolate: bool = False,
|
||||||
) -> Config:
|
) -> Config:
|
||||||
"""Load a config file. Takes care of path validation and section order."""
|
"""Load a config file. Takes care of path validation and section order.
|
||||||
|
|
||||||
|
path (Union[str, Path]): Path to the config file.
|
||||||
|
overrides: (Dict[str, Any]): Config overrides as nested dict or
|
||||||
|
dict keyed by section values in dot notation.
|
||||||
|
interpolate (bool): Whether to interpolate and resolve variables.
|
||||||
|
RETURNS (Config): The loaded config.
|
||||||
|
"""
|
||||||
config_path = ensure_path(path)
|
config_path = ensure_path(path)
|
||||||
if not config_path.exists() or not config_path.is_file():
|
if not config_path.exists() or not config_path.is_file():
|
||||||
raise IOError(Errors.E053.format(path=config_path, name="config.cfg"))
|
raise IOError(Errors.E053.format(path=config_path, name="config.cfg"))
|
||||||
|
@ -337,7 +379,12 @@ def load_config(
|
||||||
def load_config_from_str(
|
def load_config_from_str(
|
||||||
text: str, overrides: Dict[str, Any] = SimpleFrozenDict(), interpolate: bool = False
|
text: str, overrides: Dict[str, Any] = SimpleFrozenDict(), interpolate: bool = False
|
||||||
):
|
):
|
||||||
"""Load a full config from a string."""
|
"""Load a full config from a string. Wrapper around Thinc's Config.from_str.
|
||||||
|
|
||||||
|
text (str): The string config to load.
|
||||||
|
interpolate (bool): Whether to interpolate and resolve variables.
|
||||||
|
RETURNS (Config): The loaded config.
|
||||||
|
"""
|
||||||
return Config(section_order=CONFIG_SECTION_ORDER).from_str(
|
return Config(section_order=CONFIG_SECTION_ORDER).from_str(
|
||||||
text, overrides=overrides, interpolate=interpolate,
|
text, overrides=overrides, interpolate=interpolate,
|
||||||
)
|
)
|
||||||
|
@ -435,19 +482,18 @@ def get_base_version(version: str) -> str:
|
||||||
return Version(version).base_version
|
return Version(version).base_version
|
||||||
|
|
||||||
|
|
||||||
def get_model_meta(path: Union[str, Path]) -> Dict[str, Any]:
|
def load_meta(path: Union[str, Path]) -> Dict[str, Any]:
|
||||||
"""Get model meta.json from a directory path and validate its contents.
|
"""Load a model meta.json from a path and validate its contents.
|
||||||
|
|
||||||
path (str / Path): Path to model directory.
|
path (Union[str, Path]): Path to meta.json.
|
||||||
RETURNS (Dict[str, Any]): The model's meta data.
|
RETURNS (Dict[str, Any]): The loaded meta.
|
||||||
"""
|
"""
|
||||||
model_path = ensure_path(path)
|
path = ensure_path(path)
|
||||||
if not model_path.exists():
|
if not path.parent.exists():
|
||||||
raise IOError(Errors.E052.format(path=model_path))
|
raise IOError(Errors.E052.format(path=path.parent))
|
||||||
meta_path = model_path / "meta.json"
|
if not path.exists() or not path.is_file():
|
||||||
if not meta_path.is_file():
|
raise IOError(Errors.E053.format(path=path, name="meta.json"))
|
||||||
raise IOError(Errors.E053.format(path=meta_path, name="meta.json"))
|
meta = srsly.read_json(path)
|
||||||
meta = srsly.read_json(meta_path)
|
|
||||||
for setting in ["lang", "name", "version"]:
|
for setting in ["lang", "name", "version"]:
|
||||||
if setting not in meta or not meta[setting]:
|
if setting not in meta or not meta[setting]:
|
||||||
raise ValueError(Errors.E054.format(setting=setting))
|
raise ValueError(Errors.E054.format(setting=setting))
|
||||||
|
@ -471,6 +517,16 @@ def get_model_meta(path: Union[str, Path]) -> Dict[str, Any]:
|
||||||
return meta
|
return meta
|
||||||
|
|
||||||
|
|
||||||
|
def get_model_meta(path: Union[str, Path]) -> Dict[str, Any]:
|
||||||
|
"""Get model meta.json from a directory path and validate its contents.
|
||||||
|
|
||||||
|
path (str / Path): Path to model directory.
|
||||||
|
RETURNS (Dict[str, Any]): The model's meta data.
|
||||||
|
"""
|
||||||
|
model_path = ensure_path(path)
|
||||||
|
return load_meta(model_path / "meta.json")
|
||||||
|
|
||||||
|
|
||||||
def is_package(name: str) -> bool:
|
def is_package(name: str) -> bool:
|
||||||
"""Check if string maps to a package installed via pip.
|
"""Check if string maps to a package installed via pip.
|
||||||
|
|
||||||
|
@ -554,7 +610,7 @@ def working_dir(path: Union[str, Path]) -> None:
|
||||||
|
|
||||||
|
|
||||||
@contextmanager
|
@contextmanager
|
||||||
def make_tempdir() -> None:
|
def make_tempdir() -> Generator[Path, None, None]:
|
||||||
"""Execute a block in a temporary directory and remove the directory and
|
"""Execute a block in a temporary directory and remove the directory and
|
||||||
its contents at the end of the with block.
|
its contents at the end of the with block.
|
||||||
|
|
||||||
|
@ -886,6 +942,15 @@ def escape_html(text: str) -> str:
|
||||||
def get_words_and_spaces(
|
def get_words_and_spaces(
|
||||||
words: Iterable[str], text: str
|
words: Iterable[str], text: str
|
||||||
) -> Tuple[List[str], List[bool]]:
|
) -> Tuple[List[str], List[bool]]:
|
||||||
|
"""Given a list of words and a text, reconstruct the original tokens and
|
||||||
|
return a list of words and spaces that can be used to create a Doc. This
|
||||||
|
can help recover destructive tokenization that didn't preserve any
|
||||||
|
whitespace information.
|
||||||
|
|
||||||
|
words (Iterable[str]): The words.
|
||||||
|
text (str): The original text.
|
||||||
|
RETURNS (Tuple[List[str], List[bool]]): The words and spaces.
|
||||||
|
"""
|
||||||
if "".join("".join(words).split()) != "".join(text.split()):
|
if "".join("".join(words).split()) != "".join(text.split()):
|
||||||
raise ValueError(Errors.E194.format(text=text, words=words))
|
raise ValueError(Errors.E194.format(text=text, words=words))
|
||||||
text_words = []
|
text_words = []
|
||||||
|
|
|
@ -75,7 +75,8 @@ import { H1, H2, H3, H4, H5, Label, InlineList, Comment } from
|
||||||
Headlines are set in
|
Headlines are set in
|
||||||
[HK Grotesk](http://cargocollective.com/hanken/HK-Grotesk-Open-Source-Font) by
|
[HK Grotesk](http://cargocollective.com/hanken/HK-Grotesk-Open-Source-Font) by
|
||||||
Hanken Design. All other body text and code uses the best-matching default
|
Hanken Design. All other body text and code uses the best-matching default
|
||||||
system font to provide a "native" reading experience.
|
system font to provide a "native" reading experience. All code uses the
|
||||||
|
[JetBrains Mono](https://www.jetbrains.com/lp/mono/) typeface by JetBrains.
|
||||||
|
|
||||||
<Infobox title="Important note" variant="warning">
|
<Infobox title="Important note" variant="warning">
|
||||||
|
|
||||||
|
@ -106,7 +107,7 @@ Tags are also available as standalone `<Tag />` components.
|
||||||
| Argument | Example | Result |
|
| Argument | Example | Result |
|
||||||
| -------- | -------------------------- | ----------------------------------------- |
|
| -------- | -------------------------- | ----------------------------------------- |
|
||||||
| `tag` | `{tag="method"}` | <Tag>method</Tag> |
|
| `tag` | `{tag="method"}` | <Tag>method</Tag> |
|
||||||
| `new` | `{new="2"}` | <Tag variant="new">2</Tag> |
|
| `new` | `{new="3"}` | <Tag variant="new">3</Tag> |
|
||||||
| `model` | `{model="tagger, parser"}` | <Tag variant="model">tagger, parser</Tag> |
|
| `model` | `{model="tagger, parser"}` | <Tag variant="model">tagger, parser</Tag> |
|
||||||
| `hidden` | `{hidden="true"}` | |
|
| `hidden` | `{hidden="true"}` | |
|
||||||
|
|
||||||
|
@ -130,6 +131,8 @@ Special link styles are used depending on the link URL.
|
||||||
|
|
||||||
- [I am a regular external link](https://explosion.ai)
|
- [I am a regular external link](https://explosion.ai)
|
||||||
- [I am a link to the documentation](/api/doc)
|
- [I am a link to the documentation](/api/doc)
|
||||||
|
- [I am a link to an architecture](/api/architectures#HashEmbedCNN)
|
||||||
|
- [I am a link to a model](/models/en#en_core_web_sm)
|
||||||
- [I am a link to GitHub](https://github.com/explosion/spaCy)
|
- [I am a link to GitHub](https://github.com/explosion/spaCy)
|
||||||
|
|
||||||
### Abbreviations {#abbr}
|
### Abbreviations {#abbr}
|
||||||
|
@ -188,18 +191,20 @@ the buttons are implemented as styled links instead of native button elements.
|
||||||
<InlineList><Button to="#" variant="primary">Primary small</Button>
|
<InlineList><Button to="#" variant="primary">Primary small</Button>
|
||||||
<Button to="#" variant="secondary">Secondary small</Button></InlineList>
|
<Button to="#" variant="secondary">Secondary small</Button></InlineList>
|
||||||
|
|
||||||
|
<br />
|
||||||
|
|
||||||
<InlineList><Button to="#" variant="primary" large>Primary large</Button>
|
<InlineList><Button to="#" variant="primary" large>Primary large</Button>
|
||||||
<Button to="#" variant="secondary" large>Secondary large</Button></InlineList>
|
<Button to="#" variant="secondary" large>Secondary large</Button></InlineList>
|
||||||
|
|
||||||
## Components
|
## Components
|
||||||
|
|
||||||
### Table
|
### Table {#table}
|
||||||
|
|
||||||
> #### Markdown
|
> #### Markdown
|
||||||
>
|
>
|
||||||
> ```markdown_
|
> ```markdown_
|
||||||
> | Header 1 | Header 2 |
|
> | Header 1 | Header 2 |
|
||||||
> | --- | --- |
|
> | -------- | -------- |
|
||||||
> | Column 1 | Column 2 |
|
> | Column 1 | Column 2 |
|
||||||
> ```
|
> ```
|
||||||
>
|
>
|
||||||
|
@ -213,7 +218,7 @@ the buttons are implemented as styled links instead of native button elements.
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Tables are used to present data and API documentation. Certain keywords can be
|
Tables are used to present data and API documentation. Certain keywords can be
|
||||||
used to mark a footer row with a distinct style, for example to visualise the
|
used to mark a footer row with a distinct style, for example to visualize the
|
||||||
return values of a documented function.
|
return values of a documented function.
|
||||||
|
|
||||||
| Header 1 | Header 2 | Header 3 | Header 4 |
|
| Header 1 | Header 2 | Header 3 | Header 4 |
|
||||||
|
@ -224,7 +229,73 @@ return values of a documented function.
|
||||||
| Column 1 | Column 2 | Column 3 | Column 4 |
|
| Column 1 | Column 2 | Column 3 | Column 4 |
|
||||||
| **RETURNS** | Column 2 | Column 3 | Column 4 |
|
| **RETURNS** | Column 2 | Column 3 | Column 4 |
|
||||||
|
|
||||||
### List
|
Tables also support optional "divider" rows that are typically used to denote
|
||||||
|
keyword-only arguments in API documentation. To turn a row into a dividing
|
||||||
|
headline, it should only include content in its first cell, and its value should
|
||||||
|
be italicized:
|
||||||
|
|
||||||
|
> #### Markdown
|
||||||
|
>
|
||||||
|
> ```markdown_
|
||||||
|
> | Header 1 | Header 2 | Header 3 |
|
||||||
|
> | -------- | -------- | -------- |
|
||||||
|
> | Column 1 | Column 2 | Column 3 |
|
||||||
|
> | _Hello_ | | |
|
||||||
|
> | Column 1 | Column 2 | Column 3 |
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Header 1 | Header 2 | Header 3 |
|
||||||
|
| -------- | -------- | -------- |
|
||||||
|
| Column 1 | Column 2 | Column 3 |
|
||||||
|
| _Hello_ | | |
|
||||||
|
| Column 1 | Column 2 | Column 3 |
|
||||||
|
|
||||||
|
### Type Annotations {#type-annotations}
|
||||||
|
|
||||||
|
> #### Markdown
|
||||||
|
>
|
||||||
|
> ```markdown_
|
||||||
|
> ~~Model[List[Doc], Floats2d]~~
|
||||||
|
> ```
|
||||||
|
>
|
||||||
|
> #### JSX
|
||||||
|
>
|
||||||
|
> ```markup
|
||||||
|
> <TypeAnnotation>Model[List[Doc], Floats2d]</Typeannotation>
|
||||||
|
> ```
|
||||||
|
|
||||||
|
Type annotations are special inline code blocks are used to describe Python
|
||||||
|
types in the [type hints](https://docs.python.org/3/library/typing.html) format.
|
||||||
|
The special component will split the type, apply syntax highlighting and link
|
||||||
|
all types that specify links in `meta/type-annotations.json`. Types can link to
|
||||||
|
internal or external documentation pages. To make it easy to represent the type
|
||||||
|
annotations in Markdown, the rendering "hijacks" the `~~` tags that would
|
||||||
|
typically be converted to a `<del>` element – but in this case, text surrounded
|
||||||
|
by `~~` becomes a type annotation.
|
||||||
|
|
||||||
|
- ~~Dict[str, List[Union[Doc, Span]]]~~
|
||||||
|
- ~~Model[List[Doc], List[numpy.ndarray]]~~
|
||||||
|
|
||||||
|
Type annotations support a special visual style in tables and will render as a
|
||||||
|
separate row, under the cell text. This allows the API docs to display complex
|
||||||
|
types without taking up too much space in the cell. The type annotation should
|
||||||
|
always be the **last element** in the row.
|
||||||
|
|
||||||
|
> #### Markdown
|
||||||
|
>
|
||||||
|
> ```markdown_
|
||||||
|
> | Header 1 | Header 2 |
|
||||||
|
> | -------- | ----------------------- |
|
||||||
|
> | Column 1 | Column 2 ~~List[Doc]~~ |
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
|
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. ~~Model[List[Doc], FullTransformerBatch]~~ |
|
||||||
|
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
|
||||||
|
|
||||||
|
### List {#list}
|
||||||
|
|
||||||
> #### Markdown
|
> #### Markdown
|
||||||
>
|
>
|
||||||
|
@ -255,7 +326,7 @@ automatically.
|
||||||
3. Lorem ipsum dolor
|
3. Lorem ipsum dolor
|
||||||
4. consectetur adipiscing elit
|
4. consectetur adipiscing elit
|
||||||
|
|
||||||
### Aside
|
### Aside {#aside}
|
||||||
|
|
||||||
> #### Markdown
|
> #### Markdown
|
||||||
>
|
>
|
||||||
|
@ -280,7 +351,7 @@ To make them easier to use in Markdown, paragraphs formatted as blockquotes will
|
||||||
turn into asides by default. Level 4 headlines (with a leading `####`) will
|
turn into asides by default. Level 4 headlines (with a leading `####`) will
|
||||||
become aside titles.
|
become aside titles.
|
||||||
|
|
||||||
### Code Block
|
### Code Block {#code-block}
|
||||||
|
|
||||||
> #### Markdown
|
> #### Markdown
|
||||||
>
|
>
|
||||||
|
@ -387,7 +458,7 @@ original file is shown at the top of the widget.
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/language.py
|
https://github.com/explosion/spaCy/tree/master/spacy/language.py
|
||||||
```
|
```
|
||||||
|
|
||||||
### Infobox
|
### Infobox {#infobox}
|
||||||
|
|
||||||
import Infobox from 'components/infobox'
|
import Infobox from 'components/infobox'
|
||||||
|
|
||||||
|
@ -425,7 +496,7 @@ blocks.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Accordion
|
### Accordion {#accordion}
|
||||||
|
|
||||||
import Accordion from 'components/accordion'
|
import Accordion from 'components/accordion'
|
||||||
|
|
||||||
|
|
|
@ -11,9 +11,17 @@ menu:
|
||||||
- ['Entity Linking', 'entitylinker']
|
- ['Entity Linking', 'entitylinker']
|
||||||
---
|
---
|
||||||
|
|
||||||
TODO: intro and how architectures work, link to
|
A **model architecture** is a function that wires up a
|
||||||
[`registry`](/api/top-level#registry),
|
[`Model`](https://thinc.ai/docs/api-model) instance, which you can then use in a
|
||||||
[custom models](/usage/training#custom-models) usage etc.
|
pipeline component or as a layer of a larger network. This page documents
|
||||||
|
spaCy's built-in architectures that are used for different NLP tasks. All
|
||||||
|
trainable [built-in components](/api#architecture-pipeline) expect a `model`
|
||||||
|
argument defined in the config and document their the default architecture.
|
||||||
|
Custom architectures can be registered using the
|
||||||
|
[`@spacy.registry.architectures`](/api/top-level#regsitry) decorator and used as
|
||||||
|
part of the [training config](/usage/training#custom-functions). Also see the
|
||||||
|
usage documentation on
|
||||||
|
[layers and model architectures](/usage/layers-architectures).
|
||||||
|
|
||||||
## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"}
|
## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"}
|
||||||
|
|
||||||
|
@ -33,18 +41,19 @@ TODO: intro and how architectures work, link to
|
||||||
> subword_features = true
|
> subword_features = true
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Build spaCy's 'standard' tok2vec layer, which uses hash embedding with subword
|
Build spaCy's "standard" embedding layer, which uses hash embedding with subword
|
||||||
features and a CNN with layer-normalized maxout.
|
features and a CNN with layer-normalized maxout.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------------- | ---- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `width` | int | The width of the input and output. These are required to be the same, so that residual connections can be used. Recommended values are `96`, `128` or `300`. |
|
| `width` | The width of the input and output. These are required to be the same, so that residual connections can be used. Recommended values are `96`, `128` or `300`. ~~int~~ |
|
||||||
| `depth` | int | The number of convolutional layers to use. Recommended values are between `2` and `8`. |
|
| `depth` | The number of convolutional layers to use. Recommended values are between `2` and `8`. ~~int~~ |
|
||||||
| `embed_size` | int | The number of rows in the hash embedding tables. This can be surprisingly small, due to the use of the hash embeddings. Recommended values are between `2000` and `10000`. |
|
| `embed_size` | The number of rows in the hash embedding tables. This can be surprisingly small, due to the use of the hash embeddings. Recommended values are between `2000` and `10000`. ~~int~~ |
|
||||||
| `window_size` | int | The number of tokens on either side to concatenate during the convolutions. The receptive field of the CNN will be `depth * (window_size * 2 + 1)`, so a 4-layer network with a window size of `2` will be sensitive to 17 words at a time. Recommended value is `1`. |
|
| `window_size` | The number of tokens on either side to concatenate during the convolutions. The receptive field of the CNN will be `depth * (window_size * 2 + 1)`, so a 4-layer network with a window size of `2` will be sensitive to 17 words at a time. Recommended value is `1`. ~~int~~ |
|
||||||
| `maxout_pieces` | int | The number of pieces to use in the maxout non-linearity. If `1`, the [`Mish`](https://thinc.ai/docs/api-layers#mish) non-linearity is used instead. Recommended values are `1`-`3`. |
|
| `maxout_pieces` | The number of pieces to use in the maxout non-linearity. If `1`, the [`Mish`](https://thinc.ai/docs/api-layers#mish) non-linearity is used instead. Recommended values are `1`-`3`. ~~int~~ |
|
||||||
| `subword_features` | bool | Whether to also embed subword features, specifically the prefix, suffix and word shape. This is recommended for alphabetic languages like English, but not if single-character tokens are used for a language such as Chinese. |
|
| `subword_features` | Whether to also embed subword features, specifically the prefix, suffix and word shape. This is recommended for alphabetic languages like English, but not if single-character tokens are used for a language such as Chinese. ~~bool~~ |
|
||||||
| `pretrained_vectors` | bool | Whether to also use static vectors. |
|
| `pretrained_vectors` | Whether to also use static vectors. ~~bool~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.Tok2Vec.v1 {#Tok2Vec}
|
### spacy.Tok2Vec.v1 {#Tok2Vec}
|
||||||
|
|
||||||
|
@ -67,10 +76,11 @@ Construct a tok2vec model out of embedding and encoding subnetworks. See the
|
||||||
["Embed, Encode, Attend, Predict"](https://explosion.ai/blog/deep-learning-formula-nlp)
|
["Embed, Encode, Attend, Predict"](https://explosion.ai/blog/deep-learning-formula-nlp)
|
||||||
blog post for background.
|
blog post for background.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `embed` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed) |
|
| `embed` | Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
| `encode` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Floats2d]`. **Output:** `List[Floats2d]`. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). |
|
| `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.Tok2VecListener.v1 {#Tok2VecListener}
|
### spacy.Tok2VecListener.v1 {#Tok2VecListener}
|
||||||
|
|
||||||
|
@ -92,7 +102,7 @@ blog post for background.
|
||||||
>
|
>
|
||||||
> [components.tagger.model.tok2vec]
|
> [components.tagger.model.tok2vec]
|
||||||
> @architectures = "spacy.Tok2VecListener.v1"
|
> @architectures = "spacy.Tok2VecListener.v1"
|
||||||
> width = ${components.tok2vec.model:width}
|
> width = ${components.tok2vec.model.width}
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
A listener is used as a sublayer within a component such as a
|
A listener is used as a sublayer within a component such as a
|
||||||
|
@ -108,10 +118,11 @@ Instead of defining its own `Tok2Vec` instance, a model architecture like
|
||||||
[Tagger](/api/architectures#tagger) can define a listener as its `tok2vec`
|
[Tagger](/api/architectures#tagger) can define a listener as its `tok2vec`
|
||||||
argument that connects to the shared `tok2vec` component in the pipeline.
|
argument that connects to the shared `tok2vec` component in the pipeline.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `width` | int | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. |
|
| `width` | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~ |
|
||||||
| `upstream` | str | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. |
|
| `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
|
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
|
||||||
|
|
||||||
|
@ -134,12 +145,13 @@ definitions depending on the `Vocab` of the `Doc` object passed in. Vectors from
|
||||||
pretrained static vectors can also be incorporated into the concatenated
|
pretrained static vectors can also be incorporated into the concatenated
|
||||||
representation.
|
representation.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------------------- | ---- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `width` | int | The output width. Also used as the width of the embedding tables. Recommended values are between `64` and `300`. |
|
| `width` | The output width. Also used as the width of the embedding tables. Recommended values are between `64` and `300`. ~~int~~ |
|
||||||
| `rows` | int | The number of rows for the embedding tables. Can be low, due to the hashing trick. Embeddings for prefix, suffix and word shape use half as many rows. Recommended values are between `2000` and `10000`. |
|
| `rows` | The number of rows for the embedding tables. Can be low, due to the hashing trick. Embeddings for prefix, suffix and word shape use half as many rows. Recommended values are between `2000` and `10000`. ~~int~~ |
|
||||||
| `also_embed_subwords` | bool | Whether to use the `PREFIX`, `SUFFIX` and `SHAPE` features in the embeddings. If not using these, you may need more rows in your hash embeddings, as there will be increased chance of collisions. |
|
| `also_embed_subwords` | Whether to use the `PREFIX`, `SUFFIX` and `SHAPE` features in the embeddings. If not using these, you may need more rows in your hash embeddings, as there will be increased chance of collisions. ~~bool~~ |
|
||||||
| `also_use_static_vectors` | bool | Whether to also use static word vectors. Requires a vectors table to be loaded in the [Doc](/api/doc) objects' vocab. |
|
| `also_use_static_vectors` | Whether to also use static word vectors. Requires a vectors table to be loaded in the [Doc](/api/doc) objects' vocab. ~~bool~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.CharacterEmbed.v1 {#CharacterEmbed}
|
### spacy.CharacterEmbed.v1 {#CharacterEmbed}
|
||||||
|
|
||||||
|
@ -170,12 +182,13 @@ concatenated. A hash-embedded vector of the `NORM` of the word is also
|
||||||
concatenated on, and the result is then passed through a feed-forward network to
|
concatenated on, and the result is then passed through a feed-forward network to
|
||||||
construct a single vector to represent the information.
|
construct a single vector to represent the information.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `width` | int | The width of the output vector and the `NORM` hash embedding. |
|
| `width` | The width of the output vector and the `NORM` hash embedding. ~~int~~ |
|
||||||
| `rows` | int | The number of rows in the `NORM` hash embedding table. |
|
| `rows` | The number of rows in the `NORM` hash embedding table. ~~int~~ |
|
||||||
| `nM` | int | The dimensionality of the character embeddings. Recommended values are between `16` and `64`. |
|
| `nM` | The dimensionality of the character embeddings. Recommended values are between `16` and `64`. ~~int~~ |
|
||||||
| `nC` | int | The number of UTF-8 bytes to embed per word. Recommended values are between `3` and `8`, although it may depend on the length of words in the language. |
|
| `nC` | The number of UTF-8 bytes to embed per word. Recommended values are between `3` and `8`, although it may depend on the length of words in the language. ~~int~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder}
|
### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder}
|
||||||
|
|
||||||
|
@ -193,12 +206,13 @@ construct a single vector to represent the information.
|
||||||
Encode context using convolutions with maxout activation, layer normalization
|
Encode context using convolutions with maxout activation, layer normalization
|
||||||
and residual connections.
|
and residual connections.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| --------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `width` | int | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. |
|
| `width` | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. ~~int~~ |
|
||||||
| `window_size` | int | The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. |
|
| `window_size` | The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. ~~int~~ |
|
||||||
| `maxout_pieces` | int | The number of maxout pieces to use. Recommended values are `2` or `3`. |
|
| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ |
|
||||||
| `depth` | int | The number of convolutional layers. Recommended value is `4`. |
|
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.MishWindowEncoder.v1 {#MishWindowEncoder}
|
### spacy.MishWindowEncoder.v1 {#MishWindowEncoder}
|
||||||
|
|
||||||
|
@ -216,11 +230,12 @@ Encode context using convolutions with
|
||||||
[`Mish`](https://thinc.ai/docs/api-layers#mish) activation, layer normalization
|
[`Mish`](https://thinc.ai/docs/api-layers#mish) activation, layer normalization
|
||||||
and residual connections.
|
and residual connections.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `width` | int | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. |
|
| `width` | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. ~~int~~ |
|
||||||
| `window_size` | int | The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. |
|
| `window_size` | The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. ~~int~~ |
|
||||||
| `depth` | int | The number of convolutional layers. Recommended value is `4`. |
|
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.TorchBiLSTMEncoder.v1 {#TorchBiLSTMEncoder}
|
### spacy.TorchBiLSTMEncoder.v1 {#TorchBiLSTMEncoder}
|
||||||
|
|
||||||
|
@ -237,18 +252,58 @@ and residual connections.
|
||||||
Encode context using bidirectional LSTM layers. Requires
|
Encode context using bidirectional LSTM layers. Requires
|
||||||
[PyTorch](https://pytorch.org).
|
[PyTorch](https://pytorch.org).
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `width` | int | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. |
|
| `width` | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. ~~int~~ |
|
||||||
| `window_size` | int | The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. |
|
| `window_size` | The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. ~~int~~ |
|
||||||
| `depth` | int | The number of convolutional layers. Recommended value is `4`. |
|
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
||||||
|
|
||||||
|
### spacy.StaticVectors.v1 {#StaticVectors}
|
||||||
|
|
||||||
|
> #### Example config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [model]
|
||||||
|
> @architectures = "spacy.StaticVectors.v1"
|
||||||
|
> nO = null
|
||||||
|
> nM = null
|
||||||
|
> dropout = 0.2
|
||||||
|
> key_attr = "ORTH"
|
||||||
|
>
|
||||||
|
> [model.init_W]
|
||||||
|
> @initializers = "glorot_uniform_init.v1"
|
||||||
|
> ```
|
||||||
|
|
||||||
|
Embed [`Doc`](/api/doc) objects with their vocab's vectors table, applying a
|
||||||
|
learned linear projection to control the dimensionality. See the documentation
|
||||||
|
on [static vectors](/usage/embeddings-transformers#static-vectors) for details.
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `nO` | The output width of the layer, after the linear projection. ~~Optional[int]~~ |
|
||||||
|
| `nM` | The width of the static vectors. ~~Optional[int]~~ |
|
||||||
|
| `dropout` | Optional dropout rate. If set, it's applied per dimension over the whole batch. Defaults to `None`. ~~Optional[float]~~ |
|
||||||
|
| `init_W` | The [initialization function](https://thinc.ai/docs/api-initializers). Defaults to [`glorot_uniform_init`](https://thinc.ai/docs/api-initializers#glorot_uniform_init). ~~Callable[[Ops, Tuple[int, ...]]], FloatsXd]~~ |
|
||||||
|
| `key_attr` | Defaults to `"ORTH"`. ~~str~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Ragged]~~ |
|
||||||
|
|
||||||
## Transformer architectures {#transformers source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/architectures.py"}
|
## Transformer architectures {#transformers source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/architectures.py"}
|
||||||
|
|
||||||
The following architectures are provided by the package
|
The following architectures are provided by the package
|
||||||
[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the
|
||||||
[usage documentation](/usage/transformers) for how to integrate the
|
[usage documentation](/usage/embeddings-transformers#transformers) for how to
|
||||||
architectures into your training config.
|
integrate the architectures into your training config.
|
||||||
|
|
||||||
|
<Infobox variant="warning">
|
||||||
|
|
||||||
|
Note that in order to use these architectures in your config, you need to
|
||||||
|
install the
|
||||||
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the
|
||||||
|
[installation docs](/usage/embeddings-transformers#transformers-installation)
|
||||||
|
for details and system requirements.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
### spacy-transformers.TransformerModel.v1 {#TransformerModel}
|
### spacy-transformers.TransformerModel.v1 {#TransformerModel}
|
||||||
|
|
||||||
|
@ -266,13 +321,30 @@ architectures into your training config.
|
||||||
> stride = 96
|
> stride = 96
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
<!-- TODO: description -->
|
Load and wrap a transformer model from the
|
||||||
|
[HuggingFace `transformers`](https://huggingface.co/transformers) library. You
|
||||||
|
can any transformer that has pretrained weights and a PyTorch implementation.
|
||||||
|
The `name` variable is passed through to the underlying library, so it can be
|
||||||
|
either a string or a path. If it's a string, the pretrained weights will be
|
||||||
|
downloaded via the transformers library if they are not already available
|
||||||
|
locally.
|
||||||
|
|
||||||
| Name | Type | Description |
|
In order to support longer documents, the
|
||||||
| ------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
[TransformerModel](/api/architectures#TransformerModel) layer allows you to pass
|
||||||
| `name` | str | Any model name that can be loaded by [`transformers.AutoModel`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModel). |
|
in a `get_spans` function that will divide up the [`Doc`](/api/doc) objects
|
||||||
| `get_spans` | `Callable` | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. |
|
before passing them through the transformer. Your spans are allowed to overlap
|
||||||
| `tokenizer_config` | `Dict[str, Any]` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). |
|
or exclude tokens. This layer is usually used directly by the
|
||||||
|
[`Transformer`](/api/transformer) component, which allows you to share the
|
||||||
|
transformer weights across your pipeline. For a layer that's configured for use
|
||||||
|
in other components, see
|
||||||
|
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer).
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `name` | Any model name that can be loaded by [`transformers.AutoModel`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModel). ~~str~~ |
|
||||||
|
| `get_spans` | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. ~~Callable[[List[Doc]], List[Span]]~~ |
|
||||||
|
| `tokenizer_config` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). ~~Dict[str, Any]~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], FullTransformerBatch]~~ |
|
||||||
|
|
||||||
### spacy-transformers.Tok2VecListener.v1 {#transformers-Tok2VecListener}
|
### spacy-transformers.Tok2VecListener.v1 {#transformers-Tok2VecListener}
|
||||||
|
|
||||||
|
@ -297,10 +369,11 @@ operate over wordpieces, which usually don't align one-to-one against spaCy
|
||||||
tokens. The layer therefore requires a reduction operation in order to calculate
|
tokens. The layer therefore requires a reduction operation in order to calculate
|
||||||
a single token vector given zero or more wordpiece vectors.
|
a single token vector given zero or more wordpiece vectors.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `pooling` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** [`Ragged`](https://thinc.ai/docs/api-types#ragged). **Output:** [`Floats2d`](https://thinc.ai/docs/api-types#types) | A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean)) is usually a good choice. |
|
| `pooling` | A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean)) is usually a good choice. ~~Model[Ragged, Floats2d]~~ |
|
||||||
| `grad_factor` | float | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. |
|
| `grad_factor` | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy-transformers.Tok2VecTransformer.v1 {#Tok2VecTransformer}
|
### spacy-transformers.Tok2VecTransformer.v1 {#Tok2VecTransformer}
|
||||||
|
|
||||||
|
@ -320,12 +393,13 @@ Use a transformer as a [`Tok2Vec`](/api/tok2vec) layer directly. This does
|
||||||
object, but it's a **simpler solution** if you only need the transformer within
|
object, but it's a **simpler solution** if you only need the transformer within
|
||||||
one component.
|
one component.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------------ | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_spans` | callable | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. |
|
| `get_spans` | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. ~~Callable[[List[Doc]], List[Span]]~~ |
|
||||||
| `tokenizer_config` | `Dict[str, Any]` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). |
|
| `tokenizer_config` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). ~~Dict[str, Any]~~ |
|
||||||
| `pooling` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** [`Ragged`](https://thinc.ai/docs/api-types#ragged). **Output:** [`Floats2d`](https://thinc.ai/docs/api-types#types) | A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean)) is usually a good choice. |
|
| `pooling` | A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean)) is usually a good choice. ~~Model[Ragged, Floats2d]~~ |
|
||||||
| `grad_factor` | float | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. |
|
| `grad_factor` | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
## Parser & NER architectures {#parser}
|
## Parser & NER architectures {#parser}
|
||||||
|
|
||||||
|
@ -351,7 +425,7 @@ one component.
|
||||||
> subword_features = true
|
> subword_features = true
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Build a transition-based parser model. Can apply to NER or dependency-parsing.
|
Build a transition-based parser model. Can apply to NER or dependency parsing.
|
||||||
Transition-based parsing is an approach to structured prediction where the task
|
Transition-based parsing is an approach to structured prediction where the task
|
||||||
of predicting the structure is mapped to a series of state transitions. You
|
of predicting the structure is mapped to a series of state transitions. You
|
||||||
might find [this tutorial](https://explosion.ai/blog/parsing-english-in-python)
|
might find [this tutorial](https://explosion.ai/blog/parsing-english-in-python)
|
||||||
|
@ -368,14 +442,15 @@ consists of either two or three subnetworks:
|
||||||
state representation. If not present, the output from the lower model is used
|
state representation. If not present, the output from the lower model is used
|
||||||
as action scores directly.
|
as action scores directly.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Subnetwork to map tokens into vector representations. |
|
| `tok2vec` | Subnetwork to map tokens into vector representations. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
| `nr_feature_tokens` | int | The number of tokens in the context to use to construct the state vector. Valid choices are `1`, `2`, `3`, `6`, `8` and `13`. The `2`, `8` and `13` feature sets are designed for the parser, while the `3` and `6` feature sets are designed for the entity recognizer. The recommended feature sets are `3` for NER, and `8` for the dependency parser. |
|
| `nr_feature_tokens` | The number of tokens in the context to use to construct the state vector. Valid choices are `1`, `2`, `3`, `6`, `8` and `13`. The `2`, `8` and `13` feature sets are designed for the parser, while the `3` and `6` feature sets are designed for the entity recognizer. The recommended feature sets are `3` for NER, and `8` for the dependency parser. ~~int~~ |
|
||||||
| `hidden_width` | int | The width of the hidden layer. |
|
| `hidden_width` | The width of the hidden layer. ~~int~~ |
|
||||||
| `maxout_pieces` | int | How many pieces to use in the state prediction layer. Recommended values are `1`, `2` or `3`. If `1`, the maxout non-linearity is replaced with a [`Relu`](https://thinc.ai/docs/api-layers#relu) non-linearity if `use_upper` is `True`, and no non-linearity if `False`. |
|
| `maxout_pieces` | How many pieces to use in the state prediction layer. Recommended values are `1`, `2` or `3`. If `1`, the maxout non-linearity is replaced with a [`Relu`](https://thinc.ai/docs/api-layers#relu) non-linearity if `use_upper` is `True`, and no non-linearity if `False`. ~~int~~ |
|
||||||
| `use_upper` | bool | Whether to use an additional hidden layer after the state vector in order to predict the action scores. It is recommended to set this to `False` for large pretrained models such as transformers, and `True` for smaller networks. The upper layer is computed on CPU, which becomes a bottleneck on larger GPU-based models, where it's also less necessary. |
|
| `use_upper` | Whether to use an additional hidden layer after the state vector in order to predict the action scores. It is recommended to set this to `False` for large pretrained models such as transformers, and `True` for smaller networks. The upper layer is computed on CPU, which becomes a bottleneck on larger GPU-based models, where it's also less necessary. ~~bool~~ |
|
||||||
| `nO` | int | The number of actions the model will predict between. Usually inferred from data at the beginning of training, or loaded from disk. |
|
| `nO` | The number of actions the model will predict between. Usually inferred from data at the beginning of training, or loaded from disk. ~~int~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Docs], List[List[Floats2d]]]~~ |
|
||||||
|
|
||||||
### spacy.BILUOTagger.v1 {#BILUOTagger source="spacy/ml/models/simple_ner.py"}
|
### spacy.BILUOTagger.v1 {#BILUOTagger source="spacy/ml/models/simple_ner.py"}
|
||||||
|
|
||||||
|
@ -402,9 +477,10 @@ generally results in better linear separation between classes, especially for
|
||||||
non-CRF models, because there are more distinct classes for the different
|
non-CRF models, because there are more distinct classes for the different
|
||||||
situations ([Ratinov et al., 2009](https://www.aclweb.org/anthology/W09-1119/)).
|
situations ([Ratinov et al., 2009](https://www.aclweb.org/anthology/W09-1119/)).
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | ------------------------------------------ | ----------------------------------------------------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------ |
|
||||||
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Subnetwork to map tokens into vector representations. |
|
| `tok2vec` | Subnetwork to map tokens into vector representations. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.IOBTagger.v1 {#IOBTagger source="spacy/ml/models/simple_ner.py"}
|
### spacy.IOBTagger.v1 {#IOBTagger source="spacy/ml/models/simple_ner.py"}
|
||||||
|
|
||||||
|
@ -427,9 +503,10 @@ spans into tags assigned to each token. The first token of a span is given the
|
||||||
tag B-LABEL, and subsequent tokens are given the tag I-LABEL. All other tokens
|
tag B-LABEL, and subsequent tokens are given the tag I-LABEL. All other tokens
|
||||||
are assigned the tag O.
|
are assigned the tag O.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | ------------------------------------------ | ----------------------------------------------------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------ |
|
||||||
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Subnetwork to map tokens into vector representations. |
|
| `tok2vec` | Subnetwork to map tokens into vector representations. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
## Tagging architectures {#tagger source="spacy/ml/models/tagger.py"}
|
## Tagging architectures {#tagger source="spacy/ml/models/tagger.py"}
|
||||||
|
|
||||||
|
@ -450,10 +527,11 @@ Build a tagger model, using a provided token-to-vector component. The tagger
|
||||||
model simply adds a linear layer with softmax activation to predict scores given
|
model simply adds a linear layer with softmax activation to predict scores given
|
||||||
the token vectors.
|
the token vectors.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | ------------------------------------------ | ----------------------------------------------------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------ |
|
||||||
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. Subnetwork to map tokens into vector representations. |
|
| `tok2vec` | Subnetwork to map tokens into vector representations. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
| `nO` | int | The number of tags to output. Inferred from the data if `None`. |
|
| `nO` | The number of tags to output. Inferred from the data if `None`. ~~Optional[int]~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
## Text classification architectures {#textcat source="spacy/ml/models/textcat.py"}
|
## Text classification architectures {#textcat source="spacy/ml/models/textcat.py"}
|
||||||
|
|
||||||
|
@ -470,9 +548,6 @@ specific data and challenge.
|
||||||
|
|
||||||
### spacy.TextCatEnsemble.v1 {#TextCatEnsemble}
|
### spacy.TextCatEnsemble.v1 {#TextCatEnsemble}
|
||||||
|
|
||||||
Stacked ensemble of a bag-of-words model and a neural network model. The neural
|
|
||||||
network has an internal CNN Tok2Vec layer and uses attention.
|
|
||||||
|
|
||||||
> #### Example Config
|
> #### Example Config
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
|
@ -489,18 +564,21 @@ network has an internal CNN Tok2Vec layer and uses attention.
|
||||||
> nO = null
|
> nO = null
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
Stacked ensemble of a bag-of-words model and a neural network model. The neural
|
||||||
| --------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
network has an internal CNN Tok2Vec layer and uses attention.
|
||||||
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
|
||||||
| `pretrained_vectors` | bool | Whether or not pretrained vectors will be used in addition to the feature vectors. |
|
| Name | Description |
|
||||||
| `width` | int | Output dimension of the feature encoding step. |
|
| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `embed_size` | int | Input dimension of the feature encoding step. |
|
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
||||||
| `conv_depth` | int | Depth of the Tok2Vec layer. |
|
| `pretrained_vectors` | Whether or not pretrained vectors will be used in addition to the feature vectors. ~~bool~~ |
|
||||||
| `window_size` | int | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. |
|
| `width` | Output dimension of the feature encoding step. ~~int~~ |
|
||||||
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
|
| `embed_size` | Input dimension of the feature encoding step. ~~int~~ |
|
||||||
| `dropout` | float | The dropout rate. |
|
| `conv_depth` | Depth of the tok2vec layer. ~~int~~ |
|
||||||
| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when |
|
| `window_size` | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. ~~int~~ |
|
||||||
| `begin_training` is called. |
|
| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ |
|
||||||
|
| `dropout` | The dropout rate. ~~float~~ |
|
||||||
|
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
||||||
|
|
||||||
### spacy.TextCatCNN.v1 {#TextCatCNN}
|
### spacy.TextCatCNN.v1 {#TextCatCNN}
|
||||||
|
|
||||||
|
@ -527,11 +605,12 @@ A neural network model where token vectors are calculated using a CNN. The
|
||||||
vectors are mean pooled and used as features in a feed-forward network. This
|
vectors are mean pooled and used as features in a feed-forward network. This
|
||||||
architecture is usually less accurate than the ensemble, but runs faster.
|
architecture is usually less accurate than the ensemble, but runs faster.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
||||||
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. |
|
| `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ |
|
||||||
| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. |
|
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
||||||
|
|
||||||
### spacy.TextCatBOW.v1 {#TextCatBOW}
|
### spacy.TextCatBOW.v1 {#TextCatBOW}
|
||||||
|
|
||||||
|
@ -549,12 +628,13 @@ architecture is usually less accurate than the ensemble, but runs faster.
|
||||||
An ngram "bag-of-words" model. This architecture should run much faster than the
|
An ngram "bag-of-words" model. This architecture should run much faster than the
|
||||||
others, but may not be as accurate, especially if texts are short.
|
others, but may not be as accurate, especially if texts are short.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
||||||
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
|
| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ |
|
||||||
| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. |
|
| `no_output_layer` | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes` is `True`, else `Logistic`. ~~bool~~ |
|
||||||
| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. |
|
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
||||||
|
|
||||||
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
|
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
|
||||||
|
|
||||||
|
@ -571,9 +651,6 @@ into the "real world". This requires 3 main components:
|
||||||
|
|
||||||
### spacy.EntityLinker.v1 {#EntityLinker}
|
### spacy.EntityLinker.v1 {#EntityLinker}
|
||||||
|
|
||||||
The `EntityLinker` model architecture is a `Thinc` `Model` with a Linear output
|
|
||||||
layer.
|
|
||||||
|
|
||||||
> #### Example Config
|
> #### Example Config
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
|
@ -599,27 +676,28 @@ layer.
|
||||||
> @assets = "spacy.CandidateGenerator.v1"
|
> @assets = "spacy.CandidateGenerator.v1"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
The `EntityLinker` model architecture is a Thinc `Model` with a
|
||||||
| --------- | ------------------------------------------ | ---------------------------------------------------------------------------------------- |
|
[`Linear`](https://thinc.ai/api-layers#linear) output layer.
|
||||||
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. |
|
|
||||||
| `nO` | int | Output dimension, determined by the length of the vectors encoding each entity in the KB |
|
|
||||||
|
|
||||||
If the `nO` dimension is not set, the Entity Linking component will set it when
|
| Name | Description |
|
||||||
`begin_training` is called.
|
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ |
|
||||||
|
| `nO` | Output dimension, determined by the length of the vectors encoding each entity in the KB. If the `nO` dimension is not set, the entity linking component will set it when `begin_training` is called. ~~Optional[int]~~ |
|
||||||
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
|
||||||
|
|
||||||
### spacy.EmptyKB.v1 {#EmptyKB}
|
### spacy.EmptyKB.v1 {#EmptyKB}
|
||||||
|
|
||||||
A function that creates a default, empty `KnowledgeBase` from a
|
A function that creates a default, empty `KnowledgeBase` from a
|
||||||
[`Vocab`](/api/vocab) instance.
|
[`Vocab`](/api/vocab) instance.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------------------- | ---- | ------------------------------------------------------------------------- |
|
| ---------------------- | ----------------------------------------------------------------------------------- |
|
||||||
| `entity_vector_length` | int | The length of the vectors encoding each entity in the KB - 64 by default. |
|
| `entity_vector_length` | The length of the vectors encoding each entity in the KB. Defaults to `64`. ~~int~~ |
|
||||||
|
|
||||||
### spacy.CandidateGenerator.v1 {#CandidateGenerator}
|
### spacy.CandidateGenerator.v1 {#CandidateGenerator}
|
||||||
|
|
||||||
A function that takes as input a [`KnowledgeBase`](/api/kb) and a
|
A function that takes as input a [`KnowledgeBase`](/api/kb) and a
|
||||||
[`Span`](/api/span) object denoting a named entity, and returns a list of
|
[`Span`](/api/span) object denoting a named entity, and returns a list of
|
||||||
plausible [`Candidate` objects](/api/kb/#candidate_init). The default
|
plausible [`Candidate`](/api/kb/#candidate) objects. The default
|
||||||
`CandidateGenerator` simply uses the text of a mention to find its potential
|
`CandidateGenerator` simply uses the text of a mention to find its potential
|
||||||
aliases in the `KnowledgeBase`. Note that this function is case-dependent.
|
aliases in the `KnowledgeBase`. Note that this function is case-dependent.
|
||||||
|
|
|
@ -31,10 +31,10 @@ how the component should be configured. You can override its settings via the
|
||||||
> nlp.add_pipe("attribute_ruler", config=config)
|
> nlp.add_pipe("attribute_ruler", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Description |
|
||||||
| --------------- | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------- | ------- |
|
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `pattern_dicts` | `Iterable[dict]` | A list of pattern dicts with the keys as the arguments to [`AttributeRuler.add`](#add) (`patterns`/`attrs`/`index`) to add as patterns. | `None` |
|
| `pattern_dicts` | A list of pattern dicts with the keys as the arguments to [`AttributeRuler.add`](/api/attributeruler#add) (`patterns`/`attrs`/`index`) to add as patterns. Defaults to `None`. ~~Optional[Iterable[Dict[str, Union[List[dict], dict, int]]]]~~ |
|
||||||
| `validate` | bool | Whether patterns should be validated (passed to the `Matcher`). | `False` |
|
| `validate` | Whether patterns should be validated (passed to the `Matcher`). Defaults to `False`. ~~bool~~ |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/attributeruler.py
|
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/attributeruler.py
|
||||||
|
@ -47,10 +47,10 @@ be a list of dictionaries with `"patterns"`, `"attrs"`, and optional `"index"`
|
||||||
keys, e.g.:
|
keys, e.g.:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
pattern_dicts = \[
|
pattern_dicts = [
|
||||||
{"patterns": \[\[{"TAG": "VB"}\]\], "attrs": {"POS": "VERB"}},
|
{"patterns": [[{"TAG": "VB"}]], "attrs": {"POS": "VERB"}},
|
||||||
{"patterns": \[\[{"LOWER": "an"}\]\], "attrs": {"LEMMA": "a"}},
|
{"patterns": [[{"LOWER": "an"}]], "attrs": {"LEMMA": "a"}},
|
||||||
\]
|
]
|
||||||
```
|
```
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
|
@ -60,23 +60,23 @@ pattern_dicts = \[
|
||||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------- | ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. |
|
| `vocab` | The shared vocabulary to pass to the matcher. ~~Vocab~~ |
|
||||||
| `name` | str | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. Used to disable the current entity ruler while creating phrase patterns with the nlp object. |
|
| `name` | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. ~~str~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `pattern_dicts` | `Iterable[Dict]]` | Optional patterns to load in on initialization. Defaults to `None`. |
|
| `pattern_dicts` | Optional patterns to load in on initialization. Defaults to `None`. ~~Optional[Iterable[Dict[str, Union[List[dict], dict, int]]]]~~ |
|
||||||
| `validate` | bool | Whether patterns should be validated (passed to the `Matcher`). Defaults to `False`. |
|
| `validate` | Whether patterns should be validated (passed to the [`Matcher`](/api/matcher#init)). Defaults to `False`. ~~bool~~ |
|
||||||
|
|
||||||
## AttributeRuler.\_\_call\_\_ {#call tag="method"}
|
## AttributeRuler.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
Apply the attribute ruler to a Doc, setting token attributes for tokens matched
|
Apply the attribute ruler to a Doc, setting token attributes for tokens matched
|
||||||
by the provided patterns.
|
by the provided patterns.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------------------------------------------ |
|
| ----------- | -------------------------------- |
|
||||||
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
| `doc` | The document to process. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The modified `Doc` with added entities, if available. |
|
| **RETURNS** | The processed document. ~~Doc~~ |
|
||||||
|
|
||||||
## AttributeRuler.add {#add tag="method"}
|
## AttributeRuler.add {#add tag="method"}
|
||||||
|
|
||||||
|
@ -95,11 +95,11 @@ may be negative to index from the end of the span.
|
||||||
> attribute_ruler.add(patterns=patterns, attrs=attrs)
|
> attribute_ruler.add(patterns=patterns, attrs=attrs)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---------------------- | ----------------------------------------------------------------------------------------------------------------------- |
|
| ---------- | --------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| patterns | `Iterable[List[Dict]]` | A list of Matcher patterns. |
|
| `patterns` | The `Matcher` patterns to add. ~~Iterable[List[Dict[Union[int, str], Any]]]~~ |
|
||||||
| attrs | dict | The attributes to assign to the target token in the matched span. |
|
| `attrs` | The attributes to assign to the target token in the matched span. ~~Dict[str, Any]~~ |
|
||||||
| index | int | The index of the token in the matched span to modify. May be negative to index from the end of the span. Defaults to 0. |
|
| `index` | The index of the token in the matched span to modify. May be negative to index from the end of the span. Defaults to `0`. ~~int~~ |
|
||||||
|
|
||||||
## AttributeRuler.add_patterns {#add_patterns tag="method"}
|
## AttributeRuler.add_patterns {#add_patterns tag="method"}
|
||||||
|
|
||||||
|
@ -107,52 +107,52 @@ may be negative to index from the end of the span.
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
> attribute_ruler = nlp.add_pipe("attribute_ruler")
|
||||||
> pattern_dicts = \[
|
> pattern_dicts = [
|
||||||
> {
|
> {
|
||||||
> "patterns": \[\[{"TAG": "VB"}\]\],
|
> "patterns": [[{"TAG": "VB"}]],
|
||||||
> "attrs": {"POS": "VERB"}
|
> "attrs": {"POS": "VERB"}
|
||||||
> },
|
> },
|
||||||
> {
|
> {
|
||||||
> "patterns": \[\[{"LOWER": "two"}, {"LOWER": "apples"}\]\],
|
> "patterns": [[{"LOWER": "two"}, {"LOWER": "apples"}]],
|
||||||
> "attrs": {"LEMMA": "apple"},
|
> "attrs": {"LEMMA": "apple"},
|
||||||
> "index": -1
|
> "index": -1
|
||||||
> },
|
> },
|
||||||
> \]
|
> ]
|
||||||
> attribute_ruler.add_patterns(pattern_dicts)
|
> attribute_ruler.add_patterns(pattern_dicts)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Add patterns from a list of pattern dicts with the keys as the arguments to
|
Add patterns from a list of pattern dicts with the keys as the arguments to
|
||||||
[`AttributeRuler.add`](#add).
|
[`AttributeRuler.add`](/api/attributeruler#add).
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------- | ----------------- | -------------------- |
|
| --------------- | -------------------------------------------------------------------------- |
|
||||||
| `pattern_dicts` | `Iterable[Dict]]` | The patterns to add. |
|
| `pattern_dicts` | The patterns to add. ~~Iterable[Dict[str, Union[List[dict], dict, int]]]~~ |
|
||||||
|
|
||||||
## AttributeRuler.patterns {#patterns tag="property"}
|
## AttributeRuler.patterns {#patterns tag="property"}
|
||||||
|
|
||||||
Get all patterns that have been added to the attribute ruler in the
|
Get all patterns that have been added to the attribute ruler in the
|
||||||
`patterns_dict` format accepted by
|
`patterns_dict` format accepted by
|
||||||
[`AttributeRuler.add_patterns`](#add_patterns).
|
[`AttributeRuler.add_patterns`](/api/attributeruler#add_patterns).
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------ | ------------------------------------------ |
|
| ----------- | -------------------------------------------------------------------------------------------- |
|
||||||
| **RETURNS** | `List[dict]` | The patterns added to the attribute ruler. |
|
| **RETURNS** | The patterns added to the attribute ruler. ~~List[Dict[str, Union[List[dict], dict, int]]]~~ |
|
||||||
|
|
||||||
## AttributeRuler.load_from_tag_map {#load_from_tag_map tag="method"}
|
## AttributeRuler.load_from_tag_map {#load_from_tag_map tag="method"}
|
||||||
|
|
||||||
Load attribute ruler patterns from a tag map.
|
Load attribute ruler patterns from a tag map.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | ---- | ------------------------------------------------------------------------------------------ |
|
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `tag_map` | dict | The tag map that maps fine-grained tags to coarse-grained tags and morphological features. |
|
| `tag_map` | The tag map that maps fine-grained tags to coarse-grained tags and morphological features. ~~Dict[str, Dict[Union[int, str], Union[int, str]]]~~ |
|
||||||
|
|
||||||
## AttributeRuler.load_from_morph_rules {#load_from_morph_rules tag="method"}
|
## AttributeRuler.load_from_morph_rules {#load_from_morph_rules tag="method"}
|
||||||
|
|
||||||
Load attribute ruler patterns from morph rules.
|
Load attribute ruler patterns from morph rules.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------- | ---- | -------------------------------------------------------------------------------------------------------------------- |
|
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `morph_rules` | dict | The morph rules that map token text and fine-grained tags to coarse-grained tags, lemmas and morphological features. |
|
| `morph_rules` | The morph rules that map token text and fine-grained tags to coarse-grained tags, lemmas and morphological features. ~~Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]~~ |
|
||||||
|
|
||||||
## AttributeRuler.to_disk {#to_disk tag="method"}
|
## AttributeRuler.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -165,11 +165,11 @@ Serialize the pipe to disk.
|
||||||
> attribute_ruler.to_disk("/path/to/attribute_ruler")
|
> attribute_ruler.to_disk("/path/to/attribute_ruler")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## AttributeRuler.from_disk {#from_disk tag="method"}
|
## AttributeRuler.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -182,12 +182,12 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
> attribute_ruler.from_disk("/path/to/attribute_ruler")
|
> attribute_ruler.from_disk("/path/to/attribute_ruler")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ---------------- | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `AttributeRuler` | The modified `AttributeRuler` object. |
|
| **RETURNS** | The modified `AttributeRuler` object. ~~AttributeRuler~~ |
|
||||||
|
|
||||||
## AttributeRuler.to_bytes {#to_bytes tag="method"}
|
## AttributeRuler.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -200,11 +200,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | bytes | The serialized form of the `AttributeRuler` object. |
|
| **RETURNS** | The serialized form of the `AttributeRuler` object. ~~bytes~~ |
|
||||||
|
|
||||||
## AttributeRuler.from_bytes {#from_bytes tag="method"}
|
## AttributeRuler.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -218,12 +218,12 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> attribute_ruler.from_bytes(attribute_ruler_bytes)
|
> attribute_ruler.from_bytes(attribute_ruler_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ---------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `AttributeRuler` | The `AttributeRuler` object. |
|
| **RETURNS** | The `AttributeRuler` object. ~~AttributeRuler~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -3,17 +3,17 @@ title: Command Line Interface
|
||||||
teaser: Download, train and package models, and debug spaCy
|
teaser: Download, train and package models, and debug spaCy
|
||||||
source: spacy/cli
|
source: spacy/cli
|
||||||
menu:
|
menu:
|
||||||
- ['Download', 'download']
|
- ['download', 'download']
|
||||||
- ['Info', 'info']
|
- ['info', 'info']
|
||||||
- ['Validate', 'validate']
|
- ['validate', 'validate']
|
||||||
- ['Init', 'init']
|
- ['init', 'init']
|
||||||
- ['Convert', 'convert']
|
- ['convert', 'convert']
|
||||||
- ['Debug', 'debug']
|
- ['debug', 'debug']
|
||||||
- ['Train', 'train']
|
- ['train', 'train']
|
||||||
- ['Pretrain', 'pretrain']
|
- ['pretrain', 'pretrain']
|
||||||
- ['Evaluate', 'evaluate']
|
- ['evaluate', 'evaluate']
|
||||||
- ['Package', 'package']
|
- ['package', 'package']
|
||||||
- ['Project', 'project']
|
- ['project', 'project']
|
||||||
---
|
---
|
||||||
|
|
||||||
spaCy's CLI provides a range of helpful commands for downloading and training
|
spaCy's CLI provides a range of helpful commands for downloading and training
|
||||||
|
@ -22,7 +22,7 @@ list of available commands, you can type `python -m spacy --help`. You can also
|
||||||
add the `--help` flag to any command or subcommand to see the description,
|
add the `--help` flag to any command or subcommand to see the description,
|
||||||
available arguments and usage.
|
available arguments and usage.
|
||||||
|
|
||||||
## Download {#download}
|
## download {#download tag="command"}
|
||||||
|
|
||||||
Download [models](/usage/models) for spaCy. The downloader finds the
|
Download [models](/usage/models) for spaCy. The downloader finds the
|
||||||
best-matching compatible version and uses `pip install` to download the model as
|
best-matching compatible version and uses `pip install` to download the model as
|
||||||
|
@ -39,41 +39,41 @@ the model name to be specified with its version (e.g. `en_core_web_sm-2.2.0`).
|
||||||
> to a local PyPi installation and fetching it straight from there. This will
|
> to a local PyPi installation and fetching it straight from there. This will
|
||||||
> also allow you to add it as a versioned package dependency to your project.
|
> also allow you to add it as a versioned package dependency to your project.
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy download [model] [--direct] [pip args]
|
$ python -m spacy download [model] [--direct] [pip_args]
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| ------------------------------------- | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `model` | positional | Model name, e.g. [`en_core_web_sm`](/models/en#en_core_web_sm). |
|
| `model` | Model name, e.g. [`en_core_web_sm`](/models/en#en_core_web_sm). ~~str (positional)~~ |
|
||||||
| `--direct`, `-d` | flag | Force direct download of exact model version. |
|
| `--direct`, `-d` | Force direct download of exact model version. ~~bool (flag)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| pip args <Tag variant="new">2.1</Tag> | option / flag | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory or `--no-deps` to not install model dependencies. |
|
| pip args <Tag variant="new">2.1</Tag> | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory or `--no-deps` to not install model dependencies. ~~Any (option/flag)~~ |
|
||||||
| **CREATES** | directory | The installed model package in your `site-packages` directory. |
|
| **CREATES** | The installed model package in your `site-packages` directory. |
|
||||||
|
|
||||||
## Info {#info}
|
## info {#info tag="command"}
|
||||||
|
|
||||||
Print information about your spaCy installation, models and local setup, and
|
Print information about your spaCy installation, models and local setup, and
|
||||||
generate [Markdown](https://en.wikipedia.org/wiki/Markdown)-formatted markup to
|
generate [Markdown](https://en.wikipedia.org/wiki/Markdown)-formatted markup to
|
||||||
copy-paste into [GitHub issues](https://github.com/explosion/spaCy/issues).
|
copy-paste into [GitHub issues](https://github.com/explosion/spaCy/issues).
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy info [--markdown] [--silent]
|
$ python -m spacy info [--markdown] [--silent]
|
||||||
```
|
```
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy info [model] [--markdown] [--silent]
|
$ python -m spacy info [model] [--markdown] [--silent]
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| ------------------------------------------------ | ---------- | ---------------------------------------------- |
|
| ------------------------------------------------ | ------------------------------------------------------------------------------ |
|
||||||
| `model` | positional | A model, i.e. package name or path (optional). |
|
| `model` | A model, i.e. package name or path (optional). ~~Optional[str] \(positional)~~ |
|
||||||
| `--markdown`, `-md` | flag | Print information as Markdown. |
|
| `--markdown`, `-md` | Print information as Markdown. ~~bool (flag)~~ |
|
||||||
| `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | flag | Don't print anything, just return the values. |
|
| `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | Don't print anything, just return the values. ~~bool (flag)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| **PRINTS** | `stdout` | Information about your spaCy installation. |
|
| **PRINTS** | Information about your spaCy installation. |
|
||||||
|
|
||||||
## Validate {#validate new="2"}
|
## validate {#validate new="2" tag="command"}
|
||||||
|
|
||||||
Find all models installed in the current environment and check whether they are
|
Find all models installed in the current environment and check whether they are
|
||||||
compatible with the currently installed version of spaCy. Should be run after
|
compatible with the currently installed version of spaCy. Should be run after
|
||||||
|
@ -88,20 +88,20 @@ and command for updating are shown.
|
||||||
> suite, to ensure all models are up to date before proceeding. If incompatible
|
> suite, to ensure all models are up to date before proceeding. If incompatible
|
||||||
> models are found, it will return `1`.
|
> models are found, it will return `1`.
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy validate
|
$ python -m spacy validate
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | -------- | --------------------------------------------------------- |
|
| ---------- | --------------------------------------------------------- |
|
||||||
| **PRINTS** | `stdout` | Details about the compatibility of your installed models. |
|
| **PRINTS** | Details about the compatibility of your installed models. |
|
||||||
|
|
||||||
## Init {#init new="3"}
|
## init {#init new="3"}
|
||||||
|
|
||||||
The `spacy init` CLI includes helpful commands for initializing training config
|
The `spacy init` CLI includes helpful commands for initializing training config
|
||||||
files and model directories.
|
files and model directories.
|
||||||
|
|
||||||
### init config {#init-config new="3"}
|
### init config {#init-config new="3" tag="command"}
|
||||||
|
|
||||||
Initialize and save a [`config.cfg` file](/usage/training#config) using the
|
Initialize and save a [`config.cfg` file](/usage/training#config) using the
|
||||||
**recommended settings** for your use case. It works just like the
|
**recommended settings** for your use case. It works just like the
|
||||||
|
@ -111,25 +111,25 @@ config. The settings you specify will impact the suggested model architectures
|
||||||
and pipeline setup, as well as the hyperparameters. You can also adjust and
|
and pipeline setup, as well as the hyperparameters. You can also adjust and
|
||||||
customize those settings in your config file later.
|
customize those settings in your config file later.
|
||||||
|
|
||||||
> ```bash
|
> #### Example
|
||||||
> ### Example {wrap="true"}
|
>
|
||||||
|
> ```cli
|
||||||
> $ python -m spacy init config config.cfg --lang en --pipeline ner,textcat --optimize accuracy
|
> $ python -m spacy init config config.cfg --lang en --pipeline ner,textcat --optimize accuracy
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy init config [output_file] [--lang] [--pipeline]
|
$ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [--cpu]
|
||||||
[--optimize] [--cpu]
|
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| ------------------ | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `output_file` | positional | Path to output `.cfg` file. If not set, the config is written to stdout so you can pipe it forward to a file. |
|
| `output_file` | Path to output `.cfg` file or `-` to write the config to stdout (so you can pipe it forward to a file). Note that if you're writing to stdout, no additional logging info is printed. ~~Path (positional)~~ |
|
||||||
| `--lang`, `-l` | option | Optional code of the [language](/usage/models#languages) to use. Defaults to `"en"`. |
|
| `--lang`, `-l` | Optional code of the [language](/usage/models#languages) to use. Defaults to `"en"`. ~~str (option)~~ |
|
||||||
| `--pipeline`, `-p` | option | Comma-separated list of trainable [pipeline components](/usage/processing-pipelines#built-in) to include in the model. Defaults to `"tagger,parser,ner"`. |
|
| `--pipeline`, `-p` | Comma-separated list of trainable [pipeline components](/usage/processing-pipelines#built-in) to include in the model. Defaults to `"tagger,parser,ner"`. ~~str (option)~~ |
|
||||||
| `--optimize`, `-o` | option | `"efficiency"` or `"accuracy"`. Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters. Defaults to `"efficiency"`. |
|
| `--optimize`, `-o` | `"efficiency"` or `"accuracy"`. Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters. Defaults to `"efficiency"`. ~~str (option)~~ |
|
||||||
| `--cpu`, `-C` | flag | Whether the model needs to run on CPU. This will impact the choice of architecture, pretrained weights and related hyperparameters. |
|
| `--cpu`, `-C` | Whether the model needs to run on CPU. This will impact the choice of architecture, pretrained weights and related hyperparameters. ~~bool (flag)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| **CREATES** | file | The config file for training. |
|
| **CREATES** | The config file for training. |
|
||||||
|
|
||||||
### init fill-config {#init-fill-config new="3"}
|
### init fill-config {#init-fill-config new="3"}
|
||||||
|
|
||||||
|
@ -143,33 +143,32 @@ be created, and their signatures are used to find the defaults. If your config
|
||||||
contains a problem that can't be resolved automatically, spaCy will show you a
|
contains a problem that can't be resolved automatically, spaCy will show you a
|
||||||
validation error with more details.
|
validation error with more details.
|
||||||
|
|
||||||
> ```bash
|
> #### Example
|
||||||
> ### Example {wrap="true"}
|
>
|
||||||
|
> ```cli
|
||||||
> $ python -m spacy init fill-config base.cfg config.cfg
|
> $ python -m spacy init fill-config base.cfg config.cfg
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy init fill-config [base_path] [output_file] [--diff]
|
$ python -m spacy init fill-config [base_path] [output_file] [--diff]
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ---------- | ------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `base_path` | positional | Path to base config to fill, e.g. generated by the [quickstart widget](/usage/training#quickstart). |
|
| `base_path` | Path to base config to fill, e.g. generated by the [quickstart widget](/usage/training#quickstart). ~~Path (positional)~~ |
|
||||||
| `output_file` | positional | Path to output `.cfg` file. If not set, the config is written to stdout so you can pipe it forward to a file. |
|
| `output_file` | Path to output `.cfg` file. If not set, the config is written to stdout so you can pipe it forward to a file. ~~Path (positional)~~ |
|
||||||
| `--diff`, `-D` | flag | Print a visual diff highlighting the changes. |
|
| `--diff`, `-D` | Print a visual diff highlighting the changes. ~~bool (flag)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| **CREATES** | file | Complete and auto-filled config file for training. |
|
| **CREATES** | Complete and auto-filled config file for training. |
|
||||||
|
|
||||||
### init model {#init-model new="2"}
|
### init model {#init-model new="2" tag="command"}
|
||||||
|
|
||||||
<!-- TODO: update for v3 -->
|
|
||||||
|
|
||||||
Create a new model directory from raw data, like word frequencies, Brown
|
Create a new model directory from raw data, like word frequencies, Brown
|
||||||
clusters and word vectors. This command is similar to the `spacy model` command
|
clusters and word vectors. Note that in order to populate the model's vocab, you
|
||||||
in v1.x. Note that in order to populate the model's vocab, you need to pass in a
|
need to pass in a JSONL-formatted
|
||||||
JSONL-formatted [vocabulary file](/api/data-formats#vocab-jsonl) as
|
[vocabulary file](/api/data-formats#vocab-jsonl) as `--jsonl-loc` with optional
|
||||||
`--jsonl-loc` with optional `id` values that correspond to the vectors table.
|
`id` values that correspond to the vectors table. Just loading in vectors will
|
||||||
Just loading in vectors will not automatically populate the vocab.
|
not automatically populate the vocab.
|
||||||
|
|
||||||
<Infobox title="New in v3.0" variant="warning">
|
<Infobox title="New in v3.0" variant="warning">
|
||||||
|
|
||||||
|
@ -177,24 +176,23 @@ The `init-model` command is now available as a subcommand of `spacy init`.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy init model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
|
$ python -m spacy init model [lang] [output_dir] [--jsonl-loc] [--vectors-loc] [--prune-vectors]
|
||||||
[--prune-vectors]
|
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| ------------------------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `lang` | positional | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. |
|
| `lang` | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. ~~str (positional)~~ |
|
||||||
| `output_dir` | positional | Model output directory. Will be created if it doesn't exist. |
|
| `output_dir` | Model output directory. Will be created if it doesn't exist. ~~Path (positional)~~ |
|
||||||
| `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted [vocabulary file](/api/data-formats#vocab-jsonl) with lexical attributes. |
|
| `--jsonl-loc`, `-j` | Optional location of JSONL-formatted [vocabulary file](/api/data-formats#vocab-jsonl) with lexical attributes. ~~Optional[Path] \(option)~~ |
|
||||||
| `--vectors-loc`, `-v` | option | Optional location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. |
|
| `--vectors-loc`, `-v` | Optional location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. ~~Optional[Path] \(option)~~ |
|
||||||
| `--truncate-vectors`, `-t` <Tag variant="new">2.3</Tag> | option | Number of vectors to truncate to when reading in vectors file. Defaults to `0` for no truncation. |
|
| `--truncate-vectors`, `-t` <Tag variant="new">2.3</Tag> | Number of vectors to truncate to when reading in vectors file. Defaults to `0` for no truncation. ~~int (option)~~ |
|
||||||
| `--prune-vectors`, `-V` | option | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. |
|
| `--prune-vectors`, `-V` | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. ~~int (option)~~ |
|
||||||
| `--vectors-name`, `-vn` | option | Name to assign to the word vectors in the `meta.json`, e.g. `en_core_web_md.vectors`. |
|
| `--vectors-name`, `-vn` | Name to assign to the word vectors in the `meta.json`, e.g. `en_core_web_md.vectors`. ~~str (option)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| **CREATES** | model | A spaCy model containing the vocab and vectors. |
|
| **CREATES** | A spaCy model containing the vocab and vectors. |
|
||||||
|
|
||||||
## Convert {#convert}
|
## convert {#convert tag="command"}
|
||||||
|
|
||||||
Convert files into spaCy's
|
Convert files into spaCy's
|
||||||
[binary training data format](/api/data-formats#binary-training), a serialized
|
[binary training data format](/api/data-formats#binary-training), a serialized
|
||||||
|
@ -202,28 +200,26 @@ Convert files into spaCy's
|
||||||
management functions. The converter can be specified on the command line, or
|
management functions. The converter can be specified on the command line, or
|
||||||
chosen based on the file extension of the input file.
|
chosen based on the file extension of the input file.
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy convert [input_file] [output_dir] [--converter]
|
$ python -m spacy convert [input_file] [output_dir] [--converter] [--file-type] [--n-sents] [--seg-sents] [--model] [--morphology] [--merge-subtokens] [--ner-map] [--lang]
|
||||||
[--file-type] [--n-sents] [--seg-sents] [--model] [--morphology]
|
|
||||||
[--merge-subtokens] [--ner-map] [--lang]
|
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| ------------------------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------ |
|
| ------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `input_file` | positional | Input file. |
|
| `input_file` | Input file. ~~Path (positional)~~ |
|
||||||
| `output_dir` | positional | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. |
|
| `output_dir` | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. ~~Optional[Path] \(positional)~~ |
|
||||||
| `--converter`, `-c` <Tag variant="new">2</Tag> | option | Name of converter to use (see below). |
|
| `--converter`, `-c` <Tag variant="new">2</Tag> | Name of converter to use (see below). ~~str (option)~~ |
|
||||||
| `--file-type`, `-t` <Tag variant="new">2.1</Tag> | option | Type of file to create. Either `spacy` (default) for binary [`DocBin`](/api/docbin) data or `json` for v2.x JSON format. |
|
| `--file-type`, `-t` <Tag variant="new">2.1</Tag> | Type of file to create. Either `spacy` (default) for binary [`DocBin`](/api/docbin) data or `json` for v2.x JSON format. ~~str (option)~~ |
|
||||||
| `--n-sents`, `-n` | option | Number of sentences per document. |
|
| `--n-sents`, `-n` | Number of sentences per document. ~~int (option)~~ |
|
||||||
| `--seg-sents`, `-s` <Tag variant="new">2.2</Tag> | flag | Segment sentences (for `-c ner`) |
|
| `--seg-sents`, `-s` <Tag variant="new">2.2</Tag> | Segment sentences (for `--converter ner`). ~~bool (flag)~~ |
|
||||||
| `--model`, `-b` <Tag variant="new">2.2</Tag> | option | Model for parser-based sentence segmentation (for `-s`) |
|
| `--model`, `-b` <Tag variant="new">2.2</Tag> | Model for parser-based sentence segmentation (for `--seg-sents`). ~~Optional[str](option)~~ |
|
||||||
| `--morphology`, `-m` | option | Enable appending morphology to tags. |
|
| `--morphology`, `-m` | Enable appending morphology to tags. ~~bool (flag)~~ |
|
||||||
| `--ner-map`, `-nm` | option | NER tag mapping (as JSON-encoded dict of entity types). |
|
| `--ner-map`, `-nm` | NER tag mapping (as JSON-encoded dict of entity types). ~~Optional[Path](option)~~ |
|
||||||
| `--lang`, `-l` <Tag variant="new">2.1</Tag> | option | Language code (if tokenizer required). |
|
| `--lang`, `-l` <Tag variant="new">2.1</Tag> | Language code (if tokenizer required). ~~Optional[str] \(option)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| **CREATES** | binary | Binary [`DocBin`](/api/docbin) training data that can be used with [`spacy train`](/api/cli#train). |
|
| **CREATES** | Binary [`DocBin`](/api/docbin) training data that can be used with [`spacy train`](/api/cli#train). |
|
||||||
|
|
||||||
### Converters
|
### Converters {#converters}
|
||||||
|
|
||||||
| ID | Description |
|
| ID | Description |
|
||||||
| ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
@ -233,12 +229,12 @@ $ python -m spacy convert [input_file] [output_dir] [--converter]
|
||||||
| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||||
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||||
|
|
||||||
## Debug {#debug new="3"}
|
## debug {#debug new="3"}
|
||||||
|
|
||||||
The `spacy debug` CLI includes helpful commands for debugging and profiling your
|
The `spacy debug` CLI includes helpful commands for debugging and profiling your
|
||||||
configs, data and implementations.
|
configs, data and implementations.
|
||||||
|
|
||||||
### debug config {#debug-config}
|
### debug config {#debug-config new="3" tag="command"}
|
||||||
|
|
||||||
Debug a [`config.cfg` file](/usage/training#config) and show validation errors.
|
Debug a [`config.cfg` file](/usage/training#config) and show validation errors.
|
||||||
The command will create all objects in the tree and validate them. Note that
|
The command will create all objects in the tree and validate them. Note that
|
||||||
|
@ -246,15 +242,15 @@ some config validation errors are blocking and will prevent the rest of the
|
||||||
config from being resolved. This means that you may not see all validation
|
config from being resolved. This means that you may not see all validation
|
||||||
errors at once and some issues are only shown once previous errors have been
|
errors at once and some issues are only shown once previous errors have been
|
||||||
fixed. To auto-fill a partial config and save the result, you can use the
|
fixed. To auto-fill a partial config and save the result, you can use the
|
||||||
[`init config`](/api/cli#init-config) command.
|
[`init fillconfig`](/api/cli#init-fill-config) command.
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy debug config [config_path] [--code_path] [--output] [--auto_fill] [--diff] [overrides]
|
$ python -m spacy debug config [config_path] [--code_path] [overrides]
|
||||||
```
|
```
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```bash
|
> ```cli
|
||||||
> $ python -m spacy debug config ./config.cfg
|
> $ python -m spacy debug config ./config.cfg
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
@ -277,18 +273,15 @@ python -m spacy init fill-config tmp/starter-config_invalid.cfg --base tmp/start
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
| Argument | Type | Default | Description |
|
| Name | Description |
|
||||||
| --------------------- | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- |
|
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ |
|
||||||
| `--code_path`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
|
| `--code_path`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ |
|
||||||
| `--auto_fill`, `-F` | option | Whether or not to auto-fill the config with built-in defaults if possible. If `False`, the provided config needs to be complete. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| `--output_path`, `-o` | option | Output path where the filled config can be stored. Use '-' for standard output. |
|
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
|
||||||
| `--diff`, `-D` | option | `Show a visual diff if config was auto-filled. |
|
| **PRINTS** | Config validation errors, if available. |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
|
||||||
| overrides | option / flag | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
|
|
||||||
| **PRINTS** | stdout | Config validation errors, if available. |
|
|
||||||
|
|
||||||
### debug data {#debug-data}
|
### debug data {#debug-data tag="command"}
|
||||||
|
|
||||||
Analyze, debug, and validate your training and development data. Get useful
|
Analyze, debug, and validate your training and development data. Get useful
|
||||||
stats, and find problems like invalid entity annotations, cyclic dependencies,
|
stats, and find problems like invalid entity annotations, cyclic dependencies,
|
||||||
|
@ -303,14 +296,13 @@ takes the same arguments as `train` and reads settings off the
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy debug data [config_path] [--code] [--ignore-warnings]
|
$ python -m spacy debug data [config_path] [--code] [--ignore-warnings] [--verbose] [--no-format] [overrides]
|
||||||
[--verbose] [--no-format] [overrides]
|
|
||||||
```
|
```
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```bash
|
> ```cli
|
||||||
> $ python -m spacy debug data ./config.cfg
|
> $ python -m spacy debug data ./config.cfg
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
@ -453,18 +445,18 @@ will not be available.
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| -------------------------- | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ |
|
||||||
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
|
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ |
|
||||||
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
| `--ignore-warnings`, `-IW` | Ignore warnings, only show stats and errors. ~~bool (flag)~~ |
|
||||||
| `--verbose`, `-V` | flag | Print additional information and explanations. |
|
| `--verbose`, `-V` | Print additional information and explanations. ~~bool (flag)~~ |
|
||||||
| `--no-format`, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
|
| `--no-format`, `-NF` | Don't pretty-print the results. Use this if you want to write to a file. ~~bool (flag)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| overrides | option / flag | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
|
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
|
||||||
| **PRINTS** | stdout | Debugging information. |
|
| **PRINTS** | Debugging information. |
|
||||||
|
|
||||||
### debug profile {#debug-profile}
|
### debug profile {#debug-profile tag="command"}
|
||||||
|
|
||||||
Profile which functions take the most time in a spaCy pipeline. Input should be
|
Profile which functions take the most time in a spaCy pipeline. Input should be
|
||||||
formatted as one JSON object per line with a key `"text"`. It can either be
|
formatted as one JSON object per line with a key `"text"`. It can either be
|
||||||
|
@ -478,26 +470,25 @@ The `profile` command is now available as a subcommand of `spacy debug`.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy debug profile [model] [inputs] [--n-texts]
|
$ python -m spacy debug profile [model] [inputs] [--n-texts]
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| ----------------- | ---------- | ----------------------------------------------------------------- |
|
| ----------------- | ---------------------------------------------------------------------------------- |
|
||||||
| `model` | positional | A loadable spaCy model. |
|
| `model` | A loadable spaCy model. ~~str (positional)~~ |
|
||||||
| `inputs` | positional | Optional path to input file, or `-` for standard input. |
|
| `inputs` | Optional path to input file, or `-` for standard input. ~~Path (positional)~~ |
|
||||||
| `--n-texts`, `-n` | option | Maximum number of texts to use if available. Defaults to `10000`. |
|
| `--n-texts`, `-n` | Maximum number of texts to use if available. Defaults to `10000`. ~~int (option)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| **PRINTS** | stdout | Profiling information for the model. |
|
| **PRINTS** | Profiling information for the model. |
|
||||||
|
|
||||||
### debug model {#debug-model}
|
### debug model {#debug-model new="3" tag="command"}
|
||||||
|
|
||||||
Debug a Thinc [`Model`](https://thinc.ai/docs/api-model) by running it on a
|
Debug a Thinc [`Model`](https://thinc.ai/docs/api-model) by running it on a
|
||||||
sample text and checking how it updates its internal weights and parameters.
|
sample text and checking how it updates its internal weights and parameters.
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy debug model [config_path] [component] [--layers] [-DIM]
|
$ python -m spacy debug model [config_path] [component] [--layers] [-DIM] [-PAR] [-GRAD] [-ATTR] [-P0] [-P1] [-P2] [P3] [--gpu-id]
|
||||||
[-PAR] [-GRAD] [-ATTR] [-P0] [-P1] [-P2] [P3] [--gpu-id]
|
|
||||||
```
|
```
|
||||||
|
|
||||||
<Accordion title="Example outputs" spaced>
|
<Accordion title="Example outputs" spaced>
|
||||||
|
@ -507,7 +498,7 @@ model ("Step 0"), which helps us to understand the internal structure of the
|
||||||
Neural Network, and to focus on specific layers that we want to inspect further
|
Neural Network, and to focus on specific layers that we want to inspect further
|
||||||
(see next example).
|
(see next example).
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy debug model ./config.cfg tagger -P0
|
$ python -m spacy debug model ./config.cfg tagger -P0
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -553,7 +544,7 @@ an all-zero matrix determined by the `nO` and `nI` dimensions. After a first
|
||||||
training step (Step 2), this matrix has clearly updated its values through the
|
training step (Step 2), this matrix has clearly updated its values through the
|
||||||
training feedback loop.
|
training feedback loop.
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P2
|
$ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P2
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -596,23 +587,24 @@ $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| ----------------------- | ---------- | ----------------------------------------------------------------------------------------------------- |
|
| ----------------------- | --------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. | |
|
| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ |
|
||||||
| `component` | positional | Name of the pipeline component of which the model should be analyzed. | |
|
| `component` | Name of the pipeline component of which the model should be analyzed. ~~str (positional)~~ |
|
||||||
| `--layers`, `-l` | option | Comma-separated names of layer IDs to print. | |
|
| `--layers`, `-l` | Comma-separated names of layer IDs to print. ~~str (option)~~ |
|
||||||
| `--dimensions`, `-DIM` | option | Show dimensions of each layer. |
|
| `--dimensions`, `-DIM` | Show dimensions of each layer. ~~bool (flag)~~ |
|
||||||
| `--parameters`, `-PAR` | option | Show parameters of each layer. |
|
| `--parameters`, `-PAR` | Show parameters of each layer. ~~bool (flag)~~ |
|
||||||
| `--gradients`, `-GRAD` | option | Show gradients of each layer. |
|
| `--gradients`, `-GRAD` | Show gradients of each layer. ~~bool (flag)~~ |
|
||||||
| `--attributes`, `-ATTR` | option | Show attributes of each layer. |
|
| `--attributes`, `-ATTR` | Show attributes of each layer. ~~bool (flag)~~ |
|
||||||
| `--print-step0`, `-P0` | option | Print model before training. |
|
| `--print-step0`, `-P0` | Print model before training. ~~bool (flag)~~ |
|
||||||
| `--print-step1`, `-P1` | option | Print model after initialization. |
|
| `--print-step1`, `-P1` | Print model after initialization. ~~bool (flag)~~ |
|
||||||
| `--print-step2`, `-P2` | option | Print model after training. |
|
| `--print-step2`, `-P2` | Print model after training. ~~bool (flag)~~ |
|
||||||
| `--print-step3`, `-P3` | option | Print final predictions. |
|
| `--print-step3`, `-P3` | Print final predictions. ~~bool (flag)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--gpu-id`, `-g` | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~ |
|
||||||
| **PRINTS** | stdout | Debugging information. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
|
| **PRINTS** | Debugging information. |
|
||||||
|
|
||||||
## Train {#train}
|
## train {#train tag="command"}
|
||||||
|
|
||||||
Train a model. Expects data in spaCy's
|
Train a model. Expects data in spaCy's
|
||||||
[binary format](/api/data-formats#training) and a
|
[binary format](/api/data-formats#training) and a
|
||||||
|
@ -620,9 +612,9 @@ Train a model. Expects data in spaCy's
|
||||||
Will save out the best model from all epochs, as well as the final model. The
|
Will save out the best model from all epochs, as well as the final model. The
|
||||||
`--code` argument can be used to provide a Python file that's imported before
|
`--code` argument can be used to provide a Python file that's imported before
|
||||||
the training process starts. This lets you register
|
the training process starts. This lets you register
|
||||||
[custom functions](/usage/training#custom-models) and architectures and refer to
|
[custom functions](/usage/training#custom-functions) and architectures and refer
|
||||||
them in your config, all while still using spaCy's built-in `train` workflow. If
|
to them in your config, all while still using spaCy's built-in `train` workflow.
|
||||||
you need to manage complex multi-step training workflows, check out the new
|
If you need to manage complex multi-step training workflows, check out the new
|
||||||
[spaCy projects](/usage/projects).
|
[spaCy projects](/usage/projects).
|
||||||
|
|
||||||
<Infobox title="New in v3.0" variant="warning">
|
<Infobox title="New in v3.0" variant="warning">
|
||||||
|
@ -636,21 +628,21 @@ in the section `[paths]`.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy train [config_path] [--output] [--code] [--verbose] [overrides]
|
$ python -m spacy train [config_path] [--output] [--code] [--verbose] [overrides]
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| ----------------- | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ |
|
||||||
| `--output`, `-o` | positional | Directory to store model in. Will be created if it doesn't exist. |
|
| `--output`, `-o` | Directory to store model in. Will be created if it doesn't exist. ~~Optional[Path] \(positional)~~ |
|
||||||
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
|
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ |
|
||||||
| `--verbose`, `-V` | flag | Show more detailed messages during training. |
|
| `--verbose`, `-V` | Show more detailed messages during training. ~~bool (flag)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| overrides | option / flag | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. |
|
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
|
||||||
| **CREATES** | model | The final model and the best model. |
|
| **CREATES** | The final model and the best model. |
|
||||||
|
|
||||||
## Pretrain {#pretrain new="2.1" tag="experimental"}
|
## pretrain {#pretrain new="2.1" tag="command,experimental"}
|
||||||
|
|
||||||
Pretrain the "token to vector" ([`Tok2vec`](/api/tok2vec)) layer of pipeline
|
Pretrain the "token to vector" ([`Tok2vec`](/api/tok2vec)) layer of pipeline
|
||||||
components on [raw text](/api/data-formats#pretrain), using an approximate
|
components on [raw text](/api/data-formats#pretrain), using an approximate
|
||||||
|
@ -673,24 +665,23 @@ the [data format](/api/data-formats#config) for details.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy pretrain [texts_loc] [output_dir] [config_path]
|
$ python -m spacy pretrain [texts_loc] [output_dir] [config_path] [--code] [--resume-path] [--epoch-resume] [overrides]
|
||||||
[--code] [--resume-path] [--epoch-resume] [overrides]
|
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| ----------------------- | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](/api/data-formats#pretrain) for details. |
|
| `texts_loc` | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](/api/data-formats#pretrain) for details. ~~Path (positional)~~ |
|
||||||
| `output_dir` | positional | Directory to write models to on each epoch. |
|
| `output_dir` | Directory to write models to on each epoch. ~~Path (positional)~~ |
|
||||||
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
|
| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ |
|
||||||
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
|
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ |
|
||||||
| `--resume-path`, `-r` | option | Path to pretrained weights from which to resume pretraining. |
|
| `--resume-path`, `-r` | Path to pretrained weights from which to resume pretraining. ~~Optional[Path] \(option)~~ |
|
||||||
| `--epoch-resume`, `-er` | option | The epoch to resume counting from when using `--resume-path`. Prevents unintended overwriting of existing weight files. |
|
| `--epoch-resume`, `-er` | The epoch to resume counting from when using `--resume-path`. Prevents unintended overwriting of existing weight files. ~~Optional[int] \(option)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| overrides | option / flag | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
|
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.dropout 0.2`. ~~Any (option/flag)~~ |
|
||||||
| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. |
|
| **CREATES** | The pretrained weights that can be used to initialize `spacy train`. |
|
||||||
|
|
||||||
## Evaluate {#evaluate new="2"}
|
## evaluate {#evaluate new="2" tag="command"}
|
||||||
|
|
||||||
Evaluate a model. Expects a loadable spaCy model and evaluation data in the
|
Evaluate a model. Expects a loadable spaCy model and evaluation data in the
|
||||||
[binary `.spacy` format](/api/data-formats#binary-training). The
|
[binary `.spacy` format](/api/data-formats#binary-training). The
|
||||||
|
@ -702,32 +693,31 @@ skew. To render a sample of dependency parses in a HTML file using the
|
||||||
[displaCy visualizations](/usage/visualizers), set as output directory as the
|
[displaCy visualizations](/usage/visualizers), set as output directory as the
|
||||||
`--displacy-path` argument.
|
`--displacy-path` argument.
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy evaluate [model] [data_path] [--output] [--gold-preproc]
|
$ python -m spacy evaluate [model] [data_path] [--output] [--gold-preproc] [--gpu-id] [--displacy-path] [--displacy-limit]
|
||||||
[--gpu-id] [--displacy-path] [--displacy-limit]
|
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| ------------------------- | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `model` | positional | Model to evaluate. Can be a package or a path to a model data directory. |
|
| `model` | Model to evaluate. Can be a package or a path to a model data directory. ~~str (positional)~~ |
|
||||||
| `data_path` | positional | Location of evaluation data in spaCy's [binary format](/api/data-formats#training). |
|
| `data_path` | Location of evaluation data in spaCy's [binary format](/api/data-formats#training). ~~Path (positional)~~ |
|
||||||
| `--output`, `-o` | option | Output JSON file for metrics. If not set, no metrics will be exported. |
|
| `--output`, `-o` | Output JSON file for metrics. If not set, no metrics will be exported. ~~Optional[Path] \(option)~~ |
|
||||||
| `--gold-preproc`, `-G` | flag | Use gold preprocessing. |
|
| `--gold-preproc`, `-G` | Use gold preprocessing. ~~bool (flag)~~ |
|
||||||
| `--gpu-id`, `-g` | option | GPU to use, if any. Defaults to `-1` for CPU. |
|
| `--gpu-id`, `-g` | GPU to use, if any. Defaults to `-1` for CPU. ~~int (option)~~ |
|
||||||
| `--displacy-path`, `-dp` | option | Directory to output rendered parses as HTML. If not set, no visualizations will be generated. |
|
| `--displacy-path`, `-dp` | Directory to output rendered parses as HTML. If not set, no visualizations will be generated. ~~Optional[Path] \(option)~~ |
|
||||||
| `--displacy-limit`, `-dl` | option | Number of parses to generate per file. Defaults to `25`. Keep in mind that a significantly higher number might cause the `.html` files to render slowly. |
|
| `--displacy-limit`, `-dl` | Number of parses to generate per file. Defaults to `25`. Keep in mind that a significantly higher number might cause the `.html` files to render slowly. ~~int (option)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| **CREATES** | `stdout`, JSON, HTML | Training results and optional metrics and visualizations. |
|
| **CREATES** | Training results and optional metrics and visualizations. |
|
||||||
|
|
||||||
## Package {#package}
|
## package {#package tag="command"}
|
||||||
|
|
||||||
Generate an installable
|
Generate an installable
|
||||||
[model Python package](/usage/training#models-generating) from an existing model
|
[model Python package](/usage/training#models-generating) from an existing model
|
||||||
data directory. All data files are copied over. If the path to a `meta.json` is
|
data directory. All data files are copied over. If the path to a
|
||||||
supplied, or a `meta.json` is found in the input directory, this file is used.
|
[`meta.json`](/api/data-formats#meta) is supplied, or a `meta.json` is found in
|
||||||
Otherwise, the data can be entered directly from the command line. spaCy will
|
the input directory, this file is used. Otherwise, the data can be entered
|
||||||
then create a `.tar.gz` archive file that you can distribute and install with
|
directly from the command line. spaCy will then create a `.tar.gz` archive file
|
||||||
`pip install`.
|
that you can distribute and install with `pip install`.
|
||||||
|
|
||||||
<Infobox title="New in v3.0" variant="warning">
|
<Infobox title="New in v3.0" variant="warning">
|
||||||
|
|
||||||
|
@ -737,38 +727,37 @@ this, you can set the `--no-sdist` flag.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy package [input_dir] [output_dir] [--meta-path] [--create-meta]
|
$ python -m spacy package [input_dir] [output_dir] [--meta-path] [--create-meta] [--no-sdist] [--version] [--force]
|
||||||
[--no-sdist] [--version] [--force]
|
|
||||||
```
|
```
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```bash
|
> ```cli
|
||||||
> python -m spacy package /input /output
|
> $ python -m spacy package /input /output
|
||||||
> cd /output/en_model-0.0.0
|
> $ cd /output/en_model-0.0.0
|
||||||
> pip install dist/en_model-0.0.0.tar.gz
|
> $ pip install dist/en_model-0.0.0.tar.gz
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| ------------------------------------------------ | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `input_dir` | positional | Path to directory containing model data. |
|
| `input_dir` | Path to directory containing model data. ~~Path (positional)~~ |
|
||||||
| `output_dir` | positional | Directory to create package folder in. |
|
| `output_dir` | Directory to create package folder in. ~~Path (positional)~~ |
|
||||||
| `--meta-path`, `-m` <Tag variant="new">2</Tag> | option | Path to `meta.json` file (optional). |
|
| `--meta-path`, `-m` <Tag variant="new">2</Tag> | Path to [`meta.json`](/api/data-formats#meta) file (optional). ~~Optional[Path] \(option)~~ |
|
||||||
| `--create-meta`, `-C` <Tag variant="new">2</Tag> | flag | Create a `meta.json` file on the command line, even if one already exists in the directory. If an existing file is found, its entries will be shown as the defaults in the command line prompt. |
|
| `--create-meta`, `-C` <Tag variant="new">2</Tag> | Create a `meta.json` file on the command line, even if one already exists in the directory. If an existing file is found, its entries will be shown as the defaults in the command line prompt. ~~bool (flag)~~ |
|
||||||
| `--no-sdist`, `-NS`, | flag | Don't build the `.tar.gz` sdist automatically. Can be set if you want to run this step manually. |
|
| `--no-sdist`, `-NS`, | Don't build the `.tar.gz` sdist automatically. Can be set if you want to run this step manually. ~~bool (flag)~~ |
|
||||||
| `--version`, `-v` <Tag variant="new">3</Tag> | option | Package version to override in meta. Useful when training new versions, as it doesn't require editing the meta template. |
|
| `--version`, `-v` <Tag variant="new">3</Tag> | Package version to override in meta. Useful when training new versions, as it doesn't require editing the meta template. ~~Optional[str] \(option)~~ |
|
||||||
| `--force`, `-f` | flag | Force overwriting of existing folder in output directory. |
|
| `--force`, `-f` | Force overwriting of existing folder in output directory. ~~bool (flag)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| **CREATES** | directory | A Python package containing the spaCy model. |
|
| **CREATES** | A Python package containing the spaCy model. |
|
||||||
|
|
||||||
## Project {#project new="3"}
|
## project {#project new="3"}
|
||||||
|
|
||||||
The `spacy project` CLI includes subcommands for working with
|
The `spacy project` CLI includes subcommands for working with
|
||||||
[spaCy projects](/usage/projects), end-to-end workflows for building and
|
[spaCy projects](/usage/projects), end-to-end workflows for building and
|
||||||
deploying custom spaCy models.
|
deploying custom spaCy models.
|
||||||
|
|
||||||
### project clone {#project-clone}
|
### project clone {#project-clone tag="command"}
|
||||||
|
|
||||||
Clone a project template from a Git repository. Calls into `git` under the hood
|
Clone a project template from a Git repository. Calls into `git` under the hood
|
||||||
and uses the sparse checkout feature, so you're only downloading what you need.
|
and uses the sparse checkout feature, so you're only downloading what you need.
|
||||||
|
@ -779,31 +768,31 @@ can provide any other repo (public or private) that you have access to using the
|
||||||
|
|
||||||
<!-- TODO: update example once we've decided on repo structure -->
|
<!-- TODO: update example once we've decided on repo structure -->
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy project clone [name] [dest] [--repo]
|
$ python -m spacy project clone [name] [dest] [--repo]
|
||||||
```
|
```
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```bash
|
> ```cli
|
||||||
> $ python -m spacy project clone some_example
|
> $ python -m spacy project clone some_example
|
||||||
> ```
|
> ```
|
||||||
>
|
>
|
||||||
> Clone from custom repo:
|
> Clone from custom repo:
|
||||||
>
|
>
|
||||||
> ```bash
|
> ```cli
|
||||||
> $ python -m spacy project clone template --repo https://github.com/your_org/your_repo
|
> $ python -m spacy project clone template --repo https://github.com/your_org/your_repo
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ---------- | ---------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `name` | positional | The name of the template to clone, relative to the repo. Can be a top-level directory or a subdirectory like `dir/template`. |
|
| `name` | The name of the template to clone, relative to the repo. Can be a top-level directory or a subdirectory like `dir/template`. ~~str (positional)~~ |
|
||||||
| `dest` | positional | Where to clone the project. Defaults to current working directory. |
|
| `dest` | Where to clone the project. Defaults to current working directory. ~~Path (positional)~~ |
|
||||||
| `--repo`, `-r` | option | The repository to clone from. Can be any public or private Git repo you have access to. |
|
| `--repo`, `-r` | The repository to clone from. Can be any public or private Git repo you have access to. ~~str (option)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| **CREATES** | directory | The cloned [project directory](/usage/projects#project-files). |
|
| **CREATES** | The cloned [project directory](/usage/projects#project-files). |
|
||||||
|
|
||||||
### project assets {#project-assets}
|
### project assets {#project-assets tag="command"}
|
||||||
|
|
||||||
Fetch project assets like datasets and pretrained weights. Assets are defined in
|
Fetch project assets like datasets and pretrained weights. Assets are defined in
|
||||||
the `assets` section of the [`project.yml`](/usage/projects#project-yml). If a
|
the `assets` section of the [`project.yml`](/usage/projects#project-yml). If a
|
||||||
|
@ -814,23 +803,23 @@ considered "private" and you have to take care of putting them into the
|
||||||
destination directory yourself. If a local path is provided, the asset is copied
|
destination directory yourself. If a local path is provided, the asset is copied
|
||||||
into the current project.
|
into the current project.
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy project assets [project_dir]
|
$ python -m spacy project assets [project_dir]
|
||||||
```
|
```
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```bash
|
> ```cli
|
||||||
> $ python -m spacy project assets
|
> $ python -m spacy project assets
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ---------- | ----------------------------------------------------------------- |
|
| -------------- | --------------------------------------------------------------------------------------- |
|
||||||
| `project_dir` | positional | Path to project directory. Defaults to current working directory. |
|
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| **CREATES** | files | Downloaded or copied assets defined in the `project.yml`. |
|
| **CREATES** | Downloaded or copied assets defined in the `project.yml`. |
|
||||||
|
|
||||||
### project run {#project-run}
|
### project run {#project-run tag="command"}
|
||||||
|
|
||||||
Run a named command or workflow defined in the
|
Run a named command or workflow defined in the
|
||||||
[`project.yml`](/usage/projects#project-yml). If a workflow name is specified,
|
[`project.yml`](/usage/projects#project-yml). If a workflow name is specified,
|
||||||
|
@ -839,26 +828,112 @@ all commands in the workflow are run, in order. If commands define
|
||||||
re-run if state has changed. For example, if the input dataset changes, a
|
re-run if state has changed. For example, if the input dataset changes, a
|
||||||
preprocessing command that depends on those files will be re-run.
|
preprocessing command that depends on those files will be re-run.
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy project run [subcommand] [project_dir] [--force] [--dry]
|
$ python -m spacy project run [subcommand] [project_dir] [--force] [--dry]
|
||||||
```
|
```
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```bash
|
> ```cli
|
||||||
> $ python -m spacy project run train
|
> $ python -m spacy project run train
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| --------------- | ---------- | ----------------------------------------------------------------- |
|
| --------------- | --------------------------------------------------------------------------------------- |
|
||||||
| `subcommand` | positional | Name of the command or workflow to run. |
|
| `subcommand` | Name of the command or workflow to run. ~~str (positional)~~ |
|
||||||
| `project_dir` | positional | Path to project directory. Defaults to current working directory. |
|
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
|
||||||
| `--force`, `-F` | flag | Force re-running steps, even if nothing changed. |
|
| `--force`, `-F` | Force re-running steps, even if nothing changed. ~~bool (flag)~~ |
|
||||||
| `--dry`, `-D` | flag | Perform a dry run and don't execute scripts. |
|
| `--dry`, `-D` | Perform a dry run and don't execute scripts. ~~bool (flag)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| **EXECUTES** | script | The command defined in the `project.yml`. |
|
| **EXECUTES** | The command defined in the `project.yml`. |
|
||||||
|
|
||||||
### project dvc {#project-dvc}
|
### project push {#project-push tag="command"}
|
||||||
|
|
||||||
|
Upload all available files or directories listed as in the `outputs` section of
|
||||||
|
commands to a remote storage. Outputs are archived and compressed prior to
|
||||||
|
upload, and addressed in the remote storage using the output's relative path
|
||||||
|
(URL encoded), a hash of its command string and dependencies, and a hash of its
|
||||||
|
file contents. This means `push` should **never overwrite** a file in your
|
||||||
|
remote. If all the hashes match, the contents are the same and nothing happens.
|
||||||
|
If the contents are different, the new version of the file is uploaded. Deleting
|
||||||
|
obsolete files is left up to you.
|
||||||
|
|
||||||
|
Remotes can be defined in the `remotes` section of the
|
||||||
|
[`project.yml`](/usage/projects#project-yml). Under the hood, spaCy uses the
|
||||||
|
[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library to
|
||||||
|
communicate with the remote storages, so you can use any protocol that
|
||||||
|
`smart-open` supports, including [S3](https://aws.amazon.com/s3/),
|
||||||
|
[Google Cloud Storage](https://cloud.google.com/storage), SSH and more, although
|
||||||
|
you may need to install extra dependencies to use certain protocols.
|
||||||
|
|
||||||
|
```cli
|
||||||
|
$ python -m spacy project push [remote] [project_dir]
|
||||||
|
```
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```cli
|
||||||
|
> $ python -m spacy project push my_bucket
|
||||||
|
> ```
|
||||||
|
>
|
||||||
|
> ```yaml
|
||||||
|
> ### project.yml
|
||||||
|
> remotes:
|
||||||
|
> my_bucket: 's3://my-spacy-bucket'
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| -------------- | --------------------------------------------------------------------------------------- |
|
||||||
|
| `remote` | The name of the remote to upload to. Defaults to `"default"`. ~~str (positional)~~ |
|
||||||
|
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
|
||||||
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
|
| **UPLOADS** | All project outputs that exist and are not already stored in the remote. |
|
||||||
|
|
||||||
|
### project pull {#project-pull tag="command"}
|
||||||
|
|
||||||
|
Download all files or directories listed as `outputs` for commands, unless they
|
||||||
|
are not already present locally. When searching for files in the remote, `pull`
|
||||||
|
won't just look at the output path, but will also consider the **command
|
||||||
|
string** and the **hashes of the dependencies**. For instance, let's say you've
|
||||||
|
previously pushed a model checkpoint to the remote, but now you've changed some
|
||||||
|
hyper-parameters. Because you've changed the inputs to the command, if you run
|
||||||
|
`pull`, you won't retrieve the stale result. If you train your model and push
|
||||||
|
the outputs to the remote, the outputs will be saved alongside the prior
|
||||||
|
outputs, so if you change the config back, you'll be able to fetch back the
|
||||||
|
result.
|
||||||
|
|
||||||
|
Remotes can be defined in the `remotes` section of the
|
||||||
|
[`project.yml`](/usage/projects#project-yml). Under the hood, spaCy uses the
|
||||||
|
[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library to
|
||||||
|
communicate with the remote storages, so you can use any protocol that
|
||||||
|
`smart-open` supports, including [S3](https://aws.amazon.com/s3/),
|
||||||
|
[Google Cloud Storage](https://cloud.google.com/storage), SSH and more, although
|
||||||
|
you may need to install extra dependencies to use certain protocols.
|
||||||
|
|
||||||
|
```cli
|
||||||
|
$ python -m spacy project pull [remote] [project_dir]
|
||||||
|
```
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```cli
|
||||||
|
> $ python -m spacy project pull my_bucket
|
||||||
|
> ```
|
||||||
|
>
|
||||||
|
> ```yaml
|
||||||
|
> ### project.yml
|
||||||
|
> remotes:
|
||||||
|
> my_bucket: 's3://my-spacy-bucket'
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| -------------- | --------------------------------------------------------------------------------------- |
|
||||||
|
| `remote` | The name of the remote to download from. Defaults to `"default"`. ~~str (positional)~~ |
|
||||||
|
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
|
||||||
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
|
| **DOWNLOADS** | All project outputs that do not exist locally and can be found in the remote. |
|
||||||
|
|
||||||
|
### project dvc {#project-dvc tag="command"}
|
||||||
|
|
||||||
Auto-generate [Data Version Control](https://dvc.org) (DVC) config file. Calls
|
Auto-generate [Data Version Control](https://dvc.org) (DVC) config file. Calls
|
||||||
[`dvc run`](https://dvc.org/doc/command-reference/run) with `--no-exec` under
|
[`dvc run`](https://dvc.org/doc/command-reference/run) with `--no-exec` under
|
||||||
|
@ -878,23 +953,23 @@ You'll also need to add the assets you want to track with
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy project dvc [project_dir] [workflow] [--force] [--verbose]
|
$ python -m spacy project dvc [project_dir] [workflow] [--force] [--verbose]
|
||||||
```
|
```
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```bash
|
> ```cli
|
||||||
> git init
|
> $ git init
|
||||||
> dvc init
|
> $ dvc init
|
||||||
> python -m spacy project dvc all
|
> $ python -m spacy project dvc all
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Name | Description |
|
||||||
| ----------------- | ---------- | --------------------------------------------------------------------------------------------- |
|
| ----------------- | ----------------------------------------------------------------------------------------------------------------- |
|
||||||
| `project_dir` | positional | Path to project directory. Defaults to current working directory. |
|
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
|
||||||
| `workflow` | positional | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. |
|
| `workflow` | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. ~~Optional[str] \(positional)~~ |
|
||||||
| `--force`, `-F` | flag | Force-updating config file. |
|
| `--force`, `-F` | Force-updating config file. ~~bool (flag)~~ |
|
||||||
| `--verbose`, `-V` | flag | Print more output generated by DVC. |
|
| `--verbose`, `-V` | Print more output generated by DVC. ~~bool (flag)~~ |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| **CREATES** | file | A `dvc.yaml` file in the project directory, based on the steps defined in the given workflow. |
|
| **CREATES** | A `dvc.yaml` file in the project directory, based on the steps defined in the given workflow. |
|
||||||
|
|
|
@ -17,7 +17,7 @@ customize the data loading during training, you can register your own
|
||||||
or evaluation data. It takes the same arguments as the `Corpus` class and
|
or evaluation data. It takes the same arguments as the `Corpus` class and
|
||||||
returns a callable that yields [`Example`](/api/example) objects. You can
|
returns a callable that yields [`Example`](/api/example) objects. You can
|
||||||
replace it with your own registered function in the
|
replace it with your own registered function in the
|
||||||
[`@readers` registry](/api/top-level#regsitry) to customize the data loading and
|
[`@readers` registry](/api/top-level#registry) to customize the data loading and
|
||||||
streaming.
|
streaming.
|
||||||
|
|
||||||
> #### Example config
|
> #### Example config
|
||||||
|
@ -28,18 +28,18 @@ streaming.
|
||||||
>
|
>
|
||||||
> [training.train_corpus]
|
> [training.train_corpus]
|
||||||
> @readers = "spacy.Corpus.v1"
|
> @readers = "spacy.Corpus.v1"
|
||||||
> path = ${paths:train}
|
> path = ${paths.train}
|
||||||
> gold_preproc = false
|
> gold_preproc = false
|
||||||
> max_length = 0
|
> max_length = 0
|
||||||
> limit = 0
|
> limit = 0
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
| --------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | `Path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). |
|
| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Path~~ |
|
||||||
| `gold_preproc` | bool | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. |
|
| `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ |
|
||||||
| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. |
|
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
|
||||||
| `limit` | int | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. |
|
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/blob/develop/spacy/gold/corpus.py
|
https://github.com/explosion/spaCy/blob/develop/spacy/gold/corpus.py
|
||||||
|
@ -67,13 +67,13 @@ train/test skew.
|
||||||
> corpus = Corpus("./data", limit=10)
|
> corpus = Corpus("./data", limit=10)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------- |
|
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | The directory or filename to read from. |
|
| `path` | The directory or filename to read from. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `gold_preproc` | bool | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. |
|
| `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. ~~bool~~ |
|
||||||
| `max_length` | int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. |
|
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
|
||||||
| `limit` | int | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. |
|
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
|
||||||
|
|
||||||
## Corpus.\_\_call\_\_ {#call tag="method"}
|
## Corpus.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -90,7 +90,7 @@ Yield examples from the data.
|
||||||
> train_data = corpus(nlp)
|
> train_data = corpus(nlp)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ---------- | ------------------------- |
|
| ---------- | -------------------------------------- |
|
||||||
| `nlp` | `Language` | The current `nlp` object. |
|
| `nlp` | The current `nlp` object. ~~Language~~ |
|
||||||
| **YIELDS** | `Example` | The examples. |
|
| **YIELDS** | The examples. ~~Example~~ |
|
||||||
|
|
|
@ -23,13 +23,13 @@ accessed from Python. For the Python documentation, see [`Doc`](/api/doc).
|
||||||
|
|
||||||
### Attributes {#doc_attributes}
|
### Attributes {#doc_attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | ------------ | ----------------------------------------------------------------------------------------- |
|
| ------------ | -------------------------------------------------------------------------------------------------------- |
|
||||||
| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Doc` object is garbage collected. |
|
| `mem` | A memory pool. Allocated memory will be freed once the `Doc` object is garbage collected. ~~cymem.Pool~~ |
|
||||||
| `vocab` | `Vocab` | A reference to the shared `Vocab` object. |
|
| `vocab` | A reference to the shared `Vocab` object. ~~Vocab~~ |
|
||||||
| `c` | `TokenC*` | A pointer to a [`TokenC`](/api/cython-structs#tokenc) struct. |
|
| `c` | A pointer to a [`TokenC`](/api/cython-structs#tokenc) struct. ~~TokenC\*~~ |
|
||||||
| `length` | `int` | The number of tokens in the document. |
|
| `length` | The number of tokens in the document. ~~int~~ |
|
||||||
| `max_length` | `int` | The underlying size of the `Doc.c` array. |
|
| `max_length` | The underlying size of the `Doc.c` array. ~~int~~ |
|
||||||
|
|
||||||
### Doc.push_back {#doc_push_back tag="method"}
|
### Doc.push_back {#doc_push_back tag="method"}
|
||||||
|
|
||||||
|
@ -50,10 +50,10 @@ Append a token to the `Doc`. The token can be provided as a
|
||||||
> assert doc.text == "hello "
|
> assert doc.text == "hello "
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | --------------- | ----------------------------------------- |
|
| ------------ | -------------------------------------------------- |
|
||||||
| `lex_or_tok` | `LexemeOrToken` | The word to append to the `Doc`. |
|
| `lex_or_tok` | The word to append to the `Doc`. ~~LexemeOrToken~~ |
|
||||||
| `has_space` | `bint` | Whether the word has trailing whitespace. |
|
| `has_space` | Whether the word has trailing whitespace. ~~bint~~ |
|
||||||
|
|
||||||
## Token {#token tag="cdef class" source="spacy/tokens/token.pxd"}
|
## Token {#token tag="cdef class" source="spacy/tokens/token.pxd"}
|
||||||
|
|
||||||
|
@ -70,12 +70,12 @@ accessed from Python. For the Python documentation, see [`Token`](/api/token).
|
||||||
|
|
||||||
### Attributes {#token_attributes}
|
### Attributes {#token_attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------- | --------- | ------------------------------------------------------------- |
|
| ------- | -------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | A reference to the shared `Vocab` object. |
|
| `vocab` | A reference to the shared `Vocab` object. ~~Vocab~~ |
|
||||||
| `c` | `TokenC*` | A pointer to a [`TokenC`](/api/cython-structs#tokenc) struct. |
|
| `c` | A pointer to a [`TokenC`](/api/cython-structs#tokenc) struct. ~~TokenC\*~~ |
|
||||||
| `i` | `int` | The offset of the token within the document. |
|
| `i` | The offset of the token within the document. ~~int~~ |
|
||||||
| `doc` | `Doc` | The parent document. |
|
| `doc` | The parent document. ~~Doc~~ |
|
||||||
|
|
||||||
### Token.cinit {#token_cinit tag="method"}
|
### Token.cinit {#token_cinit tag="method"}
|
||||||
|
|
||||||
|
@ -87,12 +87,12 @@ Create a `Token` object from a `TokenC*` pointer.
|
||||||
> token = Token.cinit(&doc.c[3], doc, 3)
|
> token = Token.cinit(&doc.c[3], doc, 3)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | --------- | ------------------------------------------------------------ |
|
| -------- | -------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | A reference to the shared `Vocab`. |
|
| `vocab` | A reference to the shared `Vocab`. ~~Vocab~~ |
|
||||||
| `c` | `TokenC*` | A pointer to a [`TokenC`](/api/cython-structs#tokenc)struct. |
|
| `c` | A pointer to a [`TokenC`](/api/cython-structs#tokenc) struct. ~~TokenC\*~~ |
|
||||||
| `offset` | `int` | The offset of the token within the document. |
|
| `offset` | The offset of the token within the document. ~~int~~ |
|
||||||
| `doc` | `Doc` | The parent document. |
|
| `doc` | The parent document. ~~int~~ |
|
||||||
|
|
||||||
## Span {#span tag="cdef class" source="spacy/tokens/span.pxd"}
|
## Span {#span tag="cdef class" source="spacy/tokens/span.pxd"}
|
||||||
|
|
||||||
|
@ -107,14 +107,14 @@ accessed from Python. For the Python documentation, see [`Span`](/api/span).
|
||||||
|
|
||||||
### Attributes {#span_attributes}
|
### Attributes {#span_attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | -------------------------------------- | ------------------------------------------------------- |
|
| ------------ | ----------------------------------------------------------------------------- |
|
||||||
| `doc` | `Doc` | The parent document. |
|
| `doc` | The parent document. ~~Doc~~ |
|
||||||
| `start` | `int` | The index of the first token of the span. |
|
| `start` | The index of the first token of the span. ~~int~~ |
|
||||||
| `end` | `int` | The index of the first token after the span. |
|
| `end` | The index of the first token after the span. ~~int~~ |
|
||||||
| `start_char` | `int` | The index of the first character of the span. |
|
| `start_char` | The index of the first character of the span. ~~int~~ |
|
||||||
| `end_char` | `int` | The index of the last character of the span. |
|
| `end_char` | The index of the last character of the span. ~~int~~ |
|
||||||
| `label` | <Abbr title="uint64_t">`attr_t`</Abbr> | A label to attach to the span, e.g. for named entities. |
|
| `label` | A label to attach to the span, e.g. for named entities. ~~attr_t (uint64_t)~~ |
|
||||||
|
|
||||||
## Lexeme {#lexeme tag="cdef class" source="spacy/lexeme.pxd"}
|
## Lexeme {#lexeme tag="cdef class" source="spacy/lexeme.pxd"}
|
||||||
|
|
||||||
|
@ -129,11 +129,11 @@ accessed from Python. For the Python documentation, see [`Lexeme`](/api/lexeme).
|
||||||
|
|
||||||
### Attributes {#lexeme_attributes}
|
### Attributes {#lexeme_attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------- | -------------------------------------- | --------------------------------------------------------------- |
|
| ------- | ----------------------------------------------------------------------------- |
|
||||||
| `c` | `LexemeC*` | A pointer to a [`LexemeC`](/api/cython-structs#lexemec) struct. |
|
| `c` | A pointer to a [`LexemeC`](/api/cython-structs#lexemec) struct. ~~LexemeC\*~~ |
|
||||||
| `vocab` | `Vocab` | A reference to the shared `Vocab` object. |
|
| `vocab` | A reference to the shared `Vocab` object. ~~Vocab~~ |
|
||||||
| `orth` | <Abbr title="uint64_t">`attr_t`</Abbr> | ID of the verbatim text content. |
|
| `orth` | ID of the verbatim text content. ~~attr_t (uint64_t)~~ |
|
||||||
|
|
||||||
## Vocab {#vocab tag="cdef class" source="spacy/vocab.pxd"}
|
## Vocab {#vocab tag="cdef class" source="spacy/vocab.pxd"}
|
||||||
|
|
||||||
|
@ -149,11 +149,11 @@ accessed from Python. For the Python documentation, see [`Vocab`](/api/vocab).
|
||||||
|
|
||||||
### Attributes {#vocab_attributes}
|
### Attributes {#vocab_attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | ------------- | ------------------------------------------------------------------------------------------- |
|
| --------- | ---------------------------------------------------------------------------------------------------------- |
|
||||||
| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. |
|
| `mem` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. ~~cymem.Pool~~ |
|
||||||
| `strings` | `StringStore` | A `StringStore` that maps string to hash values and vice versa. |
|
| `strings` | A `StringStore` that maps string to hash values and vice versa. ~~StringStore~~ |
|
||||||
| `length` | `int` | The number of entries in the vocabulary. |
|
| `length` | The number of entries in the vocabulary. ~~int~~ |
|
||||||
|
|
||||||
### Vocab.get {#vocab_get tag="method"}
|
### Vocab.get {#vocab_get tag="method"}
|
||||||
|
|
||||||
|
@ -166,11 +166,11 @@ vocabulary.
|
||||||
> lexeme = vocab.get(vocab.mem, "hello")
|
> lexeme = vocab.get(vocab.mem, "hello")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---------------- | ------------------------------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------------------------- |
|
||||||
| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. |
|
| `mem` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. ~~cymem.Pool~~ |
|
||||||
| `string` | str | The string of the word to look up. |
|
| `string` | The string of the word to look up. ~~str~~ |
|
||||||
| **RETURNS** | `const LexemeC*` | The lexeme in the vocabulary. |
|
| **RETURNS** | The lexeme in the vocabulary. ~~const LexemeC\*~~ |
|
||||||
|
|
||||||
### Vocab.get_by_orth {#vocab_get_by_orth tag="method"}
|
### Vocab.get_by_orth {#vocab_get_by_orth tag="method"}
|
||||||
|
|
||||||
|
@ -183,11 +183,11 @@ vocabulary.
|
||||||
> lexeme = vocab.get_by_orth(doc[0].lex.norm)
|
> lexeme = vocab.get_by_orth(doc[0].lex.norm)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------------------------------------- | ------------------------------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------------------------- |
|
||||||
| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. |
|
| `mem` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. ~~cymem.Pool~~ |
|
||||||
| `orth` | <Abbr title="uint64_t">`attr_t`</Abbr> | ID of the verbatim text content. |
|
| `orth` | ID of the verbatim text content. ~~attr_t (uint64_t)~~ |
|
||||||
| **RETURNS** | `const LexemeC*` | The lexeme in the vocabulary. |
|
| **RETURNS** | The lexeme in the vocabulary. ~~const LexemeC\*~~ |
|
||||||
|
|
||||||
## StringStore {#stringstore tag="cdef class" source="spacy/strings.pxd"}
|
## StringStore {#stringstore tag="cdef class" source="spacy/strings.pxd"}
|
||||||
|
|
||||||
|
@ -203,7 +203,7 @@ accessed from Python. For the Python documentation, see
|
||||||
|
|
||||||
### Attributes {#stringstore_attributes}
|
### Attributes {#stringstore_attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------ | ------------------------------------------------------ | ------------------------------------------------------------------------------------------------ |
|
| ------ | ---------------------------------------------------------------------------------------------------------------- |
|
||||||
| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the`StringStore` object is garbage collected. |
|
| `mem` | A memory pool. Allocated memory will be freed once the `StringStore` object is garbage collected. ~~cymem.Pool~~ |
|
||||||
| `keys` | <Abbr title="vector[uint64_t]">`vector[hash_t]`</Abbr> | A list of hash values in the `StringStore`. |
|
| `keys` | A list of hash values in the `StringStore`. ~~vector[hash_t] \(vector[uint64_t])~~ |
|
||||||
|
|
|
@ -18,26 +18,26 @@ Cython data container for the `Token` object.
|
||||||
> token_ptr = &doc.c[3]
|
> token_ptr = &doc.c[3]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `lex` | `const LexemeC*` | A pointer to the lexeme for the token. |
|
| `lex` | A pointer to the lexeme for the token. ~~const LexemeC\*~~ |
|
||||||
| `morph` | `uint64_t` | An ID allowing lookup of morphological attributes. |
|
| `morph` | An ID allowing lookup of morphological attributes. ~~uint64_t~~ |
|
||||||
| `pos` | `univ_pos_t` | Coarse-grained part-of-speech tag. |
|
| `pos` | Coarse-grained part-of-speech tag. ~~univ_pos_t~~ |
|
||||||
| `spacy` | `bint` | A binary value indicating whether the token has trailing whitespace. |
|
| `spacy` | A binary value indicating whether the token has trailing whitespace. ~~bint~~ |
|
||||||
| `tag` | <Abbr title="uint64_t">`attr_t`</Abbr> | Fine-grained part-of-speech tag. |
|
| `tag` | Fine-grained part-of-speech tag. ~~attr_t (uint64_t)~~ |
|
||||||
| `idx` | `int` | The character offset of the token within the parent document. |
|
| `idx` | The character offset of the token within the parent document. ~~int~~ |
|
||||||
| `lemma` | <Abbr title="uint64_t">`attr_t`</Abbr> | Base form of the token, with no inflectional suffixes. |
|
| `lemma` | Base form of the token, with no inflectional suffixes. ~~attr_t (uint64_t)~~ |
|
||||||
| `sense` | <Abbr title="uint64_t">`attr_t`</Abbr> | Space for storing a word sense ID, currently unused. |
|
| `sense` | Space for storing a word sense ID, currently unused. ~~attr_t (uint64_t)~~ |
|
||||||
| `head` | `int` | Offset of the syntactic parent relative to the token. |
|
| `head` | Offset of the syntactic parent relative to the token. ~~int~~ |
|
||||||
| `dep` | <Abbr title="uint64_t">`attr_t`</Abbr> | Syntactic dependency relation. |
|
| `dep` | Syntactic dependency relation. ~~attr_t (uint64_t)~~ |
|
||||||
| `l_kids` | `uint32_t` | Number of left children. |
|
| `l_kids` | Number of left children. ~~uint32_t~~ |
|
||||||
| `r_kids` | `uint32_t` | Number of right children. |
|
| `r_kids` | Number of right children. ~~uint32_t~~ |
|
||||||
| `l_edge` | `uint32_t` | Offset of the leftmost token of this token's syntactic descendants. |
|
| `l_edge` | Offset of the leftmost token of this token's syntactic descendants. ~~uint32_t~~ |
|
||||||
| `r_edge` | `uint32_t` | Offset of the rightmost token of this token's syntactic descendants. |
|
| `r_edge` | Offset of the rightmost token of this token's syntactic descendants. ~~uint32_t~~ |
|
||||||
| `sent_start` | `int` | Ternary value indicating whether the token is the first word of a sentence. `0` indicates a missing value, `-1` indicates `False` and `1` indicates `True`. The default value, 0, is interpreted as no sentence break. Sentence boundary detectors will usually set 0 for all tokens except tokens that follow a sentence boundary. |
|
| `sent_start` | Ternary value indicating whether the token is the first word of a sentence. `0` indicates a missing value, `-1` indicates `False` and `1` indicates `True`. The default value, 0, is interpreted as no sentence break. Sentence boundary detectors will usually set 0 for all tokens except tokens that follow a sentence boundary. ~~int~~ |
|
||||||
| `ent_iob` | `int` | IOB code of named entity tag. `0` indicates a missing value, `1` indicates `I`, `2` indicates `0` and `3` indicates `B`. |
|
| `ent_iob` | IOB code of named entity tag. `0` indicates a missing value, `1` indicates `I`, `2` indicates `0` and `3` indicates `B`. ~~int~~ |
|
||||||
| `ent_type` | <Abbr title="uint64_t">`attr_t`</Abbr> | Named entity type. |
|
| `ent_type` | Named entity type. ~~attr_t (uint64_t)~~ |
|
||||||
| `ent_id` | <Abbr title="uint64_t">`attr_t`</Abbr> | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
|
| `ent_id` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~attr_t (uint64_t)~~ |
|
||||||
|
|
||||||
### Token.get_struct_attr {#token_get_struct_attr tag="staticmethod, nogil" source="spacy/tokens/token.pxd"}
|
### Token.get_struct_attr {#token_get_struct_attr tag="staticmethod, nogil" source="spacy/tokens/token.pxd"}
|
||||||
|
|
||||||
|
@ -52,11 +52,11 @@ Get the value of an attribute from the `TokenC` struct by attribute ID.
|
||||||
> is_alpha = Token.get_struct_attr(&doc.c[3], IS_ALPHA)
|
> is_alpha = Token.get_struct_attr(&doc.c[3], IS_ALPHA)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------------------------------------- | -------------------------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------------------- |
|
||||||
| `token` | `const TokenC*` | A pointer to a `TokenC` struct. |
|
| `token` | A pointer to a `TokenC` struct. ~~const TokenC\*~~ |
|
||||||
| `feat_name` | `attr_id_t` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. |
|
| `feat_name` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~ |
|
||||||
| **RETURNS** | <Abbr title="uint64_t">`attr_t`</Abbr> | The value of the attribute. |
|
| **RETURNS** | The value of the attribute. ~~attr_t (uint64_t)~~ |
|
||||||
|
|
||||||
### Token.set_struct_attr {#token_set_struct_attr tag="staticmethod, nogil" source="spacy/tokens/token.pxd"}
|
### Token.set_struct_attr {#token_set_struct_attr tag="staticmethod, nogil" source="spacy/tokens/token.pxd"}
|
||||||
|
|
||||||
|
@ -72,11 +72,11 @@ Set the value of an attribute of the `TokenC` struct by attribute ID.
|
||||||
> Token.set_struct_attr(token, TAG, 0)
|
> Token.set_struct_attr(token, TAG, 0)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------------------------------------- | -------------------------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------------------- |
|
||||||
| `token` | `const TokenC*` | A pointer to a `TokenC` struct. |
|
| `token` | A pointer to a `TokenC` struct. ~~const TokenC\*~~ |
|
||||||
| `feat_name` | `attr_id_t` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. |
|
| `feat_name` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~ |
|
||||||
| `value` | <Abbr title="uint64_t">`attr_t`</Abbr> | The value to set. |
|
| `value` | The value to set. ~~attr_t (uint64_t)~~ |
|
||||||
|
|
||||||
### token_by_start {#token_by_start tag="function" source="spacy/tokens/doc.pxd"}
|
### token_by_start {#token_by_start tag="function" source="spacy/tokens/doc.pxd"}
|
||||||
|
|
||||||
|
@ -93,12 +93,12 @@ Find a token in a `TokenC*` array by the offset of its first character.
|
||||||
> assert token_by_start(doc.c, doc.length, 4) == -1
|
> assert token_by_start(doc.c, doc.length, 4) == -1
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | --------------- | --------------------------------------------------------- |
|
| ------------ | ----------------------------------------------------------------- |
|
||||||
| `tokens` | `const TokenC*` | A `TokenC*` array. |
|
| `tokens` | A `TokenC*` array. ~~const TokenC\*~~ |
|
||||||
| `length` | `int` | The number of tokens in the array. |
|
| `length` | The number of tokens in the array. ~~int~~ |
|
||||||
| `start_char` | `int` | The start index to search for. |
|
| `start_char` | The start index to search for. ~~int~~ |
|
||||||
| **RETURNS** | `int` | The index of the token in the array or `-1` if not found. |
|
| **RETURNS** | The index of the token in the array or `-1` if not found. ~~int~~ |
|
||||||
|
|
||||||
### token_by_end {#token_by_end tag="function" source="spacy/tokens/doc.pxd"}
|
### token_by_end {#token_by_end tag="function" source="spacy/tokens/doc.pxd"}
|
||||||
|
|
||||||
|
@ -115,12 +115,12 @@ Find a token in a `TokenC*` array by the offset of its final character.
|
||||||
> assert token_by_end(doc.c, doc.length, 1) == -1
|
> assert token_by_end(doc.c, doc.length, 1) == -1
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------- | --------------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------------- |
|
||||||
| `tokens` | `const TokenC*` | A `TokenC*` array. |
|
| `tokens` | A `TokenC*` array. ~~const TokenC\*~~ |
|
||||||
| `length` | `int` | The number of tokens in the array. |
|
| `length` | The number of tokens in the array. ~~int~~ |
|
||||||
| `end_char` | `int` | The end index to search for. |
|
| `end_char` | The end index to search for. ~~int~~ |
|
||||||
| **RETURNS** | `int` | The index of the token in the array or `-1` if not found. |
|
| **RETURNS** | The index of the token in the array or `-1` if not found. ~~int~~ |
|
||||||
|
|
||||||
### set_children_from_heads {#set_children_from_heads tag="function" source="spacy/tokens/doc.pxd"}
|
### set_children_from_heads {#set_children_from_heads tag="function" source="spacy/tokens/doc.pxd"}
|
||||||
|
|
||||||
|
@ -143,10 +143,10 @@ attribute, in order to make the parse tree navigation consistent.
|
||||||
> assert doc.c[3].l_kids == 1
|
> assert doc.c[3].l_kids == 1
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | --------------- | ---------------------------------- |
|
| -------- | ------------------------------------------ |
|
||||||
| `tokens` | `const TokenC*` | A `TokenC*` array. |
|
| `tokens` | A `TokenC*` array. ~~const TokenC\*~~ |
|
||||||
| `length` | `int` | The number of tokens in the array. |
|
| `length` | The number of tokens in the array. ~~int~~ |
|
||||||
|
|
||||||
## LexemeC {#lexemec tag="C struct" source="spacy/structs.pxd"}
|
## LexemeC {#lexemec tag="C struct" source="spacy/structs.pxd"}
|
||||||
|
|
||||||
|
@ -160,17 +160,17 @@ struct.
|
||||||
> lex = doc.c[3].lex
|
> lex = doc.c[3].lex
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
|
| -------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `flags` | <Abbr title="uint64_t">`flags_t`</Abbr> | Bit-field for binary lexical flag values. |
|
| `flags` | Bit-field for binary lexical flag values. ~~flags_t (uint64_t)~~ |
|
||||||
| `id` | <Abbr title="uint64_t">`attr_t`</Abbr> | Usually used to map lexemes to rows in a matrix, e.g. for word vectors. Does not need to be unique, so currently misnamed. |
|
| `id` | Usually used to map lexemes to rows in a matrix, e.g. for word vectors. Does not need to be unique, so currently misnamed. ~~attr_t (uint64_t)~~ |
|
||||||
| `length` | <Abbr title="uint64_t">`attr_t`</Abbr> | Number of unicode characters in the lexeme. |
|
| `length` | Number of unicode characters in the lexeme. ~~attr_t (uint64_t)~~ |
|
||||||
| `orth` | <Abbr title="uint64_t">`attr_t`</Abbr> | ID of the verbatim text content. |
|
| `orth` | ID of the verbatim text content. ~~attr_t (uint64_t)~~ |
|
||||||
| `lower` | <Abbr title="uint64_t">`attr_t`</Abbr> | ID of the lowercase form of the lexeme. |
|
| `lower` | ID of the lowercase form of the lexeme. ~~attr_t (uint64_t)~~ |
|
||||||
| `norm` | <Abbr title="uint64_t">`attr_t`</Abbr> | ID of the lexeme's norm, i.e. a normalized form of the text. |
|
| `norm` | ID of the lexeme's norm, i.e. a normalized form of the text. ~~attr_t (uint64_t)~~ |
|
||||||
| `shape` | <Abbr title="uint64_t">`attr_t`</Abbr> | Transform of the lexeme's string, to show orthographic features. |
|
| `shape` | Transform of the lexeme's string, to show orthographic features. ~~attr_t (uint64_t)~~ |
|
||||||
| `prefix` | <Abbr title="uint64_t">`attr_t`</Abbr> | Length-N substring from the start of the lexeme. Defaults to `N=1`. |
|
| `prefix` | Length-N substring from the start of the lexeme. Defaults to `N=1`. ~~attr_t (uint64_t)~~ |
|
||||||
| `suffix` | <Abbr title="uint64_t">`attr_t`</Abbr> | Length-N substring from the end of the lexeme. Defaults to `N=3`. |
|
| `suffix` | Length-N substring from the end of the lexeme. Defaults to `N=3`. ~~attr_t (uint64_t)~~ |
|
||||||
|
|
||||||
### Lexeme.get_struct_attr {#lexeme_get_struct_attr tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
|
### Lexeme.get_struct_attr {#lexeme_get_struct_attr tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
|
||||||
|
|
||||||
|
@ -186,11 +186,11 @@ Get the value of an attribute from the `LexemeC` struct by attribute ID.
|
||||||
> is_alpha = Lexeme.get_struct_attr(lexeme, IS_ALPHA)
|
> is_alpha = Lexeme.get_struct_attr(lexeme, IS_ALPHA)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------------------------------------- | -------------------------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------------------- |
|
||||||
| `lex` | `const LexemeC*` | A pointer to a `LexemeC` struct. |
|
| `lex` | A pointer to a `LexemeC` struct. ~~const LexemeC\*~~ |
|
||||||
| `feat_name` | `attr_id_t` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. |
|
| `feat_name` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~ |
|
||||||
| **RETURNS** | <Abbr title="uint64_t">`attr_t`</Abbr> | The value of the attribute. |
|
| **RETURNS** | The value of the attribute. ~~attr_t (uint64_t)~~ |
|
||||||
|
|
||||||
### Lexeme.set_struct_attr {#lexeme_set_struct_attr tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
|
### Lexeme.set_struct_attr {#lexeme_set_struct_attr tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
|
||||||
|
|
||||||
|
@ -206,11 +206,11 @@ Set the value of an attribute of the `LexemeC` struct by attribute ID.
|
||||||
> Lexeme.set_struct_attr(lexeme, NORM, lexeme.lower)
|
> Lexeme.set_struct_attr(lexeme, NORM, lexeme.lower)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------------------------------------- | -------------------------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------------------- |
|
||||||
| `lex` | `const LexemeC*` | A pointer to a `LexemeC` struct. |
|
| `lex` | A pointer to a `LexemeC` struct. ~~const LexemeC\*~~ |
|
||||||
| `feat_name` | `attr_id_t` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. |
|
| `feat_name` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~ |
|
||||||
| `value` | <Abbr title="uint64_t">`attr_t`</Abbr> | The value to set. |
|
| `value` | The value to set. ~~attr_t (uint64_t)~~ |
|
||||||
|
|
||||||
### Lexeme.c_check_flag {#lexeme_c_check_flag tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
|
### Lexeme.c_check_flag {#lexeme_c_check_flag tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
|
||||||
|
|
||||||
|
@ -226,11 +226,11 @@ Check the value of a binary flag attribute.
|
||||||
> is_stop = Lexeme.c_check_flag(lexeme, IS_STOP)
|
> is_stop = Lexeme.c_check_flag(lexeme, IS_STOP)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---------------- | ------------------------------------------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------------------------------- |
|
||||||
| `lexeme` | `const LexemeC*` | A pointer to a `LexemeC` struct. |
|
| `lexeme` | A pointer to a `LexemeC` struct. ~~const LexemeC\*~~ |
|
||||||
| `flag_id` | `attr_id_t` | The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`. |
|
| `flag_id` | The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`. ~~attr_id_t~~ |
|
||||||
| **RETURNS** | `bint` | The boolean value of the flag. |
|
| **RETURNS** | The boolean value of the flag. ~~bint~~ |
|
||||||
|
|
||||||
### Lexeme.c_set_flag {#lexeme_c_set_flag tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
|
### Lexeme.c_set_flag {#lexeme_c_set_flag tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
|
||||||
|
|
||||||
|
@ -246,8 +246,8 @@ Set the value of a binary flag attribute.
|
||||||
> Lexeme.c_set_flag(lexeme, IS_STOP, 0)
|
> Lexeme.c_set_flag(lexeme, IS_STOP, 0)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | ---------------- | ------------------------------------------------------------------------------- |
|
| --------- | --------------------------------------------------------------------------------------------- |
|
||||||
| `lexeme` | `const LexemeC*` | A pointer to a `LexemeC` struct. |
|
| `lexeme` | A pointer to a `LexemeC` struct. ~~const LexemeC\*~~ |
|
||||||
| `flag_id` | `attr_id_t` | The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`. |
|
| `flag_id` | The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`. ~~attr_id_t~~ |
|
||||||
| `value` | `bint` | The value to set. |
|
| `value` | The value to set. ~~bint~~ |
|
||||||
|
|
|
@ -5,7 +5,8 @@ menu:
|
||||||
- ['Training Config', 'config']
|
- ['Training Config', 'config']
|
||||||
- ['Training Data', 'training']
|
- ['Training Data', 'training']
|
||||||
- ['Pretraining Data', 'pretraining']
|
- ['Pretraining Data', 'pretraining']
|
||||||
- ['Vocabulary', 'vocab']
|
- ['Vocabulary', 'vocab-jsonl']
|
||||||
|
- ['Model Meta', 'meta']
|
||||||
---
|
---
|
||||||
|
|
||||||
This section documents input and output formats of data used by spaCy, including
|
This section documents input and output formats of data used by spaCy, including
|
||||||
|
@ -73,15 +74,15 @@ your config and check that it's valid, you can run the
|
||||||
Defines the `nlp` object, its tokenizer and
|
Defines the `nlp` object, its tokenizer and
|
||||||
[processing pipeline](/usage/processing-pipelines) component names.
|
[processing pipeline](/usage/processing-pipelines) component names.
|
||||||
|
|
||||||
| Name | Type | Description | Default |
|
| Name | Description |
|
||||||
| ------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------- |
|
| ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `lang` | str | The language code to use. | `null` |
|
| `lang` | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). Defaults to `null`. ~~str~~ |
|
||||||
| `pipeline` | `List[str]` | Names of pipeline components in order. Should correspond to sections in the `[components]` block, e.g. `[components.ner]`. See docs on [defining components](/usage/training#config-components). | `[]` |
|
| `pipeline` | Names of pipeline components in order. Should correspond to sections in the `[components]` block, e.g. `[components.ner]`. See docs on [defining components](/usage/training#config-components). Defaults to `[]`. ~~List[str]~~ |
|
||||||
| `load_vocab_data` | bool | Whether to load additional lexeme and vocab data from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) if available. | `true` |
|
| `load_vocab_data` | Whether to load additional lexeme and vocab data from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) if available. Defaults to `true`. ~~bool~~ |
|
||||||
| `before_creation` | callable | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `Language` subclass before it's initialized. | `null` |
|
| `before_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `Language` subclass before it's initialized. Defaults to `null`. ~~Optional[Callable[[Type[Language]], Type[Language]]]~~ |
|
||||||
| `after_creation` | callable | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `nlp` object right after it's initialized. | `null` |
|
| `after_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `nlp` object right after it's initialized. Defaults to `null`. ~~Optional[Callable[[Language], Language]]~~ |
|
||||||
| `after_pipeline_creation` | callable | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `nlp` object after the pipeline components have been added. | `null` |
|
| `after_pipeline_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `nlp` object after the pipeline components have been added. Defaults to `null`. ~~Optional[Callable[[Language], Language]]~~ |
|
||||||
| `tokenizer` | callable | The tokenizer to use. | [`Tokenizer`](/api/tokenizer) |
|
| `tokenizer` | The tokenizer to use. Defaults to [`Tokenizer`](/api/tokenizer). ~~Callable[[str], Doc]~~ |
|
||||||
|
|
||||||
### components {#config-components tag="section"}
|
### components {#config-components tag="section"}
|
||||||
|
|
||||||
|
@ -110,15 +111,15 @@ model to copy components from). See the docs on
|
||||||
### paths, system {#config-variables tag="variables"}
|
### paths, system {#config-variables tag="variables"}
|
||||||
|
|
||||||
These sections define variables that can be referenced across the other sections
|
These sections define variables that can be referenced across the other sections
|
||||||
as variables. For example `${paths:train}` uses the value of `train` defined in
|
as variables. For example `${paths.train}` uses the value of `train` defined in
|
||||||
the block `[paths]`. If your config includes custom registered functions that
|
the block `[paths]`. If your config includes custom registered functions that
|
||||||
need paths, you can define them here. All config values can also be
|
need paths, you can define them here. All config values can also be
|
||||||
[overwritten](/usage/training#config-overrides) on the CLI when you run
|
[overwritten](/usage/training#config-overrides) on the CLI when you run
|
||||||
[`spacy train`](/api/cli#train), which is especially relevant for data paths
|
[`spacy train`](/api/cli#train), which is especially relevant for data paths
|
||||||
that you don't want to hard-code in your config file.
|
that you don't want to hard-code in your config file.
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy train ./config.cfg --paths.train ./corpus/train.spacy
|
$ python -m spacy train config.cfg --paths.train ./corpus/train.spacy
|
||||||
```
|
```
|
||||||
|
|
||||||
### training {#config-training tag="section"}
|
### training {#config-training tag="section"}
|
||||||
|
@ -126,26 +127,24 @@ $ python -m spacy train ./config.cfg --paths.train ./corpus/train.spacy
|
||||||
This section defines settings and controls for the training and evaluation
|
This section defines settings and controls for the training and evaluation
|
||||||
process that are used when you run [`spacy train`](/api/cli#train).
|
process that are used when you run [`spacy train`](/api/cli#train).
|
||||||
|
|
||||||
<!-- TODO: complete -->
|
| Name | Description |
|
||||||
|
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| Name | Type | Description | Default |
|
| `accumulate_gradient` | Whether to divide the batch up into substeps. Defaults to `1`. ~~int~~ |
|
||||||
| --------------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------- |
|
| `batcher` | Callable that takes an iterator of [`Doc`](/api/doc) objects and yields batches of `Doc`s. Defaults to [`batch_by_words`](/api/top-level#batch_by_words). ~~Callable[[Iterator[Doc], Iterator[List[Doc]]]]~~ |
|
||||||
| `seed` | int | The random seed. | `${system:seed}` |
|
| `dev_corpus` | Callable that takes the current `nlp` object and yields [`Example`](/api/example) objects. Defaults to [`Corpus`](/api/corpus). ~~Callable[[Language], Iterator[Example]]~~ |
|
||||||
| `dropout` | float | The dropout rate. | `0.1` |
|
| `dropout` | The dropout rate. Defaults to `0.1`. ~~float~~ |
|
||||||
| `accumulate_gradient` | int | Whether to divide the batch up into substeps. | `1` |
|
| `eval_frequency` | How often to evaluate during training (steps). Defaults to `200`. ~~int~~ |
|
||||||
| `init_tok2vec` | str | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). | `${paths:init_tok2vec}` |
|
| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ |
|
||||||
| `raw_text` | str | | `${paths:raw}` |
|
| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. ~~Optional[str]~~ |
|
||||||
| `vectors` | str | | `null` |
|
| `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ |
|
||||||
| `patience` | int | How many steps to continue without improvement in evaluation score. | `1600` |
|
| `max_steps` | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~ |
|
||||||
| `max_epochs` | int | Maximum number of epochs to train for. | `0` |
|
| `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ |
|
||||||
| `max_steps` | int | Maximum number of update steps to train for. | `20000` |
|
| `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ |
|
||||||
| `eval_frequency` | int | How often to evaluate during training (steps). | `200` |
|
| `raw_text` | Optional path to a jsonl file with unlabelled text documents for a [rehearsal](/api/language#rehearse) step. Defaults to variable `${paths.raw}`. ~~Optional[str]~~ |
|
||||||
| `score_weights` | `Dict[str, float]` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. | `{}` |
|
| `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ |
|
||||||
| `frozen_components` | `List[str]` | Pipeline component names that are "frozen" and shouldn't be updated during training. See [here](/usage/training#config-components) for details. | `[]` |
|
| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ |
|
||||||
| `train_corpus` | callable | Callable that takes the current `nlp` object and yields [`Example`](/api/example) objects. | [`Corpus`](/api/corpus) |
|
| `train_corpus` | Callable that takes the current `nlp` object and yields [`Example`](/api/example) objects. Defaults to [`Corpus`](/api/corpus). ~~Callable[[Language], Iterator[Example]]~~ |
|
||||||
| `dev_corpus` | callable | Callable that takes the current `nlp` object and yields [`Example`](/api/example) objects. | [`Corpus`](/api/corpus) |
|
| `vectors` | Model name or path to model containing pretrained word vectors to use, e.g. created with [`init model`](/api/cli#init-model). Defaults to `null`. ~~Optional[str]~~ |
|
||||||
| `batcher` | callable | Callable that takes an iterator of [`Doc`](/api/doc) objects and yields batches of `Doc`s. | [`batch_by_words`](/api/top-level#batch_by_words) |
|
|
||||||
| `optimizer` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. | [`Adam`](https://thinc.ai/docs/api-optimizers#adam) |
|
|
||||||
|
|
||||||
### pretraining {#config-pretraining tag="section,optional"}
|
### pretraining {#config-pretraining tag="section,optional"}
|
||||||
|
|
||||||
|
@ -153,19 +152,19 @@ This section is optional and defines settings and controls for
|
||||||
[language model pretraining](/usage/training#pretraining). It's used when you
|
[language model pretraining](/usage/training#pretraining). It's used when you
|
||||||
run [`spacy pretrain`](/api/cli#pretrain).
|
run [`spacy pretrain`](/api/cli#pretrain).
|
||||||
|
|
||||||
| Name | Type | Description | Default |
|
| Name | Description |
|
||||||
| ---------------------------- | --------------------------------------------------- | ----------------------------------------------------------------------------- | --------------------------------------------------- |
|
| ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `max_epochs` | int | Maximum number of epochs. | `1000` |
|
| `max_epochs` | Maximum number of epochs. Defaults to `1000`. ~~int~~ |
|
||||||
| `min_length` | int | Minimum length of examples. | `5` |
|
| `min_length` | Minimum length of examples. Defaults to `5`. ~~int~~ |
|
||||||
| `max_length` | int | Maximum length of examples. | `500` |
|
| `max_length` | Maximum length of examples. Defaults to `500`. ~~int~~ |
|
||||||
| `dropout` | float | The dropout rate. | `0.2` |
|
| `dropout` | The dropout rate. Defaults to `0.2`. ~~float~~ |
|
||||||
| `n_save_every` | int | Saving frequency. | `null` |
|
| `n_save_every` | Saving frequency. Defaults to `null`. ~~Optional[int]~~ |
|
||||||
| `batch_size` | int / `Sequence[int]` | The batch size or batch size [schedule](https://thinc.ai/docs/api-schedules). | `3000` |
|
| `batch_size` | The batch size or batch size [schedule](https://thinc.ai/docs/api-schedules). Defaults to `3000`. ~~Union[int, Sequence[int]]~~ |
|
||||||
| `seed` | int | The random seed. | `${system.seed}` |
|
| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ |
|
||||||
| `use_pytorch_for_gpu_memory` | bool | Allocate memory via PyTorch. | `${system:use_pytorch_for_gpu_memory}` |
|
| `use_pytorch_for_gpu_memory` | Allocate memory via PyTorch. Defaults to variable `${system.use_pytorch_for_gpu_memory}`. ~~bool~~ |
|
||||||
| `tok2vec_model` | str | tok2vec model section in the config. | `"components.tok2vec.model"` |
|
| `tok2vec_model` | The model section of the embedding component in the config. Defaults to `"components.tok2vec.model"`. ~~str~~ |
|
||||||
| `objective` | dict | The pretraining objective. | `{"type": "characters", "n_characters": 4}` |
|
| `objective` | The pretraining objective. Defaults to `{"type": "characters", "n_characters": 4}`. ~~Dict[str, Any]~~ |
|
||||||
| `optimizer` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. | [`Adam`](https://thinc.ai/docs/api-optimizers#adam) |
|
| `optimizer` | The optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ |
|
||||||
|
|
||||||
## Training data {#training}
|
## Training data {#training}
|
||||||
|
|
||||||
|
@ -208,8 +207,8 @@ objects to JSON, you can now serialize them directly using the
|
||||||
[`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy`
|
[`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy`
|
||||||
format:
|
format:
|
||||||
|
|
||||||
```bash
|
```cli
|
||||||
$ python -m spacy convert ./data.json ./output
|
$ python -m spacy convert ./data.json ./output.spacy
|
||||||
```
|
```
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
@ -313,22 +312,22 @@ to keep track of your settings and hyperparameters and your own
|
||||||
> }
|
> }
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------- | ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `text` | str | Raw text. |
|
| `text` | Raw text. ~~str~~ |
|
||||||
| `words` | `List[str]` | List of gold-standard tokens. |
|
| `words` | List of gold-standard tokens. ~~List[str]~~ |
|
||||||
| `lemmas` | `List[str]` | List of lemmas. |
|
| `lemmas` | List of lemmas. ~~List[str]~~ |
|
||||||
| `spaces` | `List[bool]` | List of boolean values indicating whether the corresponding tokens is followed by a space or not. |
|
| `spaces` | List of boolean values indicating whether the corresponding tokens is followed by a space or not. ~~List[bool]~~ |
|
||||||
| `tags` | `List[str]` | List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging). |
|
| `tags` | List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging). ~~List[str]~~ |
|
||||||
| `pos` | `List[str]` | List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging). |
|
| `pos` | List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging). ~~List[str]~~ |
|
||||||
| `morphs` | `List[str]` | List of [morphological features](/usage/linguistic-features#rule-based-morphology). |
|
| `morphs` | List of [morphological features](/usage/linguistic-features#rule-based-morphology). ~~List[str]~~ |
|
||||||
| `sent_starts` | `List[bool]` | List of boolean values indicating whether each token is the first of a sentence or not. |
|
| `sent_starts` | List of boolean values indicating whether each token is the first of a sentence or not. ~~List[bool]~~ |
|
||||||
| `deps` | `List[str]` | List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head. |
|
| `deps` | List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head. ~~List[str]~~ |
|
||||||
| `heads` | `List[int]` | List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text. |
|
| `heads` | List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text. ~~List[int]~~ |
|
||||||
| `entities` | `List[str]` | **Option 1:** List of [BILUO tags](/usage/linguistic-features#accessing-ner) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens. |
|
| `entities` | **Option 1:** List of [BILUO tags](/usage/linguistic-features#accessing-ner) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens. ~~List[str]~~ |
|
||||||
| `entities` | `List[Tuple[int, int, str]]` | **Option 2:** List of `"(start, end, label)"` tuples defining all entities in the text. |
|
| `entities` | **Option 2:** List of `"(start, end, label)"` tuples defining all entities in the text. ~~List[Tuple[int, int, str]]~~ |
|
||||||
| `cats` | `Dict[str, float]` | Dictionary of `label`/`value` pairs indicating how relevant a certain [text category](/api/textcategorizer) is for the text. |
|
| `cats` | Dictionary of `label`/`value` pairs indicating how relevant a certain [text category](/api/textcategorizer) is for the text. ~~Dict[str, float]~~ |
|
||||||
| `links` | `Dict[(int, int), Dict]` | Dictionary of `offset`/`dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The character offsets are linked to a dictionary of relevant knowledge base IDs. |
|
| `links` | Dictionary of `offset`/`dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The character offsets are linked to a dictionary of relevant knowledge base IDs. ~~Dict[Tuple[int, int], Dict]~~ |
|
||||||
|
|
||||||
<Infobox title="Notes and caveats">
|
<Infobox title="Notes and caveats">
|
||||||
|
|
||||||
|
@ -372,11 +371,11 @@ example = Example.from_dict(doc, gold_dict)
|
||||||
|
|
||||||
## Pretraining data {#pretraining}
|
## Pretraining data {#pretraining}
|
||||||
|
|
||||||
The [`spacy pretrain`](/api/cli#pretrain) command lets you pretrain the tok2vec
|
The [`spacy pretrain`](/api/cli#pretrain) command lets you pretrain the
|
||||||
layer of pipeline components from raw text. Raw text can be provided as a
|
"token-to-vector" embedding layer of pipeline components from raw text. Raw text
|
||||||
`.jsonl` (newline-delimited JSON) file containing one input text per line
|
can be provided as a `.jsonl` (newline-delimited JSON) file containing one input
|
||||||
(roughly paragraph length is good). Optionally, custom tokenization can be
|
text per line (roughly paragraph length is good). Optionally, custom
|
||||||
provided.
|
tokenization can be provided.
|
||||||
|
|
||||||
> #### Tip: Writing JSONL
|
> #### Tip: Writing JSONL
|
||||||
>
|
>
|
||||||
|
@ -390,10 +389,10 @@ provided.
|
||||||
> srsly.write_jsonl("/path/to/text.jsonl", data)
|
> srsly.write_jsonl("/path/to/text.jsonl", data)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Key | Type | Description |
|
| Key | Description |
|
||||||
| -------- | ---- | ---------------------------------------------------------- |
|
| -------- | --------------------------------------------------------------------- |
|
||||||
| `text` | str | The raw input text. Is not required if `tokens` available. |
|
| `text` | The raw input text. Is not required if `tokens` is available. ~~str~~ |
|
||||||
| `tokens` | list | Optional tokenization, one string per token. |
|
| `tokens` | Optional tokenization, one string per token. ~~List[str]~~ |
|
||||||
|
|
||||||
```json
|
```json
|
||||||
### Example
|
### Example
|
||||||
|
@ -406,7 +405,7 @@ provided.
|
||||||
## Lexical data for vocabulary {#vocab-jsonl new="2"}
|
## Lexical data for vocabulary {#vocab-jsonl new="2"}
|
||||||
|
|
||||||
To populate a model's vocabulary, you can use the
|
To populate a model's vocabulary, you can use the
|
||||||
[`spacy init-model`](/api/cli#init-model) command and load in a
|
[`spacy init model`](/api/cli#init-model) command and load in a
|
||||||
[newline-delimited JSON](http://jsonlines.org/) (JSONL) file containing one
|
[newline-delimited JSON](http://jsonlines.org/) (JSONL) file containing one
|
||||||
lexical entry per line via the `--jsonl-loc` option. The first line defines the
|
lexical entry per line via the `--jsonl-loc` option. The first line defines the
|
||||||
language and vocabulary settings. All other lines are expected to be JSON
|
language and vocabulary settings. All other lines are expected to be JSON
|
||||||
|
@ -457,3 +456,75 @@ Here's an example of the 20 most frequent lexemes in the English training data:
|
||||||
```json
|
```json
|
||||||
https://github.com/explosion/spaCy/tree/master/examples/training/vocab-data.jsonl
|
https://github.com/explosion/spaCy/tree/master/examples/training/vocab-data.jsonl
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Model meta {#meta}
|
||||||
|
|
||||||
|
The model meta is available as the file `meta.json` and exported automatically
|
||||||
|
when you save an `nlp` object to disk. Its contents are available as
|
||||||
|
[`nlp.meta`](/api/language#meta).
|
||||||
|
|
||||||
|
<Infobox variant="warning" title="Changed in v3.0">
|
||||||
|
|
||||||
|
As of spaCy v3.0, the `meta.json` **isn't** used to construct the language class
|
||||||
|
and pipeline anymore and only contains meta information for reference and for
|
||||||
|
creating a Python package with [`spacy package`](/api/cli#package). How to set
|
||||||
|
up the `nlp` object is now defined in the
|
||||||
|
[`config.cfg`](/api/data-formats#config), which includes detailed information
|
||||||
|
about the pipeline components and their model architectures, and all other
|
||||||
|
settings and hyperparameters used to train the model. It's the **single source
|
||||||
|
of truth** used for loading a model.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```json
|
||||||
|
> {
|
||||||
|
> "name": "example_model",
|
||||||
|
> "lang": "en",
|
||||||
|
> "version": "1.0.0",
|
||||||
|
> "spacy_version": ">=3.0.0,<3.1.0",
|
||||||
|
> "parent_package": "spacy",
|
||||||
|
> "description": "Example model for spaCy",
|
||||||
|
> "author": "You",
|
||||||
|
> "email": "you@example.com",
|
||||||
|
> "url": "https://example.com",
|
||||||
|
> "license": "CC BY-SA 3.0",
|
||||||
|
> "sources": [{ "name": "My Corpus", "license": "MIT" }],
|
||||||
|
> "vectors": { "width": 0, "vectors": 0, "keys": 0, "name": null },
|
||||||
|
> "pipeline": ["tok2vec", "ner", "textcat"],
|
||||||
|
> "labels": {
|
||||||
|
> "ner": ["PERSON", "ORG", "PRODUCT"],
|
||||||
|
> "textcat": ["POSITIVE", "NEGATIVE"]
|
||||||
|
> },
|
||||||
|
> "accuracy": {
|
||||||
|
> "ents_f": 82.7300930714,
|
||||||
|
> "ents_p": 82.135523614,
|
||||||
|
> "ents_r": 83.3333333333,
|
||||||
|
> "textcat_score": 88.364323811
|
||||||
|
> },
|
||||||
|
> "speed": { "cpu": 7667.8, "gpu": null, "nwords": 10329 },
|
||||||
|
> "spacy_git_version": "61dfdd9fb"
|
||||||
|
> }
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
|
| `lang` | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). Defaults to `"en"`. ~~str~~ |
|
||||||
|
| `name` | Model name, e.g. `"core_web_sm"`. The final model package name will be `{lang}_{name}`. Defaults to `"model"`. ~~str~~ |
|
||||||
|
| `version` | Model version. Will be used to version a Python package created with [`spacy package`](/api/cli#package). Defaults to `"0.0.0"`. ~~str~~ |
|
||||||
|
| `spacy_version` | spaCy version range the model is compatible with. Defaults to the spaCy version used to create the model, up to next minor version, which is the default compatibility for the available [pretrained models](/models). For instance, a model trained with v3.0.0 will have the version range `">=3.0.0,<3.1.0"`. ~~str~~ |
|
||||||
|
| `parent_package` | Name of the spaCy package. Typically `"spacy"` or `"spacy_nightly"`. Defaults to `"spacy"`. ~~str~~ |
|
||||||
|
| `description` | Model description. Also used for Python package. Defaults to `""`. ~~str~~ |
|
||||||
|
| `author` | Model author name. Also used for Python package. Defaults to `""`. ~~str~~ |
|
||||||
|
| `email` | Model author email. Also used for Python package. Defaults to `""`. ~~str~~ |
|
||||||
|
| `url` | Model author URL. Also used for Python package. Defaults to `""`. ~~str~~ |
|
||||||
|
| `license` | Model license. Also used for Python package. Defaults to `""`. ~~str~~ |
|
||||||
|
| `sources` | Data sources used to train the model. Typically a list of dicts with the keys `"name"`, `"url"`, `"author"` and `"license"`. [See here](https://github.com/explosion/spacy-models/tree/master/meta) for examples. Defaults to `None`. ~~Optional[List[Dict[str, str]]]~~ |
|
||||||
|
| `vectors` | Information about the word vectors included with the model. Typically a dict with the keys `"width"`, `"vectors"` (number of vectors), `"keys"` and `"name"`. ~~Dict[str, Any]~~ |
|
||||||
|
| `pipeline` | Names of pipeline component names in the model, in order. Corresponds to [`nlp.pipe_names`](/api/language#pipe_names). Only exists for reference and is not used to create the components. This information is defined in the [`config.cfg`](/api/data-formats#config). Defaults to `[]`. ~~List[str]~~ |
|
||||||
|
| `labels` | Label schemes of the trained pipeline components, keyed by component name. Corresponds to [`nlp.pipe_labels`](/api/language#pipe_labels). [See here](https://github.com/explosion/spacy-models/tree/master/meta) for examples. Defaults to `{}`. ~~Dict[str, Dict[str, List[str]]]~~ |
|
||||||
|
| `accuracy` | Training accuracy, added automatically by [`spacy train`](/api/cli#train). Dictionary of [score names](/usage/training#metrics) mapped to scores. Defaults to `{}`. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||||
|
| `speed` | Model speed, added automatically by [`spacy train`](/api/cli#train). Typically a dictionary with the keys `"cpu"`, `"gpu"` and `"nwords"` (words per second). Defaults to `{}`. ~~Dict[str, Optional[Union[float, str]]]~~ |
|
||||||
|
| `spacy_git_version` <Tag variant="new">3</Tag> | Git commit of [`spacy`](https://github.com/explosion/spaCy) used to create model. ~~str~~ |
|
||||||
|
| other | Any other custom meta information you want to add. The data is preserved in [`nlp.meta`](/api/language#meta). ~~Any~~ |
|
||||||
|
|
|
@ -44,18 +44,18 @@ A pattern added to the `DependencyMatcher` consists of a list of dictionaries,
|
||||||
with each dictionary describing a node to match. Each pattern should have the
|
with each dictionary describing a node to match. Each pattern should have the
|
||||||
following top-level keys:
|
following top-level keys:
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | ---- | --------------------------------------------------------------------------------------------------------------------------- |
|
| --------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `PATTERN` | dict | The token attributes to match in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). |
|
| `PATTERN` | The token attributes to match in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ |
|
||||||
| `SPEC` | dict | The relationships of the nodes in the subtree that should be matched. |
|
| `SPEC` | The relationships of the nodes in the subtree that should be matched. ~~Dict[str, str]~~ |
|
||||||
|
|
||||||
The `SPEC` includes the following fields:
|
The `SPEC` includes the following fields:
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | ---- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `NODE_NAME` | str | A unique name for this node to refer to it in other specs. |
|
| `NODE_NAME` | A unique name for this node to refer to it in other specs. ~~str~~ |
|
||||||
| `NBOR_RELOP` | str | A [Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html) operator that describes how the two nodes are related. |
|
| `NBOR_RELOP` | A [Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html) operator that describes how the two nodes are related. ~~str~~ |
|
||||||
| `NBOR_NAME` | str | The unique name of the node that this node is connected to. |
|
| `NBOR_NAME` | The unique name of the node that this node is connected to. ~~str~~ |
|
||||||
|
|
||||||
## DependencyMatcher.\_\_init\_\_ {#init tag="method"}
|
## DependencyMatcher.\_\_init\_\_ {#init tag="method"}
|
||||||
|
|
||||||
|
@ -68,9 +68,9 @@ Create a rule-based `DependencyMatcher`.
|
||||||
> matcher = DependencyMatcher(nlp.vocab)
|
> matcher = DependencyMatcher(nlp.vocab)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------- | ------- | ------------------------------------------------------------------------------------------- |
|
| ------- | ----------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. |
|
| `vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. ~~Vocab~~ |
|
||||||
|
|
||||||
## DependencyMatcher.\_\call\_\_ {#call tag="method"}
|
## DependencyMatcher.\_\call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -79,9 +79,9 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.matcher import Matcher
|
> from spacy.matcher import DependencyMatcher
|
||||||
>
|
>
|
||||||
> matcher = Matcher(nlp.vocab)
|
> matcher = DependencyMatcher(nlp.vocab)
|
||||||
> pattern = [
|
> pattern = [
|
||||||
> {"SPEC": {"NODE_NAME": "founded"}, "PATTERN": {"ORTH": "founded"}},
|
> {"SPEC": {"NODE_NAME": "founded"}, "PATTERN": {"ORTH": "founded"}},
|
||||||
> {"SPEC": {"NODE_NAME": "founder", "NBOR_RELOP": ">", "NBOR_NAME": "founded"}, "PATTERN": {"DEP": "nsubj"}},
|
> {"SPEC": {"NODE_NAME": "founder", "NBOR_RELOP": ">", "NBOR_NAME": "founded"}, "PATTERN": {"DEP": "nsubj"}},
|
||||||
|
@ -91,10 +91,10 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
|
||||||
> matches = matcher(doc)
|
> matches = matcher(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `doclike` | `Doc`/`Span` | The `Doc` or `Span` to match over. |
|
| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ |
|
||||||
| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. |
|
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. ~~List[Tuple[int, int, int]]~~ |
|
||||||
|
|
||||||
## DependencyMatcher.\_\_len\_\_ {#len tag="method"}
|
## DependencyMatcher.\_\_len\_\_ {#len tag="method"}
|
||||||
|
|
||||||
|
@ -115,9 +115,9 @@ number of individual patterns.
|
||||||
> assert len(matcher) == 1
|
> assert len(matcher) == 1
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------- |
|
| ----------- | ---------------------------- |
|
||||||
| **RETURNS** | int | The number of rules. |
|
| **RETURNS** | The number of rules. ~~int~~ |
|
||||||
|
|
||||||
## DependencyMatcher.\_\_contains\_\_ {#contains tag="method"}
|
## DependencyMatcher.\_\_contains\_\_ {#contains tag="method"}
|
||||||
|
|
||||||
|
@ -132,10 +132,10 @@ Check whether the matcher contains rules for a match ID.
|
||||||
> assert "Rule" in matcher
|
> assert "Rule" in matcher
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ----------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------- |
|
||||||
| `key` | str | The match ID. |
|
| `key` | The match ID. ~~str~~ |
|
||||||
| **RETURNS** | bool | Whether the matcher contains rules for this match ID. |
|
| **RETURNS** | Whether the matcher contains rules for this match ID. ~~bool~~ |
|
||||||
|
|
||||||
## DependencyMatcher.add {#add tag="method"}
|
## DependencyMatcher.add {#add tag="method"}
|
||||||
|
|
||||||
|
@ -151,16 +151,16 @@ will be overwritten.
|
||||||
> def on_match(matcher, doc, id, matches):
|
> def on_match(matcher, doc, id, matches):
|
||||||
> print('Matched!', matches)
|
> print('Matched!', matches)
|
||||||
>
|
>
|
||||||
> matcher = Matcher(nlp.vocab)
|
> matcher = DependencyMatcher(nlp.vocab)
|
||||||
> matcher.add("TEST_PATTERNS", patterns)
|
> matcher.add("TEST_PATTERNS", patterns)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------ | --------------------------------------------------------------------------------------------- |
|
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `match_id` | str | An ID for the thing you're matching. |
|
| `match_id` | An ID for the thing you're matching. ~~str~~ |
|
||||||
| `patterns` | list | Match pattern. A pattern consists of a list of dicts, where each dict describes a token. |
|
| `patterns` | list | Match pattern. A pattern consists of a list of dicts, where each dict describes a `"PATTERN"` and `"SPEC"`. ~~List[List[Dict[str, dict]]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
|
| `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ |
|
||||||
|
|
||||||
## DependencyMatcher.remove {#remove tag="method"}
|
## DependencyMatcher.remove {#remove tag="method"}
|
||||||
|
|
||||||
|
@ -176,9 +176,9 @@ exist.
|
||||||
> assert "Rule" not in matcher
|
> assert "Rule" not in matcher
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----- | ---- | ------------------------- |
|
| ----- | --------------------------------- |
|
||||||
| `key` | str | The ID of the match rule. |
|
| `key` | The ID of the match rule. ~~str~~ |
|
||||||
|
|
||||||
## DependencyMatcher.get {#get tag="method"}
|
## DependencyMatcher.get {#get tag="method"}
|
||||||
|
|
||||||
|
@ -192,7 +192,7 @@ Retrieve the pattern stored for a key. Returns the rule as an
|
||||||
> on_match, patterns = matcher.get("Rule")
|
> on_match, patterns = matcher.get("Rule")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | --------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------------------------------- |
|
||||||
| `key` | str | The ID of the match rule. |
|
| `key` | The ID of the match rule. ~~str~~ |
|
||||||
| **RETURNS** | tuple | The rule, as an `(on_match, patterns)` tuple. |
|
| **RETURNS** | The rule, as an `(on_match, patterns)` tuple. ~~Tuple[Optional[Callable], List[List[dict]]]~~ |
|
||||||
|
|
|
@ -48,13 +48,13 @@ architectures and their arguments and hyperparameters.
|
||||||
> nlp.add_pipe("parser", config=config)
|
> nlp.add_pipe("parser", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Description |
|
||||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- |
|
| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. | `None` |
|
| `moves` | A list of transition names. Inferred from the data if not provided. Defaults to `None`. ~~Optional[List[str]]~~ |
|
||||||
| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. | `100` |
|
| `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. Defaults to `100`. ~~int~~ |
|
||||||
| `learn_tokens` | bool | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental. | `False` |
|
| `learn_tokens` | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental. Defaults to `False`. ~~bool~~ |
|
||||||
| `min_action_freq` | int | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. | `30` |
|
| `min_action_freq` | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. Defaults to `30`. ~~int~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
|
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [TransitionBasedParser](/api/architectures#TransitionBasedParser). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/dep_parser.pyx
|
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/dep_parser.pyx
|
||||||
|
@ -81,16 +81,16 @@ Create a new pipeline instance. In your application, you would normally use a
|
||||||
shortcut for this and instantiate the component using its string name and
|
shortcut for this and instantiate the component using its string name and
|
||||||
[`nlp.add_pipe`](/api/language#add_pipe).
|
[`nlp.add_pipe`](/api/language#add_pipe).
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||||
| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. |
|
| `moves` | A list of transition names. Inferred from the data if not provided. ~~Optional[List[str]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. |
|
| `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. ~~int~~ |
|
||||||
| `learn_tokens` | bool | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental. |
|
| `learn_tokens` | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental. ~~bool~~ |
|
||||||
| `min_action_freq` | int | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. |
|
| `min_action_freq` | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. ~~int~~ |
|
||||||
|
|
||||||
## DependencyParser.\_\_call\_\_ {#call tag="method"}
|
## DependencyParser.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -111,10 +111,10 @@ and all pipeline components are applied to the `Doc` in order. Both
|
||||||
> processed = parser(doc)
|
> processed = parser(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------ |
|
| ----------- | -------------------------------- |
|
||||||
| `doc` | `Doc` | The document to process. |
|
| `doc` | The document to process. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The processed document. |
|
| **RETURNS** | The processed document. ~~Doc~~ |
|
||||||
|
|
||||||
## DependencyParser.pipe {#pipe tag="method"}
|
## DependencyParser.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -133,12 +133,12 @@ applied to the `Doc` in order. Both [`__call__`](/api/dependencyparser#call) and
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------ |
|
| -------------- | ------------------------------------------------------------- |
|
||||||
| `stream` | `Iterable[Doc]` | A stream of documents. |
|
| `docs` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
|
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | Processed documents in the order of the original text. |
|
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
||||||
|
|
||||||
## DependencyParser.begin_training {#begin_training tag="method"}
|
## DependencyParser.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
|
@ -158,13 +158,13 @@ setting up the label scheme based on the data.
|
||||||
> optimizer = parser.begin_training(lambda: [], pipeline=nlp.pipeline)
|
> optimizer = parser.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | `Callable[[], Iterable[Example]]` | Optional function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `pipeline` | `List[Tuple[str, Callable]]` | Optional list of pipeline components that this component is part of. |
|
| `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | An optional optimizer. Will be created via [`create_optimizer`](/api/dependencyparser#create_optimizer) if not set. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## DependencyParser.predict {#predict tag="method"}
|
## DependencyParser.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -178,10 +178,10 @@ modifying them.
|
||||||
> scores = parser.predict([doc1, doc2])
|
> scores = parser.predict([doc1, doc2])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------- | ---------------------------------------------- |
|
| ----------- | ------------------------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to predict. |
|
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
|
||||||
| **RETURNS** | `syntax.StateClass` | A helper class for the parse state (internal). |
|
| **RETURNS** | A helper class for the parse state (internal). ~~StateClass~~ |
|
||||||
|
|
||||||
## DependencyParser.set_annotations {#set_annotations tag="method"}
|
## DependencyParser.set_annotations {#set_annotations tag="method"}
|
||||||
|
|
||||||
|
@ -195,10 +195,10 @@ Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores.
|
||||||
> parser.set_annotations([doc1, doc2], scores)
|
> parser.set_annotations([doc1, doc2], scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ------------------- | ---------------------------------------------------------- |
|
| -------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to modify. |
|
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
|
||||||
| `scores` | `syntax.StateClass` | The scores to set, produced by `DependencyParser.predict`. |
|
| `scores` | The scores to set, produced by `DependencyParser.predict`. Returns an internal helper class for the parse state. ~~List[StateClass]~~ |
|
||||||
|
|
||||||
## DependencyParser.update {#update tag="method"}
|
## DependencyParser.update {#update tag="method"}
|
||||||
|
|
||||||
|
@ -214,15 +214,15 @@ model. Delegates to [`predict`](/api/dependencyparser#predict) and
|
||||||
> losses = parser.update(examples, sgd=optimizer)
|
> losses = parser.update(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/dependencyparser#set_annotations). |
|
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||||
| `sgd` | `Optimizer` | The [`Optimizer`](https://thinc.ai/docs/api-optimizers) object. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## DependencyParser.get_loss {#get_loss tag="method"}
|
## DependencyParser.get_loss {#get_loss tag="method"}
|
||||||
|
|
||||||
|
@ -237,11 +237,11 @@ predicted scores.
|
||||||
> loss, d_loss = parser.get_loss(examples, scores)
|
> loss, d_loss = parser.get_loss(examples, scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------- | --------------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The batch of examples. |
|
| `examples` | The batch of examples. ~~Iterable[Example]~~ |
|
||||||
| `scores` | `syntax.StateClass` | Scores representing the model's predictions. |
|
| `scores` | Scores representing the model's predictions. ~~StateClass~~ |
|
||||||
| **RETURNS** | `Tuple[float, float]` | The loss and the gradient, i.e. `(loss, gradient)`. |
|
| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ |
|
||||||
|
|
||||||
## DependencyParser.score {#score tag="method" new="3"}
|
## DependencyParser.score {#score tag="method" new="3"}
|
||||||
|
|
||||||
|
@ -253,10 +253,10 @@ Score a batch of examples.
|
||||||
> scores = parser.score(examples)
|
> scores = parser.score(examples)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------- | -------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `examples` | `Iterable[Example]` | The examples to score. |
|
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||||
| **RETURNS** | `Dict[str, Any]` | The scores, produced by [`Scorer.score_spans`](/api/scorer#score_spans) and [`Scorer.score_deps`](/api/scorer#score_deps). |
|
| **RETURNS** | The scores, produced by [`Scorer.score_spans`](/api/scorer#score_spans) and [`Scorer.score_deps`](/api/scorer#score_deps). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||||
|
|
||||||
## DependencyParser.create_optimizer {#create_optimizer tag="method"}
|
## DependencyParser.create_optimizer {#create_optimizer tag="method"}
|
||||||
|
|
||||||
|
@ -270,9 +270,9 @@ component.
|
||||||
> optimizer = parser.create_optimizer()
|
> optimizer = parser.create_optimizer()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------------- | -------------- |
|
| ----------- | ---------------------------- |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## DependencyParser.use_params {#use_params tag="method, contextmanager"}
|
## DependencyParser.use_params {#use_params tag="method, contextmanager"}
|
||||||
|
|
||||||
|
@ -287,9 +287,9 @@ context, the original parameters are restored.
|
||||||
> parser.to_disk("/best_model")
|
> parser.to_disk("/best_model")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---- | ----------------------------------------- |
|
| -------- | -------------------------------------------------- |
|
||||||
| `params` | dict | The parameter values to use in the model. |
|
| `params` | The parameter values to use in the model. ~~dict~~ |
|
||||||
|
|
||||||
## DependencyParser.add_label {#add_label tag="method"}
|
## DependencyParser.add_label {#add_label tag="method"}
|
||||||
|
|
||||||
|
@ -302,10 +302,10 @@ Add a new label to the pipe.
|
||||||
> parser.add_label("MY_LABEL")
|
> parser.add_label("MY_LABEL")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | --------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------- |
|
||||||
| `label` | str | The label to add. |
|
| `label` | The label to add. ~~str~~ |
|
||||||
| **RETURNS** | int | `0` if the label is already present, otherwise `1`. |
|
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
|
||||||
|
|
||||||
## DependencyParser.to_disk {#to_disk tag="method"}
|
## DependencyParser.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -318,11 +318,11 @@ Serialize the pipe to disk.
|
||||||
> parser.to_disk("/path/to/parser")
|
> parser.to_disk("/path/to/parser")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## DependencyParser.from_disk {#from_disk tag="method"}
|
## DependencyParser.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -335,12 +335,12 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
> parser.from_disk("/path/to/parser")
|
> parser.from_disk("/path/to/parser")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------ | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `DependencyParser` | The modified `DependencyParser` object. |
|
| **RETURNS** | The modified `DependencyParser` object. ~~DependencyParser~~ |
|
||||||
|
|
||||||
## DependencyParser.to_bytes {#to_bytes tag="method"}
|
## DependencyParser.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -353,11 +353,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | bytes | The serialized form of the `DependencyParser` object. |
|
| **RETURNS** | The serialized form of the `DependencyParser` object. ~~bytes~~ |
|
||||||
|
|
||||||
## DependencyParser.from_bytes {#from_bytes tag="method"}
|
## DependencyParser.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -371,12 +371,12 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> parser.from_bytes(parser_bytes)
|
> parser.from_bytes(parser_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------ | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `DependencyParser` | The `DependencyParser` object. |
|
| **RETURNS** | The `DependencyParser` object. ~~DependencyParser~~ |
|
||||||
|
|
||||||
## DependencyParser.labels {#labels tag="property"}
|
## DependencyParser.labels {#labels tag="property"}
|
||||||
|
|
||||||
|
@ -389,9 +389,9 @@ The labels currently added to the component.
|
||||||
> assert "MY_LABEL" in parser.labels
|
> assert "MY_LABEL" in parser.labels
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ---------------------------------- |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | tuple | The labels added to the component. |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -30,11 +30,11 @@ Construct a `Doc` object. The most common way to get a `Doc` object is via the
|
||||||
> doc = Doc(nlp.vocab, words=words, spaces=spaces)
|
> doc = Doc(nlp.vocab, words=words, spaces=spaces)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | A storage container for lexical types. |
|
| `vocab` | A storage container for lexical types. ~~Vocab~~ |
|
||||||
| `words` | iterable | A list of strings to add to the container. |
|
| `words` | A list of strings to add to the container. ~~Optional[List[str]]~~ |
|
||||||
| `spaces` | iterable | A list of boolean values indicating whether each word has a subsequent space. Must have the same length as `words`, if specified. Defaults to a sequence of `True`. |
|
| `spaces` | A list of boolean values indicating whether each word has a subsequent space. Must have the same length as `words`, if specified. Defaults to a sequence of `True`. ~~Optional[List[bool]]~~ |
|
||||||
|
|
||||||
## Doc.\_\_getitem\_\_ {#getitem tag="method"}
|
## Doc.\_\_getitem\_\_ {#getitem tag="method"}
|
||||||
|
|
||||||
|
@ -52,10 +52,10 @@ Negative indexing is supported, and follows the usual Python semantics, i.e.
|
||||||
> assert span.text == "it back"
|
> assert span.text == "it back"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------- | ----------------------- |
|
| ----------- | -------------------------------- |
|
||||||
| `i` | int | The index of the token. |
|
| `i` | The index of the token. ~~int~~ |
|
||||||
| **RETURNS** | `Token` | The token at `doc[i]`. |
|
| **RETURNS** | The token at `doc[i]`. ~~Token~~ |
|
||||||
|
|
||||||
Get a [`Span`](/api/span) object, starting at position `start` (token index) and
|
Get a [`Span`](/api/span) object, starting at position `start` (token index) and
|
||||||
ending at position `end` (token index). For instance, `doc[2:5]` produces a span
|
ending at position `end` (token index). For instance, `doc[2:5]` produces a span
|
||||||
|
@ -64,10 +64,10 @@ are not supported, as `Span` objects must be contiguous (cannot have gaps). You
|
||||||
can use negative indices and open-ended ranges, which have their normal Python
|
can use negative indices and open-ended ranges, which have their normal Python
|
||||||
semantics.
|
semantics.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------ | --------------------------------- |
|
| ----------- | ----------------------------------------------------- |
|
||||||
| `start_end` | tuple | The slice of the document to get. |
|
| `start_end` | The slice of the document to get. ~~Tuple[int, int]~~ |
|
||||||
| **RETURNS** | `Span` | The span at `doc[start:end]`. |
|
| **RETURNS** | The span at `doc[start:end]`. ~~Span~~ |
|
||||||
|
|
||||||
## Doc.\_\_iter\_\_ {#iter tag="method"}
|
## Doc.\_\_iter\_\_ {#iter tag="method"}
|
||||||
|
|
||||||
|
@ -85,9 +85,9 @@ main way annotations are accessed from Python. If faster-than-Python speeds are
|
||||||
required, you can instead access the annotations as a numpy array, or access the
|
required, you can instead access the annotations as a numpy array, or access the
|
||||||
underlying C data directly from Cython.
|
underlying C data directly from Cython.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ------- | ----------------- |
|
| ---------- | --------------------------- |
|
||||||
| **YIELDS** | `Token` | A `Token` object. |
|
| **YIELDS** | A `Token` object. ~~Token~~ |
|
||||||
|
|
||||||
## Doc.\_\_len\_\_ {#len tag="method"}
|
## Doc.\_\_len\_\_ {#len tag="method"}
|
||||||
|
|
||||||
|
@ -100,9 +100,9 @@ Get the number of tokens in the document.
|
||||||
> assert len(doc) == 7
|
> assert len(doc) == 7
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------- |
|
| ----------- | --------------------------------------------- |
|
||||||
| **RETURNS** | int | The number of tokens in the document. |
|
| **RETURNS** | The number of tokens in the document. ~~int~~ |
|
||||||
|
|
||||||
## Doc.set_extension {#set_extension tag="classmethod" new="2"}
|
## Doc.set_extension {#set_extension tag="classmethod" new="2"}
|
||||||
|
|
||||||
|
@ -120,14 +120,14 @@ details, see the documentation on
|
||||||
> assert doc._.has_city
|
> assert doc._.has_city
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
| --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `name` | str | Name of the attribute to set by the extension. For example, `"my_attr"` will be available as `doc._.my_attr`. |
|
| `name` | Name of the attribute to set by the extension. For example, `"my_attr"` will be available as `doc._.my_attr`. ~~str~~ |
|
||||||
| `default` | - | Optional default value of the attribute if no getter or method is defined. |
|
| `default` | Optional default value of the attribute if no getter or method is defined. ~~Optional[Any]~~ |
|
||||||
| `method` | callable | Set a custom method on the object, for example `doc._.compare(other_doc)`. |
|
| `method` | Set a custom method on the object, for example `doc._.compare(other_doc)`. ~~Optional[Callable[[Doc, ...], Any]]~~ |
|
||||||
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
|
| `getter` | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. ~~Optional[Callable[[Doc], Any]]~~ |
|
||||||
| `setter` | callable | Setter function that takes the `Doc` and a value, and modifies the object. Is called when the user writes to the `Doc._` attribute. |
|
| `setter` | Setter function that takes the `Doc` and a value, and modifies the object. Is called when the user writes to the `Doc._` attribute. ~~Optional[Callable[[Doc, Any], None]]~~ |
|
||||||
| `force` | bool | Force overwriting existing attribute. |
|
| `force` | Force overwriting existing attribute. ~~bool~~ |
|
||||||
|
|
||||||
## Doc.get_extension {#get_extension tag="classmethod" new="2"}
|
## Doc.get_extension {#get_extension tag="classmethod" new="2"}
|
||||||
|
|
||||||
|
@ -144,10 +144,10 @@ Look up a previously registered extension by name. Returns a 4-tuple
|
||||||
> assert extension == (False, None, None, None)
|
> assert extension == (False, None, None, None)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `name` | str | Name of the extension. |
|
| `name` | Name of the extension. ~~str~~ |
|
||||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
|
| **RETURNS** | A `(default, method, getter, setter)` tuple of the extension. ~~Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]~~ |
|
||||||
|
|
||||||
## Doc.has_extension {#has_extension tag="classmethod" new="2"}
|
## Doc.has_extension {#has_extension tag="classmethod" new="2"}
|
||||||
|
|
||||||
|
@ -161,10 +161,10 @@ Check whether an extension has been registered on the `Doc` class.
|
||||||
> assert Doc.has_extension("has_city")
|
> assert Doc.has_extension("has_city")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------------ |
|
| ----------- | --------------------------------------------------- |
|
||||||
| `name` | str | Name of the extension to check. |
|
| `name` | Name of the extension to check. ~~str~~ |
|
||||||
| **RETURNS** | bool | Whether the extension has been registered. |
|
| **RETURNS** | Whether the extension has been registered. ~~bool~~ |
|
||||||
|
|
||||||
## Doc.remove_extension {#remove_extension tag="classmethod" new="2.0.12"}
|
## Doc.remove_extension {#remove_extension tag="classmethod" new="2.0.12"}
|
||||||
|
|
||||||
|
@ -179,10 +179,10 @@ Remove a previously registered extension.
|
||||||
> assert not Doc.has_extension("has_city")
|
> assert not Doc.has_extension("has_city")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | --------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `name` | str | Name of the extension. |
|
| `name` | Name of the extension. ~~str~~ |
|
||||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
| **RETURNS** | A `(default, method, getter, setter)` tuple of the removed extension. ~~Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]~~ |
|
||||||
|
|
||||||
## Doc.char_span {#char_span tag="method" new="2"}
|
## Doc.char_span {#char_span tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -197,14 +197,14 @@ the character indices don't map to a valid span.
|
||||||
> assert span.text == "New York"
|
> assert span.text == "New York"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------------------------------ | ---------------------------------------- | --------------------------------------------------------------------- |
|
| ------------------------------------ | ----------------------------------------------------------------------------------------- |
|
||||||
| `start` | int | The index of the first character of the span. |
|
| `start` | The index of the first character of the span. ~~int~~ |
|
||||||
| `end` | int | The index of the last character after the span. |
|
| `end` | The index of the last character after the span. ~int~~ |
|
||||||
| `label` | uint64 / str | A label to attach to the span, e.g. for named entities. |
|
| `label` | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~ |
|
||||||
| `kb_id` <Tag variant="new">2.2</Tag> | uint64 / str | An ID from a knowledge base to capture the meaning of a named entity. |
|
| `kb_id` <Tag variant="new">2.2</Tag> | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~ |
|
||||||
| `vector` | `numpy.ndarray[ndim=1, dtype="float32"]` | A meaning representation of the span. |
|
| `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
| **RETURNS** | The newly constructed object or `None`. ~~Optional[Span]~~ |
|
||||||
|
|
||||||
## Doc.similarity {#similarity tag="method" model="vectors"}
|
## Doc.similarity {#similarity tag="method" model="vectors"}
|
||||||
|
|
||||||
|
@ -221,10 +221,10 @@ using an average of word vectors.
|
||||||
> assert apples_oranges == oranges_apples
|
> assert apples_oranges == oranges_apples
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | -------------------------------------------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `other` | - | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. |
|
| `other` | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. ~~Union[Doc, Span, Token, Lexeme]~~ |
|
||||||
| **RETURNS** | float | A scalar similarity score. Higher is more similar. |
|
| **RETURNS** | A scalar similarity score. Higher is more similar. ~~float~~ |
|
||||||
|
|
||||||
## Doc.count_by {#count_by tag="method"}
|
## Doc.count_by {#count_by tag="method"}
|
||||||
|
|
||||||
|
@ -237,15 +237,15 @@ attribute ID.
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.attrs import ORTH
|
> from spacy.attrs import ORTH
|
||||||
> doc = nlp("apple apple orange banana")
|
> doc = nlp("apple apple orange banana")
|
||||||
> assert doc.count_by(ORTH) == {7024L: 1, 119552L: 1, 2087L: 2}
|
> assert doc.count_by(ORTH) == {7024: 1, 119552: 1, 2087: 2}
|
||||||
> doc.to_array([ORTH])
|
> doc.to_array([ORTH])
|
||||||
> # array([[11880], [11880], [7561], [12800]])
|
> # array([[11880], [11880], [7561], [12800]])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------- |
|
||||||
| `attr_id` | int | The attribute ID |
|
| `attr_id` | The attribute ID. ~~int~~ |
|
||||||
| **RETURNS** | dict | A dictionary mapping attributes to integer counts. |
|
| **RETURNS** | A dictionary mapping attributes to integer counts. ~~Dict[int, int]~~ |
|
||||||
|
|
||||||
## Doc.get_lca_matrix {#get_lca_matrix tag="method"}
|
## Doc.get_lca_matrix {#get_lca_matrix tag="method"}
|
||||||
|
|
||||||
|
@ -261,9 +261,9 @@ ancestor is found, e.g. if span excludes a necessary ancestor.
|
||||||
> # array([[0, 1, 1, 1], [1, 1, 1, 1], [1, 1, 2, 3], [1, 1, 3, 3]], dtype=int32)
|
> # array([[0, 1, 1, 1], [1, 1, 1, 1], [1, 1, 2, 3], [1, 1, 3, 3]], dtype=int32)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------------------------------------- | ----------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------------------- |
|
||||||
| **RETURNS** | `numpy.ndarray[ndim=2, dtype="int32"]` | The lowest common ancestor matrix of the `Doc`. |
|
| **RETURNS** | The lowest common ancestor matrix of the `Doc`. ~~numpy.ndarray[ndim=2, dtype=int32]~~ |
|
||||||
|
|
||||||
## Doc.to_array {#to_array tag="method"}
|
## Doc.to_array {#to_array tag="method"}
|
||||||
|
|
||||||
|
@ -288,10 +288,10 @@ Returns a 2D array with one row per token and one column per attribute (when
|
||||||
> np_array = doc.to_array("POS")
|
> np_array = doc.to_array("POS")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `attr_ids` | list or int or string | A list of attributes (int IDs or string names) or a single attribute (int ID or string name) |
|
| `attr_ids` | A list of attributes (int IDs or string names) or a single attribute (int ID or string name). ~~Union[int, str, List[Union[int, str]]]~~ |
|
||||||
| **RETURNS** | `numpy.ndarray[ndim=2, dtype="uint64"]` or `numpy.ndarray[ndim=1, dtype="uint64"]` | The exported attributes as a numpy array. |
|
| **RETURNS** | The exported attributes as a numpy array. ~~Union[numpy.ndarray[ndim=2, dtype=uint64], numpy.ndarray[ndim=1, dtype=uint64]]~~ |
|
||||||
|
|
||||||
## Doc.from_array {#from_array tag="method"}
|
## Doc.from_array {#from_array tag="method"}
|
||||||
|
|
||||||
|
@ -310,14 +310,14 @@ array of attributes.
|
||||||
> assert doc[0].pos_ == doc2[0].pos_
|
> assert doc[0].pos_ == doc2[0].pos_
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------------------------------------- | ------------------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `attrs` | list | A list of attribute ID ints. |
|
| `attrs` | A list of attribute ID ints. ~~List[int]~~ |
|
||||||
| `array` | `numpy.ndarray[ndim=2, dtype="int32"]` | The attribute values to load. |
|
| `array` | The attribute values to load. ~~numpy.ndarray[ndim=2, dtype=int32]~~ |
|
||||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Doc` | Itself. |
|
| **RETURNS** | The `Doc` itself. ~~Doc~~ |
|
||||||
|
|
||||||
## Doc.from_docs {#from_docs tag="staticmethod"}
|
## Doc.from_docs {#from_docs tag="staticmethod" new="3"}
|
||||||
|
|
||||||
Concatenate multiple `Doc` objects to form a new one. Raises an error if the
|
Concatenate multiple `Doc` objects to form a new one. Raises an error if the
|
||||||
`Doc` objects do not all share the same `Vocab`.
|
`Doc` objects do not all share the same `Vocab`.
|
||||||
|
@ -337,12 +337,12 @@ Concatenate multiple `Doc` objects to form a new one. Raises an error if the
|
||||||
> [str(ent) for doc in docs for ent in doc.ents]
|
> [str(ent) for doc in docs for ent in doc.ents]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------------- | ----- | ----------------------------------------------------------------------------------------------- |
|
| ------------------- | ----------------------------------------------------------------------------------------------------------------- |
|
||||||
| `docs` | list | A list of `Doc` objects. |
|
| `docs` | A list of `Doc` objects. ~~List[Doc]~~ |
|
||||||
| `ensure_whitespace` | bool | Insert a space between two adjacent docs whenever the first doc does not end in whitespace. |
|
| `ensure_whitespace` | Insert a space between two adjacent docs whenever the first doc does not end in whitespace. ~~bool~~ |
|
||||||
| `attrs` | list | Optional list of attribute ID ints or attribute name strings. |
|
| `attrs` | Optional list of attribute ID ints or attribute name strings. ~~Optional[List[Union[str, int]]]~~ |
|
||||||
| **RETURNS** | `Doc` | The new `Doc` object that is containing the other docs or `None`, if `docs` is empty or `None`. |
|
| **RETURNS** | The new `Doc` object that is containing the other docs or `None`, if `docs` is empty or `None`. ~~Optional[Doc]~~ |
|
||||||
|
|
||||||
## Doc.to_disk {#to_disk tag="method" new="2"}
|
## Doc.to_disk {#to_disk tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -354,11 +354,11 @@ Save the current state to a directory.
|
||||||
> doc.to_disk("/path/to/doc")
|
> doc.to_disk("/path/to/doc")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## Doc.from_disk {#from_disk tag="method" new="2"}
|
## Doc.from_disk {#from_disk tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -372,12 +372,12 @@ Loads state from a directory. Modifies the object in place and returns it.
|
||||||
> doc = Doc(Vocab()).from_disk("/path/to/doc")
|
> doc = Doc(Vocab()).from_disk("/path/to/doc")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Doc` | The modified `Doc` object. |
|
| **RETURNS** | The modified `Doc` object. ~~Doc~~ |
|
||||||
|
|
||||||
## Doc.to_bytes {#to_bytes tag="method"}
|
## Doc.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -390,11 +390,11 @@ Serialize, i.e. export the document contents to a binary string.
|
||||||
> doc_bytes = doc.to_bytes()
|
> doc_bytes = doc.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | bytes | A losslessly serialized copy of the `Doc`, including all annotations. |
|
| **RETURNS** | A losslessly serialized copy of the `Doc`, including all annotations. ~~bytes~~ |
|
||||||
|
|
||||||
## Doc.from_bytes {#from_bytes tag="method"}
|
## Doc.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -410,12 +410,12 @@ Deserialize, i.e. import the document contents from a binary string.
|
||||||
> assert doc.text == doc2.text
|
> assert doc.text == doc2.text
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `data` | bytes | The string to load from. |
|
| `data` | The string to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Doc` | The `Doc` object. |
|
| **RETURNS** | The `Doc` object. ~~Doc~~ |
|
||||||
|
|
||||||
## Doc.retokenize {#retokenize tag="contextmanager" new="2.1"}
|
## Doc.retokenize {#retokenize tag="contextmanager" new="2.1"}
|
||||||
|
|
||||||
|
@ -433,9 +433,9 @@ invalidated, although they may accidentally continue to work.
|
||||||
> retokenizer.merge(doc[0:2])
|
> retokenizer.merge(doc[0:2])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------- | ---------------- |
|
| ----------- | -------------------------------- |
|
||||||
| **RETURNS** | `Retokenizer` | The retokenizer. |
|
| **RETURNS** | The retokenizer. ~~Retokenizer~~ |
|
||||||
|
|
||||||
### Retokenizer.merge {#retokenizer.merge tag="method"}
|
### Retokenizer.merge {#retokenizer.merge tag="method"}
|
||||||
|
|
||||||
|
@ -454,10 +454,10 @@ dictionary mapping attribute names to values as the `"_"` key.
|
||||||
> retokenizer.merge(doc[2:4], attrs=attrs)
|
> retokenizer.merge(doc[2:4], attrs=attrs)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------- | ------ | -------------------------------------- |
|
| ------- | --------------------------------------------------------------------- |
|
||||||
| `span` | `Span` | The span to merge. |
|
| `span` | The span to merge. ~~Span~~ |
|
||||||
| `attrs` | dict | Attributes to set on the merged token. |
|
| `attrs` | Attributes to set on the merged token. ~~Dict[Union[str, int], Any]~~ |
|
||||||
|
|
||||||
### Retokenizer.split {#retokenizer.split tag="method"}
|
### Retokenizer.split {#retokenizer.split tag="method"}
|
||||||
|
|
||||||
|
@ -488,33 +488,12 @@ underlying lexeme (if they're context-independent lexical attributes like
|
||||||
> retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)
|
> retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------- | ------- | ----------------------------------------------------------------------------------------------------------- |
|
| ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `token` | `Token` | The token to split. |
|
| `token` | The token to split. ~~Token~~ |
|
||||||
| `orths` | list | The verbatim text of the split tokens. Needs to match the text of the original token. |
|
| `orths` | The verbatim text of the split tokens. Needs to match the text of the original token. ~~List[str]~~ |
|
||||||
| `heads` | list | List of `token` or `(token, subtoken)` tuples specifying the tokens to attach the newly split subtokens to. |
|
| `heads` | List of `token` or `(token, subtoken)` tuples specifying the tokens to attach the newly split subtokens to. ~~List[Union[Token, Tuple[Token, int]]]~~ |
|
||||||
| `attrs` | dict | Attributes to set on all split tokens. Attribute names mapped to list of per-token attribute values. |
|
| `attrs` | Attributes to set on all split tokens. Attribute names mapped to list of per-token attribute values. ~~Dict[Union[str, int], List[Any]]~~ |
|
||||||
|
|
||||||
## Doc.merge {#merge tag="method"}
|
|
||||||
|
|
||||||
Retokenize the document, such that the span at `doc.text[start_idx : end_idx]`
|
|
||||||
is merged into a single token. If `start_idx` and `end_idx` do not mark start
|
|
||||||
and end token boundaries, the document remains unchanged.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> doc = nlp("Los Angeles start.")
|
|
||||||
> doc.merge(0, len("Los Angeles"), "NNP", "Los Angeles", "GPE")
|
|
||||||
> assert [t.text for t in doc] == ["Los Angeles", "start", "."]
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| -------------- | ------- | ------------------------------------------------------------------------------------------------------------------------- |
|
|
||||||
| `start_idx` | int | The character index of the start of the slice to merge. |
|
|
||||||
| `end_idx` | int | The character index after the end of the slice to merge. |
|
|
||||||
| `**attributes` | - | Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span. |
|
|
||||||
| **RETURNS** | `Token` | The newly merged token, or `None` if the start and end indices did not fall at token boundaries |
|
|
||||||
|
|
||||||
## Doc.ents {#ents tag="property" model="NER"}
|
## Doc.ents {#ents tag="property" model="NER"}
|
||||||
|
|
||||||
|
@ -531,9 +510,9 @@ objects, if the entity recognizer has been applied.
|
||||||
> assert ents[0].text == "Mr. Best"
|
> assert ents[0].text == "Mr. Best"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------------------------------ |
|
| ----------- | --------------------------------------------------------------------- |
|
||||||
| **RETURNS** | tuple | Entities in the document, one `Span` per entity. |
|
| **RETURNS** | Entities in the document, one `Span` per entity. ~~Tuple[Span, ...]~~ |
|
||||||
|
|
||||||
## Doc.noun_chunks {#noun_chunks tag="property" model="parser"}
|
## Doc.noun_chunks {#noun_chunks tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -552,9 +531,9 @@ relative clauses.
|
||||||
> assert chunks[1].text == "another phrase"
|
> assert chunks[1].text == "another phrase"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ------ | ---------------------------- |
|
| ---------- | ------------------------------------- |
|
||||||
| **YIELDS** | `Span` | Noun chunks in the document. |
|
| **YIELDS** | Noun chunks in the document. ~~Span~~ |
|
||||||
|
|
||||||
## Doc.sents {#sents tag="property" model="parser"}
|
## Doc.sents {#sents tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -572,9 +551,9 @@ will be unavailable.
|
||||||
> assert [s.root.text for s in sents] == ["is", "'s"]
|
> assert [s.root.text for s in sents] == ["is", "'s"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ------ | -------------------------- |
|
| ---------- | ----------------------------------- |
|
||||||
| **YIELDS** | `Span` | Sentences in the document. |
|
| **YIELDS** | Sentences in the document. ~~Span~~ |
|
||||||
|
|
||||||
## Doc.has_vector {#has_vector tag="property" model="vectors"}
|
## Doc.has_vector {#has_vector tag="property" model="vectors"}
|
||||||
|
|
||||||
|
@ -587,9 +566,9 @@ A boolean value indicating whether a word vector is associated with the object.
|
||||||
> assert doc.has_vector
|
> assert doc.has_vector
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------------------ |
|
| ----------- | --------------------------------------------------------- |
|
||||||
| **RETURNS** | bool | Whether the document has a vector data attached. |
|
| **RETURNS** | Whether the document has a vector data attached. ~~bool~~ |
|
||||||
|
|
||||||
## Doc.vector {#vector tag="property" model="vectors"}
|
## Doc.vector {#vector tag="property" model="vectors"}
|
||||||
|
|
||||||
|
@ -604,9 +583,9 @@ vectors.
|
||||||
> assert doc.vector.shape == (300,)
|
> assert doc.vector.shape == (300,)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---------------------------------------- | ------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------------------------------- |
|
||||||
| **RETURNS** | `numpy.ndarray[ndim=1, dtype="float32"]` | A 1D numpy array representing the document's semantics. |
|
| **RETURNS** | A 1-dimensional array representing the document's vector. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
|
|
||||||
## Doc.vector_norm {#vector_norm tag="property" model="vectors"}
|
## Doc.vector_norm {#vector_norm tag="property" model="vectors"}
|
||||||
|
|
||||||
|
@ -622,32 +601,32 @@ The L2 norm of the document's vector representation.
|
||||||
> assert doc1.vector_norm != doc2.vector_norm
|
> assert doc1.vector_norm != doc2.vector_norm
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ----------------------------------------- |
|
| ----------- | --------------------------------------------------- |
|
||||||
| **RETURNS** | float | The L2 norm of the vector representation. |
|
| **RETURNS** | The L2 norm of the vector representation. ~~float~~ |
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| --------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `text` | str | A string representation of the document text. |
|
| `text` | A string representation of the document text. ~~str~~ |
|
||||||
| `text_with_ws` | str | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
|
| `text_with_ws` | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. ~~str~~ |
|
||||||
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
|
| `mem` | The document's local memory heap, for all C data it owns. ~~cymem.Pool~~ |
|
||||||
| `vocab` | `Vocab` | The store of lexical types. |
|
| `vocab` | The store of lexical types. ~~Vocab~~ |
|
||||||
| `tensor` <Tag variant="new">2</Tag> | `ndarray` | Container for dense vector representations. |
|
| `tensor` <Tag variant="new">2</Tag> | Container for dense vector representations. ~~numpy.ndarray~~ |
|
||||||
| `cats` <Tag variant="new">2</Tag> | dict | Maps a label to a score for categories applied to the document. The label is a string and the score should be a float. |
|
| `cats` <Tag variant="new">2</Tag> | Maps a label to a score for categories applied to the document. The label is a string and the score should be a float. ~~Dict[str, float]~~ |
|
||||||
| `user_data` | - | A generic storage area, for user custom data. |
|
| `user_data` | A generic storage area, for user custom data. ~~Dict[str, Any]~~ |
|
||||||
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
|
| `lang` <Tag variant="new">2.1</Tag> | Language of the document's vocabulary. ~~int~~ |
|
||||||
| `lang_` <Tag variant="new">2.1</Tag> | str | Language of the document's vocabulary. |
|
| `lang_` <Tag variant="new">2.1</Tag> | Language of the document's vocabulary. ~~str~~ |
|
||||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty. |
|
| `is_tagged` | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty. ~~bool~~ |
|
||||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty. |
|
| `is_parsed` | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty. ~~bool~~ |
|
||||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty. |
|
| `is_sentenced` | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty. ~~bool~~ |
|
||||||
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if the `Doc` is empty, or if _any_ of the tokens has an entity tag set, even if the others are unknown. |
|
| `is_nered` <Tag variant="new">2.1</Tag> | A flag indicating that named entities have been set. Will return `True` if the `Doc` is empty, or if _any_ of the tokens has an entity tag set, even if the others are unknown. ~~bool~~ |
|
||||||
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
| `sentiment` | The document's positivity/negativity score, if available. ~~float~~ |
|
||||||
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
| `user_hooks` | A dictionary that allows customization of the `Doc`'s properties. ~~Dict[str, Callable]~~ |
|
||||||
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
| `user_token_hooks` | A dictionary that allows customization of properties of `Token` children. ~~Dict[str, Callable]~~ |
|
||||||
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
|
| `user_span_hooks` | A dictionary that allows customization of properties of `Span` children. ~~Dict[str, Callable]~~ |
|
||||||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
| `_` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). ~~Underscore~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -44,11 +44,11 @@ Create a `DocBin` object to hold serialized annotations.
|
||||||
> doc_bin = DocBin(attrs=["ENT_IOB", "ENT_TYPE"])
|
> doc_bin = DocBin(attrs=["ENT_IOB", "ENT_TYPE"])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Description |
|
||||||
| ----------------- | --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `attrs` | `Iterable[str]` | List of attributes to serialize. `ORTH` (hash of token text) and `SPACY` (whether the token is followed by whitespace) are always serialized, so they're not required. Defaults to `("ORTH", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_KB_ID", "LEMMA", "MORPH", "POS")`. |
|
| `attrs` | List of attributes to serialize. `ORTH` (hash of token text) and `SPACY` (whether the token is followed by whitespace) are always serialized, so they're not required. Defaults to `("ORTH", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_KB_ID", "LEMMA", "MORPH", "POS")`. ~~Iterable[str]~~ |
|
||||||
| `store_user_data` | bool | Whether to include the `Doc.user_data` and the values of custom extension attributes. Defaults to `False`. |
|
| `store_user_data` | Whether to include the `Doc.user_data` and the values of custom extension attributes. Defaults to `False`. ~~bool~~ |
|
||||||
| `docs` | `Iterable[Doc]` | `Doc` objects to add on initialization. |
|
| `docs` | `Doc` objects to add on initialization. ~~Iterable[Doc]~~ |
|
||||||
|
|
||||||
## DocBin.\_\len\_\_ {#len tag="method"}
|
## DocBin.\_\len\_\_ {#len tag="method"}
|
||||||
|
|
||||||
|
@ -63,9 +63,9 @@ Get the number of `Doc` objects that were added to the `DocBin`.
|
||||||
> assert len(doc_bin) == 1
|
> assert len(doc_bin) == 1
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Description |
|
||||||
| ----------- | ---- | ------------------------------------------- |
|
| ----------- | --------------------------------------------------- |
|
||||||
| **RETURNS** | int | The number of `Doc`s added to the `DocBin`. |
|
| **RETURNS** | The number of `Doc`s added to the `DocBin`. ~~int~~ |
|
||||||
|
|
||||||
## DocBin.add {#add tag="method"}
|
## DocBin.add {#add tag="method"}
|
||||||
|
|
||||||
|
@ -79,9 +79,9 @@ Add a `Doc`'s annotations to the `DocBin` for serialization.
|
||||||
> doc_bin.add(doc)
|
> doc_bin.add(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Description |
|
||||||
| -------- | ----- | ------------------------ |
|
| -------- | -------------------------------- |
|
||||||
| `doc` | `Doc` | The `Doc` object to add. |
|
| `doc` | The `Doc` object to add. ~~Doc~~ |
|
||||||
|
|
||||||
## DocBin.get_docs {#get_docs tag="method"}
|
## DocBin.get_docs {#get_docs tag="method"}
|
||||||
|
|
||||||
|
@ -93,15 +93,15 @@ Recover `Doc` objects from the annotations, using the given vocab.
|
||||||
> docs = list(doc_bin.get_docs(nlp.vocab))
|
> docs = list(doc_bin.get_docs(nlp.vocab))
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Description |
|
||||||
| ---------- | ------- | ------------------ |
|
| ---------- | --------------------------- |
|
||||||
| `vocab` | `Vocab` | The shared vocab. |
|
| `vocab` | The shared vocab. ~~Vocab~~ |
|
||||||
| **YIELDS** | `Doc` | The `Doc` objects. |
|
| **YIELDS** | The `Doc` objects. ~~Doc~~ |
|
||||||
|
|
||||||
## DocBin.merge {#merge tag="method"}
|
## DocBin.merge {#merge tag="method"}
|
||||||
|
|
||||||
Extend the annotations of this `DocBin` with the annotations from another. Will
|
Extend the annotations of this `DocBin` with the annotations from another. Will
|
||||||
raise an error if the pre-defined attrs of the two `DocBin`s don't match.
|
raise an error if the pre-defined `attrs` of the two `DocBin`s don't match.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -114,9 +114,9 @@ raise an error if the pre-defined attrs of the two `DocBin`s don't match.
|
||||||
> assert len(doc_bin1) == 2
|
> assert len(doc_bin1) == 2
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Description |
|
||||||
| -------- | -------- | ------------------------------------------- |
|
| -------- | ------------------------------------------------------ |
|
||||||
| `other` | `DocBin` | The `DocBin` to merge into the current bin. |
|
| `other` | The `DocBin` to merge into the current bin. ~~DocBin~~ |
|
||||||
|
|
||||||
## DocBin.to_bytes {#to_bytes tag="method"}
|
## DocBin.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -130,9 +130,9 @@ Serialize the `DocBin`'s annotations to a bytestring.
|
||||||
> doc_bin_bytes = doc_bin.to_bytes()
|
> doc_bin_bytes = doc_bin.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Description |
|
||||||
| ----------- | ----- | ------------------------ |
|
| ----------- | ---------------------------------- |
|
||||||
| **RETURNS** | bytes | The serialized `DocBin`. |
|
| **RETURNS** | The serialized `DocBin`. ~~bytes~~ |
|
||||||
|
|
||||||
## DocBin.from_bytes {#from_bytes tag="method"}
|
## DocBin.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -145,10 +145,10 @@ Deserialize the `DocBin`'s annotations from a bytestring.
|
||||||
> new_doc_bin = DocBin().from_bytes(doc_bin_bytes)
|
> new_doc_bin = DocBin().from_bytes(doc_bin_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Description |
|
||||||
| ------------ | -------- | ---------------------- |
|
| ------------ | -------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| **RETURNS** | `DocBin` | The loaded `DocBin`. |
|
| **RETURNS** | The loaded `DocBin`. ~~DocBin~~ |
|
||||||
|
|
||||||
## DocBin.to_disk {#to_disk tag="method" new="3"}
|
## DocBin.to_disk {#to_disk tag="method" new="3"}
|
||||||
|
|
||||||
|
@ -164,9 +164,9 @@ and the result can be used as the input data for
|
||||||
> doc_bin.to_disk("./data.spacy")
|
> doc_bin.to_disk("./data.spacy")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Description |
|
||||||
| -------- | ------------ | ----------------------------------------------------- |
|
| -------- | -------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | The file path, typically with the `.spacy` extension. |
|
| `path` | The file path, typically with the `.spacy` extension. ~~Union[str, Path]~~ |
|
||||||
|
|
||||||
## DocBin.from_disk {#from_disk tag="method" new="3"}
|
## DocBin.from_disk {#from_disk tag="method" new="3"}
|
||||||
|
|
||||||
|
@ -178,7 +178,7 @@ Load a serialized `DocBin` from a file. Typically uses the `.spacy` extension.
|
||||||
> doc_bin = DocBin().from_disk("./data.spacy")
|
> doc_bin = DocBin().from_disk("./data.spacy")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Description |
|
||||||
| ----------- | ------------ | ----------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | The file path, typically with the `.spacy` extension. |
|
| `path` | The file path, typically with the `.spacy` extension. ~~Union[str, Path]~~ |
|
||||||
| **RETURNS** | `DocBin` | The loaded `DocBin`. |
|
| **RETURNS** | The loaded `DocBin`. ~~DocBin~~ |
|
||||||
|
|
|
@ -40,14 +40,14 @@ architectures and their arguments and hyperparameters.
|
||||||
> nlp.add_pipe("entity_linker", config=config)
|
> nlp.add_pipe("entity_linker", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Description |
|
||||||
| ---------------- | -------------------------------------------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------ |
|
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. | `[]` |
|
| `labels_discard` | NER labels that will automatically get a "NIL" prediction. Defaults to `[]`. ~~Iterable[str]~~ |
|
||||||
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. | `True` |
|
| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. Defaults to `True`. ~~bool~~ |
|
||||||
| `incl_context` | bool | Whether or not to include the local context in the model. | `True` |
|
| `incl_context` | Whether or not to include the local context in the model. Defaults to `True`. ~~bool~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [EntityLinker](/api/architectures#EntityLinker) |
|
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [EntityLinker](/api/architectures#EntityLinker). ~~Model~~ |
|
||||||
| `kb_loader` | `Callable[[Vocab], KnowledgeBase]` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. | An empty KnowledgeBase with `entity_vector_length` 64. |
|
| `kb_loader` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. Defaults to [EmptyKB](/api/architectures#EmptyKB), a function returning an empty `KnowledgeBase` with an `entity_vector_length` of `64`. ~~Callable[[Vocab], KnowledgeBase]~~ |
|
||||||
| `get_candidates` | `Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]` | Function that generates plausible candidates for a given `Span` object. | Built-in dictionary-lookup function. |
|
| `get_candidates` | Function that generates plausible candidates for a given `Span` object. Defaults to [CandidateGenerator](/api/architectures#CandidateGenerator), a function looking up exact, case-dependent aliases in the KB. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
|
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
|
||||||
|
@ -66,7 +66,7 @@ https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
|
||||||
> entity_linker = nlp.add_pipe("entity_linker", config=config)
|
> entity_linker = nlp.add_pipe("entity_linker", config=config)
|
||||||
>
|
>
|
||||||
> # Construction via add_pipe with custom KB and candidate generation
|
> # Construction via add_pipe with custom KB and candidate generation
|
||||||
> config = {"kb_loader": {"@assets": "my_kb.v1"}, "get_candidates": {"@assets": "my_candidates.v1"},}
|
> config = {"kb": {"@assets": "my_kb.v1"}}
|
||||||
> entity_linker = nlp.add_pipe("entity_linker", config=config)
|
> entity_linker = nlp.add_pipe("entity_linker", config=config)
|
||||||
>
|
>
|
||||||
> # Construction from class
|
> # Construction from class
|
||||||
|
@ -76,22 +76,21 @@ https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
|
||||||
|
|
||||||
Create a new pipeline instance. In your application, you would normally use a
|
Create a new pipeline instance. In your application, you would normally use a
|
||||||
shortcut for this and instantiate the component using its string name and
|
shortcut for this and instantiate the component using its string name and
|
||||||
[`nlp.add_pipe`](/api/language#add_pipe).
|
[`nlp.add_pipe`](/api/language#add_pipe). Note that both the internal
|
||||||
|
`KnowledgeBase` as well as the Candidate generator can be customized by
|
||||||
|
providing custom registered functions.
|
||||||
|
|
||||||
Note that both the internal KB as well as the Candidate generator can be
|
| Name | Description |
|
||||||
customized by providing custom registered functions.
|
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
| Name | Type | Description |
|
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model~~ |
|
||||||
| ---------------- | -------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
|
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
|
||||||
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
|
||||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `kb_loader` | `Callable[[Vocab], KnowledgeBase]` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. |
|
| `kb_loader` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. ~~Callable[[Vocab], KnowledgeBase]~~ |
|
||||||
| `get_candidates` | `Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]` | Function that generates plausible candidates for a given `Span` object. |
|
| `get_candidates` | Function that generates plausible candidates for a given `Span` object. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ |
|
||||||
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. |
|
| `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~ |
|
||||||
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. |
|
| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. ~~bool~~ |
|
||||||
| `incl_context` | bool | Whether or not to include the local context in the model. |
|
| `incl_context` | Whether or not to include the local context in the model. ~~bool~~ |
|
||||||
|
|
||||||
## EntityLinker.\_\_call\_\_ {#call tag="method"}
|
## EntityLinker.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -111,10 +110,10 @@ delegate to the [`predict`](/api/entitylinker#predict) and
|
||||||
> processed = entity_linker(doc)
|
> processed = entity_linker(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------ |
|
| ----------- | -------------------------------- |
|
||||||
| `doc` | `Doc` | The document to process. |
|
| `doc` | The document to process. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The processed document. |
|
| **RETURNS** | The processed document. ~~Doc~~ |
|
||||||
|
|
||||||
## EntityLinker.pipe {#pipe tag="method"}
|
## EntityLinker.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -133,12 +132,12 @@ applied to the `Doc` in order. Both [`__call__`](/api/entitylinker#call) and
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------ |
|
| -------------- | ------------------------------------------------------------- |
|
||||||
| `stream` | `Iterable[Doc]` | A stream of documents. |
|
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
|
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | Processed documents in the order of the original text. |
|
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
||||||
|
|
||||||
## EntityLinker.begin_training {#begin_training tag="method"}
|
## EntityLinker.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
|
@ -158,13 +157,13 @@ setting up the label scheme based on the data.
|
||||||
> optimizer = entity_linker.begin_training(lambda: [], pipeline=nlp.pipeline)
|
> optimizer = entity_linker.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | `Callable[[], Iterable[Example]]` | Optional function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `pipeline` | `List[Tuple[str, Callable]]` | Optional list of pipeline components that this component is part of. |
|
| `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | An optional optimizer. Will be created via [`create_optimizer`](/api/dependencyparser#create_optimizer) if not set. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## EntityLinker.predict {#predict tag="method"}
|
## EntityLinker.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -179,10 +178,10 @@ if there is no prediction.
|
||||||
> kb_ids = entity_linker.predict([doc1, doc2])
|
> kb_ids = entity_linker.predict([doc1, doc2])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------- | ------------------------------------------------------------ |
|
| ----------- | ------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to predict. |
|
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
|
||||||
| **RETURNS** | `List[str]` | The predicted KB identifiers for the entities in the `docs`. |
|
| **RETURNS** | `List[str]` | The predicted KB identifiers for the entities in the `docs`. ~~List[str]~~ |
|
||||||
|
|
||||||
## EntityLinker.set_annotations {#set_annotations tag="method"}
|
## EntityLinker.set_annotations {#set_annotations tag="method"}
|
||||||
|
|
||||||
|
@ -197,10 +196,10 @@ entities.
|
||||||
> entity_linker.set_annotations([doc1, doc2], kb_ids)
|
> entity_linker.set_annotations([doc1, doc2], kb_ids)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | --------------- | ------------------------------------------------------------------------------------------------- |
|
| -------- | --------------------------------------------------------------------------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to modify. |
|
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
|
||||||
| `kb_ids` | `List[str]` | The knowledge base identifiers for the entities in the docs, predicted by `EntityLinker.predict`. |
|
| `kb_ids` | The knowledge base identifiers for the entities in the docs, predicted by `EntityLinker.predict`. ~~List[str]~~ |
|
||||||
|
|
||||||
## EntityLinker.update {#update tag="method"}
|
## EntityLinker.update {#update tag="method"}
|
||||||
|
|
||||||
|
@ -216,15 +215,15 @@ pipe's entity linking model and context encoder. Delegates to
|
||||||
> losses = entity_linker.update(examples, sgd=optimizer)
|
> losses = entity_linker.update(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------- | --------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/textcategorizer#set_annotations). |
|
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## EntityLinker.create_optimizer {#create_optimizer tag="method"}
|
## EntityLinker.create_optimizer {#create_optimizer tag="method"}
|
||||||
|
|
||||||
|
@ -237,9 +236,9 @@ Create an optimizer for the pipeline component.
|
||||||
> optimizer = entity_linker.create_optimizer()
|
> optimizer = entity_linker.create_optimizer()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------------- | -------------- |
|
| ----------- | ---------------------------- |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## EntityLinker.use_params {#use_params tag="method, contextmanager"}
|
## EntityLinker.use_params {#use_params tag="method, contextmanager"}
|
||||||
|
|
||||||
|
@ -254,9 +253,9 @@ context, the original parameters are restored.
|
||||||
> entity_linker.to_disk("/best_model")
|
> entity_linker.to_disk("/best_model")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---- | ----------------------------------------- |
|
| -------- | -------------------------------------------------- |
|
||||||
| `params` | dict | The parameter values to use in the model. |
|
| `params` | The parameter values to use in the model. ~~dict~~ |
|
||||||
|
|
||||||
## EntityLinker.to_disk {#to_disk tag="method"}
|
## EntityLinker.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -269,11 +268,11 @@ Serialize the pipe to disk.
|
||||||
> entity_linker.to_disk("/path/to/entity_linker")
|
> entity_linker.to_disk("/path/to/entity_linker")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## EntityLinker.from_disk {#from_disk tag="method"}
|
## EntityLinker.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -286,12 +285,12 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
> entity_linker.from_disk("/path/to/entity_linker")
|
> entity_linker.from_disk("/path/to/entity_linker")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `EntityLinker` | The modified `EntityLinker` object. |
|
| **RETURNS** | The modified `EntityLinker` object. ~~EntityLinker~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -41,11 +41,11 @@ architectures and their arguments and hyperparameters.
|
||||||
> nlp.add_pipe("ner", config=config)
|
> nlp.add_pipe("ner", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Description |
|
||||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |
|
| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. |
|
| `moves` | A list of transition names. Inferred from the data if not provided. Defaults to `None`. ~~Optional[List[str]] |
|
||||||
| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. | `100` |
|
| `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. Defaults to `100`. ~~int~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransitionBasedParser](/api/architectures#TransitionBasedParser) |
|
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [TransitionBasedParser](/api/architectures#TransitionBasedParser). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/ner.pyx
|
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/ner.pyx
|
||||||
|
@ -72,14 +72,14 @@ Create a new pipeline instance. In your application, you would normally use a
|
||||||
shortcut for this and instantiate the component using its string name and
|
shortcut for this and instantiate the component using its string name and
|
||||||
[`nlp.add_pipe`](/api/language#add_pipe).
|
[`nlp.add_pipe`](/api/language#add_pipe).
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||||
| `moves` | `List[str]` | A list of transition names. Inferred from the data if not provided. |
|
| `moves` | A list of transition names. Inferred from the data if not provided. ~~Optional[List[str]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `update_with_oracle_cut_size` | int | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. |
|
| `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. ~~int~~ |
|
||||||
|
|
||||||
## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
|
## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -100,10 +100,10 @@ and all pipeline components are applied to the `Doc` in order. Both
|
||||||
> processed = ner(doc)
|
> processed = ner(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------ |
|
| ----------- | -------------------------------- |
|
||||||
| `doc` | `Doc` | The document to process. |
|
| `doc` | The document to process. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The processed document. |
|
| **RETURNS** | The processed document. ~~Doc~~ |
|
||||||
|
|
||||||
## EntityRecognizer.pipe {#pipe tag="method"}
|
## EntityRecognizer.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -122,12 +122,12 @@ applied to the `Doc` in order. Both [`__call__`](/api/entityrecognizer#call) and
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------ |
|
| -------------- | ------------------------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | A stream of documents. |
|
| `docs` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
|
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | Processed documents in the order of the original text. |
|
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
||||||
|
|
||||||
## EntityRecognizer.begin_training {#begin_training tag="method"}
|
## EntityRecognizer.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
|
@ -147,13 +147,13 @@ setting up the label scheme based on the data.
|
||||||
> optimizer = ner.begin_training(lambda: [], pipeline=nlp.pipeline)
|
> optimizer = ner.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | `Callable[[], Iterable[Example]]` | Optional function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `pipeline` | `List[Tuple[str, Callable]]` | Optional list of pipeline components that this component is part of. |
|
| `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | An optional optimizer. Will be created via [`create_optimizer`](/api/entityrecognizer#create_optimizer) if not set. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## EntityRecognizer.predict {#predict tag="method"}
|
## EntityRecognizer.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -167,10 +167,10 @@ modifying them.
|
||||||
> scores = ner.predict([doc1, doc2])
|
> scores = ner.predict([doc1, doc2])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------ | ---------------------------------------------------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to predict. |
|
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
|
||||||
| **RETURNS** | `List[StateClass]` | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). |
|
| **RETURNS** | A helper class for the parse state (internal). ~~StateClass~~ |
|
||||||
|
|
||||||
## EntityRecognizer.set_annotations {#set_annotations tag="method"}
|
## EntityRecognizer.set_annotations {#set_annotations tag="method"}
|
||||||
|
|
||||||
|
@ -184,10 +184,10 @@ Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores.
|
||||||
> ner.set_annotations([doc1, doc2], scores)
|
> ner.set_annotations([doc1, doc2], scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ------------------ | ---------------------------------------------------------- |
|
| -------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to modify. |
|
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
|
||||||
| `scores` | `List[StateClass]` | The scores to set, produced by `EntityRecognizer.predict`. |
|
| `scores` | The scores to set, produced by `EntityRecognizer.predict`. Returns an internal helper class for the parse state. ~~List[StateClass]~~ |
|
||||||
|
|
||||||
## EntityRecognizer.update {#update tag="method"}
|
## EntityRecognizer.update {#update tag="method"}
|
||||||
|
|
||||||
|
@ -203,15 +203,15 @@ model. Delegates to [`predict`](/api/entityrecognizer#predict) and
|
||||||
> losses = ner.update(examples, sgd=optimizer)
|
> losses = ner.update(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------- | --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/entityrecognizer#set_annotations). |
|
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## EntityRecognizer.get_loss {#get_loss tag="method"}
|
## EntityRecognizer.get_loss {#get_loss tag="method"}
|
||||||
|
|
||||||
|
@ -226,11 +226,11 @@ predicted scores.
|
||||||
> loss, d_loss = ner.get_loss(examples, scores)
|
> loss, d_loss = ner.get_loss(examples, scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------- | --------------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The batch of examples. |
|
| `examples` | The batch of examples. ~~Iterable[Example]~~ |
|
||||||
| `scores` | `List[StateClass]` | Scores representing the model's predictions. |
|
| `scores` | Scores representing the model's predictions. ~~StateClass~~ |
|
||||||
| **RETURNS** | `Tuple[float, float]` | The loss and the gradient, i.e. `(loss, gradient)`. |
|
| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ |
|
||||||
|
|
||||||
## EntityRecognizer.score {#score tag="method" new="3"}
|
## EntityRecognizer.score {#score tag="method" new="3"}
|
||||||
|
|
||||||
|
@ -242,10 +242,10 @@ Score a batch of examples.
|
||||||
> scores = ner.score(examples)
|
> scores = ner.score(examples)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------- | ------------------------------------------------------------------------ |
|
| ----------- | ---------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The examples to score. |
|
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||||
| **RETURNS** | `Dict[str, Any]` | The scores, produced by [`Scorer.score_spans`](/api/scorer#score_spans). |
|
| **RETURNS** | The scores, produced by [`Scorer.score_spans`](/api/scorer#score_spans). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||||
|
|
||||||
## EntityRecognizer.create_optimizer {#create_optimizer tag="method"}
|
## EntityRecognizer.create_optimizer {#create_optimizer tag="method"}
|
||||||
|
|
||||||
|
@ -258,9 +258,9 @@ Create an optimizer for the pipeline component.
|
||||||
> optimizer = ner.create_optimizer()
|
> optimizer = ner.create_optimizer()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------------- | -------------- |
|
| ----------- | ---------------------------- |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## EntityRecognizer.use_params {#use_params tag="method, contextmanager"}
|
## EntityRecognizer.use_params {#use_params tag="method, contextmanager"}
|
||||||
|
|
||||||
|
@ -275,9 +275,9 @@ context, the original parameters are restored.
|
||||||
> ner.to_disk("/best_model")
|
> ner.to_disk("/best_model")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---- | ----------------------------------------- |
|
| -------- | -------------------------------------------------- |
|
||||||
| `params` | dict | The parameter values to use in the model. |
|
| `params` | The parameter values to use in the model. ~~dict~~ |
|
||||||
|
|
||||||
## EntityRecognizer.add_label {#add_label tag="method"}
|
## EntityRecognizer.add_label {#add_label tag="method"}
|
||||||
|
|
||||||
|
@ -290,10 +290,10 @@ Add a new label to the pipe.
|
||||||
> ner.add_label("MY_LABEL")
|
> ner.add_label("MY_LABEL")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | --------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------- |
|
||||||
| `label` | str | The label to add. |
|
| `label` | The label to add. ~~str~~ |
|
||||||
| **RETURNS** | int | `0` if the label is already present, otherwise `1`. |
|
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
|
||||||
|
|
||||||
## EntityRecognizer.to_disk {#to_disk tag="method"}
|
## EntityRecognizer.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -306,11 +306,11 @@ Serialize the pipe to disk.
|
||||||
> ner.to_disk("/path/to/ner")
|
> ner.to_disk("/path/to/ner")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## EntityRecognizer.from_disk {#from_disk tag="method"}
|
## EntityRecognizer.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -323,12 +323,12 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
> ner.from_disk("/path/to/ner")
|
> ner.from_disk("/path/to/ner")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------ | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `EntityRecognizer` | The modified `EntityRecognizer` object. |
|
| **RETURNS** | The modified `EntityRecognizer` object. ~~EntityRecognizer~~ |
|
||||||
|
|
||||||
## EntityRecognizer.to_bytes {#to_bytes tag="method"}
|
## EntityRecognizer.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -341,11 +341,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | bytes | The serialized form of the `EntityRecognizer` object. |
|
| **RETURNS** | The serialized form of the `EntityRecognizer` object. ~~bytes~~ |
|
||||||
|
|
||||||
## EntityRecognizer.from_bytes {#from_bytes tag="method"}
|
## EntityRecognizer.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -359,12 +359,12 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> ner.from_bytes(ner_bytes)
|
> ner.from_bytes(ner_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------ | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `EntityRecognizer` | The `EntityRecognizer` object. |
|
| **RETURNS** | The `EntityRecognizer` object. ~~EntityRecognizer~~ |
|
||||||
|
|
||||||
## EntityRecognizer.labels {#labels tag="property"}
|
## EntityRecognizer.labels {#labels tag="property"}
|
||||||
|
|
||||||
|
@ -377,9 +377,9 @@ The labels currently added to the component.
|
||||||
> assert "MY_LABEL" in ner.labels
|
> assert "MY_LABEL" in ner.labels
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ---------------------------------- |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | tuple | The labels added to the component. |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -34,12 +34,12 @@ how the component should be configured. You can override its settings via the
|
||||||
> nlp.add_pipe("entity_ruler", config=config)
|
> nlp.add_pipe("entity_ruler", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Description |
|
||||||
| --------------------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
|
| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `phrase_matcher_attr` | str | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. | `None` |
|
| `phrase_matcher_attr` | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ |
|
||||||
| `validate` | bool | Whether patterns should be validated (passed to the `Matcher` and `PhraseMatcher`). | `False` |
|
| `validate` | Whether patterns should be validated (passed to the `Matcher` and `PhraseMatcher`). Defaults to `False`. ~~bool~~ |
|
||||||
| `overwrite_ents` | bool | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. | `False` |
|
| `overwrite_ents` | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~ |
|
||||||
| `ent_id_sep` | str | Separator used internally for entity IDs. | `"||"` |
|
| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `"||"`. ~~str~~ |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entityruler.py
|
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entityruler.py
|
||||||
|
@ -63,16 +63,16 @@ be a token pattern (list) or a phrase pattern (string). For example:
|
||||||
> ruler = EntityRuler(nlp, overwrite_ents=True)
|
> ruler = EntityRuler(nlp, overwrite_ents=True)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------------------------- | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `nlp` | `Language` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. |
|
| `nlp` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. ~~Language~~ |
|
||||||
| `name` <Tag variant="new">3</Tag> | str | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. Used to disable the current entity ruler while creating phrase patterns with the nlp object. |
|
| `name` <Tag variant="new">3</Tag> | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. Used to disable the current entity ruler while creating phrase patterns with the nlp object. ~~str~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `phrase_matcher_attr` | int / str | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. |
|
| `phrase_matcher_attr` | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ |
|
||||||
| `validate` | bool | Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate`. Defaults to `False`. |
|
| `validate` | Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate`. Defaults to `False`. ~~bool~~ |
|
||||||
| `overwrite_ents` | bool | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. |
|
| `overwrite_ents` | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~ |
|
||||||
| `ent_id_sep` | str | Separator used internally for entity IDs. Defaults to `"||"`. |
|
| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `"||"`. ~~str~~ |
|
||||||
| `patterns` | iterable | Optional patterns to load in on initialization. |
|
| `patterns` | Optional patterns to load in on initialization. ~~Optional[List[Dict[str, Union[str, List[dict]]]]]~~ |
|
||||||
|
|
||||||
## EntityRuler.\_\len\_\_ {#len tag="method"}
|
## EntityRuler.\_\len\_\_ {#len tag="method"}
|
||||||
|
|
||||||
|
@ -87,9 +87,9 @@ The number of all patterns added to the entity ruler.
|
||||||
> assert len(ruler) == 1
|
> assert len(ruler) == 1
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ----------------------- |
|
| ----------- | ------------------------------- |
|
||||||
| **RETURNS** | int | The number of patterns. |
|
| **RETURNS** | The number of patterns. ~~int~~ |
|
||||||
|
|
||||||
## EntityRuler.\_\_contains\_\_ {#contains tag="method"}
|
## EntityRuler.\_\_contains\_\_ {#contains tag="method"}
|
||||||
|
|
||||||
|
@ -104,10 +104,10 @@ Whether a label is present in the patterns.
|
||||||
> assert not "PERSON" in ruler
|
> assert not "PERSON" in ruler
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------------------- |
|
| ----------- | ----------------------------------------------------- |
|
||||||
| `label` | str | The label to check. |
|
| `label` | The label to check. ~~str~~ |
|
||||||
| **RETURNS** | bool | Whether the entity ruler contains the label. |
|
| **RETURNS** | Whether the entity ruler contains the label. ~~bool~~ |
|
||||||
|
|
||||||
## EntityRuler.\_\_call\_\_ {#call tag="method"}
|
## EntityRuler.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -130,10 +130,10 @@ is chosen.
|
||||||
> assert ents == [("Apple", "ORG")]
|
> assert ents == [("Apple", "ORG")]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------------------------------------------ |
|
| ----------- | -------------------------------------------------------------------- |
|
||||||
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
| `doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The modified `Doc` with added entities, if available. |
|
| **RETURNS** | The modified `Doc` with added entities, if available. ~~Doc~~ |
|
||||||
|
|
||||||
## EntityRuler.add_patterns {#add_patterns tag="method"}
|
## EntityRuler.add_patterns {#add_patterns tag="method"}
|
||||||
|
|
||||||
|
@ -152,9 +152,9 @@ of dicts) or a phrase pattern (string). For more details, see the usage guide on
|
||||||
> ruler.add_patterns(patterns)
|
> ruler.add_patterns(patterns)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ---- | -------------------- |
|
| ---------- | ---------------------------------------------------------------- |
|
||||||
| `patterns` | list | The patterns to add. |
|
| `patterns` | The patterns to add. ~~List[Dict[str, Union[str, List[dict]]]]~~ |
|
||||||
|
|
||||||
## EntityRuler.to_disk {#to_disk tag="method"}
|
## EntityRuler.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -171,9 +171,9 @@ only the patterns are saved as JSONL. If a directory name is provided, a
|
||||||
> ruler.to_disk("/path/to/entity_ruler") # saves patterns and config
|
> ruler.to_disk("/path/to/entity_ruler") # saves patterns and config
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------ | ------------ | ----------------------------------------------------------------------------------------------------------------------------------- |
|
| ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a JSONL file or directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a JSONL file or directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
|
|
||||||
## EntityRuler.from_disk {#from_disk tag="method"}
|
## EntityRuler.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -190,10 +190,10 @@ configuration.
|
||||||
> ruler.from_disk("/path/to/entity_ruler") # loads patterns and config
|
> ruler.from_disk("/path/to/entity_ruler") # loads patterns and config
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------- | ---------------------------------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a JSONL file or directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a JSONL file or directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| **RETURNS** | `EntityRuler` | The modified `EntityRuler` object. |
|
| **RETURNS** | The modified `EntityRuler` object. ~~EntityRuler~~ |
|
||||||
|
|
||||||
## EntityRuler.to_bytes {#to_bytes tag="method"}
|
## EntityRuler.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -206,9 +206,9 @@ Serialize the entity ruler patterns to a bytestring.
|
||||||
> ruler_bytes = ruler.to_bytes()
|
> ruler_bytes = ruler.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------ |
|
| ----------- | ---------------------------------- |
|
||||||
| **RETURNS** | bytes | The serialized patterns. |
|
| **RETURNS** | The serialized patterns. ~~bytes~~ |
|
||||||
|
|
||||||
## EntityRuler.from_bytes {#from_bytes tag="method"}
|
## EntityRuler.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -222,40 +222,40 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> ruler.from_bytes(ruler_bytes)
|
> ruler.from_bytes(ruler_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | ------------- | ---------------------------------- |
|
| ------------ | -------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The bytestring to load. |
|
| `bytes_data` | The bytestring to load. ~~bytes~~ |
|
||||||
| **RETURNS** | `EntityRuler` | The modified `EntityRuler` object. |
|
| **RETURNS** | The modified `EntityRuler` object. ~~EntityRuler~~ |
|
||||||
|
|
||||||
## EntityRuler.labels {#labels tag="property"}
|
## EntityRuler.labels {#labels tag="property"}
|
||||||
|
|
||||||
All labels present in the match patterns.
|
All labels present in the match patterns.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------ |
|
| ----------- | -------------------------------------- |
|
||||||
| **RETURNS** | tuple | The string labels. |
|
| **RETURNS** | The string labels. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
## EntityRuler.ent_ids {#labels tag="property" new="2.2.2"}
|
## EntityRuler.ent_ids {#labels tag="property" new="2.2.2"}
|
||||||
|
|
||||||
All entity ids present in the match patterns `id` properties.
|
All entity IDs present in the `id` properties of the match patterns.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------- |
|
| ----------- | ----------------------------------- |
|
||||||
| **RETURNS** | tuple | The string ent_ids. |
|
| **RETURNS** | The string IDs. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
## EntityRuler.patterns {#patterns tag="property"}
|
## EntityRuler.patterns {#patterns tag="property"}
|
||||||
|
|
||||||
Get all patterns that were added to the entity ruler.
|
Get all patterns that were added to the entity ruler.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------- |
|
||||||
| **RETURNS** | list | The original patterns, one dictionary per pattern. |
|
| **RETURNS** | The original patterns, one dictionary per pattern. ~~List[Dict[str, Union[str, dict]]]~~ |
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------- | ------------------------------------- | ---------------------------------------------------------------- |
|
| ----------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `matcher` | [`Matcher`](/api/matcher) | The underlying matcher used to process token patterns. |
|
| `matcher` | The underlying matcher used to process token patterns. ~~Matcher~~ | |
|
||||||
| `phrase_matcher` | [`PhraseMatcher`](/api/phrasematcher) | The underlying phrase matcher, used to process phrase patterns. |
|
| `phrase_matcher` | The underlying phrase matcher, used to process phrase patterns. ~~PhraseMatcher~~ |
|
||||||
| `token_patterns` | dict | The token patterns present in the entity ruler, keyed by label. |
|
| `token_patterns` | The token patterns present in the entity ruler, keyed by label. ~~Dict[str, List[Dict[str, Union[str, List[dict]]]]~~ |
|
||||||
| `phrase_patterns` | dict | The phrase patterns present in the entity ruler, keyed by label. |
|
| `phrase_patterns` | The phrase patterns present in the entity ruler, keyed by label. ~~Dict[str, List[Doc]]~~ |
|
||||||
|
|
|
@ -8,9 +8,9 @@ new: 3.0
|
||||||
|
|
||||||
An `Example` holds the information for one training instance. It stores two
|
An `Example` holds the information for one training instance. It stores two
|
||||||
`Doc` objects: one for holding the gold-standard reference data, and one for
|
`Doc` objects: one for holding the gold-standard reference data, and one for
|
||||||
holding the predictions of the pipeline. An [`Alignment`](#alignment-object)
|
holding the predictions of the pipeline. An
|
||||||
object stores the alignment between these two documents, as they can differ in
|
[`Alignment`](/api/example#alignment-object) object stores the alignment between
|
||||||
tokenization.
|
these two documents, as they can differ in tokenization.
|
||||||
|
|
||||||
## Example.\_\_init\_\_ {#init tag="method"}
|
## Example.\_\_init\_\_ {#init tag="method"}
|
||||||
|
|
||||||
|
@ -31,12 +31,12 @@ both documents.
|
||||||
> example = Example(predicted, reference)
|
> example = Example(predicted, reference)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ----------- | ------------------------------------------------------------------------------------------------ |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `predicted` | `Doc` | The document containing (partial) predictions. Can not be `None`. |
|
| `predicted` | The document containing (partial) predictions. Can not be `None`. ~~Doc~~ |
|
||||||
| `reference` | `Doc` | The document containing gold-standard annotations. Can not be `None`. |
|
| `reference` | The document containing gold-standard annotations. Can not be `None`. ~~Doc~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `alignment` | `Alignment` | An object holding the alignment between the tokens of the `predicted` and `reference` documents. |
|
| `alignment` | An object holding the alignment between the tokens of the `predicted` and `reference` documents. ~~Optional[Alignment]~~ |
|
||||||
|
|
||||||
## Example.from_dict {#from_dict tag="classmethod"}
|
## Example.from_dict {#from_dict tag="classmethod"}
|
||||||
|
|
||||||
|
@ -56,11 +56,11 @@ see the [training format documentation](/api/data-formats#dict-input).
|
||||||
> example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref})
|
> example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref})
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ---------------- | ----------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------- |
|
||||||
| `predicted` | `Doc` | The document containing (partial) predictions. Can not be `None`. |
|
| `predicted` | The document containing (partial) predictions. Can not be `None`. ~~Doc~~ |
|
||||||
| `example_dict` | `Dict[str, obj]` | The gold-standard annotations as a dictionary. Can not be `None`. |
|
| `example_dict` | `Dict[str, obj]` | The gold-standard annotations as a dictionary. Can not be `None`. ~~Dict[str, Any]~~ |
|
||||||
| **RETURNS** | `Example` | The newly constructed object. |
|
| **RETURNS** | The newly constructed object. ~~Example~~ |
|
||||||
|
|
||||||
## Example.text {#text tag="property"}
|
## Example.text {#text tag="property"}
|
||||||
|
|
||||||
|
@ -72,12 +72,14 @@ The text of the `predicted` document in this `Example`.
|
||||||
> raw_text = example.text
|
> raw_text = example.text
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------- |
|
| ----------- | --------------------------------------------- |
|
||||||
| **RETURNS** | str | The text of the `predicted` document. |
|
| **RETURNS** | The text of the `predicted` document. ~~str~~ |
|
||||||
|
|
||||||
## Example.predicted {#predicted tag="property"}
|
## Example.predicted {#predicted tag="property"}
|
||||||
|
|
||||||
|
The `Doc` holding the predictions. Occasionally also referred to as `example.x`.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
|
@ -86,14 +88,15 @@ The text of the `predicted` document in this `Example`.
|
||||||
> set_annotations(docs, predictions)
|
> set_annotations(docs, predictions)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
The `Doc` holding the predictions. Occassionally also refered to as `example.x`.
|
| Name | Description |
|
||||||
|
| ----------- | ------------------------------------------------------ |
|
||||||
| Name | Type | Description |
|
| **RETURNS** | The document containing (partial) predictions. ~~Doc~~ |
|
||||||
| ----------- | ----- | ---------------------------------------------- |
|
|
||||||
| **RETURNS** | `Doc` | The document containing (partial) predictions. |
|
|
||||||
|
|
||||||
## Example.reference {#reference tag="property"}
|
## Example.reference {#reference tag="property"}
|
||||||
|
|
||||||
|
The `Doc` holding the gold-standard annotations. Occasionally also referred to
|
||||||
|
as `example.y`.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
|
@ -102,15 +105,15 @@ The `Doc` holding the predictions. Occassionally also refered to as `example.x`.
|
||||||
> gold_labels[i][j] = eg.reference.cats.get(label, 0.0)
|
> gold_labels[i][j] = eg.reference.cats.get(label, 0.0)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
The `Doc` holding the gold-standard annotations. Occassionally also refered to
|
| Name | Description |
|
||||||
as `example.y`.
|
| ----------- | ---------------------------------------------------------- |
|
||||||
|
| **RETURNS** | The document containing gold-standard annotations. ~~Doc~~ |
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ----- | -------------------------------------------------- |
|
|
||||||
| **RETURNS** | `Doc` | The document containing gold-standard annotations. |
|
|
||||||
|
|
||||||
## Example.alignment {#alignment tag="property"}
|
## Example.alignment {#alignment tag="property"}
|
||||||
|
|
||||||
|
The [`Alignment`](/api/example#alignment-object) object mapping the tokens of
|
||||||
|
the `predicted` document to those of the `reference` document.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
|
@ -122,15 +125,15 @@ as `example.y`.
|
||||||
> assert list(alignment.y2x.data) == [[0], [1], [2], [2]]
|
> assert list(alignment.y2x.data) == [[0], [1], [2], [2]]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
The `Alignment` object mapping the tokens of the `predicted` document to those
|
| Name | Description |
|
||||||
of the `reference` document.
|
| ----------- | ---------------------------------------------------------------- |
|
||||||
|
| **RETURNS** | The document containing gold-standard annotations. ~~Alignment~~ |
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ----------- | -------------------------------------------------- |
|
|
||||||
| **RETURNS** | `Alignment` | The document containing gold-standard annotations. |
|
|
||||||
|
|
||||||
## Example.get_aligned {#get_aligned tag="method"}
|
## Example.get_aligned {#get_aligned tag="method"}
|
||||||
|
|
||||||
|
Get the aligned view of a certain token attribute, denoted by its int ID or
|
||||||
|
string name.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
|
@ -141,17 +144,18 @@ of the `reference` document.
|
||||||
> assert example.get_aligned("TAG", as_string=True) == ["VERB", "DET", "NOUN"]
|
> assert example.get_aligned("TAG", as_string=True) == ["VERB", "DET", "NOUN"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Get the aligned view of a certain token attribute, denoted by its int ID or
|
| Name | Description |
|
||||||
string name.
|
| ----------- | -------------------------------------------------------------------------------------------------- |
|
||||||
|
| `field` | Attribute ID or string name. ~~Union[int, str]~~ |
|
||||||
| Name | Type | Description | Default |
|
| `as_string` | Whether or not to return the list of values as strings. Defaults to `False`. ~~bool~~ |
|
||||||
| ----------- | -------------------------- | ------------------------------------------------------------------ | ------- |
|
| **RETURNS** | List of integer values, or string values if `as_string` is `True`. ~~Union[List[int], List[str]]~~ |
|
||||||
| `field` | int or str | Attribute ID or string name | |
|
|
||||||
| `as_string` | bool | Whether or not to return the list of values as strings. | `False` |
|
|
||||||
| **RETURNS** | `List[int]` or `List[str]` | List of integer values, or string values if `as_string` is `True`. | |
|
|
||||||
|
|
||||||
## Example.get_aligned_parse {#get_aligned_parse tag="method"}
|
## Example.get_aligned_parse {#get_aligned_parse tag="method"}
|
||||||
|
|
||||||
|
Get the aligned view of the dependency parse. If `projectivize` is set to
|
||||||
|
`True`, non-projective dependency trees are made projective through the
|
||||||
|
Pseudo-Projective Dependency Parsing algorithm by Nivre and Nilsson (2005).
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
|
@ -161,17 +165,16 @@ string name.
|
||||||
> assert proj_heads == [3, 2, 3, 0, 3]
|
> assert proj_heads == [3, 2, 3, 0, 3]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Get the aligned view of the dependency parse. If `projectivize` is set to
|
| Name | Description |
|
||||||
`True`, non-projective dependency trees are made projective through the
|
| -------------- | -------------------------------------------------------------------------------------------------- |
|
||||||
Pseudo-Projective Dependency Parsing algorithm by Nivre and Nilsson (2005).
|
| `projectivize` | Whether or not to projectivize the dependency trees. Defaults to `True`. ~~bool~~ |
|
||||||
|
| **RETURNS** | List of integer values, or string values if `as_string` is `True`. ~~Union[List[int], List[str]]~~ |
|
||||||
| Name | Type | Description | Default |
|
|
||||||
| -------------- | -------------------------- | ------------------------------------------------------------------ | ------- |
|
|
||||||
| `projectivize` | bool | Whether or not to projectivize the dependency trees | `True` |
|
|
||||||
| **RETURNS** | `List[int]` or `List[str]` | List of integer values, or string values if `as_string` is `True`. | |
|
|
||||||
|
|
||||||
## Example.get_aligned_ner {#get_aligned_ner tag="method"}
|
## Example.get_aligned_ner {#get_aligned_ner tag="method"}
|
||||||
|
|
||||||
|
Get the aligned view of the NER
|
||||||
|
[BILUO](/usage/linguistic-features#accessing-ner) tags.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
|
@ -184,15 +187,16 @@ Pseudo-Projective Dependency Parsing algorithm by Nivre and Nilsson (2005).
|
||||||
> assert ner_tags == ["B-PERSON", "L-PERSON", "O", "O", "U-LOC"]
|
> assert ner_tags == ["B-PERSON", "L-PERSON", "O", "O", "U-LOC"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Get the aligned view of the NER
|
| Name | Description |
|
||||||
[BILUO](/usage/linguistic-features#accessing-ner) tags.
|
| ----------- | ------------------------------------------------------------------------------------------------- |
|
||||||
|
| **RETURNS** | List of BILUO values, denoting whether tokens are part of an NER annotation or not. ~~List[str]~~ |
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ----------- | ----------------------------------------------------------------------------------- |
|
|
||||||
| **RETURNS** | `List[str]` | List of BILUO values, denoting whether tokens are part of an NER annotation or not. |
|
|
||||||
|
|
||||||
## Example.get_aligned_spans_y2x {#get_aligned_spans_y2x tag="method"}
|
## Example.get_aligned_spans_y2x {#get_aligned_spans_y2x tag="method"}
|
||||||
|
|
||||||
|
Get the aligned view of any set of [`Span`](/api/span) objects defined over
|
||||||
|
[`Example.reference`](/api/example#reference). The resulting span indices will
|
||||||
|
align to the tokenization in [`Example.predicted`](/api/example#predicted).
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
|
@ -207,17 +211,19 @@ Get the aligned view of the NER
|
||||||
> assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1)]
|
> assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1)]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Get the aligned view of any set of [`Span`](/api/span) objects defined over
|
| Name | Description |
|
||||||
`example.reference`. The resulting span indices will align to the tokenization
|
| ----------- | ----------------------------------------------------------------------------- |
|
||||||
in `example.predicted`.
|
| `y_spans` | `Span` objects aligned to the tokenization of `reference`. ~~Iterable[Span]~~ |
|
||||||
|
| **RETURNS** | `Span` objects aligned to the tokenization of `predicted`. ~~List[Span]~~ |
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ---------------- | --------------------------------------------------------------- |
|
|
||||||
| `y_spans` | `Iterable[Span]` | `Span` objects aligned to the tokenization of `self.reference`. |
|
|
||||||
| **RETURNS** | `Iterable[Span]` | `Span` objects aligned to the tokenization of `self.predicted`. |
|
|
||||||
|
|
||||||
## Example.get_aligned_spans_x2y {#get_aligned_spans_x2y tag="method"}
|
## Example.get_aligned_spans_x2y {#get_aligned_spans_x2y tag="method"}
|
||||||
|
|
||||||
|
Get the aligned view of any set of [`Span`](/api/span) objects defined over
|
||||||
|
[`Example.predicted`](/api/example#predicted). The resulting span indices will
|
||||||
|
align to the tokenization in [`Example.reference`](/api/example#reference). This
|
||||||
|
method is particularly useful to assess the accuracy of predicted entities
|
||||||
|
against the original gold-standard annotation.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
|
@ -232,15 +238,10 @@ in `example.predicted`.
|
||||||
> assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)]
|
> assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Get the aligned view of any set of [`Span`](/api/span) objects defined over
|
| Name | Description |
|
||||||
`example.predicted`. The resulting span indices will align to the tokenization
|
| ----------- | ----------------------------------------------------------------------------- |
|
||||||
in `example.reference`. This method is particularly useful to assess the
|
| `x_spans` | `Span` objects aligned to the tokenization of `predicted`. ~~Iterable[Span]~~ |
|
||||||
accuracy of predicted entities against the original gold-standard annotation.
|
| **RETURNS** | `Span` objects aligned to the tokenization of `reference`. ~~List[Span]~~ |
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ---------------- | --------------------------------------------------------------- |
|
|
||||||
| `x_spans` | `Iterable[Span]` | `Span` objects aligned to the tokenization of `self.predicted`. |
|
|
||||||
| **RETURNS** | `Iterable[Span]` | `Span` objects aligned to the tokenization of `self.reference`. |
|
|
||||||
|
|
||||||
## Example.to_dict {#to_dict tag="method"}
|
## Example.to_dict {#to_dict tag="method"}
|
||||||
|
|
||||||
|
@ -253,12 +254,14 @@ reference annotation contained in this `Example`.
|
||||||
> eg_dict = example.to_dict()
|
> eg_dict = example.to_dict()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---------------- | ------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------------------------- |
|
||||||
| **RETURNS** | `Dict[str, Any]` | Dictionary representation of the reference annotation. |
|
| **RETURNS** | Dictionary representation of the reference annotation. ~~Dict[str, Any]~~ |
|
||||||
|
|
||||||
## Example.split_sents {#split_sents tag="method"}
|
## Example.split_sents {#split_sents tag="method"}
|
||||||
|
|
||||||
|
Split one `Example` into multiple `Example` objects, one for each sentence.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
|
@ -271,11 +274,9 @@ reference annotation contained in this `Example`.
|
||||||
> assert split_examples[1].text == "had lots of fun"
|
> assert split_examples[1].text == "had lots of fun"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Split one `Example` into multiple `Example` objects, one for each sentence.
|
| Name | Description |
|
||||||
|
| ----------- | ---------------------------------------------------------------------------- |
|
||||||
| Name | Type | Description |
|
| **RETURNS** | List of `Example` objects, one for each original sentence. ~~List[Example]~~ |
|
||||||
| ----------- | --------------- | ---------------------------------------------------------- |
|
|
||||||
| **RETURNS** | `List[Example]` | List of `Example` objects, one for each original sentence. |
|
|
||||||
|
|
||||||
## Alignment {#alignment-object new="3"}
|
## Alignment {#alignment-object new="3"}
|
||||||
|
|
||||||
|
@ -283,10 +284,10 @@ Calculate alignment tables between two tokenizations.
|
||||||
|
|
||||||
### Alignment attributes {#alignment-attributes"}
|
### Alignment attributes {#alignment-attributes"}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----- | -------------------------------------------------- | ---------------------------------------------------------- |
|
| ----- | --------------------------------------------------------------------- |
|
||||||
| `x2y` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `x` to `y`. |
|
| `x2y` | The `Ragged` object holding the alignment from `x` to `y`. ~~Ragged~~ |
|
||||||
| `y2x` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `y` to `x`. |
|
| `y2x` | The `Ragged` object holding the alignment from `y` to `x`. ~~Ragged~~ |
|
||||||
|
|
||||||
<Infobox title="Important note" variant="warning">
|
<Infobox title="Important note" variant="warning">
|
||||||
|
|
||||||
|
@ -314,8 +315,8 @@ tokenizations add up to the same string. For example, you'll be able to align
|
||||||
|
|
||||||
### Alignment.from_strings {#classmethod tag="function"}
|
### Alignment.from_strings {#classmethod tag="function"}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----------- | ----------------------------------------------- |
|
| ----------- | ------------------------------------------------------------- |
|
||||||
| `A` | list | String values of candidate tokens to align. |
|
| `A` | String values of candidate tokens to align. ~~List[str]~~ |
|
||||||
| `B` | list | String values of reference tokens to align. |
|
| `B` | String values of reference tokens to align. ~~List[str]~~ |
|
||||||
| **RETURNS** | `Alignment` | An `Alignment` object describing the alignment. |
|
| **RETURNS** | An `Alignment` object describing the alignment. ~~Alignment~~ |
|
||||||
|
|
|
@ -9,7 +9,7 @@ new: 2.2
|
||||||
---
|
---
|
||||||
|
|
||||||
The `KnowledgeBase` object provides a method to generate
|
The `KnowledgeBase` object provides a method to generate
|
||||||
[`Candidate`](/api/kb/#candidate_init) objects, which are plausible external
|
[`Candidate`](/api/kb/#candidate) objects, which are plausible external
|
||||||
identifiers given a certain textual mention. Each such `Candidate` holds
|
identifiers given a certain textual mention. Each such `Candidate` holds
|
||||||
information from the relevant KB entities, such as its frequency in text and
|
information from the relevant KB entities, such as its frequency in text and
|
||||||
possible aliases. Each entity in the knowledge base also has a pretrained entity
|
possible aliases. Each entity in the knowledge base also has a pretrained entity
|
||||||
|
@ -27,18 +27,18 @@ Create the knowledge base.
|
||||||
> kb = KnowledgeBase(vocab=vocab, entity_vector_length=64)
|
> kb = KnowledgeBase(vocab=vocab, entity_vector_length=64)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------------------- | ------- | ---------------------------------------- |
|
| ---------------------- | ------------------------------------------------ |
|
||||||
| `vocab` | `Vocab` | A `Vocab` object. |
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
| `entity_vector_length` | int | Length of the fixed-size entity vectors. |
|
| `entity_vector_length` | Length of the fixed-size entity vectors. ~~int~~ |
|
||||||
|
|
||||||
## KnowledgeBase.entity_vector_length {#entity_vector_length tag="property"}
|
## KnowledgeBase.entity_vector_length {#entity_vector_length tag="property"}
|
||||||
|
|
||||||
The length of the fixed-size entity vectors in the knowledge base.
|
The length of the fixed-size entity vectors in the knowledge base.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ---------------------------------------- |
|
| ----------- | ------------------------------------------------ |
|
||||||
| **RETURNS** | int | Length of the fixed-size entity vectors. |
|
| **RETURNS** | Length of the fixed-size entity vectors. ~~int~~ |
|
||||||
|
|
||||||
## KnowledgeBase.add_entity {#add_entity tag="method"}
|
## KnowledgeBase.add_entity {#add_entity tag="method"}
|
||||||
|
|
||||||
|
@ -53,11 +53,11 @@ vector, which should be of length
|
||||||
> kb.add_entity(entity="Q463035", freq=111, entity_vector=vector2)
|
> kb.add_entity(entity="Q463035", freq=111, entity_vector=vector2)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------- | ------ | ----------------------------------------------- |
|
| --------------- | ---------------------------------------------------------- |
|
||||||
| `entity` | str | The unique entity identifier |
|
| `entity` | The unique entity identifier. ~~str~~ |
|
||||||
| `freq` | float | The frequency of the entity in a typical corpus |
|
| `freq` | The frequency of the entity in a typical corpus. ~~float~~ |
|
||||||
| `entity_vector` | vector | The pretrained vector of the entity |
|
| `entity_vector` | The pretrained vector of the entity. ~~numpy.ndarray~~ |
|
||||||
|
|
||||||
## KnowledgeBase.set_entities {#set_entities tag="method"}
|
## KnowledgeBase.set_entities {#set_entities tag="method"}
|
||||||
|
|
||||||
|
@ -70,11 +70,11 @@ frequency and entity vector for each entity.
|
||||||
> kb.set_entities(entity_list=["Q42", "Q463035"], freq_list=[32, 111], vector_list=[vector1, vector2])
|
> kb.set_entities(entity_list=["Q42", "Q463035"], freq_list=[32, 111], vector_list=[vector1, vector2])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------- | -------- | --------------------------------- |
|
| ------------- | ---------------------------------------------------------------- |
|
||||||
| `entity_list` | iterable | List of unique entity identifiers |
|
| `entity_list` | List of unique entity identifiers. ~~Iterable[Union[str, int]]~~ |
|
||||||
| `freq_list` | iterable | List of entity frequencies |
|
| `freq_list` | List of entity frequencies. ~~Iterable[int]~~ |
|
||||||
| `vector_list` | iterable | List of entity vectors |
|
| `vector_list` | List of entity vectors. ~~Iterable[numpy.ndarray]~~ |
|
||||||
|
|
||||||
## KnowledgeBase.add_alias {#add_alias tag="method"}
|
## KnowledgeBase.add_alias {#add_alias tag="method"}
|
||||||
|
|
||||||
|
@ -90,11 +90,11 @@ should not exceed 1.
|
||||||
> kb.add_alias(alias="Douglas", entities=["Q42", "Q463035"], probabilities=[0.6, 0.3])
|
> kb.add_alias(alias="Douglas", entities=["Q42", "Q463035"], probabilities=[0.6, 0.3])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------- | -------- | -------------------------------------------------- |
|
| --------------- | --------------------------------------------------------------------------------- |
|
||||||
| `alias` | str | The textual mention or alias |
|
| `alias` | The textual mention or alias. ~~str~~ |
|
||||||
| `entities` | iterable | The potential entities that the alias may refer to |
|
| `entities` | The potential entities that the alias may refer to. ~~Iterable[Union[str, int]]~~ |
|
||||||
| `probabilities` | iterable | The prior probabilities of each entity |
|
| `probabilities` | The prior probabilities of each entity. ~~Iterable[float]~~ |
|
||||||
|
|
||||||
## KnowledgeBase.\_\_len\_\_ {#len tag="method"}
|
## KnowledgeBase.\_\_len\_\_ {#len tag="method"}
|
||||||
|
|
||||||
|
@ -106,9 +106,9 @@ Get the total number of entities in the knowledge base.
|
||||||
> total_entities = len(kb)
|
> total_entities = len(kb)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | --------------------------------------------- |
|
| ----------- | ----------------------------------------------------- |
|
||||||
| **RETURNS** | int | The number of entities in the knowledge base. |
|
| **RETURNS** | The number of entities in the knowledge base. ~~int~~ |
|
||||||
|
|
||||||
## KnowledgeBase.get_entity_strings {#get_entity_strings tag="method"}
|
## KnowledgeBase.get_entity_strings {#get_entity_strings tag="method"}
|
||||||
|
|
||||||
|
@ -120,9 +120,9 @@ Get a list of all entity IDs in the knowledge base.
|
||||||
> all_entities = kb.get_entity_strings()
|
> all_entities = kb.get_entity_strings()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------------- |
|
| ----------- | --------------------------------------------------------- |
|
||||||
| **RETURNS** | list | The list of entities in the knowledge base. |
|
| **RETURNS** | The list of entities in the knowledge base. ~~List[str]~~ |
|
||||||
|
|
||||||
## KnowledgeBase.get_size_aliases {#get_size_aliases tag="method"}
|
## KnowledgeBase.get_size_aliases {#get_size_aliases tag="method"}
|
||||||
|
|
||||||
|
@ -134,9 +134,9 @@ Get the total number of aliases in the knowledge base.
|
||||||
> total_aliases = kb.get_size_aliases()
|
> total_aliases = kb.get_size_aliases()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------------------- |
|
| ----------- | ---------------------------------------------------- |
|
||||||
| **RETURNS** | int | The number of aliases in the knowledge base. |
|
| **RETURNS** | The number of aliases in the knowledge base. ~~int~~ |
|
||||||
|
|
||||||
## KnowledgeBase.get_alias_strings {#get_alias_strings tag="method"}
|
## KnowledgeBase.get_alias_strings {#get_alias_strings tag="method"}
|
||||||
|
|
||||||
|
@ -148,14 +148,14 @@ Get a list of all aliases in the knowledge base.
|
||||||
> all_aliases = kb.get_alias_strings()
|
> all_aliases = kb.get_alias_strings()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------------ |
|
| ----------- | -------------------------------------------------------- |
|
||||||
| **RETURNS** | list | The list of aliases in the knowledge base. |
|
| **RETURNS** | The list of aliases in the knowledge base. ~~List[str]~~ |
|
||||||
|
|
||||||
## KnowledgeBase.get_candidates {#get_candidates tag="method"}
|
## KnowledgeBase.get_candidates {#get_candidates tag="method"}
|
||||||
|
|
||||||
Given a certain textual mention as input, retrieve a list of candidate entities
|
Given a certain textual mention as input, retrieve a list of candidate entities
|
||||||
of type [`Candidate`](/api/kb/#candidate_init).
|
of type [`Candidate`](/api/kb/#candidate).
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -163,10 +163,10 @@ of type [`Candidate`](/api/kb/#candidate_init).
|
||||||
> candidates = kb.get_candidates("Douglas")
|
> candidates = kb.get_candidates("Douglas")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------- | ---------------------------------------- |
|
| ----------- | ------------------------------------- |
|
||||||
| `alias` | str | The textual mention or alias |
|
| `alias` | The textual mention or alias. ~~str~~ |
|
||||||
| **RETURNS** | iterable | The list of relevant `Candidate` objects |
|
| **RETURNS** | iterable | The list of relevant `Candidate` objects. ~~List[Candidate]~~ |
|
||||||
|
|
||||||
## KnowledgeBase.get_vector {#get_vector tag="method"}
|
## KnowledgeBase.get_vector {#get_vector tag="method"}
|
||||||
|
|
||||||
|
@ -178,10 +178,10 @@ Given a certain entity ID, retrieve its pretrained entity vector.
|
||||||
> vector = kb.get_vector("Q42")
|
> vector = kb.get_vector("Q42")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------ | ----------------- |
|
| ----------- | ------------------------------------ |
|
||||||
| `entity` | str | The entity ID |
|
| `entity` | The entity ID. ~~str~~ |
|
||||||
| **RETURNS** | vector | The entity vector |
|
| **RETURNS** | The entity vector. ~~numpy.ndarray~~ |
|
||||||
|
|
||||||
## KnowledgeBase.get_prior_prob {#get_prior_prob tag="method"}
|
## KnowledgeBase.get_prior_prob {#get_prior_prob tag="method"}
|
||||||
|
|
||||||
|
@ -194,27 +194,27 @@ probability of the fact that the mention links to the entity ID.
|
||||||
> probability = kb.get_prior_prob("Q42", "Douglas")
|
> probability = kb.get_prior_prob("Q42", "Douglas")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | -------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------- |
|
||||||
| `entity` | str | The entity ID |
|
| `entity` | The entity ID. ~~str~~ |
|
||||||
| `alias` | str | The textual mention or alias |
|
| `alias` | The textual mention or alias. ~~str~~ |
|
||||||
| **RETURNS** | float | The prior probability of the `alias` referring to the `entity` |
|
| **RETURNS** | The prior probability of the `alias` referring to the `entity`. ~~float~~ |
|
||||||
|
|
||||||
## KnowledgeBase.dump {#dump tag="method"}
|
## KnowledgeBase.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
Save the current state of the knowledge base to a directory.
|
Save the current state of the knowledge base to a directory.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> kb.dump(loc)
|
> kb.to_disk(loc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
| ----- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `loc` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `loc` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
|
|
||||||
## KnowledgeBase.load_bulk {#load_bulk tag="method"}
|
## KnowledgeBase.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
Restore the state of the knowledge base from a given directory. Note that the
|
Restore the state of the knowledge base from a given directory. Note that the
|
||||||
[`Vocab`](/api/vocab) should also be the same as the one used to create the KB.
|
[`Vocab`](/api/vocab) should also be the same as the one used to create the KB.
|
||||||
|
@ -226,15 +226,23 @@ Restore the state of the knowledge base from a given directory. Note that the
|
||||||
> from spacy.vocab import Vocab
|
> from spacy.vocab import Vocab
|
||||||
> vocab = Vocab().from_disk("/path/to/vocab")
|
> vocab = Vocab().from_disk("/path/to/vocab")
|
||||||
> kb = KnowledgeBase(vocab=vocab, entity_vector_length=64)
|
> kb = KnowledgeBase(vocab=vocab, entity_vector_length=64)
|
||||||
> kb.load_bulk("/path/to/kb")
|
> kb.from_disk("/path/to/kb")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------- | -------------------------------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `loc` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `loc` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| **RETURNS** | `KnowledgeBase` | The modified `KnowledgeBase` object. |
|
| **RETURNS** | The modified `KnowledgeBase` object. ~~KnowledgeBase~~ |
|
||||||
|
|
||||||
## Candidate.\_\_init\_\_ {#candidate_init tag="method"}
|
## Candidate {#candidate tag="class"}
|
||||||
|
|
||||||
|
A `Candidate` object refers to a textual mention (alias) that may or may not be
|
||||||
|
resolved to a specific entity from a `KnowledgeBase`. This will be used as input
|
||||||
|
for the entity linking algorithm which will disambiguate the various candidates
|
||||||
|
to the correct one. Each candidate `(alias, entity)` pair is assigned to a
|
||||||
|
certain prior probability.
|
||||||
|
|
||||||
|
### Candidate.\_\_init\_\_ {#candidate-init tag="method"}
|
||||||
|
|
||||||
Construct a `Candidate` object. Usually this constructor is not called directly,
|
Construct a `Candidate` object. Usually this constructor is not called directly,
|
||||||
but instead these objects are returned by the
|
but instead these objects are returned by the
|
||||||
|
@ -247,22 +255,22 @@ but instead these objects are returned by the
|
||||||
> candidate = Candidate(kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob)
|
> candidate = Candidate(kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------- | --------------- | -------------------------------------------------------------- |
|
| ------------- | ------------------------------------------------------------------------- |
|
||||||
| `kb` | `KnowledgeBase` | The knowledge base that defined this candidate. |
|
| `kb` | The knowledge base that defined this candidate. ~~KnowledgeBase~~ |
|
||||||
| `entity_hash` | int | The hash of the entity's KB ID. |
|
| `entity_hash` | The hash of the entity's KB ID. ~~int~~ |
|
||||||
| `entity_freq` | float | The entity frequency as recorded in the KB. |
|
| `entity_freq` | The entity frequency as recorded in the KB. ~~float~~ |
|
||||||
| `alias_hash` | int | The hash of the textual mention or alias. |
|
| `alias_hash` | The hash of the textual mention or alias. ~~int~~ |
|
||||||
| `prior_prob` | float | The prior probability of the `alias` referring to the `entity` |
|
| `prior_prob` | The prior probability of the `alias` referring to the `entity`. ~~float~~ |
|
||||||
|
|
||||||
## Candidate attributes {#candidate_attributes}
|
## Candidate attributes {#candidate-attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------- | ------ | -------------------------------------------------------------- |
|
| --------------- | ------------------------------------------------------------------------ |
|
||||||
| `entity` | int | The entity's unique KB identifier |
|
| `entity` | The entity's unique KB identifier. ~~int~~ |
|
||||||
| `entity_` | str | The entity's unique KB identifier |
|
| `entity_` | The entity's unique KB identifier. ~~str~~ |
|
||||||
| `alias` | int | The alias or textual mention |
|
| `alias` | The alias or textual mention. ~~int~~ |
|
||||||
| `alias_` | str | The alias or textual mention |
|
| `alias_` | The alias or textual mention. ~~str~~ |
|
||||||
| `prior_prob` | long | The prior probability of the `alias` referring to the `entity` |
|
| `prior_prob` | The prior probability of the `alias` referring to the `entity`. ~~long~~ |
|
||||||
| `entity_freq` | long | The frequency of the entity in a typical corpus |
|
| `entity_freq` | The frequency of the entity in a typical corpus. ~~long~~ |
|
||||||
| `entity_vector` | vector | The pretrained vector of the entity |
|
| `entity_vector` | The pretrained vector of the entity. ~~numpy.ndarray~~ |
|
||||||
|
|
|
@ -32,15 +32,15 @@ Initialize a `Language` object.
|
||||||
> nlp = Language(Vocab())
|
> nlp = Language(Vocab())
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------------ | ----------- | ------------------------------------------------------------------------------------------ |
|
| ------------------ | ------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `vocab` | `Vocab` | A `Vocab` object. If `True`, a vocab is created using the default language data settings. |
|
| `vocab` | A `Vocab` object. If `True`, a vocab is created using the default language data settings. ~~Vocab~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `max_length` | int | Maximum number of characters allowed in a single text. Defaults to `10 ** 6`. |
|
| `max_length` | Maximum number of characters allowed in a single text. Defaults to `10 ** 6`. ~~int~~ |
|
||||||
| `meta` | dict | Custom meta data for the `Language` class. Is written to by models to add model meta data. |
|
| `meta` | Custom meta data for the `Language` class. Is written to by models to add model meta data. ~~dict~~ |
|
||||||
| `create_tokenizer` | `Callable` | Optional function that receives the `nlp` object and returns a tokenizer. |
|
| `create_tokenizer` | Optional function that receives the `nlp` object and returns a tokenizer. ~~Callable[[Language], Callable[[str], Doc]]~~ |
|
||||||
|
|
||||||
## Language.from_config {#from_config tag="classmethod"}
|
## Language.from_config {#from_config tag="classmethod" new="3"}
|
||||||
|
|
||||||
Create a `Language` object from a loaded config. Will set up the tokenizer and
|
Create a `Language` object from a loaded config. Will set up the tokenizer and
|
||||||
language data, add pipeline components based on the pipeline and components
|
language data, add pipeline components based on the pipeline and components
|
||||||
|
@ -58,14 +58,14 @@ model under the hood based on its [`config.cfg`](/api/data-formats#config).
|
||||||
> nlp = Language.from_config(config)
|
> nlp = Language.from_config(config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ---------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `config` | `Dict[str, Any]` / [`Config`](https://thinc.ai/docs/api-config#config) | The loaded config. |
|
| `config` | The loaded config. ~~Union[Dict[str, Any], Config]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `disable` | `Iterable[str]` | List of pipeline component names to disable. |
|
| `disable` | List of pipeline component names to disable. ~~Iterable[str]~~ |
|
||||||
| `auto_fill` | bool | Whether to automatically fill in missing values in the config, based on defaults and function argument annotations. Defaults to `True`. |
|
| `auto_fill` | Whether to automatically fill in missing values in the config, based on defaults and function argument annotations. Defaults to `True`. ~~bool~~ |
|
||||||
| `validate` | bool | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. |
|
| `validate` | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ |
|
||||||
| **RETURNS** | `Language` | The initialized object. |
|
| **RETURNS** | The initialized object. ~~Language~~ |
|
||||||
|
|
||||||
## Language.component {#component tag="classmethod" new="3"}
|
## Language.component {#component tag="classmethod" new="3"}
|
||||||
|
|
||||||
|
@ -94,16 +94,14 @@ decorator. For more details and examples, see the
|
||||||
> Language.component("my_component2", func=my_component)
|
> Language.component("my_component2", func=my_component)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------------- | -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `name` | str | The name of the component factory. |
|
| `name` | The name of the component factory. ~~str~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `assigns` | `Iterable[str]` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis).. |
|
| `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
||||||
| `requires` | `Iterable[str]` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). |
|
| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
||||||
| `retokenizes` | bool | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). |
|
| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~ |
|
||||||
| `scores` | `Iterable[str]` | All scores set by the components if it's trainable, e.g. `["ents_f", "ents_r", "ents_p"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). |
|
| `func` | Optional function if not used a a decorator. ~~Optional[Callable[[Doc], Doc]]~~ |
|
||||||
| `default_score_weights` | `Dict[str, float]` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. |
|
|
||||||
| `func` | `Optional[Callable]` | Optional function if not used a a decorator. |
|
|
||||||
|
|
||||||
## Language.factory {#factory tag="classmethod"}
|
## Language.factory {#factory tag="classmethod"}
|
||||||
|
|
||||||
|
@ -141,17 +139,17 @@ examples, see the
|
||||||
> )
|
> )
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------------- | -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `name` | str | The name of the component factory. |
|
| `name` | The name of the component factory. ~~str~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `default_config` | `Dict[str, any]` | The default config, describing the default values of the factory arguments. |
|
| `default_config` | The default config, describing the default values of the factory arguments. ~~Dict[str, Any]~~ |
|
||||||
| `assigns` | `Iterable[str]` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). |
|
| `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
||||||
| `requires` | `Iterable[str]` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). |
|
| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
||||||
| `retokenizes` | bool | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). |
|
| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~ |
|
||||||
| `scores` | `Iterable[str]` | All scores set by the components if it's trainable, e.g. `["ents_f", "ents_r", "ents_p"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). |
|
| `scores` | All scores set by the components if it's trainable, e.g. `["ents_f", "ents_r", "ents_p"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
||||||
| `default_score_weights` | `Dict[str, float]` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. |
|
| `default_score_weights` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. ~~Dict[str, float]~~ |
|
||||||
| `func` | `Optional[Callable]` | Optional function if not used a a decorator. |
|
| `func` | Optional function if not used a a decorator. ~~Optional[Callable[[...], Callable[[Doc], Doc]]]~~ |
|
||||||
|
|
||||||
## Language.\_\_call\_\_ {#call tag="method"}
|
## Language.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -165,13 +163,13 @@ contain arbitrary whitespace. Alignment into the original string is preserved.
|
||||||
> assert (doc[0].text, doc[0].head.tag_) == ("An", "NN")
|
> assert (doc[0].text, doc[0].head.tag_) == ("An", "NN")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------- | ----------------- | ------------------------------------------------------------------------------------------------------ |
|
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `text` | str | The text to be processed. |
|
| `text` | The text to be processed. ~~str~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). ~~List[str]~~ |
|
||||||
| `component_cfg` | `Dict[str, dict]` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. |
|
| `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ |
|
||||||
| **RETURNS** | [`Doc`](/api/doc) | A container for accessing the annotations. |
|
| **RETURNS** | A container for accessing the annotations. ~~Doc~~ |
|
||||||
|
|
||||||
## Language.pipe {#pipe tag="method"}
|
## Language.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -186,17 +184,17 @@ more efficient than processing texts one-by-one.
|
||||||
> assert doc.is_parsed
|
> assert doc.is_parsed
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------------------------------------ | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `texts` | `Iterable[str]` | A sequence of strings. |
|
| `texts` | A sequence of strings. ~~Iterable[str]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `as_tuples` | bool | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. |
|
| `as_tuples` | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. ~~bool~~ |
|
||||||
| `batch_size` | int | The number of texts to buffer. |
|
| `batch_size` | The number of texts to buffer. ~~int~~ |
|
||||||
| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). ~~List[str]~~ |
|
||||||
| `cleanup` | bool | If `True`, unneeded strings are freed to control memory use. Experimental. |
|
| `cleanup` | If `True`, unneeded strings are freed to control memory use. Experimental. ~~bool~~ |
|
||||||
| `component_cfg` | `Dict[str, dict]` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. |
|
| `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ |
|
||||||
| `n_process` <Tag variant="new">2.2.2</Tag> | int | Number of processors to use, only supported in Python 3. Defaults to `1`. |
|
| `n_process` <Tag variant="new">2.2.2</Tag> | Number of processors to use. Defaults to `1`. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | Documents in the order of the original text. |
|
| **YIELDS** | Documents in the order of the original text. ~~Doc~~ |
|
||||||
|
|
||||||
## Language.begin_training {#begin_training tag="method"}
|
## Language.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
|
@ -225,12 +223,12 @@ tuples of `Doc` and `GoldParse` objects.
|
||||||
> optimizer = nlp.begin_training(get_examples)
|
> optimizer = nlp.begin_training(get_examples)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
|
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | `Callable[[], Iterable[Example]]` | Optional function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
| `get_examples` | Optional function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Optional[Callable[[], Iterable[Example]]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | An optional optimizer. Will be created via [`create_optimizer`](/api/language#create_optimizer) if not set. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## Language.resume_training {#resume_training tag="method,experimental" new="3"}
|
## Language.resume_training {#resume_training tag="method,experimental" new="3"}
|
||||||
|
|
||||||
|
@ -248,11 +246,11 @@ a batch of [Example](/api/example) objects.
|
||||||
> nlp.rehearse(examples, sgd=optimizer)
|
> nlp.rehearse(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | An optional optimizer. Will be created via [`create_optimizer`](/api/language#create_optimizer) if not set. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## Language.update {#update tag="method"}
|
## Language.update {#update tag="method"}
|
||||||
|
|
||||||
|
@ -282,15 +280,15 @@ and custom registered functions if needed. See the
|
||||||
> nlp.update([example], sgd=optimizer)
|
> nlp.update([example], sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
|
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Dictionary to update with the loss, keyed by pipeline component. |
|
| `losses` | Dictionary to update with the loss, keyed by pipeline component. ~~Optional[Dict[str, float]]~~ |
|
||||||
| `component_cfg` | `Dict[str, dict]` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. |
|
| `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## Language.rehearse {#rehearse tag="method,experimental" new="3"}
|
## Language.rehearse {#rehearse tag="method,experimental" new="3"}
|
||||||
|
|
||||||
|
@ -305,14 +303,14 @@ the "catastrophic forgetting" problem. This feature is experimental.
|
||||||
> losses = nlp.rehearse(examples, sgd=optimizer)
|
> losses = nlp.rehearse(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
| `losses` | Dictionary to update with the loss, keyed by pipeline component. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## Language.evaluate {#evaluate tag="method"}
|
## Language.evaluate {#evaluate tag="method"}
|
||||||
|
|
||||||
|
@ -328,20 +326,19 @@ objects instead of tuples of `Doc` and `GoldParse` objects.
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> scores = nlp.evaluate(examples, verbose=True)
|
> scores = nlp.evaluate(examples)
|
||||||
> print(scores)
|
> print(scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------- | ------------------------------- | ------------------------------------------------------------------------------------------------------ |
|
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `verbose` | bool | Print debugging information. |
|
| `batch_size` | The batch size to use. ~~int~~ |
|
||||||
| `batch_size` | int | The batch size to use. |
|
| `scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. ~~Optional[Scorer]~~ |
|
||||||
| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
|
| `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ |
|
||||||
| `component_cfg` | `Dict[str, dict]` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. |
|
| `scorer_cfg` | Optional dictionary of keyword arguments for the `Scorer`. Defaults to `None`. ~~Optional[Dict[str, Any]]~~ |
|
||||||
| `scorer_cfg` | `Dict[str, Any]` | Optional dictionary of keyword arguments for the `Scorer`. Defaults to `None`. |
|
| **RETURNS** | A dictionary of evaluation scores. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||||
| **RETURNS** | `Dict[str, Union[float, dict]]` | A dictionary of evaluation scores. |
|
|
||||||
|
|
||||||
## Language.use_params {#use_params tag="contextmanager, method"}
|
## Language.use_params {#use_params tag="contextmanager, method"}
|
||||||
|
|
||||||
|
@ -356,9 +353,9 @@ their original weights after the block.
|
||||||
> nlp.to_disk("/tmp/checkpoint")
|
> nlp.to_disk("/tmp/checkpoint")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---- | --------------------------------------------- |
|
| -------- | ------------------------------------------------------ |
|
||||||
| `params` | dict | A dictionary of parameters keyed by model ID. |
|
| `params` | A dictionary of parameters keyed by model ID. ~~dict~~ |
|
||||||
|
|
||||||
## Language.create_pipe {#create_pipe tag="method" new="2"}
|
## Language.create_pipe {#create_pipe tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -380,14 +377,14 @@ To create a component and add it to the pipeline, you should always use
|
||||||
> parser = nlp.create_pipe("parser")
|
> parser = nlp.create_pipe("parser")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------------------------------- | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `factory_name` | str | Name of the registered component factory. |
|
| `factory_name` | Name of the registered component factory. ~~str~~ |
|
||||||
| `name` | str | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. |
|
| `name` | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. ~~Optional[str]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `config` <Tag variant="new">3</Tag> | `Dict[str, Any]` | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. |
|
| `config` <Tag variant="new">3</Tag> | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. ~~Optional[Dict[str, Any]]~~ |
|
||||||
| `validate` <Tag variant="new">3</Tag> | bool | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. |
|
| `validate` <Tag variant="new">3</Tag> | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ |
|
||||||
| **RETURNS** | callable | The pipeline component. |
|
| **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ |
|
||||||
|
|
||||||
## Language.add_pipe {#add_pipe tag="method" new="2"}
|
## Language.add_pipe {#add_pipe tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -423,19 +420,19 @@ component, adds it to the pipeline and returns it.
|
||||||
> nlp.add_pipe("ner", source=source_nlp)
|
> nlp.add_pipe("ner", source=source_nlp)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------------------------------- | ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `factory_name` | str | Name of the registered component factory. |
|
| `factory_name` | Name of the registered component factory. ~~str~~ |
|
||||||
| `name` | str | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. |
|
| `name` | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. ~~Optional[str]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `before` | str / int | Component name or index to insert component directly before. |
|
| `before` | Component name or index to insert component directly before. ~~Optional[Union[str, int]]~~ |
|
||||||
| `after` | str / int | Component name or index to insert component directly after: |
|
| `after` | Component name or index to insert component directly after. ~~Optional[Union[str, int]]~~ |
|
||||||
| `first` | bool | Insert component first / not first in the pipeline. |
|
| `first` | Insert component first / not first in the pipeline. ~~Optional[bool]~~ |
|
||||||
| `last` | bool | Insert component last / not last in the pipeline. |
|
| `last` | Insert component last / not last in the pipeline. ~~Optional[bool]~~ |
|
||||||
| `config` <Tag variant="new">3</Tag> | `Dict[str, Any]` | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. |
|
| `config` <Tag variant="new">3</Tag> | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. ~~Optional[Dict[str, Any]]~~ |
|
||||||
| `source` <Tag variant="new">3</Tag> | `Language` | Optional source model to copy component from. If a source is provided, the `factory_name` is interpreted as the name of the component in the source pipeline. Make sure that the vocab, vectors and settings of the source model match the target model. |
|
| `source` <Tag variant="new">3</Tag> | Optional source model to copy component from. If a source is provided, the `factory_name` is interpreted as the name of the component in the source pipeline. Make sure that the vocab, vectors and settings of the source model match the target model. ~~Optional[Language]~~ |
|
||||||
| `validate` <Tag variant="new">3</Tag> | bool | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. |
|
| `validate` <Tag variant="new">3</Tag> | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ |
|
||||||
| **RETURNS** <Tag variant="new">3</Tag> | callable | The pipeline component. |
|
| **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ |
|
||||||
|
|
||||||
## Language.has_factory {#has_factory tag="classmethod" new="3"}
|
## Language.has_factory {#has_factory tag="classmethod" new="3"}
|
||||||
|
|
||||||
|
@ -459,10 +456,10 @@ the `Language` base class, available to all subclasses.
|
||||||
> assert not Language.has_factory("component")
|
> assert not Language.has_factory("component")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ---------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------- |
|
||||||
| `name` | str | Name of the pipeline factory to check. |
|
| `name` | Name of the pipeline factory to check. ~~str~~ |
|
||||||
| **RETURNS** | bool | Whether a factory of that name is registered on the class. |
|
| **RETURNS** | Whether a factory of that name is registered on the class. ~~bool~~ |
|
||||||
|
|
||||||
## Language.has_pipe {#has_pipe tag="method" new="2"}
|
## Language.has_pipe {#has_pipe tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -481,10 +478,10 @@ Check whether a component is present in the pipeline. Equivalent to
|
||||||
> assert nlp.has_pipe("my_component")
|
> assert nlp.has_pipe("my_component")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------------- |
|
||||||
| `name` | str | Name of the pipeline component to check. |
|
| `name` | Name of the pipeline component to check. ~~str~~ |
|
||||||
| **RETURNS** | bool | Whether a component of that name exists in the pipeline. |
|
| **RETURNS** | Whether a component of that name exists in the pipeline. ~~bool~~ |
|
||||||
|
|
||||||
## Language.get_pipe {#get_pipe tag="method" new="2"}
|
## Language.get_pipe {#get_pipe tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -497,28 +494,37 @@ Get a pipeline component for a given component name.
|
||||||
> custom_component = nlp.get_pipe("custom_component")
|
> custom_component = nlp.get_pipe("custom_component")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------- | -------------------------------------- |
|
| ----------- | ------------------------------------------------ |
|
||||||
| `name` | str | Name of the pipeline component to get. |
|
| `name` | Name of the pipeline component to get. ~~str~~ |
|
||||||
| **RETURNS** | callable | The pipeline component. |
|
| **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ |
|
||||||
|
|
||||||
## Language.replace_pipe {#replace_pipe tag="method" new="2"}
|
## Language.replace_pipe {#replace_pipe tag="method" new="2"}
|
||||||
|
|
||||||
Replace a component in the pipeline.
|
Replace a component in the pipeline.
|
||||||
|
|
||||||
|
<Infobox title="Changed in v3.0" variant="warning">
|
||||||
|
|
||||||
|
As of v3.0, the `Language.replace_pipe` method doesn't take callables anymore
|
||||||
|
and instead expects the **name of a component factory** registered using
|
||||||
|
[`@Language.component`](/api/language#component) or
|
||||||
|
[`@Language.factory`](/api/language#factory).
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> nlp.replace_pipe("parser", my_custom_parser)
|
> nlp.replace_pipe("parser", my_custom_parser)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------------------------------- | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `name` | str | Name of the component to replace. |
|
| `name` | Name of the component to replace. ~~str~~ |
|
||||||
| `component` | callable | The pipeline component to insert. |
|
| `component` | The factory name of the component to insert. ~~str~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `config` <Tag variant="new">3</Tag> | `Dict[str, Any]` | Optional config parameters to use for the new component. Will be merged with the `default_config` specified by the component factory. |
|
| `config` <Tag variant="new">3</Tag> | Optional config parameters to use for the new component. Will be merged with the `default_config` specified by the component factory. ~~Optional[Dict[str, Any]]~~ |
|
||||||
| `validate` <Tag variant="new">3</Tag> | bool | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. |
|
| `validate` <Tag variant="new">3</Tag> | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ |
|
||||||
|
|
||||||
## Language.rename_pipe {#rename_pipe tag="method" new="2"}
|
## Language.rename_pipe {#rename_pipe tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -533,10 +539,10 @@ added to the pipeline, you can also use the `name` argument on
|
||||||
> nlp.rename_pipe("parser", "spacy_parser")
|
> nlp.rename_pipe("parser", "spacy_parser")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ---- | -------------------------------- |
|
| ---------- | ---------------------------------------- |
|
||||||
| `old_name` | str | Name of the component to rename. |
|
| `old_name` | Name of the component to rename. ~~str~~ |
|
||||||
| `new_name` | str | New name of the component. |
|
| `new_name` | New name of the component. ~~str~~ |
|
||||||
|
|
||||||
## Language.remove_pipe {#remove_pipe tag="method" new="2"}
|
## Language.remove_pipe {#remove_pipe tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -550,10 +556,10 @@ component function.
|
||||||
> assert name == "parser"
|
> assert name == "parser"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ----------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------ |
|
||||||
| `name` | str | Name of the component to remove. |
|
| `name` | Name of the component to remove. ~~str~~ |
|
||||||
| **RETURNS** | tuple | A `(name, component)` tuple of the removed component. |
|
| **RETURNS** | A `(name, component)` tuple of the removed component. ~~Tuple[str, Callable[[Doc], Doc]]~~ |
|
||||||
|
|
||||||
## Language.select_pipes {#select_pipes tag="contextmanager, method" new="3"}
|
## Language.select_pipes {#select_pipes tag="contextmanager, method" new="3"}
|
||||||
|
|
||||||
|
@ -589,12 +595,12 @@ As of spaCy v3.0, the `disable_pipes` method has been renamed to `select_pipes`:
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------------------ |
|
| -------------- | ------------------------------------------------------------------------------------------------------ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `disable` | str / list | Name(s) of pipeline components to disable. |
|
| `disable` | Name(s) of pipeline components to disable. ~~Optional[Union[str, Iterable[str]]]~~ |
|
||||||
| `enable` | str / list | Names(s) of pipeline components that will not be disabled. |
|
| `enable` | Names(s) of pipeline components that will not be disabled. ~~Optional[Union[str, Iterable[str]]]~~ |
|
||||||
| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. |
|
| **RETURNS** | The disabled pipes that can be restored by calling the object's `.restore()` method. ~~DisabledPipes~~ |
|
||||||
|
|
||||||
## Language.get_factory_meta {#get_factory_meta tag="classmethod" new="3"}
|
## Language.get_factory_meta {#get_factory_meta tag="classmethod" new="3"}
|
||||||
|
|
||||||
|
@ -613,10 +619,10 @@ information about the component and its default provided by the
|
||||||
> print(factory_meta.default_config)
|
> print(factory_meta.default_config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----------------------------- | ------------------ |
|
| ----------- | --------------------------------- |
|
||||||
| `name` | str | The factory name. |
|
| `name` | The factory name. ~~str~~ |
|
||||||
| **RETURNS** | [`FactoryMeta`](#factorymeta) | The factory meta. |
|
| **RETURNS** | The factory meta. ~~FactoryMeta~~ |
|
||||||
|
|
||||||
## Language.get_pipe_meta {#get_pipe_meta tag="method" new="3"}
|
## Language.get_pipe_meta {#get_pipe_meta tag="method" new="3"}
|
||||||
|
|
||||||
|
@ -636,10 +642,10 @@ contains the information about the component and its default provided by the
|
||||||
> print(factory_meta.default_config)
|
> print(factory_meta.default_config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----------------------------- | ---------------------------- |
|
| ----------- | ------------------------------------ |
|
||||||
| `name` | str | The pipeline component name. |
|
| `name` | The pipeline component name. ~~str~~ |
|
||||||
| **RETURNS** | [`FactoryMeta`](#factorymeta) | The factory meta. |
|
| **RETURNS** | The factory meta. ~~FactoryMeta~~ |
|
||||||
|
|
||||||
## Language.analyze_pipes {#analyze_pipes tag="method" new="3"}
|
## Language.analyze_pipes {#analyze_pipes tag="method" new="3"}
|
||||||
|
|
||||||
|
@ -725,18 +731,18 @@ token.ent_iob, token.ent_type
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `keys` | `List[str]` | The values to display in the table. Corresponds to attributes of the [`FactoryMeta`](/api/language#factorymeta). Defaults to `["assigns", "requires", "scores", "retokenizes"]`. |
|
| `keys` | The values to display in the table. Corresponds to attributes of the [`FactoryMeta`](/api/language#factorymeta). Defaults to `["assigns", "requires", "scores", "retokenizes"]`. ~~List[str]~~ |
|
||||||
| `pretty` | bool | Pretty-print the results as a table. Defaults to `False`. |
|
| `pretty` | Pretty-print the results as a table. Defaults to `False`. ~~bool~~ |
|
||||||
| **RETURNS** | dict | Dictionary containing the pipe analysis, keyed by `"summary"` (component meta by pipe), `"problems"` (attribute names by pipe) and `"attrs"` (pipes that assign and require an attribute, keyed by attribute). |
|
| **RETURNS** | Dictionary containing the pipe analysis, keyed by `"summary"` (component meta by pipe), `"problems"` (attribute names by pipe) and `"attrs"` (pipes that assign and require an attribute, keyed by attribute). ~~Optional[Dict[str, Any]]~~ |
|
||||||
|
|
||||||
## Language.meta {#meta tag="property"}
|
## Language.meta {#meta tag="property"}
|
||||||
|
|
||||||
Custom meta data for the Language class. If a model is loaded, contains meta
|
Custom meta data for the Language class. If a model is loaded, contains meta
|
||||||
data of the model. The `Language.meta` is also what's serialized as the
|
data of the model. The `Language.meta` is also what's serialized as the
|
||||||
`meta.json` when you save an `nlp` object to disk.
|
[`meta.json`](/api/data-formats#meta) when you save an `nlp` object to disk.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -744,9 +750,9 @@ data of the model. The `Language.meta` is also what's serialized as the
|
||||||
> print(nlp.meta)
|
> print(nlp.meta)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------- |
|
| ----------- | --------------------------------- |
|
||||||
| **RETURNS** | dict | The meta data. |
|
| **RETURNS** | The meta data. ~~Dict[str, Any]~~ |
|
||||||
|
|
||||||
## Language.config {#config tag="property" new="3"}
|
## Language.config {#config tag="property" new="3"}
|
||||||
|
|
||||||
|
@ -765,9 +771,9 @@ subclass of the built-in `dict`. It supports the additional methods `to_disk`
|
||||||
> print(nlp.config.to_str())
|
> print(nlp.config.to_str())
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------------- | ----------- |
|
| ----------- | ---------------------- |
|
||||||
| **RETURNS** | [`Config`](https://thinc.ai/docs/api-config#config) | The config. |
|
| **RETURNS** | The config. ~~Config~~ |
|
||||||
|
|
||||||
## Language.to_disk {#to_disk tag="method" new="2"}
|
## Language.to_disk {#to_disk tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -780,11 +786,11 @@ the model**.
|
||||||
> nlp.to_disk("/path/to/models")
|
> nlp.to_disk("/path/to/models")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## Language.from_disk {#from_disk tag="method" new="2"}
|
## Language.from_disk {#from_disk tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -806,12 +812,12 @@ loaded object.
|
||||||
> nlp = English().from_disk("/path/to/en_model")
|
> nlp = English().from_disk("/path/to/en_model")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ----------------------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Language` | The modified `Language` object. |
|
| **RETURNS** | The modified `Language` object. ~~Language~~ |
|
||||||
|
|
||||||
## Language.to_bytes {#to_bytes tag="method"}
|
## Language.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -823,11 +829,11 @@ Serialize the current state to a binary string.
|
||||||
> nlp_bytes = nlp.to_bytes()
|
> nlp_bytes = nlp.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ----------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. ~~iterable~~ |
|
||||||
| **RETURNS** | bytes | The serialized form of the `Language` object. |
|
| **RETURNS** | The serialized form of the `Language` object. ~~bytes~~ |
|
||||||
|
|
||||||
## Language.from_bytes {#from_bytes tag="method"}
|
## Language.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -845,35 +851,35 @@ available to the loaded object.
|
||||||
> nlp2.from_bytes(nlp_bytes)
|
> nlp2.from_bytes(nlp_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ----------------------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Language` | The `Language` object. |
|
| **RETURNS** | The `Language` object. ~~Language~~ |
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------------------------------------- | ---------------------- | ---------------------------------------------------------------------------------------- |
|
| --------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | A container for the lexical types. |
|
| `vocab` | A container for the lexical types. ~~Vocab~~ |
|
||||||
| `tokenizer` | `Tokenizer` | The tokenizer. |
|
| `tokenizer` | The tokenizer. ~~Tokenizer~~ |
|
||||||
| `make_doc` | `Callable` | Callable that takes a string and returns a `Doc`. |
|
| `make_doc` | Callable that takes a string and returns a `Doc`. ~~Callable[[str], Doc]~~ |
|
||||||
| `pipeline` | `List[str, Callable]` | List of `(name, component)` tuples describing the current processing pipeline, in order. |
|
| `pipeline` | List of `(name, component)` tuples describing the current processing pipeline, in order. ~~List[str, Callable[[Doc], Doc]]~~ |
|
||||||
| `pipe_names` <Tag variant="new">2</Tag> | `List[str]` | List of pipeline component names, in order. |
|
| `pipe_names` <Tag variant="new">2</Tag> | List of pipeline component names, in order. ~~List[str]~~ |
|
||||||
| `pipe_labels` <Tag variant="new">2.2</Tag> | `Dict[str, List[str]]` | List of labels set by the pipeline components, if available, keyed by component name. |
|
| `pipe_labels` <Tag variant="new">2.2</Tag> | List of labels set by the pipeline components, if available, keyed by component name. ~~Dict[str, List[str]]~~ |
|
||||||
| `pipe_factories` <Tag variant="new">2.2</Tag> | `Dict[str, str]` | Dictionary of pipeline component names, mapped to their factory names. |
|
| `pipe_factories` <Tag variant="new">2.2</Tag> | Dictionary of pipeline component names, mapped to their factory names. ~~Dict[str, str]~~ |
|
||||||
| `factories` | `Dict[str, Callable]` | All available factory functions, keyed by name. |
|
| `factories` | All available factory functions, keyed by name. ~~Dict[str, Callable[[...], Callable[[Doc], Doc]]]~~ |
|
||||||
| `factory_names` <Tag variant="new">3</Tag> | `List[str]` | List of all available factory names. |
|
| `factory_names` <Tag variant="new">3</Tag> | List of all available factory names. ~~List[str]~~ |
|
||||||
| `path` <Tag variant="new">2</Tag> | `Path` | Path to the model data directory, if a model is loaded. Otherwise `None`. |
|
| `path` <Tag variant="new">2</Tag> | Path to the model data directory, if a model is loaded. Otherwise `None`. ~~Optional[Path]~~ |
|
||||||
|
|
||||||
## Class attributes {#class-attributes}
|
## Class attributes {#class-attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
|
| `Defaults` | Settings, data and factory methods for creating the `nlp` object and processing pipeline. ~~Defaults~~ |
|
||||||
| `lang` | str | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
|
| `lang` | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). ~~str~~ |
|
||||||
| `default_config` | dict | Base [config](/usage/training#config) to use for [Language.config](/api/language#config). Defaults to [`default_config.cfg`](https://github.com/explosion/spaCy/tree/develop/spacy/default_config.cfg). |
|
| `default_config` | Base [config](/usage/training#config) to use for [Language.config](/api/language#config). Defaults to [`default_config.cfg`](https://github.com/explosion/spaCy/tree/develop/spacy/default_config.cfg). ~~Config~~ |
|
||||||
|
|
||||||
## Defaults {#defaults}
|
## Defaults {#defaults}
|
||||||
|
|
||||||
|
@ -907,16 +913,16 @@ customize the default language data:
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| --------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `stop_words` | List of stop words, used for `Token.is_stop`.<br />**Example:** [`stop_words.py`][stop_words.py] |
|
| `stop_words` | List of stop words, used for `Token.is_stop`.<br />**Example:** [`stop_words.py`][stop_words.py] ~~Set[str]~~ |
|
||||||
| `tokenizer_exceptions` | Tokenizer exception rules, string mapped to list of token attributes.<br />**Example:** [`de/tokenizer_exceptions.py`][de/tokenizer_exceptions.py] |
|
| `tokenizer_exceptions` | Tokenizer exception rules, string mapped to list of token attributes.<br />**Example:** [`de/tokenizer_exceptions.py`][de/tokenizer_exceptions.py] ~~Dict[str, List[dict]]~~ |
|
||||||
| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`puncutation.py`][punctuation.py] |
|
| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`puncutation.py`][punctuation.py] ~~Optional[List[Union[str, Pattern]]]~~ |
|
||||||
| `token_match` | Optional regex for matching strings that should never be split, overriding the infix rules.<br />**Example:** [`fr/tokenizer_exceptions.py`][fr/tokenizer_exceptions.py] |
|
| `token_match` | Optional regex for matching strings that should never be split, overriding the infix rules.<br />**Example:** [`fr/tokenizer_exceptions.py`][fr/tokenizer_exceptions.py] ~~Optional[Pattern]~~ |
|
||||||
| `url_match` | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.<br />**Example:** [`tokenizer_exceptions.py`][tokenizer_exceptions.py] |
|
| `url_match` | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.<br />**Example:** [`tokenizer_exceptions.py`][tokenizer_exceptions.py] ~~Optional[Pattern]~~ |
|
||||||
| `lex_attr_getters` | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.<br />**Example:** [`lex_attrs.py`][lex_attrs.py] |
|
| `lex_attr_getters` | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.<br />**Example:** [`lex_attrs.py`][lex_attrs.py] ~~Dict[int, Callable[[str], Any]]~~ |
|
||||||
| `syntax_iterators` | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).<br />**Example:** [`syntax_iterators.py`][syntax_iterators.py]. |
|
| `syntax_iterators` | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).<br />**Example:** [`syntax_iterators.py`][syntax_iterators.py]. ~~Dict[str, Callable[[Union[Doc, Span]], Iterator[Span]]]~~ |
|
||||||
| `writing_system` | Information about the language's writing system, available via `Vocab.writing_system`. Defaults to: `{"direction": "ltr", "has_case": True, "has_letters": True}.`.<br />**Example:** [`zh/__init__.py`][zh/__init__.py] |
|
| `writing_system` | Information about the language's writing system, available via `Vocab.writing_system`. Defaults to: `{"direction": "ltr", "has_case": True, "has_letters": True}.`.<br />**Example:** [`zh/__init__.py`][zh/__init__.py] ~~Dict[str, Any]~~ |
|
||||||
| `config` | Default [config](/usage/training#config) added to `nlp.config`. This can include references to custom tokenizers or lemmatizers.<br />**Example:** [`zh/__init__.py`][zh/__init__.py] |
|
| `config` | Default [config](/usage/training#config) added to `nlp.config`. This can include references to custom tokenizers or lemmatizers.<br />**Example:** [`zh/__init__.py`][zh/__init__.py] ~~Config~~ |
|
||||||
|
|
||||||
[stop_words.py]:
|
[stop_words.py]:
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
|
||||||
|
@ -949,10 +955,10 @@ serialization by passing in the string names via the `exclude` argument.
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------- | -------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------ |
|
||||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||||
| `tokenizer` | Tokenization rules and exceptions. |
|
| `tokenizer` | Tokenization rules and exceptions. |
|
||||||
| `meta` | The meta data, available as `Language.meta`. |
|
| `meta` | The meta data, available as [`Language.meta`](/api/language#meta). |
|
||||||
| ... | String names of pipeline components, e.g. `"ner"`. |
|
| ... | String names of pipeline components, e.g. `"ner"`. |
|
||||||
|
|
||||||
## FactoryMeta {#factorymeta new="3" tag="dataclass"}
|
## FactoryMeta {#factorymeta new="3" tag="dataclass"}
|
||||||
|
@ -963,12 +969,12 @@ provided by the [`@Language.component`](/api/language#component) or
|
||||||
component is defined and stored on the `Language` class for each component
|
component is defined and stored on the `Language` class for each component
|
||||||
instance and factory instance.
|
instance and factory instance.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------------- | ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `factory` | str | The name of the registered component factory. |
|
| `factory` | The name of the registered component factory. ~~str~~ |
|
||||||
| `default_config` | `Dict[str, Any]` | The default config, describing the default values of the factory arguments. |
|
| `default_config` | The default config, describing the default values of the factory arguments. ~~Dict[str, Any]~~ |
|
||||||
| `assigns` | `Iterable[str]` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). |
|
| `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
||||||
| `requires` | `Iterable[str]` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). |
|
| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
||||||
| `retokenizes` | bool | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). |
|
| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~ |
|
||||||
| `scores` | `Iterable[str]` | All scores set by the components if it's trainable, e.g. `["ents_f", "ents_r", "ents_p"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). |
|
| `scores` | All scores set by the components if it's trainable, e.g. `["ents_f", "ents_r", "ents_p"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
||||||
| `default_score_weights` | `Dict[str, float]` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. |
|
| `default_score_weights` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. ~~Dict[str, float]~~ |
|
||||||
|
|
|
@ -36,11 +36,9 @@ tags is available in the pipeline and runs _before_ the lemmatizer.
|
||||||
The default config is defined by the pipeline component factory and describes
|
The default config is defined by the pipeline component factory and describes
|
||||||
how the component should be configured. You can override its settings via the
|
how the component should be configured. You can override its settings via the
|
||||||
`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
|
`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
|
||||||
[`config.cfg` for training](/usage/training#config).
|
[`config.cfg` for training](/usage/training#config). For examples of the lookups
|
||||||
|
data formats used by the lookup and rule-based lemmatizers, see
|
||||||
For examples of the lookups data formats used by the lookup and rule-based
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data).
|
||||||
lemmatizers, see the
|
|
||||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) repo.
|
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -49,12 +47,12 @@ lemmatizers, see the
|
||||||
> nlp.add_pipe("lemmatizer", config=config)
|
> nlp.add_pipe("lemmatizer", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Description |
|
||||||
| ----------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- |
|
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `mode` | str | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. | `"lookup"` |
|
| `mode` | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `"lookup"`. ~~str~~ |
|
||||||
| `lookups` | [`Lookups`](/api/lookups) | The lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. If `None`, default tables are loaded from `spacy-lookups-data`. | `None` |
|
| `lookups` | The lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. If `None`, default tables are loaded from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). Defaults to `None`. ~~Optional[Lookups]~~ |
|
||||||
| `overwrite` | bool | Whether to overwrite existing lemmas. | `False` |
|
| `overwrite` | Whether to overwrite existing lemmas. Defaults to `False`. ~~bool~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Not yet implemented:** the model to use. | `None` |
|
| `model` | **Not yet implemented:** the model to use. ~~Model~~ |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/lemmatizer.py
|
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/lemmatizer.py
|
||||||
|
@ -77,15 +75,15 @@ Create a new pipeline instance. In your application, you would normally use a
|
||||||
shortcut for this and instantiate the component using its string name and
|
shortcut for this and instantiate the component using its string name and
|
||||||
[`nlp.add_pipe`](/api/language#add_pipe).
|
[`nlp.add_pipe`](/api/language#add_pipe).
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | [`Vocab`](/api/vocab) | The vocab. |
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model (not yet implemented). |
|
| `model` | **Not yet implemented:** The model to use. ~~Model~~ |
|
||||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| mode | str | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `"lookup"`. |
|
| mode | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `"lookup"`. ~~str~~ |
|
||||||
| lookups | [`Lookups`](/api/lookups) | A lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. Defaults to `None`. |
|
| lookups | A lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. Defaults to `None`. ~~Optional[Lookups]~~ |
|
||||||
| overwrite | bool | Whether to overwrite existing lemmas. |
|
| overwrite | Whether to overwrite existing lemmas. ~~bool~ |
|
||||||
|
|
||||||
## Lemmatizer.\_\_call\_\_ {#call tag="method"}
|
## Lemmatizer.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -102,10 +100,10 @@ and all pipeline components are applied to the `Doc` in order.
|
||||||
> processed = lemmatizer(doc)
|
> processed = lemmatizer(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------ |
|
| ----------- | -------------------------------- |
|
||||||
| `doc` | `Doc` | The document to process. |
|
| `doc` | The document to process. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The processed document. |
|
| **RETURNS** | The processed document. ~~Doc~~ |
|
||||||
|
|
||||||
## Lemmatizer.pipe {#pipe tag="method"}
|
## Lemmatizer.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -121,12 +119,12 @@ applied to the `Doc` in order.
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------ |
|
| -------------- | ------------------------------------------------------------- |
|
||||||
| `stream` | `Iterable[Doc]` | A stream of documents. |
|
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
|
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | Processed documents in the order of the original text. |
|
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
||||||
|
|
||||||
## Lemmatizer.lookup_lemmatize {#lookup_lemmatize tag="method"}
|
## Lemmatizer.lookup_lemmatize {#lookup_lemmatize tag="method"}
|
||||||
|
|
||||||
|
@ -134,39 +132,39 @@ Lemmatize a token using a lookup-based approach. If no lemma is found, the
|
||||||
original string is returned. Languages can provide a
|
original string is returned. Languages can provide a
|
||||||
[lookup table](/usage/adding-languages#lemmatizer) via the `Lookups`.
|
[lookup table](/usage/adding-languages#lemmatizer) via the `Lookups`.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------- | ------------------------------------- |
|
| ----------- | --------------------------------------------------- |
|
||||||
| `token` | [`Token`](/api/token) | The token to lemmatize. |
|
| `token` | The token to lemmatize. ~~Token~~ |
|
||||||
| **RETURNS** | `List[str]` | A list containing one or more lemmas. |
|
| **RETURNS** | A list containing one or more lemmas. ~~List[str]~~ |
|
||||||
|
|
||||||
## Lemmatizer.rule_lemmatize {#rule_lemmatize tag="method"}
|
## Lemmatizer.rule_lemmatize {#rule_lemmatize tag="method"}
|
||||||
|
|
||||||
Lemmatize a token using a rule-based approach. Typically relies on POS tags.
|
Lemmatize a token using a rule-based approach. Typically relies on POS tags.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------- | ------------------------------------- |
|
| ----------- | --------------------------------------------------- |
|
||||||
| `token` | [`Token`](/api/token) | The token to lemmatize. |
|
| `token` | The token to lemmatize. ~~Token~~ |
|
||||||
| **RETURNS** | `List[str]` | A list containing one or more lemmas. |
|
| **RETURNS** | A list containing one or more lemmas. ~~List[str]~~ |
|
||||||
|
|
||||||
## Lemmatizer.is_base_form {#is_base_form tag="method"}
|
## Lemmatizer.is_base_form {#is_base_form tag="method"}
|
||||||
|
|
||||||
Check whether we're dealing with an uninflected paradigm, so we can avoid
|
Check whether we're dealing with an uninflected paradigm, so we can avoid
|
||||||
lemmatization entirely.
|
lemmatization entirely.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------- | ------------------------------------------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------------------------------- |
|
||||||
| `token` | [`Token`](/api/token) | The token to analyze. |
|
| `token` | The token to analyze. ~~Token~~ |
|
||||||
| **RETURNS** | bool | Whether the token's attributes (e.g., part-of-speech tag, morphological features) describe a base form. |
|
| **RETURNS** | Whether the token's attributes (e.g., part-of-speech tag, morphological features) describe a base form. ~~bool~~ |
|
||||||
|
|
||||||
## Lemmatizer.get_lookups_config {#get_lookups_config tag="classmethod"}
|
## Lemmatizer.get_lookups_config {#get_lookups_config tag="classmethod"}
|
||||||
|
|
||||||
Returns the lookups configuration settings for a given mode for use in
|
Returns the lookups configuration settings for a given mode for use in
|
||||||
[`Lemmatizer.load_lookups`](#load_lookups).
|
[`Lemmatizer.load_lookups`](/api/lemmatizer#load_lookups).
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `mode` | str | The lemmatizer mode. |
|
| `mode` | The lemmatizer mode. ~~str~~ |
|
||||||
| **RETURNS** | dict | The lookups configuration settings for this mode. |
|
| **RETURNS** | The lookups configuration settings for this mode. Includes the keys `"required_tables"` and `"optional_tables"`, mapped to a list of table string names. ~~Dict[str, List[str]]~~ |
|
||||||
|
|
||||||
## Lemmatizer.load_lookups {#load_lookups tag="classmethod"}
|
## Lemmatizer.load_lookups {#load_lookups tag="classmethod"}
|
||||||
|
|
||||||
|
@ -174,12 +172,12 @@ Load and validate lookups tables. If the provided lookups is `None`, load the
|
||||||
default lookups tables according to the language and mode settings. Confirm that
|
default lookups tables according to the language and mode settings. Confirm that
|
||||||
all required tables for the language and mode are present.
|
all required tables for the language and mode are present.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------------- | ---------------------------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------------------------------- |
|
||||||
| `lang` | str | The language. |
|
| `lang` | The language. ~~str~~ |
|
||||||
| `mode` | str | The lemmatizer mode. |
|
| `mode` | The lemmatizer mode. ~~str~~ |
|
||||||
| `lookups` | [`Lookups`](/api/lookups) | The provided lookups, may be `None` if the default lookups should be loaded. |
|
| `lookups` | The provided lookups, may be `None` if the default lookups should be loaded. ~~Optional[Lookups]~~ |
|
||||||
| **RETURNS** | [`Lookups`](/api/lookups) | The lookups object. |
|
| **RETURNS** | The lookups. ~~Lookups~~ |
|
||||||
|
|
||||||
## Lemmatizer.to_disk {#to_disk tag="method"}
|
## Lemmatizer.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -192,11 +190,11 @@ Serialize the pipe to disk.
|
||||||
> lemmatizer.to_disk("/path/to/lemmatizer")
|
> lemmatizer.to_disk("/path/to/lemmatizer")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## Lemmatizer.from_disk {#from_disk tag="method"}
|
## Lemmatizer.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -209,12 +207,12 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
> lemmatizer.from_disk("/path/to/lemmatizer")
|
> lemmatizer.from_disk("/path/to/lemmatizer")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Lemmatizer` | The modified `Lemmatizer` object. |
|
| **RETURNS** | The modified `Lemmatizer` object. ~~Lemmatizer~~ |
|
||||||
|
|
||||||
## Lemmatizer.to_bytes {#to_bytes tag="method"}
|
## Lemmatizer.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -227,11 +225,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | bytes | The serialized form of the `Lemmatizer` object. |
|
| **RETURNS** | The serialized form of the `Lemmatizer` object. ~~bytes~~ |
|
||||||
|
|
||||||
## Lemmatizer.from_bytes {#from_bytes tag="method"}
|
## Lemmatizer.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -245,27 +243,20 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> lemmatizer.from_bytes(lemmatizer_bytes)
|
> lemmatizer.from_bytes(lemmatizer_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Lemmatizer` | The `Lemmatizer` object. |
|
| **RETURNS** | The `Lemmatizer` object. ~~Lemmatizer~~ |
|
||||||
|
|
||||||
## Lemmatizer.mode {#mode tag="property"}
|
|
||||||
|
|
||||||
The lemmatizer mode.
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ----- | -------------------- |
|
|
||||||
| **RETURNS** | `str` | The lemmatizer mode. |
|
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | --------------------------------- | ------------------- |
|
| --------- | ------------------------------------------- |
|
||||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
| `vocab` | The shared [`Vocab`](/api/vocab). ~~Vocab~~ |
|
||||||
| `lookups` | [`Lookups`](/api/lookups) | The lookups object. |
|
| `lookups` | The lookups object. ~~Lookups~~ |
|
||||||
|
| `mode` | The lemmatizer mode. ~~str~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -13,10 +13,10 @@ lemmatization depends on the part-of-speech tag).
|
||||||
|
|
||||||
Create a `Lexeme` object.
|
Create a `Lexeme` object.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------- | ------- | -------------------------- |
|
| ------- | ---------------------------------- |
|
||||||
| `vocab` | `Vocab` | The parent vocabulary. |
|
| `vocab` | The parent vocabulary. ~~Vocab~~ |
|
||||||
| `orth` | int | The orth id of the lexeme. |
|
| `orth` | The orth id of the lexeme. ~~int~~ |
|
||||||
|
|
||||||
## Lexeme.set_flag {#set_flag tag="method"}
|
## Lexeme.set_flag {#set_flag tag="method"}
|
||||||
|
|
||||||
|
@ -29,10 +29,10 @@ Change the value of a boolean flag.
|
||||||
> nlp.vocab["spaCy"].set_flag(COOL_FLAG, True)
|
> nlp.vocab["spaCy"].set_flag(COOL_FLAG, True)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | ---- | ------------------------------------ |
|
| --------- | -------------------------------------------- |
|
||||||
| `flag_id` | int | The attribute ID of the flag to set. |
|
| `flag_id` | The attribute ID of the flag to set. ~~int~~ |
|
||||||
| `value` | bool | The new value of the flag. |
|
| `value` | The new value of the flag. ~~bool~~ |
|
||||||
|
|
||||||
## Lexeme.check_flag {#check_flag tag="method"}
|
## Lexeme.check_flag {#check_flag tag="method"}
|
||||||
|
|
||||||
|
@ -46,10 +46,10 @@ Check the value of a boolean flag.
|
||||||
> assert nlp.vocab["spaCy"].check_flag(MY_LIBRARY) == True
|
> assert nlp.vocab["spaCy"].check_flag(MY_LIBRARY) == True
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------------- |
|
| ----------- | ---------------------------------------------- |
|
||||||
| `flag_id` | int | The attribute ID of the flag to query. |
|
| `flag_id` | The attribute ID of the flag to query. ~~int~~ |
|
||||||
| **RETURNS** | bool | The value of the flag. |
|
| **RETURNS** | The value of the flag. ~~bool~~ |
|
||||||
|
|
||||||
## Lexeme.similarity {#similarity tag="method" model="vectors"}
|
## Lexeme.similarity {#similarity tag="method" model="vectors"}
|
||||||
|
|
||||||
|
@ -65,10 +65,10 @@ Compute a semantic similarity estimate. Defaults to cosine over vectors.
|
||||||
> assert apple_orange == orange_apple
|
> assert apple_orange == orange_apple
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | -------------------------------------------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| other | - | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. |
|
| other | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. ~~Union[Doc, Span, Token, Lexeme]~~ |
|
||||||
| **RETURNS** | float | A scalar similarity score. Higher is more similar. |
|
| **RETURNS** | A scalar similarity score. Higher is more similar. ~~float~~ |
|
||||||
|
|
||||||
## Lexeme.has_vector {#has_vector tag="property" model="vectors"}
|
## Lexeme.has_vector {#has_vector tag="property" model="vectors"}
|
||||||
|
|
||||||
|
@ -81,9 +81,9 @@ A boolean value indicating whether a word vector is associated with the lexeme.
|
||||||
> assert apple.has_vector
|
> assert apple.has_vector
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ---------------------------------------------- |
|
| ----------- | ------------------------------------------------------- |
|
||||||
| **RETURNS** | bool | Whether the lexeme has a vector data attached. |
|
| **RETURNS** | Whether the lexeme has a vector data attached. ~~bool~~ |
|
||||||
|
|
||||||
## Lexeme.vector {#vector tag="property" model="vectors"}
|
## Lexeme.vector {#vector tag="property" model="vectors"}
|
||||||
|
|
||||||
|
@ -97,9 +97,9 @@ A real-valued meaning representation.
|
||||||
> assert apple.vector.shape == (300,)
|
> assert apple.vector.shape == (300,)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---------------------------------------- | ----------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------------ |
|
||||||
| **RETURNS** | `numpy.ndarray[ndim=1, dtype='float32']` | A 1D numpy array representing the lexeme's semantics. |
|
| **RETURNS** | A 1-dimensional array representing the lexeme's vector. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
|
|
||||||
## Lexeme.vector_norm {#vector_norm tag="property" model="vectors"}
|
## Lexeme.vector_norm {#vector_norm tag="property" model="vectors"}
|
||||||
|
|
||||||
|
@ -115,50 +115,50 @@ The L2 norm of the lexeme's vector representation.
|
||||||
> assert apple.vector_norm != pasta.vector_norm
|
> assert apple.vector_norm != pasta.vector_norm
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ----------------------------------------- |
|
| ----------- | --------------------------------------------------- |
|
||||||
| **RETURNS** | float | The L2 norm of the vector representation. |
|
| **RETURNS** | The L2 norm of the vector representation. ~~float~~ |
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| -------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The lexeme's vocabulary. |
|
| `vocab` | The lexeme's vocabulary. ~~Vocab~~ |
|
||||||
| `text` | str | Verbatim text content. |
|
| `text` | Verbatim text content. ~~str~~ |
|
||||||
| `orth` | int | ID of the verbatim text content. |
|
| `orth` | ID of the verbatim text content. ~~int~~ |
|
||||||
| `orth_` | str | Verbatim text content (identical to `Lexeme.text`). Exists mostly for consistency with the other attributes. |
|
| `orth_` | Verbatim text content (identical to `Lexeme.text`). Exists mostly for consistency with the other attributes. ~~str~~ |
|
||||||
| `rank` | int | Sequential ID of the lexemes's lexical type, used to index into tables, e.g. for word vectors. |
|
| `rank` | Sequential ID of the lexemes's lexical type, used to index into tables, e.g. for word vectors. ~~int~~ |
|
||||||
| `flags` | int | Container of the lexeme's binary flags. |
|
| `flags` | Container of the lexeme's binary flags. ~~int~~ |
|
||||||
| `norm` | int | The lexemes's norm, i.e. a normalized form of the lexeme text. |
|
| `norm` | The lexemes's norm, i.e. a normalized form of the lexeme text. ~~int~~ |
|
||||||
| `norm_` | str | The lexemes's norm, i.e. a normalized form of the lexeme text. |
|
| `norm_` | The lexemes's norm, i.e. a normalized form of the lexeme text. ~~str~~ |
|
||||||
| `lower` | int | Lowercase form of the word. |
|
| `lower` | Lowercase form of the word. ~~int~~ |
|
||||||
| `lower_` | str | Lowercase form of the word. |
|
| `lower_` | Lowercase form of the word. ~~str~~ |
|
||||||
| `shape` | int | Transform of the words's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
| `shape` | Transform of the words's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ |
|
||||||
| `shape_` | str | Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
| `shape_` | Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~str~~ |
|
||||||
| `prefix` | int | Length-N substring from the start of the word. Defaults to `N=1`. |
|
| `prefix` | Length-N substring from the start of the word. Defaults to `N=1`. ~~int~~ |
|
||||||
| `prefix_` | str | Length-N substring from the start of the word. Defaults to `N=1`. |
|
| `prefix_` | Length-N substring from the start of the word. Defaults to `N=1`. ~~str~~ |
|
||||||
| `suffix` | int | Length-N substring from the end of the word. Defaults to `N=3`. |
|
| `suffix` | Length-N substring from the end of the word. Defaults to `N=3`. ~~int~~ |
|
||||||
| `suffix_` | str | Length-N substring from the start of the word. Defaults to `N=3`. |
|
| `suffix_` | Length-N substring from the start of the word. Defaults to `N=3`. ~~str~~ |
|
||||||
| `is_alpha` | bool | Does the lexeme consist of alphabetic characters? Equivalent to `lexeme.text.isalpha()`. |
|
| `is_alpha` | Does the lexeme consist of alphabetic characters? Equivalent to `lexeme.text.isalpha()`. ~~bool~~ |
|
||||||
| `is_ascii` | bool | Does the lexeme consist of ASCII characters? Equivalent to `[any(ord(c) >= 128 for c in lexeme.text)]`. |
|
| `is_ascii` | Does the lexeme consist of ASCII characters? Equivalent to `[any(ord(c) >= 128 for c in lexeme.text)]`. ~~bool~~ |
|
||||||
| `is_digit` | bool | Does the lexeme consist of digits? Equivalent to `lexeme.text.isdigit()`. |
|
| `is_digit` | Does the lexeme consist of digits? Equivalent to `lexeme.text.isdigit()`. ~~bool~~ |
|
||||||
| `is_lower` | bool | Is the lexeme in lowercase? Equivalent to `lexeme.text.islower()`. |
|
| `is_lower` | Is the lexeme in lowercase? Equivalent to `lexeme.text.islower()`. ~~bool~~ |
|
||||||
| `is_upper` | bool | Is the lexeme in uppercase? Equivalent to `lexeme.text.isupper()`. |
|
| `is_upper` | Is the lexeme in uppercase? Equivalent to `lexeme.text.isupper()`. ~~bool~~ |
|
||||||
| `is_title` | bool | Is the lexeme in titlecase? Equivalent to `lexeme.text.istitle()`. |
|
| `is_title` | Is the lexeme in titlecase? Equivalent to `lexeme.text.istitle()`. ~~bool~~ |
|
||||||
| `is_punct` | bool | Is the lexeme punctuation? |
|
| `is_punct` | Is the lexeme punctuation? ~~bool~~ |
|
||||||
| `is_left_punct` | bool | Is the lexeme a left punctuation mark, e.g. `(`? |
|
| `is_left_punct` | Is the lexeme a left punctuation mark, e.g. `(`? ~~bool~~ |
|
||||||
| `is_right_punct` | bool | Is the lexeme a right punctuation mark, e.g. `)`? |
|
| `is_right_punct` | Is the lexeme a right punctuation mark, e.g. `)`? ~~bool~~ |
|
||||||
| `is_space` | bool | Does the lexeme consist of whitespace characters? Equivalent to `lexeme.text.isspace()`. |
|
| `is_space` | Does the lexeme consist of whitespace characters? Equivalent to `lexeme.text.isspace()`. ~~bool~~ |
|
||||||
| `is_bracket` | bool | Is the lexeme a bracket? |
|
| `is_bracket` | Is the lexeme a bracket? ~~bool~~ |
|
||||||
| `is_quote` | bool | Is the lexeme a quotation mark? |
|
| `is_quote` | Is the lexeme a quotation mark? ~~bool~~ |
|
||||||
| `is_currency` <Tag variant="new">2.0.8</Tag> | bool | Is the lexeme a currency symbol? |
|
| `is_currency` <Tag variant="new">2.0.8</Tag> | Is the lexeme a currency symbol? ~~bool~~ |
|
||||||
| `like_url` | bool | Does the lexeme resemble a URL? |
|
| `like_url` | Does the lexeme resemble a URL? ~~bool~~ |
|
||||||
| `like_num` | bool | Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc. |
|
| `like_num` | Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc. ~~bool~~ |
|
||||||
| `like_email` | bool | Does the lexeme resemble an email address? |
|
| `like_email` | Does the lexeme resemble an email address? ~~bool~~ |
|
||||||
| `is_oov` | bool | Does the lexeme have a word vector? |
|
| `is_oov` | Does the lexeme have a word vector? ~~bool~~ |
|
||||||
| `is_stop` | bool | Is the lexeme part of a "stop list"? |
|
| `is_stop` | Is the lexeme part of a "stop list"? ~~bool~~ |
|
||||||
| `lang` | int | Language of the parent vocabulary. |
|
| `lang` | Language of the parent vocabulary. ~~int~~ |
|
||||||
| `lang_` | str | Language of the parent vocabulary. |
|
| `lang_` | Language of the parent vocabulary. ~~str~~ |
|
||||||
| `prob` | float | Smoothed log probability estimate of the lexeme's word type (context-independent entry in the vocabulary). |
|
| `prob` | Smoothed log probability estimate of the lexeme's word type (context-independent entry in the vocabulary). ~~float~~ |
|
||||||
| `cluster` | int | Brown cluster ID. |
|
| `cluster` | Brown cluster ID. ~~int~~ |
|
||||||
| `sentiment` | float | A scalar value indicating the positivity or negativity of the lexeme. |
|
| `sentiment` | A scalar value indicating the positivity or negativity of the lexeme. ~~float~~ |
|
||||||
|
|
|
@ -24,10 +24,6 @@ Create a `Lookups` object.
|
||||||
> lookups = Lookups()
|
> lookups = Lookups()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | --------- | ----------------------------- |
|
|
||||||
| **RETURNS** | `Lookups` | The newly constructed object. |
|
|
||||||
|
|
||||||
## Lookups.\_\_len\_\_ {#len tag="method"}
|
## Lookups.\_\_len\_\_ {#len tag="method"}
|
||||||
|
|
||||||
Get the current number of tables in the lookups.
|
Get the current number of tables in the lookups.
|
||||||
|
@ -39,9 +35,9 @@ Get the current number of tables in the lookups.
|
||||||
> assert len(lookups) == 0
|
> assert len(lookups) == 0
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------ |
|
| ----------- | -------------------------------------------- |
|
||||||
| **RETURNS** | int | The number of tables in the lookups. |
|
| **RETURNS** | The number of tables in the lookups. ~~int~~ |
|
||||||
|
|
||||||
## Lookups.\_\contains\_\_ {#contains tag="method"}
|
## Lookups.\_\contains\_\_ {#contains tag="method"}
|
||||||
|
|
||||||
|
@ -56,10 +52,10 @@ Check if the lookups contain a table of a given name. Delegates to
|
||||||
> assert "some_table" in lookups
|
> assert "some_table" in lookups
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ----------------------------------------------- |
|
| ----------- | -------------------------------------------------------- |
|
||||||
| `name` | str | Name of the table. |
|
| `name` | Name of the table. ~~str~~ |
|
||||||
| **RETURNS** | bool | Whether a table of that name is in the lookups. |
|
| **RETURNS** | Whether a table of that name is in the lookups. ~~bool~~ |
|
||||||
|
|
||||||
## Lookups.tables {#tables tag="property"}
|
## Lookups.tables {#tables tag="property"}
|
||||||
|
|
||||||
|
@ -73,9 +69,9 @@ Get the names of all tables in the lookups.
|
||||||
> assert lookups.tables == ["some_table"]
|
> assert lookups.tables == ["some_table"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ----------------------------------- |
|
| ----------- | ------------------------------------------------- |
|
||||||
| **RETURNS** | list | Names of the tables in the lookups. |
|
| **RETURNS** | Names of the tables in the lookups. ~~List[str]~~ |
|
||||||
|
|
||||||
## Lookups.add_table {#add_table tag="method"}
|
## Lookups.add_table {#add_table tag="method"}
|
||||||
|
|
||||||
|
@ -89,11 +85,11 @@ exists.
|
||||||
> lookups.add_table("some_table", {"foo": "bar"})
|
> lookups.add_table("some_table", {"foo": "bar"})
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----------------------------- | ---------------------------------- |
|
| ----------- | ------------------------------------------- |
|
||||||
| `name` | str | Unique name of the table. |
|
| `name` | Unique name of the table. ~~str~~ |
|
||||||
| `data` | dict | Optional data to add to the table. |
|
| `data` | Optional data to add to the table. ~~dict~~ |
|
||||||
| **RETURNS** | [`Table`](/api/lookups#table) | The newly added table. |
|
| **RETURNS** | The newly added table. ~~Table~~ |
|
||||||
|
|
||||||
## Lookups.get_table {#get_table tag="method"}
|
## Lookups.get_table {#get_table tag="method"}
|
||||||
|
|
||||||
|
@ -108,10 +104,10 @@ Get a table from the lookups. Raises an error if the table doesn't exist.
|
||||||
> assert table["foo"] == "bar"
|
> assert table["foo"] == "bar"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----------------------------- | ------------------ |
|
| ----------- | -------------------------- |
|
||||||
| `name` | str | Name of the table. |
|
| `name` | Name of the table. ~~str~~ |
|
||||||
| **RETURNS** | [`Table`](/api/lookups#table) | The table. |
|
| **RETURNS** | The table. ~~Table~~ |
|
||||||
|
|
||||||
## Lookups.remove_table {#remove_table tag="method"}
|
## Lookups.remove_table {#remove_table tag="method"}
|
||||||
|
|
||||||
|
@ -126,10 +122,10 @@ Remove a table from the lookups. Raises an error if the table doesn't exist.
|
||||||
> assert "some_table" not in lookups
|
> assert "some_table" not in lookups
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----------------------------- | ---------------------------- |
|
| ----------- | ------------------------------------ |
|
||||||
| `name` | str | Name of the table to remove. |
|
| `name` | Name of the table to remove. ~~str~~ |
|
||||||
| **RETURNS** | [`Table`](/api/lookups#table) | The removed table. |
|
| **RETURNS** | The removed table. ~~Table~~ |
|
||||||
|
|
||||||
## Lookups.has_table {#has_table tag="method"}
|
## Lookups.has_table {#has_table tag="method"}
|
||||||
|
|
||||||
|
@ -144,10 +140,10 @@ Check if the lookups contain a table of a given name. Equivalent to
|
||||||
> assert lookups.has_table("some_table")
|
> assert lookups.has_table("some_table")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ----------------------------------------------- |
|
| ----------- | -------------------------------------------------------- |
|
||||||
| `name` | str | Name of the table. |
|
| `name` | Name of the table. ~~str~~ |
|
||||||
| **RETURNS** | bool | Whether a table of that name is in the lookups. |
|
| **RETURNS** | Whether a table of that name is in the lookups. ~~bool~~ |
|
||||||
|
|
||||||
## Lookups.to_bytes {#to_bytes tag="method"}
|
## Lookups.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -159,9 +155,9 @@ Serialize the lookups to a bytestring.
|
||||||
> lookup_bytes = lookups.to_bytes()
|
> lookup_bytes = lookups.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ----------------------- |
|
| ----------- | --------------------------------- |
|
||||||
| **RETURNS** | bytes | The serialized lookups. |
|
| **RETURNS** | The serialized lookups. ~~bytes~~ |
|
||||||
|
|
||||||
## Lookups.from_bytes {#from_bytes tag="method"}
|
## Lookups.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -175,10 +171,10 @@ Load the lookups from a bytestring.
|
||||||
> lookups.from_bytes(lookup_bytes)
|
> lookups.from_bytes(lookup_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | --------- | ---------------------- |
|
| ------------ | -------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| **RETURNS** | `Lookups` | The loaded lookups. |
|
| **RETURNS** | The loaded lookups. ~~Lookups~~ |
|
||||||
|
|
||||||
## Lookups.to_disk {#to_disk tag="method"}
|
## Lookups.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -191,9 +187,9 @@ which will be created if it doesn't exist.
|
||||||
> lookups.to_disk("/path/to/lookups")
|
> lookups.to_disk("/path/to/lookups")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------ | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
| ------ | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
|
|
||||||
## Lookups.from_disk {#from_disk tag="method"}
|
## Lookups.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -208,10 +204,10 @@ the file doesn't exist.
|
||||||
> lookups.from_disk("/path/to/lookups")
|
> lookups.from_disk("/path/to/lookups")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------ | -------------------------------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| **RETURNS** | `Lookups` | The loaded lookups. |
|
| **RETURNS** | The loaded lookups. ~~Lookups~~ |
|
||||||
|
|
||||||
## Table {#table tag="class, ordererddict"}
|
## Table {#table tag="class, ordererddict"}
|
||||||
|
|
||||||
|
@ -236,9 +232,9 @@ Initialize a new table.
|
||||||
> assert table["foo"] == "bar"
|
> assert table["foo"] == "bar"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------ | ---- | ---------------------------------- |
|
| ------ | ------------------------------------------ |
|
||||||
| `name` | str | Optional table name for reference. |
|
| `name` | Optional table name for reference. ~~str~~ |
|
||||||
|
|
||||||
### Table.from_dict {#table.from_dict tag="classmethod"}
|
### Table.from_dict {#table.from_dict tag="classmethod"}
|
||||||
|
|
||||||
|
@ -252,11 +248,11 @@ Initialize a new table from a dict.
|
||||||
> table = Table.from_dict(data, name="some_table")
|
> table = Table.from_dict(data, name="some_table")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------- | ---------------------------------- |
|
| ----------- | ------------------------------------------ |
|
||||||
| `data` | dict | The dictionary. |
|
| `data` | The dictionary. ~~dict~~ |
|
||||||
| `name` | str | Optional table name for reference. |
|
| `name` | Optional table name for reference. ~~str~~ |
|
||||||
| **RETURNS** | `Table` | The newly constructed object. |
|
| **RETURNS** | The newly constructed object. ~~Table~~ |
|
||||||
|
|
||||||
### Table.set {#table.set tag="method"}
|
### Table.set {#table.set tag="method"}
|
||||||
|
|
||||||
|
@ -272,10 +268,10 @@ Set a new key / value pair. String keys will be hashed. Same as
|
||||||
> assert table["foo"] == "bar"
|
> assert table["foo"] == "bar"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------- | --------- | ----------- |
|
| ------- | ---------------------------- |
|
||||||
| `key` | str / int | The key. |
|
| `key` | The key. ~~Union[str, int]~~ |
|
||||||
| `value` | - | The value. |
|
| `value` | The value. |
|
||||||
|
|
||||||
### Table.to_bytes {#table.to_bytes tag="method"}
|
### Table.to_bytes {#table.to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -287,9 +283,9 @@ Serialize the table to a bytestring.
|
||||||
> table_bytes = table.to_bytes()
|
> table_bytes = table.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | --------------------- |
|
| ----------- | ------------------------------- |
|
||||||
| **RETURNS** | bytes | The serialized table. |
|
| **RETURNS** | The serialized table. ~~bytes~~ |
|
||||||
|
|
||||||
### Table.from_bytes {#table.from_bytes tag="method"}
|
### Table.from_bytes {#table.from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -303,15 +299,15 @@ Load a table from a bytestring.
|
||||||
> table.from_bytes(table_bytes)
|
> table.from_bytes(table_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | ------- | ----------------- |
|
| ------------ | --------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load. |
|
| `bytes_data` | The data to load. ~~bytes~~ |
|
||||||
| **RETURNS** | `Table` | The loaded table. |
|
| **RETURNS** | The loaded table. ~~Table~~ |
|
||||||
|
|
||||||
### Attributes {#table-attributes}
|
### Attributes {#table-attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------- | ----------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------- |
|
||||||
| `name` | str | Table name. |
|
| `name` | Table name. ~~str~~ |
|
||||||
| `default_size` | int | Default size of bloom filters if no data is provided. |
|
| `default_size` | Default size of bloom filters if no data is provided. ~~int~~ |
|
||||||
| `bloom` | `preshed.bloom.BloomFilter` | The bloom filters. |
|
| `bloom` | The bloom filters. ~~preshed.BloomFilter~~ |
|
||||||
|
|
|
@ -30,20 +30,20 @@ pattern keys correspond to a number of
|
||||||
[`Token` attributes](/api/token#attributes). The supported attributes for
|
[`Token` attributes](/api/token#attributes). The supported attributes for
|
||||||
rule-based matching are:
|
rule-based matching are:
|
||||||
|
|
||||||
| Attribute | Type | Description |
|
| Attribute | Description |
|
||||||
| -------------------------------------- | ---- | ------------------------------------------------------------------------------------------------------ |
|
| -------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `ORTH` | str | The exact verbatim text of a token. |
|
| `ORTH` | The exact verbatim text of a token. ~~str~~ |
|
||||||
| `TEXT` <Tag variant="new">2.1</Tag> | str | The exact verbatim text of a token. |
|
| `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ |
|
||||||
| `LOWER` | str | The lowercase form of the token text. |
|
| `LOWER` | The lowercase form of the token text. ~~str~~ |
|
||||||
| `LENGTH` | int | The length of the token text. |
|
| `LENGTH` | The length of the token text. ~~int~~ |
|
||||||
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | bool | Token text consists of alphabetic characters, ASCII characters, digits. |
|
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|
||||||
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | bool | Token text is in lowercase, uppercase, titlecase. |
|
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|
||||||
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | bool | Token is punctuation, whitespace, stop word. |
|
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|
||||||
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | bool | Token text resembles a number, URL, email. |
|
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
|
||||||
| `POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE` | str | The token's simple and extended part-of-speech tag, dependency label, lemma, shape. |
|
| `POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, dependency label, lemma, shape. ~~str~~ |
|
||||||
| `ENT_TYPE` | str | The token's entity label. |
|
| `ENT_TYPE` | The token's entity label. ~~str~~ |
|
||||||
| `_` <Tag variant="new">2.1</Tag> | dict | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). |
|
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
|
||||||
| `OP` | str | Operator or quantifier to determine how often to match a token pattern. |
|
| `OP` | Operator or quantifier to determine how often to match a token pattern. ~~str~~ |
|
||||||
|
|
||||||
Operators and quantifiers define **how often** a token pattern should be
|
Operators and quantifiers define **how often** a token pattern should be
|
||||||
matched:
|
matched:
|
||||||
|
@ -75,11 +75,11 @@ it compares to another value.
|
||||||
> ]
|
> ]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Attribute | Type | Description |
|
| Attribute | Description |
|
||||||
| -------------------------- | ---------- | --------------------------------------------------------------------------------- |
|
| -------------------------- | ------------------------------------------------------------------------------------------------------- |
|
||||||
| `IN` | any | Attribute value is member of a list. |
|
| `IN` | Attribute value is member of a list. ~~Any~~ |
|
||||||
| `NOT_IN` | any | Attribute value is _not_ member of a list. |
|
| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ |
|
||||||
| `==`, `>=`, `<=`, `>`, `<` | int, float | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. |
|
| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
|
||||||
|
|
||||||
## Matcher.\_\_init\_\_ {#init tag="method"}
|
## Matcher.\_\_init\_\_ {#init tag="method"}
|
||||||
|
|
||||||
|
@ -95,10 +95,10 @@ string where an integer is expected) or unexpected property names.
|
||||||
> matcher = Matcher(nlp.vocab)
|
> matcher = Matcher(nlp.vocab)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------------------------------- | ------- | ------------------------------------------------------------------------------------------- |
|
| --------------------------------------- | ----------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. |
|
| `vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. ~~Vocab~~ |
|
||||||
| `validate` <Tag variant="new">2.1</Tag> | bool | Validate all patterns added to this matcher. |
|
| `validate` <Tag variant="new">2.1</Tag> | Validate all patterns added to this matcher. ~~bool~~ |
|
||||||
|
|
||||||
## Matcher.\_\_call\_\_ {#call tag="method"}
|
## Matcher.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -116,10 +116,10 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
|
||||||
> matches = matcher(doc)
|
> matches = matcher(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `doclike` | `Doc`/`Span` | The `Doc` or `Span` to match over. |
|
| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ |
|
||||||
| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. |
|
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. ~~List[Tuple[int, int, int]]~~ |
|
||||||
|
|
||||||
## Matcher.pipe {#pipe tag="method"}
|
## Matcher.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -134,13 +134,13 @@ Match a stream of documents, yielding them in turn.
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------------------------------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| --------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `docs` | iterable | A stream of documents or spans. |
|
| `docs` | A stream of documents or spans. ~~Iterable[Union[Doc, Span]]~~ |
|
||||||
| `batch_size` | int | The number of documents to accumulate into a working set. |
|
| `batch_size` | The number of documents to accumulate into a working set. ~~int~~ |
|
||||||
| `return_matches` <Tag variant="new">2.1</Tag> | bool | Yield the match lists along with the docs, making results `(doc, matches)` tuples. |
|
| `return_matches` <Tag variant="new">2.1</Tag> | Yield the match lists along with the docs, making results `(doc, matches)` tuples. ~~bool~~ |
|
||||||
| `as_tuples` | bool | Interpret the input stream as `(doc, context)` tuples, and yield `(result, context)` tuples out. If both `return_matches` and `as_tuples` are `True`, the output will be a sequence of `((doc, matches), context)` tuples. |
|
| `as_tuples` | Interpret the input stream as `(doc, context)` tuples, and yield `(result, context)` tuples out. If both `return_matches` and `as_tuples` are `True`, the output will be a sequence of `((doc, matches), context)` tuples. ~~bool~~ |
|
||||||
| **YIELDS** | `Doc` | Documents, in order. |
|
| **YIELDS** | Documents, in order. ~~Union[Doc, Tuple[Doc, Any], Tuple[Tuple[Doc, Any], Any]]~~ |
|
||||||
|
|
||||||
## Matcher.\_\_len\_\_ {#len tag="method" new="2"}
|
## Matcher.\_\_len\_\_ {#len tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -157,9 +157,9 @@ patterns.
|
||||||
> assert len(matcher) == 1
|
> assert len(matcher) == 1
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------- |
|
| ----------- | ---------------------------- |
|
||||||
| **RETURNS** | int | The number of rules. |
|
| **RETURNS** | The number of rules. ~~int~~ |
|
||||||
|
|
||||||
## Matcher.\_\_contains\_\_ {#contains tag="method" new="2"}
|
## Matcher.\_\_contains\_\_ {#contains tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -174,10 +174,10 @@ Check whether the matcher contains rules for a match ID.
|
||||||
> assert "Rule" in matcher
|
> assert "Rule" in matcher
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ----------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------- |
|
||||||
| `key` | str | The match ID. |
|
| `key` | The match ID. ~~str~~ |
|
||||||
| **RETURNS** | bool | Whether the matcher contains rules for this match ID. |
|
| **RETURNS** | Whether the matcher contains rules for this match ID. ~~bool~~ |
|
||||||
|
|
||||||
## Matcher.add {#add tag="method" new="2"}
|
## Matcher.add {#add tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -217,13 +217,13 @@ patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------------------------- | ------------------ | --------------------------------------------------------------------------------------------- |
|
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `match_id` | str | An ID for the thing you're matching. |
|
| `match_id` | An ID for the thing you're matching. ~~str~~ |
|
||||||
| `patterns` | `List[List[dict]]` | Match pattern. A pattern consists of a list of dicts, where each dict describes a token. |
|
| `patterns` | Match pattern. A pattern consists of a list of dicts, where each dict describes a token. ~~List[List[Dict[str, Any]]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `on_match` | callable / `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
|
| `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ |
|
||||||
| `greedy` <Tag variant="new">3</Tag> | str | Optional filter for greedy matches. Can either be `"FIRST"` or `"LONGEST"`. |
|
| `greedy` <Tag variant="new">3</Tag> | Optional filter for greedy matches. Can either be `"FIRST"` or `"LONGEST"`. ~~Optional[str]~~ |
|
||||||
|
|
||||||
## Matcher.remove {#remove tag="method" new="2"}
|
## Matcher.remove {#remove tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -239,9 +239,9 @@ exist.
|
||||||
> assert "Rule" not in matcher
|
> assert "Rule" not in matcher
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----- | ---- | ------------------------- |
|
| ----- | --------------------------------- |
|
||||||
| `key` | str | The ID of the match rule. |
|
| `key` | The ID of the match rule. ~~str~~ |
|
||||||
|
|
||||||
## Matcher.get {#get tag="method" new="2"}
|
## Matcher.get {#get tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -255,7 +255,7 @@ Retrieve the pattern stored for a key. Returns the rule as an
|
||||||
> on_match, patterns = matcher.get("Rule")
|
> on_match, patterns = matcher.get("Rule")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | --------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------------------------------- |
|
||||||
| `key` | str | The ID of the match rule. |
|
| `key` | The ID of the match rule. ~~str~~ |
|
||||||
| **RETURNS** | tuple | The rule, as an `(on_match, patterns)` tuple. |
|
| **RETURNS** | The rule, as an `(on_match, patterns)` tuple. ~~Tuple[Optional[Callable], List[List[dict]]]~~ |
|
||||||
|
|
|
@ -1,142 +0,0 @@
|
||||||
---
|
|
||||||
title: MorphAnalysis
|
|
||||||
tag: class
|
|
||||||
source: spacy/tokens/morphanalysis.pyx
|
|
||||||
---
|
|
||||||
|
|
||||||
Stores a single morphological analysis.
|
|
||||||
|
|
||||||
## MorphAnalysis.\_\_init\_\_ {#init tag="method"}
|
|
||||||
|
|
||||||
Initialize a MorphAnalysis object from a UD FEATS string or a dictionary of
|
|
||||||
morphological features.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> from spacy.tokens import MorphAnalysis
|
|
||||||
>
|
|
||||||
> feats = "Feat1=Val1|Feat2=Val2"
|
|
||||||
> m = MorphAnalysis(nlp.vocab, feats)
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ---------- | ------------------ | --------------------------- |
|
|
||||||
| `vocab` | `Vocab` | The vocab. |
|
|
||||||
| `features` | `Union[Dict, str]` | The morphological features. |
|
|
||||||
|
|
||||||
## MorphAnalysis.\_\_contains\_\_ {#contains tag="method"}
|
|
||||||
|
|
||||||
Whether a feature/value pair is in the analysis.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> feats = "Feat1=Val1,Val2|Feat2=Val2"
|
|
||||||
> morph = MorphAnalysis(nlp.vocab, feats)
|
|
||||||
> assert "Feat1=Val1" in morph
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ----- | ------------------------------------- |
|
|
||||||
| **RETURNS** | `str` | A feature/value pair in the analysis. |
|
|
||||||
|
|
||||||
## MorphAnalysis.\_\_iter\_\_ {#iter tag="method"}
|
|
||||||
|
|
||||||
Iterate over the feature/value pairs in the analysis.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> feats = "Feat1=Val1,Val3|Feat2=Val2"
|
|
||||||
> morph = MorphAnalysis(nlp.vocab, feats)
|
|
||||||
> assert list(morph) == ["Feat1=Va1", "Feat1=Val3", "Feat2=Val2"]
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ---------- | ----- | ------------------------------------- |
|
|
||||||
| **YIELDS** | `str` | A feature/value pair in the analysis. |
|
|
||||||
|
|
||||||
## MorphAnalysis.\_\_len\_\_ {#len tag="method"}
|
|
||||||
|
|
||||||
Returns the number of features in the analysis.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> feats = "Feat1=Val1,Val2|Feat2=Val2"
|
|
||||||
> morph = MorphAnalysis(nlp.vocab, feats)
|
|
||||||
> assert len(morph) == 3
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ----- | --------------------------------------- |
|
|
||||||
| **RETURNS** | `int` | The number of features in the analysis. |
|
|
||||||
|
|
||||||
## MorphAnalysis.\_\_str\_\_ {#str tag="method"}
|
|
||||||
|
|
||||||
Returns the morphological analysis in the UD FEATS string format.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> feats = "Feat1=Val1,Val2|Feat2=Val2"
|
|
||||||
> morph = MorphAnalysis(nlp.vocab, feats)
|
|
||||||
> assert str(morph) == feats
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ----- | -------------------------------- |
|
|
||||||
| **RETURNS** | `str` | The analysis in UD FEATS format. |
|
|
||||||
|
|
||||||
## MorphAnalysis.get {#get tag="method"}
|
|
||||||
|
|
||||||
Retrieve values for a feature by field.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> feats = "Feat1=Val1,Val2"
|
|
||||||
> morph = MorphAnalysis(nlp.vocab, feats)
|
|
||||||
> assert morph.get("Feat1") == ["Val1", "Val2"]
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ------ | ---------------------------------- |
|
|
||||||
| `field` | `str` | The field to retrieve. |
|
|
||||||
| **RETURNS** | `list` | A list of the individual features. |
|
|
||||||
|
|
||||||
## MorphAnalysis.to_dict {#to_dict tag="method"}
|
|
||||||
|
|
||||||
Produce a dict representation of the analysis, in the same format as the tag
|
|
||||||
map.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> feats = "Feat1=Val1,Val2|Feat2=Val2"
|
|
||||||
> morph = MorphAnalysis(nlp.vocab, feats)
|
|
||||||
> assert morph.to_dict() == {"Feat1": "Val1,Val2", "Feat2": "Val2"}
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ------ | ---------------------------------------- |
|
|
||||||
| **RETURNS** | `dict` | The dict representation of the analysis. |
|
|
||||||
|
|
||||||
## MorphAnalysis.from_id {#from_id tag="classmethod"}
|
|
||||||
|
|
||||||
Create a morphological analysis from a given hash ID.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> feats = "Feat1=Val1|Feat2=Val2"
|
|
||||||
> hash = nlp.vocab.strings[feats]
|
|
||||||
> morph = MorphAnalysis.from_id(nlp.vocab, hash)
|
|
||||||
> assert str(morph) == feats
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ------- | ------- | -------------------------------- |
|
|
||||||
| `vocab` | `Vocab` | The vocab. |
|
|
||||||
| `key` | `int` | The hash of the features string. |
|
|
|
@ -32,9 +32,9 @@ architectures and their arguments and hyperparameters.
|
||||||
> nlp.add_pipe("morphologizer", config=config)
|
> nlp.add_pipe("morphologizer", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Description |
|
||||||
| ------- | ------------------------------------------ | ----------------- | ----------------------------------- |
|
| ------- | ------------------------------------------------------------------------------------------------------- |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [Tagger](/api/architectures#Tagger) |
|
| `model` | The model to use. Defaults to [Tagger](/api/architectures#Tagger). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/morphologizer.pyx
|
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/morphologizer.pyx
|
||||||
|
@ -42,7 +42,9 @@ https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/morphologizer.pyx
|
||||||
|
|
||||||
## Morphologizer.\_\_init\_\_ {#init tag="method"}
|
## Morphologizer.\_\_init\_\_ {#init tag="method"}
|
||||||
|
|
||||||
Initialize the morphologizer.
|
Create a new pipeline instance. In your application, you would normally use a
|
||||||
|
shortcut for this and instantiate the component using its string name and
|
||||||
|
[`nlp.add_pipe`](/api/language#add_pipe).
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -59,18 +61,14 @@ Initialize the morphologizer.
|
||||||
> morphologizer = Morphologizer(nlp.vocab, model)
|
> morphologizer = Morphologizer(nlp.vocab, model)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Create a new pipeline instance. In your application, you would normally use a
|
| Name | Description |
|
||||||
shortcut for this and instantiate the component using its string name and
|
| -------------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||||
[`nlp.add_pipe`](/api/language#add_pipe).
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
|
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
| Name | Type | Description |
|
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||||
| -------------- | ------- | ------------------------------------------------------------------------------------------- |
|
| _keyword-only_ | |
|
||||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
| `labels_morph` | Mapping of morph + POS tags to morph labels. ~~Dict[str, str]~~ |
|
||||||
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
| `labels_pos` | Mapping of morph + POS tags to POS tags. ~~Dict[str, str]~~ |
|
||||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
|
||||||
| _keyword-only_ | | |
|
|
||||||
| `labels_morph` | dict | Mapping of morph + POS tags to morph labels. |
|
|
||||||
| `labels_pos` | dict | Mapping of morph + POS tags to POS tags. |
|
|
||||||
|
|
||||||
## Morphologizer.\_\_call\_\_ {#call tag="method"}
|
## Morphologizer.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -90,10 +88,10 @@ delegate to the [`predict`](/api/morphologizer#predict) and
|
||||||
> processed = morphologizer(doc)
|
> processed = morphologizer(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------ |
|
| ----------- | -------------------------------- |
|
||||||
| `doc` | `Doc` | The document to process. |
|
| `doc` | The document to process. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The processed document. |
|
| **RETURNS** | The processed document. ~~Doc~~ |
|
||||||
|
|
||||||
## Morphologizer.pipe {#pipe tag="method"}
|
## Morphologizer.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -112,12 +110,12 @@ applied to the `Doc` in order. Both [`__call__`](/api/morphologizer#call) and
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------ |
|
| -------------- | ------------------------------------------------------------- |
|
||||||
| `stream` | `Iterable[Doc]` | A stream of documents. |
|
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
|
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | Processed documents in the order of the original text. |
|
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
||||||
|
|
||||||
## Morphologizer.begin_training {#begin_training tag="method"}
|
## Morphologizer.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
|
@ -138,13 +136,13 @@ setting up the label scheme based on the data.
|
||||||
> optimizer = morphologizer.begin_training(lambda: [], pipeline=nlp.pipeline)
|
> optimizer = morphologizer.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | `Callable[[], Iterable[Example]]` | Optional function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `pipeline` | `List[Tuple[str, Callable]]` | Optional list of pipeline components that this component is part of. |
|
| `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | An optional optimizer. Will be created via [`create_optimizer`](/api/sentencerecognizer#create_optimizer) if not set. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## Morphologizer.predict {#predict tag="method"}
|
## Morphologizer.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -158,10 +156,10 @@ modifying them.
|
||||||
> scores = morphologizer.predict([doc1, doc2])
|
> scores = morphologizer.predict([doc1, doc2])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------- | ----------------------------------------- |
|
| ----------- | ------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to predict. |
|
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
|
||||||
| **RETURNS** | - | The model's prediction for each document. |
|
| **RETURNS** | The model's prediction for each document. |
|
||||||
|
|
||||||
## Morphologizer.set_annotations {#set_annotations tag="method"}
|
## Morphologizer.set_annotations {#set_annotations tag="method"}
|
||||||
|
|
||||||
|
@ -175,10 +173,10 @@ Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores.
|
||||||
> morphologizer.set_annotations([doc1, doc2], scores)
|
> morphologizer.set_annotations([doc1, doc2], scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | --------------- | ------------------------------------------------------- |
|
| -------- | ------------------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to modify. |
|
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
|
||||||
| `scores` | - | The scores to set, produced by `Morphologizer.predict`. |
|
| `scores` | The scores to set, produced by `Morphologizer.predict`. |
|
||||||
|
|
||||||
## Morphologizer.update {#update tag="method"}
|
## Morphologizer.update {#update tag="method"}
|
||||||
|
|
||||||
|
@ -195,15 +193,15 @@ Delegates to [`predict`](/api/morphologizer#predict) and
|
||||||
> losses = morphologizer.update(examples, sgd=optimizer)
|
> losses = morphologizer.update(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/sentencerecognizer#set_annotations). |
|
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## Morphologizer.get_loss {#get_loss tag="method"}
|
## Morphologizer.get_loss {#get_loss tag="method"}
|
||||||
|
|
||||||
|
@ -218,11 +216,11 @@ predicted scores.
|
||||||
> loss, d_loss = morphologizer.get_loss(examples, scores)
|
> loss, d_loss = morphologizer.get_loss(examples, scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------- | --------------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The batch of examples. |
|
| `examples` | The batch of examples. ~~Iterable[Example]~~ |
|
||||||
| `scores` | - | Scores representing the model's predictions. |
|
| `scores` | Scores representing the model's predictions. |
|
||||||
| **RETURNS** | `Tuple[float, float]` | The loss and the gradient, i.e. `(loss, gradient)`. |
|
| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ |
|
||||||
|
|
||||||
## Morphologizer.create_optimizer {#create_optimizer tag="method"}
|
## Morphologizer.create_optimizer {#create_optimizer tag="method"}
|
||||||
|
|
||||||
|
@ -235,9 +233,9 @@ Create an optimizer for the pipeline component.
|
||||||
> optimizer = morphologizer.create_optimizer()
|
> optimizer = morphologizer.create_optimizer()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------------- | -------------- |
|
| ----------- | ---------------------------- |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## Morphologizer.use_params {#use_params tag="method, contextmanager"}
|
## Morphologizer.use_params {#use_params tag="method, contextmanager"}
|
||||||
|
|
||||||
|
@ -252,9 +250,9 @@ context, the original parameters are restored.
|
||||||
> morphologizer.to_disk("/best_model")
|
> morphologizer.to_disk("/best_model")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---- | ----------------------------------------- |
|
| -------- | -------------------------------------------------- |
|
||||||
| `params` | dict | The parameter values to use in the model. |
|
| `params` | The parameter values to use in the model. ~~dict~~ |
|
||||||
|
|
||||||
## Morphologizer.add_label {#add_label tag="method"}
|
## Morphologizer.add_label {#add_label tag="method"}
|
||||||
|
|
||||||
|
@ -268,10 +266,10 @@ both `pos` and `morph`, the label should include the UPOS as the feature `POS`.
|
||||||
> morphologizer.add_label("Mood=Ind|POS=VERB|Tense=Past|VerbForm=Fin")
|
> morphologizer.add_label("Mood=Ind|POS=VERB|Tense=Past|VerbForm=Fin")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | --------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------- |
|
||||||
| `label` | str | The label to add. |
|
| `label` | The label to add. ~~str~~ |
|
||||||
| **RETURNS** | int | `0` if the label is already present, otherwise `1`. |
|
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
|
||||||
|
|
||||||
## Morphologizer.to_disk {#to_disk tag="method"}
|
## Morphologizer.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -284,11 +282,11 @@ Serialize the pipe to disk.
|
||||||
> morphologizer.to_disk("/path/to/morphologizer")
|
> morphologizer.to_disk("/path/to/morphologizer")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## Morphologizer.from_disk {#from_disk tag="method"}
|
## Morphologizer.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -301,12 +299,12 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
> morphologizer.from_disk("/path/to/morphologizer")
|
> morphologizer.from_disk("/path/to/morphologizer")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Morphologizer` | The modified `Morphologizer` object. |
|
| **RETURNS** | The modified `Morphologizer` object. ~~Morphologizer~~ |
|
||||||
|
|
||||||
## Morphologizer.to_bytes {#to_bytes tag="method"}
|
## Morphologizer.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -319,11 +317,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | bytes | The serialized form of the `Morphologizer` object. |
|
| **RETURNS** | The serialized form of the `Morphologizer` object. ~~bytes~~ |
|
||||||
|
|
||||||
## Morphologizer.from_bytes {#from_bytes tag="method"}
|
## Morphologizer.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -337,19 +335,19 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> morphologizer.from_bytes(morphologizer_bytes)
|
> morphologizer.from_bytes(morphologizer_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Morphologizer` | The `Morphologizer` object. |
|
| **RETURNS** | The `Morphologizer` object. ~~Morphologizer~~ |
|
||||||
|
|
||||||
## Morphologizer.labels {#labels tag="property"}
|
## Morphologizer.labels {#labels tag="property"}
|
||||||
|
|
||||||
The labels currently added to the component in Universal Dependencies
|
The labels currently added to the component in the Universal Dependencies
|
||||||
[FEATS format](https://universaldependencies.org/format.html#morphological-annotation).
|
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
|
||||||
Note that even for a blank component, this will always include the internal
|
format. Note that even for a blank component, this will always include the
|
||||||
empty label `_`. If POS features are used, the labels will include the
|
internal empty label `_`. If POS features are used, the labels will include the
|
||||||
coarse-grained POS as the feature `POS`.
|
coarse-grained POS as the feature `POS`.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
|
@ -359,9 +357,9 @@ coarse-grained POS as the feature `POS`.
|
||||||
> assert "Mood=Ind|POS=VERB|Tense=Past|VerbForm=Fin" in morphologizer.labels
|
> assert "Mood=Ind|POS=VERB|Tense=Past|VerbForm=Fin" in morphologizer.labels
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ---------------------------------- |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | tuple | The labels added to the component. |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -7,7 +7,8 @@ source: spacy/morphology.pyx
|
||||||
Store the possible morphological analyses for a language, and index them by
|
Store the possible morphological analyses for a language, and index them by
|
||||||
hash. To save space on each token, tokens only know the hash of their
|
hash. To save space on each token, tokens only know the hash of their
|
||||||
morphological analysis, so queries of morphological attributes are delegated to
|
morphological analysis, so queries of morphological attributes are delegated to
|
||||||
this class.
|
this class. See [`MorphAnalysis`](/api/morphology#morphanalysis) for the
|
||||||
|
container storing a single morphological analysis.
|
||||||
|
|
||||||
## Morphology.\_\_init\_\_ {#init tag="method"}
|
## Morphology.\_\_init\_\_ {#init tag="method"}
|
||||||
|
|
||||||
|
@ -21,15 +22,17 @@ Create a Morphology object.
|
||||||
> morphology = Morphology(strings)
|
> morphology = Morphology(strings)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | ------------- | ----------------- |
|
| --------- | --------------------------------- |
|
||||||
| `strings` | `StringStore` | The string store. |
|
| `strings` | The string store. ~~StringStore~~ |
|
||||||
|
|
||||||
## Morphology.add {#add tag="method"}
|
## Morphology.add {#add tag="method"}
|
||||||
|
|
||||||
Insert a morphological analysis in the morphology table, if not already present.
|
Insert a morphological analysis in the morphology table, if not already present.
|
||||||
The morphological analysis may be provided in the UD FEATS format as a string or
|
The morphological analysis may be provided in the Universal Dependencies
|
||||||
in the tag map dictionary format. Returns the hash of the new analysis.
|
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
|
||||||
|
format as a string or in the tag map dictionary format. Returns the hash of the
|
||||||
|
new analysis.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -39,9 +42,9 @@ in the tag map dictionary format. Returns the hash of the new analysis.
|
||||||
> assert hash == nlp.vocab.strings[feats]
|
> assert hash == nlp.vocab.strings[feats]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ------------------ | --------------------------- |
|
| ---------- | ------------------------------------------------ |
|
||||||
| `features` | `Union[Dict, str]` | The morphological features. |
|
| `features` | The morphological features. ~~Union[Dict, str]~~ |
|
||||||
|
|
||||||
## Morphology.get {#get tag="method"}
|
## Morphology.get {#get tag="method"}
|
||||||
|
|
||||||
|
@ -53,16 +56,20 @@ in the tag map dictionary format. Returns the hash of the new analysis.
|
||||||
> assert nlp.vocab.morphology.get(hash) == feats
|
> assert nlp.vocab.morphology.get(hash) == feats
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Get the FEATS string for the hash of the morphological analysis.
|
Get the
|
||||||
|
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
|
||||||
|
string for the hash of the morphological analysis.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------- | ---- | --------------------------------------- |
|
| ------- | ----------------------------------------------- |
|
||||||
| `morph` | int | The hash of the morphological analysis. |
|
| `morph` | The hash of the morphological analysis. ~~int~~ |
|
||||||
|
|
||||||
## Morphology.feats_to_dict {#feats_to_dict tag="staticmethod"}
|
## Morphology.feats_to_dict {#feats_to_dict tag="staticmethod"}
|
||||||
|
|
||||||
Convert a string FEATS representation to a dictionary of features and values in
|
Convert a string
|
||||||
the same format as the tag map.
|
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
|
||||||
|
representation to a dictionary of features and values in the same format as the
|
||||||
|
tag map.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -72,14 +79,16 @@ the same format as the tag map.
|
||||||
> assert d == {"Feat1": "Val1", "Feat2": "Val2"}
|
> assert d == {"Feat1": "Val1", "Feat2": "Val2"}
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------------------------------------ |
|
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `feats` | str | The morphological features in Universal Dependencies FEATS format. |
|
| `feats` | The morphological features in Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) format. ~~str~~ |
|
||||||
| **RETURNS** | dict | The morphological features as a dictionary. |
|
| **RETURNS** | The morphological features as a dictionary. ~~Dict[str, str]~~ |
|
||||||
|
|
||||||
## Morphology.dict_to_feats {#dict_to_feats tag="staticmethod"}
|
## Morphology.dict_to_feats {#dict_to_feats tag="staticmethod"}
|
||||||
|
|
||||||
Convert a dictionary of features and values to a string FEATS representation.
|
Convert a dictionary of features and values to a string
|
||||||
|
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
|
||||||
|
representation.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -89,15 +98,157 @@ Convert a dictionary of features and values to a string FEATS representation.
|
||||||
> assert f == "Feat1=Val1|Feat2=Val2"
|
> assert f == "Feat1=Val1|Feat2=Val2"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | ----------------- | --------------------------------------------------------------------- |
|
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `feats_dict` | `Dict[str, Dict]` | The morphological features as a dictionary. |
|
| `feats_dict` | The morphological features as a dictionary. ~~Dict[str, str]~~ |
|
||||||
| **RETURNS** | str | The morphological features as in Universal Dependencies FEATS format. |
|
| **RETURNS** | The morphological features as in Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) format. ~~str~~ |
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------- | ----- | -------------------------------------------- |
|
| ------------- | ------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `FEATURE_SEP` | `str` | The FEATS feature separator. Default is `|`. |
|
| `FEATURE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) feature separator. Default is `|`. ~~str~~ |
|
||||||
| `FIELD_SEP` | `str` | The FEATS field separator. Default is `=`. |
|
| `FIELD_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) field separator. Default is `=`. ~~str~~ |
|
||||||
| `VALUE_SEP` | `str` | The FEATS value separator. Default is `,`. |
|
| `VALUE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) value separator. Default is `,`. ~~str~~ |
|
||||||
|
|
||||||
|
## MorphAnalysis {#morphanalysis tag="class" source="spacy/tokens/morphanalysis.pyx"}
|
||||||
|
|
||||||
|
Stores a single morphological analysis.
|
||||||
|
|
||||||
|
### MorphAnalysis.\_\_init\_\_ {#morphanalysis-init tag="method"}
|
||||||
|
|
||||||
|
Initialize a MorphAnalysis object from a Universal Dependencies
|
||||||
|
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
|
||||||
|
string or a dictionary of morphological features.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> from spacy.tokens import MorphAnalysis
|
||||||
|
>
|
||||||
|
> feats = "Feat1=Val1|Feat2=Val2"
|
||||||
|
> m = MorphAnalysis(nlp.vocab, feats)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ---------- | ---------------------------------------------------------- |
|
||||||
|
| `vocab` | The vocab. ~~Vocab~~ |
|
||||||
|
| `features` | The morphological features. ~~Union[Dict[str, str], str]~~ |
|
||||||
|
|
||||||
|
### MorphAnalysis.\_\_contains\_\_ {#morphanalysis-contains tag="method"}
|
||||||
|
|
||||||
|
Whether a feature/value pair is in the analysis.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> feats = "Feat1=Val1,Val2|Feat2=Val2"
|
||||||
|
> morph = MorphAnalysis(nlp.vocab, feats)
|
||||||
|
> assert "Feat1=Val1" in morph
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | --------------------------------------------- |
|
||||||
|
| **RETURNS** | A feature/value pair in the analysis. ~~str~~ |
|
||||||
|
|
||||||
|
### MorphAnalysis.\_\_iter\_\_ {#morphanalysis-iter tag="method"}
|
||||||
|
|
||||||
|
Iterate over the feature/value pairs in the analysis.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> feats = "Feat1=Val1,Val3|Feat2=Val2"
|
||||||
|
> morph = MorphAnalysis(nlp.vocab, feats)
|
||||||
|
> assert list(morph) == ["Feat1=Va1", "Feat1=Val3", "Feat2=Val2"]
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ---------- | --------------------------------------------- |
|
||||||
|
| **YIELDS** | A feature/value pair in the analysis. ~~str~~ |
|
||||||
|
|
||||||
|
### MorphAnalysis.\_\_len\_\_ {#morphanalysis-len tag="method"}
|
||||||
|
|
||||||
|
Returns the number of features in the analysis.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> feats = "Feat1=Val1,Val2|Feat2=Val2"
|
||||||
|
> morph = MorphAnalysis(nlp.vocab, feats)
|
||||||
|
> assert len(morph) == 3
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ----------------------------------------------- |
|
||||||
|
| **RETURNS** | The number of features in the analysis. ~~int~~ |
|
||||||
|
|
||||||
|
### MorphAnalysis.\_\_str\_\_ {#morphanalysis-str tag="method"}
|
||||||
|
|
||||||
|
Returns the morphological analysis in the Universal Dependencies
|
||||||
|
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
|
||||||
|
string format.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> feats = "Feat1=Val1,Val2|Feat2=Val2"
|
||||||
|
> morph = MorphAnalysis(nlp.vocab, feats)
|
||||||
|
> assert str(morph) == feats
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
|
| **RETURNS** | The analysis in the Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) format. ~~str~~ |
|
||||||
|
|
||||||
|
### MorphAnalysis.get {#morphanalysis-get tag="method"}
|
||||||
|
|
||||||
|
Retrieve values for a feature by field.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> feats = "Feat1=Val1,Val2"
|
||||||
|
> morph = MorphAnalysis(nlp.vocab, feats)
|
||||||
|
> assert morph.get("Feat1") == ["Val1", "Val2"]
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ------------------------------------------------ |
|
||||||
|
| `field` | The field to retrieve. ~~str~~ |
|
||||||
|
| **RETURNS** | A list of the individual features. ~~List[str]~~ |
|
||||||
|
|
||||||
|
### MorphAnalysis.to_dict {#morphanalysis-to_dict tag="method"}
|
||||||
|
|
||||||
|
Produce a dict representation of the analysis, in the same format as the tag
|
||||||
|
map.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> feats = "Feat1=Val1,Val2|Feat2=Val2"
|
||||||
|
> morph = MorphAnalysis(nlp.vocab, feats)
|
||||||
|
> assert morph.to_dict() == {"Feat1": "Val1,Val2", "Feat2": "Val2"}
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ----------------------------------------------------------- |
|
||||||
|
| **RETURNS** | The dict representation of the analysis. ~~Dict[str, str]~~ |
|
||||||
|
|
||||||
|
### MorphAnalysis.from_id {#morphanalysis-from_id tag="classmethod"}
|
||||||
|
|
||||||
|
Create a morphological analysis from a given hash ID.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> feats = "Feat1=Val1|Feat2=Val2"
|
||||||
|
> hash = nlp.vocab.strings[feats]
|
||||||
|
> morph = MorphAnalysis.from_id(nlp.vocab, hash)
|
||||||
|
> assert str(morph) == feats
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ------- | ---------------------------------------- |
|
||||||
|
| `vocab` | The vocab. ~~Vocab~~ |
|
||||||
|
| `key` | The hash of the features string. ~~int~~ |
|
||||||
|
|
|
@ -36,11 +36,11 @@ be shown.
|
||||||
> matcher = PhraseMatcher(nlp.vocab)
|
> matcher = PhraseMatcher(nlp.vocab)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------------------------------- | --------- | ------------------------------------------------------------------------------------------- |
|
| --------------------------------------- | ------------------------------------------------------------------------------------------------------ |
|
||||||
| `vocab` | `Vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. |
|
| `vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. ~~Vocab~~ |
|
||||||
| `attr` <Tag variant="new">2.1</Tag> | int / str | The token attribute to match on. Defaults to `ORTH`, i.e. the verbatim token text. |
|
| `attr` <Tag variant="new">2.1</Tag> | The token attribute to match on. Defaults to `ORTH`, i.e. the verbatim token text. ~~Union[int, str]~~ |
|
||||||
| `validate` <Tag variant="new">2.1</Tag> | bool | Validate patterns added to the matcher. |
|
| `validate` <Tag variant="new">2.1</Tag> | Validate patterns added to the matcher. ~~bool~~ |
|
||||||
|
|
||||||
## PhraseMatcher.\_\_call\_\_ {#call tag="method"}
|
## PhraseMatcher.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -57,10 +57,10 @@ Find all token sequences matching the supplied patterns on the `Doc`.
|
||||||
> matches = matcher(doc)
|
> matches = matcher(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ----------- | ----------------------------------- |
|
||||||
| `doc` | `Doc` | The document to match over. |
|
| `doc` | The document to match over. ~~Doc~~ |
|
||||||
| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `match_id` is the ID of the added match pattern. |
|
| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `match_id` is the ID of the added match pattern. ~~List[Tuple[int, int, int]]~~ |
|
||||||
|
|
||||||
<Infobox title="Note on retrieving the string representation of the match_id" variant="warning">
|
<Infobox title="Note on retrieving the string representation of the match_id" variant="warning">
|
||||||
|
|
||||||
|
@ -87,11 +87,13 @@ Match a stream of documents, yielding them in turn.
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | -------- | --------------------------------------------------------- |
|
| --------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `docs` | iterable | A stream of documents. |
|
| `docs` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||||
| `batch_size` | int | The number of documents to accumulate into a working set. |
|
| `batch_size` | The number of documents to accumulate into a working set. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | Documents, in order. |
|
| `return_matches` <Tag variant="new">2.1</Tag> | Yield the match lists along with the docs, making results `(doc, matches)` tuples. ~~bool~~ |
|
||||||
|
| `as_tuples` | Interpret the input stream as `(doc, context)` tuples, and yield `(result, context)` tuples out. If both `return_matches` and `as_tuples` are `True`, the output will be a sequence of `((doc, matches), context)` tuples. ~~bool~~ |
|
||||||
|
| **YIELDS** | Documents and optional matches or context in order. ~~Union[Doc, Tuple[Doc, Any], Tuple[Tuple[Doc, Any], Any]]~~ |
|
||||||
|
|
||||||
## PhraseMatcher.\_\_len\_\_ {#len tag="method"}
|
## PhraseMatcher.\_\_len\_\_ {#len tag="method"}
|
||||||
|
|
||||||
|
@ -108,9 +110,9 @@ patterns.
|
||||||
> assert len(matcher) == 1
|
> assert len(matcher) == 1
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------- |
|
| ----------- | ---------------------------- |
|
||||||
| **RETURNS** | int | The number of rules. |
|
| **RETURNS** | The number of rules. ~~int~~ |
|
||||||
|
|
||||||
## PhraseMatcher.\_\_contains\_\_ {#contains tag="method"}
|
## PhraseMatcher.\_\_contains\_\_ {#contains tag="method"}
|
||||||
|
|
||||||
|
@ -125,10 +127,10 @@ Check whether the matcher contains rules for a match ID.
|
||||||
> assert "OBAMA" in matcher
|
> assert "OBAMA" in matcher
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ----------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------- |
|
||||||
| `key` | str | The match ID. |
|
| `key` | The match ID. ~~str~~ |
|
||||||
| **RETURNS** | bool | Whether the matcher contains rules for this match ID. |
|
| **RETURNS** | Whether the matcher contains rules for this match ID. ~~bool~~ |
|
||||||
|
|
||||||
## PhraseMatcher.add {#add tag="method"}
|
## PhraseMatcher.add {#add tag="method"}
|
||||||
|
|
||||||
|
@ -165,12 +167,12 @@ patterns = [nlp("health care reform"), nlp("healthcare reform")]
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------ | --------------------------------------------------------------------------------------------- |
|
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `match_id` | str | An ID for the thing you're matching. |
|
| `match_id` | str | An ID for the thing you're matching. ~~str~~ |
|
||||||
| `docs` | list | `Doc` objects of the phrases to match. |
|
| `docs` | `Doc` objects of the phrases to match. ~~List[Doc]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
|
| `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ |
|
||||||
|
|
||||||
## PhraseMatcher.remove {#remove tag="method" new="2.2"}
|
## PhraseMatcher.remove {#remove tag="method" new="2.2"}
|
||||||
|
|
||||||
|
@ -187,6 +189,6 @@ does not exist.
|
||||||
> assert "OBAMA" not in matcher
|
> assert "OBAMA" not in matcher
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----- | ---- | ------------------------- |
|
| ----- | --------------------------------- |
|
||||||
| `key` | str | The ID of the match rule. |
|
| `key` | The ID of the match rule. ~~str~~ |
|
||||||
|
|
|
@ -45,12 +45,12 @@ Create a new pipeline instance. In your application, you would normally use a
|
||||||
shortcut for this and instantiate the component using its string name and
|
shortcut for this and instantiate the component using its string name and
|
||||||
[`nlp.add_pipe`](/api/language#create_pipe).
|
[`nlp.add_pipe`](/api/language#create_pipe).
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------- |
|
| ------- | ------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], Any]~~ |
|
||||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||||
| `**cfg` | | Additional config parameters and settings. Will be available as the dictionary `Pipe.cfg` and is serialized with the component. |
|
| `**cfg` | Additional config parameters and settings. Will be available as the dictionary `Pipe.cfg` and is serialized with the component. |
|
||||||
|
|
||||||
## Pipe.\_\_call\_\_ {#call tag="method"}
|
## Pipe.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -70,10 +70,10 @@ and all pipeline components are applied to the `Doc` in order. Both
|
||||||
> processed = pipe(doc)
|
> processed = pipe(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------ |
|
| ----------- | -------------------------------- |
|
||||||
| `doc` | `Doc` | The document to process. |
|
| `doc` | The document to process. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The processed document. |
|
| **RETURNS** | The processed document. ~~Doc~~ |
|
||||||
|
|
||||||
## Pipe.pipe {#pipe tag="method"}
|
## Pipe.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -91,12 +91,12 @@ applied to the `Doc` in order. Both [`__call__`](/api/pipe#call) and
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ----------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------- |
|
||||||
| `stream` | `Iterable[Doc]` | A stream of documents. |
|
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `batch_size` | int | The number of documents to buffer. Defaults to `128`. |
|
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | The processed documents in order. |
|
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
||||||
|
|
||||||
## Pipe.begin_training {#begin_training tag="method"}
|
## Pipe.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
|
@ -116,13 +116,13 @@ setting up the label scheme based on the data.
|
||||||
> optimizer = pipe.begin_training(lambda: [], pipeline=nlp.pipeline)
|
> optimizer = pipe.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | `Callable[[], Iterable[Example]]` | Optional function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `pipeline` | `List[Tuple[str, Callable]]` | Optional list of pipeline components that this component is part of. |
|
| `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | An optional optimizer. Will be created via [`create_optimizer`](/api/pipe#create_optimizer) if not set. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## Pipe.predict {#predict tag="method"}
|
## Pipe.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -142,10 +142,10 @@ This method needs to be overwritten with your own custom `predict` method.
|
||||||
> scores = pipe.predict([doc1, doc2])
|
> scores = pipe.predict([doc1, doc2])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------- | ----------------------------------------- |
|
| ----------- | ------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to predict. |
|
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
|
||||||
| **RETURNS** | - | The model's prediction for each document. |
|
| **RETURNS** | The model's prediction for each document. |
|
||||||
|
|
||||||
## Pipe.set_annotations {#set_annotations tag="method"}
|
## Pipe.set_annotations {#set_annotations tag="method"}
|
||||||
|
|
||||||
|
@ -166,10 +166,10 @@ method.
|
||||||
> pipe.set_annotations(docs, scores)
|
> pipe.set_annotations(docs, scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | --------------- | ---------------------------------------------- |
|
| -------- | ------------------------------------------------ |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to modify. |
|
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
|
||||||
| `scores` | - | The scores to set, produced by `Pipe.predict`. |
|
| `scores` | The scores to set, produced by `Tagger.predict`. |
|
||||||
|
|
||||||
## Pipe.update {#update tag="method"}
|
## Pipe.update {#update tag="method"}
|
||||||
|
|
||||||
|
@ -184,15 +184,15 @@ predictions and gold-standard annotations, and update the component's model.
|
||||||
> losses = pipe.update(examples, sgd=optimizer)
|
> losses = pipe.update(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------- | --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/pipe#set_annotations). |
|
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## Pipe.rehearse {#rehearse tag="method,experimental" new="3"}
|
## Pipe.rehearse {#rehearse tag="method,experimental" new="3"}
|
||||||
|
|
||||||
|
@ -208,14 +208,14 @@ the "catastrophic forgetting" problem. This feature is experimental.
|
||||||
> losses = pipe.rehearse(examples, sgd=optimizer)
|
> losses = pipe.rehearse(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## Pipe.get_loss {#get_loss tag="method"}
|
## Pipe.get_loss {#get_loss tag="method"}
|
||||||
|
|
||||||
|
@ -230,11 +230,11 @@ predicted scores.
|
||||||
> loss, d_loss = ner.get_loss(examples, scores)
|
> loss, d_loss = ner.get_loss(examples, scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------- | --------------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The batch of examples. |
|
| `examples` | The batch of examples. ~~Iterable[Example]~~ |
|
||||||
| `scores` | | Scores representing the model's predictions. |
|
| `scores` | Scores representing the model's predictions. |
|
||||||
| **RETURNS** | `Tuple[float, float]` | The loss and the gradient, i.e. `(loss, gradient)`. |
|
| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ |
|
||||||
|
|
||||||
## Pipe.score {#score tag="method" new="3"}
|
## Pipe.score {#score tag="method" new="3"}
|
||||||
|
|
||||||
|
@ -246,10 +246,10 @@ Score a batch of examples.
|
||||||
> scores = pipe.score(examples)
|
> scores = pipe.score(examples)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------- | --------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The examples to score. |
|
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||||
| **RETURNS** | `Dict[str, Any]` | The scores, e.g. produced by the [`Scorer`](/api/scorer). |
|
| **RETURNS** | The scores, e.g. produced by the [`Scorer`](/api/scorer). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||||
|
|
||||||
## Pipe.create_optimizer {#create_optimizer tag="method"}
|
## Pipe.create_optimizer {#create_optimizer tag="method"}
|
||||||
|
|
||||||
|
@ -263,26 +263,9 @@ Create an optimizer for the pipeline component. Defaults to
|
||||||
> optimizer = pipe.create_optimizer()
|
> optimizer = pipe.create_optimizer()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------------- | -------------- |
|
| ----------- | ---------------------------- |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## Pipe.add_label {#add_label tag="method"}
|
|
||||||
|
|
||||||
Add a new label to the pipe. It's possible to extend pretrained models with new
|
|
||||||
labels, but care should be taken to avoid the "catastrophic forgetting" problem.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> pipe = nlp.add_pipe("your_custom_pipe")
|
|
||||||
> pipe.add_label("MY_LABEL")
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ---- | --------------------------------------------------- |
|
|
||||||
| `label` | str | The label to add. |
|
|
||||||
| **RETURNS** | int | `0` if the label is already present, otherwise `1`. |
|
|
||||||
|
|
||||||
## Pipe.use_params {#use_params tag="method, contextmanager"}
|
## Pipe.use_params {#use_params tag="method, contextmanager"}
|
||||||
|
|
||||||
|
@ -297,9 +280,26 @@ context, the original parameters are restored.
|
||||||
> pipe.to_disk("/best_model")
|
> pipe.to_disk("/best_model")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---- | ----------------------------------------- |
|
| -------- | -------------------------------------------------- |
|
||||||
| `params` | dict | The parameter values to use in the model. |
|
| `params` | The parameter values to use in the model. ~~dict~~ |
|
||||||
|
|
||||||
|
## Pipe.add_label {#add_label tag="method"}
|
||||||
|
|
||||||
|
Add a new label to the pipe. It's possible to extend pretrained models with new
|
||||||
|
labels, but care should be taken to avoid the "catastrophic forgetting" problem.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> pipe = nlp.add_pipe("your_custom_pipe")
|
||||||
|
> pipe.add_label("MY_LABEL")
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ----------------------------------------------------------- |
|
||||||
|
| `label` | The label to add. ~~str~~ |
|
||||||
|
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
|
||||||
|
|
||||||
## Pipe.to_disk {#to_disk tag="method"}
|
## Pipe.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -312,11 +312,11 @@ Serialize the pipe to disk.
|
||||||
> pipe.to_disk("/path/to/pipe")
|
> pipe.to_disk("/path/to/pipe")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## Pipe.from_disk {#from_disk tag="method"}
|
## Pipe.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -329,12 +329,12 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
> pipe.from_disk("/path/to/pipe")
|
> pipe.from_disk("/path/to/pipe")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Pipe` | The modified pipe. |
|
| **RETURNS** | The modified pipe. ~~Pipe~~ |
|
||||||
|
|
||||||
## Pipe.to_bytes {#to_bytes tag="method"}
|
## Pipe.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -347,11 +347,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | bytes | The serialized form of the pipe. |
|
| **RETURNS** | The serialized form of the pipe. ~~bytes~~ |
|
||||||
|
|
||||||
## Pipe.from_bytes {#from_bytes tag="method"}
|
## Pipe.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -365,21 +365,21 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> pipe.from_bytes(pipe_bytes)
|
> pipe.from_bytes(pipe_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Pipe` | The pipe. |
|
| **RETURNS** | The pipe. ~~Pipe~~ |
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------- | ------------------------------------------ | ----------------------------------------------------------------------------------------------------- |
|
| ------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `vocab` | [`Vocab`](/api/vocab) | The shared vocabulary that's passed in on initialization. |
|
| `vocab` | The shared vocabulary that's passed in on initialization. ~~Vocab~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model powering the component. |
|
| `model` | The model powering the component. ~~Model[List[Doc], Any]~~ |
|
||||||
| `name` | str | The name of the component instance in the pipeline. Can be used in the losses. |
|
| `name` | The name of the component instance in the pipeline. Can be used in the losses. ~~str~~ |
|
||||||
| `cfg` | dict | Keyword arguments passed to [`Pipe.__init__`](/api/pipe#init). Will be serialized with the component. |
|
| `cfg` | Keyword arguments passed to [`Pipe.__init__`](/api/pipe#init). Will be serialized with the component. ~~Dict[str, Any]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -33,10 +33,10 @@ all other components.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------------------------------------------ |
|
| ----------- | -------------------------------------------------------------------- |
|
||||||
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
| `doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The modified `Doc` with merged noun chunks. |
|
| **RETURNS** | The modified `Doc` with merged noun chunks. ~~Doc~~ |
|
||||||
|
|
||||||
## merge_entities {#merge_entities tag="function"}
|
## merge_entities {#merge_entities tag="function"}
|
||||||
|
|
||||||
|
@ -63,10 +63,10 @@ components to the end of the pipeline and after all other components.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------------------------------------------ |
|
| ----------- | -------------------------------------------------------------------- |
|
||||||
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
| `doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The modified `Doc` with merged entities. |
|
| **RETURNS** | The modified `Doc` with merged entities. ~~Doc~~ |
|
||||||
|
|
||||||
## merge_subtokens {#merge_subtokens tag="function" new="2.1"}
|
## merge_subtokens {#merge_subtokens tag="function" new="2.1"}
|
||||||
|
|
||||||
|
@ -102,8 +102,8 @@ end of the pipeline and after all other components.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------------------------------------------ |
|
| ----------- | -------------------------------------------------------------------- |
|
||||||
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
| `doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~ |
|
||||||
| `label` | str | The subtoken dependency label. Defaults to `"subtok"`. |
|
| `label` | The subtoken dependency label. Defaults to `"subtok"`. ~~str~~ |
|
||||||
| **RETURNS** | `Doc` | The modified `Doc` with merged subtokens. |
|
| **RETURNS** | The modified `Doc` with merged subtokens. ~~Doc~~ |
|
||||||
|
|
|
@ -27,9 +27,9 @@ Create a new `Scorer`.
|
||||||
> scorer = Scorer(nlp)
|
> scorer = Scorer(nlp)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `nlp` | Language | The pipeline to use for scoring, where each pipeline component may provide a scoring method. If none is provided, then a default pipeline for the multi-language code `xx` is constructed containing: `senter`, `tagger`, `morphologizer`, `parser`, `ner`, `textcat`. |
|
| `nlp` | The pipeline to use for scoring, where each pipeline component may provide a scoring method. If none is provided, then a default pipeline for the multi-language code `xx` is constructed containing: `senter`, `tagger`, `morphologizer`, `parser`, `ner`, `textcat`. ~~Language~~ |
|
||||||
|
|
||||||
## Scorer.score {#score tag="method"}
|
## Scorer.score {#score tag="method"}
|
||||||
|
|
||||||
|
@ -55,10 +55,10 @@ attribute being scored:
|
||||||
> scores = scorer.score(examples)
|
> scores = scorer.score(examples)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------- | --------------------------------------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The `Example` objects holding both the predictions and the correct gold-standard annotations. |
|
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
|
||||||
| **RETURNS** | `Dict` | A dictionary of scores. |
|
| **RETURNS** | A dictionary of scores. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||||
|
|
||||||
## Scorer.score_tokenization {#score_tokenization tag="staticmethod" new="3"}
|
## Scorer.score_tokenization {#score_tokenization tag="staticmethod" new="3"}
|
||||||
|
|
||||||
|
@ -74,10 +74,10 @@ Scores the tokenization:
|
||||||
> scores = Scorer.score_tokenization(examples)
|
> scores = Scorer.score_tokenization(examples)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------- | --------------------------------------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The `Example` objects holding both the predictions and the correct gold-standard annotations. |
|
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
|
||||||
| **RETURNS** | `Dict` | A dictionary containing the scores `token_acc`, `token_p`, `token_r`, `token_f`. |
|
| **RETURNS** | `Dict` | A dictionary containing the scores `token_acc`, `token_p`, `token_r`, `token_f`. ~~Dict[str, float]]~~ |
|
||||||
|
|
||||||
## Scorer.score_token_attr {#score_token_attr tag="staticmethod" new="3"}
|
## Scorer.score_token_attr {#score_token_attr tag="staticmethod" new="3"}
|
||||||
|
|
||||||
|
@ -90,18 +90,19 @@ Scores a single token attribute.
|
||||||
> print(scores["pos_acc"])
|
> print(scores["pos_acc"])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The `Example` objects holding both the predictions and the correct gold-standard annotations. |
|
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
|
||||||
| `attr` | `str` | The attribute to score. |
|
| `attr` | The attribute to score. ~~str~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `getter` | `Callable` | Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`. |
|
| `getter` | Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`. ~~Callable[[Token, str], Any]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | A dictionary containing the score `{attr}_acc`. |
|
| **RETURNS** | A dictionary containing the score `{attr}_acc`. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## Scorer.score_token_attr_per_feat {#score_token_attr_per_feat tag="staticmethod" new="3"}
|
## Scorer.score_token_attr_per_feat {#score_token_attr_per_feat tag="staticmethod" new="3"}
|
||||||
|
|
||||||
Scores a single token attribute per feature for a token attribute in
|
Scores a single token attribute per feature for a token attribute in the
|
||||||
[UFEATS](https://universaldependencies.org/format.html#morphological-annotation)
|
Universal Dependencies
|
||||||
|
[FEATS](https://universaldependencies.org/format.html#morphological-annotation)
|
||||||
format.
|
format.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
|
@ -111,13 +112,13 @@ format.
|
||||||
> print(scores["morph_per_feat"])
|
> print(scores["morph_per_feat"])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The `Example` objects holding both the predictions and the correct gold-standard annotations. |
|
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
|
||||||
| `attr` | `str` | The attribute to score. |
|
| `attr` | The attribute to score. ~~str~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `getter` | `Callable` | Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`. |
|
| `getter` | Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`. ~~Callable[[Token, str], Any]~~ |
|
||||||
| **RETURNS** | `Dict` | A dictionary containing the per-feature PRF scores under the key `{attr}_per_feat`. |
|
| **RETURNS** | A dictionary containing the per-feature PRF scores under the key `{attr}_per_feat`. ~~Dict[str, Dict[str, float]]~~ |
|
||||||
|
|
||||||
## Scorer.score_spans {#score_spans tag="staticmethod" new="3"}
|
## Scorer.score_spans {#score_spans tag="staticmethod" new="3"}
|
||||||
|
|
||||||
|
@ -130,13 +131,13 @@ Returns PRF scores for labeled or unlabeled spans.
|
||||||
> print(scores["ents_f"])
|
> print(scores["ents_f"])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The `Example` objects holding both the predictions and the correct gold-standard annotations. |
|
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
|
||||||
| `attr` | `str` | The attribute to score. |
|
| `attr` | The attribute to score. ~~str~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `getter` | `Callable` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. |
|
| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ |
|
||||||
| **RETURNS** | `Dict` | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. |
|
| **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||||
|
|
||||||
## Scorer.score_deps {#score_deps tag="staticmethod" new="3"}
|
## Scorer.score_deps {#score_deps tag="staticmethod" new="3"}
|
||||||
|
|
||||||
|
@ -159,16 +160,16 @@ Calculate the UAS, LAS, and LAS per type scores for dependency parses.
|
||||||
> print(scores["dep_uas"], scores["dep_las"])
|
> print(scores["dep_uas"], scores["dep_las"])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------- | ------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
|
| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The `Example` objects holding both the predictions and the correct gold-standard annotations. |
|
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
|
||||||
| `attr` | `str` | The attribute containing the dependency label. |
|
| `attr` | The attribute to score. ~~str~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `getter` | `Callable` | Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`. |
|
| `getter` | Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`. ~~Callable[[Token, str], Any]~~ |
|
||||||
| `head_attr` | `str` | The attribute containing the head token. |
|
| `head_attr` | The attribute containing the head token. ~~str~~ |
|
||||||
| `head_getter` | `callable` | Defaults to `getattr`. If provided, `head_getter(token, attr)` should return the head for an individual `Token`. |
|
| `head_getter` | Defaults to `getattr`. If provided, `head_getter(token, attr)` should return the head for an individual `Token`. ~~Callable[[Doc, str], Token]~~ |
|
||||||
| `ignore_labels` | `Tuple` | Labels to ignore while scoring (e.g., `punct`). |
|
| `ignore_labels` | Labels to ignore while scoring (e.g. `"punct"`). ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Dict` | A dictionary containing the scores: `{attr}_uas`, `{attr}_las`, and `{attr}_las_per_type`. |
|
| **RETURNS** | A dictionary containing the scores: `{attr}_uas`, `{attr}_las`, and `{attr}_las_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||||
|
|
||||||
## Scorer.score_cats {#score_cats tag="staticmethod" new="3"}
|
## Scorer.score_cats {#score_cats tag="staticmethod" new="3"}
|
||||||
|
|
||||||
|
@ -195,13 +196,13 @@ depends on the scorer settings:
|
||||||
> print(scores["cats_macro_auc"])
|
> print(scores["cats_macro_auc"])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------------- | ------------------- | ------------------------------------------------------------------------------------------------------- |
|
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The `Example` objects holding both the predictions and the correct gold-standard annotations. |
|
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
|
||||||
| `attr` | `str` | The attribute to score. |
|
| `attr` | The attribute to score. ~~str~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `getter` | `Callable` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the cats for an individual `Doc`. |
|
| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the cats for an individual `Doc`. ~~Callable[[Doc, str], Dict[str, float]]~~ |
|
||||||
| labels | `Iterable[str]` | The set of possible labels. Defaults to `[]`. |
|
| labels | The set of possible labels. Defaults to `[]`. ~~Iterable[str]~~ |
|
||||||
| `multi_label` | `bool` | Whether the attribute allows multiple labels. Defaults to `True`. |
|
| `multi_label` | Whether the attribute allows multiple labels. Defaults to `True`. ~~bool~~ |
|
||||||
| `positive_label` | `str` | The positive label for a binary task with exclusive classes. Defaults to `None`. |
|
| `positive_label` | The positive label for a binary task with exclusive classes. Defaults to `None`. ~~Optional[str]~~ |
|
||||||
| **RETURNS** | `Dict` | A dictionary containing the scores, with inapplicable scores as `None`. |
|
| **RETURNS** | A dictionary containing the scores, with inapplicable scores as `None`. ~~Dict[str, Optional[float]]~~ |
|
||||||
|
|
|
@ -29,9 +29,9 @@ architectures and their arguments and hyperparameters.
|
||||||
> nlp.add_pipe("senter", config=config)
|
> nlp.add_pipe("senter", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Description |
|
||||||
| ------- | ------------------------------------------ | ----------------- | ----------------------------------- |
|
| ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [Tagger](/api/architectures#Tagger) |
|
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [Tagger](/api/architectures#Tagger). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/senter.pyx
|
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/senter.pyx
|
||||||
|
@ -60,11 +60,11 @@ Create a new pipeline instance. In your application, you would normally use a
|
||||||
shortcut for this and instantiate the component using its string name and
|
shortcut for this and instantiate the component using its string name and
|
||||||
[`nlp.add_pipe`](/api/language#add_pipe).
|
[`nlp.add_pipe`](/api/language#add_pipe).
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------- | ------- | ------------------------------------------------------------------------------------------- |
|
| ------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||||
|
|
||||||
## SentenceRecognizer.\_\_call\_\_ {#call tag="method"}
|
## SentenceRecognizer.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -85,10 +85,10 @@ and all pipeline components are applied to the `Doc` in order. Both
|
||||||
> processed = senter(doc)
|
> processed = senter(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------ |
|
| ----------- | -------------------------------- |
|
||||||
| `doc` | `Doc` | The document to process. |
|
| `doc` | The document to process. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The processed document. |
|
| **RETURNS** | The processed document. ~~Doc~~ |
|
||||||
|
|
||||||
## SentenceRecognizer.pipe {#pipe tag="method"}
|
## SentenceRecognizer.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -107,12 +107,12 @@ and [`pipe`](/api/sentencerecognizer#pipe) delegate to the
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------ |
|
| -------------- | ------------------------------------------------------------- |
|
||||||
| `stream` | `Iterable[Doc]` | A stream of documents. |
|
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
|
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | Processed documents in the order of the original text. |
|
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
||||||
|
|
||||||
## SentenceRecognizer.begin_training {#begin_training tag="method"}
|
## SentenceRecognizer.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
|
@ -132,13 +132,13 @@ setting up the label scheme based on the data.
|
||||||
> optimizer = senter.begin_training(lambda: [], pipeline=nlp.pipeline)
|
> optimizer = senter.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | `Callable[[], Iterable[Example]]` | Optional function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `pipeline` | `List[Tuple[str, Callable]]` | Optional list of pipeline components that this component is part of. |
|
| `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | An optional optimizer. Will be created via [`create_optimizer`](/api/sentencerecognizer#create_optimizer) if not set. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## SentenceRecognizer.predict {#predict tag="method"}
|
## SentenceRecognizer.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -152,10 +152,10 @@ modifying them.
|
||||||
> scores = senter.predict([doc1, doc2])
|
> scores = senter.predict([doc1, doc2])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------- | ----------------------------------------- |
|
| ----------- | ------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to predict. |
|
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
|
||||||
| **RETURNS** | - | The model's prediction for each document. |
|
| **RETURNS** | The model's prediction for each document. |
|
||||||
|
|
||||||
## SentenceRecognizer.set_annotations {#set_annotations tag="method"}
|
## SentenceRecognizer.set_annotations {#set_annotations tag="method"}
|
||||||
|
|
||||||
|
@ -169,10 +169,10 @@ Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores.
|
||||||
> senter.set_annotations([doc1, doc2], scores)
|
> senter.set_annotations([doc1, doc2], scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | --------------- | ------------------------------------------------------------ |
|
| -------- | ------------------------------------------------------------ |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to modify. |
|
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
|
||||||
| `scores` | - | The scores to set, produced by `SentenceRecognizer.predict`. |
|
| `scores` | The scores to set, produced by `SentenceRecognizer.predict`. |
|
||||||
|
|
||||||
## SentenceRecognizer.update {#update tag="method"}
|
## SentenceRecognizer.update {#update tag="method"}
|
||||||
|
|
||||||
|
@ -189,15 +189,15 @@ Delegates to [`predict`](/api/sentencerecognizer#predict) and
|
||||||
> losses = senter.update(examples, sgd=optimizer)
|
> losses = senter.update(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/sentencerecognizer#set_annotations). |
|
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## SentenceRecognizer.rehearse {#rehearse tag="method,experimental" new="3"}
|
## SentenceRecognizer.rehearse {#rehearse tag="method,experimental" new="3"}
|
||||||
|
|
||||||
|
@ -213,14 +213,14 @@ the "catastrophic forgetting" problem. This feature is experimental.
|
||||||
> losses = senter.rehearse(examples, sgd=optimizer)
|
> losses = senter.rehearse(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## SentenceRecognizer.get_loss {#get_loss tag="method"}
|
## SentenceRecognizer.get_loss {#get_loss tag="method"}
|
||||||
|
|
||||||
|
@ -235,11 +235,11 @@ predicted scores.
|
||||||
> loss, d_loss = senter.get_loss(examples, scores)
|
> loss, d_loss = senter.get_loss(examples, scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------- | --------------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The batch of examples. |
|
| `examples` | The batch of examples. ~~Iterable[Example]~~ |
|
||||||
| `scores` | - | Scores representing the model's predictions. |
|
| `scores` | Scores representing the model's predictions. |
|
||||||
| **RETURNS** | `Tuple[float, float]` | The loss and the gradient, i.e. `(loss, gradient)`. |
|
| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ |
|
||||||
|
|
||||||
## SentenceRecognizer.score {#score tag="method" new="3"}
|
## SentenceRecognizer.score {#score tag="method" new="3"}
|
||||||
|
|
||||||
|
@ -251,10 +251,10 @@ Score a batch of examples.
|
||||||
> scores = senter.score(examples)
|
> scores = senter.score(examples)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------- | ------------------------------------------------------------------------ |
|
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The examples to score. |
|
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||||
| **RETURNS** | `Dict[str, Any]` | The scores, produced by [`Scorer.score_spans`](/api/scorer#score_spans). |
|
| **RETURNS** | The scores, produced by [`Scorer.score_token_attr`](/api/scorer#score_token_attr) for the attributes `"pos"`, `"tag"` and `"lemma"`. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## SentenceRecognizer.create_optimizer {#create_optimizer tag="method"}
|
## SentenceRecognizer.create_optimizer {#create_optimizer tag="method"}
|
||||||
|
|
||||||
|
@ -267,9 +267,9 @@ Create an optimizer for the pipeline component.
|
||||||
> optimizer = senter.create_optimizer()
|
> optimizer = senter.create_optimizer()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------------- | -------------- |
|
| ----------- | ---------------------------- |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## SentenceRecognizer.use_params {#use_params tag="method, contextmanager"}
|
## SentenceRecognizer.use_params {#use_params tag="method, contextmanager"}
|
||||||
|
|
||||||
|
@ -284,9 +284,9 @@ context, the original parameters are restored.
|
||||||
> senter.to_disk("/best_model")
|
> senter.to_disk("/best_model")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---- | ----------------------------------------- |
|
| -------- | -------------------------------------------------- |
|
||||||
| `params` | dict | The parameter values to use in the model. |
|
| `params` | The parameter values to use in the model. ~~dict~~ |
|
||||||
|
|
||||||
## SentenceRecognizer.to_disk {#to_disk tag="method"}
|
## SentenceRecognizer.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -299,11 +299,11 @@ Serialize the pipe to disk.
|
||||||
> senter.to_disk("/path/to/senter")
|
> senter.to_disk("/path/to/senter")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## SentenceRecognizer.from_disk {#from_disk tag="method"}
|
## SentenceRecognizer.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -316,12 +316,12 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
> senter.from_disk("/path/to/senter")
|
> senter.from_disk("/path/to/senter")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | -------------------- | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `SentenceRecognizer` | The modified `SentenceRecognizer` object. |
|
| **RETURNS** | The modified `SentenceRecognizer` object. ~~SentenceRecognizer~~ |
|
||||||
|
|
||||||
## SentenceRecognizer.to_bytes {#to_bytes tag="method"}
|
## SentenceRecognizer.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -334,11 +334,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | bytes | The serialized form of the `SentenceRecognizer` object. |
|
| **RETURNS** | The serialized form of the `SentenceRecognizer` object. ~~bytes~~ |
|
||||||
|
|
||||||
## SentenceRecognizer.from_bytes {#from_bytes tag="method"}
|
## SentenceRecognizer.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -352,12 +352,12 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> senter.from_bytes(senter_bytes)
|
> senter.from_bytes(senter_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | -------------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `SentenceRecognizer` | The `SentenceRecognizer` object. |
|
| **RETURNS** | The `SentenceRecognizer` object. ~~SentenceRecognizer~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -28,9 +28,9 @@ how the component should be configured. You can override its settings via the
|
||||||
> nlp.add_pipe("entity_ruler", config=config)
|
> nlp.add_pipe("entity_ruler", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Description |
|
||||||
| ------------- | ----------- | ---------------------------------------------------------------------------------------------------------- | ------- |
|
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `punct_chars` | `List[str]` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. | `None` |
|
| `punct_chars` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to `None`. ~~Optional[List[str]]~~ | `None` |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/sentencizer.pyx
|
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/sentencizer.pyx
|
||||||
|
@ -51,10 +51,10 @@ Initialize the sentencizer.
|
||||||
> sentencizer = Sentencizer()
|
> sentencizer = Sentencizer()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ----------- | ----------------------------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `punct_chars` | `List[str]` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults. |
|
| `punct_chars` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults. ~~Optional[List[str]]~~ |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### punct_chars defaults
|
### punct_chars defaults
|
||||||
|
@ -87,10 +87,10 @@ the component has been added to the pipeline using
|
||||||
> assert len(list(doc.sents)) == 2
|
> assert len(list(doc.sents)) == 2
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------------------------------------------ |
|
| ----------- | -------------------------------------------------------------------- |
|
||||||
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
| `doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The modified `Doc` with added sentence boundaries. |
|
| **RETURNS** | The modified `Doc` with added sentence boundaries. ~~Doc~~ |
|
||||||
|
|
||||||
## Sentencizer.pipe {#pipe tag="method"}
|
## Sentencizer.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -106,12 +106,12 @@ applied to the `Doc` in order.
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ----------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------- |
|
||||||
| `stream` | `Iterable[Doc]` | A stream of documents. |
|
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `batch_size` | int | The number of documents to buffer. Defaults to `128`. |
|
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | The processed documents in order. |
|
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
||||||
|
|
||||||
## Sentencizer.score {#score tag="method" new="3"}
|
## Sentencizer.score {#score tag="method" new="3"}
|
||||||
|
|
||||||
|
@ -123,10 +123,10 @@ Score a batch of examples.
|
||||||
> scores = sentencizer.score(examples)
|
> scores = sentencizer.score(examples)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------- | ------------------------------------------------------------------------ |
|
| ----------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The examples to score. |
|
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||||
| **RETURNS** | `Dict[str, Any]` | The scores, produced by [`Scorer.score_spans`](/api/scorer#score_spans). |
|
| **RETURNS** | The scores, produced by [`Scorer.score_spans`](/api/scorer#score_spans). ~~Dict[str, Union[float, Dict[str, float]]~~ |
|
||||||
|
|
||||||
## Sentencizer.to_disk {#to_disk tag="method"}
|
## Sentencizer.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -142,9 +142,9 @@ a file `sentencizer.json`. This also happens automatically when you save an
|
||||||
> sentencizer.to_disk("/path/to/sentencizer.json")
|
> sentencizer.to_disk("/path/to/sentencizer.json")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------ | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
| ------ | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a JSON file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a JSON file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
|
|
||||||
## Sentencizer.from_disk {#from_disk tag="method"}
|
## Sentencizer.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -159,10 +159,10 @@ added to its pipeline.
|
||||||
> sentencizer.from_disk("/path/to/sentencizer.json")
|
> sentencizer.from_disk("/path/to/sentencizer.json")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------- | -------------------------------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a JSON file. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a JSON file. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| **RETURNS** | `Sentencizer` | The modified `Sentencizer` object. |
|
| **RETURNS** | The modified `Sentencizer` object. ~~Sentencizer~~ |
|
||||||
|
|
||||||
## Sentencizer.to_bytes {#to_bytes tag="method"}
|
## Sentencizer.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -176,9 +176,9 @@ Serialize the sentencizer settings to a bytestring.
|
||||||
> sentencizer_bytes = sentencizer.to_bytes()
|
> sentencizer_bytes = sentencizer.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | -------------------- |
|
| ----------- | ------------------------------ |
|
||||||
| **RETURNS** | bytes | The serialized data. |
|
| **RETURNS** | The serialized data. ~~bytes~~ |
|
||||||
|
|
||||||
## Sentencizer.from_bytes {#from_bytes tag="method"}
|
## Sentencizer.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -192,7 +192,7 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> sentencizer.from_bytes(sentencizer_bytes)
|
> sentencizer.from_bytes(sentencizer_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | ------------- | ---------------------------------- |
|
| ------------ | -------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The bytestring to load. |
|
| `bytes_data` | The bytestring to load. ~~bytes~~ |
|
||||||
| **RETURNS** | `Sentencizer` | The modified `Sentencizer` object. |
|
| **RETURNS** | The modified `Sentencizer` object. ~~Sentencizer~~ |
|
||||||
|
|
|
@ -18,14 +18,14 @@ Create a Span object from the slice `doc[start : end]`.
|
||||||
> assert [t.text for t in span] == ["it", "back", "!"]
|
> assert [t.text for t in span] == ["it", "back", "!"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---------------------------------------- | --------------------------------------------------------------------------------------------------------- |
|
| -------- | --------------------------------------------------------------------------------------- |
|
||||||
| `doc` | `Doc` | The parent document. |
|
| `doc` | The parent document. ~~Doc~~ |
|
||||||
| `start` | int | The index of the first token of the span. |
|
| `start` | The index of the first token of the span. ~~int~~ |
|
||||||
| `end` | int | The index of the first token after the span. |
|
| `end` | The index of the first token after the span. ~~int~~ |
|
||||||
| `label` | int / str | A label to attach to the span, e.g. for named entities. As of v2.1, the label can also be a string. |
|
| `label` | A label to attach to the span, e.g. for named entities. ~~Union[str, int]~~ |
|
||||||
| `kb_id` | int / str | A knowledge base ID to attach to the span, e.g. for named entities. The ID can be an integer or a string. |
|
| `kb_id` | A knowledge base ID to attach to the span, e.g. for named entities. ~~Union[str, int]~~ |
|
||||||
| `vector` | `numpy.ndarray[ndim=1, dtype="float32"]` | A meaning representation of the span. |
|
| `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
|
|
||||||
## Span.\_\_getitem\_\_ {#getitem tag="method"}
|
## Span.\_\_getitem\_\_ {#getitem tag="method"}
|
||||||
|
|
||||||
|
@ -39,10 +39,10 @@ Get a `Token` object.
|
||||||
> assert span[1].text == "back"
|
> assert span[1].text == "back"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------- | --------------------------------------- |
|
| ----------- | ----------------------------------------------- |
|
||||||
| `i` | int | The index of the token within the span. |
|
| `i` | The index of the token within the span. ~~int~~ |
|
||||||
| **RETURNS** | `Token` | The token at `span[i]`. |
|
| **RETURNS** | The token at `span[i]`. ~~Token~~ |
|
||||||
|
|
||||||
Get a `Span` object.
|
Get a `Span` object.
|
||||||
|
|
||||||
|
@ -54,10 +54,10 @@ Get a `Span` object.
|
||||||
> assert span[1:3].text == "back!"
|
> assert span[1:3].text == "back!"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------ | -------------------------------- |
|
| ----------- | ------------------------------------------------- |
|
||||||
| `start_end` | tuple | The slice of the span to get. |
|
| `start_end` | The slice of the span to get. ~~Tuple[int, int]~~ |
|
||||||
| **RETURNS** | `Span` | The span at `span[start : end]`. |
|
| **RETURNS** | The span at `span[start : end]`. ~~Span~~ |
|
||||||
|
|
||||||
## Span.\_\_iter\_\_ {#iter tag="method"}
|
## Span.\_\_iter\_\_ {#iter tag="method"}
|
||||||
|
|
||||||
|
@ -71,9 +71,9 @@ Iterate over `Token` objects.
|
||||||
> assert [t.text for t in span] == ["it", "back", "!"]
|
> assert [t.text for t in span] == ["it", "back", "!"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ------- | ----------------- |
|
| ---------- | --------------------------- |
|
||||||
| **YIELDS** | `Token` | A `Token` object. |
|
| **YIELDS** | A `Token` object. ~~Token~~ |
|
||||||
|
|
||||||
## Span.\_\_len\_\_ {#len tag="method"}
|
## Span.\_\_len\_\_ {#len tag="method"}
|
||||||
|
|
||||||
|
@ -87,9 +87,9 @@ Get the number of tokens in the span.
|
||||||
> assert len(span) == 3
|
> assert len(span) == 3
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | --------------------------------- |
|
| ----------- | ----------------------------------------- |
|
||||||
| **RETURNS** | int | The number of tokens in the span. |
|
| **RETURNS** | The number of tokens in the span. ~~int~~ |
|
||||||
|
|
||||||
## Span.set_extension {#set_extension tag="classmethod" new="2"}
|
## Span.set_extension {#set_extension tag="classmethod" new="2"}
|
||||||
|
|
||||||
|
@ -107,14 +107,14 @@ For details, see the documentation on
|
||||||
> assert doc[1:4]._.has_city
|
> assert doc[1:4]._.has_city
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `name` | str | Name of the attribute to set by the extension. For example, `"my_attr"` will be available as `span._.my_attr`. |
|
| `name` | Name of the attribute to set by the extension. For example, `"my_attr"` will be available as `span._.my_attr`. ~~str~~ |
|
||||||
| `default` | - | Optional default value of the attribute if no getter or method is defined. |
|
| `default` | Optional default value of the attribute if no getter or method is defined. ~~Optional[Any]~~ |
|
||||||
| `method` | callable | Set a custom method on the object, for example `span._.compare(other_span)`. |
|
| `method` | Set a custom method on the object, for example `span._.compare(other_span)`. ~~Optional[Callable[[Span, ...], Any]]~~ |
|
||||||
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
|
| `getter` | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. ~~Optional[Callable[[Span], Any]]~~ |
|
||||||
| `setter` | callable | Setter function that takes the `Span` and a value, and modifies the object. Is called when the user writes to the `Span._` attribute. |
|
| `setter` | Setter function that takes the `Span` and a value, and modifies the object. Is called when the user writes to the `Span._` attribute. ~~Optional[Callable[[Span, Any], None]]~~ |
|
||||||
| `force` | bool | Force overwriting existing attribute. |
|
| `force` | Force overwriting existing attribute. ~~bool~~ |
|
||||||
|
|
||||||
## Span.get_extension {#get_extension tag="classmethod" new="2"}
|
## Span.get_extension {#get_extension tag="classmethod" new="2"}
|
||||||
|
|
||||||
|
@ -131,10 +131,10 @@ Look up a previously registered extension by name. Returns a 4-tuple
|
||||||
> assert extension == (False, None, None, None)
|
> assert extension == (False, None, None, None)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `name` | str | Name of the extension. |
|
| `name` | Name of the extension. ~~str~~ |
|
||||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
|
| **RETURNS** | A `(default, method, getter, setter)` tuple of the extension. ~~Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]~~ |
|
||||||
|
|
||||||
## Span.has_extension {#has_extension tag="classmethod" new="2"}
|
## Span.has_extension {#has_extension tag="classmethod" new="2"}
|
||||||
|
|
||||||
|
@ -148,10 +148,10 @@ Check whether an extension has been registered on the `Span` class.
|
||||||
> assert Span.has_extension("is_city")
|
> assert Span.has_extension("is_city")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------------ |
|
| ----------- | --------------------------------------------------- |
|
||||||
| `name` | str | Name of the extension to check. |
|
| `name` | Name of the extension to check. ~~str~~ |
|
||||||
| **RETURNS** | bool | Whether the extension has been registered. |
|
| **RETURNS** | Whether the extension has been registered. ~~bool~~ |
|
||||||
|
|
||||||
## Span.remove_extension {#remove_extension tag="classmethod" new="2.0.12"}
|
## Span.remove_extension {#remove_extension tag="classmethod" new="2.0.12"}
|
||||||
|
|
||||||
|
@ -166,10 +166,10 @@ Remove a previously registered extension.
|
||||||
> assert not Span.has_extension("is_city")
|
> assert not Span.has_extension("is_city")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | --------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `name` | str | Name of the extension. |
|
| `name` | Name of the extension. ~~str~~ |
|
||||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
| **RETURNS** | A `(default, method, getter, setter)` tuple of the removed extension. ~~Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]~~ |
|
||||||
|
|
||||||
## Span.char_span {#char_span tag="method" new="2.2.4"}
|
## Span.char_span {#char_span tag="method" new="2.2.4"}
|
||||||
|
|
||||||
|
@ -184,14 +184,14 @@ the character indices don't map to a valid span.
|
||||||
> assert span.text == "New York"
|
> assert span.text == "New York"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---------------------------------------- | --------------------------------------------------------------------- |
|
| ------------------------------------ | ----------------------------------------------------------------------------------------- |
|
||||||
| `start` | int | The index of the first character of the span. |
|
| `start` | The index of the first character of the span. ~~int~~ |
|
||||||
| `end` | int | The index of the last character after the span. |
|
| `end` | The index of the last character after the span. ~int~~ |
|
||||||
| `label` | uint64 / str | A label to attach to the span, e.g. for named entities. |
|
| `label` | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~ |
|
||||||
| `kb_id` | uint64 / str | An ID from a knowledge base to capture the meaning of a named entity. |
|
| `kb_id` <Tag variant="new">2.2</Tag> | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~ |
|
||||||
| `vector` | `numpy.ndarray[ndim=1, dtype="float32"]` | A meaning representation of the span. |
|
| `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
| **RETURNS** | The newly constructed object or `None`. ~~Optional[Span]~~ |
|
||||||
|
|
||||||
## Span.similarity {#similarity tag="method" model="vectors"}
|
## Span.similarity {#similarity tag="method" model="vectors"}
|
||||||
|
|
||||||
|
@ -209,10 +209,10 @@ using an average of word vectors.
|
||||||
> assert apples_oranges == oranges_apples
|
> assert apples_oranges == oranges_apples
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | -------------------------------------------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `other` | - | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. |
|
| `other` | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. ~~Union[Doc, Span, Token, Lexeme]~~ |
|
||||||
| **RETURNS** | float | A scalar similarity score. Higher is more similar. |
|
| **RETURNS** | A scalar similarity score. Higher is more similar. ~~float~~ |
|
||||||
|
|
||||||
## Span.get_lca_matrix {#get_lca_matrix tag="method"}
|
## Span.get_lca_matrix {#get_lca_matrix tag="method"}
|
||||||
|
|
||||||
|
@ -229,9 +229,9 @@ ancestor is found, e.g. if span excludes a necessary ancestor.
|
||||||
> # array([[0, 0, 0], [0, 1, 2], [0, 2, 2]], dtype=int32)
|
> # array([[0, 0, 0], [0, 1, 2], [0, 2, 2]], dtype=int32)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------------------------------------- | ------------------------------------------------ |
|
| ----------- | --------------------------------------------------------------------------------------- |
|
||||||
| **RETURNS** | `numpy.ndarray[ndim=2, dtype="int32"]` | The lowest common ancestor matrix of the `Span`. |
|
| **RETURNS** | The lowest common ancestor matrix of the `Span`. ~~numpy.ndarray[ndim=2, dtype=int32]~~ |
|
||||||
|
|
||||||
## Span.to_array {#to_array tag="method" new="2"}
|
## Span.to_array {#to_array tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -249,10 +249,10 @@ shape `(N, M)`, where `N` is the length of the document. The values will be
|
||||||
> np_array = span.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
|
> np_array = span.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----------------------------- | -------------------------------------------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `attr_ids` | list | A list of attribute ID ints. |
|
| `attr_ids` | A list of attributes (int IDs or string names) or a single attribute (int ID or string name). ~~Union[int, str, List[Union[int, str]]]~~ |
|
||||||
| **RETURNS** | `numpy.ndarray[long, ndim=2]` | A feature matrix, with one row per word, and one column per attribute indicated in the input `attr_ids`. |
|
| **RETURNS** | The exported attributes as a numpy array. ~~Union[numpy.ndarray[ndim=2, dtype=uint64], numpy.ndarray[ndim=1, dtype=uint64]]~~ |
|
||||||
|
|
||||||
## Span.ents {#ents tag="property" new="2.0.13" model="ner"}
|
## Span.ents {#ents tag="property" new="2.0.13" model="ner"}
|
||||||
|
|
||||||
|
@ -270,9 +270,9 @@ if the entity recognizer has been applied.
|
||||||
> assert ents[0].text == "Mr. Best"
|
> assert ents[0].text == "Mr. Best"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | -------------------------------------------- |
|
| ----------- | ----------------------------------------------------------------- |
|
||||||
| **RETURNS** | tuple | Entities in the span, one `Span` per entity. |
|
| **RETURNS** | Entities in the span, one `Span` per entity. ~~Tuple[Span, ...]~~ |
|
||||||
|
|
||||||
## Span.as_doc {#as_doc tag="method"}
|
## Span.as_doc {#as_doc tag="method"}
|
||||||
|
|
||||||
|
@ -287,10 +287,10 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
|
||||||
> assert doc2.text == "New York"
|
> assert doc2.text == "New York"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------------- | ----- | ---------------------------------------------------- |
|
| ---------------- | ------------------------------------------------------------- |
|
||||||
| `copy_user_data` | bool | Whether or not to copy the original doc's user data. |
|
| `copy_user_data` | Whether or not to copy the original doc's user data. ~~bool~~ |
|
||||||
| **RETURNS** | `Doc` | A `Doc` object of the `Span`'s content. |
|
| **RETURNS** | A `Doc` object of the `Span`'s content. ~~Doc~~ |
|
||||||
|
|
||||||
## Span.root {#root tag="property" model="parser"}
|
## Span.root {#root tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -309,9 +309,9 @@ taken.
|
||||||
> assert new_york.root.text == "York"
|
> assert new_york.root.text == "York"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------- | --------------- |
|
| ----------- | ------------------------- |
|
||||||
| **RETURNS** | `Token` | The root token. |
|
| **RETURNS** | The root token. ~~Token~~ |
|
||||||
|
|
||||||
## Span.conjuncts {#conjuncts tag="property" model="parser"}
|
## Span.conjuncts {#conjuncts tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -325,9 +325,9 @@ A tuple of tokens coordinated to `span.root`.
|
||||||
> assert [t.text for t in apples_conjuncts] == ["oranges"]
|
> assert [t.text for t in apples_conjuncts] == ["oranges"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------- | ----------------------- |
|
| ----------- | --------------------------------------------- |
|
||||||
| **RETURNS** | `tuple` | The coordinated tokens. |
|
| **RETURNS** | The coordinated tokens. ~~Tuple[Token, ...]~~ |
|
||||||
|
|
||||||
## Span.lefts {#lefts tag="property" model="parser"}
|
## Span.lefts {#lefts tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -341,9 +341,9 @@ Tokens that are to the left of the span, whose heads are within the span.
|
||||||
> assert lefts == ["New"]
|
> assert lefts == ["New"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ------- | ------------------------------------ |
|
| ---------- | ---------------------------------------------- |
|
||||||
| **YIELDS** | `Token` | A left-child of a token of the span. |
|
| **YIELDS** | A left-child of a token of the span. ~~Token~~ |
|
||||||
|
|
||||||
## Span.rights {#rights tag="property" model="parser"}
|
## Span.rights {#rights tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -357,9 +357,9 @@ Tokens that are to the right of the span, whose heads are within the span.
|
||||||
> assert rights == ["in"]
|
> assert rights == ["in"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ------- | ------------------------------------- |
|
| ---------- | ----------------------------------------------- |
|
||||||
| **YIELDS** | `Token` | A right-child of a token of the span. |
|
| **YIELDS** | A right-child of a token of the span. ~~Token~~ |
|
||||||
|
|
||||||
## Span.n_lefts {#n_lefts tag="property" model="parser"}
|
## Span.n_lefts {#n_lefts tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -373,9 +373,9 @@ the span.
|
||||||
> assert doc[3:7].n_lefts == 1
|
> assert doc[3:7].n_lefts == 1
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------- |
|
| ----------- | ---------------------------------------- |
|
||||||
| **RETURNS** | int | The number of left-child tokens. |
|
| **RETURNS** | The number of left-child tokens. ~~int~~ |
|
||||||
|
|
||||||
## Span.n_rights {#n_rights tag="property" model="parser"}
|
## Span.n_rights {#n_rights tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -389,9 +389,9 @@ the span.
|
||||||
> assert doc[2:4].n_rights == 1
|
> assert doc[2:4].n_rights == 1
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | --------------------------------- |
|
| ----------- | ----------------------------------------- |
|
||||||
| **RETURNS** | int | The number of right-child tokens. |
|
| **RETURNS** | The number of right-child tokens. ~~int~~ |
|
||||||
|
|
||||||
## Span.subtree {#subtree tag="property" model="parser"}
|
## Span.subtree {#subtree tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -405,9 +405,9 @@ Tokens within the span and tokens which descend from them.
|
||||||
> assert subtree == ["Give", "it", "back", "!"]
|
> assert subtree == ["Give", "it", "back", "!"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ------- | ------------------------------------------------- |
|
| ---------- | ----------------------------------------------------------- |
|
||||||
| **YIELDS** | `Token` | A token within the span, or a descendant from it. |
|
| **YIELDS** | A token within the span, or a descendant from it. ~~Token~~ |
|
||||||
|
|
||||||
## Span.has_vector {#has_vector tag="property" model="vectors"}
|
## Span.has_vector {#has_vector tag="property" model="vectors"}
|
||||||
|
|
||||||
|
@ -420,9 +420,9 @@ A boolean value indicating whether a word vector is associated with the object.
|
||||||
> assert doc[1:].has_vector
|
> assert doc[1:].has_vector
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------------------- |
|
| ----------- | ----------------------------------------------------- |
|
||||||
| **RETURNS** | bool | Whether the span has a vector data attached. |
|
| **RETURNS** | Whether the span has a vector data attached. ~~bool~~ |
|
||||||
|
|
||||||
## Span.vector {#vector tag="property" model="vectors"}
|
## Span.vector {#vector tag="property" model="vectors"}
|
||||||
|
|
||||||
|
@ -437,9 +437,9 @@ vectors.
|
||||||
> assert doc[1:].vector.shape == (300,)
|
> assert doc[1:].vector.shape == (300,)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---------------------------------------- | --------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| **RETURNS** | `numpy.ndarray[ndim=1, dtype="float32"]` | A 1D numpy array representing the span's semantics. |
|
| **RETURNS** | A 1-dimensional array representing the span's vector. ~~`numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
|
|
||||||
## Span.vector_norm {#vector_norm tag="property" model="vectors"}
|
## Span.vector_norm {#vector_norm tag="property" model="vectors"}
|
||||||
|
|
||||||
|
@ -454,31 +454,31 @@ The L2 norm of the span's vector representation.
|
||||||
> assert doc[1:].vector_norm != doc[2:].vector_norm
|
> assert doc[1:].vector_norm != doc[2:].vector_norm
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ----------------------------------------- |
|
| ----------- | --------------------------------------------------- |
|
||||||
| **RETURNS** | float | The L2 norm of the vector representation. |
|
| **RETURNS** | The L2 norm of the vector representation. ~~float~~ |
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------------------------------- | ------------ | -------------------------------------------------------------------------------------------------------------- |
|
| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `doc` | `Doc` | The parent document. |
|
| `doc` | The parent document. ~~Doc~~ |
|
||||||
| `tensor` <Tag variant="new">2.1.7</Tag> | `ndarray` | The span's slice of the parent `Doc`'s tensor. |
|
| `tensor` <Tag variant="new">2.1.7</Tag> | The span's slice of the parent `Doc`'s tensor. ~~numpy.ndarray~~ |
|
||||||
| `sent` | `Span` | The sentence span that this span is a part of. |
|
| `sent` | The sentence span that this span is a part of. ~~Span~~ |
|
||||||
| `start` | int | The token offset for the start of the span. |
|
| `start` | The token offset for the start of the span. ~~int~~ |
|
||||||
| `end` | int | The token offset for the end of the span. |
|
| `end` | The token offset for the end of the span. ~~int~~ |
|
||||||
| `start_char` | int | The character offset for the start of the span. |
|
| `start_char` | The character offset for the start of the span. ~~int~~ |
|
||||||
| `end_char` | int | The character offset for the end of the span. |
|
| `end_char` | The character offset for the end of the span. ~~int~~ |
|
||||||
| `text` | str | A string representation of the span text. |
|
| `text` | A string representation of the span text. ~~str~~ |
|
||||||
| `text_with_ws` | str | The text content of the span with a trailing whitespace character if the last token has one. |
|
| `text_with_ws` | The text content of the span with a trailing whitespace character if the last token has one. ~~str~~ |
|
||||||
| `orth` | int | ID of the verbatim text content. |
|
| `orth` | ID of the verbatim text content. ~~int~~ |
|
||||||
| `orth_` | str | Verbatim text content (identical to `Span.text`). Exists mostly for consistency with the other attributes. |
|
| `orth_` | Verbatim text content (identical to `Span.text`). Exists mostly for consistency with the other attributes. ~~str~~ |
|
||||||
| `label` | int | The hash value of the span's label. |
|
| `label` | The hash value of the span's label. ~~int~~ |
|
||||||
| `label_` | str | The span's label. |
|
| `label_` | The span's label. ~~str~~ |
|
||||||
| `lemma_` | str | The span's lemma. |
|
| `lemma_` | The span's lemma. Equivalent to `"".join(token.text_with_ws for token in span)`. ~~str~~ |
|
||||||
| `kb_id` | int | The hash value of the knowledge base ID referred to by the span. |
|
| `kb_id` | The hash value of the knowledge base ID referred to by the span. ~~int~~ |
|
||||||
| `kb_id_` | str | The knowledge base ID referred to by the span. |
|
| `kb_id_` | The knowledge base ID referred to by the span. ~~str~~ |
|
||||||
| `ent_id` | int | The hash value of the named entity the token is an instance of. |
|
| `ent_id` | The hash value of the named entity the token is an instance of. ~~int~~ |
|
||||||
| `ent_id_` | str | The string ID of the named entity the token is an instance of. |
|
| `ent_id_` | The string ID of the named entity the token is an instance of. ~~str~~ |
|
||||||
| `sentiment` | float | A scalar value indicating the positivity or negativity of the span. |
|
| `sentiment` | A scalar value indicating the positivity or negativity of the span. ~~float~~ |
|
||||||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
| `_` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). ~~Underscore~~ |
|
||||||
|
|
|
@ -19,9 +19,9 @@ Create the `StringStore`.
|
||||||
> stringstore = StringStore(["apple", "orange"])
|
> stringstore = StringStore(["apple", "orange"])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | -------- | ------------------------------------------ |
|
| --------- | ---------------------------------------------------------------------- |
|
||||||
| `strings` | iterable | A sequence of strings to add to the store. |
|
| `strings` | A sequence of strings to add to the store. ~~Optional[Iterable[str]]~~ |
|
||||||
|
|
||||||
## StringStore.\_\_len\_\_ {#len tag="method"}
|
## StringStore.\_\_len\_\_ {#len tag="method"}
|
||||||
|
|
||||||
|
@ -34,9 +34,9 @@ Get the number of strings in the store.
|
||||||
> assert len(stringstore) == 2
|
> assert len(stringstore) == 2
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ----------------------------------- |
|
| ----------- | ------------------------------------------- |
|
||||||
| **RETURNS** | int | The number of strings in the store. |
|
| **RETURNS** | The number of strings in the store. ~~int~~ |
|
||||||
|
|
||||||
## StringStore.\_\_getitem\_\_ {#getitem tag="method"}
|
## StringStore.\_\_getitem\_\_ {#getitem tag="method"}
|
||||||
|
|
||||||
|
@ -51,10 +51,10 @@ Retrieve a string from a given hash, or vice versa.
|
||||||
> assert stringstore[apple_hash] == "apple"
|
> assert stringstore[apple_hash] == "apple"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | -------------------- | -------------------------- |
|
| -------------- | ----------------------------------------------- |
|
||||||
| `string_or_id` | bytes, str or uint64 | The value to encode. |
|
| `string_or_id` | The value to encode. ~~Union[bytes, str, int]~~ |
|
||||||
| **RETURNS** | str or int | The value to be retrieved. |
|
| **RETURNS** | The value to be retrieved. ~~Union[str, int]~~ |
|
||||||
|
|
||||||
## StringStore.\_\_contains\_\_ {#contains tag="method"}
|
## StringStore.\_\_contains\_\_ {#contains tag="method"}
|
||||||
|
|
||||||
|
@ -68,15 +68,15 @@ Check whether a string is in the store.
|
||||||
> assert not "cherry" in stringstore
|
> assert not "cherry" in stringstore
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------------- |
|
| ----------- | ----------------------------------------------- |
|
||||||
| `string` | str | The string to check. |
|
| `string` | The string to check. ~~str~~ |
|
||||||
| **RETURNS** | bool | Whether the store contains the string. |
|
| **RETURNS** | Whether the store contains the string. ~~bool~~ |
|
||||||
|
|
||||||
## StringStore.\_\_iter\_\_ {#iter tag="method"}
|
## StringStore.\_\_iter\_\_ {#iter tag="method"}
|
||||||
|
|
||||||
Iterate over the strings in the store, in order. Note that a newly initialized
|
Iterate over the strings in the store, in order. Note that a newly initialized
|
||||||
store will always include an empty string `''` at position `0`.
|
store will always include an empty string `""` at position `0`.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -86,9 +86,9 @@ store will always include an empty string `''` at position `0`.
|
||||||
> assert all_strings == ["apple", "orange"]
|
> assert all_strings == ["apple", "orange"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ---- | ---------------------- |
|
| ---------- | ------------------------------ |
|
||||||
| **YIELDS** | str | A string in the store. |
|
| **YIELDS** | A string in the store. ~~str~~ |
|
||||||
|
|
||||||
## StringStore.add {#add tag="method" new="2"}
|
## StringStore.add {#add tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -105,10 +105,10 @@ Add a string to the `StringStore`.
|
||||||
> assert stringstore["banana"] == banana_hash
|
> assert stringstore["banana"] == banana_hash
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------ | ------------------------ |
|
| ----------- | -------------------------------- |
|
||||||
| `string` | str | The string to add. |
|
| `string` | The string to add. ~~str~~ |
|
||||||
| **RETURNS** | uint64 | The string's hash value. |
|
| **RETURNS** | The string's hash value. ~~int~~ |
|
||||||
|
|
||||||
## StringStore.to_disk {#to_disk tag="method" new="2"}
|
## StringStore.to_disk {#to_disk tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -120,9 +120,9 @@ Save the current state to a directory.
|
||||||
> stringstore.to_disk("/path/to/strings")
|
> stringstore.to_disk("/path/to/strings")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------ | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
| ------ | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
|
|
||||||
## StringStore.from_disk {#from_disk tag="method" new="2"}
|
## StringStore.from_disk {#from_disk tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -135,10 +135,10 @@ Loads state from a directory. Modifies the object in place and returns it.
|
||||||
> stringstore = StringStore().from_disk("/path/to/strings")
|
> stringstore = StringStore().from_disk("/path/to/strings")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------- | -------------------------------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| **RETURNS** | `StringStore` | The modified `StringStore` object. |
|
| **RETURNS** | The modified `StringStore` object. ~~StringStore~~ |
|
||||||
|
|
||||||
## StringStore.to_bytes {#to_bytes tag="method"}
|
## StringStore.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -150,9 +150,9 @@ Serialize the current state to a binary string.
|
||||||
> store_bytes = stringstore.to_bytes()
|
> store_bytes = stringstore.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------------------------------ |
|
| ----------- | ---------------------------------------------------------- |
|
||||||
| **RETURNS** | bytes | The serialized form of the `StringStore` object. |
|
| **RETURNS** | The serialized form of the `StringStore` object. ~~bytes~~ |
|
||||||
|
|
||||||
## StringStore.from_bytes {#from_bytes tag="method"}
|
## StringStore.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -166,10 +166,10 @@ Load state from a binary string.
|
||||||
> new_store = StringStore().from_bytes(store_bytes)
|
> new_store = StringStore().from_bytes(store_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | ------------- | ------------------------- |
|
| ------------ | ----------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| **RETURNS** | `StringStore` | The `StringStore` object. |
|
| **RETURNS** | The `StringStore` object. ~~StringStore~~ |
|
||||||
|
|
||||||
## Utilities {#util}
|
## Utilities {#util}
|
||||||
|
|
||||||
|
@ -184,7 +184,7 @@ Get a 64-bit hash for a given string.
|
||||||
> assert hash_string("apple") == 8566208034543834098
|
> assert hash_string("apple") == 8566208034543834098
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------ | ------------------- |
|
| ----------- | --------------------------- |
|
||||||
| `string` | str | The string to hash. |
|
| `string` | The string to hash. ~~str~~ |
|
||||||
| **RETURNS** | uint64 | The hash. |
|
| **RETURNS** | The hash. ~~int~~ |
|
||||||
|
|
|
@ -28,10 +28,10 @@ architectures and their arguments and hyperparameters.
|
||||||
> nlp.add_pipe("tagger", config=config)
|
> nlp.add_pipe("tagger", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Description |
|
||||||
| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------- |
|
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `set_morphology` | bool | Whether to set morphological features. | `False` |
|
| `set_morphology` | Whether to set morphological features. Defaults to `False`. ~~bool~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). | [Tagger](/api/architectures#Tagger) |
|
| `model` | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). Defaults to [Tagger](/api/architectures#Tagger). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tagger.pyx
|
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tagger.pyx
|
||||||
|
@ -58,13 +58,13 @@ Create a new pipeline instance. In your application, you would normally use a
|
||||||
shortcut for this and instantiate the component using its string name and
|
shortcut for this and instantiate the component using its string name and
|
||||||
[`nlp.add_pipe`](/api/language#add_pipe).
|
[`nlp.add_pipe`](/api/language#add_pipe).
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). |
|
| `model` | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `set_morphology` | bool | Whether to set morphological features. |
|
| `set_morphology` | Whether to set morphological features. ~~bool~~ |
|
||||||
|
|
||||||
## Tagger.\_\_call\_\_ {#call tag="method"}
|
## Tagger.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -84,10 +84,10 @@ and all pipeline components are applied to the `Doc` in order. Both
|
||||||
> processed = tagger(doc)
|
> processed = tagger(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------ |
|
| ----------- | -------------------------------- |
|
||||||
| `doc` | `Doc` | The document to process. |
|
| `doc` | The document to process. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The processed document. |
|
| **RETURNS** | The processed document. ~~Doc~~ |
|
||||||
|
|
||||||
## Tagger.pipe {#pipe tag="method"}
|
## Tagger.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -105,12 +105,12 @@ applied to the `Doc` in order. Both [`__call__`](/api/tagger#call) and
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------ |
|
| -------------- | ------------------------------------------------------------- |
|
||||||
| `stream` | `Iterable[Doc]` | A stream of documents. |
|
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
|
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | Processed documents in the order of the original text. |
|
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
||||||
|
|
||||||
## Tagger.begin_training {#begin_training tag="method"}
|
## Tagger.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
|
@ -130,13 +130,13 @@ setting up the label scheme based on the data.
|
||||||
> optimizer = tagger.begin_training(lambda: [], pipeline=nlp.pipeline)
|
> optimizer = tagger.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | `Callable[[], Iterable[Example]]` | Optional function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `pipeline` | `List[Tuple[str, Callable]]` | Optional list of pipeline components that this component is part of. |
|
| `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | An optional optimizer. Will be created via [`create_optimizer`](/api/tagger#create_optimizer) if not set. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## Tagger.predict {#predict tag="method"}
|
## Tagger.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -150,10 +150,10 @@ modifying them.
|
||||||
> scores = tagger.predict([doc1, doc2])
|
> scores = tagger.predict([doc1, doc2])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------- | ----------------------------------------- |
|
| ----------- | ------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to predict. |
|
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
|
||||||
| **RETURNS** | - | The model's prediction for each document. |
|
| **RETURNS** | The model's prediction for each document. |
|
||||||
|
|
||||||
## Tagger.set_annotations {#set_annotations tag="method"}
|
## Tagger.set_annotations {#set_annotations tag="method"}
|
||||||
|
|
||||||
|
@ -167,10 +167,10 @@ Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores.
|
||||||
> tagger.set_annotations([doc1, doc2], scores)
|
> tagger.set_annotations([doc1, doc2], scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | --------------- | ------------------------------------------------ |
|
| -------- | ------------------------------------------------ |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to modify. |
|
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
|
||||||
| `scores` | - | The scores to set, produced by `Tagger.predict`. |
|
| `scores` | The scores to set, produced by `Tagger.predict`. |
|
||||||
|
|
||||||
## Tagger.update {#update tag="method"}
|
## Tagger.update {#update tag="method"}
|
||||||
|
|
||||||
|
@ -187,15 +187,15 @@ Delegates to [`predict`](/api/tagger#predict) and
|
||||||
> losses = tagger.update(examples, sgd=optimizer)
|
> losses = tagger.update(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
|
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/tagger#set_annotations). |
|
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. The value keyed by the model's name is updated. |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## Tagger.rehearse {#rehearse tag="method,experimental" new="3"}
|
## Tagger.rehearse {#rehearse tag="method,experimental" new="3"}
|
||||||
|
|
||||||
|
@ -211,14 +211,14 @@ the "catastrophic forgetting" problem. This feature is experimental.
|
||||||
> losses = tagger.rehearse(examples, sgd=optimizer)
|
> losses = tagger.rehearse(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## Tagger.get_loss {#get_loss tag="method"}
|
## Tagger.get_loss {#get_loss tag="method"}
|
||||||
|
|
||||||
|
@ -233,11 +233,11 @@ predicted scores.
|
||||||
> loss, d_loss = tagger.get_loss(examples, scores)
|
> loss, d_loss = tagger.get_loss(examples, scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------- | --------------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The batch of examples. |
|
| `examples` | The batch of examples. ~~Iterable[Example]~~ |
|
||||||
| `scores` | - | Scores representing the model's predictions. |
|
| `scores` | Scores representing the model's predictions. |
|
||||||
| **RETURNS** | `Tuple[float, float]` | The loss and the gradient, i.e. `(loss, gradient)`. |
|
| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ |
|
||||||
|
|
||||||
## Tagger.score {#score tag="method" new="3"}
|
## Tagger.score {#score tag="method" new="3"}
|
||||||
|
|
||||||
|
@ -249,10 +249,10 @@ Score a batch of examples.
|
||||||
> scores = tagger.score(examples)
|
> scores = tagger.score(examples)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
|
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The examples to score. |
|
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||||
| **RETURNS** | `Dict[str, Any]` | The scores, produced by [`Scorer.score_token_attr`](/api/scorer#score_token_attr) for the attributes `"pos"`, `"tag"` and `"lemma"`. |
|
| **RETURNS** | The scores, produced by [`Scorer.score_token_attr`](/api/scorer#score_token_attr) for the attributes `"pos"`, `"tag"` and `"lemma"`. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## Tagger.create_optimizer {#create_optimizer tag="method"}
|
## Tagger.create_optimizer {#create_optimizer tag="method"}
|
||||||
|
|
||||||
|
@ -265,9 +265,9 @@ Create an optimizer for the pipeline component.
|
||||||
> optimizer = tagger.create_optimizer()
|
> optimizer = tagger.create_optimizer()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------------- | -------------- |
|
| ----------- | ---------------------------- |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## Tagger.use_params {#use_params tag="method, contextmanager"}
|
## Tagger.use_params {#use_params tag="method, contextmanager"}
|
||||||
|
|
||||||
|
@ -282,9 +282,9 @@ context, the original parameters are restored.
|
||||||
> tagger.to_disk("/best_model")
|
> tagger.to_disk("/best_model")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---- | ----------------------------------------- |
|
| -------- | -------------------------------------------------- |
|
||||||
| `params` | dict | The parameter values to use in the model. |
|
| `params` | The parameter values to use in the model. ~~dict~~ |
|
||||||
|
|
||||||
## Tagger.add_label {#add_label tag="method"}
|
## Tagger.add_label {#add_label tag="method"}
|
||||||
|
|
||||||
|
@ -297,10 +297,10 @@ Add a new label to the pipe.
|
||||||
> tagger.add_label("MY_LABEL")
|
> tagger.add_label("MY_LABEL")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | --------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------- |
|
||||||
| `label` | str | The label to add. |
|
| `label` | The label to add. ~~str~~ |
|
||||||
| **RETURNS** | int | `0` if the label is already present, otherwise `1`. |
|
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
|
||||||
|
|
||||||
## Tagger.to_disk {#to_disk tag="method"}
|
## Tagger.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -313,11 +313,11 @@ Serialize the pipe to disk.
|
||||||
> tagger.to_disk("/path/to/tagger")
|
> tagger.to_disk("/path/to/tagger")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## Tagger.from_disk {#from_disk tag="method"}
|
## Tagger.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -330,12 +330,12 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
> tagger.from_disk("/path/to/tagger")
|
> tagger.from_disk("/path/to/tagger")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Tagger` | The modified `Tagger` object. |
|
| **RETURNS** | The modified `Tagger` object. ~~Tagger~~ |
|
||||||
|
|
||||||
## Tagger.to_bytes {#to_bytes tag="method"}
|
## Tagger.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -348,11 +348,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | bytes | The serialized form of the `Tagger` object. |
|
| **RETURNS** | The serialized form of the `Tagger` object. ~~bytes~~ |
|
||||||
|
|
||||||
## Tagger.from_bytes {#from_bytes tag="method"}
|
## Tagger.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -366,12 +366,12 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> tagger.from_bytes(tagger_bytes)
|
> tagger.from_bytes(tagger_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Tagger` | The `Tagger` object. |
|
| **RETURNS** | The `Tagger` object. ~~Tagger~~ |
|
||||||
|
|
||||||
## Tagger.labels {#labels tag="property"}
|
## Tagger.labels {#labels tag="property"}
|
||||||
|
|
||||||
|
@ -384,9 +384,9 @@ The labels currently added to the component.
|
||||||
> assert "MY_LABEL" in tagger.labels
|
> assert "MY_LABEL" in tagger.labels
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------ | ---------------------------------- |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | `Tuple[str]` | The labels added to the component. |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -35,10 +35,10 @@ architectures and their arguments and hyperparameters.
|
||||||
> nlp.add_pipe("textcat", config=config)
|
> nlp.add_pipe("textcat", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Description |
|
||||||
| -------- | ------------------------------------------ | --------------------------------------------------------------------------------------- | ----------------------------------------------------- |
|
| -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `labels` | `List[str]` | A list of categories to learn. If empty, the model infers the categories from the data. | `[]` |
|
| `labels` | A list of categories to learn. If empty, the model infers the categories from the data. Defaults to `[]`. ~~Iterable[str]~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model instance that predicts scores for each category. | [TextCatEnsemble](/api/architectures#TextCatEnsemble) |
|
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/textcat.py
|
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/textcat.py
|
||||||
|
@ -65,13 +65,13 @@ Create a new pipeline instance. In your application, you would normally use a
|
||||||
shortcut for this and instantiate the component using its string name and
|
shortcut for this and instantiate the component using its string name and
|
||||||
[`nlp.add_pipe`](/api/language#create_pipe).
|
[`nlp.add_pipe`](/api/language#create_pipe).
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- |
|
| -------------- | -------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `labels` | `Iterable[str]` | The labels to use. |
|
| `labels` | The labels to use. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## TextCategorizer.\_\_call\_\_ {#call tag="method"}
|
## TextCategorizer.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -91,10 +91,10 @@ delegate to the [`predict`](/api/textcategorizer#predict) and
|
||||||
> processed = textcat(doc)
|
> processed = textcat(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------ |
|
| ----------- | -------------------------------- |
|
||||||
| `doc` | `Doc` | The document to process. |
|
| `doc` | The document to process. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The processed document. |
|
| **RETURNS** | The processed document. ~~Doc~~ |
|
||||||
|
|
||||||
## TextCategorizer.pipe {#pipe tag="method"}
|
## TextCategorizer.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -113,12 +113,12 @@ applied to the `Doc` in order. Both [`__call__`](/api/textcategorizer#call) and
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ----------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------- |
|
||||||
| `stream` | `Iterable[Doc]` | A stream of documents. |
|
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `batch_size` | int | The number of documents to buffer. Defaults to `128`. |
|
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | The processed documents in order. |
|
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
||||||
|
|
||||||
## TextCategorizer.begin_training {#begin_training tag="method"}
|
## TextCategorizer.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
|
@ -138,13 +138,13 @@ setting up the label scheme based on the data.
|
||||||
> optimizer = textcat.begin_training(lambda: [], pipeline=nlp.pipeline)
|
> optimizer = textcat.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | `Callable[[], Iterable[Example]]` | Optional function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `pipeline` | `List[Tuple[str, Callable]]` | Optional list of pipeline components that this component is part of. |
|
| `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | An optional optimizer. Will be created via [`create_optimizer`](/api/textcategorizer#create_optimizer) if not set. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## TextCategorizer.predict {#predict tag="method"}
|
## TextCategorizer.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -158,10 +158,10 @@ modifying them.
|
||||||
> scores = textcat.predict([doc1, doc2])
|
> scores = textcat.predict([doc1, doc2])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------- | ----------------------------------------- |
|
| ----------- | ------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to predict. |
|
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
|
||||||
| **RETURNS** | - | The model's prediction for each document. |
|
| **RETURNS** | The model's prediction for each document. |
|
||||||
|
|
||||||
## TextCategorizer.set_annotations {#set_annotations tag="method"}
|
## TextCategorizer.set_annotations {#set_annotations tag="method"}
|
||||||
|
|
||||||
|
@ -175,10 +175,10 @@ Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores.
|
||||||
> textcat.set_annotations(docs, scores)
|
> textcat.set_annotations(docs, scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | --------------- | --------------------------------------------------------- |
|
| -------- | --------------------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to modify. |
|
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
|
||||||
| `scores` | - | The scores to set, produced by `TextCategorizer.predict`. |
|
| `scores` | The scores to set, produced by `TextCategorizer.predict`. |
|
||||||
|
|
||||||
## TextCategorizer.update {#update tag="method"}
|
## TextCategorizer.update {#update tag="method"}
|
||||||
|
|
||||||
|
@ -195,15 +195,15 @@ Delegates to [`predict`](/api/textcategorizer#predict) and
|
||||||
> losses = textcat.update(examples, sgd=optimizer)
|
> losses = textcat.update(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------- | --------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/textcategorizer#set_annotations). |
|
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"}
|
## TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"}
|
||||||
|
|
||||||
|
@ -219,14 +219,14 @@ the "catastrophic forgetting" problem. This feature is experimental.
|
||||||
> losses = textcat.rehearse(examples, sgd=optimizer)
|
> losses = textcat.rehearse(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## TextCategorizer.get_loss {#get_loss tag="method"}
|
## TextCategorizer.get_loss {#get_loss tag="method"}
|
||||||
|
|
||||||
|
@ -241,11 +241,11 @@ predicted scores.
|
||||||
> loss, d_loss = textcat.get_loss(examples, scores)
|
> loss, d_loss = textcat.get_loss(examples, scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------- | --------------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The batch of examples. |
|
| `examples` | The batch of examples. ~~Iterable[Example]~~ |
|
||||||
| `scores` | - | Scores representing the model's predictions. |
|
| `scores` | Scores representing the model's predictions. |
|
||||||
| **RETURNS** | `Tuple[float, float]` | The loss and the gradient, i.e. `(loss, gradient)`. |
|
| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ |
|
||||||
|
|
||||||
## TextCategorizer.score {#score tag="method" new="3"}
|
## TextCategorizer.score {#score tag="method" new="3"}
|
||||||
|
|
||||||
|
@ -257,12 +257,12 @@ Score a batch of examples.
|
||||||
> scores = textcat.score(examples)
|
> scores = textcat.score(examples)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------------- | ------------------- | ---------------------------------------------------------------------- |
|
| ---------------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | The examples to score. |
|
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `positive_label` | str | Optional positive label. |
|
| `positive_label` | Optional positive label. ~~Optional[str]~~ |
|
||||||
| **RETURNS** | `Dict[str, Any]` | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). |
|
| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||||
|
|
||||||
## TextCategorizer.create_optimizer {#create_optimizer tag="method"}
|
## TextCategorizer.create_optimizer {#create_optimizer tag="method"}
|
||||||
|
|
||||||
|
@ -275,25 +275,9 @@ Create an optimizer for the pipeline component.
|
||||||
> optimizer = textcat.create_optimizer()
|
> optimizer = textcat.create_optimizer()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------------- | -------------- |
|
| ----------- | ---------------------------- |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## TextCategorizer.add_label {#add_label tag="method"}
|
|
||||||
|
|
||||||
Add a new label to the pipe.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> textcat = nlp.add_pipe("textcat")
|
|
||||||
> textcat.add_label("MY_LABEL")
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ---- | --------------------------------------------------- |
|
|
||||||
| `label` | str | The label to add. |
|
|
||||||
| **RETURNS** | int | `0` if the label is already present, otherwise `1`. |
|
|
||||||
|
|
||||||
## TextCategorizer.use_params {#use_params tag="method, contextmanager"}
|
## TextCategorizer.use_params {#use_params tag="method, contextmanager"}
|
||||||
|
|
||||||
|
@ -307,9 +291,25 @@ Modify the pipe's model, to use the given parameter values.
|
||||||
> textcat.to_disk("/best_model")
|
> textcat.to_disk("/best_model")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---- | ----------------------------------------- |
|
| -------- | -------------------------------------------------- |
|
||||||
| `params` | dict | The parameter values to use in the model. |
|
| `params` | The parameter values to use in the model. ~~dict~~ |
|
||||||
|
|
||||||
|
## TextCategorizer.add_label {#add_label tag="method"}
|
||||||
|
|
||||||
|
Add a new label to the pipe.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat = nlp.add_pipe("textcat")
|
||||||
|
> textcat.add_label("MY_LABEL")
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ----------------------------------------------------------- |
|
||||||
|
| `label` | The label to add. ~~str~~ |
|
||||||
|
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
|
||||||
|
|
||||||
## TextCategorizer.to_disk {#to_disk tag="method"}
|
## TextCategorizer.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -322,11 +322,11 @@ Serialize the pipe to disk.
|
||||||
> textcat.to_disk("/path/to/textcat")
|
> textcat.to_disk("/path/to/textcat")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## TextCategorizer.from_disk {#from_disk tag="method"}
|
## TextCategorizer.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -339,12 +339,12 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
> textcat.from_disk("/path/to/textcat")
|
> textcat.from_disk("/path/to/textcat")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ----------------- | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `TextCategorizer` | The modified `TextCategorizer` object. |
|
| **RETURNS** | The modified `TextCategorizer` object. ~~TextCategorizer~~ |
|
||||||
|
|
||||||
## TextCategorizer.to_bytes {#to_bytes tag="method"}
|
## TextCategorizer.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -357,11 +357,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | bytes | The serialized form of the `TextCategorizer` object. |
|
| **RETURNS** | The serialized form of the `TextCategorizer` object. ~~bytes~~ |
|
||||||
|
|
||||||
## TextCategorizer.from_bytes {#from_bytes tag="method"}
|
## TextCategorizer.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -375,12 +375,12 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> textcat.from_bytes(textcat_bytes)
|
> textcat.from_bytes(textcat_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ----------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `TextCategorizer` | The `TextCategorizer` object. |
|
| **RETURNS** | The `TextCategorizer` object. ~~TextCategorizer~~ |
|
||||||
|
|
||||||
## TextCategorizer.labels {#labels tag="property"}
|
## TextCategorizer.labels {#labels tag="property"}
|
||||||
|
|
||||||
|
@ -393,9 +393,9 @@ The labels currently added to the component.
|
||||||
> assert "MY_LABEL" in textcat.labels
|
> assert "MY_LABEL" in textcat.labels
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ---------------------------------- |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | tuple | The labels added to the component. |
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -15,7 +15,7 @@ multiple components, e.g. to have one embedding and CNN network shared between a
|
||||||
[`EntityRecognizer`](/api/entityrecognizer).
|
[`EntityRecognizer`](/api/entityrecognizer).
|
||||||
|
|
||||||
In order to use the `Tok2Vec` predictions, subsequent components should use the
|
In order to use the `Tok2Vec` predictions, subsequent components should use the
|
||||||
[Tok2VecListener](/api/architectures#Tok2VecListener) layer as the tok2vec
|
[Tok2VecListener](/api/architectures#Tok2VecListener) layer as the `tok2vec`
|
||||||
subnetwork of their model. This layer will read data from the `doc.tensor`
|
subnetwork of their model. This layer will read data from the `doc.tensor`
|
||||||
attribute during prediction. During training, the `Tok2Vec` component will save
|
attribute during prediction. During training, the `Tok2Vec` component will save
|
||||||
its prediction and backprop callback for each batch, so that the subsequent
|
its prediction and backprop callback for each batch, so that the subsequent
|
||||||
|
@ -40,9 +40,9 @@ architectures and their arguments and hyperparameters.
|
||||||
> nlp.add_pipe("tok2vec", config=config)
|
> nlp.add_pipe("tok2vec", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Description |
|
||||||
| ------- | ------------------------------------------ | ----------------------------------------------------------------------- | ----------------------------------------------- |
|
| ------- | ------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** `List[Floats2d]`. The model to use. | [HashEmbedCNN](/api/architectures#HashEmbedCNN) |
|
| `model` | The model to use. Defaults to [HashEmbedCNN](/api/architectures#HashEmbedCNN). ~~Model[List[Doc], List[Floats2d]~~ |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tok2vec.py
|
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/tok2vec.py
|
||||||
|
@ -69,11 +69,11 @@ Create a new pipeline instance. In your application, you would normally use a
|
||||||
shortcut for this and instantiate the component using its string name and
|
shortcut for this and instantiate the component using its string name and
|
||||||
[`nlp.add_pipe`](/api/language#create_pipe).
|
[`nlp.add_pipe`](/api/language#create_pipe).
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------- | ------------------------------------------ | ------------------------------------------------------------------------------------------- |
|
| ------- | ------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]~~ |
|
||||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||||
|
|
||||||
## Tok2Vec.\_\_call\_\_ {#call tag="method"}
|
## Tok2Vec.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -95,10 +95,10 @@ pipeline components are applied to the `Doc` in order. Both
|
||||||
> processed = tok2vec(doc)
|
> processed = tok2vec(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------ |
|
| ----------- | -------------------------------- |
|
||||||
| `doc` | `Doc` | The document to process. |
|
| `doc` | The document to process. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The processed document. |
|
| **RETURNS** | The processed document. ~~Doc~~ |
|
||||||
|
|
||||||
## Tok2Vec.pipe {#pipe tag="method"}
|
## Tok2Vec.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -116,12 +116,12 @@ and [`set_annotations`](/api/tok2vec#set_annotations) methods.
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ----------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------- |
|
||||||
| `stream` | `Iterable[Doc]` | A stream of documents. |
|
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `batch_size` | int | The number of documents to buffer. Defaults to `128`. |
|
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | The processed documents in order. |
|
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
||||||
|
|
||||||
## Tok2Vec.begin_training {#begin_training tag="method"}
|
## Tok2Vec.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
|
@ -141,13 +141,13 @@ setting up the label scheme based on the data.
|
||||||
> optimizer = tok2vec.begin_training(lambda: [], pipeline=nlp.pipeline)
|
> optimizer = tok2vec.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | `Callable[[], Iterable[Example]]` | Optional function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `pipeline` | `List[Tuple[str, Callable]]` | Optional list of pipeline components that this component is part of. |
|
| `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | An optional optimizer. Will be created via [`create_optimizer`](/api/tok2vec#create_optimizer) if not set. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## Tok2Vec.predict {#predict tag="method"}
|
## Tok2Vec.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -161,10 +161,10 @@ modifying them.
|
||||||
> scores = tok2vec.predict([doc1, doc2])
|
> scores = tok2vec.predict([doc1, doc2])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------- | ----------------------------------------- |
|
| ----------- | ------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to predict. |
|
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
|
||||||
| **RETURNS** | - | The model's prediction for each document. |
|
| **RETURNS** | The model's prediction for each document. |
|
||||||
|
|
||||||
## Tok2Vec.set_annotations {#set_annotations tag="method"}
|
## Tok2Vec.set_annotations {#set_annotations tag="method"}
|
||||||
|
|
||||||
|
@ -178,10 +178,10 @@ Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores.
|
||||||
> tok2vec.set_annotations(docs, scores)
|
> tok2vec.set_annotations(docs, scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | --------------- | ------------------------------------------------- |
|
| -------- | ------------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to modify. |
|
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
|
||||||
| `scores` | - | The scores to set, produced by `Tok2Vec.predict`. |
|
| `scores` | The scores to set, produced by `Tok2Vec.predict`. |
|
||||||
|
|
||||||
## Tok2Vec.update {#update tag="method"}
|
## Tok2Vec.update {#update tag="method"}
|
||||||
|
|
||||||
|
@ -197,15 +197,15 @@ Delegates to [`predict`](/api/tok2vec#predict).
|
||||||
> losses = tok2vec.update(examples, sgd=optimizer)
|
> losses = tok2vec.update(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/tok2vec#set_annotations). |
|
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## Tok2Vec.create_optimizer {#create_optimizer tag="method"}
|
## Tok2Vec.create_optimizer {#create_optimizer tag="method"}
|
||||||
|
|
||||||
|
@ -218,9 +218,9 @@ Create an optimizer for the pipeline component.
|
||||||
> optimizer = tok2vec.create_optimizer()
|
> optimizer = tok2vec.create_optimizer()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------------- | -------------- |
|
| ----------- | ---------------------------- |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## Tok2Vec.use_params {#use_params tag="method, contextmanager"}
|
## Tok2Vec.use_params {#use_params tag="method, contextmanager"}
|
||||||
|
|
||||||
|
@ -235,9 +235,9 @@ context, the original parameters are restored.
|
||||||
> tok2vec.to_disk("/best_model")
|
> tok2vec.to_disk("/best_model")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---- | ----------------------------------------- |
|
| -------- | -------------------------------------------------- |
|
||||||
| `params` | dict | The parameter values to use in the model. |
|
| `params` | The parameter values to use in the model. ~~dict~~ |
|
||||||
|
|
||||||
## Tok2Vec.to_disk {#to_disk tag="method"}
|
## Tok2Vec.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -250,11 +250,11 @@ Serialize the pipe to disk.
|
||||||
> tok2vec.to_disk("/path/to/tok2vec")
|
> tok2vec.to_disk("/path/to/tok2vec")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## Tok2Vec.from_disk {#from_disk tag="method"}
|
## Tok2Vec.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -267,12 +267,12 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
> tok2vec.from_disk("/path/to/tok2vec")
|
> tok2vec.from_disk("/path/to/tok2vec")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Tok2Vec` | The modified `Tok2Vec` object. |
|
| **RETURNS** | The modified `Tok2Vec` object. ~~Tok2Vec~~ |
|
||||||
|
|
||||||
## Tok2Vec.to_bytes {#to_bytes tag="method"}
|
## Tok2Vec.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -285,11 +285,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | bytes | The serialized form of the `Tok2Vec` object. |
|
| **RETURNS** | The serialized form of the `Tok2Vec` object. ~~bytes~~ |
|
||||||
|
|
||||||
## Tok2Vec.from_bytes {#from_bytes tag="method"}
|
## Tok2Vec.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -303,12 +303,12 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> tok2vec.from_bytes(tok2vec_bytes)
|
> tok2vec.from_bytes(tok2vec_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Tok2Vec` | The `Tok2Vec` object. |
|
| **RETURNS** | The `Tok2Vec` object. ~~Tok2Vec~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -17,11 +17,11 @@ Construct a `Token` object.
|
||||||
> assert token.text == "Give"
|
> assert token.text == "Give"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ------- | ------------------------------------------- |
|
| -------- | --------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | A storage container for lexical types. |
|
| `vocab` | A storage container for lexical types. ~~Vocab~~ |
|
||||||
| `doc` | `Doc` | The parent document. |
|
| `doc` | The parent document. ~~Doc~~ |
|
||||||
| `offset` | int | The index of the token within the document. |
|
| `offset` | The index of the token within the document. ~~int~~ |
|
||||||
|
|
||||||
## Token.\_\_len\_\_ {#len tag="method"}
|
## Token.\_\_len\_\_ {#len tag="method"}
|
||||||
|
|
||||||
|
@ -35,9 +35,9 @@ The number of unicode characters in the token, i.e. `token.text`.
|
||||||
> assert len(token) == 4
|
> assert len(token) == 4
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ---------------------------------------------- |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | int | The number of unicode characters in the token. |
|
| **RETURNS** | The number of unicode characters in the token. ~~int~~ |
|
||||||
|
|
||||||
## Token.set_extension {#set_extension tag="classmethod" new="2"}
|
## Token.set_extension {#set_extension tag="classmethod" new="2"}
|
||||||
|
|
||||||
|
@ -55,14 +55,14 @@ For details, see the documentation on
|
||||||
> assert doc[3]._.is_fruit
|
> assert doc[3]._.is_fruit
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------- |
|
| --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `name` | str | Name of the attribute to set by the extension. For example, `"my_attr"` will be available as `token._.my_attr`. |
|
| `name` | Name of the attribute to set by the extension. For example, `"my_attr"` will be available as `token._.my_attr`. ~~str~~ |
|
||||||
| `default` | - | Optional default value of the attribute if no getter or method is defined. |
|
| `default` | Optional default value of the attribute if no getter or method is defined. ~~Optional[Any]~~ |
|
||||||
| `method` | callable | Set a custom method on the object, for example `token._.compare(other_token)`. |
|
| `method` | Set a custom method on the object, for example `token._.compare(other_token)`. ~~Optional[Callable[[Token, ...], Any]]~~ |
|
||||||
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
|
| `getter` | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. ~~Optional[Callable[[Token], Any]]~~ |
|
||||||
| `setter` | callable | Setter function that takes the `Token` and a value, and modifies the object. Is called when the user writes to the `Token._` attribute. |
|
| `setter` | Setter function that takes the `Token` and a value, and modifies the object. Is called when the user writes to the `Token._` attribute. ~~Optional[Callable[[Token, Any], None]]~~ |
|
||||||
| `force` | bool | Force overwriting existing attribute. |
|
| `force` | Force overwriting existing attribute. ~~bool~~ |
|
||||||
|
|
||||||
## Token.get_extension {#get_extension tag="classmethod" new="2"}
|
## Token.get_extension {#get_extension tag="classmethod" new="2"}
|
||||||
|
|
||||||
|
@ -79,10 +79,10 @@ Look up a previously registered extension by name. Returns a 4-tuple
|
||||||
> assert extension == (False, None, None, None)
|
> assert extension == (False, None, None, None)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `name` | str | Name of the extension. |
|
| `name` | Name of the extension. ~~str~~ |
|
||||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
|
| **RETURNS** | A `(default, method, getter, setter)` tuple of the extension. ~~Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]~~ |
|
||||||
|
|
||||||
## Token.has_extension {#has_extension tag="classmethod" new="2"}
|
## Token.has_extension {#has_extension tag="classmethod" new="2"}
|
||||||
|
|
||||||
|
@ -96,10 +96,10 @@ Check whether an extension has been registered on the `Token` class.
|
||||||
> assert Token.has_extension("is_fruit")
|
> assert Token.has_extension("is_fruit")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------------ |
|
| ----------- | --------------------------------------------------- |
|
||||||
| `name` | str | Name of the extension to check. |
|
| `name` | Name of the extension to check. ~~str~~ |
|
||||||
| **RETURNS** | bool | Whether the extension has been registered. |
|
| **RETURNS** | Whether the extension has been registered. ~~bool~~ |
|
||||||
|
|
||||||
## Token.remove_extension {#remove_extension tag="classmethod" new=""2.0.11""}
|
## Token.remove_extension {#remove_extension tag="classmethod" new=""2.0.11""}
|
||||||
|
|
||||||
|
@ -114,10 +114,10 @@ Remove a previously registered extension.
|
||||||
> assert not Token.has_extension("is_fruit")
|
> assert not Token.has_extension("is_fruit")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | --------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `name` | str | Name of the extension. |
|
| `name` | Name of the extension. ~~str~~ |
|
||||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
| **RETURNS** | A `(default, method, getter, setter)` tuple of the removed extension. ~~Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]~~ |
|
||||||
|
|
||||||
## Token.check_flag {#check_flag tag="method"}
|
## Token.check_flag {#check_flag tag="method"}
|
||||||
|
|
||||||
|
@ -132,10 +132,10 @@ Check the value of a boolean flag.
|
||||||
> assert token.check_flag(IS_TITLE) == True
|
> assert token.check_flag(IS_TITLE) == True
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------------- |
|
| ----------- | ---------------------------------------------- |
|
||||||
| `flag_id` | int | The attribute ID of the flag to check. |
|
| `flag_id` | The attribute ID of the flag to check. ~~int~~ |
|
||||||
| **RETURNS** | bool | Whether the flag is set. |
|
| **RETURNS** | Whether the flag is set. ~~bool~~ |
|
||||||
|
|
||||||
## Token.similarity {#similarity tag="method" model="vectors"}
|
## Token.similarity {#similarity tag="method" model="vectors"}
|
||||||
|
|
||||||
|
@ -150,10 +150,10 @@ Compute a semantic similarity estimate. Defaults to cosine over vectors.
|
||||||
> assert apples_oranges == oranges_apples
|
> assert apples_oranges == oranges_apples
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | -------------------------------------------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| other | - | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. |
|
| other | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. ~~Union[Doc, Span, Token, Lexeme]~~ |
|
||||||
| **RETURNS** | float | A scalar similarity score. Higher is more similar. |
|
| **RETURNS** | A scalar similarity score. Higher is more similar. ~~float~~ |
|
||||||
|
|
||||||
## Token.nbor {#nbor tag="method"}
|
## Token.nbor {#nbor tag="method"}
|
||||||
|
|
||||||
|
@ -167,10 +167,10 @@ Get a neighboring token.
|
||||||
> assert give_nbor.text == "it"
|
> assert give_nbor.text == "it"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------- | ----------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------- |
|
||||||
| `i` | int | The relative position of the token to get. Defaults to `1`. |
|
| `i` | The relative position of the token to get. Defaults to `1`. ~~int~~ |
|
||||||
| **RETURNS** | `Token` | The token at position `self.doc[self.i+i]`. |
|
| **RETURNS** | The token at position `self.doc[self.i+i]`. ~~Token~~ |
|
||||||
|
|
||||||
## Token.is_ancestor {#is_ancestor tag="method" model="parser"}
|
## Token.is_ancestor {#is_ancestor tag="method" model="parser"}
|
||||||
|
|
||||||
|
@ -186,10 +186,10 @@ dependency tree.
|
||||||
> assert give.is_ancestor(it)
|
> assert give.is_ancestor(it)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------- | ----------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------- |
|
||||||
| descendant | `Token` | Another token. |
|
| descendant | Another token. ~~Token~~ |
|
||||||
| **RETURNS** | bool | Whether this token is the ancestor of the descendant. |
|
| **RETURNS** | Whether this token is the ancestor of the descendant. ~~bool~~ |
|
||||||
|
|
||||||
## Token.ancestors {#ancestors tag="property" model="parser"}
|
## Token.ancestors {#ancestors tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -205,9 +205,9 @@ The rightmost token of this token's syntactic descendants.
|
||||||
> assert [t.text for t in he_ancestors] == ["pleaded"]
|
> assert [t.text for t in he_ancestors] == ["pleaded"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ------- | --------------------------------------------------------------------- |
|
| ---------- | ------------------------------------------------------------------------------- |
|
||||||
| **YIELDS** | `Token` | A sequence of ancestor tokens such that `ancestor.is_ancestor(self)`. |
|
| **YIELDS** | A sequence of ancestor tokens such that `ancestor.is_ancestor(self)`. ~~Token~~ |
|
||||||
|
|
||||||
## Token.conjuncts {#conjuncts tag="property" model="parser"}
|
## Token.conjuncts {#conjuncts tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -221,9 +221,9 @@ A tuple of coordinated tokens, not including the token itself.
|
||||||
> assert [t.text for t in apples_conjuncts] == ["oranges"]
|
> assert [t.text for t in apples_conjuncts] == ["oranges"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------- | ----------------------- |
|
| ----------- | --------------------------------------------- |
|
||||||
| **RETURNS** | `tuple` | The coordinated tokens. |
|
| **RETURNS** | The coordinated tokens. ~~Tuple[Token, ...]~~ |
|
||||||
|
|
||||||
## Token.children {#children tag="property" model="parser"}
|
## Token.children {#children tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -237,9 +237,9 @@ A sequence of the token's immediate syntactic children.
|
||||||
> assert [t.text for t in give_children] == ["it", "back", "!"]
|
> assert [t.text for t in give_children] == ["it", "back", "!"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ------- | ------------------------------------------- |
|
| ---------- | ------------------------------------------------------- |
|
||||||
| **YIELDS** | `Token` | A child token such that `child.head==self`. |
|
| **YIELDS** | A child token such that `child.head == self`. ~~Token~~ |
|
||||||
|
|
||||||
## Token.lefts {#lefts tag="property" model="parser"}
|
## Token.lefts {#lefts tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -253,9 +253,9 @@ The leftward immediate children of the word, in the syntactic dependency parse.
|
||||||
> assert lefts == ["New"]
|
> assert lefts == ["New"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ------- | -------------------------- |
|
| ---------- | ------------------------------------ |
|
||||||
| **YIELDS** | `Token` | A left-child of the token. |
|
| **YIELDS** | A left-child of the token. ~~Token~~ |
|
||||||
|
|
||||||
## Token.rights {#rights tag="property" model="parser"}
|
## Token.rights {#rights tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -269,9 +269,9 @@ The rightward immediate children of the word, in the syntactic dependency parse.
|
||||||
> assert rights == ["in"]
|
> assert rights == ["in"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ------- | --------------------------- |
|
| ---------- | ------------------------------------- |
|
||||||
| **YIELDS** | `Token` | A right-child of the token. |
|
| **YIELDS** | A right-child of the token. ~~Token~~ |
|
||||||
|
|
||||||
## Token.n_lefts {#n_lefts tag="property" model="parser"}
|
## Token.n_lefts {#n_lefts tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -285,9 +285,9 @@ dependency parse.
|
||||||
> assert doc[3].n_lefts == 1
|
> assert doc[3].n_lefts == 1
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------- |
|
| ----------- | ---------------------------------------- |
|
||||||
| **RETURNS** | int | The number of left-child tokens. |
|
| **RETURNS** | The number of left-child tokens. ~~int~~ |
|
||||||
|
|
||||||
## Token.n_rights {#n_rights tag="property" model="parser"}
|
## Token.n_rights {#n_rights tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -301,9 +301,9 @@ dependency parse.
|
||||||
> assert doc[3].n_rights == 1
|
> assert doc[3].n_rights == 1
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | --------------------------------- |
|
| ----------- | ----------------------------------------- |
|
||||||
| **RETURNS** | int | The number of right-child tokens. |
|
| **RETURNS** | The number of right-child tokens. ~~int~~ |
|
||||||
|
|
||||||
## Token.subtree {#subtree tag="property" model="parser"}
|
## Token.subtree {#subtree tag="property" model="parser"}
|
||||||
|
|
||||||
|
@ -317,9 +317,9 @@ A sequence containing the token and all the token's syntactic descendants.
|
||||||
> assert [t.text for t in give_subtree] == ["Give", "it", "back", "!"]
|
> assert [t.text for t in give_subtree] == ["Give", "it", "back", "!"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ------- | -------------------------------------------------------------------------- |
|
| ---------- | ------------------------------------------------------------------------------------ |
|
||||||
| **YIELDS** | `Token` | A descendant token such that `self.is_ancestor(token)` or `token == self`. |
|
| **YIELDS** | A descendant token such that `self.is_ancestor(token)` or `token == self`. ~~Token~~ |
|
||||||
|
|
||||||
## Token.is_sent_start {#is_sent_start tag="property" new="2"}
|
## Token.is_sent_start {#is_sent_start tag="property" new="2"}
|
||||||
|
|
||||||
|
@ -334,9 +334,9 @@ unknown. Defaults to `True` for the first token in the `Doc`.
|
||||||
> assert not doc[5].is_sent_start
|
> assert not doc[5].is_sent_start
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------ |
|
| ----------- | --------------------------------------------- |
|
||||||
| **RETURNS** | bool | Whether the token starts a sentence. |
|
| **RETURNS** | Whether the token starts a sentence. ~~bool~~ |
|
||||||
|
|
||||||
## Token.has_vector {#has_vector tag="property" model="vectors"}
|
## Token.has_vector {#has_vector tag="property" model="vectors"}
|
||||||
|
|
||||||
|
@ -350,9 +350,9 @@ A boolean value indicating whether a word vector is associated with the token.
|
||||||
> assert apples.has_vector
|
> assert apples.has_vector
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | --------------------------------------------- |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | bool | Whether the token has a vector data attached. |
|
| **RETURNS** | Whether the token has a vector data attached. ~~bool~~ |
|
||||||
|
|
||||||
## Token.vector {#vector tag="property" model="vectors"}
|
## Token.vector {#vector tag="property" model="vectors"}
|
||||||
|
|
||||||
|
@ -367,9 +367,9 @@ A real-valued meaning representation.
|
||||||
> assert apples.vector.shape == (300,)
|
> assert apples.vector.shape == (300,)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---------------------------------------- | ---------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| **RETURNS** | `numpy.ndarray[ndim=1, dtype="float32"]` | A 1D numpy array representing the token's semantics. |
|
| **RETURNS** | A 1-dimensional array representing the token's vector. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
|
|
||||||
## Token.vector_norm {#vector_norm tag="property" model="vectors"}
|
## Token.vector_norm {#vector_norm tag="property" model="vectors"}
|
||||||
|
|
||||||
|
@ -386,80 +386,80 @@ The L2 norm of the token's vector representation.
|
||||||
> assert apples.vector_norm != pasta.vector_norm
|
> assert apples.vector_norm != pasta.vector_norm
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ----------------------------------------- |
|
| ----------- | --------------------------------------------------- |
|
||||||
| **RETURNS** | float | The L2 norm of the vector representation. |
|
| **RETURNS** | The L2 norm of the vector representation. ~~float~~ |
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------------------------------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `doc` | `Doc` | The parent document. |
|
| `doc` | The parent document. ~~Doc~~ |
|
||||||
| `lex` <Tag variant="new">3</Tag> | [`Lexeme`](/api/lexeme) | The underlying lexeme. |
|
| `lex` <Tag variant="new">3</Tag> | The underlying lexeme. ~~Lexeme~~ |
|
||||||
| `sent` <Tag variant="new">2.0.12</Tag> | [`Span`](/api/span) | The sentence span that this token is a part of. |
|
| `sent` <Tag variant="new">2.0.12</Tag> | The sentence span that this token is a part of. ~~Span~~ |
|
||||||
| `text` | str | Verbatim text content. |
|
| `text` | Verbatim text content. ~~str~~ |
|
||||||
| `text_with_ws` | str | Text content, with trailing space character if present. |
|
| `text_with_ws` | Text content, with trailing space character if present. ~~str~~ |
|
||||||
| `whitespace_` | str | Trailing space character if present. |
|
| `whitespace_` | Trailing space character if present. ~~str~~ |
|
||||||
| `orth` | int | ID of the verbatim text content. |
|
| `orth` | ID of the verbatim text content. ~~int~~ |
|
||||||
| `orth_` | str | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. |
|
| `orth_` | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. ~~str~~ |
|
||||||
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
|
| `vocab` | The vocab object of the parent `Doc`. ~~vocab~~ |
|
||||||
| `tensor` <Tag variant="new">2.1.7</Tag> | `ndarray` | The tokens's slice of the parent `Doc`'s tensor. |
|
| `tensor` <Tag variant="new">2.1.7</Tag> | The tokens's slice of the parent `Doc`'s tensor. ~~numpy.ndarray~~ |
|
||||||
| `head` | `Token` | The syntactic parent, or "governor", of this token. |
|
| `head` | The syntactic parent, or "governor", of this token. ~~Token~~ |
|
||||||
| `left_edge` | `Token` | The leftmost token of this token's syntactic descendants. |
|
| `left_edge` | The leftmost token of this token's syntactic descendants. ~~Token~~ |
|
||||||
| `right_edge` | `Token` | The rightmost token of this token's syntactic descendants. |
|
| `right_edge` | The rightmost token of this token's syntactic descendants. ~~Token~~ |
|
||||||
| `i` | int | The index of the token within the parent document. |
|
| `i` | The index of the token within the parent document. ~~int~~ |
|
||||||
| `ent_type` | int | Named entity type. |
|
| `ent_type` | Named entity type. ~~int~~ |
|
||||||
| `ent_type_` | str | Named entity type. |
|
| `ent_type_` | Named entity type. ~~str~~ |
|
||||||
| `ent_iob` | int | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. |
|
| `ent_iob` | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. ~~int~~ |
|
||||||
| `ent_iob_` | str | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. |
|
| `ent_iob_` | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. ~~str~~ |
|
||||||
| `ent_kb_id` <Tag variant="new">2.2</Tag> | int | Knowledge base ID that refers to the named entity this token is a part of, if any. |
|
| `ent_kb_id` <Tag variant="new">2.2</Tag> | Knowledge base ID that refers to the named entity this token is a part of, if any. ~~int~~ |
|
||||||
| `ent_kb_id_` <Tag variant="new">2.2</Tag> | str | Knowledge base ID that refers to the named entity this token is a part of, if any. |
|
| `ent_kb_id_` <Tag variant="new">2.2</Tag> | Knowledge base ID that refers to the named entity this token is a part of, if any. ~~str~~ |
|
||||||
| `ent_id` | int | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
|
| `ent_id` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~int~~ |
|
||||||
| `ent_id_` | str | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
|
| `ent_id_` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~str~~ |
|
||||||
| `lemma` | int | Base form of the token, with no inflectional suffixes. |
|
| `lemma` | Base form of the token, with no inflectional suffixes. ~~int~~ |
|
||||||
| `lemma_` | str | Base form of the token, with no inflectional suffixes. |
|
| `lemma_` | Base form of the token, with no inflectional suffixes. ~~str~~ |
|
||||||
| `norm` | int | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
| `norm` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~int~~ |
|
||||||
| `norm_` | str | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
| `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~str~~ |
|
||||||
| `lower` | int | Lowercase form of the token. |
|
| `lower` | Lowercase form of the token. ~~int~~ |
|
||||||
| `lower_` | str | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
|
| `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ |
|
||||||
| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
| `shape` | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ |
|
||||||
| `shape_` | str | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
| `shape_` | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~str~~ |
|
||||||
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
|
| `prefix` | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. ~~int~~ |
|
||||||
| `prefix_` | str | A length-N substring from the start of the token. Defaults to `N=1`. |
|
| `prefix_` | A length-N substring from the start of the token. Defaults to `N=1`. ~~str~~ |
|
||||||
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
|
| `suffix` | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. ~~int~~ |
|
||||||
| `suffix_` | str | Length-N substring from the end of the token. Defaults to `N=3`. |
|
| `suffix_` | Length-N substring from the end of the token. Defaults to `N=3`. ~~str~~ |
|
||||||
| `is_alpha` | bool | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. |
|
| `is_alpha` | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. ~~bool~~ |
|
||||||
| `is_ascii` | bool | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. |
|
| `is_ascii` | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. ~~bool~~ |
|
||||||
| `is_digit` | bool | Does the token consist of digits? Equivalent to `token.text.isdigit()`. |
|
| `is_digit` | Does the token consist of digits? Equivalent to `token.text.isdigit()`. ~~bool~~ |
|
||||||
| `is_lower` | bool | Is the token in lowercase? Equivalent to `token.text.islower()`. |
|
| `is_lower` | Is the token in lowercase? Equivalent to `token.text.islower()`. ~~bool~~ |
|
||||||
| `is_upper` | bool | Is the token in uppercase? Equivalent to `token.text.isupper()`. |
|
| `is_upper` | Is the token in uppercase? Equivalent to `token.text.isupper()`. ~~bool~~ |
|
||||||
| `is_title` | bool | Is the token in titlecase? Equivalent to `token.text.istitle()`. |
|
| `is_title` | Is the token in titlecase? Equivalent to `token.text.istitle()`. ~~bool~~ |
|
||||||
| `is_punct` | bool | Is the token punctuation? |
|
| `is_punct` | Is the token punctuation? ~~bool~~ |
|
||||||
| `is_left_punct` | bool | Is the token a left punctuation mark, e.g. `"("` ? |
|
| `is_left_punct` | Is the token a left punctuation mark, e.g. `"("` ? ~~bool~~ |
|
||||||
| `is_right_punct` | bool | Is the token a right punctuation mark, e.g. `")"` ? |
|
| `is_right_punct` | Is the token a right punctuation mark, e.g. `")"` ? ~~bool~~ |
|
||||||
| `is_space` | bool | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. |
|
| `is_space` | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. ~~bool~~ |
|
||||||
| `is_bracket` | bool | Is the token a bracket? |
|
| `is_bracket` | Is the token a bracket? ~~bool~~ |
|
||||||
| `is_quote` | bool | Is the token a quotation mark? |
|
| `is_quote` | Is the token a quotation mark? ~~bool~~ |
|
||||||
| `is_currency` <Tag variant="new">2.0.8</Tag> | bool | Is the token a currency symbol? |
|
| `is_currency` <Tag variant="new">2.0.8</Tag> | Is the token a currency symbol? ~~bool~~ |
|
||||||
| `like_url` | bool | Does the token resemble a URL? |
|
| `like_url` | Does the token resemble a URL? ~~bool~~ |
|
||||||
| `like_num` | bool | Does the token represent a number? e.g. "10.9", "10", "ten", etc. |
|
| `like_num` | Does the token represent a number? e.g. "10.9", "10", "ten", etc. ~~bool~~ |
|
||||||
| `like_email` | bool | Does the token resemble an email address? |
|
| `like_email` | Does the token resemble an email address? ~~bool~~ |
|
||||||
| `is_oov` | bool | Does the token have a word vector? |
|
| `is_oov` | Does the token have a word vector? ~~bool~~ |
|
||||||
| `is_stop` | bool | Is the token part of a "stop list"? |
|
| `is_stop` | Is the token part of a "stop list"? ~~bool~~ |
|
||||||
| `pos` | int | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). |
|
| `pos` | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~int~~ |
|
||||||
| `pos_` | str | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). |
|
| `pos_` | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~str~~ |
|
||||||
| `tag` | int | Fine-grained part-of-speech. |
|
| `tag` | Fine-grained part-of-speech. ~~int~~ |
|
||||||
| `tag_` | str | Fine-grained part-of-speech. |
|
| `tag_` | Fine-grained part-of-speech. ~~str~~ |
|
||||||
| `morph` | `MorphAnalysis` | Morphological analysis. |
|
| `morph` <Tag variant="new">3</Tag> | Morphological analysis. ~~MorphAnalysis~~ |
|
||||||
| `morph_` | str | Morphological analysis in UD FEATS format. |
|
| `morph_` <Tag variant="new">3</Tag> | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ |
|
||||||
| `dep` | int | Syntactic dependency relation. |
|
| `dep` | Syntactic dependency relation. ~~int~~ |
|
||||||
| `dep_` | str | Syntactic dependency relation. |
|
| `dep_` | Syntactic dependency relation. ~~str~~ |
|
||||||
| `lang` | int | Language of the parent document's vocabulary. |
|
| `lang` | Language of the parent document's vocabulary. ~~int~~ |
|
||||||
| `lang_` | str | Language of the parent document's vocabulary. |
|
| `lang_` | Language of the parent document's vocabulary. ~~str~~ |
|
||||||
| `prob` | float | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). |
|
| `prob` | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). ~~float~~ |
|
||||||
| `idx` | int | The character offset of the token within the parent document. |
|
| `idx` | The character offset of the token within the parent document. ~~int~~ |
|
||||||
| `sentiment` | float | A scalar value indicating the positivity or negativity of the token. |
|
| `sentiment` | A scalar value indicating the positivity or negativity of the token. ~~float~~ |
|
||||||
| `lex_id` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
|
| `lex_id` | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. ~~int~~ |
|
||||||
| `rank` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
|
| `rank` | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. ~~int~~ |
|
||||||
| `cluster` | int | Brown cluster ID. |
|
| `cluster` | Brown cluster ID. ~~int~~ |
|
||||||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
| `_` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). ~~Underscore~~ |
|
||||||
|
|
|
@ -45,15 +45,15 @@ the
|
||||||
> tokenizer = nlp.tokenizer
|
> tokenizer = nlp.tokenizer
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------ |
|
| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | A storage container for lexical types. |
|
| `vocab` | A storage container for lexical types. ~~Vocab~~ |
|
||||||
| `rules` | dict | Exceptions and special-cases for the tokenizer. |
|
| `rules` | Exceptions and special-cases for the tokenizer. ~~Optional[Dict[str, List[Dict[int, str]]]]~~ |
|
||||||
| `prefix_search` | callable | A function matching the signature of `re.compile(string).search` to match prefixes. |
|
| `prefix_search` | A function matching the signature of `re.compile(string).search` to match prefixes. ~~Optional[Callable[[str], Optional[Match]]]~~ |
|
||||||
| `suffix_search` | callable | A function matching the signature of `re.compile(string).search` to match suffixes. |
|
| `suffix_search` | A function matching the signature of `re.compile(string).search` to match suffixes. ~~Optional[Callable[[str], Optional[Match]]]~~ |
|
||||||
| `infix_finditer` | callable | A function matching the signature of `re.compile(string).finditer` to find infixes. |
|
| `infix_finditer` | A function matching the signature of `re.compile(string).finditer` to find infixes. ~~Optional[Callable[[str], Iterator[Match]]]~~ |
|
||||||
| `token_match` | callable | A function matching the signature of `re.compile(string).match` to find token matches. |
|
| `token_match` | A function matching the signature of `re.compile(string).match` to find token matches. ~~Optional[Callable[[str], Optional[Match]]]~~ |
|
||||||
| `url_match` | callable | A function matching the signature of `re.compile(string).match` to find token matches after considering prefixes and suffixes. |
|
| `url_match` | A function matching the signature of `re.compile(string).match` to find token matches after considering prefixes and suffixes. ~~Optional[Callable[[str], Optional[Match]]]~~ |
|
||||||
|
|
||||||
## Tokenizer.\_\_call\_\_ {#call tag="method"}
|
## Tokenizer.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -66,10 +66,10 @@ Tokenize a string.
|
||||||
> assert len(tokens) == 4
|
> assert len(tokens) == 4
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | --------------------------------------- |
|
| ----------- | ----------------------------------------------- |
|
||||||
| `string` | str | The string to tokenize. |
|
| `string` | The string to tokenize. ~~str~~ |
|
||||||
| **RETURNS** | `Doc` | A container for linguistic annotations. |
|
| **RETURNS** | A container for linguistic annotations. ~~Doc~~ |
|
||||||
|
|
||||||
## Tokenizer.pipe {#pipe tag="method"}
|
## Tokenizer.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -83,40 +83,40 @@ Tokenize a stream of texts.
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | ----- | ---------------------------------------------------------------------------- |
|
| ------------ | ------------------------------------------------------------------------------------ |
|
||||||
| `texts` | - | A sequence of unicode texts. |
|
| `texts` | A sequence of unicode texts. ~~Iterable[str]~~ |
|
||||||
| `batch_size` | int | The number of texts to accumulate in an internal buffer. Defaults to `1000`. |
|
| `batch_size` | The number of texts to accumulate in an internal buffer. Defaults to `1000`. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | A sequence of Doc objects, in order. |
|
| **YIELDS** | The tokenized Doc objects, in order. ~~Doc~~ |
|
||||||
|
|
||||||
## Tokenizer.find_infix {#find_infix tag="method"}
|
## Tokenizer.find_infix {#find_infix tag="method"}
|
||||||
|
|
||||||
Find internal split points of the string.
|
Find internal split points of the string.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `string` | str | The string to split. |
|
| `string` | The string to split. ~~str~~ |
|
||||||
| **RETURNS** | list | A list of `re.MatchObject` objects that have `.start()` and `.end()` methods, denoting the placement of internal segment separators, e.g. hyphens. |
|
| **RETURNS** | A list of `re.MatchObject` objects that have `.start()` and `.end()` methods, denoting the placement of internal segment separators, e.g. hyphens. ~~List[Match]~~ |
|
||||||
|
|
||||||
## Tokenizer.find_prefix {#find_prefix tag="method"}
|
## Tokenizer.find_prefix {#find_prefix tag="method"}
|
||||||
|
|
||||||
Find the length of a prefix that should be segmented from the string, or `None`
|
Find the length of a prefix that should be segmented from the string, or `None`
|
||||||
if no prefix rules match.
|
if no prefix rules match.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------------------------ |
|
||||||
| `string` | str | The string to segment. |
|
| `string` | The string to segment. ~~str~~ |
|
||||||
| **RETURNS** | int | The length of the prefix if present, otherwise `None`. |
|
| **RETURNS** | The length of the prefix if present, otherwise `None`. ~~Optional[int]~~ |
|
||||||
|
|
||||||
## Tokenizer.find_suffix {#find_suffix tag="method"}
|
## Tokenizer.find_suffix {#find_suffix tag="method"}
|
||||||
|
|
||||||
Find the length of a suffix that should be segmented from the string, or `None`
|
Find the length of a suffix that should be segmented from the string, or `None`
|
||||||
if no suffix rules match.
|
if no suffix rules match.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------ | ------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------------------------ |
|
||||||
| `string` | str | The string to segment. |
|
| `string` | The string to segment. ~~str~~ |
|
||||||
| **RETURNS** | int / `None` | The length of the suffix if present, otherwise `None`. |
|
| **RETURNS** | The length of the suffix if present, otherwise `None`. ~~Optional[int]~~ |
|
||||||
|
|
||||||
## Tokenizer.add_special_case {#add_special_case tag="method"}
|
## Tokenizer.add_special_case {#add_special_case tag="method"}
|
||||||
|
|
||||||
|
@ -134,10 +134,10 @@ and examples.
|
||||||
> tokenizer.add_special_case("don't", case)
|
> tokenizer.add_special_case("don't", case)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `string` | str | The string to specially tokenize. |
|
| `string` | The string to specially tokenize. ~~str~~ |
|
||||||
| `token_attrs` | iterable | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. |
|
| `token_attrs` | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. ~~Iterable[Dict[int, str]]~~ |
|
||||||
|
|
||||||
## Tokenizer.explain {#explain tag="method"}
|
## Tokenizer.explain {#explain tag="method"}
|
||||||
|
|
||||||
|
@ -153,10 +153,10 @@ produced are identical to `Tokenizer.__call__` except for whitespace tokens.
|
||||||
> assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
|
> assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | --------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------- |
|
||||||
| `string` | str | The string to tokenize with the debugging tokenizer |
|
| `string` | The string to tokenize with the debugging tokenizer. ~~str~~ |
|
||||||
| **RETURNS** | list | A list of `(pattern_string, token_string)` tuples |
|
| **RETURNS** | A list of `(pattern_string, token_string)` tuples. ~~List[Tuple[str, str]]~~ |
|
||||||
|
|
||||||
## Tokenizer.to_disk {#to_disk tag="method"}
|
## Tokenizer.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -169,11 +169,11 @@ Serialize the tokenizer to disk.
|
||||||
> tokenizer.to_disk("/path/to/tokenizer")
|
> tokenizer.to_disk("/path/to/tokenizer")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## Tokenizer.from_disk {#from_disk tag="method"}
|
## Tokenizer.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -186,12 +186,12 @@ Load the tokenizer from disk. Modifies the object in place and returns it.
|
||||||
> tokenizer.from_disk("/path/to/tokenizer")
|
> tokenizer.from_disk("/path/to/tokenizer")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Tokenizer` | The modified `Tokenizer` object. |
|
| **RETURNS** | The modified `Tokenizer` object. ~~Tokenizer~~ |
|
||||||
|
|
||||||
## Tokenizer.to_bytes {#to_bytes tag="method"}
|
## Tokenizer.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -204,11 +204,11 @@ Load the tokenizer from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the tokenizer to a bytestring.
|
Serialize the tokenizer to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | bytes | The serialized form of the `Tokenizer` object. |
|
| **RETURNS** | The serialized form of the `Tokenizer` object. ~~bytes~~ |
|
||||||
|
|
||||||
## Tokenizer.from_bytes {#from_bytes tag="method"}
|
## Tokenizer.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -223,23 +223,23 @@ it.
|
||||||
> tokenizer.from_bytes(tokenizer_bytes)
|
> tokenizer.from_bytes(tokenizer_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Tokenizer` | The `Tokenizer` object. |
|
| **RETURNS** | The `Tokenizer` object. ~~Tokenizer~~ |
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------------- | ------- | -------------------------------------------------------------------------------------------------------------------------- |
|
| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
|
| `vocab` | The vocab object of the parent `Doc`. ~~Vocab~~ |
|
||||||
| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. |
|
| `prefix_search` | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~ |
|
||||||
| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. |
|
| `suffix_search` | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~ |
|
||||||
| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
|
| `infix_finditer` | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) sequence of `re.MatchObject` objects. ~~Optional[Callable[[str], Iterator[Match]]]~~ |
|
||||||
| `token_match` | - | A function matching the signature of `re.compile(string).match to find token matches. Returns an`re.MatchObject`or`None. |
|
| `token_match` | A function matching the signature of `re.compile(string).match` to find token matches. Returns an `re.MatchObject` or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~ |
|
||||||
| `rules` | dict | A dictionary of tokenizer exceptions and special cases. |
|
| `rules` | A dictionary of tokenizer exceptions and special cases. ~~Optional[Dict[str, List[Dict[int, str]]]]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -18,9 +18,10 @@ Load a model using the name of an installed
|
||||||
`Path`-like object. spaCy will try resolving the load argument in this order. If
|
`Path`-like object. spaCy will try resolving the load argument in this order. If
|
||||||
a model is loaded from a model name, spaCy will assume it's a Python package and
|
a model is loaded from a model name, spaCy will assume it's a Python package and
|
||||||
import it and call the model's own `load()` method. If a model is loaded from a
|
import it and call the model's own `load()` method. If a model is loaded from a
|
||||||
path, spaCy will assume it's a data directory, read the language and pipeline
|
path, spaCy will assume it's a data directory, load its
|
||||||
settings off the meta.json and initialize the `Language` class. The data will be
|
[`config.cfg`](/api/data-formats#config) and use the language and pipeline
|
||||||
loaded in via [`Language.from_disk`](/api/language#from_disk).
|
information to construct the `Language` class. The data will be loaded in via
|
||||||
|
[`Language.from_disk`](/api/language#from_disk).
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -32,17 +33,18 @@ loaded in via [`Language.from_disk`](/api/language#from_disk).
|
||||||
> nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"])
|
> nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------------------------- | ---------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `name` | str / `Path` | Model to load, i.e. package name or path. |
|
| `name` | Model to load, i.e. package name or path. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). ~~List[str]~~ |
|
||||||
| `config` <Tag variant="new">3</Tag> | `Dict[str, Any]` / [`Config`](https://thinc.ai/docs/api-config#config) | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. |
|
| `config` <Tag variant="new">3</Tag> | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~ |
|
||||||
| **RETURNS** | `Language` | A `Language` object with the loaded model. |
|
| **RETURNS** | A `Language` object with the loaded model. ~~Language~~ |
|
||||||
|
|
||||||
Essentially, `spacy.load()` is a convenience wrapper that reads the language ID
|
Essentially, `spacy.load()` is a convenience wrapper that reads the model's
|
||||||
and pipeline components from a model's `meta.json`, initializes the `Language`
|
[`config.cfg`](/api/data-formats#config), uses the language and pipeline
|
||||||
class, loads in the model data and returns it.
|
information to construct a `Language` object, loads in the model data and
|
||||||
|
returns it.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### Abstract example
|
### Abstract example
|
||||||
|
@ -65,12 +67,12 @@ Create a blank model of a given language class. This function is the twin of
|
||||||
> nlp_de = spacy.blank("de") # equivalent to German()
|
> nlp_de = spacy.blank("de") # equivalent to German()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---------- | ------------------------------------------------------------------------------------------------ |
|
| ----------- | -------------------------------------------------------------------------------------------------------- |
|
||||||
| `name` | str | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. |
|
| `name` | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. ~~str~~ |
|
||||||
| **RETURNS** | `Language` | An empty `Language` object of the appropriate subclass. |
|
| **RETURNS** | An empty `Language` object of the appropriate subclass. ~~Language~~ |
|
||||||
|
|
||||||
#### spacy.info {#spacy.info tag="function"}
|
### spacy.info {#spacy.info tag="function"}
|
||||||
|
|
||||||
The same as the [`info` command](/api/cli#info). Pretty-print information about
|
The same as the [`info` command](/api/cli#info). Pretty-print information about
|
||||||
your installation, models and local setup from within spaCy. To get the model
|
your installation, models and local setup from within spaCy. To get the model
|
||||||
|
@ -85,12 +87,12 @@ meta data as a dictionary instead, you can use the `meta` attribute on your
|
||||||
> markdown = spacy.info(markdown=True, silent=True)
|
> markdown = spacy.info(markdown=True, silent=True)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ---- | ------------------------------------------------ |
|
| -------------- | ------------------------------------------------------------------ |
|
||||||
| `model` | str | A model, i.e. a package name or path (optional). |
|
| `model` | A model, i.e. a package name or path (optional). ~~Optional[str]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `markdown` | bool | Print information as Markdown. |
|
| `markdown` | Print information as Markdown. ~~bool~~ |
|
||||||
| `silent` | bool | Don't print anything, just return. |
|
| `silent` | Don't print anything, just return. ~~bool~~ |
|
||||||
|
|
||||||
### spacy.explain {#spacy.explain tag="function"}
|
### spacy.explain {#spacy.explain tag="function"}
|
||||||
|
|
||||||
|
@ -111,10 +113,10 @@ list of available terms, see
|
||||||
> # world NN noun, singular or mass
|
> # world NN noun, singular or mass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------- |
|
||||||
| `term` | str | Term to explain. |
|
| `term` | Term to explain. ~~str~~ |
|
||||||
| **RETURNS** | str | The explanation, or `None` if not found in the glossary. |
|
| **RETURNS** | The explanation, or `None` if not found in the glossary. ~~Optional[str]~~ |
|
||||||
|
|
||||||
### spacy.prefer_gpu {#spacy.prefer_gpu tag="function" new="2.0.14"}
|
### spacy.prefer_gpu {#spacy.prefer_gpu tag="function" new="2.0.14"}
|
||||||
|
|
||||||
|
@ -131,9 +133,9 @@ models.
|
||||||
> nlp = spacy.load("en_core_web_sm")
|
> nlp = spacy.load("en_core_web_sm")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------ |
|
| ----------- | --------------------------------------- |
|
||||||
| **RETURNS** | bool | Whether the GPU was activated. |
|
| **RETURNS** | Whether the GPU was activated. ~~bool~~ |
|
||||||
|
|
||||||
### spacy.require_gpu {#spacy.require_gpu tag="function" new="2.0.14"}
|
### spacy.require_gpu {#spacy.require_gpu tag="function" new="2.0.14"}
|
||||||
|
|
||||||
|
@ -150,9 +152,9 @@ and _before_ loading any models.
|
||||||
> nlp = spacy.load("en_core_web_sm")
|
> nlp = spacy.load("en_core_web_sm")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ----------- |
|
| ----------- | --------------- |
|
||||||
| **RETURNS** | bool | `True` |
|
| **RETURNS** | `True` ~~bool~~ |
|
||||||
|
|
||||||
## displaCy {#displacy source="spacy/displacy"}
|
## displaCy {#displacy source="spacy/displacy"}
|
||||||
|
|
||||||
|
@ -175,16 +177,16 @@ browser. Will run a simple web server.
|
||||||
> displacy.serve([doc1, doc2], style="dep")
|
> displacy.serve([doc1, doc2], style="dep")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description | Default |
|
| Name | Description |
|
||||||
| --------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ----------- |
|
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `docs` | list, `Doc`, `Span` | Document(s) to visualize. |
|
| `docs` | Document(s) or span(s) to visualize. ~~Union[Iterable[Union[Doc, Span]], Doc, Span]~~ |
|
||||||
| `style` | str | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
|
| `style` | Visualization style, `"dep"` or `"ent"`. Defaults to `"dep"`. ~~str~~ |
|
||||||
| `page` | bool | Render markup as full HTML page. | `True` |
|
| `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ |
|
||||||
| `minify` | bool | Minify HTML markup. | `False` |
|
| `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ |
|
||||||
| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` |
|
| `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ |
|
||||||
| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` |
|
| `manual` | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
|
||||||
| `port` | int | Port to serve visualization. | `5000` |
|
| `port` | Port to serve visualization. Defaults to `5000`. ~~int~~ |
|
||||||
| `host` | str | Host to serve visualization. | `'0.0.0.0'` |
|
| `host` | Host to serve visualization. Defaults to `"0.0.0.0"`. ~~str~~ |
|
||||||
|
|
||||||
### displacy.render {#displacy.render tag="method" new="2"}
|
### displacy.render {#displacy.render tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -200,16 +202,16 @@ Render a dependency parse tree or named entity visualization.
|
||||||
> html = displacy.render(doc, style="dep")
|
> html = displacy.render(doc, style="dep")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description | Default |
|
| Name | Description |
|
||||||
| ----------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `docs` | list, `Doc`, `Span` | Document(s) to visualize. |
|
| `docs` | Document(s) or span(s) to visualize. ~~Union[Iterable[Union[Doc, Span]], Doc, Span]~~ |
|
||||||
| `style` | str | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
|
| `style` | Visualization style, `"dep"` or `"ent"`. Defaults to `"dep"`. ~~str~~ |
|
||||||
| `page` | bool | Render markup as full HTML page. | `False` |
|
| `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ |
|
||||||
| `minify` | bool | Minify HTML markup. | `False` |
|
| `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ |
|
||||||
| `jupyter` | bool | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None`. | `None` |
|
| `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ |
|
||||||
| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` |
|
| `manual` | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ |
|
||||||
| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` |
|
| `jupyter` | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None` (default). ~~Optional[bool]~~ |
|
||||||
| **RETURNS** | str | Rendered HTML markup. |
|
| **RETURNS** | The rendered HTML markup. ~~str~~ |
|
||||||
|
|
||||||
### Visualizer options {#displacy_options}
|
### Visualizer options {#displacy_options}
|
||||||
|
|
||||||
|
@ -225,22 +227,22 @@ If a setting is not present in the options, the default value will be used.
|
||||||
> displacy.serve(doc, style="dep", options=options)
|
> displacy.serve(doc, style="dep", options=options)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description | Default |
|
| Name | Description |
|
||||||
| ------------------------------------------ | ---- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
| `fine_grained` | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). Defaults to `False`. ~~bool~~ |
|
||||||
| `add_lemma` <Tag variant="new">2.2.4</Tag> | bool | Print the lemma's in a separate row below the token texts. | `False` |
|
| `add_lemma` <Tag variant="new">2.2.4</Tag> | Print the lemma's in a separate row below the token texts. Defaults to `False`. ~~bool~~ |
|
||||||
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
| `collapse_punct` | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. Defaults to `True`. ~~bool~~ |
|
||||||
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
| `collapse_phrases` | Merge noun phrases into one token. Defaults to `False`. ~~bool~~ |
|
||||||
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
| `compact` | "Compact mode" with square arrows that takes up less space. Defaults to `False`. ~~bool~~ |
|
||||||
| `color` | str | Text color (HEX, RGB or color names). | `'#000000'` |
|
| `color` | Text color (HEX, RGB or color names). Defaults to `"#000000"`. ~~str~~ |
|
||||||
| `bg` | str | Background color (HEX, RGB or color names). | `'#ffffff'` |
|
| `bg` | Background color (HEX, RGB or color names). Defaults to `"#ffffff"`. ~~str~~ |
|
||||||
| `font` | str | Font name or font family for all text. | `'Arial'` |
|
| `font` | Font name or font family for all text. Defaults to `"Arial"`. ~~str~~ |
|
||||||
| `offset_x` | int | Spacing on left side of the SVG in px. | `50` |
|
| `offset_x` | Spacing on left side of the SVG in px. Defaults to `50`. ~~int~~ |
|
||||||
| `arrow_stroke` | int | Width of arrow path in px. | `2` |
|
| `arrow_stroke` | Width of arrow path in px. Defaults to `2`. ~~int~~ |
|
||||||
| `arrow_width` | int | Width of arrow head in px. | `10` / `8` (compact) |
|
| `arrow_width` | Width of arrow head in px. Defaults to `10` in regular mode and `8` in compact mode. ~~int~~ |
|
||||||
| `arrow_spacing` | int | Spacing between arrows in px to avoid overlaps. | `20` / `12` (compact) |
|
| `arrow_spacing` | Spacing between arrows in px to avoid overlaps. Defaults to `20` in regular mode and `12` in compact mode. ~~int~~ |
|
||||||
| `word_spacing` | int | Vertical spacing between words and arcs in px. | `45` |
|
| `word_spacing` | Vertical spacing between words and arcs in px. Defaults to `45`. ~~int~~ |
|
||||||
| `distance` | int | Distance between words in px. | `175` / `150` (compact) |
|
| `distance` | Distance between words in px. Defaults to `175` in regular mode and `150` in compact mode. ~~int~~ |
|
||||||
|
|
||||||
#### Named Entity Visualizer options {#displacy_options-ent}
|
#### Named Entity Visualizer options {#displacy_options-ent}
|
||||||
|
|
||||||
|
@ -252,11 +254,11 @@ If a setting is not present in the options, the default value will be used.
|
||||||
> displacy.serve(doc, style="ent", options=options)
|
> displacy.serve(doc, style="ent", options=options)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description | Default |
|
| Name | Description |
|
||||||
| --------------------------------------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------ |
|
| --------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `ents` | list | Entity types to highlight (`None` for all types). | `None` |
|
| `ents` | Entity types to highlight or `None` for all types (default). ~~Optional[List[str]]~~ |
|
||||||
| `colors` | dict | Color overrides. Entity types in uppercase should be mapped to color names or values. | `{}` |
|
| `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ |
|
||||||
| `template` <Tag variant="new">2.2</Tag> | str | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. | see [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) |
|
| `template` <Tag variant="new">2.2</Tag> | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ |
|
||||||
|
|
||||||
By default, displaCy comes with colors for all entity types used by
|
By default, displaCy comes with colors for all entity types used by
|
||||||
[spaCy models](/models). If you're using custom entity types, you can use the
|
[spaCy models](/models). If you're using custom entity types, you can use the
|
||||||
|
@ -280,43 +282,44 @@ concept of function registries. spaCy also uses the function registry for
|
||||||
language subclasses, model architecture, lookups and pipeline component
|
language subclasses, model architecture, lookups and pipeline component
|
||||||
factories.
|
factories.
|
||||||
|
|
||||||
<!-- TODO: improve example? -->
|
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
|
> from typing import Iterator
|
||||||
> import spacy
|
> import spacy
|
||||||
> from thinc.api import Model
|
|
||||||
>
|
>
|
||||||
> @spacy.registry.architectures("CustomNER.v1")
|
> @spacy.registry.schedules("waltzing.v1")
|
||||||
> def custom_ner(n0: int) -> Model:
|
> def waltzing() -> Iterator[float]:
|
||||||
> return Model("custom", forward, dims={"nO": nO})
|
> i = 0
|
||||||
|
> while True:
|
||||||
|
> yield i % 3 + 1
|
||||||
|
> i += 1
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Registry name | Description |
|
| Registry name | Description |
|
||||||
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
|
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
|
||||||
| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points) |
|
|
||||||
| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. |
|
|
||||||
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
|
||||||
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
|
|
||||||
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
|
||||||
| `assets` | Registry for data assets, knowledge bases etc. |
|
| `assets` | Registry for data assets, knowledge bases etc. |
|
||||||
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
|
|
||||||
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
|
|
||||||
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |
|
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |
|
||||||
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
|
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
|
||||||
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
|
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||||
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
|
| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
|
||||||
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
|
|
||||||
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
|
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
|
||||||
|
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||||
|
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
|
||||||
|
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
|
||||||
|
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
|
||||||
|
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
|
||||||
|
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
|
||||||
|
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
|
||||||
|
| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. |
|
||||||
|
|
||||||
### spacy-transformers registry {#registry-transformers}
|
### spacy-transformers registry {#registry-transformers}
|
||||||
|
|
||||||
The following registries are added by the
|
The following registries are added by the
|
||||||
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package.
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package.
|
||||||
See the [`Transformer`](/api/transformer) API reference and
|
See the [`Transformer`](/api/transformer) API reference and
|
||||||
[usage docs](/usage/transformers) for details.
|
[usage docs](/usage/embeddings-transformers) for details.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -338,7 +341,17 @@ See the [`Transformer`](/api/transformer) API reference and
|
||||||
|
|
||||||
## Batchers {#batchers source="spacy/gold/batchers.py" new="3"}
|
## Batchers {#batchers source="spacy/gold/batchers.py" new="3"}
|
||||||
|
|
||||||
<!-- TODO: intro -->
|
A data batcher implements a batching strategy that essentially turns a stream of
|
||||||
|
items into a stream of batches, with each batch consisting of one item or a list
|
||||||
|
of items. During training, the models update their weights after processing one
|
||||||
|
batch at a time. Typical batching strategies include presenting the training
|
||||||
|
data as a stream of batches with similar sizes, or with increasing batch sizes.
|
||||||
|
See the Thinc documentation on
|
||||||
|
[`schedules`](https://thinc.ai/docs/api-schedules) for a few standard examples.
|
||||||
|
|
||||||
|
Instead of using one of the built-in batchers listed here, you can also
|
||||||
|
[implement your own](/usage/training#custom-code-readers-batchers), which may or
|
||||||
|
may not use a custom schedule.
|
||||||
|
|
||||||
#### batch_by_words.v1 {#batch_by_words tag="registered function"}
|
#### batch_by_words.v1 {#batch_by_words tag="registered function"}
|
||||||
|
|
||||||
|
@ -359,13 +372,13 @@ themselves, or be discarded if `discard_oversize` is set to `True`. The argument
|
||||||
> get_length = null
|
> get_length = null
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------------ | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `seqs` | `Iterable[Any]` | The sequences to minibatch. |
|
| `seqs` | The sequences to minibatch. ~~Iterable[Any]~~ |
|
||||||
| `size` | `Iterable[int]` / int | The target number of words per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
| `size` | The target number of words per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~ |
|
||||||
| `tolerance` | float | What percentage of the size to allow batches to exceed. |
|
| `tolerance` | What percentage of the size to allow batches to exceed. ~~float~~ |
|
||||||
| `discard_oversize` | bool | Whether to discard sequences that by themselves exceed the tolerated size. |
|
| `discard_oversize` | Whether to discard sequences that by themselves exceed the tolerated size. ~~bool~~ |
|
||||||
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. |
|
| `get_length` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~ |
|
||||||
|
|
||||||
#### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}
|
#### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"}
|
||||||
|
|
||||||
|
@ -380,10 +393,10 @@ themselves, or be discarded if `discard_oversize` is set to `True`. The argument
|
||||||
|
|
||||||
Create a batcher that creates batches of the specified size.
|
Create a batcher that creates batches of the specified size.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `size` | `Iterable[int]` / int | The target number of items per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
| `size` | The target number of items per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~ |
|
||||||
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. |
|
| `get_length` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~ |
|
||||||
|
|
||||||
#### batch_by_padded.v1 {#batch_by_padded tag="registered function"}
|
#### batch_by_padded.v1 {#batch_by_padded tag="registered function"}
|
||||||
|
|
||||||
|
@ -403,12 +416,12 @@ sequences binned by length within a window. The padded size is defined as the
|
||||||
maximum length of sequences within the batch multiplied by the number of
|
maximum length of sequences within the batch multiplied by the number of
|
||||||
sequences in the batch.
|
sequences in the batch.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `size` | `Iterable[int]` / int | The largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). |
|
| `size` | The largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~ |
|
||||||
| `buffer` | int | The number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. |
|
| `buffer` | The number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. ~~int~~ |
|
||||||
| `discard_oversize` | bool | Whether to discard sequences that are by themselves longer than the largest padded batch size. |
|
| `discard_oversize` | Whether to discard sequences that are by themselves longer than the largest padded batch size. ~~bool~~ |
|
||||||
| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. |
|
| `get_length` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~ |
|
||||||
|
|
||||||
## Training data and alignment {#gold source="spacy/gold"}
|
## Training data and alignment {#gold source="spacy/gold"}
|
||||||
|
|
||||||
|
@ -436,11 +449,11 @@ single-token entity.
|
||||||
> assert tags == ["O", "O", "U-LOC", "O"]
|
> assert tags == ["O", "O", "U-LOC", "O"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `doc` | `Doc` | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. |
|
| `doc` | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. ~~Doc~~ |
|
||||||
| `entities` | iterable | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. |
|
| `entities` | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. ~~List[Tuple[int, int, Union[str, int]]]~~ |
|
||||||
| **RETURNS** | list | str strings, describing the [BILUO](/usage/linguistic-features#accessing-ner) tags. |
|
| **RETURNS** | A list of strings, describing the [BILUO](/usage/linguistic-features#accessing-ner) tags. ~~List[str]~~ |
|
||||||
|
|
||||||
### gold.offsets_from_biluo_tags {#offsets_from_biluo_tags tag="function"}
|
### gold.offsets_from_biluo_tags {#offsets_from_biluo_tags tag="function"}
|
||||||
|
|
||||||
|
@ -458,11 +471,11 @@ Encode per-token tags following the
|
||||||
> assert entities == [(7, 13, "LOC")]
|
> assert entities == [(7, 13, "LOC")]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `doc` | `Doc` | The document that the BILUO tags refer to. |
|
| `doc` | The document that the BILUO tags refer to. ~~Doc~~ |
|
||||||
| `entities` | iterable | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. |
|
| `entities` | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. ~~List[str]~~ |
|
||||||
| **RETURNS** | list | A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string. |
|
| **RETURNS** | A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string. ~~List[Tuple[int, int, str]]~~ |
|
||||||
|
|
||||||
### gold.spans_from_biluo_tags {#spans_from_biluo_tags tag="function" new="2.1"}
|
### gold.spans_from_biluo_tags {#spans_from_biluo_tags tag="function" new="2.1"}
|
||||||
|
|
||||||
|
@ -481,11 +494,11 @@ token-based tags, e.g. to overwrite the `doc.ents`.
|
||||||
> doc.ents = spans_from_biluo_tags(doc, tags)
|
> doc.ents = spans_from_biluo_tags(doc, tags)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `doc` | `Doc` | The document that the BILUO tags refer to. |
|
| `doc` | The document that the BILUO tags refer to. ~~Doc~~ |
|
||||||
| `entities` | iterable | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. |
|
| `entities` | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. ~~List[str]~~ |
|
||||||
| **RETURNS** | list | A sequence of `Span` objects with added entity labels. |
|
| **RETURNS** | A sequence of `Span` objects with added entity labels. ~~List[Span]~~ |
|
||||||
|
|
||||||
## Utility functions {#util source="spacy/util.py"}
|
## Utility functions {#util source="spacy/util.py"}
|
||||||
|
|
||||||
|
@ -497,14 +510,13 @@ page should be safe to use and we'll try to ensure backwards compatibility.
|
||||||
However, we recommend having additional tests in place if your application
|
However, we recommend having additional tests in place if your application
|
||||||
depends on any of spaCy's utilities.
|
depends on any of spaCy's utilities.
|
||||||
|
|
||||||
<!-- TODO: document new config-related util functions? -->
|
|
||||||
|
|
||||||
### util.get_lang_class {#util.get_lang_class tag="function"}
|
### util.get_lang_class {#util.get_lang_class tag="function"}
|
||||||
|
|
||||||
Import and load a `Language` class. Allows lazy-loading
|
Import and load a `Language` class. Allows lazy-loading
|
||||||
[language data](/usage/adding-languages) and importing languages using the
|
[language data](/usage/adding-languages) and importing languages using the
|
||||||
two-letter language code. To add a language code for a custom language class,
|
two-letter language code. To add a language code for a custom language class,
|
||||||
you can use the [`set_lang_class`](/api/top-level#util.set_lang_class) helper.
|
you can register it using the [`@registry.languages`](/api/top-level#registry)
|
||||||
|
decorator.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -514,36 +526,14 @@ you can use the [`set_lang_class`](/api/top-level#util.set_lang_class) helper.
|
||||||
> lang = lang_class()
|
> lang = lang_class()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---------- | -------------------------------------- |
|
| ----------- | ---------------------------------------------- |
|
||||||
| `lang` | str | Two-letter language code, e.g. `'en'`. |
|
| `lang` | Two-letter language code, e.g. `"en"`. ~~str~~ |
|
||||||
| **RETURNS** | `Language` | Language class. |
|
| **RETURNS** | The respective subclass. ~~Language~~ |
|
||||||
|
|
||||||
### util.set_lang_class {#util.set_lang_class tag="function"}
|
|
||||||
|
|
||||||
Set a custom `Language` class name that can be loaded via
|
|
||||||
[`get_lang_class`](/api/top-level#util.get_lang_class). If your model uses a
|
|
||||||
custom language, this is required so that spaCy can load the correct class from
|
|
||||||
the two-letter language code.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> from spacy.lang.xy import CustomLanguage
|
|
||||||
>
|
|
||||||
> util.set_lang_class('xy', CustomLanguage)
|
|
||||||
> lang_class = util.get_lang_class('xy')
|
|
||||||
> nlp = lang_class()
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ------ | ---------- | -------------------------------------- |
|
|
||||||
| `name` | str | Two-letter language code, e.g. `'en'`. |
|
|
||||||
| `cls` | `Language` | The language class, e.g. `English`. |
|
|
||||||
|
|
||||||
### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"}
|
### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"}
|
||||||
|
|
||||||
Check whether a `Language` class is already loaded. `Language` classes are
|
Check whether a `Language` subclass is already loaded. `Language` subclasses are
|
||||||
loaded lazily, to avoid expensive setup code associated with the language data.
|
loaded lazily, to avoid expensive setup code associated with the language data.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
|
@ -554,19 +544,19 @@ loaded lazily, to avoid expensive setup code associated with the language data.
|
||||||
> assert util.lang_class_is_loaded("de") is False
|
> assert util.lang_class_is_loaded("de") is False
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------------- |
|
| ----------- | ---------------------------------------------- |
|
||||||
| `name` | str | Two-letter language code, e.g. `'en'`. |
|
| `name` | Two-letter language code, e.g. `"en"`. ~~str~~ |
|
||||||
| **RETURNS** | bool | Whether the class has been loaded. |
|
| **RETURNS** | Whether the class has been loaded. ~~bool~~ |
|
||||||
|
|
||||||
### util.load_model {#util.load_model tag="function" new="2"}
|
### util.load_model {#util.load_model tag="function" new="2"}
|
||||||
|
|
||||||
Load a model from a package or data path. If called with a package name, spaCy
|
Load a model from a package or data path. If called with a package name, spaCy
|
||||||
will assume the model is a Python package and import and call its `load()`
|
will assume the model is a Python package and import and call its `load()`
|
||||||
method. If called with a path, spaCy will assume it's a data directory, read the
|
method. If called with a path, spaCy will assume it's a data directory, read the
|
||||||
language and pipeline settings from the meta.json and initialize a `Language`
|
language and pipeline settings from the [`config.cfg`](/api/data-formats#config)
|
||||||
class. The model data will then be loaded in via
|
and create a `Language` object. The model data will then be loaded in via
|
||||||
[`Language.from_disk()`](/api/language#from_disk).
|
[`Language.from_disk`](/api/language#from_disk).
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -576,31 +566,13 @@ class. The model data will then be loaded in via
|
||||||
> nlp = util.load_model("/path/to/data")
|
> nlp = util.load_model("/path/to/data")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------- | ---------- | -------------------------------------------------------- |
|
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `name` | str | Package name or model path. |
|
| `name` | Package name or model path. ~~str~~ |
|
||||||
| `**overrides` | - | Specific overrides, like pipeline components to disable. |
|
| `vocab` <Tag variant="new">3</Tag> | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~. |
|
||||||
| **RETURNS** | `Language` | `Language` class with the loaded model. |
|
| `disable` | Names of pipeline components to disable. ~~Iterable[str]~~ |
|
||||||
|
| `config` <Tag variant="new">3</Tag> | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~ |
|
||||||
### util.load_model_from_path {#util.load_model_from_path tag="function" new="2"}
|
| **RETURNS** | `Language` class with the loaded model. ~~Language~~ |
|
||||||
|
|
||||||
Load a model from a data directory path. Creates the [`Language`](/api/language)
|
|
||||||
class and pipeline based on the directory's meta.json and then calls
|
|
||||||
[`from_disk()`](/api/language#from_disk) with the path. This function also makes
|
|
||||||
it easy to test a new model that you haven't packaged yet.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> nlp = load_model_from_path("/path/to/data")
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ------------- | ---------- | ---------------------------------------------------------------------------------------------------- |
|
|
||||||
| `model_path` | str | Path to model data directory. |
|
|
||||||
| `meta` | dict | Model meta data. If `False`, spaCy will try to load the meta from a meta.json in the same directory. |
|
|
||||||
| `**overrides` | - | Specific overrides, like pipeline components to disable. |
|
|
||||||
| **RETURNS** | `Language` | `Language` class with the loaded model. |
|
|
||||||
|
|
||||||
### util.load_model_from_init_py {#util.load_model_from_init_py tag="function" new="2"}
|
### util.load_model_from_init_py {#util.load_model_from_init_py tag="function" new="2"}
|
||||||
|
|
||||||
|
@ -616,26 +588,66 @@ A helper function to use in the `load()` method of a model package's
|
||||||
> return load_model_from_init_py(__file__, **overrides)
|
> return load_model_from_init_py(__file__, **overrides)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------- | ---------- | -------------------------------------------------------- |
|
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `init_file` | str | Path to model's `__init__.py`, i.e. `__file__`. |
|
| `init_file` | Path to model's `__init__.py`, i.e. `__file__`. ~~Union[str, Path]~~ |
|
||||||
| `**overrides` | - | Specific overrides, like pipeline components to disable. |
|
| `vocab` <Tag variant="new">3</Tag> | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~. |
|
||||||
| **RETURNS** | `Language` | `Language` class with the loaded model. |
|
| `disable` | Names of pipeline components to disable. ~~Iterable[str]~~ |
|
||||||
|
| `config` <Tag variant="new">3</Tag> | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~ |
|
||||||
|
| **RETURNS** | `Language` class with the loaded model. ~~Language~~ |
|
||||||
|
|
||||||
### util.get_model_meta {#util.get_model_meta tag="function" new="2"}
|
### util.load_config {#util.load_config tag="function" new="3"}
|
||||||
|
|
||||||
Get a model's meta.json from a directory path and validate its contents.
|
Load a model's [`config.cfg`](/api/data-formats#config) from a file path. The
|
||||||
|
config typically includes details about the model pipeline and how its
|
||||||
|
components are created, as well as all training settings and hyperparameters.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> meta = util.get_model_meta("/path/to/model")
|
> config = util.load_config("/path/to/model/config.cfg")
|
||||||
|
> print(config.to_str())
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------ | ------------------------ |
|
| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | Path to model directory. |
|
| `path` | Path to the model's `config.cfg`. ~~Union[str, Path]~~ |
|
||||||
| **RETURNS** | dict | The model's meta data. |
|
| `overrides` | Optional config overrides to replace in loaded config. Can be provided as nested dict, or as flat dict with keys in dot notation, e.g. `"nlp.pipeline"`. ~~Dict[str, Any]~~ |
|
||||||
|
| `interpolate` | Whether to interpolate the config and replace variables like `${paths.train}` with their values. Defaults to `False`. ~~bool~~ |
|
||||||
|
| **RETURNS** | The model's config. ~~Config~~ |
|
||||||
|
|
||||||
|
### util.load_meta {#util.load_meta tag="function" new="3"}
|
||||||
|
|
||||||
|
Get a model's [`meta.json`](/api/data-formats#meta) from a file path and
|
||||||
|
validate its contents.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> meta = util.load_meta("/path/to/model/meta.json")
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ----------------------------------------------------- |
|
||||||
|
| `path` | Path to the model's `meta.json`. ~~Union[str, Path]~~ |
|
||||||
|
| **RETURNS** | The model's meta data. ~~Dict[str, Any]~~ |
|
||||||
|
|
||||||
|
### util.get_installed_models {#util.get_installed_models tag="function" new="3"}
|
||||||
|
|
||||||
|
List all model packages installed in the current environment. This will include
|
||||||
|
any spaCy model that was packaged with [`spacy package`](/api/cli#package).
|
||||||
|
Under the hood, model packages expose a Python entry point that spaCy can check,
|
||||||
|
without having to load the model.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> model_names = util.get_installed_models()
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ---------------------------------------------------------------------------------- |
|
||||||
|
| **RETURNS** | The string names of the models installed in the current environment. ~~List[str]~~ |
|
||||||
|
|
||||||
### util.is_package {#util.is_package tag="function"}
|
### util.is_package {#util.is_package tag="function"}
|
||||||
|
|
||||||
|
@ -649,10 +661,10 @@ Check if string maps to a package installed via pip. Mainly used to validate
|
||||||
> util.is_package("xyz") # False
|
> util.is_package("xyz") # False
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------ | -------------------------------------------- |
|
| ----------- | ----------------------------------------------------- |
|
||||||
| `name` | str | Name of package. |
|
| `name` | Name of package. ~~str~~ |
|
||||||
| **RETURNS** | `bool` | `True` if installed package, `False` if not. |
|
| **RETURNS** | `True` if installed package, `False` if not. ~~bool~~ |
|
||||||
|
|
||||||
### util.get_package_path {#util.get_package_path tag="function" new="2"}
|
### util.get_package_path {#util.get_package_path tag="function" new="2"}
|
||||||
|
|
||||||
|
@ -666,10 +678,10 @@ Get path to an installed package. Mainly used to resolve the location of
|
||||||
> # /usr/lib/python3.6/site-packages/en_core_web_sm
|
> # /usr/lib/python3.6/site-packages/en_core_web_sm
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ------ | -------------------------------- |
|
| -------------- | ----------------------------------------- |
|
||||||
| `package_name` | str | Name of installed package. |
|
| `package_name` | Name of installed package. ~~str~~ |
|
||||||
| **RETURNS** | `Path` | Path to model package directory. |
|
| **RETURNS** | Path to model package directory. ~~Path~~ |
|
||||||
|
|
||||||
### util.is_in_jupyter {#util.is_in_jupyter tag="function" new="2"}
|
### util.is_in_jupyter {#util.is_in_jupyter tag="function" new="2"}
|
||||||
|
|
||||||
|
@ -686,9 +698,9 @@ detecting the IPython kernel. Mainly used for the
|
||||||
> display(HTML(html))
|
> display(HTML(html))
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------- |
|
| ----------- | ---------------------------------------------- |
|
||||||
| **RETURNS** | bool | `True` if in Jupyter, `False` if not. |
|
| **RETURNS** | `True` if in Jupyter, `False` if not. ~~bool~~ |
|
||||||
|
|
||||||
### util.compile_prefix_regex {#util.compile_prefix_regex tag="function"}
|
### util.compile_prefix_regex {#util.compile_prefix_regex tag="function"}
|
||||||
|
|
||||||
|
@ -702,10 +714,10 @@ Compile a sequence of prefix rules into a regex object.
|
||||||
> nlp.tokenizer.prefix_search = prefix_regex.search
|
> nlp.tokenizer.prefix_search = prefix_regex.search
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `entries` | tuple | The prefix rules, e.g. [`lang.punctuation.TOKENIZER_PREFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). |
|
| `entries` | The prefix rules, e.g. [`lang.punctuation.TOKENIZER_PREFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
|
||||||
| **RETURNS** | [regex](https://docs.python.org/3/library/re.html#re-objects) | The regex object. to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). |
|
| **RETURNS** | The regex object. to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). ~~Pattern~~ |
|
||||||
|
|
||||||
### util.compile_suffix_regex {#util.compile_suffix_regex tag="function"}
|
### util.compile_suffix_regex {#util.compile_suffix_regex tag="function"}
|
||||||
|
|
||||||
|
@ -719,10 +731,10 @@ Compile a sequence of suffix rules into a regex object.
|
||||||
> nlp.tokenizer.suffix_search = suffix_regex.search
|
> nlp.tokenizer.suffix_search = suffix_regex.search
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `entries` | tuple | The suffix rules, e.g. [`lang.punctuation.TOKENIZER_SUFFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). |
|
| `entries` | The suffix rules, e.g. [`lang.punctuation.TOKENIZER_SUFFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
|
||||||
| **RETURNS** | [regex](https://docs.python.org/3/library/re.html#re-objects) | The regex object. to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). |
|
| **RETURNS** | The regex object. to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). ~~Pattern~~ |
|
||||||
|
|
||||||
### util.compile_infix_regex {#util.compile_infix_regex tag="function"}
|
### util.compile_infix_regex {#util.compile_infix_regex tag="function"}
|
||||||
|
|
||||||
|
@ -736,10 +748,10 @@ Compile a sequence of infix rules into a regex object.
|
||||||
> nlp.tokenizer.infix_finditer = infix_regex.finditer
|
> nlp.tokenizer.infix_finditer = infix_regex.finditer
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `entries` | tuple | The infix rules, e.g. [`lang.punctuation.TOKENIZER_INFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). |
|
| `entries` | The infix rules, e.g. [`lang.punctuation.TOKENIZER_INFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ |
|
||||||
| **RETURNS** | [regex](https://docs.python.org/3/library/re.html#re-objects) | The regex object. to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). |
|
| **RETURNS** | The regex object. to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). ~~Pattern~~ |
|
||||||
|
|
||||||
### util.minibatch {#util.minibatch tag="function" new="2"}
|
### util.minibatch {#util.minibatch tag="function" new="2"}
|
||||||
|
|
||||||
|
@ -754,11 +766,11 @@ vary on each step.
|
||||||
> nlp.update(batch)
|
> nlp.update(batch)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | -------------- | ---------------------- |
|
| ---------- | ---------------------------------------- |
|
||||||
| `items` | iterable | The items to batch up. |
|
| `items` | The items to batch up. ~~Iterable[Any]~~ |
|
||||||
| `size` | int / iterable | The batch size(s). |
|
| `size` | int / iterable | The batch size(s). ~~Union[int, Sequence[int]]~~ |
|
||||||
| **YIELDS** | list | The batches. |
|
| **YIELDS** | The batches. |
|
||||||
|
|
||||||
### util.filter_spans {#util.filter_spans tag="function" new="2.1.4"}
|
### util.filter_spans {#util.filter_spans tag="function" new="2.1.4"}
|
||||||
|
|
||||||
|
@ -776,17 +788,30 @@ of one entity) or when merging spans with
|
||||||
> filtered = filter_spans(spans)
|
> filtered = filter_spans(spans)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------- | -------------------- |
|
| ----------- | --------------------------------------- |
|
||||||
| `spans` | iterable | The spans to filter. |
|
| `spans` | The spans to filter. ~~Iterable[Span]~~ |
|
||||||
| **RETURNS** | list | The filtered spans. |
|
| **RETURNS** | The filtered spans. ~~List[Span]~~ |
|
||||||
|
|
||||||
### util.get_words_and_spaces {#get_words_and_spaces tag="function" new="3"}
|
### util.get_words_and_spaces {#get_words_and_spaces tag="function" new="3"}
|
||||||
|
|
||||||
<!-- TODO: document -->
|
Given a list of words and a text, reconstruct the original tokens and return a
|
||||||
|
list of words and spaces that can be used to create a [`Doc`](/api/doc#init).
|
||||||
|
This can help recover destructive tokenization that didn't preserve any
|
||||||
|
whitespace information.
|
||||||
|
|
||||||
| Name | Type | Description |
|
> #### Example
|
||||||
| ----------- | ----- | ----------- |
|
>
|
||||||
| `words` | list | |
|
> ```python
|
||||||
| `text` | str | |
|
> orig_words = ["Hey", ",", "what", "'s", "up", "?"]
|
||||||
| **RETURNS** | tuple | |
|
> orig_text = "Hey, what's up?"
|
||||||
|
> words, spaces = get_words_and_spaces(orig_words, orig_text)
|
||||||
|
> # ['Hey', ',', 'what', "'s", 'up', '?']
|
||||||
|
> # [False, True, False, True, False, False]
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `words` | The list of words. ~~Iterable[str]~~ |
|
||||||
|
| `text` | The original text. ~~str~~ |
|
||||||
|
| **RETURNS** | A list of words and a list of boolean values indicating whether the word at this position is followed by a space. ~~Tuple[List[str], List[bool]]~~ |
|
||||||
|
|
|
@ -41,7 +41,8 @@ token, the spaCy token receives the sum of their values. To access the values,
|
||||||
you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The
|
you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The
|
||||||
package also adds the function registries [`@span_getters`](#span_getters) and
|
package also adds the function registries [`@span_getters`](#span_getters) and
|
||||||
[`@annotation_setters`](#annotation_setters) with several built-in registered
|
[`@annotation_setters`](#annotation_setters) with several built-in registered
|
||||||
functions. For more details, see the [usage documentation](/usage/transformers).
|
functions. For more details, see the
|
||||||
|
[usage documentation](/usage/embeddings-transformers).
|
||||||
|
|
||||||
## Config and implementation {#config}
|
## Config and implementation {#config}
|
||||||
|
|
||||||
|
@ -60,11 +61,11 @@ architectures and their arguments and hyperparameters.
|
||||||
> nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
|
> nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Description |
|
||||||
| ------------------- | ------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
|
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `max_batch_items` | int | Maximum size of a padded batch. | `4096` |
|
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
|
||||||
| `annotation_setter` | Callable | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no additional annotations are set. | `null_annotation_setter` |
|
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** [`FullTransformerBatch`](/api/transformer#fulltransformerbatch). The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. | [TransformerModel](/api/architectures#TransformerModel) |
|
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
|
https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
|
||||||
|
@ -101,14 +102,14 @@ attribute. You can also provide a callback to set additional annotations. In
|
||||||
your application, you would normally use a shortcut for this and instantiate the
|
your application, you would normally use a shortcut for this and instantiate the
|
||||||
component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
|
component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------------- | ------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** [`FullTransformerBatch`](/api/transformer#fulltransformerbatch). The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. |
|
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ |
|
||||||
| `annotation_setter` | `Callable` | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no additional annotations are set. |
|
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no annotations are set. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||||
| `max_batch_items` | int | Maximum size of a padded batch. Defaults to `128*32`. |
|
| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ |
|
||||||
|
|
||||||
## Transformer.\_\_call\_\_ {#call tag="method"}
|
## Transformer.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
@ -128,10 +129,10 @@ to the [`predict`](/api/transformer#predict) and
|
||||||
> processed = transformer(doc)
|
> processed = transformer(doc)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ------------------------ |
|
| ----------- | -------------------------------- |
|
||||||
| `doc` | `Doc` | The document to process. |
|
| `doc` | The document to process. ~~Doc~~ |
|
||||||
| **RETURNS** | `Doc` | The processed document. |
|
| **RETURNS** | The processed document. ~~Doc~~ |
|
||||||
|
|
||||||
## Transformer.pipe {#pipe tag="method"}
|
## Transformer.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -150,12 +151,12 @@ applied to the `Doc` in order. Both [`__call__`](/api/transformer#call) and
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ----------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------- |
|
||||||
| `stream` | `Iterable[Doc]` | A stream of documents. |
|
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `batch_size` | int | The number of documents to buffer. Defaults to `128`. |
|
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
||||||
| **YIELDS** | `Doc` | The processed documents in order. |
|
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
||||||
|
|
||||||
## Transformer.begin_training {#begin_training tag="method"}
|
## Transformer.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
|
@ -175,13 +176,13 @@ setting up the label scheme based on the data.
|
||||||
> optimizer = trf.begin_training(lambda: [], pipeline=nlp.pipeline)
|
> optimizer = trf.begin_training(lambda: [], pipeline=nlp.pipeline)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `get_examples` | `Callable[[], Iterable[Example]]` | Optional function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. |
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `pipeline` | `List[Tuple[str, Callable]]` | Optional list of pipeline components that this component is part of. |
|
| `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | An optional optimizer. Will be created via [`create_optimizer`](/api/transformer#create_optimizer) if not set. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## Transformer.predict {#predict tag="method"}
|
## Transformer.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
@ -195,10 +196,10 @@ modifying them.
|
||||||
> scores = trf.predict([doc1, doc2])
|
> scores = trf.predict([doc1, doc2])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------- | ----------------------------------------- |
|
| ----------- | ------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to predict. |
|
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
|
||||||
| **RETURNS** | - | The model's prediction for each document. |
|
| **RETURNS** | The model's prediction for each document. |
|
||||||
|
|
||||||
## Transformer.set_annotations {#set_annotations tag="method"}
|
## Transformer.set_annotations {#set_annotations tag="method"}
|
||||||
|
|
||||||
|
@ -215,10 +216,10 @@ callback is then called, if provided.
|
||||||
> trf.set_annotations(docs, scores)
|
> trf.set_annotations(docs, scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | --------------- | ----------------------------------------------------- |
|
| -------- | ----------------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | The documents to modify. |
|
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
|
||||||
| `scores` | - | The scores to set, produced by `Transformer.predict`. |
|
| `scores` | The scores to set, produced by `Transformer.predict`. |
|
||||||
|
|
||||||
## Transformer.update {#update tag="method"}
|
## Transformer.update {#update tag="method"}
|
||||||
|
|
||||||
|
@ -244,15 +245,15 @@ and call the optimizer, while the others simply increment the gradients.
|
||||||
> losses = trf.update(examples, sgd=optimizer)
|
> losses = trf.update(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------- | --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects. Only the [`Example.predicted`](/api/example#predicted) `Doc` object is used, the reference `Doc` is ignored. |
|
| `examples` | A batch of [`Example`](/api/example) objects. Only the [`Example.predicted`](/api/example#predicted) `Doc` object is used, the reference `Doc` is ignored. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/transformer#set_annotations). |
|
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## Transformer.create_optimizer {#create_optimizer tag="method"}
|
## Transformer.create_optimizer {#create_optimizer tag="method"}
|
||||||
|
|
||||||
|
@ -265,9 +266,9 @@ Create an optimizer for the pipeline component.
|
||||||
> optimizer = trf.create_optimizer()
|
> optimizer = trf.create_optimizer()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------------- | -------------- |
|
| ----------- | ---------------------------- |
|
||||||
| **RETURNS** | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
## Transformer.use_params {#use_params tag="method, contextmanager"}
|
## Transformer.use_params {#use_params tag="method, contextmanager"}
|
||||||
|
|
||||||
|
@ -282,9 +283,9 @@ context, the original parameters are restored.
|
||||||
> trf.to_disk("/best_model")
|
> trf.to_disk("/best_model")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---- | ----------------------------------------- |
|
| -------- | -------------------------------------------------- |
|
||||||
| `params` | dict | The parameter values to use in the model. |
|
| `params` | The parameter values to use in the model. ~~dict~~ |
|
||||||
|
|
||||||
## Transformer.to_disk {#to_disk tag="method"}
|
## Transformer.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -297,11 +298,11 @@ Serialize the pipe to disk.
|
||||||
> trf.to_disk("/path/to/transformer")
|
> trf.to_disk("/path/to/transformer")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## Transformer.from_disk {#from_disk tag="method"}
|
## Transformer.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -314,12 +315,12 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
> trf.from_disk("/path/to/transformer")
|
> trf.from_disk("/path/to/transformer")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Tok2Vec` | The modified `Tok2Vec` object. |
|
| **RETURNS** | The modified `Transformer` object. ~~Transformer~~ |
|
||||||
|
|
||||||
## Transformer.to_bytes {#to_bytes tag="method"}
|
## Transformer.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -332,11 +333,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | bytes | The serialized form of the `Tok2Vec` object. |
|
| **RETURNS** | The serialized form of the `Transformer` object. ~~bytes~~ |
|
||||||
|
|
||||||
## Transformer.from_bytes {#from_bytes tag="method"}
|
## Transformer.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -350,12 +351,12 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> trf.from_bytes(trf_bytes)
|
> trf.from_bytes(trf_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Tok2Vec` | The `Tok2Vec` object. |
|
| **RETURNS** | The `Transformer` object. ~~Transformer~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
@ -386,20 +387,20 @@ by this class. Instances of this class
|
||||||
are`typically assigned to the [Doc._.trf_data`](/api/transformer#custom-attributes)
|
are`typically assigned to the [Doc._.trf_data`](/api/transformer#custom-attributes)
|
||||||
extension attribute.
|
extension attribute.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | -------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `tokens` | `Dict` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts, and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. |
|
| `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts, and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ |
|
||||||
| `tensors` | `List[FloatsXd]` | The activations for the Doc from the transformer. Usually the last tensor that is 3-dimensional will be the most important, as that will provide the final hidden state. Generally activations that are 2-dimensional will be attention weights. Details of this variable will differ depending on the underlying transformer model. |
|
| `tensors` | The activations for the Doc from the transformer. Usually the last tensor that is 3-dimensional will be the most important, as that will provide the final hidden state. Generally activations that are 2-dimensional will be attention weights. Details of this variable will differ depending on the underlying transformer model. ~~List[FloatsXd]~~ |
|
||||||
| `align` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | Alignment from the `Doc`'s tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. |
|
| `align` | Alignment from the `Doc`'s tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ |
|
||||||
| `width` | int | The width of the last hidden layer. |
|
| `width` | The width of the last hidden layer. ~~int~~ |
|
||||||
|
|
||||||
### TransformerData.empty {#transformerdata-emoty tag="classmethod"}
|
### TransformerData.empty {#transformerdata-emoty tag="classmethod"}
|
||||||
|
|
||||||
Create an empty `TransformerData` container.
|
Create an empty `TransformerData` container.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----------------- | -------------- |
|
| ----------- | ---------------------------------- |
|
||||||
| **RETURNS** | `TransformerData` | The container. |
|
| **RETURNS** | The container. ~~TransformerData~~ |
|
||||||
|
|
||||||
## FullTransformerBatch {#fulltransformerbatch tag="dataclass"}
|
## FullTransformerBatch {#fulltransformerbatch tag="dataclass"}
|
||||||
|
|
||||||
|
@ -407,13 +408,13 @@ Holds a batch of input and output objects for a transformer model. The data can
|
||||||
then be split to a list of [`TransformerData`](/api/transformer#transformerdata)
|
then be split to a list of [`TransformerData`](/api/transformer#transformerdata)
|
||||||
objects to associate the outputs to each [`Doc`](/api/doc) in the batch.
|
objects to associate the outputs to each [`Doc`](/api/doc) in the batch.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | -------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `spans` | `List[List[Span]]` | The batch of input spans. The outer list refers to the Doc objects in the batch, and the inner list are the spans for that `Doc`. Note that spans are allowed to overlap or exclude tokens, but each Span can only refer to one `Doc` (by definition). This means that within a `Doc`, the regions of the output tensors that correspond to each Span may overlap or have gaps, but for each `Doc`, there is a non-overlapping contiguous slice of the outputs. |
|
| `spans` | The batch of input spans. The outer list refers to the Doc objects in the batch, and the inner list are the spans for that `Doc`. Note that spans are allowed to overlap or exclude tokens, but each Span can only refer to one `Doc` (by definition). This means that within a `Doc`, the regions of the output tensors that correspond to each Span may overlap or have gaps, but for each `Doc`, there is a non-overlapping contiguous slice of the outputs. ~~List[List[Span]]~~ |
|
||||||
| `tokens` | [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) | The output of the tokenizer. |
|
| `tokens` | The output of the tokenizer. ~~transformers.BatchEncoding~~ |
|
||||||
| `tensors` | `List[torch.Tensor]` | The output of the transformer model. |
|
| `tensors` | The output of the transformer model. ~~List[torch.Tensor]~~ |
|
||||||
| `align` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | Alignment from the spaCy tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. |
|
| `align` | Alignment from the spaCy tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ |
|
||||||
| `doc_data` | `List[TransformerData]` | The outputs, split per `Doc` object. |
|
| `doc_data` | The outputs, split per `Doc` object. ~~List[TransformerData]~~ |
|
||||||
|
|
||||||
### FullTransformerBatch.unsplit_by_doc {#fulltransformerbatch-unsplit_by_doc tag="method"}
|
### FullTransformerBatch.unsplit_by_doc {#fulltransformerbatch-unsplit_by_doc tag="method"}
|
||||||
|
|
||||||
|
@ -422,19 +423,19 @@ current object's spans, tokens and alignment. This is used during the backward
|
||||||
pass, in order to construct the gradients to pass back into the transformer
|
pass, in order to construct the gradients to pass back into the transformer
|
||||||
model.
|
model.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---------------------- | ------------------------------- |
|
| ----------- | -------------------------------------------------------- |
|
||||||
| `arrays` | `List[List[Floats3d]]` | The split batch of activations. |
|
| `arrays` | The split batch of activations. ~~List[List[Floats3d]]~~ |
|
||||||
| **RETURNS** | `FullTransformerBatch` | The transformer batch. |
|
| **RETURNS** | The transformer batch. ~~FullTransformerBatch~~ |
|
||||||
|
|
||||||
### FullTransformerBatch.split_by_doc {#fulltransformerbatch-split_by_doc tag="method"}
|
### FullTransformerBatch.split_by_doc {#fulltransformerbatch-split_by_doc tag="method"}
|
||||||
|
|
||||||
Split a `TransformerData` object that represents a batch into a list with one
|
Split a `TransformerData` object that represents a batch into a list with one
|
||||||
`TransformerData` per `Doc`.
|
`TransformerData` per `Doc`.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----------------------- | ---------------- |
|
| ----------- | ------------------------------------------ |
|
||||||
| **RETURNS** | `List[TransformerData]` | The split batch. |
|
| **RETURNS** | The split batch. ~~List[TransformerData]~~ |
|
||||||
|
|
||||||
## Span getters {#span_getters source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"}
|
## Span getters {#span_getters source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"}
|
||||||
|
|
||||||
|
@ -460,10 +461,10 @@ decorator.
|
||||||
> return get_sent_spans
|
> return get_sent_spans
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------ | ---------------------------------------- |
|
| ----------- | ------------------------------------------------------------- |
|
||||||
| `docs` | `Iterable[Doc]` | A batch of `Doc` objects. |
|
| `docs` | A batch of `Doc` objects. ~~Iterable[Doc]~~ |
|
||||||
| **RETURNS** | `List[List[Span]]` | The spans to process by the transformer. |
|
| **RETURNS** | The spans to process by the transformer. ~~List[List[Span]]~~ |
|
||||||
|
|
||||||
### doc_spans.v1 {#doc_spans tag="registered function"}
|
### doc_spans.v1 {#doc_spans tag="registered function"}
|
||||||
|
|
||||||
|
@ -510,10 +511,10 @@ than `window` will allow for an overlap, so that some tokens are counted twice.
|
||||||
This can be desirable, because it allows all tokens to have both a left and
|
This can be desirable, because it allows all tokens to have both a left and
|
||||||
right context.
|
right context.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | ---- | ---------------- |
|
| -------- | ------------------------ |
|
||||||
| `window` | int | The window size. |
|
| `window` | The window size. ~~int~~ |
|
||||||
| `stride` | int | The stride size. |
|
| `stride` | The stride size. ~~int~~ |
|
||||||
|
|
||||||
## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}
|
## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}
|
||||||
|
|
||||||
|
@ -526,7 +527,7 @@ You can register custom annotation setters using the
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> @registry.annotation_setters("spacy-transformer.null_annotation_setter.v1")
|
> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1")
|
||||||
> def configure_null_annotation_setter() -> Callable:
|
> def configure_null_annotation_setter() -> Callable:
|
||||||
> def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None:
|
> def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None:
|
||||||
> pass
|
> pass
|
||||||
|
@ -534,22 +535,22 @@ You can register custom annotation setters using the
|
||||||
> return setter
|
> return setter
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ---------------------- | ------------------------------------ |
|
| ---------- | ------------------------------------------------------------- |
|
||||||
| `docs` | `List[Doc]` | A batch of `Doc` objects. |
|
| `docs` | A batch of `Doc` objects. ~~List[Doc]~~ |
|
||||||
| `trf_data` | `FullTransformerBatch` | The transformers data for the batch. |
|
| `trf_data` | The transformers data for the batch. ~~FullTransformerBatch~~ |
|
||||||
|
|
||||||
The following built-in functions are available:
|
The following built-in functions are available:
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| --------------------------------------------- | ------------------------------------- |
|
| ---------------------------------------------- | ------------------------------------- |
|
||||||
| `spacy-transformer.null_annotation_setter.v1` | Don't set any additional annotations. |
|
| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. |
|
||||||
|
|
||||||
## Custom attributes {#custom-attributes}
|
## Custom attributes {#custom-attributes}
|
||||||
|
|
||||||
The component sets the following
|
The component sets the following
|
||||||
[custom extension attributes](/usage/processing-pipeline#custom-components-attributes):
|
[custom extension attributes](/usage/processing-pipeline#custom-components-attributes):
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ----------------------------------------------------- | ---------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------ |
|
||||||
| `Doc.trf_data` | [`TransformerData`](/api/transformer#transformerdata) | Transformer tokens and outputs for the `Doc` object. |
|
| `Doc.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ |
|
||||||
|
|
|
@ -30,13 +30,13 @@ you can add vectors to later.
|
||||||
> vectors = Vectors(data=data, keys=keys)
|
> vectors = Vectors(data=data, keys=keys)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `shape` | tuple | Size of the table as `(n_entries, n_columns)`, the number of entries and number of columns. Not required if you're initializing the object with `data` and `keys`. |
|
| `shape` | Size of the table as `(n_entries, n_columns)`, the number of entries and number of columns. Not required if you're initializing the object with `data` and `keys`. ~~Tuple[int, int]~~ |
|
||||||
| `data` | `ndarray[ndim=1, dtype='float32']` | The vector data. |
|
| `data` | The vector data. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
| `keys` | iterable | A sequence of keys aligned with the data. |
|
| `keys` | A sequence of keys aligned with the data. ~~Iterable[Union[str, int]]~~ |
|
||||||
| `name` | str | A name to identify the vectors table. |
|
| `name` | A name to identify the vectors table. ~~str~~ |
|
||||||
|
|
||||||
## Vectors.\_\_getitem\_\_ {#getitem tag="method"}
|
## Vectors.\_\_getitem\_\_ {#getitem tag="method"}
|
||||||
|
|
||||||
|
@ -51,10 +51,10 @@ raised.
|
||||||
> assert cat_vector == nlp.vocab["cat"].vector
|
> assert cat_vector == nlp.vocab["cat"].vector
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------- | ---------------------------------- | ------------------------------ |
|
| ----------- | ---------------------------------------------------------------- |
|
||||||
| `key` | int | The key to get the vector for. |
|
| `key` | The key to get the vector for. ~~int~~ |
|
||||||
| returns | `ndarray[ndim=1, dtype='float32']` | The vector for the key. |
|
| **RETURNS** | The vector for the key. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
|
|
||||||
## Vectors.\_\_setitem\_\_ {#setitem tag="method"}
|
## Vectors.\_\_setitem\_\_ {#setitem tag="method"}
|
||||||
|
|
||||||
|
@ -68,10 +68,10 @@ Set a vector for the given key.
|
||||||
> nlp.vocab.vectors[cat_id] = vector
|
> nlp.vocab.vectors[cat_id] = vector
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---------------------------------- | ------------------------------ |
|
| -------- | ----------------------------------------------------------- |
|
||||||
| `key` | int | The key to set the vector for. |
|
| `key` | The key to set the vector for. ~~int~~ |
|
||||||
| `vector` | `ndarray[ndim=1, dtype='float32']` | The vector to set. |
|
| `vector` | The vector to set. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
|
|
||||||
## Vectors.\_\_iter\_\_ {#iter tag="method"}
|
## Vectors.\_\_iter\_\_ {#iter tag="method"}
|
||||||
|
|
||||||
|
@ -84,9 +84,9 @@ Iterate over the keys in the table.
|
||||||
> print(key, nlp.vocab.strings[key])
|
> print(key, nlp.vocab.strings[key])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ---- | ------------------- |
|
| ---------- | --------------------------- |
|
||||||
| **YIELDS** | int | A key in the table. |
|
| **YIELDS** | A key in the table. ~~int~~ |
|
||||||
|
|
||||||
## Vectors.\_\_len\_\_ {#len tag="method"}
|
## Vectors.\_\_len\_\_ {#len tag="method"}
|
||||||
|
|
||||||
|
@ -99,9 +99,9 @@ Return the number of vectors in the table.
|
||||||
> assert len(vectors) == 3
|
> assert len(vectors) == 3
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ----------------------------------- |
|
| ----------- | ------------------------------------------- |
|
||||||
| **RETURNS** | int | The number of vectors in the table. |
|
| **RETURNS** | The number of vectors in the table. ~~int~~ |
|
||||||
|
|
||||||
## Vectors.\_\_contains\_\_ {#contains tag="method"}
|
## Vectors.\_\_contains\_\_ {#contains tag="method"}
|
||||||
|
|
||||||
|
@ -115,10 +115,10 @@ Check whether a key has been mapped to a vector entry in the table.
|
||||||
> assert cat_id in vectors
|
> assert cat_id in vectors
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ----------------------------------- |
|
| ----------- | -------------------------------------------- |
|
||||||
| `key` | int | The key to check. |
|
| `key` | The key to check. ~~int~~ |
|
||||||
| **RETURNS** | bool | Whether the key has a vector entry. |
|
| **RETURNS** | Whether the key has a vector entry. ~~bool~~ |
|
||||||
|
|
||||||
## Vectors.add {#add tag="method"}
|
## Vectors.add {#add tag="method"}
|
||||||
|
|
||||||
|
@ -138,13 +138,13 @@ mapping separately. If you need to manage the strings, you should use the
|
||||||
> nlp.vocab.vectors.add("dog", row=0)
|
> nlp.vocab.vectors.add("dog", row=0)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ---------------------------------- | ----------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------- |
|
||||||
| `key` | str / int | The key to add. |
|
| `key` | The key to add. ~~Union[str, int]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `vector` | `ndarray[ndim=1, dtype='float32']` | An optional vector to add for the key. |
|
| `vector` | An optional vector to add for the key. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
| `row` | int | An optional row number of a vector to map the key to. |
|
| `row` | An optional row number of a vector to map the key to. ~~int~~ |
|
||||||
| **RETURNS** | int | The row the vector was added to. |
|
| **RETURNS** | The row the vector was added to. ~~int~~ |
|
||||||
|
|
||||||
## Vectors.resize {#resize tag="method"}
|
## Vectors.resize {#resize tag="method"}
|
||||||
|
|
||||||
|
@ -160,11 +160,11 @@ These removed items are returned as a list of `(key, row)` tuples.
|
||||||
> removed = nlp.vocab.vectors.resize((10000, 300))
|
> removed = nlp.vocab.vectors.resize((10000, 300))
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | -------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------------- |
|
||||||
| `shape` | tuple | A `(rows, dims)` tuple describing the number of rows and dimensions. |
|
| `shape` | A `(rows, dims)` tuple describing the number of rows and dimensions. ~~Tuple[int, int]~~ |
|
||||||
| `inplace` | bool | Reallocate the memory. |
|
| `inplace` | Reallocate the memory. ~~bool~~ |
|
||||||
| **RETURNS** | list | The removed items as a list of `(key, row)` tuples. |
|
| **RETURNS** | The removed items as a list of `(key, row)` tuples. ~~List[Tuple[int, int]]~~ |
|
||||||
|
|
||||||
## Vectors.keys {#keys tag="method"}
|
## Vectors.keys {#keys tag="method"}
|
||||||
|
|
||||||
|
@ -177,9 +177,9 @@ A sequence of the keys in the table.
|
||||||
> print(key, nlp.vocab.strings[key])
|
> print(key, nlp.vocab.strings[key])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | -------- | ----------- |
|
| ----------- | --------------------------- |
|
||||||
| **RETURNS** | iterable | The keys. |
|
| **RETURNS** | The keys. ~~Iterable[int]~~ |
|
||||||
|
|
||||||
## Vectors.values {#values tag="method"}
|
## Vectors.values {#values tag="method"}
|
||||||
|
|
||||||
|
@ -194,9 +194,9 @@ the length of the vectors table.
|
||||||
> print(vector)
|
> print(vector)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ---------------------------------- | ---------------------- |
|
| ---------- | --------------------------------------------------------------- |
|
||||||
| **YIELDS** | `ndarray[ndim=1, dtype='float32']` | A vector in the table. |
|
| **YIELDS** | A vector in the table. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
|
|
||||||
## Vectors.items {#items tag="method"}
|
## Vectors.items {#items tag="method"}
|
||||||
|
|
||||||
|
@ -209,9 +209,9 @@ Iterate over `(key, vector)` pairs, in order.
|
||||||
> print(key, nlp.vocab.strings[key], vector)
|
> print(key, nlp.vocab.strings[key], vector)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | ----- | -------------------------------- |
|
| ---------- | ------------------------------------------------------------------------------------- |
|
||||||
| **YIELDS** | tuple | `(key, vector)` pairs, in order. |
|
| **YIELDS** | `(key, vector)` pairs, in order. ~~Tuple[int, numpy.ndarray[ndim=1, dtype=float32]]~~ |
|
||||||
|
|
||||||
## Vectors.find {#find tag="method"}
|
## Vectors.find {#find tag="method"}
|
||||||
|
|
||||||
|
@ -226,14 +226,14 @@ Look up one or more keys by row, or vice versa.
|
||||||
> keys = nlp.vocab.vectors.find(rows=[18, 256, 985])
|
> keys = nlp.vocab.vectors.find(rows=[18, 256, 985])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ------------------------------------- | ------------------------------------------------------------------------ |
|
| -------------- | -------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `key` | str / int | Find the row that the given key points to. Returns int, `-1` if missing. |
|
| `key` | Find the row that the given key points to. Returns int, `-1` if missing. ~~Union[str, int]~~ |
|
||||||
| `keys` | iterable | Find rows that the keys point to. Returns `ndarray`. |
|
| `keys` | Find rows that the keys point to. Returns `numpy.ndarray`. ~~Iterable[Union[str, int]]~~ |
|
||||||
| `row` | int | Find the first key that points to the row. Returns int. |
|
| `row` | Find the first key that points to the row. Returns integer. ~~int~~ |
|
||||||
| `rows` | iterable | Find the keys that point to the rows. Returns ndarray. |
|
| `rows` | Find the keys that point to the rows. Returns `numpy.ndarray`. ~~Iterable[int]~~ |
|
||||||
| **RETURNS** | The requested key, keys, row or rows. |
|
| **RETURNS** | The requested key, keys, row or rows. ~~Union[int, numpy.ndarray[ndim=1, dtype=float32]]~~ |
|
||||||
|
|
||||||
## Vectors.shape {#shape tag="property"}
|
## Vectors.shape {#shape tag="property"}
|
||||||
|
|
||||||
|
@ -250,9 +250,9 @@ vector table.
|
||||||
> assert dims == 300
|
> assert dims == 300
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | ---------------------- |
|
| ----------- | ------------------------------------------ |
|
||||||
| **RETURNS** | tuple | A `(rows, dims)` pair. |
|
| **RETURNS** | A `(rows, dims)` pair. ~~Tuple[int, int]~~ |
|
||||||
|
|
||||||
## Vectors.size {#size tag="property"}
|
## Vectors.size {#size tag="property"}
|
||||||
|
|
||||||
|
@ -265,9 +265,9 @@ The vector size, i.e. `rows * dims`.
|
||||||
> assert vectors.size == 150000
|
> assert vectors.size == 150000
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ---------------- |
|
| ----------- | ------------------------ |
|
||||||
| **RETURNS** | int | The vector size. |
|
| **RETURNS** | The vector size. ~~int~~ |
|
||||||
|
|
||||||
## Vectors.is_full {#is_full tag="property"}
|
## Vectors.is_full {#is_full tag="property"}
|
||||||
|
|
||||||
|
@ -283,9 +283,9 @@ If a table is full, it can be resized using
|
||||||
> assert vectors.is_full
|
> assert vectors.is_full
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ---------------------------------- |
|
| ----------- | ------------------------------------------- |
|
||||||
| **RETURNS** | bool | Whether the vectors table is full. |
|
| **RETURNS** | Whether the vectors table is full. ~~bool~~ |
|
||||||
|
|
||||||
## Vectors.n_keys {#n_keys tag="property"}
|
## Vectors.n_keys {#n_keys tag="property"}
|
||||||
|
|
||||||
|
@ -301,9 +301,9 @@ vectors, they will be counted individually.
|
||||||
> assert vectors.n_keys == 0
|
> assert vectors.n_keys == 0
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ------------------------------------ |
|
| ----------- | -------------------------------------------- |
|
||||||
| **RETURNS** | int | The number of all keys in the table. |
|
| **RETURNS** | The number of all keys in the table. ~~int~~ |
|
||||||
|
|
||||||
## Vectors.most_similar {#most_similar tag="method"}
|
## Vectors.most_similar {#most_similar tag="method"}
|
||||||
|
|
||||||
|
@ -320,14 +320,14 @@ performed in chunks, to avoid consuming too much memory. You can set the
|
||||||
> most_similar = nlp.vocab.vectors.most_similar(queries, n=10)
|
> most_similar = nlp.vocab.vectors.most_similar(queries, n=10)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------- | ------------------------------------------------------------------ |
|
| -------------- | --------------------------------------------------------------------------- |
|
||||||
| `queries` | `ndarray` | An array with one or more vectors. |
|
| `queries` | An array with one or more vectors. ~~numpy.ndarray~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `batch_size` | int | The batch size to use. Default to `1024`. |
|
| `batch_size` | The batch size to use. Default to `1024`. ~~int~~ |
|
||||||
| `n` | int | The number of entries to return for each query. Defaults to `1`. |
|
| `n` | The number of entries to return for each query. Defaults to `1`. ~~int~~ |
|
||||||
| `sort` | bool | Whether to sort the entries returned by score. Defaults to `True`. |
|
| `sort` | Whether to sort the entries returned by score. Defaults to `True`. ~~bool~~ |
|
||||||
| **RETURNS** | tuple | The most similar entries as a `(keys, best_rows, scores)` tuple. |
|
| **RETURNS** | tuple | The most similar entries as a `(keys, best_rows, scores)` tuple. ~~Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]~~ |
|
||||||
|
|
||||||
## Vectors.to_disk {#to_disk tag="method"}
|
## Vectors.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
@ -340,9 +340,9 @@ Save the current state to a directory.
|
||||||
>
|
>
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------ | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
| ------ | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
|
|
||||||
## Vectors.from_disk {#from_disk tag="method"}
|
## Vectors.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -355,10 +355,10 @@ Loads state from a directory. Modifies the object in place and returns it.
|
||||||
> vectors.from_disk("/path/to/vectors")
|
> vectors.from_disk("/path/to/vectors")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------ | -------------------------------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| **RETURNS** | `Vectors` | The modified `Vectors` object. |
|
| **RETURNS** | The modified `Vectors` object. ~~Vectors~~ |
|
||||||
|
|
||||||
## Vectors.to_bytes {#to_bytes tag="method"}
|
## Vectors.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -370,9 +370,9 @@ Serialize the current state to a binary string.
|
||||||
> vectors_bytes = vectors.to_bytes()
|
> vectors_bytes = vectors.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ----- | -------------------------------------------- |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| **RETURNS** | bytes | The serialized form of the `Vectors` object. |
|
| **RETURNS** | The serialized form of the `Vectors` object. ~~bytes~~ |
|
||||||
|
|
||||||
## Vectors.from_bytes {#from_bytes tag="method"}
|
## Vectors.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -387,15 +387,15 @@ Load state from a binary string.
|
||||||
> new_vectors.from_bytes(vectors_bytes)
|
> new_vectors.from_bytes(vectors_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------- | ---------------------- |
|
| ----------- | --------------------------------- |
|
||||||
| `data` | bytes | The data to load from. |
|
| `data` | The data to load from. ~~bytes~~ |
|
||||||
| **RETURNS** | `Vectors` | The `Vectors` object. |
|
| **RETURNS** | The `Vectors` object. ~~Vectors~~ |
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------- | ---------------------------------- | ------------------------------------------------------------------------------- |
|
| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `data` | `ndarray[ndim=1, dtype='float32']` | Stored vectors data. `numpy` is used for CPU vectors, `cupy` for GPU vectors. |
|
| `data` | Stored vectors data. `numpy` is used for CPU vectors, `cupy` for GPU vectors. ~~Union[numpy.ndarray[ndim=1, dtype=float32], cupy.ndarray[ndim=1, dtype=float32]]~~ |
|
||||||
| `key2row` | dict | Dictionary mapping word hashes to rows in the `Vectors.data` table. |
|
| `key2row` | Dictionary mapping word hashes to rows in the `Vectors.data` table. ~~Dict[int, int]~~ |
|
||||||
| `keys` | `ndarray[ndim=1, dtype='float32']` | Array keeping the keys in order, such that `keys[vectors.key2row[key]] == key`. |
|
| `keys` | Array keeping the keys in order, such that `keys[vectors.key2row[key]] == key`. ~~Union[numpy.ndarray[ndim=1, dtype=float32], cupy.ndarray[ndim=1, dtype=float32]]~~ |
|
||||||
|
|
|
@ -21,14 +21,15 @@ Create the vocabulary.
|
||||||
> vocab = Vocab(strings=["hello", "world"])
|
> vocab = Vocab(strings=["hello", "world"])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------------------------------------- | -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `lex_attr_getters` | dict | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. |
|
| `lex_attr_getters` | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. ~~Optional[Dict[str, Callable[[str], Any]]]~~ |
|
||||||
| `strings` | `StringStore` / list | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. |
|
| `strings` | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. ~~Union[List[str], StringStore]~~ |
|
||||||
| `lookups` | `Lookups` | A [`Lookups`](/api/lookups) that stores the `lemma_\*`, `lexeme_norm` and other large lookup tables. Defaults to `None`. |
|
| `lookups` | A [`Lookups`](/api/lookups) that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. ~~Optional[Lookups]~~ |
|
||||||
| `lookups_extra` <Tag variant="new">2.3</Tag> | `Lookups` | A [`Lookups`](/api/lookups) that stores the optional `lexeme_cluster`/`lexeme_prob`/`lexeme_sentiment`/`lexeme_settings` lookup tables. Defaults to `None`. |
|
| `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ |
|
||||||
| `oov_prob` | float | The default OOV probability. Defaults to `-20.0`. |
|
| `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~ |
|
||||||
| `vectors_name` <Tag variant="new">2.2</Tag> | str | A name to identify the vectors table. |
|
| `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ |
|
||||||
|
| `get_noun_chunks` | A function that yields base noun phrases, used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
|
||||||
|
|
||||||
## Vocab.\_\_len\_\_ {#len tag="method"}
|
## Vocab.\_\_len\_\_ {#len tag="method"}
|
||||||
|
|
||||||
|
@ -41,9 +42,9 @@ Get the current number of lexemes in the vocabulary.
|
||||||
> assert len(nlp.vocab) > 0
|
> assert len(nlp.vocab) > 0
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | ---------------------------------------- |
|
| ----------- | ------------------------------------------------ |
|
||||||
| **RETURNS** | int | The number of lexemes in the vocabulary. |
|
| **RETURNS** | The number of lexemes in the vocabulary. ~~int~~ |
|
||||||
|
|
||||||
## Vocab.\_\_getitem\_\_ {#getitem tag="method"}
|
## Vocab.\_\_getitem\_\_ {#getitem tag="method"}
|
||||||
|
|
||||||
|
@ -57,10 +58,10 @@ given, a new lexeme is created and stored.
|
||||||
> assert nlp.vocab[apple] == nlp.vocab["apple"]
|
> assert nlp.vocab[apple] == nlp.vocab["apple"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------- | ---------------------------------------- |
|
| -------------- | ------------------------------------------------------------ |
|
||||||
| `id_or_string` | int / str | The hash value of a word, or its string. |
|
| `id_or_string` | The hash value of a word, or its string. ~~Union[int, str]~~ |
|
||||||
| **RETURNS** | `Lexeme` | The lexeme indicated by the given ID. |
|
| **RETURNS** | The lexeme indicated by the given ID. ~~Lexeme~~ |
|
||||||
|
|
||||||
## Vocab.\_\_iter\_\_ {#iter tag="method"}
|
## Vocab.\_\_iter\_\_ {#iter tag="method"}
|
||||||
|
|
||||||
|
@ -72,9 +73,9 @@ Iterate over the lexemes in the vocabulary.
|
||||||
> stop_words = (lex for lex in nlp.vocab if lex.is_stop)
|
> stop_words = (lex for lex in nlp.vocab if lex.is_stop)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ---------- | -------- | --------------------------- |
|
| ---------- | -------------------------------------- |
|
||||||
| **YIELDS** | `Lexeme` | An entry in the vocabulary. |
|
| **YIELDS** | An entry in the vocabulary. ~~Lexeme~~ |
|
||||||
|
|
||||||
## Vocab.\_\_contains\_\_ {#contains tag="method"}
|
## Vocab.\_\_contains\_\_ {#contains tag="method"}
|
||||||
|
|
||||||
|
@ -91,10 +92,10 @@ given string, you need to look it up in
|
||||||
> assert oov not in nlp.vocab
|
> assert oov not in nlp.vocab
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | ---- | -------------------------------------------------- |
|
| ----------- | ----------------------------------------------------------- |
|
||||||
| `string` | str | The ID string. |
|
| `string` | The ID string. ~~str~~ |
|
||||||
| **RETURNS** | bool | Whether the string has an entry in the vocabulary. |
|
| **RETURNS** | Whether the string has an entry in the vocabulary. ~~bool~~ |
|
||||||
|
|
||||||
## Vocab.add_flag {#add_flag tag="method"}
|
## Vocab.add_flag {#add_flag tag="method"}
|
||||||
|
|
||||||
|
@ -115,11 +116,11 @@ using `token.check_flag(flag_id)`.
|
||||||
> assert doc[2].check_flag(MY_PRODUCT) == True
|
> assert doc[2].check_flag(MY_PRODUCT) == True
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------- | ---- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `flag_getter` | dict | A function `f(str) -> bool`, to get the flag value. |
|
| `flag_getter` | A function that takes the lexeme text and returns the boolean flag value. ~~Callable[[str], bool]~~ |
|
||||||
| `flag_id` | int | An integer between 1 and 63 (inclusive), specifying the bit at which the flag will be stored. If `-1`, the lowest available bit will be chosen. |
|
| `flag_id` | An integer between `1` and `63` (inclusive), specifying the bit at which the flag will be stored. If `-1`, the lowest available bit will be chosen. ~~int~~ |
|
||||||
| **RETURNS** | int | The integer ID by which the flag value can be checked. |
|
| **RETURNS** | The integer ID by which the flag value can be checked. ~~int~~ |
|
||||||
|
|
||||||
## Vocab.reset_vectors {#reset_vectors tag="method" new="2"}
|
## Vocab.reset_vectors {#reset_vectors tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -133,11 +134,11 @@ have to call this to change the size of the vectors. Only one of the `width` and
|
||||||
> nlp.vocab.reset_vectors(width=300)
|
> nlp.vocab.reset_vectors(width=300)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | ---- | -------------------------------------- |
|
| -------------- | ---------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `width` | int | The new width (keyword argument only). |
|
| `width` | The new width. ~~int~~ |
|
||||||
| `shape` | int | The new shape (keyword argument only). |
|
| `shape` | The new shape. ~~int~~ |
|
||||||
|
|
||||||
## Vocab.prune_vectors {#prune_vectors tag="method" new="2"}
|
## Vocab.prune_vectors {#prune_vectors tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -158,11 +159,11 @@ cosines are calculated in minibatches, to reduce memory usage.
|
||||||
> assert len(nlp.vocab.vectors) <= 1000
|
> assert len(nlp.vocab.vectors) <= 1000
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ------------ | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `nr_row` | int | The number of rows to keep in the vector table. |
|
| `nr_row` | The number of rows to keep in the vector table. ~~int~~ |
|
||||||
| `batch_size` | int | Batch of vectors for calculating the similarities. Larger batch sizes might be faster, while temporarily requiring more memory. |
|
| `batch_size` | Batch of vectors for calculating the similarities. Larger batch sizes might be faster, while temporarily requiring more memory. ~~int~~ |
|
||||||
| **RETURNS** | dict | A dictionary keyed by removed words mapped to `(string, score)` tuples, where `string` is the entry the removed word was mapped to, and `score` the similarity score between the two words. |
|
| **RETURNS** | A dictionary keyed by removed words mapped to `(string, score)` tuples, where `string` is the entry the removed word was mapped to, and `score` the similarity score between the two words. ~~Dict[str, Tuple[str, float]]~~ |
|
||||||
|
|
||||||
## Vocab.get_vector {#get_vector tag="method" new="2"}
|
## Vocab.get_vector {#get_vector tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -178,12 +179,12 @@ subword features by average over ngrams of `orth` (introduced in spaCy `v2.1`).
|
||||||
> nlp.vocab.get_vector("apple", minn=1, maxn=5)
|
> nlp.vocab.get_vector("apple", minn=1, maxn=5)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------------------------------- | ---------------------------------------- | ---------------------------------------------------------------------------------------------- |
|
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `orth` | int / str | The hash value of a word, or its unicode string. |
|
| `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ |
|
||||||
| `minn` <Tag variant="new">2.1</Tag> | int | Minimum n-gram length used for FastText's ngram computation. Defaults to the length of `orth`. |
|
| `minn` <Tag variant="new">2.1</Tag> | Minimum n-gram length used for FastText's ngram computation. Defaults to the length of `orth`. ~~int~~ |
|
||||||
| `maxn` <Tag variant="new">2.1</Tag> | int | Maximum n-gram length used for FastText's ngram computation. Defaults to the length of `orth`. |
|
| `maxn` <Tag variant="new">2.1</Tag> | Maximum n-gram length used for FastText's ngram computation. Defaults to the length of `orth`. ~~int~~ |
|
||||||
| **RETURNS** | `numpy.ndarray[ndim=1, dtype='float32']` | A word vector. Size and shape are determined by the `Vocab.vectors` instance. |
|
| **RETURNS** | A word vector. Size and shape are determined by the `Vocab.vectors` instance. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
|
|
||||||
## Vocab.set_vector {#set_vector tag="method" new="2"}
|
## Vocab.set_vector {#set_vector tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -196,10 +197,10 @@ or hash value.
|
||||||
> nlp.vocab.set_vector("apple", array([...]))
|
> nlp.vocab.set_vector("apple", array([...]))
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------- | ---------------------------------------- | ------------------------------------------------ |
|
| -------- | -------------------------------------------------------------------- |
|
||||||
| `orth` | int / str | The hash value of a word, or its unicode string. |
|
| `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ |
|
||||||
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | The vector to set. |
|
| `vector` | The vector to set. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
|
|
||||||
## Vocab.has_vector {#has_vector tag="method" new="2"}
|
## Vocab.has_vector {#has_vector tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -213,10 +214,10 @@ Words can be looked up by string or hash value.
|
||||||
> vector = nlp.vocab.get_vector("apple")
|
> vector = nlp.vocab.get_vector("apple")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| ----------- | --------- | ------------------------------------------------ |
|
| ----------- | -------------------------------------------------------------------- |
|
||||||
| `orth` | int / str | The hash value of a word, or its unicode string. |
|
| `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ |
|
||||||
| **RETURNS** | bool | Whether the word has a vector. |
|
| **RETURNS** | Whether the word has a vector. ~~bool~~ |
|
||||||
|
|
||||||
## Vocab.to_disk {#to_disk tag="method" new="2"}
|
## Vocab.to_disk {#to_disk tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -228,11 +229,11 @@ Save the current state to a directory.
|
||||||
> nlp.vocab.to_disk("/path/to/vocab")
|
> nlp.vocab.to_disk("/path/to/vocab")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
## Vocab.from_disk {#from_disk tag="method" new="2"}
|
## Vocab.from_disk {#from_disk tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -245,12 +246,12 @@ Loads state from a directory. Modifies the object in place and returns it.
|
||||||
> vocab = Vocab().from_disk("/path/to/vocab")
|
> vocab = Vocab().from_disk("/path/to/vocab")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | -------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Vocab` | The modified `Vocab` object. |
|
| **RETURNS** | The modified `Vocab` object. ~~Vocab~~ |
|
||||||
|
|
||||||
## Vocab.to_bytes {#to_bytes tag="method"}
|
## Vocab.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -262,11 +263,11 @@ Serialize the current state to a binary string.
|
||||||
> vocab_bytes = nlp.vocab.to_bytes()
|
> vocab_bytes = nlp.vocab.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | bytes | The serialized form of the `Vocab` object. |
|
| **RETURNS** | The serialized form of the `Vocab` object. ~~Vocab~~ |
|
||||||
|
|
||||||
## Vocab.from_bytes {#from_bytes tag="method"}
|
## Vocab.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -281,12 +282,12 @@ Load state from a binary string.
|
||||||
> vocab.from_bytes(vocab_bytes)
|
> vocab.from_bytes(vocab_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------- | ------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | |
|
||||||
| `exclude` | `Iterable[str]` | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | `Vocab` | The `Vocab` object. |
|
| **RETURNS** | The `Vocab` object. ~~Vocab~~ |
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
|
@ -299,13 +300,13 @@ Load state from a binary string.
|
||||||
> assert type(PERSON) == int
|
> assert type(PERSON) == int
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Description |
|
||||||
| --------------------------------------------- | ------------- | ------------------------------------------------------------ |
|
| --------------------------------------------- | ------------------------------------------------------------------------------- |
|
||||||
| `strings` | `StringStore` | A table managing the string-to-int mapping. |
|
| `strings` | A table managing the string-to-int mapping. ~~StringStore~~ |
|
||||||
| `vectors` <Tag variant="new">2</Tag> | `Vectors` | A table associating word IDs to word vectors. |
|
| `vectors` <Tag variant="new">2</Tag> | A table associating word IDs to word vectors. ~~Vectors~~ |
|
||||||
| `vectors_length` | int | Number of dimensions for each word vector. |
|
| `vectors_length` | Number of dimensions for each word vector. ~~int~~ |
|
||||||
| `lookups` | `Lookups` | The available lookup tables in this vocab. |
|
| `lookups` | The available lookup tables in this vocab. ~~Lookups~~ |
|
||||||
| `writing_system` <Tag variant="new">2.1</Tag> | dict | A dict with information about the language's writing system. |
|
| `writing_system` <Tag variant="new">2.1</Tag> | A dict with information about the language's writing system. ~~Dict[str, Any]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
97
website/docs/images/layers-architectures.svg
Normal file
97
website/docs/images/layers-architectures.svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 50 KiB |
92
website/docs/images/projects.svg
Normal file
92
website/docs/images/projects.svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 40 KiB |
BIN
website/docs/images/sense2vec.jpg
Normal file
BIN
website/docs/images/sense2vec.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 224 KiB |
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user