Merge branch 'develop' of https://github.com/explosion/spaCy into develop

This commit is contained in:
Matthew Honnibal 2020-08-23 18:32:24 +02:00
commit 3828bc3ed0
50 changed files with 1424 additions and 358 deletions

View File

@ -5,7 +5,7 @@
Thanks for your interest in contributing to spaCy 🎉 The project is maintained
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
and we'll do our best to help you get started. This page will give you a quick
overview of how things are organised and most importantly, how to get involved.
overview of how things are organized and most importantly, how to get involved.
## Table of contents
@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
### Code formatting
[`black`](https://github.com/ambv/black) is an opinionated Python code
formatter, optimised to produce readable code and small diffs. You can run
formatter, optimized to produce readable code and small diffs. You can run
`black` from the command-line, or via your code editor. For example, if you're
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
following to your `settings.json` to use `black` for formatting and auto-format
@ -286,7 +286,7 @@ Code that interacts with the file-system should accept objects that follow the
If the function is user-facing and takes a path as an argument, it should check
whether the path is provided as a string. Strings should be converted to
`pathlib.Path` objects. Serialization and deserialization functions should always
accept **file-like objects**, as it makes the library io-agnostic. Working on
accept **file-like objects**, as it makes the library IO-agnostic. Working on
buffers makes the code more general, easier to test, and compatible with Python
3's asynchronous IO.
@ -384,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The
many "traps for new players". Working in Cython is very rewarding once you're
over the initial learning curve. As with C and C++, the first way you write
something in Cython will often be the performance-optimal approach. In contrast,
Python optimisation generally requires a lot of experimentation. Is it faster to
Python optimization generally requires a lot of experimentation. Is it faster to
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
Does this numpy operation create a copy? There's no way to guess the answers to
these questions, and you'll usually be dissatisfied with your results — so
@ -400,7 +400,7 @@ Python. If it's not fast enough the first time, just switch to Cython.
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
- [Multi-threading spaCys parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
- [Multi-threading spaCys parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
## Adding tests
@ -412,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in
all test files and test functions need to be prefixed with `test_`.
When adding tests, make sure to use descriptive names, keep the code short and
concise and only test for one behaviour at a time. Try to `parametrize` test
concise and only test for one behavior at a time. Try to `parametrize` test
cases wherever possible, use our pre-defined fixtures for spaCy components and
avoid unnecessary imports.

View File

@ -49,9 +49,8 @@ It's commercial open-source software, released under the MIT license.
## 💬 Where to ask questions
The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and
[@ines](https://github.com/ines), along with core contributors
[@svlandeg](https://github.com/svlandeg) and
The spaCy project is maintained by [@honnibal](https://github.com/honnibal),
[@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and
[@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
be able to provide individual support via email. We also believe that help is
much more valuable if it's shared publicly, so that more people can benefit from

View File

@ -6,9 +6,10 @@ requires = [
"cymem>=2.0.2,<2.1.0",
"preshed>=3.0.2,<3.1.0",
"murmurhash>=0.28.0,<1.1.0",
"thinc>=8.0.0a28,<8.0.0a30",
"thinc>=8.0.0a29,<8.0.0a40",
"blis>=0.4.0,<0.5.0",
"pytokenizations",
"smart_open>=2.0.0,<3.0.0"
"smart_open>=2.0.0,<3.0.0",
"pathy"
]
build-backend = "setuptools.build_meta"

View File

@ -1,7 +1,7 @@
# Our libraries
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
thinc>=8.0.0a28,<8.0.0a30
thinc>=8.0.0a29,<8.0.0a40
blis>=0.4.0,<0.5.0
ml_datasets>=0.1.1
murmurhash>=0.28.0,<1.1.0
@ -9,6 +9,7 @@ wasabi>=0.7.1,<1.1.0
srsly>=2.1.0,<3.0.0
catalogue>=0.0.7,<1.1.0
typer>=0.3.0,<0.4.0
pathy
# Third party dependencies
numpy>=1.15.0
requests>=2.13.0,<3.0.0

View File

@ -34,18 +34,19 @@ setup_requires =
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0
thinc>=8.0.0a28,<8.0.0a30
thinc>=8.0.0a29,<8.0.0a40
install_requires =
# Our libraries
murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
thinc>=8.0.0a28,<8.0.0a30
thinc>=8.0.0a29,<8.0.0a40
blis>=0.4.0,<0.5.0
wasabi>=0.7.1,<1.1.0
srsly>=2.1.0,<3.0.0
catalogue>=0.0.7,<1.1.0
typer>=0.3.0,<0.4.0
pathy
# Third-party dependencies
tqdm>=4.38.0,<5.0.0
numpy>=1.15.0

View File

@ -1,6 +1,6 @@
# fmt: off
__title__ = "spacy-nightly"
__version__ = "3.0.0a8"
__version__ = "3.0.0a9"
__release__ = True
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"

View File

@ -21,6 +21,8 @@ from .project.clone import project_clone # noqa: F401
from .project.assets import project_assets # noqa: F401
from .project.run import project_run # noqa: F401
from .project.dvc import project_update_dvc # noqa: F401
from .project.push import project_push # noqa: F401
from .project.pull import project_pull # noqa: F401
@app.command("link", no_args_is_help=True, deprecated=True, hidden=True)

View File

@ -1,4 +1,5 @@
from typing import Dict, Any, Union, List, Optional
from typing import Dict, Any, Union, List, Optional, TYPE_CHECKING
import sys
from pathlib import Path
from wasabi import msg
import srsly
@ -8,11 +9,13 @@ from typer.main import get_command
from contextlib import contextmanager
from thinc.config import Config, ConfigValidationError
from configparser import InterpolationError
import sys
from ..schemas import ProjectConfigSchema, validate
from ..util import import_file
if TYPE_CHECKING:
from pathy import Pathy # noqa: F401
PROJECT_FILE = "project.yml"
PROJECT_LOCK = "project.lock"
@ -93,11 +96,12 @@ def parse_config_overrides(args: List[str]) -> Dict[str, Any]:
return result
def load_project_config(path: Path) -> Dict[str, Any]:
def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
"""Load the project.yml file from a directory and validate it. Also make
sure that all directories defined in the config exist.
path (Path): The path to the project directory.
interpolate (bool): Whether to substitute project variables.
RETURNS (Dict[str, Any]): The loaded project.yml.
"""
config_path = path / PROJECT_FILE
@ -110,16 +114,34 @@ def load_project_config(path: Path) -> Dict[str, Any]:
msg.fail(invalid_err, e, exits=1)
errors = validate(ProjectConfigSchema, config)
if errors:
msg.fail(invalid_err, "\n".join(errors), exits=1)
msg.fail(invalid_err)
print("\n".join(errors))
sys.exit(1)
validate_project_commands(config)
# Make sure directories defined in config exist
for subdir in config.get("directories", []):
dir_path = path / subdir
if not dir_path.exists():
dir_path.mkdir(parents=True)
if interpolate:
err = "project.yml validation error"
with show_validation_error(title=err, hint_fill=False):
config = substitute_project_variables(config)
return config
def substitute_project_variables(config: Dict[str, Any], overrides: Dict = {}):
key = "vars"
config.setdefault(key, {})
config[key].update(overrides)
# Need to put variables in the top scope again so we can have a top-level
# section "project" (otherwise, a list of commands in the top scope wouldn't)
# be allowed by Thinc's config system
cfg = Config({"project": config, key: config[key]})
interpolated = cfg.interpolate()
return dict(interpolated["project"])
def validate_project_commands(config: Dict[str, Any]) -> None:
"""Check that project commands and workflows are valid, don't contain
duplicates, don't clash and only refer to commands that exist.
@ -230,3 +252,39 @@ def get_sourced_components(config: Union[Dict[str, Any], Config]) -> List[str]:
for name, cfg in config.get("components", {}).items()
if "factory" not in cfg and "source" in cfg
]
def upload_file(src: Path, dest: Union[str, "Pathy"]) -> None:
"""Upload a file.
src (Path): The source path.
url (str): The destination URL to upload to.
"""
dest = ensure_pathy(dest)
with dest.open(mode="wb") as output_file:
with src.open(mode="rb") as input_file:
output_file.write(input_file.read())
def download_file(src: Union[str, "Pathy"], dest: Path, *, force: bool = False) -> None:
"""Download a file using smart_open.
url (str): The URL of the file.
dest (Path): The destination path.
force (bool): Whether to force download even if file exists.
If False, the download will be skipped.
"""
if dest.exists() and not force:
return None
src = ensure_pathy(src)
with src.open(mode="rb") as input_file:
with dest.open(mode="wb") as output_file:
output_file.write(input_file.read())
def ensure_pathy(path):
"""Temporary helper to prevent importing Pathy globally (which can cause
slow and annoying Google Cloud warning)."""
from pathy import Pathy # noqa: F811
return Pathy(path)

View File

@ -24,7 +24,7 @@ class Optimizations(str, Enum):
@init_cli.command("config")
def init_config_cli(
# fmt: off
output_file: Path = Arg("-", help="File to save config.cfg to (or - for stdout)", allow_dash=True),
output_file: Path = Arg(..., help="File to save config.cfg to or - for stdout (will only output config and no additional logging info)", allow_dash=True),
lang: Optional[str] = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"),
pipeline: Optional[str] = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include in the model (without 'tok2vec' or 'transformer')"),
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
@ -110,6 +110,13 @@ def init_config(
"word_vectors": reco["word_vectors"],
"has_letters": reco["has_letters"],
}
if variables["transformer_data"] and not has_spacy_transformers():
msg.warn(
"To generate a more effective transformer-based config (GPU-only), "
"install the spacy-transformers package and re-run this command. "
"The config generated now does not use transformers."
)
variables["transformer_data"] = None
base_template = template.render(variables).strip()
# Giving up on getting the newlines right in jinja for now
base_template = re.sub(r"\n\n\n+", "\n\n", base_template)
@ -126,8 +133,6 @@ def init_config(
for label, value in use_case.items():
msg.text(f"- {label}: {value}")
use_transformer = bool(template_vars.use_transformer)
if use_transformer:
require_spacy_transformers(msg)
with show_validation_error(hint_fill=False):
config = util.load_config_from_str(base_template)
nlp, _ = util.load_model_from_config(config, auto_fill=True)
@ -149,12 +154,10 @@ def save_config(config: Config, output_file: Path, is_stdout: bool = False) -> N
print(f"{COMMAND} train {output_file.parts[-1]} {' '.join(variables)}")
def require_spacy_transformers(msg: Printer) -> None:
def has_spacy_transformers() -> bool:
try:
import spacy_transformers # noqa: F401
return True
except ImportError:
msg.fail(
"Using a transformer-based pipeline requires spacy-transformers "
"to be installed.",
exits=1,
)
return False

View File

@ -4,10 +4,10 @@ from wasabi import msg
import re
import shutil
import requests
import smart_open
from ...util import ensure_path, working_dir
from .._util import project_cli, Arg, PROJECT_FILE, load_project_config, get_checksum
from .._util import download_file
# TODO: find a solution for caches
@ -44,16 +44,14 @@ def project_assets(project_dir: Path) -> None:
if not assets:
msg.warn(f"No assets specified in {PROJECT_FILE}", exits=0)
msg.info(f"Fetching {len(assets)} asset(s)")
variables = config.get("variables", {})
for asset in assets:
dest = asset["dest"].format(**variables)
dest = asset["dest"]
url = asset.get("url")
checksum = asset.get("checksum")
if not url:
# project.yml defines asset without URL that the user has to place
check_private_asset(dest, checksum)
continue
url = url.format(**variables)
fetch_asset(project_path, url, dest, checksum)
@ -132,15 +130,3 @@ def convert_asset_url(url: str) -> str:
)
return converted
return url
def download_file(url: str, dest: Path, chunk_size: int = 1024) -> None:
"""Download a file using smart_open.
url (str): The URL of the file.
dest (Path): The destination path.
chunk_size (int): The size of chunks to read/write.
"""
with smart_open.open(url, mode="rb") as input_file:
with dest.open(mode="wb") as output_file:
output_file.write(input_file.read())

View File

@ -99,7 +99,6 @@ def update_dvc_config(
if ref_hash == config_hash and not force:
return False # Nothing has changed in project.yml, don't need to update
dvc_config_path.unlink()
variables = config.get("variables", {})
dvc_commands = []
config_commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
for name in workflows[workflow]:
@ -122,7 +121,7 @@ def update_dvc_config(
dvc_commands.append(join_command(full_cmd))
with working_dir(path):
dvc_flags = {"--verbose": verbose, "--quiet": silent}
run_dvc_commands(dvc_commands, variables, flags=dvc_flags)
run_dvc_commands(dvc_commands, flags=dvc_flags)
with dvc_config_path.open("r+", encoding="utf8") as f:
content = f.read()
f.seek(0, 0)
@ -131,23 +130,16 @@ def update_dvc_config(
def run_dvc_commands(
commands: List[str] = tuple(),
variables: Dict[str, str] = {},
flags: Dict[str, bool] = {},
commands: List[str] = tuple(), flags: Dict[str, bool] = {},
) -> None:
"""Run a sequence of DVC commands in a subprocess, in order.
commands (List[str]): The string commands without the leading "dvc".
variables (Dict[str, str]): Dictionary of variable names, mapped to their
values. Will be used to substitute format string variables in the
commands.
flags (Dict[str, bool]): Conditional flags to be added to command. Makes it
easier to pass flags like --quiet that depend on a variable or
command-line setting while avoiding lots of nested conditionals.
"""
for command in commands:
# Substitute variables, e.g. "./{NAME}.json"
command = command.format(**variables)
command = split_command(command)
dvc_command = ["dvc", *command]
# Add the flags if they are set to True

36
spacy/cli/project/pull.py Normal file
View File

@ -0,0 +1,36 @@
from pathlib import Path
from wasabi import msg
from .remote_storage import RemoteStorage
from .remote_storage import get_command_hash
from .._util import project_cli, Arg
from .._util import load_project_config
@project_cli.command("pull")
def project_pull_cli(
# fmt: off
remote: str = Arg("default", help="Name or path of remote storage"),
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
# fmt: on
):
"""Retrieve any precomputed outputs from a remote storage that are available.
You can alias remotes in your project.yml by mapping them to storage paths.
A storage can be anything that the smart-open library can upload to, e.g.
gcs, aws, ssh, local directories etc
"""
for url, output_path in project_pull(project_dir, remote):
if url is not None:
msg.good(f"Pulled {output_path} from {url}")
def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
config = load_project_config(project_dir)
if remote in config.get("remotes", {}):
remote = config["remotes"][remote]
storage = RemoteStorage(project_dir, remote)
for cmd in config.get("commands", []):
deps = [project_dir / dep for dep in cmd.get("deps", [])]
cmd_hash = get_command_hash("", "", deps, cmd["script"])
for output_path in cmd.get("outputs", []):
url = storage.pull(output_path, command_hash=cmd_hash)
yield url, output_path

48
spacy/cli/project/push.py Normal file
View File

@ -0,0 +1,48 @@
from pathlib import Path
from wasabi import msg
from .remote_storage import RemoteStorage
from .remote_storage import get_content_hash, get_command_hash
from .._util import load_project_config
from .._util import project_cli, Arg
@project_cli.command("push")
def project_push_cli(
# fmt: off
remote: str = Arg("default", help="Name or path of remote storage"),
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
# fmt: on
):
"""Persist outputs to a remote storage. You can alias remotes in your project.yml
by mapping them to storage paths. A storage can be anything that the smart-open
library can upload to, e.g. gcs, aws, ssh, local directories etc
"""
for output_path, url in project_push(project_dir, remote):
if url is None:
msg.info(f"Skipping {output_path}")
else:
msg.good(f"Pushed {output_path} to {url}")
def project_push(project_dir: Path, remote: str):
"""Persist outputs to a remote storage. You can alias remotes in your project.yml
by mapping them to storage paths. A storage can be anything that the smart-open
library can upload to, e.g. gcs, aws, ssh, local directories etc
"""
config = load_project_config(project_dir)
if remote in config.get("remotes", {}):
remote = config["remotes"][remote]
storage = RemoteStorage(project_dir, remote)
for cmd in config.get("commands", []):
cmd_hash = get_command_hash(
"", "", [project_dir / dep for dep in cmd.get("deps", [])], cmd["script"]
)
for output_path in cmd.get("outputs", []):
output_loc = project_dir / output_path
if output_loc.exists():
url = storage.push(
output_path,
command_hash=cmd_hash,
content_hash=get_content_hash(output_loc),
)
yield output_path, url

View File

@ -0,0 +1,169 @@
from typing import Optional, List, Dict, TYPE_CHECKING
import os
import site
import hashlib
import urllib.parse
import tarfile
from pathlib import Path
from .._util import get_hash, get_checksum, download_file, ensure_pathy
from ...util import make_tempdir
if TYPE_CHECKING:
from pathy import Pathy # noqa: F401
class RemoteStorage:
"""Push and pull outputs to and from a remote file storage.
Remotes can be anything that `smart-open` can support: AWS, GCS, file system,
ssh, etc.
"""
def __init__(self, project_root: Path, url: str, *, compression="gz"):
self.root = project_root
self.url = ensure_pathy(url)
self.compression = compression
def push(self, path: Path, command_hash: str, content_hash: str) -> "Pathy":
"""Compress a file or directory within a project and upload it to a remote
storage. If an object exists at the full URL, nothing is done.
Within the remote storage, files are addressed by their project path
(url encoded) and two user-supplied hashes, representing their creation
context and their file contents. If the URL already exists, the data is
not uploaded. Paths are archived and compressed prior to upload.
"""
loc = self.root / path
if not loc.exists():
raise IOError(f"Cannot push {loc}: does not exist.")
url = self.make_url(path, command_hash, content_hash)
if url.exists():
return None
tmp: Path
with make_tempdir() as tmp:
tar_loc = tmp / self.encode_name(str(path))
mode_string = f"w:{self.compression}" if self.compression else "w"
with tarfile.open(tar_loc, mode=mode_string) as tar_file:
tar_file.add(str(loc), arcname=str(path))
with tar_loc.open(mode="rb") as input_file:
with url.open(mode="wb") as output_file:
output_file.write(input_file.read())
return url
def pull(
self,
path: Path,
*,
command_hash: Optional[str] = None,
content_hash: Optional[str] = None,
) -> Optional["Pathy"]:
"""Retrieve a file from the remote cache. If the file already exists,
nothing is done.
If the command_hash and/or content_hash are specified, only matching
results are returned. If no results are available, an error is raised.
"""
dest = self.root / path
if dest.exists():
return None
url = self.find(path, command_hash=command_hash, content_hash=content_hash)
if url is None:
return url
else:
# Make sure the destination exists
if not dest.parent.exists():
dest.parent.mkdir(parents=True)
tmp: Path
with make_tempdir() as tmp:
tar_loc = tmp / url.parts[-1]
download_file(url, tar_loc)
mode_string = f"r:{self.compression}" if self.compression else "r"
with tarfile.open(tar_loc, mode=mode_string) as tar_file:
# This requires that the path is added correctly, relative
# to root. This is how we set things up in push()
tar_file.extractall(self.root)
return url
def find(
self,
path: Path,
*,
command_hash: Optional[str] = None,
content_hash: Optional[str] = None,
) -> Optional["Pathy"]:
"""Find the best matching version of a file within the storage,
or `None` if no match can be found. If both the creation and content hash
are specified, only exact matches will be returned. Otherwise, the most
recent matching file is preferred.
"""
name = self.encode_name(str(path))
if command_hash is not None and content_hash is not None:
url = self.make_url(path, command_hash, content_hash)
urls = [url] if url.exists() else []
elif command_hash is not None:
urls = list((self.url / name / command_hash).iterdir())
else:
urls = list((self.url / name).iterdir())
if content_hash is not None:
urls = [url for url in urls if url.parts[-1] == content_hash]
return urls[-1] if urls else None
def make_url(self, path: Path, command_hash: str, content_hash: str) -> "Pathy":
"""Construct a URL from a subpath, a creation hash and a content hash."""
return self.url / self.encode_name(str(path)) / command_hash / content_hash
def encode_name(self, name: str) -> str:
"""Encode a subpath into a URL-safe name."""
return urllib.parse.quote_plus(name)
def get_content_hash(loc: Path) -> str:
return get_checksum(loc)
def get_command_hash(
site_hash: str, env_hash: str, deps: List[Path], cmd: List[str]
) -> str:
"""Create a hash representing the execution of a command. This includes the
currently installed packages, whatever environment variables have been marked
as relevant, and the command.
"""
hashes = [site_hash, env_hash] + [get_checksum(dep) for dep in sorted(deps)]
hashes.extend(cmd)
creation_bytes = "".join(hashes).encode("utf8")
return hashlib.md5(creation_bytes).hexdigest()
def get_site_hash():
"""Hash the current Python environment's site-packages contents, including
the name and version of the libraries. The list we're hashing is what
`pip freeze` would output.
"""
site_dirs = site.getsitepackages()
if site.ENABLE_USER_SITE:
site_dirs.extend(site.getusersitepackages())
packages = set()
for site_dir in site_dirs:
site_dir = Path(site_dir)
for subpath in site_dir.iterdir():
if subpath.parts[-1].endswith("dist-info"):
packages.add(subpath.parts[-1].replace(".dist-info", ""))
package_bytes = "".join(sorted(packages)).encode("utf8")
return hashlib.md5sum(package_bytes).hexdigest()
def get_env_hash(env: Dict[str, str]) -> str:
"""Construct a hash of the environment variables that will be passed into
the commands.
Values in the env dict may be references to the current os.environ, using
the syntax $ENV_VAR to mean os.environ[ENV_VAR]
"""
env_vars = {}
for key, value in env.items():
if value.startswith("$"):
env_vars[key] = os.environ.get(value[1:], "")
else:
env_vars[key] = value
return get_hash(env_vars)

View File

@ -44,7 +44,6 @@ def project_run(
dry (bool): Perform a dry run and don't execute commands.
"""
config = load_project_config(project_dir)
variables = config.get("variables", {})
commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
workflows = config.get("workflows", {})
validate_subcommand(commands.keys(), workflows.keys(), subcommand)
@ -54,22 +53,20 @@ def project_run(
project_run(project_dir, cmd, force=force, dry=dry)
else:
cmd = commands[subcommand]
variables = config.get("variables", {})
for dep in cmd.get("deps", []):
dep = dep.format(**variables)
if not (project_dir / dep).exists():
err = f"Missing dependency specified by command '{subcommand}': {dep}"
err_kwargs = {"exits": 1} if not dry else {}
msg.fail(err, **err_kwargs)
with working_dir(project_dir) as current_dir:
rerun = check_rerun(current_dir, cmd, variables)
rerun = check_rerun(current_dir, cmd)
if not rerun and not force:
msg.info(f"Skipping '{cmd['name']}': nothing changed")
else:
msg.divider(subcommand)
run_commands(cmd["script"], variables, dry=dry)
run_commands(cmd["script"], dry=dry)
if not dry:
update_lockfile(current_dir, cmd, variables)
update_lockfile(current_dir, cmd)
def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None:
@ -115,23 +112,15 @@ def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None:
def run_commands(
commands: List[str] = tuple(),
variables: Dict[str, Any] = {},
silent: bool = False,
dry: bool = False,
commands: List[str] = tuple(), silent: bool = False, dry: bool = False,
) -> None:
"""Run a sequence of commands in a subprocess, in order.
commands (List[str]): The string commands.
variables (Dict[str, Any]): Dictionary of variable names, mapped to their
values. Will be used to substitute format string variables in the
commands.
silent (bool): Don't print the commands.
dry (bool): Perform a dry run and don't execut anything.
"""
for command in commands:
# Substitute variables, e.g. "./{NAME}.json"
command = command.format(**variables)
command = split_command(command)
# Not sure if this is needed or a good idea. Motivation: users may often
# use commands in their config that reference "python" and we want to
@ -173,15 +162,12 @@ def validate_subcommand(
)
def check_rerun(
project_dir: Path, command: Dict[str, Any], variables: Dict[str, Any]
) -> bool:
def check_rerun(project_dir: Path, command: Dict[str, Any]) -> bool:
"""Check if a command should be rerun because its settings or inputs/outputs
changed.
project_dir (Path): The current project directory.
command (Dict[str, Any]): The command, as defined in the project.yml.
variables (Dict[str, Any]): The variables defined in the project.yml.
RETURNS (bool): Whether to re-run the command.
"""
lock_path = project_dir / PROJECT_LOCK
@ -197,19 +183,16 @@ def check_rerun(
# If the entry in the lockfile matches the lockfile entry that would be
# generated from the current command, we don't rerun because it means that
# all inputs/outputs, hashes and scripts are the same and nothing changed
return get_hash(get_lock_entry(project_dir, command, variables)) != get_hash(entry)
return get_hash(get_lock_entry(project_dir, command)) != get_hash(entry)
def update_lockfile(
project_dir: Path, command: Dict[str, Any], variables: Dict[str, Any]
) -> None:
def update_lockfile(project_dir: Path, command: Dict[str, Any]) -> None:
"""Update the lockfile after running a command. Will create a lockfile if
it doesn't yet exist and will add an entry for the current command, its
script and dependencies/outputs.
project_dir (Path): The current project directory.
command (Dict[str, Any]): The command, as defined in the project.yml.
variables (Dict[str, Any]): The variables defined in the project.yml.
"""
lock_path = project_dir / PROJECT_LOCK
if not lock_path.exists():
@ -217,13 +200,11 @@ def update_lockfile(
data = {}
else:
data = srsly.read_yaml(lock_path)
data[command["name"]] = get_lock_entry(project_dir, command, variables)
data[command["name"]] = get_lock_entry(project_dir, command)
srsly.write_yaml(lock_path, data)
def get_lock_entry(
project_dir: Path, command: Dict[str, Any], variables: Dict[str, Any]
) -> Dict[str, Any]:
def get_lock_entry(project_dir: Path, command: Dict[str, Any]) -> Dict[str, Any]:
"""Get a lockfile entry for a given command. An entry includes the command,
the script (command steps) and a list of dependencies and outputs with
their paths and file hashes, if available. The format is based on the
@ -231,12 +212,11 @@ def get_lock_entry(
project_dir (Path): The current project directory.
command (Dict[str, Any]): The command, as defined in the project.yml.
variables (Dict[str, Any]): The variables defined in the project.yml.
RETURNS (Dict[str, Any]): The lockfile entry.
"""
deps = get_fileinfo(project_dir, command.get("deps", []), variables)
outs = get_fileinfo(project_dir, command.get("outputs", []), variables)
outs_nc = get_fileinfo(project_dir, command.get("outputs_no_cache", []), variables)
deps = get_fileinfo(project_dir, command.get("deps", []))
outs = get_fileinfo(project_dir, command.get("outputs", []))
outs_nc = get_fileinfo(project_dir, command.get("outputs_no_cache", []))
return {
"cmd": f"{COMMAND} run {command['name']}",
"script": command["script"],
@ -245,20 +225,16 @@ def get_lock_entry(
}
def get_fileinfo(
project_dir: Path, paths: List[str], variables: Dict[str, Any]
) -> List[Dict[str, str]]:
def get_fileinfo(project_dir: Path, paths: List[str]) -> List[Dict[str, str]]:
"""Generate the file information for a list of paths (dependencies, outputs).
Includes the file path and the file's checksum.
project_dir (Path): The current project directory.
paths (List[str]): The file paths.
variables (Dict[str, Any]): The variables defined in the project.yml.
RETURNS (List[Dict[str, str]]): The lockfile entry for a file.
"""
data = []
for path in paths:
path = path.format(**variables)
file_path = project_dir / path
md5 = get_checksum(file_path) if file_path.exists() else None
data.append({"path": path, "md5": md5})

View File

@ -252,8 +252,10 @@ class EntityRenderer:
colors.update(user_color)
colors.update(options.get("colors", {}))
self.default_color = DEFAULT_ENTITY_COLOR
self.colors = colors
self.colors = {label.upper(): color for label, color in colors.items()}
self.ents = options.get("ents", None)
if self.ents is not None:
self.ents = [ent.upper() for ent in self.ents]
self.direction = DEFAULT_DIR
self.lang = DEFAULT_LANG
template = options.get("template")

View File

@ -51,14 +51,14 @@ TPL_ENTS = """
TPL_ENT = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
{text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">{label}</span>
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">{label}</span>
</mark>
"""
TPL_ENT_RTL = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em">
{text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span>
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-right: 0.5rem">{label}</span>
</mark>
"""

View File

@ -303,7 +303,7 @@ class ProjectConfigCommand(BaseModel):
class ProjectConfigSchema(BaseModel):
# fmt: off
variables: Dict[StrictStr, Union[str, int, float, bool]] = Field({}, title="Optional variables to substitute in commands")
vars: Dict[StrictStr, Any] = Field({}, title="Optional variables to substitute in commands")
assets: List[ProjectConfigAsset] = Field([], title="Data assets")
workflows: Dict[StrictStr, List[StrictStr]] = Field({}, title="Named workflows, mapped to list of project commands to run in order")
commands: List[ProjectConfigCommand] = Field([], title="Project command shortucts")

View File

@ -65,7 +65,7 @@ def test_issue4590(en_vocab):
def test_issue4651_with_phrase_matcher_attr():
"""Test that the EntityRuler PhraseMatcher is deserialize correctly using
"""Test that the EntityRuler PhraseMatcher is deserialized correctly using
the method from_disk when the EntityRuler argument phrase_matcher_attr is
specified.
"""
@ -87,7 +87,7 @@ def test_issue4651_with_phrase_matcher_attr():
def test_issue4651_without_phrase_matcher_attr():
"""Test that the EntityRuler PhraseMatcher is deserialize correctly using
"""Test that the EntityRuler PhraseMatcher is deserialized correctly using
the method from_disk when the EntityRuler argument phrase_matcher_attr is
not specified.
"""

View File

@ -6,9 +6,12 @@ from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate
from spacy.cli.pretrain import make_docs
from spacy.cli.init_config import init_config, RECOMMENDATIONS
from spacy.cli._util import validate_project_commands, parse_config_overrides
from spacy.util import get_lang_class
from spacy.cli._util import load_project_config, substitute_project_variables
from thinc.config import ConfigValidationError
import srsly
from .util import make_tempdir
def test_cli_converters_conllu2json():
# from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu
@ -295,6 +298,24 @@ def test_project_config_validation2(config, n_errors):
assert len(errors) == n_errors
def test_project_config_interpolation():
variables = {"a": 10, "b": {"c": "foo", "d": True}}
commands = [
{"name": "x", "script": ["hello ${vars.a} ${vars.b.c}"]},
{"name": "y", "script": ["${vars.b.c} ${vars.b.d}"]},
]
project = {"commands": commands, "vars": variables}
with make_tempdir() as d:
srsly.write_yaml(d / "project.yml", project)
cfg = load_project_config(d)
assert cfg["commands"][0]["script"][0] == "hello 10 foo"
assert cfg["commands"][1]["script"][0] == "foo true"
commands = [{"name": "x", "script": ["hello ${vars.a} ${vars.b.e}"]}]
project = {"commands": commands, "vars": variables}
with pytest.raises(ConfigValidationError):
substitute_project_variables(project)
@pytest.mark.parametrize(
"args,expected",
[

View File

@ -1,6 +1,6 @@
import pytest
from spacy import displacy
from spacy.displacy.render import DependencyRenderer
from spacy.displacy.render import DependencyRenderer, EntityRenderer
from spacy.tokens import Span
from spacy.lang.fa import Persian
@ -97,3 +97,17 @@ def test_displacy_render_wrapper(en_vocab):
assert html.endswith("/div>TEST")
# Restore
displacy.set_render_wrapper(lambda html: html)
def test_displacy_options_case():
ents = ["foo", "BAR"]
colors = {"FOO": "red", "bar": "green"}
renderer = EntityRenderer({"ents": ents, "colors": colors})
text = "abcd"
labels = ["foo", "bar", "FOO", "BAR"]
spans = [{"start": i, "end": i + 1, "label": labels[i]} for i in range(len(text))]
result = renderer.render_ents("abcde", spans, None).split("\n\n")
assert "red" in result[0] and "foo" in result[0]
assert "green" in result[1] and "bar" in result[1]
assert "red" in result[2] and "FOO" in result[2]
assert "green" in result[3] and "BAR" in result[3]

View File

@ -47,9 +47,9 @@ cdef class Tokenizer:
`infix_finditer` (callable): A function matching the signature of
`re.compile(string).finditer` to find infixes.
token_match (callable): A boolean function matching strings to be
recognised as tokens.
recognized as tokens.
url_match (callable): A boolean function matching strings to be
recognised as tokens after considering prefixes and suffixes.
recognized as tokens after considering prefixes and suffixes.
EXAMPLE:
>>> tokenizer = Tokenizer(nlp.vocab)

View File

@ -1193,8 +1193,7 @@ cdef class Doc:
retokenizer.merge(span, attributes[i])
def to_json(self, underscore=None):
"""Convert a Doc to JSON. The format it produces will be the new format
for the `spacy train` command (not implemented yet).
"""Convert a Doc to JSON.
underscore (list): Optional list of string names of custom doc._.
attributes. Attribute values need to be JSON-serializable. Values will

View File

@ -1,5 +1,5 @@
from typing import List, Union, Dict, Any, Optional, Iterable, Callable, Tuple
from typing import Iterator, Type, Pattern, TYPE_CHECKING
from typing import Iterator, Type, Pattern, Generator, TYPE_CHECKING
from types import ModuleType
import os
import importlib
@ -610,7 +610,7 @@ def working_dir(path: Union[str, Path]) -> None:
@contextmanager
def make_tempdir() -> None:
def make_tempdir() -> Generator[Path, None, None]:
"""Execute a block in a temporary directory and remove the directory and
its contents at the end of the with block.

View File

@ -11,9 +11,17 @@ menu:
- ['Entity Linking', 'entitylinker']
---
TODO: intro and how architectures work, link to
[`registry`](/api/top-level#registry),
[custom functions](/usage/training#custom-functions) usage etc.
A **model architecture** is a function that wires up a
[`Model`](https://thinc.ai/docs/api-model) instance, which you can then use in a
pipeline component or as a layer of a larger network. This page documents
spaCy's built-in architectures that are used for different NLP tasks. All
trainable [built-in components](/api#architecture-pipeline) expect a `model`
argument defined in the config and document their the default architecture.
Custom architectures can be registered using the
[`@spacy.registry.architectures`](/api/top-level#regsitry) decorator and used as
part of the [training config](/usage/training#custom-functions). Also see the
usage documentation on
[layers and model architectures](/usage/layers-architectures).
## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"}
@ -110,13 +118,11 @@ Instead of defining its own `Tok2Vec` instance, a model architecture like
[Tagger](/api/architectures#tagger) can define a listener as its `tok2vec`
argument that connects to the shared `tok2vec` component in the pipeline.
<!-- TODO: return type -->
| Name | Description |
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `width` | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~ |
| `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ |
| **CREATES** | The model using the architecture. ~~Model~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed}
@ -139,15 +145,13 @@ definitions depending on the `Vocab` of the `Doc` object passed in. Vectors from
pretrained static vectors can also be incorporated into the concatenated
representation.
<!-- TODO: model return type -->
| Name | Description |
| ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `width` | The output width. Also used as the width of the embedding tables. Recommended values are between `64` and `300`. ~~int~~ |
| `rows` | The number of rows for the embedding tables. Can be low, due to the hashing trick. Embeddings for prefix, suffix and word shape use half as many rows. Recommended values are between `2000` and `10000`. ~~int~~ |
| `also_embed_subwords` | Whether to use the `PREFIX`, `SUFFIX` and `SHAPE` features in the embeddings. If not using these, you may need more rows in your hash embeddings, as there will be increased chance of collisions. ~~bool~~ |
| `also_use_static_vectors` | Whether to also use static word vectors. Requires a vectors table to be loaded in the [Doc](/api/doc) objects' vocab. ~~bool~~ |
| **CREATES** | The model using the architecture. ~~Model~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy.CharacterEmbed.v1 {#CharacterEmbed}
@ -178,15 +182,13 @@ concatenated. A hash-embedded vector of the `NORM` of the word is also
concatenated on, and the result is then passed through a feed-forward network to
construct a single vector to represent the information.
<!-- TODO: model return type -->
| Name | Description |
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `width` | The width of the output vector and the `NORM` hash embedding. ~~int~~ |
| `rows` | The number of rows in the `NORM` hash embedding table. ~~int~~ |
| `nM` | The dimensionality of the character embeddings. Recommended values are between `16` and `64`. ~~int~~ |
| `nC` | The number of UTF-8 bytes to embed per word. Recommended values are between `3` and `8`, although it may depend on the length of words in the language. ~~int~~ |
| **CREATES** | The model using the architecture. ~~Model~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder}
@ -277,12 +279,10 @@ Embed [`Doc`](/api/doc) objects with their vocab's vectors table, applying a
learned linear projection to control the dimensionality. See the documentation
on [static vectors](/usage/embeddings-transformers#static-vectors) for details.
<!-- TODO: document argument descriptions -->
| Name |  Description |
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `nO` | Defaults to `None`. ~~Optional[int]~~ |
| `nM` | Defaults to `None`. ~~Optional[int]~~ |
| `nO` | The output width of the layer, after the linear projection. ~~Optional[int]~~ |
| `nM` | The width of the static vectors. ~~Optional[int]~~ |
| `dropout` | Optional dropout rate. If set, it's applied per dimension over the whole batch. Defaults to `None`. ~~Optional[float]~~ |
| `init_W` | The [initialization function](https://thinc.ai/docs/api-initializers). Defaults to [`glorot_uniform_init`](https://thinc.ai/docs/api-initializers#glorot_uniform_init). ~~Callable[[Ops, Tuple[int, ...]]], FloatsXd]~~ |
| `key_attr` | Defaults to `"ORTH"`. ~~str~~ |
@ -292,8 +292,18 @@ on [static vectors](/usage/embeddings-transformers#static-vectors) for details.
The following architectures are provided by the package
[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the
[usage documentation](/usage/embeddings-transformers) for how to integrate the
architectures into your training config.
[usage documentation](/usage/embeddings-transformers#transformers) for how to
integrate the architectures into your training config.
<Infobox variant="warning">
Note that in order to use these architectures in your config, you need to
install the
[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the
[installation docs](/usage/embeddings-transformers#transformers-installation)
for details and system requirements.
</Infobox>
### spacy-transformers.TransformerModel.v1 {#TransformerModel}
@ -311,7 +321,23 @@ architectures into your training config.
> stride = 96
> ```
<!-- TODO: description -->
Load and wrap a transformer model from the
[HuggingFace `transformers`](https://huggingface.co/transformers) library. You
can any transformer that has pretrained weights and a PyTorch implementation.
The `name` variable is passed through to the underlying library, so it can be
either a string or a path. If it's a string, the pretrained weights will be
downloaded via the transformers library if they are not already available
locally.
In order to support longer documents, the
[TransformerModel](/api/architectures#TransformerModel) layer allows you to pass
in a `get_spans` function that will divide up the [`Doc`](/api/doc) objects
before passing them through the transformer. Your spans are allowed to overlap
or exclude tokens. This layer is usually used directly by the
[`Transformer`](/api/transformer) component, which allows you to share the
transformer weights across your pipeline. For a layer that's configured for use
in other components, see
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer).
| Name | Description |
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -541,8 +567,6 @@ specific data and challenge.
Stacked ensemble of a bag-of-words model and a neural network model. The neural
network has an internal CNN Tok2Vec layer and uses attention.
<!-- TODO: model return type -->
| Name | Description |
| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
@ -554,7 +578,7 @@ network has an internal CNN Tok2Vec layer and uses attention.
| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ |
| `dropout` | The dropout rate. ~~float~~ |
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
| **CREATES** | The model using the architecture. ~~Model~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
### spacy.TextCatCNN.v1 {#TextCatCNN}
@ -581,14 +605,12 @@ A neural network model where token vectors are calculated using a CNN. The
vectors are mean pooled and used as features in a feed-forward network. This
architecture is usually less accurate than the ensemble, but runs faster.
<!-- TODO: model return type -->
| Name | Description |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
| `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ |
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
| **CREATES** | The model using the architecture. ~~Model~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
### spacy.TextCatBOW.v1 {#TextCatBOW}
@ -606,15 +628,13 @@ architecture is usually less accurate than the ensemble, but runs faster.
An ngram "bag-of-words" model. This architecture should run much faster than the
others, but may not be as accurate, especially if texts are short.
<!-- TODO: model return type -->
| Name | Description |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ |
| `no_output_layer` | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes` is `True`, else `Logistic`. ~~bool~~ |
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
| **CREATES** | The model using the architecture. ~~Model~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
@ -659,13 +679,11 @@ into the "real world". This requires 3 main components:
The `EntityLinker` model architecture is a Thinc `Model` with a
[`Linear`](https://thinc.ai/api-layers#linear) output layer.
<!-- TODO: model return type -->
| Name | Description |
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ |
| `nO` | Output dimension, determined by the length of the vectors encoding each entity in the KB. If the `nO` dimension is not set, the entity linking component will set it when `begin_training` is called. ~~Optional[int]~~ |
| **CREATES** | The model using the architecture. ~~Model~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
### spacy.EmptyKB.v1 {#EmptyKB}

View File

@ -123,7 +123,7 @@ $ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [
| Name | Description |
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `output_file` | Path to output `.cfg` file. If not set, the config is written to stdout so you can pipe it forward to a file. ~~Path (positional)~~ |
| `output_file` | Path to output `.cfg` file or `-` to write the config to stdout (so you can pipe it forward to a file). Note that if you're writing to stdout, no additional logging info is printed. ~~Path (positional)~~ |
| `--lang`, `-l` | Optional code of the [language](/usage/models#languages) to use. Defaults to `"en"`. ~~str (option)~~ |
| `--pipeline`, `-p` | Comma-separated list of trainable [pipeline components](/usage/processing-pipelines#built-in) to include in the model. Defaults to `"tagger,parser,ner"`. ~~str (option)~~ |
| `--optimize`, `-o` | `"efficiency"` or `"accuracy"`. Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters. Defaults to `"efficiency"`. ~~str (option)~~ |
@ -847,6 +847,92 @@ $ python -m spacy project run [subcommand] [project_dir] [--force] [--dry]
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **EXECUTES** | The command defined in the `project.yml`. |
### project push {#project-push tag="command"}
Upload all available files or directories listed as in the `outputs` section of
commands to a remote storage. Outputs are archived and compressed prior to
upload, and addressed in the remote storage using the output's relative path
(URL encoded), a hash of its command string and dependencies, and a hash of its
file contents. This means `push` should **never overwrite** a file in your
remote. If all the hashes match, the contents are the same and nothing happens.
If the contents are different, the new version of the file is uploaded. Deleting
obsolete files is left up to you.
Remotes can be defined in the `remotes` section of the
[`project.yml`](/usage/projects#project-yml). Under the hood, spaCy uses the
[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library to
communicate with the remote storages, so you can use any protocol that
`smart-open` supports, including [S3](https://aws.amazon.com/s3/),
[Google Cloud Storage](https://cloud.google.com/storage), SSH and more, although
you may need to install extra dependencies to use certain protocols.
```cli
$ python -m spacy project push [remote] [project_dir]
```
> #### Example
>
> ```cli
> $ python -m spacy project push my_bucket
> ```
>
> ```yaml
> ### project.yml
> remotes:
> my_bucket: 's3://my-spacy-bucket'
> ```
| Name | Description |
| -------------- | --------------------------------------------------------------------------------------- |
| `remote` | The name of the remote to upload to. Defaults to `"default"`. ~~str (positional)~~ |
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **UPLOADS** | All project outputs that exist and are not already stored in the remote. |
### project pull {#project-pull tag="command"}
Download all files or directories listed as `outputs` for commands, unless they
are not already present locally. When searching for files in the remote, `pull`
won't just look at the output path, but will also consider the **command
string** and the **hashes of the dependencies**. For instance, let's say you've
previously pushed a model checkpoint to the remote, but now you've changed some
hyper-parameters. Because you've changed the inputs to the command, if you run
`pull`, you won't retrieve the stale result. If you train your model and push
the outputs to the remote, the outputs will be saved alongside the prior
outputs, so if you change the config back, you'll be able to fetch back the
result.
Remotes can be defined in the `remotes` section of the
[`project.yml`](/usage/projects#project-yml). Under the hood, spaCy uses the
[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library to
communicate with the remote storages, so you can use any protocol that
`smart-open` supports, including [S3](https://aws.amazon.com/s3/),
[Google Cloud Storage](https://cloud.google.com/storage), SSH and more, although
you may need to install extra dependencies to use certain protocols.
```cli
$ python -m spacy project pull [remote] [project_dir]
```
> #### Example
>
> ```cli
> $ python -m spacy project pull my_bucket
> ```
>
> ```yaml
> ### project.yml
> remotes:
> my_bucket: 's3://my-spacy-bucket'
> ```
| Name | Description |
| -------------- | --------------------------------------------------------------------------------------- |
| `remote` | The name of the remote to download from. Defaults to `"default"`. ~~str (positional)~~ |
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **DOWNLOADS** | All project outputs that do not exist locally and can be found in the remote. |
### project dvc {#project-dvc tag="command"}
Auto-generate [Data Version Control](https://dvc.org) (DVC) config file. Calls

View File

@ -127,26 +127,24 @@ $ python -m spacy train config.cfg --paths.train ./corpus/train.spacy
This section defines settings and controls for the training and evaluation
process that are used when you run [`spacy train`](/api/cli#train).
<!-- TODO: complete -->
| Name | Description |
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ |
| `dropout` | The dropout rate. Defaults to `0.1`. ~~float~~ |
| `accumulate_gradient` | Whether to divide the batch up into substeps. Defaults to `1`. ~~int~~ |
| `batcher` | Callable that takes an iterator of [`Doc`](/api/doc) objects and yields batches of `Doc`s. Defaults to [`batch_by_words`](/api/top-level#batch_by_words). ~~Callable[[Iterator[Doc], Iterator[List[Doc]]]]~~ |
| `dev_corpus` | Callable that takes the current `nlp` object and yields [`Example`](/api/example) objects. Defaults to [`Corpus`](/api/corpus). ~~Callable[[Language], Iterator[Example]]~~ |
| `dropout` | The dropout rate. Defaults to `0.1`. ~~float~~ |
| `eval_frequency` | How often to evaluate during training (steps). Defaults to `200`. ~~int~~ |
| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ |
| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. ~~Optional[str]~~ |
| `raw_text` | Optional path to a jsonl file with unlabelled text documents for a [rehearsal](/api/language#rehearse) step. Defaults to variable `${paths.raw}`. ~~Optional[str]~~ |
| `vectors` | Model name or path to model containing pretrained word vectors to use, e.g. created with [`init model`](/api/cli#init-model). Defaults to `null`. ~~Optional[str]~~ |
| `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ |
| `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ |
| `max_steps` | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~ |
| `eval_frequency` | How often to evaluate during training (steps). Defaults to `200`. ~~int~~ |
| `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ |
| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ |
| `train_corpus` | Callable that takes the current `nlp` object and yields [`Example`](/api/example) objects. Defaults to [`Corpus`](/api/corpus). ~~Callable[[Language], Iterator[Example]]~~ |
| `dev_corpus` | Callable that takes the current `nlp` object and yields [`Example`](/api/example) objects. Defaults to [`Corpus`](/api/corpus). ~~Callable[[Language], Iterator[Example]]~~ |
| `batcher` | Callable that takes an iterator of [`Doc`](/api/doc) objects and yields batches of `Doc`s. Defaults to [`batch_by_words`](/api/top-level#batch_by_words). ~~Callable[[Iterator[Doc], Iterator[List[Doc]]]]~~ |
| `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ |
| `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ |
| `raw_text` | Optional path to a jsonl file with unlabelled text documents for a [rehearsal](/api/language#rehearse) step. Defaults to variable `${paths.raw}`. ~~Optional[str]~~ |
| `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ |
| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ |
| `train_corpus` | Callable that takes the current `nlp` object and yields [`Example`](/api/example) objects. Defaults to [`Corpus`](/api/corpus). ~~Callable[[Language], Iterator[Example]]~~ |
| `vectors` | Model name or path to model containing pretrained word vectors to use, e.g. created with [`init model`](/api/cli#init-model). Defaults to `null`. ~~Optional[str]~~ |
### pretraining {#config-pretraining tag="section,optional"}

View File

@ -7,7 +7,7 @@ source: spacy/morphology.pyx
Store the possible morphological analyses for a language, and index them by
hash. To save space on each token, tokens only know the hash of their
morphological analysis, so queries of morphological attributes are delegated to
this class. See [`MorphAnalysis`](/api/morphology#morphansalysis) for the
this class. See [`MorphAnalysis`](/api/morphology#morphanalysis) for the
container storing a single morphological analysis.
## Morphology.\_\_init\_\_ {#init tag="method"}

View File

@ -450,8 +450,8 @@ The L2 norm of the token's vector representation.
| `pos_` | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~str~~ |
| `tag` | Fine-grained part-of-speech. ~~int~~ |
| `tag_` | Fine-grained part-of-speech. ~~str~~ |
| `morph` | Morphological analysis. ~~MorphAnalysis~~ |
| `morph_` | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ |
| `morph` <Tag variant="new">3</Tag> | Morphological analysis. ~~MorphAnalysis~~ |
| `morph_` <Tag variant="new">3</Tag> | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ |
| `dep` | Syntactic dependency relation. ~~int~~ |
| `dep_` | Syntactic dependency relation. ~~str~~ |
| `lang` | Language of the parent document's vocabulary. ~~int~~ |

View File

@ -257,7 +257,7 @@ If a setting is not present in the options, the default value will be used.
| Name | Description |
| --------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ents` | Entity types to highlight or `None` for all types (default). ~~Optional[List[str]]~~ |
| `colors` | Color overrides. Entity types in uppercase should be mapped to color names or values. ~~Dict[str, str]~~ |
| `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ |
| `template` <Tag variant="new">2.2</Tag> | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ |
By default, displaCy comes with colors for all entity types used by
@ -299,20 +299,20 @@ factories.
| Registry name | Description |
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. |
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `assets` | Registry for data assets, knowledge bases etc. |
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
| `batchers` | Registry for training and evaluation [data batchers](#batchers). |
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. |
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). |
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). |
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. |
### spacy-transformers registry {#registry-transformers}
@ -632,6 +632,23 @@ validate its contents.
| `path` | Path to the model's `meta.json`. ~~Union[str, Path]~~ |
| **RETURNS** | The model's meta data. ~~Dict[str, Any]~~ |
### util.get_installed_models {#util.get_installed_models tag="function" new="3"}
List all model packages installed in the current environment. This will include
any spaCy model that was packaged with [`spacy package`](/api/cli#package).
Under the hood, model packages expose a Python entry point that spaCy can check,
without having to load the model.
> #### Example
>
> ```python
> model_names = util.get_installed_models()
> ```
| Name | Description |
| ----------- | ---------------------------------------------------------------------------------- |
| **RETURNS** | The string names of the models installed in the current environment. ~~List[str]~~ |
### util.is_package {#util.is_package tag="function"}
Check if string maps to a package installed via pip. Mainly used to validate

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 50 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

View File

@ -9,7 +9,24 @@ menu:
next: /usage/training
---
<!-- TODO: intro, short explanation of embeddings/transformers, Tok2Vec and Transformer components, point user to processing pipelines docs for more general info that user should know first -->
spaCy supports a number of **transfer and multi-task learning** workflows that
can often help improve your pipeline's efficiency or accuracy. Transfer learning
refers to techniques such as word vector tables and language model pretraining.
These techniques can be used to import knowledge from raw text into your
pipeline, so that your models are able to generalize better from your annotated
examples.
You can convert **word vectors** from popular tools like
[FastText](https://fasttext.cc) and [Gensim](https://radimrehurek.com/gensim),
or you can load in any pretrained **transformer model** if you install
[`spacy-transformers`](https://github.com/explosion/spacy-transformers). You can
also do your own language model pretraining via the
[`spacy pretrain`](/api/cli#pretrain) command. You can even **share** your
transformer or other contextual embedding model across multiple components,
which can make long pipelines several times more efficient. To use transfer
learning, you'll need at least a few annotated examples for what you're trying
to predict. Otherwise, you could try using a "one-shot learning" approach using
[vectors and similarity](/usage/linguistic-features#vectors-similarity).
<Accordion title="Whats the difference between word vectors and language models?" id="vectors-vs-language-models">
@ -53,19 +70,46 @@ of performance.
## Shared embedding layers {#embedding-layers}
<!-- TODO: write -->
spaCy lets you share a single transformer or other token-to-vector ("tok2vec")
embedding layer between multiple components. You can even update the shared
layer, performing **multi-task learning**. Reusing the tok2vec layer between
components can make your pipeline run a lot faster and result in much smaller
models. However, it can make the pipeline less modular and make it more
difficult to swap components or retrain parts of the pipeline. Multi-task
learning can affect your accuracy (either positively or negatively), and may
require some retuning of your hyper-parameters.
![Pipeline components using a shared embedding component vs. independent embedding layers](../images/tok2vec.svg)
| Shared | Independent |
| ------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- |
| ✅ **smaller:** models only need to include a single copy of the embeddings | ❌ **larger:** models need to include the embeddings for each component |
| ✅ **faster:** | ❌ **slower:** |
| ✅ **faster:** embed the documents once for your whole pipeline | ❌ **slower:** rerun the embedding for each component |
| ❌ **less composable:** all components require the same embedding component in the pipeline | ✅ **modular:** components can be moved and swapped freely |
You can share a single transformer or other tok2vec model between multiple
components by adding a [`Transformer`](/api/transformer) or
[`Tok2Vec`](/api/tok2vec) component near the start of your pipeline. Components
later in the pipeline can "connect" to it by including a **listener layer** like
[Tok2VecListener](/api/architectures#Tok2VecListener) within their model.
![Pipeline components listening to shared embedding component](../images/tok2vec-listener.svg)
<!-- TODO: explain the listener concept, how it works etc. -->
At the beginning of training, the [`Tok2Vec`](/api/tok2vec) component will grab
a reference to the relevant listener layers in the rest of your pipeline. When
it processes a batch of documents, it will pass forward its predictions to the
listeners, allowing the listeners to **reuse the predictions** when they are
eventually called. A similar mechanism is used to pass gradients from the
listeners back to the model. The [`Transformer`](/api/transformer) component and
[TransformerListener](/api/architectures#TransformerListener) layer do the same
thing for transformer models, but the `Transformer` component will also save the
transformer outputs to the
[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
giving you access to them after the pipeline has finished running.
<!-- TODO: show example of implementation via config, side by side -->
<!-- TODO: Once rehearsal is tested, mention it here. -->
## Using transformer models {#transformers}
@ -180,7 +224,7 @@ yourself. For details on how to get started with training your own model, check
out the [training quickstart](/usage/training#quickstart).
<!-- TODO:
<Project id="en_core_bert">
<Project id="en_core_trf_lg">
The easiest way to get started is to clone a transformers-based project
template. Swap in your data, edit the settings and hyperparameters and train,

View File

@ -368,7 +368,7 @@ from is called `spacy`. So, when using spaCy, never call anything else `spacy`.
</Accordion>
<Accordion title="NER model doesn't recognise other entities anymore after training" id="catastrophic-forgetting">
<Accordion title="NER model doesn't recognize other entities anymore after training" id="catastrophic-forgetting">
If your training data only contained new entities and you didn't mix in any
examples the model previously recognized, it can cause the model to "forget"

View File

@ -0,0 +1,185 @@
---
title: Layers and Model Architectures
teaser: Power spaCy components with custom neural networks
menu:
- ['Type Signatures', 'type-sigs']
- ['Defining Sublayers', 'sublayers']
- ['PyTorch & TensorFlow', 'frameworks']
- ['Trainable Components', 'components']
next: /usage/projects
---
A **model architecture** is a function that wires up a
[Thinc `Model`](https://thinc.ai/docs/api-model) instance, which you can then
use in a component or as a layer of a larger network. You can use Thinc as a
thin wrapper around frameworks such as PyTorch, TensorFlow or MXNet, or you can
implement your logic in Thinc directly. spaCy's built-in components will never
construct their `Model` instances themselves, so you won't have to subclass the
component to change its model architecture. You can just **update the config**
so that it refers to a different registered function. Once the component has
been created, its model instance has already been assigned, so you cannot change
its model architecture. The architecture is like a recipe for the network, and
you can't change the recipe once the dish has already been prepared. You have to
make a new one.
![Diagram of a pipeline component with its model](../images/layers-architectures.svg)
## Type signatures {#type-sigs}
<!-- TODO: update example, maybe simplify definition? -->
> #### Example
>
> ```python
> @spacy.registry.architectures.register("spacy.Tagger.v1")
> def build_tagger_model(
> tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None
> ) -> Model[List[Doc], List[Floats2d]]:
> t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None
> output_layer = Softmax(nO, t2v_width, init_W=zero_init)
> softmax = with_array(output_layer)
> model = chain(tok2vec, softmax)
> model.set_ref("tok2vec", tok2vec)
> model.set_ref("softmax", output_layer)
> model.set_ref("output_layer", output_layer)
> return model
> ```
The Thinc `Model` class is a **generic type** that can specify its input and
output types. Python uses a square-bracket notation for this, so the type
~~Model[List, Dict]~~ says that each batch of inputs to the model will be a
list, and the outputs will be a dictionary. Both `typing.List` and `typing.Dict`
are also generics, allowing you to be more specific about the data. For
instance, you can write ~~Model[List[Doc], Dict[str, float]]~~ to specify that
the model expects a list of [`Doc`](/api/doc) objects as input, and returns a
dictionary mapping strings to floats. Some of the most common types you'll see
are:
| Type | Description |
| ------------------ | ---------------------------------------------------------------------------------------------------- |
| ~~List[Doc]~~ | A batch of [`Doc`](/api/doc) objects. Most components expect their models to take this as input. |
| ~~Floats2d~~ | A two-dimensional `numpy` or `cupy` array of floats. Usually 32-bit. |
| ~~Ints2d~~ | A two-dimensional `numpy` or `cupy` array of integers. Common dtypes include uint64, int32 and int8. |
| ~~List[Floats2d]~~ | A list of two-dimensional arrays, generally with one array per `Doc` and one row per token. |
| ~~Ragged~~ | A container to handle variable-length sequence data in an unpadded contiguous array. |
| ~~Padded~~ | A container to handle variable-length sequence data in a passed contiguous array. |
The model type signatures help you figure out which model architectures and
components can **fit together**. For instance, the
[`TextCategorizer`](/api/textcategorizer) class expects a model typed
~~Model[List[Doc], Floats2d]~~, because the model will predict one row of
category probabilities per [`Doc`](/api/doc). In contrast, the
[`Tagger`](/api/tagger) class expects a model typed ~~Model[List[Doc],
List[Floats2d]]~~, because it needs to predict one row of probabilities per
token.
There's no guarantee that two models with the same type signature can be used
interchangeably. There are many other ways they could be incompatible. However,
if the types don't match, they almost surely _won't_ be compatible. This little
bit of validation goes a long way, especially if you
[configure your editor](https://thinc.ai/docs/usage-type-checking) or other
tools to highlight these errors early. Thinc will also verify that your types
match correctly when your config file is processed at the beginning of training.
<Infobox title="Tip: Static type checking in your editor" emoji="💡">
If you're using a modern editor like Visual Studio Code, you can
[set up `mypy`](https://thinc.ai/docs/usage-type-checking#install) with the
custom Thinc plugin and get live feedback about mismatched types as you write
code.
[![](../images/thinc_mypy.jpg)](https://thinc.ai/docs/usage-type-checking#linting)
</Infobox>
## Defining sublayers {#sublayers}
Model architecture functions often accept **sublayers as arguments**, so that
you can try **substituting a different layer** into the network. Depending on
how the architecture function is structured, you might be able to define your
network structure entirely through the [config system](/usage/training#config),
using layers that have already been defined. The
[transformers documentation](/usage/embeddings-transformers#transformers)
section shows a common example of swapping in a different sublayer.
In most neural network models for NLP, the most important parts of the network
are what we refer to as the
[embed and encode](https://explosion.ai/blog/embed-encode-attend-predict) steps.
These steps together compute dense, context-sensitive representations of the
tokens. Most of spaCy's default architectures accept a
[`tok2vec` embedding layer](/api/architectures#tok2vec-arch) as an argument, so
you can control this important part of the network separately. This makes it
easy to **switch between** transformer, CNN, BiLSTM or other feature extraction
approaches. And if you want to define your own solution, all you need to do is
register a ~~Model[List[Doc], List[Floats2d]]~~ architecture function, and
you'll be able to try it out in any of spaCy components.
<!-- TODO: example of switching sublayers -->
### Registering new architectures
- Recap concept, link to config docs.
## Wrapping PyTorch, TensorFlow and other frameworks {#frameworks}
<!-- TODO: this is copied over from the Thinc docs and we probably want to shorten it and make it more spaCy-specific -->
Thinc allows you to wrap models written in other machine learning frameworks
like PyTorch, TensorFlow and MXNet using a unified
[`Model`](https://thinc.ai/docs/api-model) API. As well as **wrapping whole
models**, Thinc lets you call into an external framework for just **part of your
model**: you can have a model where you use PyTorch just for the transformer
layers, using "native" Thinc layers to do fiddly input and output
transformations and add on task-specific "heads", as efficiency is less of a
consideration for those parts of the network.
Thinc uses a special class, [`Shim`](https://thinc.ai/docs/api-model#shim), to
hold references to external objects. This allows each wrapper space to define a
custom type, with whatever attributes and methods are helpful, to assist in
managing the communication between Thinc and the external library. The
[`Model`](https://thinc.ai/docs/api-model#model) class holds `shim` instances in
a separate list, and communicates with the shims about updates, serialization,
changes of device, etc.
The wrapper will receive each batch of inputs, convert them into a suitable form
for the underlying model instance, and pass them over to the shim, which will
**manage the actual communication** with the model. The output is then passed
back into the wrapper, and converted for use in the rest of the network. The
equivalent procedure happens during backpropagation. Array conversion is handled
via the [DLPack](https://github.com/dmlc/dlpack) standard wherever possible, so
that data can be passed between the frameworks **without copying the data back**
to the host device unnecessarily.
| Framework | Wrapper layer | Shim | DLPack |
| -------------- | ------------------------------------------------------------------------- | --------------------------------------------------------- | --------------- |
| **PyTorch** | [`PyTorchWrapper`](https://thinc.ai/docs/api-layers#pytorchwrapper) | [`PyTorchShim`](https://thinc.ai/docs/api-model#shims) | ✅ |
| **TensorFlow** | [`TensorFlowWrapper`](https://thinc.ai/docs/api-layers#tensorflowwrapper) | [`TensorFlowShim`](https://thinc.ai/docs/api-model#shims) | ❌ <sup>1</sup> |
| **MXNet** | [`MXNetWrapper`](https://thinc.ai/docs/api-layers#mxnetwrapper) | [`MXNetShim`](https://thinc.ai/docs/api-model#shims) | ✅ |
1. DLPack support in TensorFlow is now
[available](<(https://github.com/tensorflow/tensorflow/issues/24453)>) but
still experimental.
<!-- TODO:
- Explain concept
- Link off to notebook
-->
## Models for trainable components {#components}
- Interaction with `predict`, `get_loss` and `set_annotations`
- Initialization life-cycle with `begin_training`.
- Link to relation extraction notebook.
```python
def update(self, examples):
docs = [ex.predicted for ex in examples]
refs = [ex.reference for ex in examples]
predictions, backprop = self.model.begin_update(docs)
gradient = self.get_loss(predictions, refs)
backprop(gradient)
def __call__(self, doc):
predictions = self.model([doc])
self.set_annotations(predictions)
```

View File

@ -429,7 +429,7 @@ nlp = spacy.load("en_core_web_sm")
doc = nlp("fb is hiring a new vice president of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# the model didn't recognise "fb" as an entity :(
# The model didn't recognize "fb" as an entity :(
fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
doc.ents = list(doc.ents) + [fb_ent]
@ -558,11 +558,11 @@ import spacy
nlp = spacy.load("my_custom_el_model")
doc = nlp("Ada Lovelace was born in London")
# document level
# Document level
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]
# token level
# Token level
ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
@ -914,12 +914,12 @@ from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex
# default tokenizer
# Default tokenizer
nlp = spacy.load("en_core_web_sm")
doc = nlp("mother-in-law")
print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']
# modify tokenizer infix patterns
# Modify tokenizer infix patterns
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
@ -929,7 +929,7 @@ infixes = (
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
# EDIT: commented out regex that splits on hyphens between letters:
# ✅ Commented out regex that splits on hyphens between letters:
# r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
]

View File

@ -108,11 +108,11 @@ class, or defined within a [model package](/usage/saving-loading#models).
>
> [components.tagger]
> factory = "tagger"
> # settings for the tagger component
> # Settings for the tagger component
>
> [components.parser]
> factory = "parser"
> # settings for the parser component
> # Settings for the parser component
> ```
When you load a model, spaCy first consults the model's
@ -171,7 +171,7 @@ lang = "en"
pipeline = ["tagger", "parser", "ner"]
data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"
cls = spacy.util.get_lang_class(lang) # 1. Get Language instance, e.g. English()
cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English
nlp = cls() # 2. Initialize it
for name in pipeline:
nlp.add_pipe(name) # 3. Add the component to the pipeline
@ -187,9 +187,9 @@ which is then processed by the component next in the pipeline.
```python
### The pipeline under the hood
doc = nlp.make_doc("This is a sentence") # create a Doc from raw text
for name, proc in nlp.pipeline: # iterate over components in order
doc = proc(doc) # apply each component
doc = nlp.make_doc("This is a sentence") # Create a Doc from raw text
for name, proc in nlp.pipeline: # Iterate over components in order
doc = proc(doc) # Apply each component
```
The current processing pipeline is available as `nlp.pipeline`, which returns a
@ -473,7 +473,7 @@ only being able to modify it afterwards.
>
> @Language.component("my_component")
> def my_component(doc):
> # do something to the doc here
> # Do something to the doc here
> return doc
> ```

View File

@ -5,9 +5,12 @@ menu:
- ['Intro & Workflow', 'intro']
- ['Directory & Assets', 'directory']
- ['Custom Projects', 'custom']
- ['Remote Storage', 'remote']
- ['Integrations', 'integrations']
---
## Introduction and workflow {#intro hidden="true"}
> #### 🪐 Project templates
>
> Our [`projects`](https://github.com/explosion/projects) repo includes various
@ -19,20 +22,17 @@ spaCy projects let you manage and share **end-to-end spaCy workflows** for
different **use cases and domains**, and orchestrate training, packaging and
serving your custom models. You can start off by cloning a pre-defined project
template, adjust it to fit your needs, load in your data, train a model, export
it as a Python package and share the project templates with your team. spaCy
projects can be used via the new [`spacy project`](/api/cli#project) command.
For an overview of the available project templates, check out the
[`projects`](https://github.com/explosion/projects) repo. spaCy projects also
[integrate](#integrations) with many other cool machine learning and data
science tools to track and manage your data and experiments, iterate on demos
and prototypes and ship your models into production.
it as a Python package, upload your outputs to a remote storage and share your
results with your team. spaCy projects can be used via the new
[`spacy project`](/api/cli#project) command and we provide templates in our
[`projects`](https://github.com/explosion/projects) repo.
<!-- TODO: mention integrations -->
## Introduction and workflow {#intro}
<!-- TODO: decide how to introduce concept -->
![Illustration of project workflow and commands](../images/projects.svg)
<!-- TODO:
<Project id="some_example_project">
@ -155,8 +155,8 @@ other. For instance, to generate a packaged model, you might start by converting
your data, then run [`spacy train`](/api/cli#train) to train your model on the
converted data and if that's successful, run [`spacy package`](/api/cli#package)
to turn the best model artifact into an installable Python package. The
following command runs the workflow named `all` defined in the `project.yml`, and
executes the commands it specifies, in order:
following command runs the workflow named `all` defined in the `project.yml`,
and executes the commands it specifies, in order:
```cli
$ python -m spacy project run all
@ -171,6 +171,31 @@ advanced data pipelines and track your changes in Git, check out the
from a workflow defined in your `project.yml` so you can manage your spaCy
project as a DVC repo.
### 5. Optional: Push to remote storage {#push}
> ```yaml
> ### project.yml
> remotes:
> default: 's3://my-spacy-bucket'
> local: '/mnt/scratch/cache'
> ```
After training a model, you can optionally use the
[`spacy project push`](/api/cli#project-push) command to upload your outputs to
a remote storage, using protocols like [S3](https://aws.amazon.com/s3/),
[Google Cloud Storage](https://cloud.google.com/storage) or SSH. This can help
you **export** your model packages, **share** work with your team, or **cache
results** to avoid repeating work.
```cli
$ python -m spacy project push
```
The `remotes` section in your `project.yml` lets you assign names to the
different storages. To download state from a remote storage, you can use the
[`spacy project pull`](/api/cli#project-pull) command. For more details, see the
docs on [remote storage](#remote).
## Project directory and assets {#directory}
### project.yml {#project-yml}
@ -190,7 +215,7 @@ https://github.com/explosion/spacy-boilerplates/blob/master/ner_fashion/project.
| Section | Description |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `variables` | A dictionary of variables that can be referenced in paths, URLs and scripts. For example, `{NAME}` will use the value of the variable `NAME`. |
| `vars` | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`. |
| `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. |
| `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. |
| `workflows` | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command. |
@ -349,9 +374,9 @@ if __name__ == "__main__":
In your `project.yml`, you can then run the script by calling
`python scripts/custom_evaluation.py` with the function arguments. You can also
use the `variables` section to define reusable variables that will be
substituted in commands, paths and URLs. In this example, the `BATCH_SIZE` is
defined as a variable will be added in place of `{BATCH_SIZE}` in the script.
use the `vars` section to define reusable variables that will be substituted in
commands, paths and URLs. In this example, the batch size is defined as a
variable will be added in place of `${vars.batch_size}` in the script.
> #### Calling into Python
>
@ -363,13 +388,13 @@ defined as a variable will be added in place of `{BATCH_SIZE}` in the script.
<!-- prettier-ignore -->
```yaml
### project.yml
variables:
BATCH_SIZE: 128
vars:
batch_size: 128
commands:
- name: evaluate
script:
- 'python scripts/custom_evaluation.py {BATCH_SIZE} ./training/model-best ./corpus/eval.json'
- 'python scripts/custom_evaluation.py ${batch_size} ./training/model-best ./corpus/eval.json'
deps:
- 'training/model-best'
- 'corpus/eval.json'
@ -421,6 +446,114 @@ assets:
checksum: '5113dc04e03f079525edd8df3f4f39e3'
```
## Remote Storage {#remote}
You can persist your project outputs to a remote storage using the
[`project push`](/api/cli#project-push) command. This can help you **export**
your model packages, **share** work with your team, or **cache results** to
avoid repeating work. The [`project pull`](/api/cli#project-pull) command will
download any outputs that are in the remote storage and aren't available
locally.
You can list one or more remotes in the `remotes` section of your
[`project.yml`](#project-yml) by mapping a string name to the URL of the
storage. Under the hood, spaCy uses the
[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library to
communicate with the remote storages, so you can use any protocol that
`smart-open` supports, including [S3](https://aws.amazon.com/s3/),
[Google Cloud Storage](https://cloud.google.com/storage), SSH and more, although
you may need to install extra dependencies to use certain protocols.
> #### Example
>
> ```cli
> $ python -m spacy project pull local
> ```
```yaml
### project.yml
remotes:
default: 's3://my-spacy-bucket'
local: '/mnt/scratch/cache'
stuff: 'ssh://myserver.example.com/whatever'
```
<Infobox title="How it works" emoji="💡">
Inside the remote storage, spaCy uses a clever **directory structure** to avoid
overwriting files. The top level of the directory structure is a URL-encoded
version of the output's path. Within this directory are subdirectories named
according to a hash of the command string and the command's dependencies.
Finally, within those directories are files, named according to an MD5 hash of
their contents.
<!-- TODO: update with actual real example? -->
<!-- prettier-ignore -->
```yaml
└── urlencoded_file_path # Path of original file
├── some_command_hash # Hash of command you ran
│ ├── some_content_hash # Hash of file content
│ └── another_content_hash
└── another_command_hash
└── third_content_hash
```
</Infobox>
For instance, let's say you had the following command in your `project.yml`:
```yaml
### project.yml
- name: train
help: 'Train a spaCy model using the specified corpus and config'
script:
- 'spacy train ./config.cfg --output training/'
deps:
- 'corpus/train'
- 'corpus/dev'
- 'config.cfg'
outputs:
- 'training/model-best'
```
> #### Example
>
> ```
> └── s3://my-spacy-bucket/training%2Fmodel-best
> └── 1d8cb33a06cc345ad3761c6050934a1b
> └── d8e20c3537a084c5c10d95899fe0b1ff
> ```
After you finish training, you run [`project push`](/api/cli#project-push) to
make sure the `training/model-best` output is saved to remote storage. spaCy
will then construct a hash from your command script and the listed dependencies,
`corpus/train`, `corpus/dev` and `config.cfg`, in order to identify the
execution context of your output. It would then compute an MD5 hash of the
`training/model-best` directory, and use those three pieces of information to
construct the storage URL.
```cli
$ python -m spacy project run train
$ python -m spacy project push
```
If you change the command or one of its dependencies (for instance, by editing
the [`config.cfg`](/usage/training#config) file to tune the hyperparameters, a
different creation hash will be calculated, so when you use
[`project push`](/api/cli#project-push) you won't be overwriting your previous
file. The system even supports multiple outputs for the same file and the same
context, which can happen if your training process is not deterministic, or if
you have dependencies that aren't represented in the command.
In summary, the [`spacy project`](/api/cli#project) remote storages are designed
to make a particular set of trade-offs. Priority is placed on **convenience**,
**correctness** and **avoiding data loss**. You can use
[`project push`](/api/cli#project-push) freely, as you'll never overwrite remote
state, and you don't have to come up with names or version numbers. However,
it's up to you to manage the size of your remote storage, and to remove files
that are no longer relevant to you.
## Integrations {#integrations}
### Data Version Control (DVC) {#dvc} <IntegrationLogo name="dvc" title="DVC" width={70} height="auto" align="right" />
@ -517,16 +650,17 @@ and evaluation set.
<!-- prettier-ignore -->
```yaml
### project.yml
variables:
PRODIGY_DATASET: 'ner_articles'
PRODIGY_LABELS: 'PERSON,ORG,PRODUCT'
PRODIGY_MODEL: 'en_core_web_md'
vars:
prodigy:
dataset: 'ner_articles'
labels: 'PERSON,ORG,PRODUCT'
model: 'en_core_web_md'
commands:
- name: annotate
- script:
- 'python -m prodigy ner.correct {PRODIGY_DATASET} ./assets/raw_data.jsonl {PRODIGY_MODEL} --labels {PRODIGY_LABELS}'
- 'python -m prodigy data-to-spacy ./corpus/train.json ./corpus/eval.json --ner {PRODIGY_DATASET}'
- 'python -m prodigy ner.correct ${vars.prodigy.dataset} ./assets/raw_data.jsonl ${vars.prodigy.model} --labels ${vars.prodigy.labels}'
- 'python -m prodigy data-to-spacy ./corpus/train.json ./corpus/eval.json --ner ${vars.prodigy.dataset}'
- 'python -m spacy convert ./corpus/train.json ./corpus/train.spacy'
- 'python -m spacy convert ./corpus/eval.json ./corpus/eval.spacy'
- deps:

View File

@ -511,21 +511,21 @@ from spacy.language import Language
from spacy.matcher import Matcher
from spacy.tokens import Token
# We're using a component factory because the component needs to be initialized
# with the shared vocab via the nlp object
# We're using a component factory because the component needs to be
# initialized with the shared vocab via the nlp object
@Language.factory("html_merger")
def create_bad_html_merger(nlp, name):
return BadHTMLMerger(nlp)
return BadHTMLMerger(nlp.vocab)
class BadHTMLMerger:
def __init__(self, nlp):
def __init__(self, vocab):
patterns = [
[{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}],
[{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}],
]
# Register a new token extension to flag bad HTML
Token.set_extension("bad_html", default=False)
self.matcher = Matcher(nlp.vocab)
self.matcher = Matcher(vocab)
self.matcher.add("BAD_HTML", patterns)
def __call__(self, doc):

View File

@ -243,7 +243,7 @@ file `data.json` in its subdirectory:
### Directory structure {highlight="2-3"}
└── /path/to/model
├── my_component # data serialized by "my_component"
| └── data.json
└── data.json
├── ner # data for "ner" component
├── parser # data for "parser" component
├── tagger # data for "tagger" component

View File

@ -1,13 +1,12 @@
---
title: Training Models
next: /usage/projects
next: /usage/layers-architectures
menu:
- ['Introduction', 'basics']
- ['Quickstart', 'quickstart']
- ['Config System', 'config']
- ['Custom Functions', 'custom-functions']
- ['Transfer Learning', 'transfer-learning']
- ['Parallel Training', 'parallel-training']
# - ['Parallel Training', 'parallel-training']
- ['Internal API', 'api']
---
@ -35,8 +34,8 @@ ready-to-use spaCy models.
The recommended way to train your spaCy models is via the
[`spacy train`](/api/cli#train) command on the command line. It only needs a
single [`config.cfg`](#config) **configuration file** that includes all settings
and hyperparameters. You can optionally [overwritten](#config-overrides)
settings on the command line, and load in a Python file to register
and hyperparameters. You can optionally [overwrite](#config-overrides) settings
on the command line, and load in a Python file to register
[custom functions](#custom-code) and architectures. This quickstart widget helps
you generate a starter config with the **recommended settings** for your
specific use case. It's also available in spaCy as the
@ -82,7 +81,7 @@ $ python -m spacy init fill-config base_config.cfg config.cfg
Instead of exporting your starter config from the quickstart widget and
auto-filling it, you can also use the [`init config`](/api/cli#init-config)
command and specify your requirement and settings and CLI arguments. You can now
command and specify your requirement and settings as CLI arguments. You can now
add your data and run [`train`](/api/cli#train) with your config. See the
[`convert`](/api/cli#convert) command for details on how to convert your data to
spaCy's binary `.spacy` format. You can either include the data paths in the
@ -92,23 +91,8 @@ spaCy's binary `.spacy` format. You can either include the data paths in the
$ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
```
<!-- TODO:
<Project id="some_example_project">
The easiest way to get started with an end-to-end training process is to clone a
[project](/usage/projects) template. Projects let you manage multi-step
workflows, from data preprocessing to training and packaging your model.
</Project>
-->
## Training config {#config}
<!-- > #### Migration from spaCy v2.x
>
> TODO: once we have an answer for how to update the training command
> (`spacy migrate`?), add details here -->
Training config files include all **settings and hyperparameters** for training
your model. Instead of providing lots of arguments on the command line, you only
need to pass your `config.cfg` file to [`spacy train`](/api/cli#train). Under
@ -126,9 +110,10 @@ Some of the main advantages and features of spaCy's training config are:
functions like [model architectures](/api/architectures),
[optimizers](https://thinc.ai/docs/api-optimizers) or
[schedules](https://thinc.ai/docs/api-schedules) and define arguments that are
passed into them. You can also register your own functions to define
[custom architectures](#custom-functions), reference them in your config and
tweak their parameters.
passed into them. You can also
[register your own functions](#custom-functions) to define custom
architectures or methods, reference them in your config and tweak their
parameters.
- **Interpolation.** If you have hyperparameters or other settings used by
multiple components, define them once and reference them as
[variables](#config-interpolation).
@ -226,21 +211,21 @@ passed to the component factory as arguments. This lets you configure the model
settings and hyperparameters. If a component block defines a `source`, the
component will be copied over from an existing pretrained model, with its
existing weights. This lets you include an already trained component in your
model pipeline, or update a pretrained components with more data specific to
your use case.
model pipeline, or update a pretrained component with more data specific to your
use case.
```ini
### config.cfg (excerpt)
[components]
# "parser" and "ner" are sourced from pretrained model
# "parser" and "ner" are sourced from a pretrained model
[components.parser]
source = "en_core_web_sm"
[components.ner]
source = "en_core_web_sm"
# "textcat" and "custom" are created blank from built-in / custom factory
# "textcat" and "custom" are created blank from a built-in / custom factory
[components.textcat]
factory = "textcat"
@ -294,11 +279,11 @@ batch_size = 128
```
To refer to a function instead, you can make `[training.batch_size]` its own
section and use the `@` syntax specify the function and its arguments in this
case [`compounding.v1`](https://thinc.ai/docs/api-schedules#compounding) defined
in the [function registry](/api/top-level#registry). All other values defined in
the block are passed to the function as keyword arguments when it's initialized.
You can also use this mechanism to register
section and use the `@` syntax to specify the function and its arguments in
this case [`compounding.v1`](https://thinc.ai/docs/api-schedules#compounding)
defined in the [function registry](/api/top-level#registry). All other values
defined in the block are passed to the function as keyword arguments when it's
initialized. You can also use this mechanism to register
[custom implementations and architectures](#custom-functions) and reference them
from your configs.
@ -404,13 +389,11 @@ recipe once the dish has already been prepared. You have to make a new one.
spaCy includes a variety of built-in [architectures](/api/architectures) for
different tasks. For example:
<!-- TODO: model return types -->
| Architecture | Description |
| ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [HashEmbedCNN](/api/architectures#HashEmbedCNN) | Build spaCys "standard" embedding layer, which uses hash embedding with subword features and a CNN with layer-normalized maxout. ~~Model[List[Doc], List[Floats2d]]~~ |
| [TransitionBasedParser](/api/architectures#TransitionBasedParser) | Build a [transition-based parser](https://explosion.ai/blog/parsing-english-in-python) model used in the default [`EntityRecognizer`](/api/entityrecognizer) and [`DependencyParser`](/api/dependencyparser). ~~Model[List[Docs], List[List[Floats2d]]]~~ |
| [TextCatEnsemble](/api/architectures#TextCatEnsemble) | Stacked ensemble of a bag-of-words model and a neural network model with an internal CNN embedding layer. Used in the default [`TextCategorizer`](/api/textcategorizer). ~~Model~~ |
| [TextCatEnsemble](/api/architectures#TextCatEnsemble) | Stacked ensemble of a bag-of-words model and a neural network model with an internal CNN embedding layer. Used in the default [`TextCategorizer`](/api/textcategorizer). ~~Model[List[Doc], Floats2d]~~ |
<!-- TODO: link to not yet existing usage page on custom architectures etc. -->
@ -726,9 +709,9 @@ a stream of items into a stream of batches. spaCy has several useful built-in
[batching strategies](/api/top-level#batchers) with customizable sizes, but it's
also easy to implement your own. For instance, the following function takes the
stream of generated [`Example`](/api/example) objects, and removes those which
have the exact same underlying raw text, to avoid duplicates within each batch.
Note that in a more realistic implementation, you'd also want to check whether
the annotations are exactly the same.
have the same underlying raw text, to avoid duplicates within each batch. Note
that in a more realistic implementation, you'd also want to check whether the
annotations are the same.
> #### config.cfg
>
@ -759,71 +742,10 @@ def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Examp
return create_filtered_batches
```
<!-- TODO:
<Project id="example_pytorch_model">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum
sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat
mattis pretium.
</Project>
-->
### Defining custom architectures {#custom-architectures}
<!-- TODO: this should probably move to new section on models -->
## Transfer learning {#transfer-learning}
<!-- TODO: write something, link to embeddings and transformers page should probably wait until transformers/embeddings/transfer learning docs are done -->
### Using transformer models like BERT {#transformers}
spaCy v3.0 lets you use almost any statistical model to power your pipeline. You
can use models implemented in a variety of frameworks. A transformer model is
just a statistical model, so the
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package
actually has very little work to do: it just has to provide a few functions that
do the required plumbing. It also provides a pipeline component,
[`Transformer`](/api/transformer), that lets you do multi-task learning and lets
you save the transformer outputs for later use.
<!-- TODO:
<Project id="en_core_bert">
Try out a BERT-based model pipeline using this project template: swap in your
data, edit the settings and hyperparameters and train, evaluate, package and
visualize your model.
</Project>
-->
For more details on how to integrate transformer models into your training
config and customize the implementations, see the usage guide on
[training transformers](/usage/embeddings-transformers#transformers-training).
### Pretraining with spaCy {#pretraining}
<!-- TODO: document spacy pretrain, objectives etc. should probably wait until transformers/embeddings/transfer learning docs are done -->
## Parallel Training with Ray {#parallel-training}
<!-- TODO:
<Project id="some_example_project">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum
sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat
mattis pretium.
</Project>
-->
## Internal training API {#api}
<Infobox variant="warning">
@ -843,8 +765,8 @@ called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object
that will hold the predictions, and another `Doc` object that holds the
gold-standard annotations. It also includes the **alignment** between those two
documents if they differ in tokenization. The `Example` class ensures that spaCy
can rely on one **standardized format** that's passed through the pipeline.
Here's an example of a simple `Example` for part-of-speech tags:
can rely on one **standardized format** that's passed through the pipeline. For
instance, let's say we want to define gold-standard part-of-speech tags:
```python
words = ["I", "like", "stuff"]
@ -856,9 +778,10 @@ reference = Doc(vocab, words=words).from_array("TAG", numpy.array(tag_ids, dtype
example = Example(predicted, reference)
```
Alternatively, the `reference` `Doc` with the gold-standard annotations can be
created from a dictionary with keyword arguments specifying the annotations,
like `tags` or `entities`. Using the `Example` object and its gold-standard
As this is quite verbose, there's an alternative way to create the reference
`Doc` with the gold-standard annotations. The function `Example.from_dict` takes
a dictionary with keyword arguments specifying the annotations, like `tags` or
`entities`. Using the resulting `Example` object and its gold-standard
annotations, the model can be updated to learn a sentence of three words with
their assigned part-of-speech tags.
@ -883,8 +806,8 @@ example = Example.from_dict(predicted, {"tags": tags})
Here's another example that shows how to define gold-standard named entities.
The letters added before the labels refer to the tags of the
[BILUO scheme](/usage/linguistic-features#updating-biluo) `O` is a token
outside an entity, `U` an single entity unit, `B` the beginning of an entity,
`I` a token inside an entity and `L` the last token of an entity.
outside an entity, `U` a single entity unit, `B` the beginning of an entity, `I`
a token inside an entity and `L` the last token of an entity.
```python
doc = Doc(nlp.vocab, words=["Facebook", "released", "React", "in", "2014"])
@ -958,7 +881,7 @@ dictionary of annotations:
```diff
text = "Facebook released React in 2014"
annotations = {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]}
+ example = Example.from_dict(nlp.make_doc(text), {"entities": entities})
+ example = Example.from_dict(nlp.make_doc(text), annotations)
- nlp.update([text], [annotations])
+ nlp.update([example])
```

View File

@ -10,6 +10,32 @@ menu:
## Summary {#summary}
<Grid cols={2}>
<div>
</div>
<Infobox title="Table of Contents" id="toc">
- [Summary](#summary)
- [New features](#features)
- [Training & config system](#features-training)
- [Transformer-based pipelines](#features-transformers)
- [Custom models](#features-custom-models)
- [End-to-end project workflows](#features-projects)
- [New built-in components](#features-pipeline-components)
- [New custom component API](#features-components)
- [Python type hints](#features-types)
- [New methods & attributes](#new-methods)
- [New & updated documentation](#new-docs)
- [Backwards incompatibilities](#incompat)
- [Migrating from spaCy v2.x](#migrating)
</Infobox>
</Grid>
## New Features {#features}
### New training workflow and config system {#features-training}
@ -28,6 +54,8 @@ menu:
### Transformer-based pipelines {#features-transformers}
![Pipeline components listening to shared embedding component](../images/tok2vec-listener.svg)
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers),
@ -38,7 +66,7 @@ menu:
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
[Tok2VecListener](/api/architectures#transformers-Tok2VecListener),
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
- **Models:** [`en_core_bert_sm`](/models/en)
- **Models:** [`en_core_trf_lg_sm`](/models/en)
- **Implementation:**
[`spacy-transformers`](https://github.com/explosion/spacy-transformers)
@ -46,8 +74,57 @@ menu:
### Custom models using any framework {#features-custom-models}
<Infobox title="Details & Documentation" emoji="📖" list>
<!-- TODO: link to new custom models page -->
- **Thinc: **
[Wrapping PyTorch, TensorFlow & MXNet](https://thinc.ai/docs/usage-frameworks)
- **API:** [Model architectures](/api/architectures), [`Pipe`](/api/pipe)
</Infobox>
### Manage end-to-end workflows with projects {#features-projects}
<!-- TODO: update example -->
> #### Example
>
> ```cli
> # Clone a project template
> $ python -m spacy project clone example
> $ cd example
> # Download data assets
> $ python -m spacy project assets
> # Run a workflow
> $ python -m spacy project run train
> ```
spaCy projects let you manage and share **end-to-end spaCy workflows** for
different **use cases and domains**, and orchestrate training, packaging and
serving your custom models. You can start off by cloning a pre-defined project
template, adjust it to fit your needs, load in your data, train a model, export
it as a Python package, upload your outputs to a remote storage and share your
results with your team.
![Illustration of project workflow and commands](../images/projects.svg)
spaCy projects also make it easy to **integrate with other tools** in the data
science and machine learning ecosystem, including [DVC](/usage/projects#dvc) for
data version control, [Prodigy](/usage/projects#prodigy) for creating labelled
data, [Streamlit](/usage/projects#streamlit) for building interactive apps,
[FastAPI](/usage/projects#fastapi) for serving models in production,
[Ray](/usage/projects#ray) for parallel training,
[Weights & Biases](/usage/projects#wandb) for experiment tracking, and more!
<!-- <Project id="some_example_project">
The easiest way to get started with an end-to-end training process is to clone a
[project](/usage/projects) template. Projects let you manage multi-step
workflows, from data preprocessing to training and packaging your model.
</Project>-->
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [spaCy projects](/usage/projects),
@ -59,6 +136,16 @@ menu:
### New built-in pipeline components {#features-pipeline-components}
spaCy v3.0 includes several new trainable and rule-based components that you can
add to your pipeline and customize for your use case:
> #### Example
>
> ```python
> nlp = spacy.blank("en")
> nlp.add_pipe("lemmatizer")
> ```
| Name | Description |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation. |
@ -78,15 +165,37 @@ menu:
### New and improved pipeline component APIs {#features-components}
- `Language.factory`, `Language.component`
- `Language.analyze_pipes`
- Adding components from other models
> #### Example
>
> ```python
> @Language.component("my_component")
> def my_component(doc):
> return doc
>
> nlp.add_pipe("my_component")
> nlp.add_pipe("ner", source=other_nlp)
> nlp.analyze_pipes(pretty=True)
> ```
Defining, configuring, reusing, training and analyzing pipeline components is
now easier and more convenient. The `@Language.component` and
`@Language.factory` decorators let you register your component, define its
default configuration and meta data, like the attribute values it assigns and
requires. Any custom component can be included during training, and sourcing
components from existing pretrained models lets you **mix and match custom
pipelines**. The `nlp.analyze_pipes` method outputs structured information about
the current pipeline and its components, including the attributes they assign,
the scores they compute during training and whether any required attributes
aren't set.
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Custom components](/usage/processing-pipelines#custom_components),
[Defining components during training](/usage/training#config-components)
- **API:** [`Language`](/api/language)
[Defining components for training](/usage/training#config-components)
- **API:** [`@Language.component`](/api/language#component),
[`@Language.factory`](/api/language#factory),
[`Language.add_pipe`](/api/language#add_pipe),
[`Language.analyze_pipes`](/api/language#analyze_pipes)
- **Implementation:**
[`spacy/language.py`](https://github.com/explosion/spaCy/tree/develop/spacy/language.py)
@ -136,13 +245,14 @@ in your config and see validation errors if the argument values don't match.
</Infobox>
### New methods, attributes and commands
### New methods, attributes and commands {#new-methods}
The following methods, attributes and commands are new in spaCy v3.0.
| Name | Description |
| ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). |
| [`Token.morph`](/api/token#attributes) [`Token.morph_`](/api/token#attributes) | Access a token's morphological analysis. |
| [`Language.select_pipes`](/api/language#select_pipes) | Context manager for enabling or disabling specific pipeline components for a block. |
| [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. |
| [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. |
@ -153,9 +263,55 @@ The following methods, attributes and commands are new in spaCy v3.0.
| [`Pipe.score`](/api/pipe#score) | Method on trainable pipeline components that returns a dictionary of evaluation scores. |
| [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). |
| [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). |
| [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all models installed in the environment. |
| [`init config`](/api/cli#init-config) [`init fill-config`](/api/cli#init-fill-config) [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training). |
| [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). |
### New and updated documentation {#new-docs}
<Grid cols={2} gutterBottom={false}>
<div>
To help you get started with spaCy v3.0 and the new features, we've added
several new or rewritten documentation pages, including a new usage guide on
[embeddings, transformers and transfer learning](/usage/embeddings-transformers),
a guide on [training models](/usage/training) rewritten from scratch, a page
explaining the new [spaCy projects](/usage/projects) and updated usage
documentation on
[custom pipeline components](/usage/processing-pipelines#custom-components).
We've also added a bunch of new illustrations and new API reference pages
documenting spaCy's machine learning [model architectures](/api/architectures)
and the expected [data formats](/api/data-formats). API pages about
[pipeline components](/api/#architecture-pipeline) now include more information,
like the default config and implementation, and we've adopted a more detailed
format for documenting argument and return types.
</div>
[![Library architecture](../images/architecture.svg)](/api)
</Grid>
<Infobox title="New or reworked documentation" emoji="📖" list>
- **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers),
[Training models](/usage/training),
[Layers & Architectures](/usage/layers-architectures),
[Projects](/usage/projects),
[Custom pipeline components](/usage/processing-pipelines#custom-components),
[Custom tokenizers](/usage/linguistic-features#custom-tokenizer)
- **API Reference: ** [Library architecture](/api),
[Model architectures](/api/architectures), [Data formats](/api/data-formats)
- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
[`Transformer`](/api/transformer), [`Lemmatizer`](/api/lemmatizer),
[`Morphologizer`](/api/morphologizer),
[`AttributeRuler`](/api/attributeruler),
[`SentenceRecognizer`](/api/sentencerecognizer), [`Pipe`](/api/pipe),
[`Corpus`](/api/corpus)
</Infobox>
## Backwards Incompatibilities {#incompat}
As always, we've tried to keep the breaking changes to a minimum and focus on
@ -186,7 +342,7 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
[training config](/usage/training#config).
- [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of
the component factory instead of the component function.
- **Custom pipeline components** now needs to be decorated with the
- **Custom pipeline components** now need to be decorated with the
[`@Language.component`](/api/language#component) or
[`@Language.factory`](/api/language#factory) decorator.
- [`Language.update`](/api/language#update) now takes a batch of
@ -213,14 +369,15 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
### Removed or renamed API {#incompat-removed}
| Removed | Replacement |
| ------------------------------------------------------ | ----------------------------------------------------------------------------------------- |
| -------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) |
| `GoldParse` | [`Example`](/api/example) |
| `GoldCorpus` | [`Corpus`](/api/corpus) |
| `KnowledgeBase.load_bulk` `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk) [`KnowledgeBase.to_disk`](/api/kb#to_disk) |
| `KnowledgeBase.load_bulk`, `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) |
| `spacy init-model` | [`spacy init model`](/api/cli#init-model) |
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
| `spacy link` `util.set_data_path` `util.get_data_path` | not needed, model symlinks are deprecated |
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
The following deprecated methods, attributes and arguments were removed in v3.0.
Most of them have been **deprecated for a while** and many would previously
@ -236,7 +393,7 @@ on them.
| `Language.tagger`, `Language.parser`, `Language.entity` | [`Language.get_pipe`](/api/language#get_pipe) |
| keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` |
| `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` |
| `verbose` argument on [`Language.evaluate`] | logging |
| `verbose` argument on [`Language.evaluate`](/api/language#evaluate) | logging (`DEBUG`) |
| `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) |
## Migrating from v2.x {#migrating}

View File

@ -122,9 +122,9 @@ import DisplacyEntHtml from 'images/displacy-ent2.html'
The entity visualizer lets you customize the following `options`:
| Argument | Description |
| -------- | -------------------------------------------------------------------------------------------------------------------------- |
| -------- | ------------------------------------------------------------------------------------------------------------- |
| `ents` | Entity types to highlight (`None` for all types). Defaults to `None`. ~~Optional[List[str]]~~ | `None` |
| `colors` | Color overrides. Entity types in uppercase should be mapped to color names or values. Defaults to `{}`. ~~Dict[str, str]~~ |
| `colors` | Color overrides. Entity types should be mapped to color names or values. Defaults to `{}`. ~~Dict[str, str]~~ |
If you specify a list of `ents`, only those entity types will be rendered for
example, you can choose to display `PERSON` entities. Internally, the visualizer

View File

@ -24,6 +24,11 @@
"tag": "new"
},
{ "text": "Training Models", "url": "/usage/training", "tag": "new" },
{
"text": "Layers & Model Architectures",
"url": "/usage/layers-architectures",
"tag": "new"
},
{ "text": "spaCy Projects", "url": "/usage/projects", "tag": "new" },
{ "text": "Saving & Loading", "url": "/usage/saving-loading" },
{ "text": "Visualizers", "url": "/usage/visualizers" }

View File

@ -29,6 +29,8 @@
"Optimizer": "https://thinc.ai/docs/api-optimizers",
"Model": "https://thinc.ai/docs/api-model",
"Ragged": "https://thinc.ai/docs/api-types#ragged",
"Padded": "https://thinc.ai/docs/api-types#padded",
"Ints2d": "https://thinc.ai/docs/api-types#types",
"Floats2d": "https://thinc.ai/docs/api-types#types",
"Floats3d": "https://thinc.ai/docs/api-types#types",
"FloatsXd": "https://thinc.ai/docs/api-types#types",

View File

@ -5,6 +5,8 @@ import Icon from './icon'
import { isString } from './util'
import classes from '../styles/table.module.sass'
const FOOT_ROW_REGEX = /^(RETURNS|YIELDS|CREATES|PRINTS|EXECUTES|UPLOADS|DOWNLOADS)/
function isNum(children) {
return isString(children) && /^\d+[.,]?[\dx]+?(|x|ms|mb|gb|k|m)?$/i.test(children)
}
@ -43,7 +45,6 @@ function isDividerRow(children) {
}
function isFootRow(children) {
const rowRegex = /^(RETURNS|YIELDS|CREATES|PRINTS|EXECUTES)/
if (children.length && children[0].props.name === 'td') {
const cellChildren = children[0].props.children
if (
@ -52,7 +53,7 @@ function isFootRow(children) {
cellChildren.props.children &&
isString(cellChildren.props.children)
) {
return rowRegex.test(cellChildren.props.children)
return FOOT_ROW_REGEX.test(cellChildren.props.children)
}
}
return false

View File

@ -67,7 +67,7 @@
border: 0
// Special style for types in API tables
td > &:last-child
td:not(:first-child) > &:last-child
display: block
border-top: 1px dotted var(--color-subtle)
border-radius: 0

View File

@ -12,7 +12,7 @@
background: var(--color-subtle-light)
color: var(--color-subtle-dark)
border-radius: 50%
padding: 0.5rem
padding: 0.5rem 0.65rem 0.5rem 0
transition: color 0.2s ease
float: right
margin-left: 3rem