mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	Merge branch 'master' into spacy.io
This commit is contained in:
		
						commit
						3246cf8b2b
					
				
							
								
								
									
										106
									
								
								.github/contributors/peter-exos.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										106
									
								
								.github/contributors/peter-exos.md
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,106 @@
 | 
			
		|||
# spaCy contributor agreement
 | 
			
		||||
 | 
			
		||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
 | 
			
		||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 | 
			
		||||
The SCA applies to any contribution that you make to any product or project
 | 
			
		||||
managed by us (the **"project"**), and sets out the intellectual property rights
 | 
			
		||||
you grant to us in the contributed materials. The term **"us"** shall mean
 | 
			
		||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
 | 
			
		||||
**"you"** shall mean the person or entity identified below.
 | 
			
		||||
 | 
			
		||||
If you agree to be bound by these terms, fill in the information requested
 | 
			
		||||
below and include the filled-in version with your first pull request, under the
 | 
			
		||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
 | 
			
		||||
should be your GitHub username, with the extension `.md`. For example, the user
 | 
			
		||||
example_user would create the file `.github/contributors/example_user.md`.
 | 
			
		||||
 | 
			
		||||
Read this agreement carefully before signing. These terms and conditions
 | 
			
		||||
constitute a binding legal agreement.
 | 
			
		||||
 | 
			
		||||
## Contributor Agreement
 | 
			
		||||
 | 
			
		||||
1. The term "contribution" or "contributed materials" means any source code,
 | 
			
		||||
object code, patch, tool, sample, graphic, specification, manual,
 | 
			
		||||
documentation, or any other material posted or submitted by you to the project.
 | 
			
		||||
 | 
			
		||||
2. With respect to any worldwide copyrights, or copyright applications and
 | 
			
		||||
registrations, in your contribution:
 | 
			
		||||
 | 
			
		||||
    * you hereby assign to us joint ownership, and to the extent that such
 | 
			
		||||
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
 | 
			
		||||
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
 | 
			
		||||
    royalty-free, unrestricted license to exercise all rights under those
 | 
			
		||||
    copyrights. This includes, at our option, the right to sublicense these same
 | 
			
		||||
    rights to third parties through multiple levels of sublicensees or other
 | 
			
		||||
    licensing arrangements;
 | 
			
		||||
 | 
			
		||||
    * you agree that each of us can do all things in relation to your
 | 
			
		||||
    contribution as if each of us were the sole owners, and if one of us makes
 | 
			
		||||
    a derivative work of your contribution, the one who makes the derivative
 | 
			
		||||
    work (or has it made will be the sole owner of that derivative work;
 | 
			
		||||
 | 
			
		||||
    * you agree that you will not assert any moral rights in your contribution
 | 
			
		||||
    against us, our licensees or transferees;
 | 
			
		||||
 | 
			
		||||
    * you agree that we may register a copyright in your contribution and
 | 
			
		||||
    exercise all ownership rights associated with it; and
 | 
			
		||||
 | 
			
		||||
    * you agree that neither of us has any duty to consult with, obtain the
 | 
			
		||||
    consent of, pay or render an accounting to the other for any use or
 | 
			
		||||
    distribution of your contribution.
 | 
			
		||||
 | 
			
		||||
3. With respect to any patents you own, or that you can license without payment
 | 
			
		||||
to any third party, you hereby grant to us a perpetual, irrevocable,
 | 
			
		||||
non-exclusive, worldwide, no-charge, royalty-free license to:
 | 
			
		||||
 | 
			
		||||
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
 | 
			
		||||
    your contribution in whole or in part, alone or in combination with or
 | 
			
		||||
    included in any product, work or materials arising out of the project to
 | 
			
		||||
    which your contribution was submitted, and
 | 
			
		||||
 | 
			
		||||
    * at our option, to sublicense these same rights to third parties through
 | 
			
		||||
    multiple levels of sublicensees or other licensing arrangements.
 | 
			
		||||
 | 
			
		||||
4. Except as set out above, you keep all right, title, and interest in your
 | 
			
		||||
contribution. The rights that you grant to us under these terms are effective
 | 
			
		||||
on the date you first submitted a contribution to us, even if your submission
 | 
			
		||||
took place before the date you sign these terms.
 | 
			
		||||
 | 
			
		||||
5. You covenant, represent, warrant and agree that:
 | 
			
		||||
 | 
			
		||||
    * Each contribution that you submit is and shall be an original work of
 | 
			
		||||
    authorship and you can legally grant the rights set out in this SCA;
 | 
			
		||||
 | 
			
		||||
    * to the best of your knowledge, each contribution will not violate any
 | 
			
		||||
    third party's copyrights, trademarks, patents, or other intellectual
 | 
			
		||||
    property rights; and
 | 
			
		||||
 | 
			
		||||
    * each contribution shall be in compliance with U.S. export control laws and
 | 
			
		||||
    other applicable export and import laws. You agree to notify us if you
 | 
			
		||||
    become aware of any circumstance which would make any of the foregoing
 | 
			
		||||
    representations inaccurate in any respect. We may publicly disclose your
 | 
			
		||||
    participation in the project, including the fact that you have signed the SCA.
 | 
			
		||||
 | 
			
		||||
6. This SCA is governed by the laws of the State of California and applicable
 | 
			
		||||
U.S. Federal law. Any choice of law rules will not apply.
 | 
			
		||||
 | 
			
		||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
 | 
			
		||||
mark both statements:
 | 
			
		||||
 | 
			
		||||
    * [ ] I am signing on behalf of myself as an individual and no other person
 | 
			
		||||
    or entity, including my employer, has or will have rights with respect to my
 | 
			
		||||
    contributions.
 | 
			
		||||
 | 
			
		||||
    * [x] I am signing on behalf of my employer or a legal entity and I have the
 | 
			
		||||
    actual authority to contractually bind that entity.
 | 
			
		||||
 | 
			
		||||
## Contributor Details
 | 
			
		||||
 | 
			
		||||
| Field                          | Entry                |
 | 
			
		||||
|------------------------------- | -------------------- |
 | 
			
		||||
| Name                           |  Peter Baumann       |
 | 
			
		||||
| Company name (if applicable)   |  Exos Financial      |
 | 
			
		||||
| Title or role (if applicable)  |  data scientist      |
 | 
			
		||||
| Date                           |  Feb 1st, 2021       |
 | 
			
		||||
| GitHub username                |  peter-exos          |
 | 
			
		||||
| Website (optional)             |                      |
 | 
			
		||||
| 
						 | 
				
			
			@ -1,6 +1,6 @@
 | 
			
		|||
# fmt: off
 | 
			
		||||
__title__ = "spacy"
 | 
			
		||||
__version__ = "3.0.1"
 | 
			
		||||
__version__ = "3.0.3"
 | 
			
		||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 | 
			
		||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
 | 
			
		||||
__projects__ = "https://github.com/explosion/projects"
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -16,7 +16,7 @@ import os
 | 
			
		|||
 | 
			
		||||
from ..schemas import ProjectConfigSchema, validate
 | 
			
		||||
from ..util import import_file, run_command, make_tempdir, registry, logger
 | 
			
		||||
from ..util import is_compatible_version, ENV_VARS
 | 
			
		||||
from ..util import is_compatible_version, SimpleFrozenDict, ENV_VARS
 | 
			
		||||
from .. import about
 | 
			
		||||
 | 
			
		||||
if TYPE_CHECKING:
 | 
			
		||||
| 
						 | 
				
			
			@ -111,26 +111,33 @@ def _parse_overrides(args: List[str], is_cli: bool = False) -> Dict[str, Any]:
 | 
			
		|||
                    value = "true"
 | 
			
		||||
                else:
 | 
			
		||||
                    value = args.pop(0)
 | 
			
		||||
            result[opt] = _parse_override(value)
 | 
			
		||||
        else:
 | 
			
		||||
            msg.fail(f"{err}: name should start with --", exits=1)
 | 
			
		||||
    return result
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def _parse_override(value: Any) -> Any:
 | 
			
		||||
    # Just like we do in the config, we're calling json.loads on the
 | 
			
		||||
    # values. But since they come from the CLI, it'd be unintuitive to
 | 
			
		||||
    # explicitly mark strings with escaped quotes. So we're working
 | 
			
		||||
    # around that here by falling back to a string if parsing fails.
 | 
			
		||||
    # TODO: improve logic to handle simple types like list of strings?
 | 
			
		||||
    try:
 | 
			
		||||
                result[opt] = srsly.json_loads(value)
 | 
			
		||||
        return srsly.json_loads(value)
 | 
			
		||||
    except ValueError:
 | 
			
		||||
                result[opt] = str(value)
 | 
			
		||||
        else:
 | 
			
		||||
            msg.fail(f"{err}: name should start with --", exits=1)
 | 
			
		||||
    return result
 | 
			
		||||
        return str(value)
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
 | 
			
		||||
def load_project_config(
 | 
			
		||||
    path: Path, interpolate: bool = True, overrides: Dict[str, Any] = SimpleFrozenDict()
 | 
			
		||||
) -> Dict[str, Any]:
 | 
			
		||||
    """Load the project.yml file from a directory and validate it. Also make
 | 
			
		||||
    sure that all directories defined in the config exist.
 | 
			
		||||
 | 
			
		||||
    path (Path): The path to the project directory.
 | 
			
		||||
    interpolate (bool): Whether to substitute project variables.
 | 
			
		||||
    overrides (Dict[str, Any]): Optional config overrides.
 | 
			
		||||
    RETURNS (Dict[str, Any]): The loaded project.yml.
 | 
			
		||||
    """
 | 
			
		||||
    config_path = path / PROJECT_FILE
 | 
			
		||||
| 
						 | 
				
			
			@ -154,20 +161,36 @@ def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]:
 | 
			
		|||
        if not dir_path.exists():
 | 
			
		||||
            dir_path.mkdir(parents=True)
 | 
			
		||||
    if interpolate:
 | 
			
		||||
        err = "project.yml validation error"
 | 
			
		||||
        err = f"{PROJECT_FILE} validation error"
 | 
			
		||||
        with show_validation_error(title=err, hint_fill=False):
 | 
			
		||||
            config = substitute_project_variables(config)
 | 
			
		||||
            config = substitute_project_variables(config, overrides)
 | 
			
		||||
    return config
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def substitute_project_variables(config: Dict[str, Any], overrides: Dict = {}):
 | 
			
		||||
    key = "vars"
 | 
			
		||||
def substitute_project_variables(
 | 
			
		||||
    config: Dict[str, Any],
 | 
			
		||||
    overrides: Dict[str, Any] = SimpleFrozenDict(),
 | 
			
		||||
    key: str = "vars",
 | 
			
		||||
    env_key: str = "env",
 | 
			
		||||
) -> Dict[str, Any]:
 | 
			
		||||
    """Interpolate variables in the project file using the config system.
 | 
			
		||||
 | 
			
		||||
    config (Dict[str, Any]): The project config.
 | 
			
		||||
    overrides (Dict[str, Any]): Optional config overrides.
 | 
			
		||||
    key (str): Key containing variables in project config.
 | 
			
		||||
    env_key (str): Key containing environment variable mapping in project config.
 | 
			
		||||
    RETURNS (Dict[str, Any]): The interpolated project config.
 | 
			
		||||
    """
 | 
			
		||||
    config.setdefault(key, {})
 | 
			
		||||
    config[key].update(overrides)
 | 
			
		||||
    config.setdefault(env_key, {})
 | 
			
		||||
    # Substitute references to env vars with their values
 | 
			
		||||
    for config_var, env_var in config[env_key].items():
 | 
			
		||||
        config[env_key][config_var] = _parse_override(os.environ.get(env_var, ""))
 | 
			
		||||
    # Need to put variables in the top scope again so we can have a top-level
 | 
			
		||||
    # section "project" (otherwise, a list of commands in the top scope wouldn't)
 | 
			
		||||
    # be allowed by Thinc's config system
 | 
			
		||||
    cfg = Config({"project": config, key: config[key]})
 | 
			
		||||
    cfg = Config({"project": config, key: config[key], env_key: config[env_key]})
 | 
			
		||||
    cfg = Config().from_str(cfg.to_str(), overrides=overrides)
 | 
			
		||||
    interpolated = cfg.interpolate()
 | 
			
		||||
    return dict(interpolated["project"])
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -175,10 +175,13 @@ def render_parses(
 | 
			
		|||
def print_prf_per_type(
 | 
			
		||||
    msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str
 | 
			
		||||
) -> None:
 | 
			
		||||
    data = [
 | 
			
		||||
        (k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}")
 | 
			
		||||
        for k, v in scores.items()
 | 
			
		||||
    ]
 | 
			
		||||
    data = []
 | 
			
		||||
    for key, value in scores.items():
 | 
			
		||||
        row = [key]
 | 
			
		||||
        for k in ("p", "r", "f"):
 | 
			
		||||
            v = value[k]
 | 
			
		||||
            row.append(f"{v * 100:.2f}" if isinstance(v, (int, float)) else v)
 | 
			
		||||
        data.append(row)
 | 
			
		||||
    msg.table(
 | 
			
		||||
        data,
 | 
			
		||||
        header=("", "P", "R", "F"),
 | 
			
		||||
| 
						 | 
				
			
			@ -191,7 +194,10 @@ def print_textcats_auc_per_cat(
 | 
			
		|||
    msg: Printer, scores: Dict[str, Dict[str, float]]
 | 
			
		||||
) -> None:
 | 
			
		||||
    msg.table(
 | 
			
		||||
        [(k, f"{v:.2f}") for k, v in scores.items()],
 | 
			
		||||
        [
 | 
			
		||||
            (k, f"{v:.2f}" if isinstance(v, (float, int)) else v)
 | 
			
		||||
            for k, v in scores.items()
 | 
			
		||||
        ],
 | 
			
		||||
        header=("", "ROC AUC"),
 | 
			
		||||
        aligns=("l", "r"),
 | 
			
		||||
        title="Textcat ROC AUC (per label)",
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -3,19 +3,23 @@ from pathlib import Path
 | 
			
		|||
from wasabi import msg
 | 
			
		||||
import sys
 | 
			
		||||
import srsly
 | 
			
		||||
import typer
 | 
			
		||||
 | 
			
		||||
from ... import about
 | 
			
		||||
from ...git_info import GIT_VERSION
 | 
			
		||||
from ...util import working_dir, run_command, split_command, is_cwd, join_command
 | 
			
		||||
from ...util import SimpleFrozenList, is_minor_version_match, ENV_VARS
 | 
			
		||||
from ...util import check_bool_env_var
 | 
			
		||||
from ...util import check_bool_env_var, SimpleFrozenDict
 | 
			
		||||
from .._util import PROJECT_FILE, PROJECT_LOCK, load_project_config, get_hash
 | 
			
		||||
from .._util import get_checksum, project_cli, Arg, Opt, COMMAND
 | 
			
		||||
from .._util import get_checksum, project_cli, Arg, Opt, COMMAND, parse_config_overrides
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@project_cli.command("run")
 | 
			
		||||
@project_cli.command(
 | 
			
		||||
    "run", context_settings={"allow_extra_args": True, "ignore_unknown_options": True}
 | 
			
		||||
)
 | 
			
		||||
def project_run_cli(
 | 
			
		||||
    # fmt: off
 | 
			
		||||
    ctx: typer.Context,  # This is only used to read additional arguments
 | 
			
		||||
    subcommand: str = Arg(None, help=f"Name of command defined in the {PROJECT_FILE}"),
 | 
			
		||||
    project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
 | 
			
		||||
    force: bool = Opt(False, "--force", "-F", help="Force re-running steps, even if nothing changed"),
 | 
			
		||||
| 
						 | 
				
			
			@ -33,13 +37,15 @@ def project_run_cli(
 | 
			
		|||
    if show_help or not subcommand:
 | 
			
		||||
        print_run_help(project_dir, subcommand)
 | 
			
		||||
    else:
 | 
			
		||||
        project_run(project_dir, subcommand, force=force, dry=dry)
 | 
			
		||||
        overrides = parse_config_overrides(ctx.args)
 | 
			
		||||
        project_run(project_dir, subcommand, overrides=overrides, force=force, dry=dry)
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def project_run(
 | 
			
		||||
    project_dir: Path,
 | 
			
		||||
    subcommand: str,
 | 
			
		||||
    *,
 | 
			
		||||
    overrides: Dict[str, Any] = SimpleFrozenDict(),
 | 
			
		||||
    force: bool = False,
 | 
			
		||||
    dry: bool = False,
 | 
			
		||||
    capture: bool = False,
 | 
			
		||||
| 
						 | 
				
			
			@ -59,7 +65,7 @@ def project_run(
 | 
			
		|||
        when you want to turn over execution to the command, and capture=True
 | 
			
		||||
        when you want to run the command more like a function.
 | 
			
		||||
    """
 | 
			
		||||
    config = load_project_config(project_dir)
 | 
			
		||||
    config = load_project_config(project_dir, overrides=overrides)
 | 
			
		||||
    commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
 | 
			
		||||
    workflows = config.get("workflows", {})
 | 
			
		||||
    validate_subcommand(commands.keys(), workflows.keys(), subcommand)
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -28,6 +28,15 @@ bg:
 | 
			
		|||
  accuracy:
 | 
			
		||||
    name: iarfmoose/roberta-base-bulgarian
 | 
			
		||||
    size_factor: 3
 | 
			
		||||
bn:
 | 
			
		||||
  word_vectors: null
 | 
			
		||||
  transformer:
 | 
			
		||||
  efficiency:
 | 
			
		||||
    name: sagorsarker/bangla-bert-base
 | 
			
		||||
    size_factor: 3
 | 
			
		||||
  accuracy:
 | 
			
		||||
    name: sagorsarker/bangla-bert-base
 | 
			
		||||
    size_factor: 3
 | 
			
		||||
da:
 | 
			
		||||
  word_vectors: da_core_news_lg
 | 
			
		||||
  transformer:
 | 
			
		||||
| 
						 | 
				
			
			@ -104,10 +113,10 @@ hi:
 | 
			
		|||
  word_vectors: null
 | 
			
		||||
  transformer:
 | 
			
		||||
    efficiency:
 | 
			
		||||
      name: monsoon-nlp/hindi-tpu-electra
 | 
			
		||||
      name: ai4bharat/indic-bert
 | 
			
		||||
      size_factor: 3
 | 
			
		||||
    accuracy:
 | 
			
		||||
      name: monsoon-nlp/hindi-tpu-electra
 | 
			
		||||
      name: ai4bharat/indic-bert
 | 
			
		||||
      size_factor: 3
 | 
			
		||||
id:
 | 
			
		||||
  word_vectors: null
 | 
			
		||||
| 
						 | 
				
			
			@ -185,10 +194,10 @@ si:
 | 
			
		|||
  word_vectors: null
 | 
			
		||||
  transformer:
 | 
			
		||||
    efficiency:
 | 
			
		||||
      name: keshan/SinhalaBERTo
 | 
			
		||||
      name: setu4993/LaBSE
 | 
			
		||||
      size_factor: 3
 | 
			
		||||
    accuracy:
 | 
			
		||||
      name: keshan/SinhalaBERTo
 | 
			
		||||
      name: setu4993/LaBSE
 | 
			
		||||
      size_factor: 3
 | 
			
		||||
sv:
 | 
			
		||||
  word_vectors: null
 | 
			
		||||
| 
						 | 
				
			
			@ -203,10 +212,10 @@ ta:
 | 
			
		|||
  word_vectors: null
 | 
			
		||||
  transformer:
 | 
			
		||||
    efficiency:
 | 
			
		||||
      name: monsoon-nlp/tamillion
 | 
			
		||||
      name: ai4bharat/indic-bert
 | 
			
		||||
      size_factor: 3
 | 
			
		||||
    accuracy:
 | 
			
		||||
      name: monsoon-nlp/tamillion
 | 
			
		||||
      name: ai4bharat/indic-bert
 | 
			
		||||
      size_factor: 3
 | 
			
		||||
te:
 | 
			
		||||
  word_vectors: null
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -579,8 +579,8 @@ class Errors:
 | 
			
		|||
    E922 = ("Component '{name}' has been initialized with an output dimension of "
 | 
			
		||||
            "{nO} - cannot add any more labels.")
 | 
			
		||||
    E923 = ("It looks like there is no proper sample data to initialize the "
 | 
			
		||||
            "Model of component '{name}'. This is likely a bug in spaCy, so "
 | 
			
		||||
            "feel free to open an issue: https://github.com/explosion/spaCy/issues")
 | 
			
		||||
            "Model of component '{name}'. To check your input data paths and "
 | 
			
		||||
            "annotation, run: python -m spacy debug data config.cfg")
 | 
			
		||||
    E924 = ("The '{name}' component does not seem to be initialized properly. "
 | 
			
		||||
            "This is likely a bug in spaCy, so feel free to open an issue: "
 | 
			
		||||
            "https://github.com/explosion/spaCy/issues")
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
							
								
								
									
										18
									
								
								spacy/lang/tn/__init__.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										18
									
								
								spacy/lang/tn/__init__.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,18 @@
 | 
			
		|||
from .stop_words import STOP_WORDS
 | 
			
		||||
from .lex_attrs import LEX_ATTRS
 | 
			
		||||
from .punctuation import TOKENIZER_INFIXES
 | 
			
		||||
from ...language import Language
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
class SetswanaDefaults(Language.Defaults):
 | 
			
		||||
    infixes = TOKENIZER_INFIXES
 | 
			
		||||
    stop_words = STOP_WORDS
 | 
			
		||||
    lex_attr_getters = LEX_ATTRS
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
class Setswana(Language):
 | 
			
		||||
    lang = "tn"
 | 
			
		||||
    Defaults = SetswanaDefaults
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
__all__ = ["Setswana"]
 | 
			
		||||
							
								
								
									
										15
									
								
								spacy/lang/tn/examples.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										15
									
								
								spacy/lang/tn/examples.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,15 @@
 | 
			
		|||
"""
 | 
			
		||||
Example sentences to test spaCy and its language models.
 | 
			
		||||
>>> from spacy.lang.tn.examples import sentences
 | 
			
		||||
>>> docs = nlp.pipe(sentences)
 | 
			
		||||
"""
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
sentences = [
 | 
			
		||||
    "Apple e nyaka go reka JSE ka tlhwatlhwa ta R1 billion",
 | 
			
		||||
    "Johannesburg ke toropo e kgolo mo Afrika Borwa.",
 | 
			
		||||
    "O ko kae?",
 | 
			
		||||
    "ke mang presidente ya Afrika Borwa?",
 | 
			
		||||
    "ke eng toropo kgolo ya Afrika Borwa?",
 | 
			
		||||
    "Nelson Mandela o belegwe leng?",
 | 
			
		||||
]
 | 
			
		||||
							
								
								
									
										107
									
								
								spacy/lang/tn/lex_attrs.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										107
									
								
								spacy/lang/tn/lex_attrs.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,107 @@
 | 
			
		|||
from ...attrs import LIKE_NUM
 | 
			
		||||
 | 
			
		||||
_num_words = [
 | 
			
		||||
    "lefela",
 | 
			
		||||
    "nngwe",
 | 
			
		||||
    "pedi",
 | 
			
		||||
    "tharo",
 | 
			
		||||
    "nne",
 | 
			
		||||
    "tlhano",
 | 
			
		||||
    "thataro",
 | 
			
		||||
    "supa",
 | 
			
		||||
    "robedi",
 | 
			
		||||
    "robongwe",
 | 
			
		||||
    "lesome",
 | 
			
		||||
    "lesomenngwe",
 | 
			
		||||
    "lesomepedi",
 | 
			
		||||
    "sometharo",
 | 
			
		||||
    "somenne",
 | 
			
		||||
    "sometlhano",
 | 
			
		||||
    "somethataro",
 | 
			
		||||
    "somesupa",
 | 
			
		||||
    "somerobedi",
 | 
			
		||||
    "somerobongwe",
 | 
			
		||||
    "someamabedi",
 | 
			
		||||
    "someamararo",
 | 
			
		||||
    "someamane",
 | 
			
		||||
    "someamatlhano",
 | 
			
		||||
    "someamarataro",
 | 
			
		||||
    "someamasupa",
 | 
			
		||||
    "someamarobedi",
 | 
			
		||||
    "someamarobongwe",
 | 
			
		||||
    "lekgolo",
 | 
			
		||||
    "sekete",
 | 
			
		||||
    "milione",
 | 
			
		||||
    "bilione",
 | 
			
		||||
    "terilione",
 | 
			
		||||
    "kwatirilione",
 | 
			
		||||
    "gajillione",
 | 
			
		||||
    "bazillione",
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
_ordinal_words = [
 | 
			
		||||
    "ntlha",
 | 
			
		||||
    "bobedi",
 | 
			
		||||
    "boraro",
 | 
			
		||||
    "bone",
 | 
			
		||||
    "botlhano",
 | 
			
		||||
    "borataro",
 | 
			
		||||
    "bosupa",
 | 
			
		||||
    "borobedi ",
 | 
			
		||||
    "borobongwe",
 | 
			
		||||
    "bolesome",
 | 
			
		||||
    "bolesomengwe",
 | 
			
		||||
    "bolesomepedi",
 | 
			
		||||
    "bolesometharo",
 | 
			
		||||
    "bolesomenne",
 | 
			
		||||
    "bolesometlhano",
 | 
			
		||||
    "bolesomethataro",
 | 
			
		||||
    "bolesomesupa",
 | 
			
		||||
    "bolesomerobedi",
 | 
			
		||||
    "bolesomerobongwe",
 | 
			
		||||
    "somamabedi",
 | 
			
		||||
    "someamararo",
 | 
			
		||||
    "someamane",
 | 
			
		||||
    "someamatlhano",
 | 
			
		||||
    "someamarataro",
 | 
			
		||||
    "someamasupa",
 | 
			
		||||
    "someamarobedi",
 | 
			
		||||
    "someamarobongwe",
 | 
			
		||||
    "lekgolo",
 | 
			
		||||
    "sekete",
 | 
			
		||||
    "milione",
 | 
			
		||||
    "bilione",
 | 
			
		||||
    "terilione",
 | 
			
		||||
    "kwatirilione",
 | 
			
		||||
    "gajillione",
 | 
			
		||||
    "bazillione",
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def like_num(text):
 | 
			
		||||
    if text.startswith(("+", "-", "±", "~")):
 | 
			
		||||
        text = text[1:]
 | 
			
		||||
    text = text.replace(",", "").replace(".", "")
 | 
			
		||||
    if text.isdigit():
 | 
			
		||||
        return True
 | 
			
		||||
    if text.count("/") == 1:
 | 
			
		||||
        num, denom = text.split("/")
 | 
			
		||||
        if num.isdigit() and denom.isdigit():
 | 
			
		||||
            return True
 | 
			
		||||
 | 
			
		||||
    text_lower = text.lower()
 | 
			
		||||
    if text_lower in _num_words:
 | 
			
		||||
        return True
 | 
			
		||||
 | 
			
		||||
    # CHeck ordinal number
 | 
			
		||||
    if text_lower in _ordinal_words:
 | 
			
		||||
        return True
 | 
			
		||||
    if text_lower.endswith("th"):
 | 
			
		||||
        if text_lower[:-2].isdigit():
 | 
			
		||||
            return True
 | 
			
		||||
 | 
			
		||||
    return False
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
LEX_ATTRS = {LIKE_NUM: like_num}
 | 
			
		||||
							
								
								
									
										19
									
								
								spacy/lang/tn/punctuation.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										19
									
								
								spacy/lang/tn/punctuation.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,19 @@
 | 
			
		|||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, HYPHENS
 | 
			
		||||
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA
 | 
			
		||||
 | 
			
		||||
_infixes = (
 | 
			
		||||
    LIST_ELLIPSES
 | 
			
		||||
    + LIST_ICONS
 | 
			
		||||
    + [
 | 
			
		||||
        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
 | 
			
		||||
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
 | 
			
		||||
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
 | 
			
		||||
        ),
 | 
			
		||||
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
 | 
			
		||||
        r"(?<=[{a}0-9])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
 | 
			
		||||
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
 | 
			
		||||
    ]
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
TOKENIZER_INFIXES = _infixes
 | 
			
		||||
							
								
								
									
										20
									
								
								spacy/lang/tn/stop_words.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										20
									
								
								spacy/lang/tn/stop_words.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,20 @@
 | 
			
		|||
# Stop words
 | 
			
		||||
STOP_WORDS = set(
 | 
			
		||||
    """
 | 
			
		||||
ke gareng ga selekanyo tlhwatlhwa yo mongwe se
 | 
			
		||||
sengwe fa go le jalo gongwe ba na mo tikologong
 | 
			
		||||
jaaka kwa morago nna gonne ka sa pele nako teng
 | 
			
		||||
tlase fela ntle magareng tsona feta bobedi kgabaganya
 | 
			
		||||
moo gape kgatlhanong botlhe tsotlhe bokana e esi
 | 
			
		||||
setseng mororo dinako golo kgolo nnye wena gago
 | 
			
		||||
o ntse ntle tla goreng gangwe mang yotlhe gore
 | 
			
		||||
eo yona tseraganyo eng ne sentle re rona thata
 | 
			
		||||
godimo fitlha pedi masomamabedi lesomepedi mmogo
 | 
			
		||||
tharo tseo boraro tseno yone jaanong bobona bona
 | 
			
		||||
lesome tsaya tsamaiso nngwe masomethataro thataro
 | 
			
		||||
tsa mmatota tota sale thoko supa dira tshwanetse di mmalwa masisi
 | 
			
		||||
bonala e tshwanang bogolo tsenya tsweetswee karolo
 | 
			
		||||
sepe tlhalosa dirwa robedi robongwe lesomenngwe gaisa
 | 
			
		||||
tlhano lesometlhano botlalo lekgolo
 | 
			
		||||
""".split()
 | 
			
		||||
)
 | 
			
		||||
| 
						 | 
				
			
			@ -451,7 +451,7 @@ cdef class Lexeme:
 | 
			
		|||
            Lexeme.c_set_flag(self.c, IS_QUOTE, x)
 | 
			
		||||
 | 
			
		||||
    property is_left_punct:
 | 
			
		||||
        """RETURNS (bool): Whether the lexeme is left punctuation, e.g. )."""
 | 
			
		||||
        """RETURNS (bool): Whether the lexeme is left punctuation, e.g. (."""
 | 
			
		||||
        def __get__(self):
 | 
			
		||||
            return Lexeme.c_check_flag(self.c, IS_LEFT_PUNCT)
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -18,4 +18,4 @@ cdef class PhraseMatcher:
 | 
			
		|||
    cdef Pool mem
 | 
			
		||||
    cdef key_t _terminal_hash
 | 
			
		||||
 | 
			
		||||
    cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil
 | 
			
		||||
    cdef void find_matches(self, Doc doc, int start_idx, int end_idx, vector[SpanC] *matches) nogil
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -230,10 +230,10 @@ cdef class PhraseMatcher:
 | 
			
		|||
                result = internal_node
 | 
			
		||||
            map_set(self.mem, <MapStruct*>result, self.vocab.strings[key], NULL)
 | 
			
		||||
 | 
			
		||||
    def __call__(self, doc, *, as_spans=False):
 | 
			
		||||
    def __call__(self, object doclike, *, as_spans=False):
 | 
			
		||||
        """Find all sequences matching the supplied patterns on the `Doc`.
 | 
			
		||||
 | 
			
		||||
        doc (Doc): The document to match over.
 | 
			
		||||
        doclike (Doc or Span): The document to match over.
 | 
			
		||||
        as_spans (bool): Return Span objects with labels instead of (match_id,
 | 
			
		||||
            start, end) tuples.
 | 
			
		||||
        RETURNS (list): A list of `(match_id, start, end)` tuples,
 | 
			
		||||
| 
						 | 
				
			
			@ -244,12 +244,22 @@ cdef class PhraseMatcher:
 | 
			
		|||
        DOCS: https://spacy.io/api/phrasematcher#call
 | 
			
		||||
        """
 | 
			
		||||
        matches = []
 | 
			
		||||
        if doc is None or len(doc) == 0:
 | 
			
		||||
        if doclike is None or len(doclike) == 0:
 | 
			
		||||
            # if doc is empty or None just return empty list
 | 
			
		||||
            return matches
 | 
			
		||||
        if isinstance(doclike, Doc):
 | 
			
		||||
            doc = doclike
 | 
			
		||||
            start_idx = 0
 | 
			
		||||
            end_idx = len(doc)
 | 
			
		||||
        elif isinstance(doclike, Span):
 | 
			
		||||
            doc = doclike.doc
 | 
			
		||||
            start_idx = doclike.start
 | 
			
		||||
            end_idx = doclike.end
 | 
			
		||||
        else:
 | 
			
		||||
            raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
 | 
			
		||||
 | 
			
		||||
        cdef vector[SpanC] c_matches
 | 
			
		||||
        self.find_matches(doc, &c_matches)
 | 
			
		||||
        self.find_matches(doc, start_idx, end_idx, &c_matches)
 | 
			
		||||
        for i in range(c_matches.size()):
 | 
			
		||||
            matches.append((c_matches[i].label, c_matches[i].start, c_matches[i].end))
 | 
			
		||||
        for i, (ent_id, start, end) in enumerate(matches):
 | 
			
		||||
| 
						 | 
				
			
			@ -261,17 +271,17 @@ cdef class PhraseMatcher:
 | 
			
		|||
        else:
 | 
			
		||||
            return matches
 | 
			
		||||
 | 
			
		||||
    cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil:
 | 
			
		||||
    cdef void find_matches(self, Doc doc, int start_idx, int end_idx, vector[SpanC] *matches) nogil:
 | 
			
		||||
        cdef MapStruct* current_node = self.c_map
 | 
			
		||||
        cdef int start = 0
 | 
			
		||||
        cdef int idx = 0
 | 
			
		||||
        cdef int idy = 0
 | 
			
		||||
        cdef int idx = start_idx
 | 
			
		||||
        cdef int idy = start_idx
 | 
			
		||||
        cdef key_t key
 | 
			
		||||
        cdef void* value
 | 
			
		||||
        cdef int i = 0
 | 
			
		||||
        cdef SpanC ms
 | 
			
		||||
        cdef void* result
 | 
			
		||||
        while idx < doc.length:
 | 
			
		||||
        while idx < end_idx:
 | 
			
		||||
            start = idx
 | 
			
		||||
            token = Token.get_struct_attr(&doc.c[idx], self.attr)
 | 
			
		||||
            # look for sequences from this position
 | 
			
		||||
| 
						 | 
				
			
			@ -279,7 +289,7 @@ cdef class PhraseMatcher:
 | 
			
		|||
            if result:
 | 
			
		||||
                current_node = <MapStruct*>result
 | 
			
		||||
                idy = idx + 1
 | 
			
		||||
                while idy < doc.length:
 | 
			
		||||
                while idy < end_idx:
 | 
			
		||||
                    result = map_get(current_node, self._terminal_hash)
 | 
			
		||||
                    if result:
 | 
			
		||||
                        i = 0
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -107,6 +107,7 @@ def init_ensemble_textcat(model, X, Y) -> Model:
 | 
			
		|||
    model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
 | 
			
		||||
    model.get_ref("maxout_layer").set_dim("nI", tok2vec_width)
 | 
			
		||||
    model.get_ref("norm_layer").set_dim("nI", tok2vec_width)
 | 
			
		||||
    model.get_ref("norm_layer").set_dim("nO", tok2vec_width)
 | 
			
		||||
    init_chain(model, X, Y)
 | 
			
		||||
    return model
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -273,7 +273,7 @@ class EntityLinker(TrainablePipe):
 | 
			
		|||
        gradients = self.distance.get_grad(sentence_encodings, entity_encodings)
 | 
			
		||||
        loss = self.distance.get_loss(sentence_encodings, entity_encodings)
 | 
			
		||||
        loss = loss / len(entity_encodings)
 | 
			
		||||
        return loss, gradients
 | 
			
		||||
        return float(loss), gradients
 | 
			
		||||
 | 
			
		||||
    def predict(self, docs: Iterable[Doc]) -> List[str]:
 | 
			
		||||
        """Apply the pipeline's model to a batch of docs, without modifying them.
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -76,7 +76,7 @@ def merge_subtokens(doc: Doc, label: str = "subtok") -> Doc:
 | 
			
		|||
    retokenizes=True,
 | 
			
		||||
)
 | 
			
		||||
def make_token_splitter(
 | 
			
		||||
    nlp: Language, name: str, *, min_length=0, split_length=0,
 | 
			
		||||
    nlp: Language, name: str, *, min_length: int = 0, split_length: int = 0
 | 
			
		||||
):
 | 
			
		||||
    return TokenSplitter(min_length=min_length, split_length=split_length)
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -197,7 +197,7 @@ class ClozeMultitask(TrainablePipe):
 | 
			
		|||
        target = vectors[ids]
 | 
			
		||||
        gradient = self.distance.get_grad(prediction, target)
 | 
			
		||||
        loss = self.distance.get_loss(prediction, target)
 | 
			
		||||
        return loss, gradient
 | 
			
		||||
        return float(loss), gradient
 | 
			
		||||
 | 
			
		||||
    def update(self, examples, *, drop=0., sgd=None, losses=None):
 | 
			
		||||
        pass
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -121,7 +121,7 @@ class Tok2Vec(TrainablePipe):
 | 
			
		|||
        tokvecs = self.model.predict(docs)
 | 
			
		||||
        batch_id = Tok2VecListener.get_batch_id(docs)
 | 
			
		||||
        for listener in self.listeners:
 | 
			
		||||
            listener.receive(batch_id, tokvecs, lambda dX: [])
 | 
			
		||||
            listener.receive(batch_id, tokvecs, _empty_backprop)
 | 
			
		||||
        return tokvecs
 | 
			
		||||
 | 
			
		||||
    def set_annotations(self, docs: Sequence[Doc], tokvecses) -> None:
 | 
			
		||||
| 
						 | 
				
			
			@ -291,12 +291,18 @@ def forward(model: Tok2VecListener, inputs, is_train: bool):
 | 
			
		|||
        # of data.
 | 
			
		||||
        # When the components batch differently, we don't receive a matching
 | 
			
		||||
        # prediction from the upstream, so we can't predict.
 | 
			
		||||
        if not all(doc.tensor.size for doc in inputs):
 | 
			
		||||
        outputs = []
 | 
			
		||||
        width = model.get_dim("nO")
 | 
			
		||||
        for doc in inputs:
 | 
			
		||||
            if doc.tensor.size == 0:
 | 
			
		||||
                # But we do need to do *something* if the tensor hasn't been set.
 | 
			
		||||
                # The compromise is to at least return data of the right shape,
 | 
			
		||||
                # so the output is valid.
 | 
			
		||||
            width = model.get_dim("nO")
 | 
			
		||||
            outputs = [model.ops.alloc2f(len(doc), width) for doc in inputs]
 | 
			
		||||
                outputs.append(model.ops.alloc2f(len(doc), width))
 | 
			
		||||
            else:
 | 
			
		||||
            outputs = [doc.tensor for doc in inputs]
 | 
			
		||||
                outputs.append(doc.tensor)
 | 
			
		||||
        return outputs, lambda dX: []
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def _empty_backprop(dX):  # for pickling
 | 
			
		||||
    return []
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -446,6 +446,7 @@ class ProjectConfigCommand(BaseModel):
 | 
			
		|||
class ProjectConfigSchema(BaseModel):
 | 
			
		||||
    # fmt: off
 | 
			
		||||
    vars: Dict[StrictStr, Any] = Field({}, title="Optional variables to substitute in commands")
 | 
			
		||||
    env: Dict[StrictStr, Any] = Field({}, title="Optional variable names to substitute in commands, mapped to environment variable names")
 | 
			
		||||
    assets: List[Union[ProjectConfigAssetURL, ProjectConfigAssetGit]] = Field([], title="Data assets")
 | 
			
		||||
    workflows: Dict[StrictStr, List[StrictStr]] = Field({}, title="Named workflows, mapped to list of project commands to run in order")
 | 
			
		||||
    commands: List[ProjectConfigCommand] = Field([], title="Project command shortucts")
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -8,7 +8,8 @@ from spacy.util import get_lang_class
 | 
			
		|||
LANGUAGES = ["af", "ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", "es",
 | 
			
		||||
             "et", "fa", "fi", "fr", "ga", "he", "hi", "hr", "hu", "id", "is",
 | 
			
		||||
             "it", "kn", "lt", "lv", "nb", "nl", "pl", "pt", "ro", "si", "sk",
 | 
			
		||||
             "sl", "sq", "sr", "sv", "ta", "te", "tl", "tr", "tt", "ur", 'yo']
 | 
			
		||||
             "sl", "sq", "sr", "sv", "ta", "te", "tl", "tn", "tr", "tt", "ur",
 | 
			
		||||
             "yo"]
 | 
			
		||||
# fmt: on
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -323,3 +323,39 @@ def test_phrase_matcher_deprecated(en_vocab):
 | 
			
		|||
@pytest.mark.parametrize("attr", ["SENT_START", "IS_SENT_START"])
 | 
			
		||||
def test_phrase_matcher_sent_start(en_vocab, attr):
 | 
			
		||||
    _ = PhraseMatcher(en_vocab, attr=attr)  # noqa: F841
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_span_in_phrasematcher(en_vocab):
 | 
			
		||||
    """Ensure that PhraseMatcher accepts Span and Doc as input"""
 | 
			
		||||
    # fmt: off
 | 
			
		||||
    words = ["I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "and", "nothing", "else", "."]
 | 
			
		||||
    # fmt: on
 | 
			
		||||
    doc = Doc(en_vocab, words=words)
 | 
			
		||||
    span = doc[:8]
 | 
			
		||||
    pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
 | 
			
		||||
    matcher = PhraseMatcher(en_vocab)
 | 
			
		||||
    matcher.add("SPACY", [pattern])
 | 
			
		||||
    matches_doc = matcher(doc)
 | 
			
		||||
    matches_span = matcher(span)
 | 
			
		||||
    assert len(matches_doc) == 1
 | 
			
		||||
    assert len(matches_span) == 1
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_span_v_doc_in_phrasematcher(en_vocab):
 | 
			
		||||
    """Ensure that PhraseMatcher only returns matches in input Span and not in entire Doc"""
 | 
			
		||||
    # fmt: off
 | 
			
		||||
    words = [
 | 
			
		||||
        "I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "Spans",
 | 
			
		||||
        "and", "Docs", "in", "my", "matchers", "," "and", "Spans", "and", "Docs",
 | 
			
		||||
        "everywhere", "."
 | 
			
		||||
    ]
 | 
			
		||||
    # fmt: on
 | 
			
		||||
    doc = Doc(en_vocab, words=words)
 | 
			
		||||
    span = doc[9:15]  # second clause
 | 
			
		||||
    pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
 | 
			
		||||
    matcher = PhraseMatcher(en_vocab)
 | 
			
		||||
    matcher.add("SPACY", [pattern])
 | 
			
		||||
    matches_doc = matcher(doc)
 | 
			
		||||
    matches_span = matcher(span)
 | 
			
		||||
    assert len(matches_doc) == 3
 | 
			
		||||
    assert len(matches_span) == 1
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -451,13 +451,27 @@ def test_pipe_factories_from_source_config():
 | 
			
		|||
    assert config["arg"] == "world"
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_pipe_factories_decorator_idempotent():
 | 
			
		||||
class PipeFactoriesIdempotent:
 | 
			
		||||
    def __init__(self, nlp, name):
 | 
			
		||||
        ...
 | 
			
		||||
 | 
			
		||||
    def __call__(self, doc):
 | 
			
		||||
        ...
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize(
 | 
			
		||||
    "i,func,func2",
 | 
			
		||||
    [
 | 
			
		||||
        (0, lambda nlp, name: lambda doc: doc, lambda doc: doc),
 | 
			
		||||
        (1, PipeFactoriesIdempotent, PipeFactoriesIdempotent(None, None)),
 | 
			
		||||
    ],
 | 
			
		||||
)
 | 
			
		||||
def test_pipe_factories_decorator_idempotent(i, func, func2):
 | 
			
		||||
    """Check that decorator can be run multiple times if the function is the
 | 
			
		||||
    same. This is especially relevant for live reloading because we don't
 | 
			
		||||
    want spaCy to raise an error if a module registering components is reloaded.
 | 
			
		||||
    """
 | 
			
		||||
    name = "test_pipe_factories_decorator_idempotent"
 | 
			
		||||
    func = lambda nlp, name: lambda doc: doc
 | 
			
		||||
    name = f"test_pipe_factories_decorator_idempotent_{i}"
 | 
			
		||||
    for i in range(5):
 | 
			
		||||
        Language.factory(name, func=func)
 | 
			
		||||
    nlp = Language()
 | 
			
		||||
| 
						 | 
				
			
			@ -466,7 +480,6 @@ def test_pipe_factories_decorator_idempotent():
 | 
			
		|||
    # Make sure it also works for component decorator, which creates the
 | 
			
		||||
    # factory function
 | 
			
		||||
    name2 = f"{name}2"
 | 
			
		||||
    func2 = lambda doc: doc
 | 
			
		||||
    for i in range(5):
 | 
			
		||||
        Language.component(name2, func=func2)
 | 
			
		||||
    nlp = Language()
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
							
								
								
									
										229
									
								
								spacy/tests/regression/test_issue6501-7000.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										229
									
								
								spacy/tests/regression/test_issue6501-7000.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,229 @@
 | 
			
		|||
import pytest
 | 
			
		||||
from spacy.lang.en import English
 | 
			
		||||
import numpy as np
 | 
			
		||||
import spacy
 | 
			
		||||
from spacy.tokens import Doc
 | 
			
		||||
from spacy.matcher import PhraseMatcher
 | 
			
		||||
from spacy.tokens import DocBin
 | 
			
		||||
from spacy.util import load_config_from_str
 | 
			
		||||
from spacy.training import Example
 | 
			
		||||
from spacy.training.initialize import init_nlp
 | 
			
		||||
import pickle
 | 
			
		||||
 | 
			
		||||
from ..util import make_tempdir
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_issue6730(en_vocab):
 | 
			
		||||
    """Ensure that the KB does not accept empty strings, but otherwise IO works fine."""
 | 
			
		||||
    from spacy.kb import KnowledgeBase
 | 
			
		||||
 | 
			
		||||
    kb = KnowledgeBase(en_vocab, entity_vector_length=3)
 | 
			
		||||
    kb.add_entity(entity="1", freq=148, entity_vector=[1, 2, 3])
 | 
			
		||||
 | 
			
		||||
    with pytest.raises(ValueError):
 | 
			
		||||
        kb.add_alias(alias="", entities=["1"], probabilities=[0.4])
 | 
			
		||||
    assert kb.contains_alias("") is False
 | 
			
		||||
 | 
			
		||||
    kb.add_alias(alias="x", entities=["1"], probabilities=[0.2])
 | 
			
		||||
    kb.add_alias(alias="y", entities=["1"], probabilities=[0.1])
 | 
			
		||||
 | 
			
		||||
    with make_tempdir() as tmp_dir:
 | 
			
		||||
        kb.to_disk(tmp_dir)
 | 
			
		||||
        kb.from_disk(tmp_dir)
 | 
			
		||||
    assert kb.get_size_aliases() == 2
 | 
			
		||||
    assert set(kb.get_alias_strings()) == {"x", "y"}
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_issue6755(en_tokenizer):
 | 
			
		||||
    doc = en_tokenizer("This is a magnificent sentence.")
 | 
			
		||||
    span = doc[:0]
 | 
			
		||||
    assert span.text_with_ws == ""
 | 
			
		||||
    assert span.text == ""
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize(
 | 
			
		||||
    "sentence, start_idx,end_idx,label",
 | 
			
		||||
    [("Welcome to Mumbai, my friend", 11, 17, "GPE")],
 | 
			
		||||
)
 | 
			
		||||
def test_issue6815_1(sentence, start_idx, end_idx, label):
 | 
			
		||||
    nlp = English()
 | 
			
		||||
    doc = nlp(sentence)
 | 
			
		||||
    span = doc[:].char_span(start_idx, end_idx, label=label)
 | 
			
		||||
    assert span.label_ == label
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize(
 | 
			
		||||
    "sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
 | 
			
		||||
)
 | 
			
		||||
def test_issue6815_2(sentence, start_idx, end_idx, kb_id):
 | 
			
		||||
    nlp = English()
 | 
			
		||||
    doc = nlp(sentence)
 | 
			
		||||
    span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
 | 
			
		||||
    assert span.kb_id == kb_id
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize(
 | 
			
		||||
    "sentence, start_idx,end_idx,vector",
 | 
			
		||||
    [("Welcome to Mumbai, my friend", 11, 17, np.array([0.1, 0.2, 0.3]))],
 | 
			
		||||
)
 | 
			
		||||
def test_issue6815_3(sentence, start_idx, end_idx, vector):
 | 
			
		||||
    nlp = English()
 | 
			
		||||
    doc = nlp(sentence)
 | 
			
		||||
    span = doc[:].char_span(start_idx, end_idx, vector=vector)
 | 
			
		||||
    assert (span.vector == vector).all()
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_issue6839(en_vocab):
 | 
			
		||||
    """Ensure that PhraseMatcher accepts Span as input"""
 | 
			
		||||
    # fmt: off
 | 
			
		||||
    words = ["I", "like", "Spans", "and", "Docs", "in", "my", "input", ",", "and", "nothing", "else", "."]
 | 
			
		||||
    # fmt: on
 | 
			
		||||
    doc = Doc(en_vocab, words=words)
 | 
			
		||||
    span = doc[:8]
 | 
			
		||||
    pattern = Doc(en_vocab, words=["Spans", "and", "Docs"])
 | 
			
		||||
    matcher = PhraseMatcher(en_vocab)
 | 
			
		||||
    matcher.add("SPACY", [pattern])
 | 
			
		||||
    matches = matcher(span)
 | 
			
		||||
    assert matches
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
CONFIG_ISSUE_6908 = """
 | 
			
		||||
[paths]
 | 
			
		||||
train = "TRAIN_PLACEHOLDER"
 | 
			
		||||
raw = null
 | 
			
		||||
init_tok2vec = null
 | 
			
		||||
vectors = null
 | 
			
		||||
 | 
			
		||||
[system]
 | 
			
		||||
seed = 0
 | 
			
		||||
gpu_allocator = null
 | 
			
		||||
 | 
			
		||||
[nlp]
 | 
			
		||||
lang = "en"
 | 
			
		||||
pipeline = ["textcat"]
 | 
			
		||||
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
 | 
			
		||||
disabled = []
 | 
			
		||||
before_creation = null
 | 
			
		||||
after_creation = null
 | 
			
		||||
after_pipeline_creation = null
 | 
			
		||||
batch_size = 1000
 | 
			
		||||
 | 
			
		||||
[components]
 | 
			
		||||
 | 
			
		||||
[components.textcat]
 | 
			
		||||
factory = "TEXTCAT_PLACEHOLDER"
 | 
			
		||||
 | 
			
		||||
[corpora]
 | 
			
		||||
 | 
			
		||||
[corpora.train]
 | 
			
		||||
@readers = "spacy.Corpus.v1"
 | 
			
		||||
path = ${paths:train}
 | 
			
		||||
 | 
			
		||||
[corpora.dev]
 | 
			
		||||
@readers = "spacy.Corpus.v1"
 | 
			
		||||
path = ${paths:train}
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
[training]
 | 
			
		||||
train_corpus = "corpora.train"
 | 
			
		||||
dev_corpus = "corpora.dev"
 | 
			
		||||
seed = ${system.seed}
 | 
			
		||||
gpu_allocator = ${system.gpu_allocator}
 | 
			
		||||
frozen_components = []
 | 
			
		||||
before_to_disk = null
 | 
			
		||||
 | 
			
		||||
[pretraining]
 | 
			
		||||
 | 
			
		||||
[initialize]
 | 
			
		||||
vectors = ${paths.vectors}
 | 
			
		||||
init_tok2vec = ${paths.init_tok2vec}
 | 
			
		||||
vocab_data = null
 | 
			
		||||
lookups = null
 | 
			
		||||
before_init = null
 | 
			
		||||
after_init = null
 | 
			
		||||
 | 
			
		||||
[initialize.components]
 | 
			
		||||
 | 
			
		||||
[initialize.components.textcat]
 | 
			
		||||
labels = ['label1', 'label2']
 | 
			
		||||
 | 
			
		||||
[initialize.tokenizer]
 | 
			
		||||
"""
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize(
 | 
			
		||||
    "component_name", ["textcat", "textcat_multilabel"],
 | 
			
		||||
)
 | 
			
		||||
def test_issue6908(component_name):
 | 
			
		||||
    """Test intializing textcat with labels in a list"""
 | 
			
		||||
 | 
			
		||||
    def create_data(out_file):
 | 
			
		||||
        nlp = spacy.blank("en")
 | 
			
		||||
        doc = nlp.make_doc("Some text")
 | 
			
		||||
        doc.cats = {"label1": 0, "label2": 1}
 | 
			
		||||
        out_data = DocBin(docs=[doc]).to_bytes()
 | 
			
		||||
        with out_file.open("wb") as file_:
 | 
			
		||||
            file_.write(out_data)
 | 
			
		||||
 | 
			
		||||
    with make_tempdir() as tmp_path:
 | 
			
		||||
        train_path = tmp_path / "train.spacy"
 | 
			
		||||
        create_data(train_path)
 | 
			
		||||
        config_str = CONFIG_ISSUE_6908.replace("TEXTCAT_PLACEHOLDER", component_name)
 | 
			
		||||
        config_str = config_str.replace("TRAIN_PLACEHOLDER", train_path.as_posix())
 | 
			
		||||
        config = load_config_from_str(config_str)
 | 
			
		||||
        init_nlp(config)
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
CONFIG_ISSUE_6950 = """
 | 
			
		||||
[nlp]
 | 
			
		||||
lang = "en"
 | 
			
		||||
pipeline = ["tok2vec", "tagger"]
 | 
			
		||||
 | 
			
		||||
[components]
 | 
			
		||||
 | 
			
		||||
[components.tok2vec]
 | 
			
		||||
factory = "tok2vec"
 | 
			
		||||
 | 
			
		||||
[components.tok2vec.model]
 | 
			
		||||
@architectures = "spacy.Tok2Vec.v1"
 | 
			
		||||
 | 
			
		||||
[components.tok2vec.model.embed]
 | 
			
		||||
@architectures = "spacy.MultiHashEmbed.v1"
 | 
			
		||||
width = ${components.tok2vec.model.encode:width}
 | 
			
		||||
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
 | 
			
		||||
rows = [5000,2500,2500,2500]
 | 
			
		||||
include_static_vectors = false
 | 
			
		||||
 | 
			
		||||
[components.tok2vec.model.encode]
 | 
			
		||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
 | 
			
		||||
width = 96
 | 
			
		||||
depth = 4
 | 
			
		||||
window_size = 1
 | 
			
		||||
maxout_pieces = 3
 | 
			
		||||
 | 
			
		||||
[components.ner]
 | 
			
		||||
factory = "ner"
 | 
			
		||||
 | 
			
		||||
[components.tagger]
 | 
			
		||||
factory = "tagger"
 | 
			
		||||
 | 
			
		||||
[components.tagger.model]
 | 
			
		||||
@architectures = "spacy.Tagger.v1"
 | 
			
		||||
nO = null
 | 
			
		||||
 | 
			
		||||
[components.tagger.model.tok2vec]
 | 
			
		||||
@architectures = "spacy.Tok2VecListener.v1"
 | 
			
		||||
width = ${components.tok2vec.model.encode:width}
 | 
			
		||||
upstream = "*"
 | 
			
		||||
"""
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_issue6950():
 | 
			
		||||
    """Test that the nlp object with initialized tok2vec with listeners pickles
 | 
			
		||||
    correctly (and doesn't have lambdas).
 | 
			
		||||
    """
 | 
			
		||||
    nlp = English.from_config(load_config_from_str(CONFIG_ISSUE_6950))
 | 
			
		||||
    nlp.initialize(lambda: [Example.from_dict(nlp.make_doc("hello"), {"tags": ["V"]})])
 | 
			
		||||
    pickle.dumps(nlp)
 | 
			
		||||
    nlp("hello")
 | 
			
		||||
    pickle.dumps(nlp)
 | 
			
		||||
| 
						 | 
				
			
			@ -1,23 +0,0 @@
 | 
			
		|||
import pytest
 | 
			
		||||
from ..util import make_tempdir
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_issue6730(en_vocab):
 | 
			
		||||
    """Ensure that the KB does not accept empty strings, but otherwise IO works fine."""
 | 
			
		||||
    from spacy.kb import KnowledgeBase
 | 
			
		||||
 | 
			
		||||
    kb = KnowledgeBase(en_vocab, entity_vector_length=3)
 | 
			
		||||
    kb.add_entity(entity="1", freq=148, entity_vector=[1, 2, 3])
 | 
			
		||||
 | 
			
		||||
    with pytest.raises(ValueError):
 | 
			
		||||
        kb.add_alias(alias="", entities=["1"], probabilities=[0.4])
 | 
			
		||||
    assert kb.contains_alias("") is False
 | 
			
		||||
 | 
			
		||||
    kb.add_alias(alias="x", entities=["1"], probabilities=[0.2])
 | 
			
		||||
    kb.add_alias(alias="y", entities=["1"], probabilities=[0.1])
 | 
			
		||||
 | 
			
		||||
    with make_tempdir() as tmp_dir:
 | 
			
		||||
        kb.to_disk(tmp_dir)
 | 
			
		||||
        kb.from_disk(tmp_dir)
 | 
			
		||||
    assert kb.get_size_aliases() == 2
 | 
			
		||||
    assert set(kb.get_alias_strings()) == {"x", "y"}
 | 
			
		||||
| 
						 | 
				
			
			@ -1,5 +0,0 @@
 | 
			
		|||
def test_issue6755(en_tokenizer):
 | 
			
		||||
    doc = en_tokenizer("This is a magnificent sentence.")
 | 
			
		||||
    span = doc[:0]
 | 
			
		||||
    assert span.text_with_ws == ""
 | 
			
		||||
    assert span.text == ""
 | 
			
		||||
| 
						 | 
				
			
			@ -1,35 +0,0 @@
 | 
			
		|||
import pytest
 | 
			
		||||
from spacy.lang.en import English
 | 
			
		||||
import numpy as np
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize(
 | 
			
		||||
    "sentence, start_idx,end_idx,label",
 | 
			
		||||
    [("Welcome to Mumbai, my friend", 11, 17, "GPE")],
 | 
			
		||||
)
 | 
			
		||||
def test_char_span_label(sentence, start_idx, end_idx, label):
 | 
			
		||||
    nlp = English()
 | 
			
		||||
    doc = nlp(sentence)
 | 
			
		||||
    span = doc[:].char_span(start_idx, end_idx, label=label)
 | 
			
		||||
    assert span.label_ == label
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize(
 | 
			
		||||
    "sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
 | 
			
		||||
)
 | 
			
		||||
def test_char_span_kb_id(sentence, start_idx, end_idx, kb_id):
 | 
			
		||||
    nlp = English()
 | 
			
		||||
    doc = nlp(sentence)
 | 
			
		||||
    span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
 | 
			
		||||
    assert span.kb_id == kb_id
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize(
 | 
			
		||||
    "sentence, start_idx,end_idx,vector",
 | 
			
		||||
    [("Welcome to Mumbai, my friend", 11, 17, np.array([0.1, 0.2, 0.3]))],
 | 
			
		||||
)
 | 
			
		||||
def test_char_span_vector(sentence, start_idx, end_idx, vector):
 | 
			
		||||
    nlp = English()
 | 
			
		||||
    doc = nlp(sentence)
 | 
			
		||||
    span = doc[:].char_span(start_idx, end_idx, vector=vector)
 | 
			
		||||
    assert (span.vector == vector).all()
 | 
			
		||||
| 
						 | 
				
			
			@ -1,102 +0,0 @@
 | 
			
		|||
import pytest
 | 
			
		||||
import spacy
 | 
			
		||||
from spacy.language import Language
 | 
			
		||||
from spacy.tokens import DocBin
 | 
			
		||||
from spacy import util
 | 
			
		||||
from spacy.schemas import ConfigSchemaInit
 | 
			
		||||
 | 
			
		||||
from spacy.training.initialize import init_nlp
 | 
			
		||||
 | 
			
		||||
from ..util import make_tempdir
 | 
			
		||||
 | 
			
		||||
TEXTCAT_WITH_LABELS_ARRAY_CONFIG = """
 | 
			
		||||
[paths]
 | 
			
		||||
train = "TRAIN_PLACEHOLDER"
 | 
			
		||||
raw = null
 | 
			
		||||
init_tok2vec = null
 | 
			
		||||
vectors = null
 | 
			
		||||
 | 
			
		||||
[system]
 | 
			
		||||
seed = 0
 | 
			
		||||
gpu_allocator = null
 | 
			
		||||
 | 
			
		||||
[nlp]
 | 
			
		||||
lang = "en"
 | 
			
		||||
pipeline = ["textcat"]
 | 
			
		||||
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
 | 
			
		||||
disabled = []
 | 
			
		||||
before_creation = null
 | 
			
		||||
after_creation = null
 | 
			
		||||
after_pipeline_creation = null
 | 
			
		||||
batch_size = 1000
 | 
			
		||||
 | 
			
		||||
[components]
 | 
			
		||||
 | 
			
		||||
[components.textcat]
 | 
			
		||||
factory = "TEXTCAT_PLACEHOLDER"
 | 
			
		||||
 | 
			
		||||
[corpora]
 | 
			
		||||
 | 
			
		||||
[corpora.train]
 | 
			
		||||
@readers = "spacy.Corpus.v1"
 | 
			
		||||
path = ${paths:train}
 | 
			
		||||
 | 
			
		||||
[corpora.dev]
 | 
			
		||||
@readers = "spacy.Corpus.v1"
 | 
			
		||||
path = ${paths:train}
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
[training]
 | 
			
		||||
train_corpus = "corpora.train"
 | 
			
		||||
dev_corpus = "corpora.dev"
 | 
			
		||||
seed = ${system.seed}
 | 
			
		||||
gpu_allocator = ${system.gpu_allocator}
 | 
			
		||||
frozen_components = []
 | 
			
		||||
before_to_disk = null
 | 
			
		||||
 | 
			
		||||
[pretraining]
 | 
			
		||||
 | 
			
		||||
[initialize]
 | 
			
		||||
vectors = ${paths.vectors}
 | 
			
		||||
init_tok2vec = ${paths.init_tok2vec}
 | 
			
		||||
vocab_data = null
 | 
			
		||||
lookups = null
 | 
			
		||||
before_init = null
 | 
			
		||||
after_init = null
 | 
			
		||||
 | 
			
		||||
[initialize.components]
 | 
			
		||||
 | 
			
		||||
[initialize.components.textcat]
 | 
			
		||||
labels = ['label1', 'label2']
 | 
			
		||||
 | 
			
		||||
[initialize.tokenizer]
 | 
			
		||||
"""
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize(
 | 
			
		||||
    "component_name",
 | 
			
		||||
    ["textcat", "textcat_multilabel"],
 | 
			
		||||
)
 | 
			
		||||
def test_textcat_initialize_labels_validation(component_name):
 | 
			
		||||
    """Test intializing textcat with labels in a list"""
 | 
			
		||||
 | 
			
		||||
    def create_data(out_file):
 | 
			
		||||
        nlp = spacy.blank("en")
 | 
			
		||||
        doc = nlp.make_doc("Some text")
 | 
			
		||||
        doc.cats = {"label1": 0, "label2": 1}
 | 
			
		||||
 | 
			
		||||
        out_data = DocBin(docs=[doc]).to_bytes()
 | 
			
		||||
        with out_file.open("wb") as file_:
 | 
			
		||||
            file_.write(out_data)
 | 
			
		||||
 | 
			
		||||
    with make_tempdir() as tmp_path:
 | 
			
		||||
        train_path = tmp_path / "train.spacy"
 | 
			
		||||
        create_data(train_path)
 | 
			
		||||
 | 
			
		||||
        config_str = TEXTCAT_WITH_LABELS_ARRAY_CONFIG.replace(
 | 
			
		||||
            "TEXTCAT_PLACEHOLDER", component_name
 | 
			
		||||
        )
 | 
			
		||||
        config_str = config_str.replace("TRAIN_PLACEHOLDER", train_path.as_posix())
 | 
			
		||||
 | 
			
		||||
        config = util.load_config_from_str(config_str)
 | 
			
		||||
        init_nlp(config)
 | 
			
		||||
							
								
								
									
										12
									
								
								spacy/tests/regression/test_issue7019.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										12
									
								
								spacy/tests/regression/test_issue7019.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,12 @@
 | 
			
		|||
from spacy.cli.evaluate import print_textcats_auc_per_cat, print_prf_per_type
 | 
			
		||||
from wasabi import msg
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_issue7019():
 | 
			
		||||
    scores = {"LABEL_A": 0.39829102, "LABEL_B": 0.938298329382, "LABEL_C": None}
 | 
			
		||||
    print_textcats_auc_per_cat(msg, scores)
 | 
			
		||||
    scores = {
 | 
			
		||||
        "LABEL_A": {"p": 0.3420302, "r": 0.3929020, "f": 0.49823928932},
 | 
			
		||||
        "LABEL_B": {"p": None, "r": None, "f": None},
 | 
			
		||||
    }
 | 
			
		||||
    print_prf_per_type(msg, scores, name="foo", type="bar")
 | 
			
		||||
							
								
								
									
										67
									
								
								spacy/tests/regression/test_issue7029.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										67
									
								
								spacy/tests/regression/test_issue7029.py
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,67 @@
 | 
			
		|||
from spacy.lang.en import English
 | 
			
		||||
from spacy.training import Example
 | 
			
		||||
from spacy.util import load_config_from_str
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
CONFIG = """
 | 
			
		||||
[nlp]
 | 
			
		||||
lang = "en"
 | 
			
		||||
pipeline = ["tok2vec", "tagger"]
 | 
			
		||||
 | 
			
		||||
[components]
 | 
			
		||||
 | 
			
		||||
[components.tok2vec]
 | 
			
		||||
factory = "tok2vec"
 | 
			
		||||
 | 
			
		||||
[components.tok2vec.model]
 | 
			
		||||
@architectures = "spacy.Tok2Vec.v1"
 | 
			
		||||
 | 
			
		||||
[components.tok2vec.model.embed]
 | 
			
		||||
@architectures = "spacy.MultiHashEmbed.v1"
 | 
			
		||||
width = ${components.tok2vec.model.encode:width}
 | 
			
		||||
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
 | 
			
		||||
rows = [5000,2500,2500,2500]
 | 
			
		||||
include_static_vectors = false
 | 
			
		||||
 | 
			
		||||
[components.tok2vec.model.encode]
 | 
			
		||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
 | 
			
		||||
width = 96
 | 
			
		||||
depth = 4
 | 
			
		||||
window_size = 1
 | 
			
		||||
maxout_pieces = 3
 | 
			
		||||
 | 
			
		||||
[components.tagger]
 | 
			
		||||
factory = "tagger"
 | 
			
		||||
 | 
			
		||||
[components.tagger.model]
 | 
			
		||||
@architectures = "spacy.Tagger.v1"
 | 
			
		||||
nO = null
 | 
			
		||||
 | 
			
		||||
[components.tagger.model.tok2vec]
 | 
			
		||||
@architectures = "spacy.Tok2VecListener.v1"
 | 
			
		||||
width = ${components.tok2vec.model.encode:width}
 | 
			
		||||
upstream = "*"
 | 
			
		||||
"""
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
TRAIN_DATA = [
 | 
			
		||||
    ("I like green eggs", {"tags": ["N", "V", "J", "N"]}),
 | 
			
		||||
    ("Eat blue ham", {"tags": ["V", "J", "N"]}),
 | 
			
		||||
]
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_issue7029():
 | 
			
		||||
    """Test that an empty document doesn't mess up an entire batch."""
 | 
			
		||||
    nlp = English.from_config(load_config_from_str(CONFIG))
 | 
			
		||||
    train_examples = []
 | 
			
		||||
    for t in TRAIN_DATA:
 | 
			
		||||
        train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
 | 
			
		||||
    optimizer = nlp.initialize(get_examples=lambda: train_examples)
 | 
			
		||||
    for i in range(50):
 | 
			
		||||
        losses = {}
 | 
			
		||||
        nlp.update(train_examples, sgd=optimizer, losses=losses)
 | 
			
		||||
    texts = ["first", "second", "third", "fourth", "and", "then", "some", ""]
 | 
			
		||||
    nlp.select_pipes(enable=["tok2vec", "tagger"])
 | 
			
		||||
    docs1 = list(nlp.pipe(texts, batch_size=1))
 | 
			
		||||
    docs2 = list(nlp.pipe(texts, batch_size=4))
 | 
			
		||||
    assert [doc[0].tag_ for doc in docs1[:-1]] == [doc[0].tag_ for doc in docs2[:-1]]
 | 
			
		||||
| 
						 | 
				
			
			@ -325,6 +325,23 @@ def test_project_config_interpolation():
 | 
			
		|||
        substitute_project_variables(project)
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_project_config_interpolation_env():
 | 
			
		||||
    variables = {"a": 10}
 | 
			
		||||
    env_var = "SPACY_TEST_FOO"
 | 
			
		||||
    env_vars = {"foo": env_var}
 | 
			
		||||
    commands = [{"name": "x", "script": ["hello ${vars.a} ${env.foo}"]}]
 | 
			
		||||
    project = {"commands": commands, "vars": variables, "env": env_vars}
 | 
			
		||||
    with make_tempdir() as d:
 | 
			
		||||
        srsly.write_yaml(d / "project.yml", project)
 | 
			
		||||
        cfg = load_project_config(d)
 | 
			
		||||
    assert cfg["commands"][0]["script"][0] == "hello 10 "
 | 
			
		||||
    os.environ[env_var] = "123"
 | 
			
		||||
    with make_tempdir() as d:
 | 
			
		||||
        srsly.write_yaml(d / "project.yml", project)
 | 
			
		||||
        cfg = load_project_config(d)
 | 
			
		||||
    assert cfg["commands"][0]["script"][0] == "hello 10 123"
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
@pytest.mark.parametrize(
 | 
			
		||||
    "args,expected",
 | 
			
		||||
    [
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -1,7 +1,9 @@
 | 
			
		|||
import pytest
 | 
			
		||||
import numpy
 | 
			
		||||
import srsly
 | 
			
		||||
from spacy.lang.en import English
 | 
			
		||||
from spacy.strings import StringStore
 | 
			
		||||
from spacy.tokens import Doc
 | 
			
		||||
from spacy.vocab import Vocab
 | 
			
		||||
from spacy.attrs import NORM
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -20,7 +22,10 @@ def test_pickle_string_store(text1, text2):
 | 
			
		|||
 | 
			
		||||
@pytest.mark.parametrize("text1,text2", [("dog", "cat")])
 | 
			
		||||
def test_pickle_vocab(text1, text2):
 | 
			
		||||
    vocab = Vocab(lex_attr_getters={int(NORM): lambda string: string[:-1]})
 | 
			
		||||
    vocab = Vocab(
 | 
			
		||||
        lex_attr_getters={int(NORM): lambda string: string[:-1]},
 | 
			
		||||
        get_noun_chunks=English.Defaults.syntax_iterators.get("noun_chunks"),
 | 
			
		||||
    )
 | 
			
		||||
    vocab.set_vector("dog", numpy.ones((5,), dtype="f"))
 | 
			
		||||
    lex1 = vocab[text1]
 | 
			
		||||
    lex2 = vocab[text2]
 | 
			
		||||
| 
						 | 
				
			
			@ -34,4 +39,23 @@ def test_pickle_vocab(text1, text2):
 | 
			
		|||
    assert unpickled[text2].norm == lex2.norm
 | 
			
		||||
    assert unpickled[text1].norm != unpickled[text2].norm
 | 
			
		||||
    assert unpickled.vectors is not None
 | 
			
		||||
    assert unpickled.get_noun_chunks is not None
 | 
			
		||||
    assert list(vocab["dog"].vector) == [1.0, 1.0, 1.0, 1.0, 1.0]
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_pickle_doc(en_vocab):
 | 
			
		||||
    words = ["a", "b", "c"]
 | 
			
		||||
    deps = ["dep"] * len(words)
 | 
			
		||||
    heads = [0] * len(words)
 | 
			
		||||
    doc = Doc(
 | 
			
		||||
        en_vocab,
 | 
			
		||||
        words=words,
 | 
			
		||||
        deps=deps,
 | 
			
		||||
        heads=heads,
 | 
			
		||||
    )
 | 
			
		||||
    data = srsly.pickle_dumps(doc)
 | 
			
		||||
    unpickled = srsly.pickle_loads(data)
 | 
			
		||||
    assert [t.text for t in unpickled] == words
 | 
			
		||||
    assert [t.dep_ for t in unpickled] == deps
 | 
			
		||||
    assert [t.head.i for t in unpickled] == heads
 | 
			
		||||
    assert list(doc.noun_chunks) == []
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -55,6 +55,7 @@ def test_vocab_lexeme_add_flag_provided_id(en_vocab):
 | 
			
		|||
    assert en_vocab["199"].check_flag(IS_DIGIT) is False
 | 
			
		||||
    assert en_vocab["the"].check_flag(is_len4) is False
 | 
			
		||||
    assert en_vocab["dogs"].check_flag(is_len4) is True
 | 
			
		||||
    en_vocab.add_flag(lambda string: string.isdigit(), flag_id=IS_DIGIT)
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def test_vocab_lexeme_oov_rank(en_vocab):
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -245,7 +245,7 @@ cdef class Tokenizer:
 | 
			
		|||
        cdef int offset
 | 
			
		||||
        cdef int modified_doc_length
 | 
			
		||||
        # Find matches for special cases
 | 
			
		||||
        self._special_matcher.find_matches(doc, &c_matches)
 | 
			
		||||
        self._special_matcher.find_matches(doc, 0, doc.length, &c_matches)
 | 
			
		||||
        # Skip processing if no matches
 | 
			
		||||
        if c_matches.size() == 0:
 | 
			
		||||
            return True
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -215,8 +215,7 @@ def convert_vectors(
 | 
			
		|||
 | 
			
		||||
 | 
			
		||||
def read_vectors(vectors_loc: Path, truncate_vectors: int):
 | 
			
		||||
    f = open_file(vectors_loc)
 | 
			
		||||
    f = ensure_shape(f)
 | 
			
		||||
    f = ensure_shape(vectors_loc)
 | 
			
		||||
    shape = tuple(int(size) for size in next(f).split())
 | 
			
		||||
    if truncate_vectors >= 1:
 | 
			
		||||
        shape = (truncate_vectors, shape[1])
 | 
			
		||||
| 
						 | 
				
			
			@ -251,11 +250,12 @@ def open_file(loc: Union[str, Path]) -> IO:
 | 
			
		|||
        return loc.open("r", encoding="utf8")
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def ensure_shape(lines):
 | 
			
		||||
def ensure_shape(vectors_loc):
 | 
			
		||||
    """Ensure that the first line of the data is the vectors shape.
 | 
			
		||||
    If it's not, we read in the data and output the shape as the first result,
 | 
			
		||||
    so that the reader doesn't have to deal with the problem.
 | 
			
		||||
    """
 | 
			
		||||
    lines = open_file(vectors_loc)
 | 
			
		||||
    first_line = next(lines)
 | 
			
		||||
    try:
 | 
			
		||||
        shape = tuple(int(size) for size in first_line.split())
 | 
			
		||||
| 
						 | 
				
			
			@ -269,7 +269,11 @@ def ensure_shape(lines):
 | 
			
		|||
        # Figure out the shape, make it the first value, and then give the
 | 
			
		||||
        # rest of the data.
 | 
			
		||||
        width = len(first_line.split()) - 1
 | 
			
		||||
        captured = [first_line] + list(lines)
 | 
			
		||||
        length = len(captured)
 | 
			
		||||
        length = 1
 | 
			
		||||
        for _ in lines:
 | 
			
		||||
            length += 1
 | 
			
		||||
        yield f"{length} {width}"
 | 
			
		||||
        yield from captured
 | 
			
		||||
        # Reading the lines in again from file. This to avoid having to
 | 
			
		||||
        # store all the results in a list in memory
 | 
			
		||||
        lines2 = open_file(vectors_loc)
 | 
			
		||||
        yield from lines2
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -930,6 +930,8 @@ def is_same_func(func1: Callable, func2: Callable) -> bool:
 | 
			
		|||
    """
 | 
			
		||||
    if not callable(func1) or not callable(func2):
 | 
			
		||||
        return False
 | 
			
		||||
    if not hasattr(func1, "__qualname__") or not hasattr(func2, "__qualname__"):
 | 
			
		||||
        return False
 | 
			
		||||
    same_name = func1.__qualname__ == func2.__qualname__
 | 
			
		||||
    same_file = inspect.getfile(func1) == inspect.getfile(func2)
 | 
			
		||||
    same_code = inspect.getsourcelines(func1) == inspect.getsourcelines(func2)
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -551,12 +551,13 @@ def pickle_vocab(vocab):
 | 
			
		|||
    data_dir = vocab.data_dir
 | 
			
		||||
    lex_attr_getters = srsly.pickle_dumps(vocab.lex_attr_getters)
 | 
			
		||||
    lookups = vocab.lookups
 | 
			
		||||
    get_noun_chunks = vocab.get_noun_chunks
 | 
			
		||||
    return (unpickle_vocab,
 | 
			
		||||
            (sstore, vectors, morph, data_dir, lex_attr_getters, lookups))
 | 
			
		||||
            (sstore, vectors, morph, data_dir, lex_attr_getters, lookups, get_noun_chunks))
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
def unpickle_vocab(sstore, vectors, morphology, data_dir,
 | 
			
		||||
                   lex_attr_getters, lookups):
 | 
			
		||||
                   lex_attr_getters, lookups, get_noun_chunks):
 | 
			
		||||
    cdef Vocab vocab = Vocab()
 | 
			
		||||
    vocab.vectors = vectors
 | 
			
		||||
    vocab.strings = sstore
 | 
			
		||||
| 
						 | 
				
			
			@ -564,6 +565,7 @@ def unpickle_vocab(sstore, vectors, morphology, data_dir,
 | 
			
		|||
    vocab.data_dir = data_dir
 | 
			
		||||
    vocab.lex_attr_getters = srsly.pickle_loads(lex_attr_getters)
 | 
			
		||||
    vocab.lookups = lookups
 | 
			
		||||
    vocab.get_noun_chunks = get_noun_chunks
 | 
			
		||||
    return vocab
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -67,7 +67,7 @@ data format used by the lookup and rule-based lemmatizers, see
 | 
			
		|||
> lemmatizer = nlp.add_pipe("lemmatizer")
 | 
			
		||||
>
 | 
			
		||||
> # Construction via add_pipe with custom settings
 | 
			
		||||
> config = {"mode": "rule", overwrite=True}
 | 
			
		||||
> config = {"mode": "rule", "overwrite": True}
 | 
			
		||||
> lemmatizer = nlp.add_pipe("lemmatizer", config=config)
 | 
			
		||||
> ```
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -44,7 +44,7 @@ be shown.
 | 
			
		|||
 | 
			
		||||
## PhraseMatcher.\_\_call\_\_ {#call tag="method"}
 | 
			
		||||
 | 
			
		||||
Find all token sequences matching the supplied patterns on the `Doc`.
 | 
			
		||||
Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
 | 
			
		||||
 | 
			
		||||
> #### Example
 | 
			
		||||
>
 | 
			
		||||
| 
						 | 
				
			
			@ -59,7 +59,7 @@ Find all token sequences matching the supplied patterns on the `Doc`.
 | 
			
		|||
 | 
			
		||||
| Name                                  | Description                                                                                                                                                                                                                                                                                              |
 | 
			
		||||
| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 
			
		||||
| `doc`                                 | The document to match over. ~~Doc~~                                                                                                                                                                                                                                                                      |
 | 
			
		||||
| `doclike`                             | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~                                                                                                                                                                                                                                                                      |
 | 
			
		||||
| _keyword-only_                        |                                                                                                                                                                                                                                                                                                          |
 | 
			
		||||
| `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~                                                                                                                                            |
 | 
			
		||||
| **RETURNS**                           | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -727,7 +727,7 @@ capitalization by including a mix of capitalized and lowercase examples. See the
 | 
			
		|||
 | 
			
		||||
Create a data augmentation callback that uses orth-variant replacement. The
 | 
			
		||||
callback can be added to a corpus or other data iterator during training. It's
 | 
			
		||||
is especially useful for punctuation and case replacement, to help generalize
 | 
			
		||||
especially useful for punctuation and case replacement, to help generalize
 | 
			
		||||
beyond corpora that don't have smart quotes, or only have smart quotes etc.
 | 
			
		||||
 | 
			
		||||
| Name            | Description                                                                                                                                                                                                                                                                                               |
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -4,8 +4,8 @@ import { Help } from 'components/typography'; import Link from 'components/link'
 | 
			
		|||
 | 
			
		||||
| Pipeline                                                   | Parser | Tagger |  NER |
 | 
			
		||||
| ---------------------------------------------------------- | -----: | -----: | ---: |
 | 
			
		||||
| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) |   95.2 |   97.8 | 89.9 |
 | 
			
		||||
| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3)   |   91.9 |   97.4 | 85.5 |
 | 
			
		||||
| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) |   95.1 |   97.8 | 89.8 |
 | 
			
		||||
| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3)   |   92.0 |   97.4 | 85.5 |
 | 
			
		||||
| `en_core_web_lg` (spaCy v2)                                |   91.9 |   97.2 | 85.5 |
 | 
			
		||||
 | 
			
		||||
<figcaption class="caption">
 | 
			
		||||
| 
						 | 
				
			
			@ -22,7 +22,7 @@ the development set).
 | 
			
		|||
 | 
			
		||||
| Named Entity Recognition System  | OntoNotes | CoNLL '03 |
 | 
			
		||||
| -------------------------------- | --------: | --------: |
 | 
			
		||||
| spaCy RoBERTa (2020)             |      89.7 |      91.6 |
 | 
			
		||||
| spaCy RoBERTa (2020)             |      89.8 |      91.6 |
 | 
			
		||||
| Stanza (StanfordNLP)<sup>1</sup> |      88.8 |      92.1 |
 | 
			
		||||
| Flair<sup>2</sup>                |      89.7 |      93.1 |
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -77,7 +77,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
 | 
			
		|||
 | 
			
		||||
| Dependency Parsing System                                                      |  UAS |  LAS |
 | 
			
		||||
| ------------------------------------------------------------------------------ | ---: | ---: |
 | 
			
		||||
| spaCy RoBERTa (2020)                                                           | 95.5 | 94.3 |
 | 
			
		||||
| spaCy RoBERTa (2020)                                                           | 95.1 | 93.7 |
 | 
			
		||||
| [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
 | 
			
		||||
| [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019)             | 97.2 | 95.7 |
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -69,9 +69,9 @@ python -m spacy project clone pipelines/tagger_parser_ud
 | 
			
		|||
 | 
			
		||||
By default, the project will be cloned into the current working directory. You
 | 
			
		||||
can specify an optional second argument to define the output directory. The
 | 
			
		||||
`--repo` option lets you define a custom repo to clone from if you don't want
 | 
			
		||||
to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You
 | 
			
		||||
can also use any private repo you have access to with Git.
 | 
			
		||||
`--repo` option lets you define a custom repo to clone from if you don't want to
 | 
			
		||||
use the spaCy [`projects`](https://github.com/explosion/projects) repo. You can
 | 
			
		||||
also use any private repo you have access to with Git.
 | 
			
		||||
 | 
			
		||||
### 2. Fetch the project assets {#assets}
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -221,6 +221,7 @@ pipelines.
 | 
			
		|||
| `title`         | An optional project title used in `--help` message and [auto-generated docs](#custom-docs).                                                                                                                                                                                                                                                                                                                                                                                                                  |
 | 
			
		||||
| `description`   | An optional project description used in [auto-generated docs](#custom-docs).                                                                                                                                                                                                                                                                                                                                                                                                                                 |
 | 
			
		||||
| `vars`          | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`.                                                                                                                                                |
 | 
			
		||||
| `env`           | A dictionary of variables, mapped to the names of environment variables that will be read in when running the project. For example, `${env.name}` will use the value of the environment variable defined as `name`.                                                                                                                                                                                                                                                                                          |
 | 
			
		||||
| `directories`   | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist.                                                                                                                                                                                                                                                                                                                 |
 | 
			
		||||
| `assets`        | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo.                                                                        |
 | 
			
		||||
| `workflows`     | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command.                                                                                                                                                                                                                                                                                                                                         |
 | 
			
		||||
| 
						 | 
				
			
			@ -310,8 +311,8 @@ company-internal and not available over the internet. In that case, you can
 | 
			
		|||
specify the destination paths and a checksum, and leave out the URL. When your
 | 
			
		||||
teammates clone and run your project, they can place the files in the respective
 | 
			
		||||
directory themselves. The [`project assets`](/api/cli#project-assets) command
 | 
			
		||||
will alert you about missing files and mismatched checksums, so you can ensure that
 | 
			
		||||
others are running your project with the same data.
 | 
			
		||||
will alert you about missing files and mismatched checksums, so you can ensure
 | 
			
		||||
that others are running your project with the same data.
 | 
			
		||||
 | 
			
		||||
### Dependencies and outputs {#deps-outputs}
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -358,9 +359,10 @@ graphs based on the dependencies and outputs, and won't re-run previous steps
 | 
			
		|||
automatically. For instance, if you only run the command `train` that depends on
 | 
			
		||||
data created by `preprocess` and those files are missing, spaCy will show an
 | 
			
		||||
error – it won't just re-run `preprocess`. If you're looking for more advanced
 | 
			
		||||
data management, check out the [Data Version Control (DVC) integration](#dvc). If you're planning on integrating your spaCy project with DVC, you
 | 
			
		||||
can also use `outputs_no_cache` instead of `outputs` to define outputs that
 | 
			
		||||
won't be cached or tracked.
 | 
			
		||||
data management, check out the [Data Version Control (DVC) integration](#dvc).
 | 
			
		||||
If you're planning on integrating your spaCy project with DVC, you can also use
 | 
			
		||||
`outputs_no_cache` instead of `outputs` to define outputs that won't be cached
 | 
			
		||||
or tracked.
 | 
			
		||||
 | 
			
		||||
### Files and directory structure {#project-files}
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -467,7 +469,9 @@ In your `project.yml`, you can then run the script by calling
 | 
			
		|||
`python scripts/custom_evaluation.py` with the function arguments. You can also
 | 
			
		||||
use the `vars` section to define reusable variables that will be substituted in
 | 
			
		||||
commands, paths and URLs. In this example, the batch size is defined as a
 | 
			
		||||
variable will be added in place of `${vars.batch_size}` in the script.
 | 
			
		||||
variable will be added in place of `${vars.batch_size}` in the script. Just like
 | 
			
		||||
in the [training config](/usage/training##config-overrides), you can also
 | 
			
		||||
override settings on the command line – for example using `--vars.batch_size`.
 | 
			
		||||
 | 
			
		||||
> #### Calling into Python
 | 
			
		||||
>
 | 
			
		||||
| 
						 | 
				
			
			@ -491,6 +495,29 @@ commands:
 | 
			
		|||
      - 'corpus/eval.json'
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
You can also use the `env` section to reference **environment variables** and
 | 
			
		||||
make their values available to the commands. This can be useful for overriding
 | 
			
		||||
settings on the command line and passing through system-level settings.
 | 
			
		||||
 | 
			
		||||
> #### Usage example
 | 
			
		||||
>
 | 
			
		||||
> ```bash
 | 
			
		||||
> export GPU_ID=1
 | 
			
		||||
> BATCH_SIZE=128 python -m spacy project run evaluate
 | 
			
		||||
> ```
 | 
			
		||||
 | 
			
		||||
```yaml
 | 
			
		||||
### project.yml
 | 
			
		||||
env:
 | 
			
		||||
  batch_size: BATCH_SIZE
 | 
			
		||||
  gpu_id: GPU_ID
 | 
			
		||||
 | 
			
		||||
commands:
 | 
			
		||||
  - name: evaluate
 | 
			
		||||
    script:
 | 
			
		||||
      - 'python scripts/custom_evaluation.py ${env.batch_size}'
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### Documenting your project {#custom-docs}
 | 
			
		||||
 | 
			
		||||
> #### Readme Example
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -185,7 +185,7 @@ sections of a config file are:
 | 
			
		|||
 | 
			
		||||
For a full overview of spaCy's config format and settings, see the
 | 
			
		||||
[data format documentation](/api/data-formats#config) and
 | 
			
		||||
[Thinc's config system docs](https://thinc.ai/usage/config). The settings
 | 
			
		||||
[Thinc's config system docs](https://thinc.ai/docs/usage-config). The settings
 | 
			
		||||
available for the different architectures are documented with the
 | 
			
		||||
[model architectures API](/api/architectures). See the Thinc documentation for
 | 
			
		||||
[optimizers](https://thinc.ai/docs/api-optimizers) and
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -198,6 +198,7 @@
 | 
			
		|||
            "has_examples": true
 | 
			
		||||
        },
 | 
			
		||||
        { "code": "tl", "name": "Tagalog" },
 | 
			
		||||
        { "code": "tn", "name": "Setswana", "has_examples": true },
 | 
			
		||||
        { "code": "tr", "name": "Turkish", "example": "Bu bir cümledir.", "has_examples": true },
 | 
			
		||||
        { "code": "tt", "name": "Tatar", "has_examples": true },
 | 
			
		||||
        {
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
		Loading…
	
		Reference in New Issue
	
	Block a user