spaCy/spacy/cli/assemble.py

import logging
from pathlib import Path
from typing import Optional

import typer
from wasabi import msg

from .. import util
from ..util import get_sourced_components, load_model_from_config
from ._util import (
    Arg,
    Opt,
    app,
    import_code_paths,
    parse_config_overrides,
    show_validation_error,
)


@app.command(
    "assemble",
    context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
)
def assemble_cli(
    # fmt: off
    ctx: typer.Context,  # This is only used to read additional arguments
    config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
    output_path: Path = Arg(..., help="Output directory to store assembled pipeline in"),
    code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),
    verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
    # fmt: on
):
    """
    Assemble a spaCy pipeline from a config file. The config file includes
    all settings for initializing the pipeline. To override settings in the
    config, e.g. settings that point to local paths or that you want to
    experiment with, you can override them as command line options. The
    --code argument lets you pass in a Python file that can be used to
    register custom functions that are referenced in the config.

    DOCS: https://spacy.io/api/cli#assemble
    """
    if verbose:
        util.logger.setLevel(logging.DEBUG)
    # Make sure all files and paths exists if they are needed
    if not config_path or (str(config_path) != "-" and not config_path.exists()):
        msg.fail("Config file not found", config_path, exits=1)
    overrides = parse_config_overrides(ctx.args)
    import_code_paths(code_path)
    with show_validation_error(config_path):
        config = util.load_config(config_path, overrides=overrides, interpolate=False)
    msg.divider("Initializing pipeline")
    nlp = load_model_from_config(config, auto_fill=True)
    config = config.interpolate()
    sourced = get_sourced_components(config)
    # Make sure that listeners are defined before initializing further
    nlp._link_components()
    with nlp.select_pipes(disable=[*sourced]):
        nlp.initialize()
    msg.good("Initialized pipeline")
    msg.divider("Serializing to disk")
    if output_path is not None and not output_path.exists():
        output_path.mkdir(parents=True)
        msg.good(f"Created output directory: {output_path}")
    nlp.to_disk(output_path)
isort all the things 2023-06-26 12:41:03 +03:00			`import logging`
assemble CLI command (#7783) * assemble CLI command * ensure assemble runs even without training section * cleanup 2021-04-19 11:39:11 +03:00			`from pathlib import Path`
isort all the things 2023-06-26 12:41:03 +03:00			`from typing import Optional`

assemble CLI command (#7783) * assemble CLI command * ensure assemble runs even without training section * cleanup 2021-04-19 11:39:11 +03:00			`import typer`
isort all the things 2023-06-26 12:41:03 +03:00			`from wasabi import msg`
assemble CLI command (#7783) * assemble CLI command * ensure assemble runs even without training section * cleanup 2021-04-19 11:39:11 +03:00
			`from .. import util`
			`from ..util import get_sourced_components, load_model_from_config`
isort all the things 2023-06-26 12:41:03 +03:00			`from ._util import (`
			`Arg,`
			`Opt,`
			`app,`
Accept multiple code files in all CLI commands (#12101) * Add support for multiple code files to all relevant commands Prior to this, only the package command supported multiple code files. * Update docs * Add debug data test, plus generic fixtures One tricky thing here: it's tempting to create the config by creating a pipeline in code, but that requires declaring the custom components here. However the CliRunner appears to be run in the same process or otherwise have access to our registry, so it works even without any code arguments. So it's necessary to avoid declaring the components in the tests. * Add debug config test and restructure The code argument imports the provided file. If it adds item to the registry, that affects global state, which CliRunner doesn't isolate. Since there's no standard way to remove things from the registry, this instead uses subprocess.run to run commands. * Use a more generic, parametrized test * Add output arg for assemble and pretrain Assemble and pretrain require an output argument. This commit adds assemble testing, but not pretrain, as that requires an actual trainable component, which is not currently in the test config. * Add evaluate test and some cleanup * Mark tests as slow * Revert argument name change * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Format API CLI docs * isort * Fix imports in tests * isort * Undo changes to package CLI help * Fix python executable and lang code in test * Fix executable in another test --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> 2023-08-01 16:24:02 +03:00			`import_code_paths,`
isort all the things 2023-06-26 12:41:03 +03:00			`parse_config_overrides,`
			`show_validation_error,`
			`)`
assemble CLI command (#7783) * assemble CLI command * ensure assemble runs even without training section * cleanup 2021-04-19 11:39:11 +03:00

			`@app.command(`
			`"assemble",`
			`context_settings={"allow_extra_args": True, "ignore_unknown_options": True},`
			`)`
			`def assemble_cli(`
			`# fmt: off`
			`ctx: typer.Context, # This is only used to read additional arguments`
			`config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),`
			`output_path: Path = Arg(..., help="Output directory to store assembled pipeline in"),`
Accept multiple code files in all CLI commands (#12101) * Add support for multiple code files to all relevant commands Prior to this, only the package command supported multiple code files. * Update docs * Add debug data test, plus generic fixtures One tricky thing here: it's tempting to create the config by creating a pipeline in code, but that requires declaring the custom components here. However the CliRunner appears to be run in the same process or otherwise have access to our registry, so it works even without any code arguments. So it's necessary to avoid declaring the components in the tests. * Add debug config test and restructure The code argument imports the provided file. If it adds item to the registry, that affects global state, which CliRunner doesn't isolate. Since there's no standard way to remove things from the registry, this instead uses subprocess.run to run commands. * Use a more generic, parametrized test * Add output arg for assemble and pretrain Assemble and pretrain require an output argument. This commit adds assemble testing, but not pretrain, as that requires an actual trainable component, which is not currently in the test config. * Add evaluate test and some cleanup * Mark tests as slow * Revert argument name change * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Format API CLI docs * isort * Fix imports in tests * isort * Undo changes to package CLI help * Fix python executable and lang code in test * Fix executable in another test --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> 2023-08-01 16:24:02 +03:00			`code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),`
assemble CLI command (#7783) * assemble CLI command * ensure assemble runs even without training section * cleanup 2021-04-19 11:39:11 +03:00			`verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),`
			`# fmt: on`
			`):`
			`"""`
			`Assemble a spaCy pipeline from a config file. The config file includes`
			`all settings for initializing the pipeline. To override settings in the`
			`config, e.g. settings that point to local paths or that you want to`
			`experiment with, you can override them as command line options. The`
			`--code argument lets you pass in a Python file that can be used to`
			`register custom functions that are referenced in the config.`

			`DOCS: https://spacy.io/api/cli#assemble`
			`"""`
Tests for CLI app - `init config` generates `train`-able config (#12173) * remove migration support form * initial test commit * add fixture * add combo test * pull out parameter example data * fix formatting on examples * remove unused import * remove unncessary fmt:off instructions * only set logger level if verbose flag is explicitly set --------- Co-authored-by: svlandeg <svlandeg@github.com> 2023-07-31 15:45:04 +03:00			`if verbose:`
			`util.logger.setLevel(logging.DEBUG)`
assemble CLI command (#7783) * assemble CLI command * ensure assemble runs even without training section * cleanup 2021-04-19 11:39:11 +03:00			`# Make sure all files and paths exists if they are needed`
			`if not config_path or (str(config_path) != "-" and not config_path.exists()):`
			`msg.fail("Config file not found", config_path, exits=1)`
			`overrides = parse_config_overrides(ctx.args)`
Accept multiple code files in all CLI commands (#12101) * Add support for multiple code files to all relevant commands Prior to this, only the package command supported multiple code files. * Update docs * Add debug data test, plus generic fixtures One tricky thing here: it's tempting to create the config by creating a pipeline in code, but that requires declaring the custom components here. However the CliRunner appears to be run in the same process or otherwise have access to our registry, so it works even without any code arguments. So it's necessary to avoid declaring the components in the tests. * Add debug config test and restructure The code argument imports the provided file. If it adds item to the registry, that affects global state, which CliRunner doesn't isolate. Since there's no standard way to remove things from the registry, this instead uses subprocess.run to run commands. * Use a more generic, parametrized test * Add output arg for assemble and pretrain Assemble and pretrain require an output argument. This commit adds assemble testing, but not pretrain, as that requires an actual trainable component, which is not currently in the test config. * Add evaluate test and some cleanup * Mark tests as slow * Revert argument name change * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Format API CLI docs * isort * Fix imports in tests * isort * Undo changes to package CLI help * Fix python executable and lang code in test * Fix executable in another test --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> 2023-08-01 16:24:02 +03:00			`import_code_paths(code_path)`
assemble CLI command (#7783) * assemble CLI command * ensure assemble runs even without training section * cleanup 2021-04-19 11:39:11 +03:00			`with show_validation_error(config_path):`
			`config = util.load_config(config_path, overrides=overrides, interpolate=False)`
			`msg.divider("Initializing pipeline")`
			`nlp = load_model_from_config(config, auto_fill=True)`
			`config = config.interpolate()`
			`sourced = get_sourced_components(config)`
			`# Make sure that listeners are defined before initializing further`
			`nlp._link_components()`
			`with nlp.select_pipes(disable=[*sourced]):`
			`nlp.initialize()`
			`msg.good("Initialized pipeline")`
			`msg.divider("Serializing to disk")`
			`if output_path is not None and not output_path.exists():`
			`output_path.mkdir(parents=True)`
			`msg.good(f"Created output directory: {output_path}")`
			`nlp.to_disk(output_path)`