spaCy/spacy/cli/init_model.py

from typing import Optional, List, Dict, Any, Union, IO
import math
from tqdm import tqdm
import numpy
from ast import literal_eval
from pathlib import Path
from preshed.counter import PreshCounter
import tarfile
import gzip
import zipfile
import srsly
import warnings
from wasabi import msg, Printer
import typer

DEFAULT_OOV_PROB = -20


@init_cli.command("vocab")
@app.command(
    "init-model",
    context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
    hidden=True,  # hide this from main CLI help but still allow it to work with warning
)
def init_model_cli(
    # fmt: off
    ctx: typer.Context,  # This is only used to read additional arguments
    lang: str = Arg(..., help="Pipeline language"),
    output_dir: Path = Arg(..., help="Pipeline output directory"),
    freqs_loc: Optional[Path] = Arg(None, help="Location of words frequencies file", exists=True),
    clusters_loc: Optional[Path] = Opt(None, "--clusters-loc", "-c", help="Optional location of brown clusters data", exists=True),
    jsonl_loc: Optional[Path] = Opt(None, "--jsonl-loc", "-j", help="Location of JSONL-formatted attributes file", exists=True),
    vectors_loc: Optional[Path] = Opt(None, "--vectors-loc", "-v", help="Optional vectors file in Word2Vec format", exists=True),
    prune_vectors: int = Opt(-1, "--prune-vectors", "-V", help="Optional number of vectors to prune to"),
    truncate_vectors: int = Opt(0, "--truncate-vectors", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),
    vectors_name: Optional[str] = Opt(None, "--vectors-name", "-vn", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),
    model_name: Optional[str] = Opt(None, "--meta-name", "-mn", help="Optional name of the package for the pipeline meta"),
    base_model: Optional[str] = Opt(None, "--base", "-b", help="Name of or path to base pipeline to start with (mostly relevant for pipelines with custom tokenizers)")
    # fmt: on
):
    """
    Create a new blank pipeline directory with vocab and vectors from raw data.
    If vectors are provided in Word2Vec format, they can be either a .txt or
    zipped as a .zip or .tar.gz.

    DOCS: https://nightly.spacy.io/api/cli#init-vocab
    """
    if ctx.command.name == "init-model":
        msg.warn(
            "The init-model command is now called 'init vocab'. You can run "
            "'python -m spacy init --help' for an overview of the other "
            "available initialization commands."
        )
    init_vocab(
        lang,
        output_dir,
        freqs_loc=freqs_loc,
        clusters_loc=clusters_loc,
        jsonl_loc=jsonl_loc,
        vectors_loc=vectors_loc,
        prune_vectors=prune_vectors,
        truncate_vectors=truncate_vectors,
        vectors_name=vectors_name,
        model_name=model_name,
        base_model=base_model,
        silent=False,
    )
Refactor CLI 2020-06-21 22:35:01 +03:00			`from typing import Optional, List, Dict, Any, Union, IO`
Added model command to create model from raw data: words counts, brown clusters and vectors 2017-11-27 01:21:47 +03:00			`import math`
Restore tqdm imports (#4804) * set 4.38.0 to minimal version with color bug fix * set imports back to proper place * add upper range for tqdm 2019-12-16 15:12:19 +03:00			`from tqdm import tqdm`
Added model command to create model from raw data: words counts, brown clusters and vectors 2017-11-27 01:21:47 +03:00			`import numpy`
			`from ast import literal_eval`
			`from pathlib import Path`
			`from preshed.counter import PreshCounter`
Support .gz and .tar.gz files in spacy init-model 2018-03-21 16:33:23 +03:00			`import tarfile`
			`import gzip`
Support zipped vector files in init-model 2018-03-28 00:01:18 +03:00			`import zipfile`
💫 Replace ujson, msgpack and dill/pickle/cloudpickle with srsly (#3003) Remove hacks and wrappers, keep code in sync across our libraries and move spaCy a few steps closer to only depending on packages with binary wheels 🎉 See here: https://github.com/explosion/srsly Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy have steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place. At the same time, we noticed that having a lot of small dependencies was making maintainence harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel. srsly currently includes forks of the following packages: ujson msgpack msgpack-numpy cloudpickle * WIP: replace json/ujson with srsly * Replace ujson in examples Use regular json instead of srsly to make code easier to read and follow * Update requirements * Fix imports * Fix typos * Replace msgpack with srsly * Fix warning 2018-12-03 03:28:22 +03:00			`import srsly`
Add missing import 2020-04-28 15:00:11 +03:00			`import warnings`
Add init CLI and init config (#5854) * Add init CLI and init config draft * Improve config validation * Auto-format * Don't export anything in debug config * Update docs 2020-08-02 16:18:30 +03:00			`from wasabi import msg, Printer`
			`import typer`
Added model command to create model from raw data: words counts, brown clusters and vectors 2017-11-27 01:21:47 +03:00
Fix init_model if there's no vocab (closes #4048) (#4049) 2019-08-01 18:26:09 +03:00			`DEFAULT_OOV_PROB = -20`
💫 New JSON helpers, training data internals & CLI rewrite (#2932) * Support nowrap setting in util.prints * Tidy up and fix whitespace * Simplify script and use read_jsonl helper * Add JSON schemas (see #2928) * Deprecate Doc.print_tree Will be replaced with Doc.to_json, which will produce a unified format * Add Doc.to_json() method (see #2928) Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space. * Remove outdated test * Add write_json and write_jsonl helpers * WIP: Update spacy train * Tidy up spacy train * WIP: Use wasabi for formatting * Add GoldParse helpers for JSON format * WIP: add debug-data command * Fix typo * Add missing import * Update wasabi pin * Add missing import * 💫 Refactor CLI (#2943) To be merged into #2932. ## Description - [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi) - [x] use [`black`](https://github.com/ambv/black) for auto-formatting - [x] add `flake8` config - [x] move all messy UD-related scripts to `cli.ud` - [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO) ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update wasabi pin * Delete old test * Update errors * Fix typo * Tidy up and format remaining code * Fix formatting * Improve formatting of messages * Auto-format remaining code * Add tok2vec stuff to spacy.train * Fix typo * Update wasabi pin * Fix path checks for when train() is called as function * Reformat and tidy up pretrain script * Update argument annotations * Raise error if model language doesn't match lang * Document new train command 2018-11-30 22:16:14 +03:00

Update docs links in codebase 2020-09-04 13:58:50 +03:00			`@init_cli.command("vocab")`
Add init CLI and init config (#5854) * Add init CLI and init config draft * Improve config validation * Auto-format * Don't export anything in debug config * Update docs 2020-08-02 16:18:30 +03:00			`@app.command(`
			`"init-model",`
			`context_settings={"allow_extra_args": True, "ignore_unknown_options": True},`
			`hidden=True, # hide this from main CLI help but still allow it to work with warning`
			`)`
Refactor CLI 2020-06-21 22:35:01 +03:00			`def init_model_cli(`
Modernize plac commands for Python 3 (#4836) 2020-01-01 15:15:46 +03:00			`# fmt: off`
Add init CLI and init config (#5854) * Add init CLI and init config draft * Improve config validation * Auto-format * Don't export anything in debug config * Update docs 2020-08-02 16:18:30 +03:00			`ctx: typer.Context, # This is only used to read additional arguments`
"model" terminology consistency in docs 2020-09-03 14:13:03 +03:00			`lang: str = Arg(..., help="Pipeline language"),`
			`output_dir: Path = Arg(..., help="Pipeline output directory"),`
Refactor CLI 2020-06-21 22:35:01 +03:00			`freqs_loc: Optional[Path] = Arg(None, help="Location of words frequencies file", exists=True),`
			`clusters_loc: Optional[Path] = Opt(None, "--clusters-loc", "-c", help="Optional location of brown clusters data", exists=True),`
			`jsonl_loc: Optional[Path] = Opt(None, "--jsonl-loc", "-j", help="Location of JSONL-formatted attributes file", exists=True),`
			`vectors_loc: Optional[Path] = Opt(None, "--vectors-loc", "-v", help="Optional vectors file in Word2Vec format", exists=True),`
bugfixing prune_vectors and vectors_loc 2020-07-01 22:00:47 +03:00			`prune_vectors: int = Opt(-1, "--prune-vectors", "-V", help="Optional number of vectors to prune to"),`
Port CLI to Typer and add project stubs 2020-06-21 14:44:00 +03:00			`truncate_vectors: int = Opt(0, "--truncate-vectors", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),`
			`vectors_name: Optional[str] = Opt(None, "--vectors-name", "-vn", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),`
Adjust more arguments [ci skip] 2020-09-03 18:12:24 +03:00			`model_name: Optional[str] = Opt(None, "--meta-name", "-mn", help="Optional name of the package for the pipeline meta"),`
			`base_model: Optional[str] = Opt(None, "--base", "-b", help="Name of or path to base pipeline to start with (mostly relevant for pipelines with custom tokenizers)")`
Modernize plac commands for Python 3 (#4836) 2020-01-01 15:15:46 +03:00			`# fmt: on`
💫 New JSON helpers, training data internals & CLI rewrite (#2932) * Support nowrap setting in util.prints * Tidy up and fix whitespace * Simplify script and use read_jsonl helper * Add JSON schemas (see #2928) * Deprecate Doc.print_tree Will be replaced with Doc.to_json, which will produce a unified format * Add Doc.to_json() method (see #2928) Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space. * Remove outdated test * Add write_json and write_jsonl helpers * WIP: Update spacy train * Tidy up spacy train * WIP: Use wasabi for formatting * Add GoldParse helpers for JSON format * WIP: add debug-data command * Fix typo * Add missing import * Update wasabi pin * Add missing import * 💫 Refactor CLI (#2943) To be merged into #2932. ## Description - [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi) - [x] use [`black`](https://github.com/ambv/black) for auto-formatting - [x] add `flake8` config - [x] move all messy UD-related scripts to `cli.ud` - [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO) ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update wasabi pin * Delete old test * Update errors * Fix typo * Tidy up and format remaining code * Fix formatting * Improve formatting of messages * Auto-format remaining code * Add tok2vec stuff to spacy.train * Fix typo * Update wasabi pin * Fix path checks for when train() is called as function * Reformat and tidy up pretrain script * Update argument annotations * Raise error if model language doesn't match lang * Document new train command 2018-11-30 22:16:14 +03:00			`):`
Fix formatting 2017-12-07 12:23:09 +03:00			`"""`
"model" terminology consistency in docs 2020-09-03 14:13:03 +03:00			`Create a new blank pipeline directory with vocab and vectors from raw data.`
			`If vectors are provided in Word2Vec format, they can be either a .txt or`
			`zipped as a .zip or .tar.gz.`
Update docs links in codebase 2020-09-04 13:58:50 +03:00
			`DOCS: https://nightly.spacy.io/api/cli#init-vocab`
Fix formatting 2017-12-07 12:23:09 +03:00			`"""`
Add init CLI and init config (#5854) * Add init CLI and init config draft * Improve config validation * Auto-format * Don't export anything in debug config * Update docs 2020-08-02 16:18:30 +03:00			`if ctx.command.name == "init-model":`
			`msg.warn(`
"model" terminology consistency in docs 2020-09-03 14:13:03 +03:00			`"The init-model command is now called 'init vocab'. You can run "`
			`"'python -m spacy init --help' for an overview of the other "`
			`"available initialization commands."`
Add init CLI and init config (#5854) * Add init CLI and init config draft * Improve config validation * Auto-format * Don't export anything in debug config * Update docs 2020-08-02 16:18:30 +03:00			`)`
Tmp notes 2020-09-27 21:13:38 +03:00			`init_vocab(`
Refactor CLI 2020-06-21 22:35:01 +03:00			`lang,`
			`output_dir,`
			`freqs_loc=freqs_loc,`
			`clusters_loc=clusters_loc,`
			`jsonl_loc=jsonl_loc,`
bugfixing prune_vectors and vectors_loc 2020-07-01 22:00:47 +03:00			`vectors_loc=vectors_loc,`
Refactor CLI 2020-06-21 22:35:01 +03:00			`prune_vectors=prune_vectors,`
			`truncate_vectors=truncate_vectors,`
			`vectors_name=vectors_name,`
			`model_name=model_name,`
			`base_model=base_model,`
			`silent=False,`
			`)`