* change logging call for spacy.LookupsDataLoader.v1
* substitutions in language and _util
* various more substitutions
* add string formatting guidelines to contribution guidelines
* WIP
* rm ipython embeds
* rm total
* WIP
* cleanup
* cleanup + reword
* rm component function
* remove migration support form
* fix reference dataset for dev data
* additional fixes
- set approach to identifying unique trees
- adjust line length on messages
- add logic for detecting docs without annotations
* use 0 instead of none for no annotation
* partial annotation support
* initial tests for _compile_gold lemma attributes
Using the example data from the edit tree lemmatizer tests for:
- lemmatizer_trees
- partial_lemma_annotations
- n_low_cardinality_lemmas
- no_lemma_annotations
* adds output test for cli app
* switch msg level
* rm unclear uniqueness check
* Revert "rm unclear uniqueness check"
This reverts commit 6ea2b3524b.
* remove good message on uniqueness
* formatting
* use en_vocab fixture
* clarify data set source in messages
* remove unnecessary import
Co-authored-by: svlandeg <svlandeg@github.com>
* Add a `spacy evaluate speed` subcommand
This subcommand reports the mean batch performance of a model on a data set with
a 95% confidence interval. For reliability, it first performs some warmup
rounds. Then it will measure performance on batches with randomly shuffled
documents.
To avoid having too many spaCy commands, `speed` is a subcommand of `evaluate`
and accuracy evaluation is moved to its own `evaluate accuracy` subcommand.
* Fix import cycle
* Restore `spacy evaluate`, make `spacy benchmark speed` an alias
* Add documentation for `spacy benchmark`
* CREATES -> PRINTS
* WPS -> words/s
* Disable formatting of benchmark speed arguments
* Fail with an error message when trying to speed bench empty corpus
* Make it clearer that `benchmark accuracy` is a replacement for `evaluate`
* Fix docstring webpage reference
* tests: check `evaluate` output against `benchmark accuracy`
* fix processing of "auto" in walk_directory
* add check for None
* move AUTO check to convert and fix verification of args
* add specific CLI test with CliRunner
* cleanup
* more cleanup
* update docstring
This continues work started in
https://github.com/explosion/projects/pull/147,
which provides features for automatically manipulating pipelines and
configs. The functions included are:
- merge: combine components from two pipelines and handle listeners
- use_transformer: use transformer as feature source
- use_tok2vec: use CNN tok2vec as feature source
- resume: make a version of a config for resuming training
Currently these are all grouped under a new `spacy configure` command.
That may not be the best place for them; in particular, `merge` may
belong elsewhere, since it outputs a pipeline rather than a config.
The current state of the PR is that the commands run, but there's only
one small test, and docs haven't been written yet. Docs can be started
but will depend somewhat on how the naming issues work out.
If you don't have spacy-transformers installed, but try to use `init
config` with the GPU flag, you'll get an error. The issue is that the
`use_transformers` flag in the config is conflated with the GPU flag,
and then there's an attempt to access transformers config info that may
not exist.
There may be a better way to do this, but this stops the error.
* Support local filesystem remotes for projects
* Fix support for local filesystem remotes for projects
* Use `FluidPath` instead of `Pathy` to support both filesystem and
remote paths
* Create missing parent directories if required for local filesystem
* Add a more general `_file_exists` method to support both `Pathy`,
`Path`, and `smart_open`-compatible URLs
* Add explicit `smart_open` dependency starting with support for
`compression` flag
* Update `pathy` dependency to exclude older versions that aren't
compatible with required `smart_open` version
* Update docs to refer to `Pathy` instead of `smart_open` for project
remotes (technically you can still push to any `smart_open`-compatible
path but you can't pull from them)
* Add tests for local filesystem remotes
* Update pathy for general BlobStat sorting
* Add import
* Remove _file_exists since only Pathy remotes are supported
* Format CLI docs
* Clean up merge
* Update warning, add tests for project requirements check
* Make warning more general for differences between PEP 508 and pip
* Add tests for _check_requirements
* Parameterize test
* Add fallback in requirements check, only check once
* Rename to skip_requirements_check
* Update spacy/cli/project/run.py
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Fix flag handling in dvc
Prior to this commit, if a flag (--verbose or --quiet) was passed to
DVC, it would be added to the end of the generated dvc command line.
This would result in the command being interpreted as part of the actual
command to run, rather than an argument to dvc. This would result in
command lines like:
spacy project run preprocess --verbose
That would fail with an error that there's no such directory as
`--verbose`.
This change puts the flags at the front of the dvc command so that they
are interpreted correctly. It removes the `run_dvc_commands` function,
which had been reduced to just a for loop and wasn't used elsewhere.
A separate problem is that there's no way to specify the quiet behaviour
to dvc from the command line, though it's unclear if that's a bug.
* Add dvc quiet flag to docs
* Handle case in DVC where no commands are appropriate
If only have commands with no deps or outputs (admittedly unlikely), you
get a weird error about the dvc file not existing. This gives explicit
output instead.
* Add support for quiet flag
* Fix command execution
Commands are strings now because they're joined further up.
* new error message when 'project run assets'
* new error message when 'project run assets'
* Update spacy/cli/project/run.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Due to problems with the javascript conversion in the website
quickstart, remove the `has_letters` setting to simplify generating
`attrs` for the default `tok2vec`.
Additionally reduce `PREFIX` as in the trained pipelines.
* Add a dry run flag to download
* Remove --dry-run, add --url option to `spacy info` instead
* Make mypy happy
* Print only the URL, so it's easier to use in scripts
* Don't add the egg hash unless downloading an sdist
* Update spacy/cli/info.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add two implementations of requirements
* Clean up requirements sample slightly
This should make mypy happy
* Update URL help string
* Remove requirements option
* Add url option to docs
* Add URL to spacy info model output, when available
* Add types-setuptools to testing reqs
* Add types-setuptools to requirements
* Add "compatible", expand docstring
* Update spacy/cli/info.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Run prettier on CLI docs
* Update docs
Add a sidebar about finding download URLs, with some examples of the new
command.
* Add download URLs to table on model page
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Updates from review
* download url -> download link
* Update docs
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Try cloning repo from main & master
* fixup! Try cloning repo from main & master
* fixup! fixup! Try cloning repo from main & master
* refactor clone and check for repo:branch existence
* spacing fix
* make mypy happy
* type util function
* Update spacy/cli/project/clone.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* account for NER labels with a hyphen in the name
* cleanup
* fix docstring
* add return type to helper method
* shorter method and few more occurrences
* user helper method across repo
* fix circular import
* partial revert to avoid circular import