Commit Graph

9158 Commits

Author SHA1 Message Date
Paul O'Leary McCann
858565a567
Fix issues with DVC commands (#11592)
* Fix flag handling in dvc

Prior to this commit, if a flag (--verbose or --quiet) was passed to
DVC, it would be added to the end of the generated dvc command line.
This would result in the command being interpreted as part of the actual
command to run, rather than an argument to dvc. This would result in
command lines like:

    spacy project run preprocess --verbose

That would fail with an error that there's no such directory as
`--verbose`.

This change puts the flags at the front of the dvc command so that they
are interpreted correctly. It removes the `run_dvc_commands` function,
which had been reduced to just a for loop and wasn't used elsewhere.

A separate problem is that there's no way to specify the quiet behaviour
to dvc from the command line, though it's unclear if that's a bug.

* Add dvc quiet flag to docs

* Handle case in DVC where no commands are appropriate

If only have commands with no deps or outputs (admittedly unlikely), you
get a weird error about the dvc file not existing. This gives explicit
output instead.

* Add support for quiet flag

* Fix command execution

Commands are strings now because they're joined further up.
2022-10-18 15:11:39 +09:00
github-actions[bot]
ceb62352bf
Auto-format code with black (#11649)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-10-14 18:04:55 +09:00
Adriane Boyd
6b5a3e7219
Extend to pydantic v1.10 (#11635)
* Update types in `spacy.schemas` for updated pydantic+mypy
2022-10-14 08:16:49 +02:00
Sofie Van Landeghem
4d869fcc11
Small fixes to docstrings (#11610)
* add missing scorer arg to docstring

* fix class names in textcat_multilabel

* add missing scorer to docstrings
2022-10-12 15:17:40 +02:00
Adriane Boyd
fe06e037bc
Fix init for pymorphy2_lookup lemmatizer mode (#11631) 2022-10-12 12:18:39 +02:00
Sofie Van Landeghem
29649589fc
remove dtype (#11615) 2022-10-11 15:25:05 +02:00
Sofie Van Landeghem
ef74f8f5e4
Fix mypy error in edittree lemmatizer (#11612)
* cleanup imports

* try limiting Thinc to previous release

* remove Model specification

* fix code and revert Thinc constraint
2022-10-11 14:15:22 +02:00
Sofie Van Landeghem
bcda8bc1e7
update mypy to latest version (#11546)
* update mypy and disable it for python 3.6

* ignoring mypy's type redefinition error
2022-09-29 14:24:40 +02:00
Adriane Boyd
6d7630c5d3
Allow overriding spacy_version in spacy package meta (#11552) 2022-09-29 10:44:06 +02:00
Peter Baumgartner
e794d4ae39
debug data Spancat Table Improvements (#11504)
* update

* fix format function

* pull out _format_number

* format with black
2022-09-28 17:16:05 +02:00
Raphael Mitsch
aea16719be
Simplify and clarify enable/disable behavior of spacy.load() (#11459)
* Change enable/disable behavior so that arguments take precedence over config options. Extend error message on conflict. Add warning message in case of overwriting config option with arguments.

* Fix tests in test_serialize_pipeline.py to reflect changes to handling of enable/disable.

* Fix type issue.

* Move comment.

* Move comment.

* Issue UserWarning instead of printing wasabi message. Adjust test.

* Added pytest.warns(UserWarning) for expected warning to fix tests.

* Update warning message.

* Move type handling out of fetch_pipes_status().

* Add global variable for default value. Use id() to determine whether used values are default value.

* Fix default value for disable.

* Rename DEFAULT_PIPE_STATUS to _DEFAULT_EMPTY_PIPES.
2022-09-27 14:22:36 +02:00
Adriane Boyd
877671e09a
Preserve missing entity annotation in augmenters (#11540)
Preserve both `-` and `O` annotation in augmenters rather than relying
on `Example.to_dict`'s default support for one option outside of labeled
entity spans.

This is intended as a temporary workaround for augmenters for v3.4.x.
The behavior of `Example` and related IOB utils could be improved in the
general case for v3.5.
2022-09-27 10:16:51 +02:00
Richard Hudson
6f692a06d5
Remove side effects from Doc.__init__() (#11506)
* Remove side effects from Doc.__init__()

* Changes based on review comment

* Readd test

* Change interface of Doc.__init__()

* Simplify test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update doc.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-09-26 15:58:21 +02:00
Raphael Mitsch
af9b01ef97
Add dependency check to project step runs (#11226)
* Add dependency check to project step running.

* Fix dependency mismatch warning.

* Remove newline.

* Add types-setuptools to setup.cfg.

* Move types-setuptools to test requirements. Move warnings into _validate_requirements(). Handle file reading in project_run().

* Remove newline formatting for output of package conflicts.

* Show full version conflict message instead of just package name.

* Update spacy/cli/project/run.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix typo.

* Re-add rephrasing of message for conflicting packages. Remove requirements path redundancy.

* Update spacy/cli/project/run.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/cli/project/run.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Print unified message for requirement conflicts and missing requirements.

* Update spacy/cli/project/run.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix warning message.

* Print conflict/missing messages individually.

* Print conflict/missing messages individually.

* Add check_requirements setting in project.yml to disable requirements check.

* Update website/docs/usage/projects.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/usage/projects.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update description of project.yml structure in projects.md.

* Update website/docs/usage/projects.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Prettify projects docs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-16 16:54:31 +02:00
github-actions[bot]
279358be63
Auto-format code with black (#11513)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-09-16 11:50:19 +02:00
Sofie Van Landeghem
0509f90874
add dot (#11500) 2022-09-15 17:29:42 +02:00
Adriane Boyd
7c98245c0c
Add levenshtein from polyleven (#11418)
Add a simple levenshtein distance function using the implementation from
the polyleven library as `spacy.matcher.levenshtein`.
2022-09-14 17:05:22 +02:00
Madeesh Kannan
0ec9a696e6
Fix config validation failures caused by NVTX pipeline wrappers (#11460)
* Enable Cython<->Python bindings for `Pipe` and `TrainablePipe` methods

* `pipes_with_nvtx_range`: Skip hooking methods whose signature cannot be ascertained

When loading pipelines from a config file, the arguments passed to individual pipeline components is validated by `pydantic` during init. For this, the validation model attempts to parse the function signature of the component's c'tor/entry point so that it can check if all mandatory parameters are present in the config file.

When using the `models_and_pipes_with_nvtx_range` as a `after_pipeline_creation` callback, the methods of all pipeline components get replaced by a NVTX range wrapper **before** the above-mentioned validation takes place. This can be problematic for components that are implemented as Cython extension types - if the extension type is not compiled with Python bindings for its methods, they will have no signatures at runtime. This resulted in `pydantic` matching the *wrapper's* parameters with the those in the config and raising errors.

To avoid this, we now skip applying the wrapper to any (Cython) methods that do not have signatures.
2022-09-12 14:55:41 +02:00
kadarakos
6b83fee58d
Assets message (#11458)
* new error message when 'project run assets'

* new error message when 'project run assets'

* Update spacy/cli/project/run.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-09 17:17:10 +02:00
Adriane Boyd
8a86a35eab
Remove has_letters in config template (#11465)
Due to problems with the javascript conversion in the website
quickstart, remove the `has_letters` setting to simplify generating
`attrs` for the default `tok2vec`.

Additionally reduce `PREFIX` as in the trained pipelines.
2022-09-09 15:10:04 +02:00
github-actions[bot]
0c72c6bb2c
Auto-format code with black (#11468)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-09-09 11:21:17 +02:00
Paul O'Leary McCann
977dc33312
Add a way to get the URL to download a pipeline to the CLI (#11175)
* Add a dry run flag to download

* Remove --dry-run, add --url option to `spacy info` instead

* Make mypy happy

* Print only the URL, so it's easier to use in scripts

* Don't add the egg hash unless downloading an sdist

* Update spacy/cli/info.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add two implementations of requirements

* Clean up requirements sample slightly

This should make mypy happy

* Update URL help string

* Remove requirements option

* Add url option to docs

* Add URL to spacy info model output, when available

* Add types-setuptools to testing reqs

* Add types-setuptools to requirements

* Add "compatible", expand docstring

* Update spacy/cli/info.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Run prettier on CLI docs

* Update docs

Add a sidebar about finding download URLs, with some examples of the new
command.

* Add download URLs to table on model page

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Updates from review

* download url -> download link

* Update docs

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-09-02 11:58:21 +02:00
github-actions[bot]
71884d0942
Auto-format code with black (#11427)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-09-02 11:43:20 +02:00
Madeesh Kannan
d1760ebe02
Better handling of unexpected types in SetPredicate (#11312)
* `Matcher`: Better type checking of values in `SetPredicate`
`SetPredicate`: Emit warning and return `False` on unexpected value types

* Rename `value_type_mismatch` variable

* Inline warning

* Remove unexpected type warning from `_SetPredicate`

* Ensure that `str` values are not interpreted as sequences
Check elements of sequence values for convertibility to `str` or `int`

* Add more `INTERSECT` and `IN` test cases

* Test for inputs with multiple characters

* Return `False` early instead of using a boolean flag

* Remove superfluous `int` check, parentheses

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Appy suggestions from code review

* Clarify test comment

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-09-02 09:09:48 +02:00
Adriane Boyd
78f5503a29
Check for any non-Doc returned value for components (#11424) 2022-09-01 19:37:23 +02:00
Sofie Van Landeghem
8fc0efc502
Allow string argument for disable/enable/exclude (#11406)
* adding unit test for spacy.load with disable/exclude string arg

* allow pure strings in from_config

* update docs

* upstream type adjustements

* docs update

* make docstring more consistent

* Update spacy/language.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* two more cleanups

* fix type in internal method

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-31 09:02:34 +02:00
Daniël de Kok
3f4b4b7b4f
Fix test_{prefer,require}_gpu (#11390)
* Fix `test_{prefer,require}_gpu`

These tests assumed that GPUs are only supported with CuPy, but since Thinc 8.1
we also support Metal Performance Shaders.

* test_misc: arrange thinc imports to be together
2022-08-30 14:21:02 +02:00
Patrick J. Burns
5ae63b1fbd
Add Latin language support (#11349)
* Add lang folder for la (Latin)

* Add Latin lang classes

* Add minimal tokenizer exceptions

* Add minimal stopwords

* Add minimal lex_attrs

* Update stopwords, tokenizer exceptions

* Add la tests; register la_tokenizer in conftest.py

* Update spacy/lang/la/lex_attrs.py

Remove duplicate form in Latin lex_attrs

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update natto-py version spec (#11222)

* Update natto-py version spec

* Update setup.cfg

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add scorer to textcat API docs config settings (#11263)

* Update docs for pipeline initialize() methods (#11221)

* Update documentation for dependency parser

* Update documentation for trainable_lemmatizer

* Update documentation for entity_linker

* Update documentation for ner

* Update documentation for morphologizer

* Update documentation for senter

* Update documentation for spancat

* Update documentation for tagger

* Update documentation for textcat

* Update documentation for tok2vec

* Run prettier on edited files

* Apply similar changes in transformer docs

* Remove need to say annotated example explicitly

I removed the need to say "Must contain at least one annotated Example"
because it's often a given that Examples will contain some gold-standard
annotation.

* Run prettier on transformer docs

* chore: add 'concepCy' to spacy universe (#11255)

* chore: add 'concepCy' to spacy universe

* docs: add 'slogan' to concepCy

* Support full prerelease versions in the compat table (#11228)

* Support full prerelease versions in the compat table

* Fix types

* adding spans to doc_annotation in Example.to_dict (#11261)

* adding spans to doc_annotation in Example.to_dict

* to_dict compatible with from_dict: tuples instead of spans

* use strings for label and kb_id

* Simplify test

* Update data formats docs

Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix regex invalid escape sequences (#11276)

* Add W605 to the errors raised by flake8 in the CI (#11283)

* Clean up automated label-based issue handling (#11284)

* Clean up automated label-based issue handline

1. upgrade tiangolo/issue-manager to latest
2. move needs-more-info to tiangolo
3. change needs-more-info close time to 7 days
4. delete old needs-more-info config

* Use old, longer message

* Fix label name

* Fix Dutch noun chunks to skip overlapping spans (#11275)

* Add test for overlapping noun chunks

* Skip overlapping noun chunks

* Update spacy/tests/lang/nl/test_noun_chunks.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Docs: displaCy documentation - data types, `parse_{deps,ents,spans}`, spans example (#10950)

* add in spans example and parse references

* rm autoformatter

* rm extra ents copy

* TypedDict draft

* type fixes

* restore non-documentation files

* docs update

* fix spans example

* fix hyperlinks

* add parse example

* example fix + argument fix

* fix api arg in docs

* fix bad variable replacement

* fix spacing in style

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* fix spacing on table

* fix spacing on table

* rm temp files

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* include span_ruler for default warning filter (#11333)

* Add uk pipelines to website (#11332)

* Check for . in factory names (#11336)

* Make fixes for PR #11349

* Fix roman numeral coverage in #11349

Co-authored-by: Patrick J. Burns <patricks@diyclassics.org>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Lj Miranda <12949683+ljvmiranda921@users.noreply.github.com>
Co-authored-by: Jules Belveze <32683010+JulesBelveze@users.noreply.github.com>
Co-authored-by: stefawolf <wlf.ste@gmail.com>
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
2022-08-30 14:04:54 +02:00
Edward
6723d76f24
Add ConsoleLogger.v2 (#11214)
* Init

* Change logger to ConsoleLogger.v2

* adjust naming

* More naming adjustments

* Fix output_file reference error

* ignore type

* Add basic test for logger

* Hopefully fix mypy issue

* mypy ignore line

* Update mypy line

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update test method name

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Change file saving logic

* Fix finalize method

* increase spacy-legacy version in requirements

* Update docs

* small adjustments

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-29 10:23:05 +02:00
Tobius Saul
c09d2fa25b
luganda language extension (#10847)
* luganda language extension

* __init__.py changes

* New enhancements

* Lexical attribute changed

* punctuaction and sentence additions

* Remove comment header

* Fix typos, reformat

* reformated version

* Add tokenizer test

* Remove contractions from stop words

* Format

* Add Luganda to website

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-23 13:09:36 +02:00
Edward
5afa98aabf
Support custom attributes for tokens and spans in json conversion (#11125)
* Add token and span custom attributes to to_json()

* Change logic for to_json

* Add functionality to from_json

* Small adjustments

* Move token/span attributes to new dict key

* Fix test

* Fix the same test but much better

* Add backwards compatibility tests and adjust logic

* Add test to check if attributes not set in underscore are not saved in the json

* Add tests for json compatibility

* Adjust test names

* Fix tests and clean up code

* Fix assert json tests

* small adjustment

* adjust naming and code readability

* Adjust naming, added more tests and changed logic

* Fix typo

* Adjust errors, naming, and small test optimization

* Fix byte tests

* Fix bytes tests

* Change naming and json structure

* update schema

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update schema for underscore attributes

* Adjust underscore schema

* adjust schema tests

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-23 10:05:02 +02:00
Adriane Boyd
f55bb7470d
Clean up warnings in the test suite (#11331) 2022-08-22 12:04:30 +02:00
Paul O'Leary McCann
0f07defe2c
Remove reference to voting on issue (#11335)
Not clear which issue this refers to, we don't suggest this for any
other issues, and we don't use votes in general.
2022-08-22 11:29:05 +02:00
Adriane Boyd
3e4cf1bbe1
Check for . in factory names (#11336) 2022-08-19 09:52:12 +02:00
Sofie Van Landeghem
cab263791f
include span_ruler for default warning filter (#11333) 2022-08-17 19:55:54 +02:00
Peter Baumgartner
db7b9938a4
Docs: displaCy documentation - data types, parse_{deps,ents,spans}, spans example (#10950)
* add in spans example and parse references

* rm autoformatter

* rm extra ents copy

* TypedDict draft

* type fixes

* restore non-documentation files

* docs update

* fix spans example

* fix hyperlinks

* add parse example

* example fix + argument fix

* fix api arg in docs

* fix bad variable replacement

* fix spacing in style

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* fix spacing on table

* fix spacing on table

* rm temp files

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-08-16 11:23:34 -04:00
Adriane Boyd
ed4ad309e6
Fix Dutch noun chunks to skip overlapping spans (#11275)
* Add test for overlapping noun chunks

* Skip overlapping noun chunks

* Update spacy/tests/lang/nl/test_noun_chunks.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-08-10 09:49:08 +02:00
Adriane Boyd
fc4246558b
Fix regex invalid escape sequences (#11276) 2022-08-09 10:59:36 +02:00
stefawolf
23749cfc91
adding spans to doc_annotation in Example.to_dict (#11261)
* adding spans to doc_annotation in Example.to_dict

* to_dict compatible with from_dict: tuples instead of spans

* use strings for label and kb_id

* Simplify test

* Update data formats docs

Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-05 12:26:38 +02:00
Adriane Boyd
b07708d5d0
Support full prerelease versions in the compat table (#11228)
* Support full prerelease versions in the compat table

* Fix types
2022-08-04 15:14:19 +02:00
ninjalu
95a1b8aca6
add additional REL_OP (#10371)
* add additional  REL_OP

* change to condition and new rel_op symbols

* add operators to docs

* add the anchor while we're in here

* add tests

Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
2022-07-27 13:16:44 +02:00
Edward
360a702ecd
Add parent argument (#11210) 2022-07-26 14:35:18 +02:00
Adriane Boyd
5c2a00cef0
Set version to v3.4.1 (#11209) 2022-07-26 12:52:38 +02:00
Daniël de Kok
4ee8a06149
Fix compatibility with CuPy 9.x (#11194)
After the precomputable affine table of shape [nB, nF, nO, nP] is
computed, padding with shape [1, nF, nO, nP] is assigned to the first
row of the precomputed affine table. However, when we are indexing the
precomputed table, we get a row of shape [nF, nO, nP]. CuPy versions
before 10.0 cannot paper over this shape difference.

This change fixes compatibility with CuPy < 10.0 by squeezing the first
dimension of the padding before assignment.
2022-07-26 10:52:01 +02:00
Adriane Boyd
e5990db713 Revert "Temporarily skip tests that require models/compat"
This reverts commit d9320db7db.
2022-07-25 18:12:18 +02:00
Nicolai Bjerre Pedersen
2fa983aa2e
Fix span typings (#11119)
Add id, id_ to span.pyi.
2022-07-12 13:47:35 +02:00
Peter Baumgartner
36cb2029a9
displaCy Spans Vertical Alignment Fix 2 (#11092)
* add in span render slot fix

* fix spacing off by one

* rm demo

* adjust comments

* fix whitespace and overlap issue
2022-07-08 19:20:13 +02:00
github-actions[bot]
e7fd06bdbe
Auto-format code with black (#11099)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-07-08 18:43:25 +09:00
Daniël de Kok
a06cbae70d
precompute_hiddens/Parser: do not look up CPU ops (3.4) (#11069)
* precompute_hiddens/Parser: do not look up CPU ops

`get_ops("cpu")` is quite expensive. To avoid this, we want to cache the
result as in #11068. However, for 3.x we do not want to change the ABI.
So we avoid the expensive lookup by using NumpyOps. This should have a
minimal impact, since `get_ops("cpu")` was only used when the model ops
were `CupyOps`. If the ops are `AppleOps`, we are still passing through
the correct BLAS implementation.

* _NUMPY_OPS -> NUMPY_OPS
2022-07-05 10:53:42 +02:00
Madeesh Kannan
d36d66b7ca
Increase test deadline to 30 minutes to prevent spurious test failures (#11070)
* Increase test deadline to 30 minutes to prevent spurious test failures

* Reduce deadline to 2 minutes
2022-07-04 18:37:09 +02:00