Commit Graph

2543 Commits

Author SHA1 Message Date
Daniël de Kok
207565a788 Merge remote-tracking branch 'upstream/master' into chore/v4-merge-master-20221222 2022-12-22 10:08:54 +01:00
Raphael Mitsch
eef3d950b4
Fix SpanGroup and Span typing (#12009)
* Correct Span.label, Span.kb_id types. Fix SpanGroup.__iter__().

* Extend test.

* Rename test. Fix typo.

* Add comment.

* Fix types for Span.label, Span.kb_id, Span.char_span().

* Update spacy/tests/doc/test_span_group.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update docs.

* Fix typo.

* Update spacy/tokens/span_group.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-12-21 18:54:27 +01:00
kadarakos
c223cd7a86
Add apply CLI (#11376)
* annotate cli first try

* add batch-size and n_process

* rename to apply

* typing fix

* handle file suffixes

* walk directories

* support jsonl

* typing fix

* remove debug

* make suffix optional for walk

* revert unrelated

* don't warn but raise

* better error message

* minor touch up

* Update spacy/tests/test_cli.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/cli/apply.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/cli/apply.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* update tests and bugfix

* add force_overwrite

* typo

* fix adding .spacy suffix

* Update spacy/cli/apply.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/cli/apply.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/cli/apply.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* store user data and rename cmd arg

* include test for user attr

* rename cmd arg

* better help message

* documentation

* prettier

* black

* link fix

* Update spacy/cli/apply.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update website/docs/api/cli.md

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update website/docs/api/cli.md

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update website/docs/api/cli.md

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* addressing reviews

* dont quit but warn

* prettier

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-12-20 17:11:33 +01:00
Daniël de Kok
f9308aae13
Fix v4 branch to build against Thinc v9 (#11921)
* Move `thinc.extra.search` to `spacy.pipeline._parser_internals`

Backport of:
https://github.com/explosion/spaCy/pull/11317

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Replace references to `thinc.backends.linalg` with `CBlas`

Backport of:
https://github.com/explosion/spaCy/pull/11292

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Use cross entropy from `thinc.legacy`

* Require thinc>=9.0.0.dev0,<9.1.0

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2022-12-17 14:32:19 +01:00
Edward
ca75190a3d
Custom extensions for spans with equal boundaries (#11429)
* Init

* Fix return type for mypy

* adjust types and improve setting new attributes

* Add underscore changes to json conversion

* Add test and underscore changes to from_docs

* add underscore changes and test to span.to_doc

* update return values

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add types to function

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* adjust formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* shorten return type

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* add helper function to improve readability

* Improve code and add comments

* rerun azure tests

* Fix tests for json conversion

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-12-12 08:55:53 +01:00
Adriane Boyd
0591e67265
Cast to uint64 for all array-based doc representations (#11933)
* Convert all individual values explicitly to uint64 for array-based doc representations

* Temporarily test with latest numpy v1.24.0rc

* Remove unnecessary conversion from attr_t

* Reduce number of individual casts

* Convert specifically from int32 to uint64

* Revert "Temporarily test with latest numpy v1.24.0rc"

This reverts commit eb0e3c5006.

* Also use int32 in tests
2022-12-12 08:45:35 +01:00
github-actions[bot]
f22fc7a113
Auto-format code with black (#11955)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-12-09 10:15:52 +01:00
Paul O'Leary McCann
6b9af38eeb
Remove all references to "begin_training" (#11943)
When v3 was released, `begin_training` was renamed to `initialize`.
There were warnings in the code and docs about that. This PR removes
them.
2022-12-08 11:43:52 +01:00
Daniël de Kok
27fac7df2e
EditTreeLemmatizer: correctly add strings when initializing from labels (#11934)
Strings in replacement nodes where not added to the `StringStore`
when `EditTreeLemmatizer` was initialized from a set of labels. The
corresponding test did not capture this because it added the strings
through the examples that were passed to the initialization.

This change fixes both this bug in the initialization as the 'shadowing'
of the bug in the test.
2022-12-07 13:53:41 +09:00
Zhangrp
23085ffef4
Fix interpolation in directory names, see #11235. (#11914) 2022-12-06 17:42:12 +09:00
svlandeg
04fea09ffd Merge branch 'copy_master' into copy_v4 2022-12-05 08:56:15 +01:00
Adriane Boyd
445c670a2d
Fix spancat for zero suggestions (#11860)
* Add test for spancat predict with zero suggestions

* Fix spancat for zero suggestions

* Undo changes to extract_spans

* Use .sum() as in update
2022-12-02 09:33:52 +01:00
Adriane Boyd
6f9d630f7e
Replace Pipe type with Callable in Language (#11803)
* Replace Pipe type with Callable in Language

* Use Callable[[Doc], Doc] in the docstrings
2022-11-29 13:20:08 +01:00
Adriane Boyd
e0d43557b7
Merge pull request #11871 from adrianeboyd/chore/v3.5.0
Prepare for v3.5.0
2022-11-29 11:41:32 +01:00
Adriane Boyd
1ebe7db07c
Support local filesystem remotes for projects (#11762)
* Support local filesystem remotes for projects

* Fix support for local filesystem remotes for projects
  * Use `FluidPath` instead of `Pathy` to support both filesystem and
    remote paths
  * Create missing parent directories if required for local filesystem
  * Add a more general `_file_exists` method to support both `Pathy`,
    `Path`, and `smart_open`-compatible URLs
* Add explicit `smart_open` dependency starting with support for
  `compression` flag
* Update `pathy` dependency to exclude older versions that aren't
  compatible with required `smart_open` version
* Update docs to refer to `Pathy` instead of `smart_open` for project
  remotes (technically you can still push to any `smart_open`-compatible
  path but you can't pull from them)
* Add tests for local filesystem remotes

* Update pathy for general BlobStat sorting

* Add import

* Remove _file_exists since only Pathy remotes are supported

* Format CLI docs

* Clean up merge
2022-11-29 11:40:58 +01:00
Paul O'Leary McCann
f54bfb56c9
Don't throw an error if using displacy on an unset span key (#11845)
* Don't throw an error if using displacy on an unset span key

* List available keys in W117
2022-11-28 10:01:09 +01:00
Adriane Boyd
681ec20914
Add smart_open requirement, update deprecated options (#11864)
* Switch from deprecated `ignore_ext` to `compression`
* Add upload/download test for local files
2022-11-25 13:00:57 +01:00
Adriane Boyd
378db0eb1e Temporarily skip tests that require models/compat 2022-11-25 12:05:25 +01:00
Raphael Mitsch
c0fd8a2e71
find-threshold: CLI command for multi-label classifier threshold tuning (#11280)
* Add foundation for find-threshold CLI functionality.

* Finish first draft for find-threshold.

* Add tests.

* Revert adjusted import statements.

* Fix mypy errors.

* Fix imports.

* Harmonize arguments with spacy evaluate command.

* Generalize component and threshold handling. Harmonize arguments with 'spacy evaluate' CLI.

* Fix Spancat test.

* Add beta parameter to Scorer and PRFScore.

* Make beta a component scorer setting.

* Remove beta.

* Update nlp.config (workaround).

* Reload pipeline on threshold change. Adjust tests. Remove confection reference.

* Remove assumption of component being a Pipe object or having a .cfg attribute.

* Adjust test output and reference values.

* Remove beta references. Delete universe.json.

* Reverting unnecessary changes. Removing unused default values. Renaming variables in find-cli tests.

* Update spacy/cli/find_threshold.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Remove adding labels in tests.

* Remove unused error

* Undo changes to PRFScorer

* Change default value for n_trials. Log table iteratively.

* Add warnings for pointless applications of find_threshold().

* Fix imports.

* Adjust type check of TextCategorizer to exclude subclasses.

* Change check of if there's only one unique value in scores.

* Update spacy/cli/find_threshold.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Incorporate feedback.

* Fix test issue. Update docstring.

* Update docs & docstring.

* Update spacy/tests/test_cli.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add examples to docs. Rename _nlp to nlp in tests.

* Update spacy/cli/find_threshold.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/cli/find_threshold.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-11-25 11:44:55 +01:00
Adriane Boyd
30d31fd335
Update Russian and Ukrainian lemmatizers (#11811)
* pymorph2 issues #11620, #11626, #11625:
- #11620: pymorphy2_lookup
- #11626: handle multiple forms pointing to the same normal form + handling empty POS tag
- #11625: matching DET that are labelled as PRON by pymorhp2

* Move lemmatizer algorithm changes back into RussianLemmatizer

* Fix uk pymorphy3_lookup mode init

* Move and update tests for ru/uk lookup lemmatizer modes

* Fix typo

* Remove traces of previous behavior for uninflected POS

* Refactor to private generic-looking pymorphy methods

* Remove xfailed uk lemmatizer cases

* Update spacy/lang/ru/lemmatizer.py

Co-authored-by: Richard Hudson <richard@explosion.ai>

Co-authored-by: Dmytro S Lituiev <d.lituiev@gmail.com>
Co-authored-by: Richard Hudson <richard@explosion.ai>
2022-11-25 11:12:46 +01:00
Madeesh Kannan
5ea14af32b
Add training.before_update callback (#11739)
* Add `training.before_update` callback

This callback can be used to implement training paradigms like gradual (un)freezing of components (e.g: the Transformer) after a certain number of training steps to mitigate catastrophic forgetting during fine-tuning.

* Fix type annotation, default config value

* Generalize arguments passed to the callback

* Update schema

* Pass `epoch` to callback, rename `current_step` to `step`

* Add test

* Simplify test

* Replace config string with `spacy.blank`

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Cleanup imports

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-11-23 17:54:58 +01:00
Edward
e79910d57e
Remove sentiment extension (#11722)
* remove sentiment attribute

* remove sentiment from docs

* add test for backwards compatibility

* replace from_disk with from_bytes

* Fix docs and format file

* Fix formatting
2022-11-23 13:09:32 +01:00
Marco Edward Gorelli
f0d8309a28
fix comparison of constants (#11834)
Co-authored-by: MarcoGorelli <>
2022-11-21 08:12:03 +01:00
Paul O'Leary McCann
75bb7ad541
Check textcat values for validity (#11763)
* Check textcat values for validity

* Fix error numbers

* Clean up vals reference

* Check category value validity through training

The _validate_categories is called in update, which for multilabel is
inherited from the single label component.

* Formatting
2022-11-17 10:25:01 +01:00
Paul O'Leary McCann
c0c54e44bc
Add equality definition for vectors (#11806)
* Add equality definition for vectors

This re-uses the check from sourcing components.

* Use the equality check

* Format

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-11-16 09:44:42 +01:00
Adriane Boyd
03eebe9d1c
Update warning, add tests for project requirements check (#11777)
* Update warning, add tests for project requirements check

* Make warning more general for differences between PEP 508 and pip
* Add tests for _check_requirements

* Parameterize test
2022-11-09 10:59:28 +01:00
Raphael Mitsch
20bbbe3e44
Revert disable/disabled merging behavior (#11745)
* Merge disable with disabled. Adjust warnings, errors and tests.

* Replace any() with set operation.

* Update spacy/tests/pipeline/test_pipe_methods.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update docs.

* Remve reference to config entry nlp.enabled from docs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-11-08 14:58:10 +01:00
Adriane Boyd
40e1000db0
Restore Doc attr getter values in Doc.to_json (#11700) 2022-11-03 11:49:08 +01:00
Adriane Boyd
68b8fa2df2 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master-4 2022-11-03 09:42:36 +01:00
Adriane Boyd
420b1d854b
Update textcat scorer threshold behavior (#11696)
* Update textcat scorer threshold behavior

For `textcat` (with exclusive classes) the scorer should always use a
threshold of 0.0 because there should be one predicted label per doc and
the numeric score for that particular label should not matter.

* Rename to test_textcat_multilabel_threshold

* Remove all uses of threshold for multi_label=False

* Update Scorer.score_cats API docs

* Add tests for score_cats with thresholds

* Update textcat API docs

* Fix types

* Convert threshold back to float

* Fix threshold type in docstring

* Improve formatting in Scorer API docs
2022-11-02 15:35:04 +01:00
Paul O'Leary McCann
d61e742960
Handle Docs with no entities in EntityLinker (#11640)
* Handle docs with no entities

If a whole batch contains no entities it won't make it to the model, but
it's possible for individual Docs to have no entities. Before this
commit, those Docs would cause an error when attempting to concatenate
arrays because the dimensions didn't match.

It turns out the process of preparing the Ragged at the end of the span
maker forward was a little different from list2ragged, which just uses
the flatten function directly. Letting list2ragged do the conversion
avoids the dimension issue.

This did not come up before because in NEL demo projects it's typical
for data with no entities to be discarded before it reaches the NEL
component.

This includes a simple direct test that shows the issue and checks it's
resolved. It doesn't check if there are any downstream changes, so a
more complete test could be added. A full run was tested by adding an
example with no entities to the Emerson sample project.

* Add a blank instance to default training data in tests

Rather than adding a specific test, since not failing on instances with
no entities is basic functionality, it makes sense to add it to the
default set.

* Fix without modifying architecture

If the architecture is modified this would have to be a new version, but
this change isn't big enough to merit that.
2022-10-28 10:25:34 +02:00
Adriane Boyd
88d35450dc
Rename test helper method with non-test_ name (#11701) 2022-10-25 14:53:18 +02:00
Adriane Boyd
cae4589f5a
Replace EntityRuler with SpanRuler implementation (#11320)
* Replace EntityRuler with SpanRuler implementation

Remove `EntityRuler` and rename the `SpanRuler`-based
`future_entity_ruler` to `entity_ruler`.

Main changes:

* It is no longer possible to load patterns on init as with
`EntityRuler(patterns=)`.
* The older serialization formats (`patterns.jsonl`) are no longer
supported and the related tests are removed.
* The config settings are only stored in the config, not in the
serialized component (in particular the `phrase_matcher_attr` and
overwrite settings).

* Add migration guide to EntityRuler API docs

* docs update

* Minor edit

Co-authored-by: svlandeg <svlandeg@github.com>
2022-10-24 09:11:35 +02:00
Adriane Boyd
a4bd890f32
Merge pull request #11686 from adrianeboyd/chore/update-v4-from-master
Update v4 from master
2022-10-21 12:55:53 +02:00
github-actions[bot]
84d9cb6b38
Auto-format code with black (#11687)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-10-21 11:54:17 +02:00
Paul O'Leary McCann
0e2b7fb28b
Remove thinc util reimports (#11665)
* Remove imports marked as v2 leftovers

There are a few functions that were in `spacy.util` in v2, but were
moved to Thinc. In v3 these were imported in `spacy.util` so that code
could be used unchanged, but the comment over them indicates they should
always be imported from Thinc. This commit removes those imports.

It doesn't look like any DeprecationWarning was ever thrown for using
these, but it is probably fine to remove them anyway with a major
version. It is not clear that they were widely used.

* Import fix_random_seed correctly

This seems to be the only place in spaCy that was using the old import.
2022-10-21 11:01:18 +02:00
Adriane Boyd
103b24fb25 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master 2022-10-21 09:13:32 +02:00
Adriane Boyd
7e56701057 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5 2022-10-20 13:38:49 +02:00
Edward
d66ccb8eb0
Fix multiple entries per custom extension in doc json (#11551)
* Fix multiple extensions and character offset

* Rename token_start/end to start/end

* Refactor Doc.from_json based on review

* Iterate over user_data items

* Only add non-empty underscore entries

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-10-19 15:52:47 +02:00
Adriane Boyd
fe06e037bc
Fix init for pymorphy2_lookup lemmatizer mode (#11631) 2022-10-12 12:18:39 +02:00
Madeesh Kannan
446a3ecf34
StringStore refactoring (#11344)
* `strings`: Remove unused `hash32_utf8` function

* `strings`: Make `hash_utf8` and `decode_Utf8Str` private

* `strings`: Reorganize private functions

* 'strings': Raise error when non-string/-int types are passed to functions that don't accept them

* `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods

* `Morphology`: Use `StringStore.items()` to enumerate features when pickling

* `test_stringstore`: Update pre-Python 3 tests

* Update `StringStore` docs

* Fix `get_string_id` imports

* Replace redundant test with tests for type checking

* Rename `_retrieve_interned_str`, remove `.get` default arg

* Add `get_string_id` to `strings.pyi`
Remove `mypy` ignore directives from imports of the above

* `strings.pyi`: Replace functions that consume `Union`-typed params with overloads

* `strings.pyi`: Revert some function signatures

* Update `SYMBOLS_BY_INT` lookups and error codes post-merge

* Revert clobbered change introduced in a previous merge

* Remove unnecessary type hint

* Invert tuple order in `StringStore.items()`

* Add test for `StringStore.items()`

* Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling"

This reverts commit 1af9510ceb.

* Rename `keys` and `key_map`

* Add `keys()` and `values()`

* Add comment about the inverted key-value semantics in the API

* Fix type hints

* Implement `keys()`, `values()`, `items()` without generators

* Fix type hints, remove unnecessary boxing

* Update docs

* Simplify `keys/values/items()` impl

* `mypy` fix

* Fix error message, doc fixes
2022-10-06 10:51:06 +02:00
svlandeg
d4922f25fc fix test for EL activations with refactored KB 2022-10-03 14:41:15 +02:00
svlandeg
e3027c65b8 Merge branch 'copy_develop' into copy_v4 2022-10-03 14:12:16 +02:00
svlandeg
9c8cdb403e Merge branch 'master_copy' into develop_copy 2022-09-30 15:40:26 +02:00
Raphael Mitsch
aea16719be
Simplify and clarify enable/disable behavior of spacy.load() (#11459)
* Change enable/disable behavior so that arguments take precedence over config options. Extend error message on conflict. Add warning message in case of overwriting config option with arguments.

* Fix tests in test_serialize_pipeline.py to reflect changes to handling of enable/disable.

* Fix type issue.

* Move comment.

* Move comment.

* Issue UserWarning instead of printing wasabi message. Adjust test.

* Added pytest.warns(UserWarning) for expected warning to fix tests.

* Update warning message.

* Move type handling out of fetch_pipes_status().

* Add global variable for default value. Use id() to determine whether used values are default value.

* Fix default value for disable.

* Rename DEFAULT_PIPE_STATUS to _DEFAULT_EMPTY_PIPES.
2022-09-27 14:22:36 +02:00
Jacobo Myerston
3e8bc1272f
add punctuation to grc (#11426)
* add punctuation to grc

Add support for special editorial punctuation that is common in ancient Greek texts.  Ancient Greek texts, as found in digital and print form, have been largely edited by scholars. Restorations and improvements are normally marked with special characters that need to be handled properly by the tokenizer.

* add unit tests

* simplify regex

* move generic quotes to char classes

* rename unit test

* fix regex

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-09-27 11:38:56 +02:00
Adriane Boyd
877671e09a
Preserve missing entity annotation in augmenters (#11540)
Preserve both `-` and `O` annotation in augmenters rather than relying
on `Example.to_dict`'s default support for one option outside of labeled
entity spans.

This is intended as a temporary workaround for augmenters for v3.4.x.
The behavior of `Example` and related IOB utils could be improved in the
general case for v3.5.
2022-09-27 10:16:51 +02:00
Richard Hudson
6f692a06d5
Remove side effects from Doc.__init__() (#11506)
* Remove side effects from Doc.__init__()

* Changes based on review comment

* Readd test

* Change interface of Doc.__init__()

* Simplify test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update doc.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-09-26 15:58:21 +02:00
github-actions[bot]
279358be63
Auto-format code with black (#11513)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-09-16 11:50:19 +02:00
Adriane Boyd
7c98245c0c
Add levenshtein from polyleven (#11418)
Add a simple levenshtein distance function using the implementation from
the polyleven library as `spacy.matcher.levenshtein`.
2022-09-14 17:05:22 +02:00