Commit Graph

9207 Commits

Author SHA1 Message Date
Madeesh Kannan
604a7c3c26
SpanGroup(s)-related optimizations (#11380)
* `SpanGroup`: Add support for binding copies to a new reference document

* `SpanGroups`: Replace superfluous serialize-deserialize roundtrip in `copy`

Instead, directly copy the in-memory representations of the constituent `SpanGroup`s.

* Update `SpanGroup.copy()` signature

* Rename `new_doc` param to `doc`

* Fix kwdarg

* Update `.pyi` file and docstrings

* `mypy` fix

* Update spacy/tokens/span_group.pyx

* Update docs

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-31 09:03:20 +02:00
Sofie Van Landeghem
8fc0efc502
Allow string argument for disable/enable/exclude (#11406)
* adding unit test for spacy.load with disable/exclude string arg

* allow pure strings in from_config

* update docs

* upstream type adjustements

* docs update

* make docstring more consistent

* Update spacy/language.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* two more cleanups

* fix type in internal method

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-31 09:02:34 +02:00
Paul O'Leary McCann
698b8b495f
Update/remove old Matcher syntax (#11370)
* Clean up old Matcher call style related stuff

In v2 Matcher.add was called with (key, on_match, *patterns). In v3 this
was changed to (key, patterns, *, on_match=None), but there were various
points where the old call syntax was documented or handled specially.
This removes all those.

The Matcher itself didn't need any code changes, as it just gives a
generic type error. However the PhraseMatcher required some changes
because it would automatically "fix" the old call style.

Surprisingly, the tokenizer was still using the old call style in one
place.

After these changes tests failed in two places:

1. one test for the "new" call style, including the "old" call style. I
   removed this test.
2. deserializing the PhraseMatcher fails because the input docs are a
   set.

I am not sure why 2 is happening - I guess it's a quirk of the
serialization format? - so for now I just convert the set to a list when
deserializing. The check that the input Docs are a List in the
PhraseMatcher is a new check, but makes it parallel with the other
Matchers, which seemed like the right thing to do.

* Add notes related to input docs / deserialization type

* Remove Typing import

* Remove old note about call style change

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Use separate method for setting internal doc representations

In addition to the title change, this changes the internal dict to be a
defaultdict, instead of a dict with frequent use of setdefault.

* Add _add_from_arrays for unpickling

* Cleanup around adding from arrays

This moves adding to internal structures into the private batch method,
and removes the single-add method.

This has one behavioral change for `add`, in that if something is wrong
with the list of input Docs (such as one of the items not being a Doc),
valid items before the invalid one will not be added. Also the callback
will not be updated if anything is invalid. This change should not be
significant.

This also adds a test to check failure when given a non-Doc.

* Update spacy/matcher/phrasematcher.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-30 15:40:31 +02:00
Daniël de Kok
3f4b4b7b4f
Fix test_{prefer,require}_gpu (#11390)
* Fix `test_{prefer,require}_gpu`

These tests assumed that GPUs are only supported with CuPy, but since Thinc 8.1
we also support Metal Performance Shaders.

* test_misc: arrange thinc imports to be together
2022-08-30 14:21:02 +02:00
Patrick J. Burns
5ae63b1fbd
Add Latin language support (#11349)
* Add lang folder for la (Latin)

* Add Latin lang classes

* Add minimal tokenizer exceptions

* Add minimal stopwords

* Add minimal lex_attrs

* Update stopwords, tokenizer exceptions

* Add la tests; register la_tokenizer in conftest.py

* Update spacy/lang/la/lex_attrs.py

Remove duplicate form in Latin lex_attrs

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update natto-py version spec (#11222)

* Update natto-py version spec

* Update setup.cfg

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add scorer to textcat API docs config settings (#11263)

* Update docs for pipeline initialize() methods (#11221)

* Update documentation for dependency parser

* Update documentation for trainable_lemmatizer

* Update documentation for entity_linker

* Update documentation for ner

* Update documentation for morphologizer

* Update documentation for senter

* Update documentation for spancat

* Update documentation for tagger

* Update documentation for textcat

* Update documentation for tok2vec

* Run prettier on edited files

* Apply similar changes in transformer docs

* Remove need to say annotated example explicitly

I removed the need to say "Must contain at least one annotated Example"
because it's often a given that Examples will contain some gold-standard
annotation.

* Run prettier on transformer docs

* chore: add 'concepCy' to spacy universe (#11255)

* chore: add 'concepCy' to spacy universe

* docs: add 'slogan' to concepCy

* Support full prerelease versions in the compat table (#11228)

* Support full prerelease versions in the compat table

* Fix types

* adding spans to doc_annotation in Example.to_dict (#11261)

* adding spans to doc_annotation in Example.to_dict

* to_dict compatible with from_dict: tuples instead of spans

* use strings for label and kb_id

* Simplify test

* Update data formats docs

Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix regex invalid escape sequences (#11276)

* Add W605 to the errors raised by flake8 in the CI (#11283)

* Clean up automated label-based issue handling (#11284)

* Clean up automated label-based issue handline

1. upgrade tiangolo/issue-manager to latest
2. move needs-more-info to tiangolo
3. change needs-more-info close time to 7 days
4. delete old needs-more-info config

* Use old, longer message

* Fix label name

* Fix Dutch noun chunks to skip overlapping spans (#11275)

* Add test for overlapping noun chunks

* Skip overlapping noun chunks

* Update spacy/tests/lang/nl/test_noun_chunks.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Docs: displaCy documentation - data types, `parse_{deps,ents,spans}`, spans example (#10950)

* add in spans example and parse references

* rm autoformatter

* rm extra ents copy

* TypedDict draft

* type fixes

* restore non-documentation files

* docs update

* fix spans example

* fix hyperlinks

* add parse example

* example fix + argument fix

* fix api arg in docs

* fix bad variable replacement

* fix spacing in style

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* fix spacing on table

* fix spacing on table

* rm temp files

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* include span_ruler for default warning filter (#11333)

* Add uk pipelines to website (#11332)

* Check for . in factory names (#11336)

* Make fixes for PR #11349

* Fix roman numeral coverage in #11349

Co-authored-by: Patrick J. Burns <patricks@diyclassics.org>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Lj Miranda <12949683+ljvmiranda921@users.noreply.github.com>
Co-authored-by: Jules Belveze <32683010+JulesBelveze@users.noreply.github.com>
Co-authored-by: stefawolf <wlf.ste@gmail.com>
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
2022-08-30 14:04:54 +02:00
Adriane Boyd
98a916e01a
Make stable private modules public and adjust names (#11353)
* Make stable private modules public and adjust names

* `spacy.ml._character_embed` -> `spacy.ml.character_embed`
* `spacy.ml._precomputable_affine` -> `spacy.ml.precomputable_affine`
* `spacy.tokens._serialize` -> `spacy.tokens.doc_bin`
* `spacy.tokens._retokenize` -> `spacy.tokens.retokenize`
* `spacy.tokens._dict_proxies` -> `spacy.tokens.span_groups`

* Skip _precomputable_affine

* retokenize -> retokenizer

* Fix imports
2022-08-30 13:56:35 +02:00
Adriane Boyd
4bce8fa755
Remove setup_requires from setup.cfg (#11384)
* Remove setup_requires from setup.cfg

* Update requirements test to ignore cython in setup.cfg
2022-08-29 13:23:24 +02:00
Paul O'Leary McCann
aafee5e1b7
Fix lookup usage in French/Catalan (fix #11347) (#11382)
* Fix lookup usage (fix #11347)

Before using the lookups table in the French (and Catalan) lemmatizers,
there's a check to see if the current term is in the table. But it's
checking a string against hashes, so it's always false. Also the table
lookup function is designed so you don't have to do that anyway.

* Use the lookup table directly

* Use string, not token
2022-08-29 10:32:38 +02:00
Edward
6723d76f24
Add ConsoleLogger.v2 (#11214)
* Init

* Change logger to ConsoleLogger.v2

* adjust naming

* More naming adjustments

* Fix output_file reference error

* ignore type

* Add basic test for logger

* Hopefully fix mypy issue

* mypy ignore line

* Update mypy line

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update test method name

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Change file saving logic

* Fix finalize method

* increase spacy-legacy version in requirements

* Update docs

* small adjustments

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-29 10:23:05 +02:00
Adriane Boyd
2a558a7cdc
Switch to mecab-ko as default Korean tokenizer (#11294)
* Switch to mecab-ko as default Korean tokenizer

Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.

Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.

* Temporarily run tests with mecab-ko tokenizer

* Fix types

* Fix duplicate test names

* Update requirements test

* Revert "Temporarily run tests with mecab-ko tokenizer"

This reverts commit d2083e7044.

* Add mecab_args setting, fix pickle for KoreanNattoTokenizer

* Fix length check

* Update docs

* Formatting

* Update natto-py error message

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-08-26 10:11:18 +02:00
Adriane Boyd
740c33fe58 Merge remote-tracking branch 'upstream/develop' into chore/update-v4-from-develop 2022-08-24 20:43:07 +02:00
Adriane Boyd
81874265e9 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5-1 2022-08-24 12:47:42 +02:00
Adriane Boyd
c44d243f25 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master 2022-08-24 07:15:41 +02:00
Tobius Saul
c09d2fa25b
luganda language extension (#10847)
* luganda language extension

* __init__.py changes

* New enhancements

* Lexical attribute changed

* punctuaction and sentence additions

* Remove comment header

* Fix typos, reformat

* reformated version

* Add tokenizer test

* Remove contractions from stop words

* Format

* Add Luganda to website

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-23 13:09:36 +02:00
Edward
5afa98aabf
Support custom attributes for tokens and spans in json conversion (#11125)
* Add token and span custom attributes to to_json()

* Change logic for to_json

* Add functionality to from_json

* Small adjustments

* Move token/span attributes to new dict key

* Fix test

* Fix the same test but much better

* Add backwards compatibility tests and adjust logic

* Add test to check if attributes not set in underscore are not saved in the json

* Add tests for json compatibility

* Adjust test names

* Fix tests and clean up code

* Fix assert json tests

* small adjustment

* adjust naming and code readability

* Adjust naming, added more tests and changed logic

* Fix typo

* Adjust errors, naming, and small test optimization

* Fix byte tests

* Fix bytes tests

* Change naming and json structure

* update schema

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update schema for underscore attributes

* Adjust underscore schema

* adjust schema tests

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-23 10:05:02 +02:00
Adriane Boyd
bb0e178878
Make Span/Doc.ents more consistent for ent_kb_id and ent_id (#11328)
* Map `Span.id` to `Token.ent_id` in all cases when setting `Doc.ents`
* Reset `Token.ent_id` and `Token.ent_kb_id` when setting `Doc.ents`
* Make `Span.ent_id` an alias of `Span.id` rather than a read-only view
of the root token's `ent_id` annotation
2022-08-22 20:28:57 +02:00
Sofie Van Landeghem
1a5be63715
Cleanup Cython structs (#11337)
* cleanup Tokenizer fields

* remove unused object from vocab

* remove IS_OOV_DEPRECATED

* add back in as FLAG13

* FLAG 18 instead

* import fix

* fix clumpsy fingers

* revert symbol changes in favor of #11352

* bint instead of bool
2022-08-22 15:52:24 +02:00
Adriane Boyd
f55bb7470d
Clean up warnings in the test suite (#11331) 2022-08-22 12:04:30 +02:00
Paul O'Leary McCann
0f07defe2c
Remove reference to voting on issue (#11335)
Not clear which issue this refers to, we don't suggest this for any
other issues, and we don't use votes in general.
2022-08-22 11:29:05 +02:00
Adriane Boyd
5fa8f4faca
Switch ru and uk lemmatizers to pymorphy3 (#11345)
* Switch ru and uk lemmatizers to pymorphy3

* Switch to pymorphy3 in tests
2022-08-22 11:27:14 +02:00
Adriane Boyd
3e4cf1bbe1
Check for . in factory names (#11336) 2022-08-19 09:52:12 +02:00
Sofie Van Landeghem
cab263791f
include span_ruler for default warning filter (#11333) 2022-08-17 19:55:54 +02:00
Adriane Boyd
d757dec5c4
Remove intify_attrs(_do_deprecated) (#11319) 2022-08-17 12:13:54 +02:00
Peter Baumgartner
db7b9938a4
Docs: displaCy documentation - data types, parse_{deps,ents,spans}, spans example (#10950)
* add in spans example and parse references

* rm autoformatter

* rm extra ents copy

* TypedDict draft

* type fixes

* restore non-documentation files

* docs update

* fix spans example

* fix hyperlinks

* add parse example

* example fix + argument fix

* fix api arg in docs

* fix bad variable replacement

* fix spacing in style

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* fix spacing on table

* fix spacing on table

* rm temp files

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-08-16 11:23:34 -04:00
antonpibm
551e73ccfc
Match private networks as URLs (#11121) 2022-08-11 11:26:26 +02:00
Sofie Van Landeghem
5d54c0e32a
Rename modules for consistency (#11286)
* rename Python module to entity_ruler

* rename Python module to attribute_ruler
2022-08-10 11:44:05 +02:00
Adriane Boyd
ed4ad309e6
Fix Dutch noun chunks to skip overlapping spans (#11275)
* Add test for overlapping noun chunks

* Skip overlapping noun chunks

* Update spacy/tests/lang/nl/test_noun_chunks.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-08-10 09:49:08 +02:00
Adriane Boyd
fc4246558b
Fix regex invalid escape sequences (#11276) 2022-08-09 10:59:36 +02:00
stefawolf
23749cfc91
adding spans to doc_annotation in Example.to_dict (#11261)
* adding spans to doc_annotation in Example.to_dict

* to_dict compatible with from_dict: tuples instead of spans

* use strings for label and kb_id

* Simplify test

* Update data formats docs

Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-05 12:26:38 +02:00
Luka Dragar
b64243ed55
Updates to Slovenian language (#11162)
* Added examples for Slovene

* Update spacy/lang/sl/examples.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Corrected a typo in one of the sentences

* Updated support for Slovenian

* Some minor changes to corrections

* Added forint currency

* Corrected HYPHENS_PERMITTED regex and some formatting

* Minor changes

* Un-xfail tokenizer test

* Format

Co-authored-by: Luka Dragar <D20124481@mytudublin.ie>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-05 10:10:18 +02:00
Adriane Boyd
b07708d5d0
Support full prerelease versions in the compat table (#11228)
* Support full prerelease versions in the compat table

* Fix types
2022-08-04 15:14:19 +02:00
Daniël de Kok
e581eeac34
precompute_hiddens/Parser: look up CPU ops once (v4) (#11068)
* precompute_hiddens/Parser: look up CPU ops once

* precompute_hiddens: make cpu_ops private
2022-07-29 15:12:19 +02:00
Daniël de Kok
1ff683a50b Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220728 2022-07-28 13:53:59 +02:00
ninjalu
95a1b8aca6
add additional REL_OP (#10371)
* add additional  REL_OP

* change to condition and new rel_op symbols

* add operators to docs

* add the anchor while we're in here

* add tests

Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
2022-07-27 13:16:44 +02:00
Edward
360a702ecd
Add parent argument (#11210) 2022-07-26 14:35:18 +02:00
Adriane Boyd
5c2a00cef0
Set version to v3.4.1 (#11209) 2022-07-26 12:52:38 +02:00
Daniël de Kok
4ee8a06149
Fix compatibility with CuPy 9.x (#11194)
After the precomputable affine table of shape [nB, nF, nO, nP] is
computed, padding with shape [1, nF, nO, nP] is assigned to the first
row of the precomputed affine table. However, when we are indexing the
precomputed table, we get a row of shape [nF, nO, nP]. CuPy versions
before 10.0 cannot paper over this shape difference.

This change fixes compatibility with CuPy < 10.0 by squeezing the first
dimension of the padding before assignment.
2022-07-26 10:52:01 +02:00
Adriane Boyd
e5990db713 Revert "Temporarily skip tests that require models/compat"
This reverts commit d9320db7db.
2022-07-25 18:12:18 +02:00
Madeesh Kannan
ba18d2913d
Morphology/Morphologizer optimizations and refactoring (#11024)
* `Morphology`: Refactor to use C types, reduce allocations, remove unused code

* `Morphologzier`: Avoid unnecessary sorting of morpho features

* `Morphologizer`: Remove execessive reallocations of labels, improve hash lookups of labels, coerce `numpy` numeric types to native ints
Update docs

* Remove unused method

* Replace `unique_ptr` usage with `shared_ptr`

* Add type annotations to internal Python methods, rename `hash` variable, fix typos

* Add comment to clarify implementation detail

* Fix return type

* `Morphology`: Stop early when splitting fields and values
2022-07-15 11:14:08 +02:00
Nicolai Bjerre Pedersen
2fa983aa2e
Fix span typings (#11119)
Add id, id_ to span.pyi.
2022-07-12 13:47:35 +02:00
Peter Baumgartner
36cb2029a9
displaCy Spans Vertical Alignment Fix 2 (#11092)
* add in span render slot fix

* fix spacing off by one

* rm demo

* adjust comments

* fix whitespace and overlap issue
2022-07-08 19:20:13 +02:00
github-actions[bot]
e7fd06bdbe
Auto-format code with black (#11099)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-07-08 18:43:25 +09:00
Daniël de Kok
a06cbae70d
precompute_hiddens/Parser: do not look up CPU ops (3.4) (#11069)
* precompute_hiddens/Parser: do not look up CPU ops

`get_ops("cpu")` is quite expensive. To avoid this, we want to cache the
result as in #11068. However, for 3.x we do not want to change the ABI.
So we avoid the expensive lookup by using NumpyOps. This should have a
minimal impact, since `get_ops("cpu")` was only used when the model ops
were `CupyOps`. If the ops are `AppleOps`, we are still passing through
the correct BLAS implementation.

* _NUMPY_OPS -> NUMPY_OPS
2022-07-05 10:53:42 +02:00
Madeesh Kannan
d36d66b7ca
Increase test deadline to 30 minutes to prevent spurious test failures (#11070)
* Increase test deadline to 30 minutes to prevent spurious test failures

* Reduce deadline to 2 minutes
2022-07-04 18:37:09 +02:00
kadarakos
5240baccfe
dont use get_array_module (#11056) 2022-07-04 17:15:33 +02:00
Raphael Mitsch
e9eb59699f
NEL confidence threshold (#11016)
* Add base for NEL abstention threshold mechanism.

* Add abstention threshold to entity linker. Add test.

* Fix entity linking tests.

* Changed abstention default threshold from 0 to None.

* Fix default values for abstention thresholds.

* Fix mypy errors.

* Replace assertion with raise of proper error code.

* Simplify threshold check. Remove thresholding from EntityLinker_v1.

* Rename test.

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Make E1043 configurable.

* Update docs.

* Rephrase description in docs. Adjusting error code message.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-07-04 17:05:21 +02:00
Madeesh Kannan
59c763eec1
StringStore-related optimizations (#10938)
* `strings`: More roubust type checking of keys/IDs, coerce `int`-like types to `hash_t`

* Preserve existing public API behaviour

* Fix return type

* Replace `bool` with `bint`, rename to `_try_coerce_to_hash`, replace `id` with `hash`

* Avoid unnecessary re-encoding and re-calculation of strings and hashs respectively

* Rename variables named `hash`
Add comment on early return
2022-07-04 15:04:03 +02:00
explosion-bot
7e55a51314 Auto-format code with black 2022-07-01 08:04:32 +00:00
Madeesh Kannan
eaf66e7431
Add NVTX ranges to TrainablePipe components (#10965)
* `TrainablePipe`: Add NVTX range decorator

* Annotate `TrainablePipe` subclasses with NVTX ranges

* Export function signature to allow introspection of args in tests

* Revert "Annotate `TrainablePipe` subclasses with NVTX ranges"

This reverts commit d8684f7372.

* Revert "Export function signature to allow introspection of args in tests"

This reverts commit f4405ca3ad.

* Revert "`TrainablePipe`: Add NVTX range decorator"

This reverts commit 26536eb6b8.

* Add `spacy.pipes_with_nvtx_range` pipeline callback

* Show warnings for all missing user-defined pipe functions that need to be annotated
Fix imports, typos

* Rename `DEFAULT_ANNOTATABLE_PIPE_METHODS` to `DEFAULT_NVTX_ANNOTATABLE_PIPE_METHODS`
Reorder import

* Walk model nodes directly whilst applying NVTX ranges
Ignore pipe method wrapper when applying range
2022-06-30 11:28:12 +02:00
Adriane Boyd
3fe9f47de4
Revert "disable failing test because Stanford servers are down (#11015)" (#11054)
This reverts commit f8116078ce.
2022-06-30 11:24:54 +02:00
Shen Qin
be00db6645
Addition of min_max quantifier in matcher {n,m} (#10981)
* Min_max_operators
1. Modified API and Usage for spaCy website to include min_max operator
2. Modified matcher.pyx to include min_max function {n,m} and its variants
3. Modified schemas.py to include min_max validation error
4. Added test cases to test_matcher_api.py, test_matcher_logic.py and test_pattern_validation.py

* attempt to fix mypy/pydantic compat issue

* formatting

* Update spacy/tests/matcher/test_pattern_validation.py

Co-authored-by: Source-Shen <82353723+Source-Shen@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-30 11:01:58 +02:00
Daniël de Kok
0ff14aabce
vectors: avoid expensive comparisons between numpy ints and Python ints (#10992)
* vectors: avoid expensive comparisons between numpy ints and Python ints

* vectors: avoid failure on lists of ints

* Convert another numpy int to Python
2022-06-29 12:58:31 +02:00
Peter Baumgartner
dd038b536c
fix to horizontal space (#10994) 2022-06-28 20:42:40 +02:00
Adriane Boyd
24f4908fce
Update vector handling in similarity methods (#11013)
Distinguish between vectors that are 0 vs. missing vectors when warning
about missing vectors.

Update `Doc.has_vector` to match `Span.has_vector` and
`Token.has_vector` for cases where the vocab has vectors but none of the
tokens in the container have vectors.
2022-06-28 19:50:47 +02:00
Madeesh Kannan
1d5cad0b42
Example.get_aligned_parse: Handle unit and zero length vectors correctly (#11026)
* `Example.get_aligned_parse`: Do not squeeze gold token idx vector
Correctly handle zero-size vectors passed to `np.vectorize`

* Add tests

* Use `Doc` ctor to initialize attributes

* Remove unintended change

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Remove unused import

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-28 19:42:58 +02:00
Richard Hudson
a9559e7435
Handle Cyrillic combining diacritics (#10837)
* Handle Russian, Ukrainian and Bulgarian

* Corrections

* Correction

* Correction to comment

* Changes based on review

* Correction

* Reverted irrelevant change in punctuation.py

* Remove unnecessary group

* Reverted accidental change
2022-06-28 15:35:32 +02:00
Zackere
8ffff18ac4
Try cloning repo from main & master (#10843)
* Try cloning repo from main & master

* fixup! Try cloning repo from main & master

* fixup! fixup! Try cloning repo from main & master

* refactor clone and check for repo:branch existence

* spacing fix

* make mypy happy

* type util function

* Update spacy/cli/project/clone.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-28 09:11:15 -04:00
Daniël de Kok
1605ef7319 Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220627-2 2022-06-27 17:45:45 +02:00
github-actions[bot]
4155a59d47
Auto-format code with black (#11022)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-06-27 09:35:35 +02:00
Adriane Boyd
738b38064f
Merge pull request #11021 from adrianeboyd/chore/v3.4.0
Set version to v3.4.0
2022-06-24 14:54:16 +02:00
Madeesh Kannan
8f1ba4de58
Backport parser/alignment optimizations from feature/refactor-parser (#10952) 2022-06-24 13:39:52 +02:00
Adriane Boyd
d9320db7db Temporarily skip tests that require models/compat 2022-06-24 11:20:53 +02:00
Adriane Boyd
bffe54d02b Set version to v3.4.0 2022-06-24 08:48:58 +02:00
Sofie Van Landeghem
f8116078ce
disable failing test because Stanford servers are down (#11015) 2022-06-23 10:57:46 +02:00
Sofie Van Landeghem
f00254ae27
add counts to verbose list of NER labels (#10957) 2022-06-20 09:48:40 +02:00
Raphael Mitsch
4c058eb40a
enable argument for spacy.load() (#10784)
* Enable flag on spacy.load: foundation for include, enable arguments.

* Enable flag on spacy.load: fixed tests.

* Enable flag on spacy.load: switched from pretrained model to empty model with added pipes for tests.

* Enable flag on spacy.load: switched to more consistent error on misspecification of component activity. Test refactoring. Added  to default config.

* Enable flag on spacy.load: added support for fields not in pipeline.

* Enable flag on spacy.load: removed serialization fields from supported fields.

* Enable flag on spacy.load: removed 'enable' from config again.

* Enable flag on spacy.load: relaxed checks in _resolve_component_activation_status() to allow non-standard pipes.

* Enable flag on spacy.load: fixed relaxed checks for _resolve_component_activation_status() to allow non-standard pipes. Extended tests.

* Enable flag on spacy.load: comments w.r.t. resolution workarounds.

* Enable flag on spacy.load: remove include fields. Update website docs.

* Enable flag on spacy.load: updates w.r.t. changes in master.

* Implement Doc.from_json(): update docstrings.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): remove newline.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): change error message for E1038.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Enable flag on spacy.load: wrapped docstring for _resolve_component_status() at 80 chars.

* Enable flag on spacy.load: changed exmples for enable flag.

* Remove newline.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix docstring for Language._resolve_component_status().

* Rename E1038 to E1042.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-17 20:24:13 +01:00
Sofie Van Landeghem
eaeca5eb6a
account for NER labels with a hyphen in the name (#10960)
* account for NER labels with a hyphen in the name

* cleanup

* fix docstring

* add return type to helper method

* shorter method and few more occurrences

* user helper method across repo

* fix circular import

* partial revert to avoid circular import
2022-06-17 20:02:37 +01:00
github-actions[bot]
6313787fb6
Auto-format code with black (#10977)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-06-17 19:41:55 +01:00
Raphael Mitsch
d50668dbf0
Made _initialize_X() methods private. (#10978) 2022-06-17 15:55:34 +02:00
Raphael Mitsch
a7f6bc5dfb
Workaround for Typer optional default values with Python calls (#10788)
* Workaround for Typer optional default values with Python calls: added test and workaround.

* @rmitsch Workaround for Typer optional default values with Python calls: reverting some black formatting changes.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* @rmitsch Workaround for Typer optional default values with Python calls: removing return type hint.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Workaround for Typer optional default values with Python calls: fixed imports, added GitHub issue marker.

* Workaround for Typer optional default values with Python calls: removed forcing of default values for optional arguments in init_config_cli(). Added default values for init_config(). Synchronized default values for init_config_cli() and init_config().

* Workaround for Typer optional default values with Python calls: removed unused import.

* Workaround for Typer optional default values with Python calls: fixed usage of optimize in init_config_cli().

* Workaround for Typer optional default values with Pythhon calls: remove output_file from InitDefaultValues.

* Workaround for Typer optional default values with Python calls: rename class for default init values.

* Workaround for Typer optional default values with Python calls: remove newline.

* remove introduced newlines

* Remove test_init_config_from_python_without_optional_args().

* remove leftover import

* reformat import

* remove duplicate

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-17 12:15:36 +02:00
Daniël de Kok
3d3fbeda9f
Update for CBlas changes in Thinc 8.1.0.dev2 (#10970) 2022-06-16 11:42:34 +02:00
Daniël de Kok
0d352c46ed
vectors: remove use of float as row number (#10955)
The float -1 was returned rather than the integer -1 as the row for
unknown keys. This doesn't introduce a realy bug, since such floats
cast (without issues) to int in the conversion to NumPy arrays. Still,
it's nice to to do the correct thing :).
2022-06-15 15:32:02 +02:00
Madeesh Kannan
126d1db123
Add failing test: test_matcher_extension_in_set_predicate (#10948) 2022-06-13 10:56:45 +02:00
Daniël de Kok
a83a501195
precomputable_biaffine: avoid concatenation (#10911)
The `forward` of `precomputable_biaffine` performs matrix multiplication
and then `vstack`s the result with padding. This creates a temporary
array used for the output of matrix concatenation.

This change avoids the temporary by pre-allocating an array that is
large enough for the output of matrix multiplication plus padding and
fills the array in-place.

This gave me a small speedup (a bit over 100 WPS) on de_core_news_lg on
M1 Max (after changing thinc-apple-ops to support in-place gemm as BLIS
does).
2022-06-10 18:12:28 +02:00
github-actions[bot]
97e8a5041b
Auto-format code with black (#10945)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-06-10 13:21:33 +02:00
Daniël de Kok
2f05c6824c Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220609 2022-06-09 10:18:25 +02:00
kadarakos
1bb87f35bc
Detect cycle during projectivize (#10877)
* detect cycle during projectivize

* not complete test to detect cycle in projectivize

* boolean to int type to propagate error

* use unordered_set instead of set

* moved error message to errors

* removed cycle from test case

* use find instead of count

* cycle check: only perform one lookup

* Return bool again from _has_head_as_ancestor

Communicate presence of cycles through an output argument.

* Switch to returning std::pair to encode presence of a cycle

The has_cycle pointer is too easy to misuse. Ideally, we would have a
sum type like Rust's `Result` here, but C++ is not there yet.

* _is_non_proj_arc: clarify what we are returning

* _has_head_as_ancestor: remove count

We are now explicitly checking for cycles, so the algorithm must always
terminate. Either we encounter the head, we find a root, or a cycle.

* _is_nonproj_arc: simplify condition

* Another refactor using C++ exceptions

* Remove unused error code

* Print graph with cycle on exception

* Include .hh files in source package

* Add FIXME comment

* cycle detection test

* find cycle when starting from problematic vertex

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2022-06-08 19:34:11 +02:00
github-actions[bot]
24aafdffad
Auto-format code with black (#10908)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-06-03 11:01:55 +02:00
Adriane Boyd
727ce6d1f5
Remove English exceptions with mismatched features (#10873)
Remove English contraction exceptions with mismatched features that lead
to exceptions like "theses" and "thisre".
2022-06-03 09:44:04 +02:00
Madeesh Kannan
41389ffe1e
Avoid pickling Doc inputs passed to Language.pipe() (#10864)
* `Language.pipe()`: Serialize `Doc` objects to bytes when using multiprocessing to avoid pickling overhead

* `Doc.to_dict()`: Serialize `_context` attribute (keeping in line with `(un)pickle_doc()`

* Correct type annotations

* Fix typo

* `Doc`: Do not serialize `_context`

* `Language.pipe`: Send context objects to child processes, Simplify `as_tuples` handling

* Fix type annotation

* `Language.pipe`: Simplify `as_tuple` multiprocessor handling

* Cleanup code, fix typos

* MyPy fixes

* Move doc preparation function into `_multiprocessing_pipe`
Whitespace changes

* Remove superfluous comma

* Rename `prepare_doc` to `prepare_input`

* Update spacy/errors.py

* Undo renaming for error

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-02 20:06:49 +02:00
single-fingal
6c6b8da7cc
Fix: De/Serialize SpanGroups including the SpanGroup keys (#10707)
* fix: De/Serialize `SpanGroups` including the SpanGroup keys

This prevents the loss of `SpanGroup`s that have the same .name as other `SpanGroup`s within the same `SpanGroups` object (upon de/serialization of the `SpanGroups`).

Fixes #10685

* Maintain backwards compatibility for serialized `SpanGroups`

(serialized as: a list of `SpanGroup`s, or b'')

* Add tests for `SpanGroups` deserialization backwards-compatibility

* Move a `SpanGroups` de/serialization test (test_issue10685)
  to tests/serialize/test_serialize_spangroups.py

* Output a warning if deserializing a `SpanGroups` with duplicate .name-d `SpanGroup`s

* Minor refactor

* `SpanGroups.from_bytes` handles only `list` and `dict` types with
`dict` as the expected default
* For lists, keep first rather than last value encountered
* Update error message
* Rename and update tests

* Update to preserve list serialization of SpanGroups

To avoid breaking compatibility of serialized `Doc` and `DocBin` with
earlier versions of spacy v3, revert back to a list-only serialization,
but update the names just for serialization so that the SpanGroups keys
override the SpanGroup names.

* Preserve object identity and current key overwrite

* Preserve SpanGroup object identity
* Preserve last rather than first span group from SpanGroup list
  format without SpanGroups keys

* Update inline comments

* Fix types

* Add type info for SpanGroup.copy

* Deserialize `SpanGroup`s as copies

when a single SpanGroup is the value for more than 1 `SpanGroups` key.
This is because we serialize `SpanGroups` as dicts (to maintain backward-
and forward-compatibility) and we can't assume `SpanGroup`s with the same
bytes/serialization were the same (identical) object, pre-serialization.

* Update spacy/tokens/_dict_proxies.py

* Add more SpanGroups serialization tests

Test that serialized SpanGroups maintain their Span order

* small clarification on older spaCy version

* Update spacy/tests/serialize/test_serialize_span_groups.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-02 15:56:27 +02:00
Adriane Boyd
7e13652d36
Fix schemas import in Doc (#10898) 2022-06-02 15:53:03 +02:00
Raphael Mitsch
8387ce4c01
Add Doc.from_json() (#10688)
* Implement Doc.from_json: rough draft.

* Implement Doc.from_json: first draft with tests.

* Implement Doc.from_json: added documentation on website for Doc.to_json(), Doc.from_json().

* Implement Doc.from_json: formatting changes.

* Implement Doc.to_json(): reverting unrelated formatting changes.

* Implement Doc.to_json(): fixing entity and span conversion. Moving fixture and doc <-> json conversion tests into single file.

* Implement Doc.from_json(): replaced entity/span converters with doc.char_span() calls.

* Implement Doc.from_json(): handling sentence boundaries in spans.

* Implementing Doc.from_json(): added parser-free sentence boundaries transfer.

* Implementing Doc.from_json(): added parser-free sentence boundaries transfer.

* Implementing Doc.from_json(): incorporated various PR feedback.

* Renaming fixture for document without dependencies.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): using two sent_starts instead of one.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): doc_without_dependency_parser() -> doc_without_deps.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): incorporating various PR feedback. Rebased on latest master.

* Implementing Doc.from_json(): refactored Doc.from_json() to work with annotation IDs instead of their string representations.

* Implement Doc.from_json(): reverting unwanted formatting/rebasing changes.

* Implement Doc.from_json(): added check for char_span() calculation for entities.

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): minor refactoring, additional check for token attribute consistency with corresponding test.

* Implement Doc.from_json(): removed redundancy in annotation type key naming.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): Simplifying setting annotation values.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement doc.from_json(): renaming annot_types to token_attrs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjustments for renaming of annot_types to token_attrs.

* Implement Doc.from_json(): removing default categories.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): simplifying lexeme initialization.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): simplifying lexeme initialization.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): refactoring to only have keys for present annotations.

* Implement Doc.from_json(): fix check for tokens' HEAD attributes.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): refactoring Doc.from_json().

* Implement Doc.from_json(): fixing span_group retrieval.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fixing span retrieval.

* Implement Doc.from_json(): added schema for Doc JSON format. Minor refactoring in Doc.from_json().

* Implement Doc.from_json(): added comment regarding Token and Span extension support.

* Implement Doc.from_json(): renaming inconsistent_props to partial_attrs..

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjusting error message.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): extending E1038 message.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): added params to E1038 raises.

* Implement Doc.from_json(): combined attribute collection with partial attributes check.

* Implement Doc.from_json(): added optional schema validation.

* Implement Doc.from_json(): fixed optional fields in schema, tests.

* Implement Doc.from_json(): removed redundant None check for DEP.

* Implement Doc.from_json(): added passing of schema validatoin message to E1037..

* Implement Doc.from_json(): removing redundant error E1040.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): changing message for E1037.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjusted website docs and docstring of Doc.from_json().

* Update spacy/tests/doc/test_json_doc_conversion.py

* Implement Doc.from_json(): docstring update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): website docs update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring formatting.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring formatting.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fixing Doc reference in website docs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): reformatted website/docs/api/doc.md.

* Implement Doc.from_json(): bumped IDs of new errors to avoid merge conflicts.

* Implement Doc.from_json(): fixing bug in tests.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fix setting of sentence starts for docs without DEP.

* Implement Doc.from_json(): add check for valid char spans when manually setting sentence boundaries. Refactor sentence boundary setting slightly. Move error message for lack of support for partial token annotations to errors.py.

* Implement Doc.from_json(): simplify token sentence start manipulation.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Combine related error messages

* Update spacy/tests/doc/test_json_doc_conversion.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-02 14:03:47 +02:00
Adriane Boyd
a322d6d5f2
Add SpanRuler component (#9880)
* Add SpanRuler component

Add a `SpanRuler` component similar to `EntityRuler` that saves a list
of matched spans to `Doc.spans[spans_key]`. The matches from the token
and phrase matchers are deduplicated and sorted before assignment but
are not otherwise filtered.

* Update spacy/pipeline/span_ruler.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix cast

* Add self.key property

* Use number of patterns as length

* Remove patterns kwarg from init

* Update spacy/tests/pipeline/test_span_ruler.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add options for spans filter and setting to ents

* Add `spans_filter` option as a registered function'
* Make `spans_key` optional and if `None`, set to `doc.ents` instead of
`doc.spans[spans_key]`.

* Update and generalize tests

* Add test for setting doc.ents, fix key property type

* Fix typing

* Allow independent doc.spans and doc.ents

* If `spans_key` is set, set `doc.spans` with `spans_filter`.
* If `annotate_ents` is set, set `doc.ents` with `ents_fitler`.
  * Use `util.filter_spans` by default as `ents_filter`.
  * Use a custom warning if the filter does not work for `doc.ents`.

* Enable use of SpanC.id in Span

* Support id in SpanRuler as Span.id

* Update types

* `id` can only be provided as string (already by `PatternType`
definition)

* Update all uses of Span.id/ent_id in Doc

* Rename Span id kwarg to span_id

* Update types and docs

* Add ents filter to mimic EntityRuler overwrite_ents

* Refactor `ents_filter` to take `entities, spans` args for more
  filtering options
* Give registered filters more descriptive names
* Allow registered `filter_spans` filter
  (`spacy.first_longest_spans_filter.v1`) to take any number of
  `Iterable[Span]` objects as args so it can be used for spans filter
  or ents filter

* Implement future entity ruler as span ruler

Implement a compatible `entity_ruler` as `future_entity_ruler` using
`SpanRuler` as the underlying component:
* Add `sort_key` and `sort_reverse` to allow the sorting behavior to be
  customized. (Necessary for the same sorting/filtering as in
  `EntityRuler`.)
* Implement `overwrite_overlapping_ents_filter` and
  `preserve_existing_ents_filter` to support
  `EntityRuler.overwrite_ents` settings.
* Add `remove_by_id` to support `EntityRuler.remove` functionality.
* Refactor `entity_ruler` tests to parametrize all tests to test both
  `entity_ruler` and `future_entity_ruler`
* Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns`
  properties.

Additional changes:

* Move all config settings to top-level attributes to avoid duplicating
  settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of
  casting.)

* Format

* Fix filter make method name

* Refactor to use same error for removing by label or ID

* Also provide existing spans to spans filter

* Support ids property

* Remove token_patterns and phrase_patterns

* Update docstrings

* Add span ruler docs

* Fix types

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move sorting into filters

* Check for all tokens in seen tokens in entity ruler filters

* Remove registered sort key

* Set Token.ent_id in a backwards-compatible way in Doc.set_ents

* Remove sort options from API docs

* Update docstrings

* Rename entity ruler filters

* Fix and parameterize scoring

* Add id to Span API docs

* Fix typo in API docs

* Include explicit labeled=True for scorer

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-02 13:12:53 +02:00
Sofie Van Landeghem
f7507c2327
fix typo + CI slow testing (#10835)
* fix typo

* one more typo
2022-06-02 00:10:16 +02:00
Paul O'Leary McCann
dca2e8c644
Minor NEL type fixes (#10860)
* Fix TODO about typing

Fix was simple: just request an array2f.

* Add type ignore

Maxout has a more restrictive type than the residual layer expects (only
Floats2d vs any Floats).

* Various cleanup

This moves a lot of lines around but doesn't change any functionality.
Details:

1. use `continue` to reduce indentation
2. move sentence doc building inside conditional since it's otherwise
   unused
3. reduces some temporary assignments
2022-06-01 00:41:28 +02:00
Daniël de Kok
09a5f03dd7
Merge pull request #10849 from danieldk/simplify-gpu-check
Simplify GPU check
2022-05-30 16:35:10 +02:00
Daniël de Kok
85dd2b6c04
Parser: use C saxpy/sgemm provided by the Ops implementation (#10773)
* Parser: use C saxpy/sgemm provided by the Ops implementation

This is a backport of https://github.com/explosion/spaCy/pull/10747
from the parser refactor branch. It eliminates the explicit calls
to BLIS, instead using the saxpy/sgemm provided by the Ops
implementation.

This allows us to use Accelerate in the parser on M1 Macs (with
an updated thinc-apple-ops).

Performance of the de_core_news_lg pipe:

BLIS 0.7.0, no thinc-apple-ops:  6385 WPS
BLIS 0.7.0, thinc-apple-ops:    36455 WPS
BLIS 0.9.0, no thinc-apple-ops: 19188 WPS
BLIS 0.9.0, thinc-apple-ops:    36682 WPS
This PR, thinc-apple-ops:       38726 WPS

Performance of the de_core_news_lg pipe (only tok2vec -> parser):

BLIS 0.7.0, no thinc-apple-ops: 13907 WPS
BLIS 0.7.0, thinc-apple-ops:    73172 WPS
BLIS 0.9.0, no thinc-apple-ops: 41576 WPS
BLIS 0.9.0, thinc-apple-ops:    72569 WPS
This PR, thinc-apple-ops:       87061 WPS

* Require thinc >=8.1.0,<8.2.0

* Lower thinc lowerbound to 8.1.0.dev0

* Use best CPU ops for CBLAS when the parser model is on the GPU

* Fix another unguarded cblas() call

* Fix: use ops as a shorthand for self.model.ops

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2022-05-27 11:20:52 +02:00
github-actions[bot]
6172af8158
Auto-format code with black (#10857)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-05-27 10:54:54 +02:00
Daniël de Kok
7c6a97559d Simplify GPU check
This change removes `thinc.util.has_cupy` from the GPU presence check.
Currently `gpu_is_available` already implies `has_cupy`.  We also want
to show this warning in the future when a machine has a non-CuPy GPU.
2022-05-25 14:06:45 +02:00
kadarakos
f6a4b80c0b
Better errors for has_annotation and Matcher (#10830)
* Show input argument instead of None

* catch invalid attr early

* moved error message from code to errors.py

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/errors.py

* update E153 and E154

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-05-25 11:12:29 +02:00
Richard Hudson
32954c3bcb
Fix issues for Mypy 0.950 and Pydantic 1.9.0 (#10786)
* Make changes to typing

* Correction

* Format with black

* Corrections based on review

* Bumped Thinc dependency version

* Bumped blis requirement

* Correction for older Python versions

* Update spacy/ml/models/textcat.py

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

* Corrections based on review feedback

* Readd deleted docstring line

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2022-05-25 09:33:54 +02:00
Paul O'Leary McCann
6be09bbd07
Fix Entity Linker with tokenization mismatches (fix #9575) (#10457)
* Add failing test

* Partial fix for issue

This kind of works. The issue with token length mismatches is gone. The
problem is that when you get empty lists of encodings to compare, it
fails because the sizes are not the same, even though they're both zero:
(0, 3) vs (0,). Not sure why that happens...

* Short circuit on empties

* Remove spurious check

The check here isn't needed now the the short circuit is fixed.

* Update spacy/tests/pipeline/test_entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use "eg", not "example"

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-05-23 20:42:26 +02:00
Lj Miranda
1d34aa2b3d
Add spacy-span-analyzer to debug data (#10668)
* Rename to spans_key for consistency

* Implement spans length in debug data

* Implement how span bounds and spans are obtained

In this commit, I implemented how span boundaries (the tokens) around a
given span and spans are obtained. I've put them in the compile_gold()
function so that it's accessible later on. I will do the actual
computation of the span and boundary distinctiveness in the main
function above.

* Compute for p_spans and p_bounds

* Add computation for SD and BD

* Fix mypy issues

* Add weighted average computation

* Fix compile_gold conditional logic

* Add test for frequency distribution computation

* Add tests for kl-divergence computation

* Fix weighted average computation

* Make tables more compact by rounding them

* Add more descriptive checks for spans

* Modularize span computation methods

In this commit, I added the _get_span_characteristics and
_print_span_characteristics functions so that they can be reusable
anywhere.

* Remove unnecessary arguments and make fxs more compact

* Update a few parameter arguments

* Add tests for print_span and get_span methods

* Update API to talk about span characteristics in brief

* Add better reporting of spans_length

* Add test for span length reporting

* Update formatting of span length report

Removed '' to indicate that it's not a string, then
sort the n-grams by their length, not by their frequency.

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Show all frequency distribution when -V

In this commit, I displayed the full frequency distribution of the
span lengths when --verbose is passed. To make things simpler, I
rewrote some of the formatter functions so that I can call them
whenever.

Another notable change is that instead of showing percentages as
Integers, I showed them as floats (max 2-decimal places). I did this
because it looks weird when it displays (0%).

* Update logic on how total is computed

The way the 90% thresholding is computed now is that we keep
adding the percentages until we reach >= 90%. I also updated the wording
and used the term "At least" to denote that >= 90% of your spans have
these distributions.

* Fix display when showing the threshold percentage

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add better phrasing for span information

* Update spacy/cli/debug_data.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add minor edits for whitespaces etc.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-05-23 19:06:38 +02:00
Paul O'Leary McCann
46982cf694
Add glossary entry for root (#10821)
* Add glossary entry for root

There was already one but it was lower case, maybe that should be
removed?

* remove lowercase root

On reflection, that was probably just a mistake.

* Add lowercase root back

It's harmless to leave it there.
2022-05-20 09:56:32 +02:00
Daniël de Kok
5586fd9311 Merge remote-tracking branch 'upstream/master' into v4-merge-master-20220518 2022-05-18 11:34:54 +02:00
Raphael Mitsch
357be2614e
Fuzz tokenizer.explain: draft for fuzzy tests. (#10771)
* Fuzz tokenizer.explain: draft for fuzzy tests.

* Fuzz tokenizer.explain: xignoring tokenizer.explain() tests. Removed deadline modification. Removed LANGUAGES_WITHOUT_TOKENIZERS.

* Fuzz tokenizer.explain: changed tokenizer initialization to avoid failus in Azure runs.

* Fuzz tokenizer.explain: type hint for tokenizer in test.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-05-17 10:23:16 +02:00
github-actions[bot]
99aeaf9bd3
Auto-format code with black (#10795)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-05-13 19:02:08 +02:00
kadarakos
fd36469900
bugfix parser labels (#10797) 2022-05-13 11:41:32 +02:00
Patrick Düggelin
cb06309ed8
Fix PhraseMatcher remove overlapping terms (#10734)
* Add regression test for issue 10643

* Improve overlapping terms testcase

* Fix removing overlapping terms in phrase matcher (#10643)
2022-05-12 12:23:52 +02:00