Commit Graph

16081 Commits

Author SHA1 Message Date
Adriane Boyd
0f9d2b01fb
Set version v3.6.0.dev1 (#12703) 2023-06-07 16:23:14 +02:00
kadarakos
c003aac29a
SpanFinder into spaCy from experimental (#12507)
* span finder integrated into spacy from experimental

* black

* isort

* black

* default spankey constant

* black

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* rename

* rename

* max_length and min_length as Optional[int] and strict checking

* black

* mypy fix for integer type infinity

* revert line order

* implement all comparison operators for inf int

* avoid two for loops over all docs by not precomputing

* interleave thresholding with span creation

* black

* revert to not interleaving (relized its faster)

* black

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* update dosctring

* enforce that the gold and predicted documents have the same text

* new error for ensuring reference and predicted texts are the same

* remove todo

* adjust test

* black

* handle misaligned tokenization

* return correct variable

* failing overfit test

* only use a single spans_key like in spancat

* black

* remove debug lines

* typo

* remove comment

* remove near duplicate reduntant method

* use the 'spans_key' variable name everywhere

* Update spacy/pipeline/span_finder.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* flaky test fix suggestion, hand set bias terms

* only test suggester and test result exhaustively

* make it clear that the span_finder_suggester is more general (not specific to span_finder)

* Update spacy/tests/pipeline/test_span_finder.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Apply suggestions from code review

* remove question comment

* move preset_spans_suggester test to spancat tests

* Add docs and unify default configs for spancat and span finder

* Add `allow_overlap=True` to span finder scorer

* Fix offset bug in set_annotations

* Ignore labels in span finder scorer

* Format

* Add span_finder to quickstart template

* Move settings to self.cfg, store min/max unset as None

* Remove debugging

* Update docstrings and docs

* Update spacy/pipeline/span_finder.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix imports

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-06-07 15:52:28 +02:00
Basile Dura
c3c064ace4
fix: InitializableComponent type hints (#12692)
* fix: InitializableComponent type hints

* fix: avoid circular dependency

* style: clean imports in language.py

* style: use relative imports

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* fix: apply black

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-06-02 14:29:52 +02:00
Adriane Boyd
c4112a1da3
Require that all SpanGroup spans are from the current doc (#12569)
* Require that all SpanGroup spans are from the current doc

The restriction on only adding spans from the current doc were already
implemented for all operations except for `SpanGroup.__init__`.

Initialize copied spans for `SpanGroup.copy` with `Doc.char_span` in
order to validate the character offsets and to make it possible to copy
spans between documents with differing tokenization. Currently there is
no validation that the document texts are identical, but the span char
offsets must be valid spans in the target doc, which prevents you from
ending up with completely invalid spans.

* Undo change in test_beam_overfitting_IO
2023-06-01 19:19:17 +02:00
Isabel Zimmerman
05df59fd4a
[DOCS] add vetiver to spacy universe (#12557)
* add vetiver to spacy universe

* remove image

* update logo to render correctly in thumbnail

* apply Basil's suggestion

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>

* refer to the same model

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
2023-06-01 17:11:18 +02:00
Adriane Boyd
c936db2faf
Address numpy 1.25 deprecations in test suite (#12684)
* Address upcoming numpy v1.25 deprecations in test suite

* Temporarily test most recent numpy prerelease in CI

* Revert "Temporarily test most recent numpy prerelease in CI"

This reverts commit d75a66e55e.
2023-05-31 17:23:07 +02:00
Adriane Boyd
9b7a59c325
Revert "CI: Disable fail-fast (#12658)" (#12676)
This reverts commit 1f088cbf4a.
2023-05-26 10:57:02 +02:00
Vinit Ravishankar
f0e0206b77
update universe for spacypdfreader (#12661) 2023-05-23 13:28:48 +02:00
Adriane Boyd
1f088cbf4a
CI: Disable fail-fast (#12658)
While the typing_extensions/pydantic `Literal` bugs are being sorted
out, disable fail-fast so the rest of the CI is available for
development purposes.
2023-05-23 10:48:06 +02:00
Basile Dura
6ea4155487
feat: add comparison operators in span.pyi (#12652)
* feat: add comparison operators in span.pyi

remove Cython-specific `__richcmp__`

* fix: comparison operators should be defined for any other object
2023-05-23 08:50:37 +02:00
Victoria
6930a6bf45
Add spaCy VSCode extension materials (#12592) 2023-05-19 14:38:53 +02:00
Basile Dura
95fd46b1dd
feat: add type hinting on SpanGroup.__iter__ (#12642) 2023-05-17 14:20:00 +02:00
Adriane Boyd
df083f91a5
Add Malay to website languages (#12643) 2023-05-17 13:13:43 +02:00
Sani
873c16a4df
Malay language support (#12602)
* add malay lang

* fix token len

* black format

* reformat conftest malay

* remove exceptions not exist in dbp

* format code
2023-05-17 12:45:21 +02:00
Lj Miranda
58779c24ef
Remove shorthand for output-file in spacy apply (#12636)
The output-file argument is positional, so can't use a shorthand like -o.
2023-05-17 12:36:29 +02:00
David Berenstein
83b6f488cb
universe: Update examples Adept Augementation (#12620)
* Update universe.json

* chore: changed readme example as suggested by Vincent Warmerdam (koaning)
2023-05-15 14:09:33 +02:00
Adriane Boyd
3dc445df8d
Fix new tags in docs for v3.5.x (#12629)
* Fix new tags in docs for v3.5.x

* Fix new tag
2023-05-15 12:06:58 +02:00
Basile Dura
2dd8825f09
docs: add comment on offset_x argument (#12630) 2023-05-15 11:42:47 +02:00
Basile Dura
f96b9e03df
build: bump typer version to accept >=0.3<0.10 (#12631) 2023-05-15 08:06:58 +02:00
Adriane Boyd
3637148c4d
Add scorer option to return per-component scores (#12540)
* Add scorer option to return per-component scores

Add `per_component` option to `Language.evaluate` and `Scorer.score` to
return scores keyed by `tokenizer` (hard-coded) or by component name.

Add option to `evaluate` CLI to score by component. Per-component scores
can only be saved to JSON.

* Update help text and messages
2023-05-12 15:36:54 +02:00
Kenneth Enevoldsen
88680a6eed
docs: remove invalid huggingface-hub push argument (#12624) 2023-05-12 09:40:28 +02:00
Adriane Boyd
b5af0fe836
Revert "Use Latin normalization for Serbian attrs (#12608)" (#12621)
This reverts commit 6f314f99c4.

We are reverting this until we can support this normalization more
consistently across vectors, training corpora, and lemmatizer data.
2023-05-11 11:54:16 +02:00
royashcenazi
3252f6b13f
Parsigs universe 3 (#12617)
* parsigs universe

* added model installation explanation in the description

* Update website/meta/universe.json

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>

* added model installement instruction in the code example

* added biomedical category

---------

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
2023-05-10 13:49:51 +02:00
royashcenazi
a56ab98e3c
parsigs universe (#12616)
* parsigs universe

* added model installation explanation in the description

* Update website/meta/universe.json

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>

* added model installement instruction in the code example

---------

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
2023-05-10 13:19:28 +02:00
David Berenstein
d11b549195
chore: added adept-augmentations to the spacy universe (#12609)
* chore: added adept-augmentations to the spacy universe

* Apply suggestions from code review

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>

* Update universe.json

---------

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
2023-05-10 13:16:16 +02:00
Patrick J. Burns
15f16db6ca
Fix typo (#12615) 2023-05-09 15:52:34 +02:00
Patrick J. Burns
eb3960a15a
Add LatinCy models to universe.json (#12597)
* Add LatinCy models to universe.json

* Update website/meta/universe.json

Add install code for LatinCy models to 'code_example'

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update LatinCy ‘code_example’ in website/meta/universe.json

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-05-09 12:02:45 +02:00
Adriane Boyd
1279b464bb
In initialize only calculate current vectors hash if needed (#12607) 2023-05-08 16:51:58 +02:00
Adriane Boyd
6f314f99c4
Use Latin normalization for Serbian attrs (#12608)
* Use Latin normalization for Serbian attrs

Use Latin normalization for Serbian `NORM`, `PREFIX`, and `SUFFIX`.

* Update NORMs in tokenizer exceptions and related tests

* Add tests for all custom lex attrs

* Remove unused imports
2023-05-08 12:33:56 +02:00
Adriane Boyd
cbc6bcf434
Merge pull request #12604 from adrianeboyd/chore/v3.6.0.dev0
Set version to v3.6.0.dev0
2023-05-08 10:05:15 +02:00
Adriane Boyd
46ce66021a Temporarily skip download CLI related tests in CI 2023-05-08 09:17:33 +02:00
Adriane Boyd
fbd12eb4a4 Set version to v3.6.0.dev0 2023-05-08 09:10:35 +02:00
Adriane Boyd
dbc71ecd44
Remove #egg from download URLs (#12567)
The current URLs will become invalid in pip 25.0. According to the pip
docs, the egg= URLs are currently only needed for editable VCS installs.
2023-05-04 17:13:12 +02:00
Kenneth Enevoldsen
73698326df
Update inmemorylookupkb.mdx (#12586)
Example does not refer to the in memory lookup
2023-05-02 12:51:13 +02:00
Lj Miranda
298e6036b7
Add spans in spacy benchmark (#12575)
* Add spans in spacy benchmark

The current implementation of spaCy benchmark accuracy / spacy evaluate
doesn't include the "spans" type, so calling the command doesn't render
the HTML displaCy file needed.

This PR attempts to fix that by creating a new parameter for "spans"
and calling the appropriate displaCy value.

* Reformat file with black

* Add tests for evaluate

* Fix spans -> span for displacy style

* Update test to check render instead

* Update source so mypy passes

* Add parser information to avoid warnings
2023-04-28 14:32:52 +02:00
Adriane Boyd
6817e3d372
CI: Only run test suite once with thinc-apple-ops for macos python 3.11 (#12436)
* CI: Only run test suite once with thinc-apple-ops for macos python 3.11

* Adjust syntax

* Try alternate syntax

* Try alternate syntax

* Try alternate syntax
2023-04-28 14:29:51 +02:00
kadarakos
34d1164b0e
Spancat speed improvement (#12577)
* avoid nesting then flattening

* mypy fix

* Apply suggestions from code review

* Add type for indices

* Run full matrix for mypy

* Add back modified type: ignore

* Revert "Run full matrix for mypy"

This reverts commit e218873d04.

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-04-27 15:27:13 +02:00
Victoria
a8dfc66135
Add spacy-wasm to universe (#12572)
* add spacy-wasm to universe

* add tag
2023-04-26 14:18:40 +02:00
moxley01
070fa16545
add spacysee project (#12568) 2023-04-25 12:30:19 +02:00
Adriane Boyd
68da580a4c
CI: Disable Azure (#12560) 2023-04-21 15:05:53 +02:00
Daniël de Kok
8a5814bf2c
Add distillation loop (#12542)
* Add distillation initialization and loop

* Fix up configuration keys

* Add docstring

* Type annotations

* init_nlp_distill -> init_nlp_student

* Do not resolve dot name distill corpus in initialization

(Since we don't use it.)

* student: do not request use of optimizer in student pipe

We apply finish up the updates once in the training loop instead.

Also add the necessary logic to `Language.distill` to mirror
`Language.update`.

* Correctly determine sort key in subdivide_batch

* Fix _distill_loop docstring wrt. stopping condition

* _distill_loop: fix distill_data docstring

Make similar changes in train_while_improving, since it also had
incorrect types and missing type annotations.

* Move `set_{gpu_allocator,seed}_from_config` to spacy.util

* Update Language.update docs for the sgd argument

* Type annotation

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2023-04-21 13:49:40 +02:00
Victoria
e115408514
remove survey link (#12559) 2023-04-21 10:22:26 +02:00
Patrick J. Burns
ab4ba04c32
Update LatinDefaults for lang 'la' (#12538)
* Add noun chunking to la syntax iterators

* Expand list of numeral, ordinal words

* Expand abbreviations in la tokenizer_exceptions

* Add example sents

* Update spacy/lang/la/syntax_iterators.py

Reorganize la syntax iterators

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Minor updates based on review

* fix call

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-04-20 16:55:40 +02:00
Adriane Boyd
b60b027927
Add default option to MorphAnalysis.get (#12545)
* Add default to MorphAnalysis.get

Similar to `dict`, allow a `default` option for `MorphAnalysis.get` for
the user to provide a default return value if the field is not found.
The default return value remains `[]`, which is not the same as
`dict.get`, but is already established as this method's default return
value with the return type `List[str]`. However the new `default` option
does not enforce that the user-provided default is actually `List[str]`.

* Restore test case
2023-04-20 14:06:32 +02:00
Adriane Boyd
dc0a1a9808
Load exceptions last in Tokenizer.from_bytes (#12553)
In `Tokenizer.from_bytes`, the exceptions should be loaded last so that
they are only processed once as part of loading the model.

The exceptions are tokenized as phrase matcher patterns in the
background and the internal tokenization needs to be synced with all the
remaining tokenizer settings. If the exceptions are not loaded last,
there are speed regressions for `Tokenizer.from_bytes/disk` vs.
`Tokenizer.add_special_case` as the caches are reloaded more than
necessary during deserialization.
2023-04-20 11:30:34 +02:00
Sofie Van Landeghem
8e6a3d58d8
fix typo (#12543) 2023-04-19 10:59:33 +02:00
TAN Long
923d24e885
perf(REL_OP): Replace some token.children with token.rights or token.lefts (#12528)
Co-authored-by: Tan Long <tanloong@foxmail.com>
2023-04-17 13:16:34 +02:00
TAN Long
119f959218
docs(REL_OP): modify docs for REL_OPs to match Semgrex's update on CoreNLP v4.5.2 (#12531)
Co-authored-by: Tan Long <tanloong@foxmail.com>
2023-04-17 13:14:01 +02:00
andyjessen
02259fa195
Add category to spaCy project (#12506)
ScispaCy fits within biomedical domain. Consider adding this category.
2023-04-07 15:31:04 +02:00
Adriane Boyd
5d0f48fe69
Enforce that Span.start/end(_char) remain valid and in sync (#12268)
* Enforce that Span.start/end(_char) remain valid and in sync

Allowing span attributes to be writable starting in v3 has made it
possible for the internal `Span.start/end/start_char/end_char` to get
out-of-sync or have invalid values.

This checks that the values are valid and syncs the token and char
offsets if any attributes are modified directly. It does not yet handle
the case where the underlying doc is modified.

* Format
2023-04-06 16:01:59 +02:00