Commit Graph

3152 Commits

Author SHA1 Message Date
svlandeg
0e3b6a87d6 Merge branch 'upstream_master' into sync_v4 2023-07-19 16:37:31 +02:00
Adriane Boyd
95075298f5
Update pex Makefile defaults (#12832)
* Update pex Makefile defaults

- switch to python 3.8
- only install spacy-lookups-data for extra packages

* Update website for pex defaults
2023-07-18 09:29:04 +02:00
Ian Thompson
ef20e114e0
Typo fix in Language.replace_listeners docs (#12823)
* modified:   spacy/language.py
	- corrected typo in docstring for :method:`Language.replace_listeners`
	- added noqa comment on unused local variable assignment in :method:`Language.from_config` as I wasn't sure if it should be unassigned

modified:   website/docs/api/language.mdx
	- corrected typo in `Language.replace_listeners` markdown

* modified:   spacy/language.py
	- removed noqa comment

---------

Co-authored-by: Ian Thompson <ian.thompson@hrblock.com>
2023-07-14 09:45:54 +02:00
Sofie Van Landeghem
ddffd09602
Trainable lemmatizer docs link (#12795)
* add an anchor to the trainable lemmatizer section

* add requirement for morphologizer,tagger to rule-based lemmatizer

* morphologizer only
2023-07-07 15:18:16 +02:00
Adriane Boyd
1a55661cfb
Update website binder version to v3.6 (#12805) 2023-07-07 10:52:33 +02:00
Adriane Boyd
41dba5bd34
Update max_length default in span finder docs (#12803) 2023-07-07 10:17:41 +02:00
Sofie Van Landeghem
536798f9e3
Disallow False for first/last arguments of add_pipe (#12793)
* Literal True for first/last options

* add test case

* update docs

* remove old redundant test case

* black formatting

* use Optional typing in docstrings

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-07-06 15:20:13 +02:00
Adriane Boyd
4e19ec7eb8
Docs for v3.6.0 (#12792)
* Docs for v3.6.0

* Add sl performance

* Add da trf note
2023-07-06 12:58:25 +02:00
Tom Aarsen
eab929361d
Use 'exclude' instead of 'disable' (#12783)
as suggested by @svlandeg
2023-07-04 11:45:13 +02:00
Marcus Blättermann
bd239511a4
Fix problem with missing syntax highlighting languages causing runtime crash on the website (#12781)
* Fix problem with universe pages using `docker` language

* Fix problem with universe pages using `r` language

* Add fallback, in case code language is unknown
2023-07-03 10:24:25 +02:00
Daniël de Kok
57a230c6e4
Remove section about parallel training with Ray (#12770)
The Ray integration is currently broken, having these docs around
suggest that this functionality is currently available.
2023-06-28 17:09:57 +02:00
Adriane Boyd
fb0da3e097
Support custom token/lexeme attribute for vectors (#12625)
* Support custom token/lexeme attribute for vectors

* Fix imports

* Back off to ORTH without Vectors.attr

* Fallback if vectors.attr doesn't exist

* Update docs
2023-06-28 09:43:14 +02:00
Daniël de Kok
bf92ca4f10 Merge remote-tracking branch 'upstream/master' into v4-isort 2023-06-26 12:43:00 +02:00
Tom Aarsen
93983f08fc
Add SpanMarker for NER to spaCy universe (#12730)
* Add SpanMarker for NER to spaCy universe

* Escape the newlines in the text in the code example

Or at least, attempt to

* Remove now unnecessary import

* Disable NER pipeline component in code example
2023-06-20 16:47:44 +02:00
David Berenstein
53c400bd7a
docs: added reference to spacy-setfit to the spaCy Universe (#12737)
* docs: added reference to spacy-setfit

* removed package import after adding factory entry points to packages
2023-06-19 15:52:07 +02:00
Marcus Blättermann
7e4b38c841
Fix #12716 does not update the config generation section (#12718)
This is a really odd bug, where Firefox doesn't re-render the `code` element, even though `children` changed.

Two things fixed that:
- remove the `language-ini` `className`
- replace the `code` block with a `div`

Both are not ideal. Therefor this solution adds an inner `div` that now has the classes while still maintaining the semantic `code` element.

I couldn't find any explanation for why this is happening and why it only happens in Firefox. I assume it is a bug caused by one of our many dependencies (or their interplay)

To make matters worse: This bug *doesn't* occure when running the site in dev mode. You have to build and serve the site to recreate it.
2023-06-19 09:34:28 +02:00
Daniël de Kok
50c5e9a2dd Merge remote-tracking branch 'upstream/master' into sync-v4-master-20230612 2023-06-12 15:57:10 +02:00
Jacobo Myerston
daa6e0339f
Update universe.json (#12709)
* Update universe.json

* Update universe.json

add some missing commas in the greCy's description.
2023-06-12 13:55:20 +02:00
kadarakos
c003aac29a
SpanFinder into spaCy from experimental (#12507)
* span finder integrated into spacy from experimental

* black

* isort

* black

* default spankey constant

* black

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* rename

* rename

* max_length and min_length as Optional[int] and strict checking

* black

* mypy fix for integer type infinity

* revert line order

* implement all comparison operators for inf int

* avoid two for loops over all docs by not precomputing

* interleave thresholding with span creation

* black

* revert to not interleaving (relized its faster)

* black

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* update dosctring

* enforce that the gold and predicted documents have the same text

* new error for ensuring reference and predicted texts are the same

* remove todo

* adjust test

* black

* handle misaligned tokenization

* return correct variable

* failing overfit test

* only use a single spans_key like in spancat

* black

* remove debug lines

* typo

* remove comment

* remove near duplicate reduntant method

* use the 'spans_key' variable name everywhere

* Update spacy/pipeline/span_finder.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* flaky test fix suggestion, hand set bias terms

* only test suggester and test result exhaustively

* make it clear that the span_finder_suggester is more general (not specific to span_finder)

* Update spacy/tests/pipeline/test_span_finder.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Apply suggestions from code review

* remove question comment

* move preset_spans_suggester test to spancat tests

* Add docs and unify default configs for spancat and span finder

* Add `allow_overlap=True` to span finder scorer

* Fix offset bug in set_annotations

* Ignore labels in span finder scorer

* Format

* Add span_finder to quickstart template

* Move settings to self.cfg, store min/max unset as None

* Remove debugging

* Update docstrings and docs

* Update spacy/pipeline/span_finder.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix imports

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-06-07 15:52:28 +02:00
Isabel Zimmerman
05df59fd4a
[DOCS] add vetiver to spacy universe (#12557)
* add vetiver to spacy universe

* remove image

* update logo to render correctly in thumbnail

* apply Basil's suggestion

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>

* refer to the same model

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
2023-06-01 17:11:18 +02:00
Vinit Ravishankar
f0e0206b77
update universe for spacypdfreader (#12661) 2023-05-23 13:28:48 +02:00
Victoria
6930a6bf45
Add spaCy VSCode extension materials (#12592) 2023-05-19 14:38:53 +02:00
Adriane Boyd
df083f91a5
Add Malay to website languages (#12643) 2023-05-17 13:13:43 +02:00
Lj Miranda
58779c24ef
Remove shorthand for output-file in spacy apply (#12636)
The output-file argument is positional, so can't use a shorthand like -o.
2023-05-17 12:36:29 +02:00
David Berenstein
83b6f488cb
universe: Update examples Adept Augementation (#12620)
* Update universe.json

* chore: changed readme example as suggested by Vincent Warmerdam (koaning)
2023-05-15 14:09:33 +02:00
Adriane Boyd
3dc445df8d
Fix new tags in docs for v3.5.x (#12629)
* Fix new tags in docs for v3.5.x

* Fix new tag
2023-05-15 12:06:58 +02:00
Basile Dura
2dd8825f09
docs: add comment on offset_x argument (#12630) 2023-05-15 11:42:47 +02:00
Adriane Boyd
3637148c4d
Add scorer option to return per-component scores (#12540)
* Add scorer option to return per-component scores

Add `per_component` option to `Language.evaluate` and `Scorer.score` to
return scores keyed by `tokenizer` (hard-coded) or by component name.

Add option to `evaluate` CLI to score by component. Per-component scores
can only be saved to JSON.

* Update help text and messages
2023-05-12 15:36:54 +02:00
Kenneth Enevoldsen
88680a6eed
docs: remove invalid huggingface-hub push argument (#12624) 2023-05-12 09:40:28 +02:00
royashcenazi
3252f6b13f
Parsigs universe 3 (#12617)
* parsigs universe

* added model installation explanation in the description

* Update website/meta/universe.json

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>

* added model installement instruction in the code example

* added biomedical category

---------

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
2023-05-10 13:49:51 +02:00
royashcenazi
a56ab98e3c
parsigs universe (#12616)
* parsigs universe

* added model installation explanation in the description

* Update website/meta/universe.json

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>

* added model installement instruction in the code example

---------

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
2023-05-10 13:19:28 +02:00
David Berenstein
d11b549195
chore: added adept-augmentations to the spacy universe (#12609)
* chore: added adept-augmentations to the spacy universe

* Apply suggestions from code review

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>

* Update universe.json

---------

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
2023-05-10 13:16:16 +02:00
Patrick J. Burns
15f16db6ca
Fix typo (#12615) 2023-05-09 15:52:34 +02:00
Patrick J. Burns
eb3960a15a
Add LatinCy models to universe.json (#12597)
* Add LatinCy models to universe.json

* Update website/meta/universe.json

Add install code for LatinCy models to 'code_example'

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update LatinCy ‘code_example’ in website/meta/universe.json

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-05-09 12:02:45 +02:00
Kenneth Enevoldsen
73698326df
Update inmemorylookupkb.mdx (#12586)
Example does not refer to the in memory lookup
2023-05-02 12:51:13 +02:00
Victoria
a8dfc66135
Add spacy-wasm to universe (#12572)
* add spacy-wasm to universe

* add tag
2023-04-26 14:18:40 +02:00
moxley01
070fa16545
add spacysee project (#12568) 2023-04-25 12:30:19 +02:00
Daniël de Kok
8a5814bf2c
Add distillation loop (#12542)
* Add distillation initialization and loop

* Fix up configuration keys

* Add docstring

* Type annotations

* init_nlp_distill -> init_nlp_student

* Do not resolve dot name distill corpus in initialization

(Since we don't use it.)

* student: do not request use of optimizer in student pipe

We apply finish up the updates once in the training loop instead.

Also add the necessary logic to `Language.distill` to mirror
`Language.update`.

* Correctly determine sort key in subdivide_batch

* Fix _distill_loop docstring wrt. stopping condition

* _distill_loop: fix distill_data docstring

Make similar changes in train_while_improving, since it also had
incorrect types and missing type annotations.

* Move `set_{gpu_allocator,seed}_from_config` to spacy.util

* Update Language.update docs for the sgd argument

* Type annotation

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2023-04-21 13:49:40 +02:00
Victoria
e115408514
remove survey link (#12559) 2023-04-21 10:22:26 +02:00
Adriane Boyd
b60b027927
Add default option to MorphAnalysis.get (#12545)
* Add default to MorphAnalysis.get

Similar to `dict`, allow a `default` option for `MorphAnalysis.get` for
the user to provide a default return value if the field is not found.
The default return value remains `[]`, which is not the same as
`dict.get`, but is already established as this method's default return
value with the return type `List[str]`. However the new `default` option
does not enforce that the user-provided default is actually `List[str]`.

* Restore test case
2023-04-20 14:06:32 +02:00
TAN Long
119f959218
docs(REL_OP): modify docs for REL_OPs to match Semgrex's update on CoreNLP v4.5.2 (#12531)
Co-authored-by: Tan Long <tanloong@foxmail.com>
2023-04-17 13:14:01 +02:00
andyjessen
02259fa195
Add category to spaCy project (#12506)
ScispaCy fits within biomedical domain. Consider adding this category.
2023-04-07 15:31:04 +02:00
Madeesh Kannan
6db20b354f
Docs: Fix rule-based matching example that expands named entities (#12495) 2023-04-06 11:45:58 +02:00
Edward
c95d320d28
Add more information to custom code docs (#12491)
* Add info to sections

* Update website/docs/usage/training.mdx

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-04-06 11:45:19 +02:00
Will Frey
8d4129e177
Fix invalid ConsoleLogger.v3 example config (#12498)
Replace `progress_bar = "all_steps"` with `progress_bar = "eval"`, which is consistent with the default behavior for `spacy.ConsoleLogger.v1` and `spacy.ConsoleLogger.v2`.
2023-04-04 20:53:07 +02:00
Edward
de32011e4c
Add model-last saving mechanism to pretraining (#12459)
* Adjust pretrain command

* chane naming and add finally block

* Add unit test

* Add unit test assertions

* Update spacy/training/pretrain.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* change finally block

* Add to docs

* Update website/docs/usage/embeddings-transformers.mdx

* Add flag to skip saving model-last

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-04-03 15:24:03 +02:00
Ye Lei (叶磊)
ce258670b7
Allow passing a Span to displacy.parse_deps (#12477)
* Allow passing a Span to displacy.parse_deps

* Update docstring

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update API docs

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-31 09:44:01 +02:00
Daniël de Kok
b734e5314d
Avoid TrainablePipe.finish_update getting called twice during training (#12450)
* Avoid `TrainablePipe.finish_update` getting called twice during training

PR #12136 fixed an issue where the tok2vec pipe was updated before
gradient were accumulated. However, it introduced a new bug that cause
`finish_update` to be called twice when using the training loop. This
causes a fairly large slowdown.

The `Language.update` method accepts the `sgd` argument for passing an
optimizer. This argument has three possible values:

- `Optimizer`: use the given optimizer to finish pipe updates.
- `None`: use a default optimizer to finish pipe updates.
- `False`: do not finish pipe updates.

However, the latter option was not documented and not valid with the
existing type of `sgd`. I assumed that this was a remnant of earlier
spaCy versions and removed handling of `False`.

However, with that change, we are passing `None` to `Language.update`.
As a result, we were calling `finish_update` in both `Language.update`
and in the training loop after all subbatches are processed.

This change restores proper handling/use of `False`. Moreover, the role
of `False` is now documented and added to the type to avoid future
accidents.

* Fix typo

* Document defaults for `Language.update`
2023-03-30 09:30:42 +02:00
Edward
dba4e7bece
Add info to stringstore and vocab (#12471) 2023-03-27 13:15:14 +02:00
sloev / Johannes Valbjørn
fd072533e7
add spacy_onnx_sentiment_english to universe (#12422)
* add spacy_onnx_sentiment_english to universe

* rename to sentimental-onix

* fix comma json error

* fix typo

* typo fix

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* mention need to download model before example works

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-27 11:35:14 +02:00