Commit Graph

3178 Commits

Author SHA1 Message Date
Adriane Boyd
ff4215f1c7
Drop support for python 3.6 (#13009)
* Drop support for python 3.6

* Update docs
2023-09-25 14:48:38 +02:00
Adriane Boyd
935a5455b6
Docs: add new tag for evaluate CLI --spans-keys (#13013) 2023-09-25 11:49:28 +02:00
Eliana Vornov
4e3360ad12
add --spans-key option for CLI spancat evaluation (#12981)
* add span key option for CLI evaluation

* Rephrase CLI help to refer to Doc.spans instead of spancat

* Rephrase docs to refer to Doc.spans instead of spancat

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-09-25 11:25:41 +02:00
Raphael Mitsch
bef9f63e13 Add gpt-3.5-turbo-instruct to list of supported OpenAI models. 2023-09-21 11:28:58 +02:00
Sofie Van Landeghem
8f0d6b0a8c
Fix in BertTokenizer docs (#12955)
* fix BertWordPieceTokenizer constructor call

* fix

* Update website/docs/usage/linguistic-features.mdx

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-09-13 13:21:58 +02:00
Sofie Van Landeghem
013762be41
Few spacy-llm doc fixes (#12969)
* fix construction example

* shorten task-specific factory list

* small edits to HF models

* small edit to API models

* typo

* fix space

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-09-08 11:35:38 +02:00
Sofie Van Landeghem
def7013eec
Docs for spacy-llm 0.5.0 (#12968)
* Update incorrect example config. (#12893)

* spacy-llm docs cleanup (#12945)

* Shorten NER section

* fix template references

* simplify sections

* set temperature to 0.0 in examples

* condense model information

* fix parameters for REST models

* set temperature to 0.0

* spelling fix

* trigger preview

* fix quotes

* add small note on noop.v1

* move up example noop config

* set appropriate model example configs

* explain config

* fix

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* Docs for ner.v3 and spancat.v3 spacy-llm tasks (#12949)

* formatting

* update usage table with NER.v3

* fix typo in links

* v3 overview of parameters

* add spancat.v3

* add further v3 explanations

* remove TODO comment

* few more small fixes

* Add doc section on LLM + task factories (#12905)

* Add section on LLM + task factories.

* Apply suggestions from code review

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* add default config to openai models (#12961)

* Docs for spacy-llm 0.5.0 (#12967)

* simplify Python example

* simplify Python example

* Refer only to latest OpenAI model versions from usage doc

* Typo fix

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* clarify accuracy claim

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-09-08 10:25:14 +02:00
Magdalena Aniol
cc78847688
fix training.batch_size example (#12963) 2023-09-06 16:38:13 +02:00
Sofie Van Landeghem
6d1f6d9a23
Fix LLM usage example (#12950)
* fix usage example

* revert back to v2 to allow hot fix on main
2023-09-04 09:05:50 +02:00
Sofie Van Landeghem
5c1f9264c2
fix typo in link (#12948)
* fix typo in link

* fix REL.v1 parameter
2023-09-01 13:47:20 +02:00
David Berenstein
065ead4eed
updated add_pipe docs (#12947) 2023-09-01 11:05:36 +02:00
vincent d warmerdam
3e4264899c
Update large-language-models.mdx (#12944) 2023-08-30 11:58:14 +02:00
Ines Montani
52758e1afa Add headers to netlify.toml [ci skip] 2023-08-30 11:55:23 +02:00
Vinit Ravishankar
c2303858e6
Documentation for spacy-curated-transformers (#12677)
* initial

* initial documentation run

* fix typo

* Remove mentions of Torchscript and quantization

Both are disabled in the initial release of `spacy-curated-transformers`.

* Fix `piece_encoder` entries

* Remove `spacy-transformers`-specific warning

* Fix duplicate entries in tables

* Doc fixes

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove type aliases

* Fix copy-paste typo

* Change `debug pieces` version tag to `3.7`

* Set curated transformers API version to  `3.7`

* Fix transformer listener naming

* Add docs for `init fill-config-transformer`

* Update CLI command invocation syntax

* Update intro section of the pipeline component docs

* Fix source URL

* Add a note to the architectures section about the `init fill-config-transformer` CLI command

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update CLI command name, args

* Remove hyphen from the `curated-transformers.mdx` filename

* Fix links

* Remove placeholder text

* Add text to the model/tokenizer loader sections

* Fill in the `DocTransformerOutput` section

* Formatting fixes

* Add curated transformer page to API docs sidebar

* More formatting fixes

* Remove TODO comment

* Remove outdated info about default config

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add link to HF model hub

* `prettier`

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-08-29 17:52:16 +02:00
PD Hall
d8a32c1050
docs: fix ngram_range_suggester max_size description (#12939) 2023-08-29 11:10:58 +02:00
Connor Brinton
6dd56868de
📝 Fix formula for receptive field in docs (#12918)
SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce
contextualized embeddings using a `MaxoutWindowEncoder` layer. These
convolutions are implemented using Thinc's `expand_window` layer, which
concatenates `window_size` neighboring sequence items on either side of
the sequence item being processed. This is repeated across `depth`
convolutional layers.

For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder`
layer with a context window of 1 and a depth of 2. We'll focus on the
token "C". We can visually represent the contextual embedding produced
for "C" as:
```mermaid
flowchart LR
A0(A<sub>0</sub>)
B0(B<sub>0</sub>)
C0(C<sub>0</sub>)
D0(D<sub>0</sub>)
E0(E<sub>0</sub>)
B1(B<sub>1</sub>)
C1(C<sub>1</sub>)
D1(D<sub>1</sub>)
C2(C<sub>2</sub>)
A0 --> B1
B0 --> B1
C0 --> B1
B0 --> C1
C0 --> C1
D0 --> C1
C0 --> D1
D0 --> D1
E0 --> D1
B1 --> C2
C1 --> C2
D1 --> C2
```

Described in words, this graph shows that before the first layer of the
convolution, the "receptive field" centered at each token consists only
of that same token. That is to say, that we have a receptive field of 1.
The first layer of the convolution adds one neighboring token on either
side to the receptive field. Since this is done on both sides, the
receptive field increases by 2, giving the first layer a receptive field
of 3. The second layer of the convolutions adds an _additional_
neighboring token on either side to the receptive field, giving a final
receptive field of 5.

However, this doesn't match the formula currently given in the docs,
which read:
> The receptive field of the CNN will be
> `depth * (window_size * 2 + 1)`, so a 4-layer network with a window
> size of `2` will be sensitive to 20 words at a time.

Substituting in our depth of 2 and window size of 1, this formula gives
us a receptive field of:
```
depth * (window_size * 2 + 1)
= 2 * (1 * 2 + 1)
= 2 * (2 + 1)
= 2 * 3
= 6
```

This not only doesn't match our computations from above, it's also an
even number! This is suspicious, since the receptive field is supposed
to be centered on a token, and not between tokens. Generally, this
formula results in an even number for any even value of `depth`.

The error in this formula is that the adjustment for the center token
is multiplied by the depth, when it should occur only once. The
corrected formula, `depth * window_size * 2 + 1`, gives the correct
value for our small example from above:
```
depth * window_size * 2 + 1
= 2 * 1 * 2 + 1
= 4 + 1
= 5
```

These changes update the docs to correct the receptive field formula and
the example receptive field size.
2023-08-21 10:52:32 +02:00
Adriane Boyd
76a9f9c6c6
Docs: clarify abstract spacy.load examples (#12889) 2023-08-16 17:28:34 +02:00
William Mattingly
64b8ee2dbe
Update universe.json (#12904)
* Update universe.json

added hobbit-spacy to the universe json

* Update universe.json

removed displacy from hobbit-spacy and added a default text.
2023-08-14 16:44:14 +02:00
Adriane Boyd
c4e378df97
Update CuPy extras (#12890)
* Add `cuda12x` for `cupy-cuda12x`.
* Drop `cuda-autodetect` from quickstart, set default to `cuda11x`
instead.
2023-08-08 12:58:28 +02:00
Sofie Van Landeghem
3b7faf4f5e
fix (#12881) 2023-08-03 08:37:43 +02:00
Arman Mohammadi
07407e07ab
fix the regular expression matching on the full text (#12883)
There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.
2023-08-02 16:52:26 +02:00
Adriane Boyd
0fe43f40f1
Support registered vectors (#12492)
* Support registered vectors

* Format

* Auto-fill [nlp] on load from config and from bytes/disk

* Only auto-fill [nlp]

* Undo all changes to Language.from_disk

* Expand BaseVectors

These methods are needed in various places for training and vector
similarity.

* isort

* More linting

* Only fill [nlp.vectors]

* Update spacy/vocab.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Revert changes to test related to auto-filling [nlp]

* Add vectors registry

* Rephrase error about vocab methods for vectors

* Switch to dummy implementation for BaseVectors.to_ops

* Add initial draft of docs

* Remove example from BaseVectors docs

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/basevectors.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix type and lint bpemb example

* Update website/docs/api/basevectors.mdx

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-08-01 15:46:08 +02:00
Andy Friedman
186889ec9c
added entry for SaysWho (#12828)
* Update universe.json

added entry for Sayswho

* Update universe.json

updated sayswho entry

* Update universe.json

* Update website/meta/universe.json

* Update website/meta/universe.json

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-07-31 10:52:32 +02:00
Sofie Van Landeghem
c9e9dccf79
Add displaCy data structures to docs (2) (#12875)
* Add data structures to docs

* Adjusted descriptions for more consistency

* Add _optional_ flag to parameters

* Add tests and adjust optional title key in doc

* Add title to dep visualizations

* fix typo

---------

Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
2023-07-31 10:47:57 +02:00
Victoria
49055ed7c8
Add cli for finding locations of registered func (#12757)
* Add cli for finding locations of registered func

* fixes: naming and typing

* isort

* update naming

* remove to find-function

* remove file:// bit

* use registry name if given and exit gracefully if a registry was not found

* clean up failure msg

* specify registry_name options

* mypy fixes

* return location for internal usage

* add documentation

* more mypy fixes

* clean up example

* add section to menu

* add tests

---------

Co-authored-by: svlandeg <svlandeg@github.com>
2023-07-31 09:39:00 +02:00
Márton Kardos
51b9655470
Added OdyCy to spaCy Universe (#12826)
* Added OdyCy to spaCy Universe

* Replaced template tags

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-07-26 16:05:53 +02:00
Madeesh Kannan
98799d849e
SpanCat: Remove invalid threshold config argument (#12860) 2023-07-26 13:56:31 +02:00
Victoria
e2b89012a2
Add spacy-llm docs to website (#12782)
* initial commit

* update for v0.4.0

* Apply suggestions from code review

* Fix formatting

* Apply suggestions from code review

* Update website/docs/api/large-language-models.mdx

* Update website/docs/api/large-language-models.mdx

* update usage page

* Apply suggestions from review

* Apply suggestions from review

* fix links

* fix relative links

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Apply suggestions from review

* Add section on Llama 2. Format.

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-07-24 14:44:47 +02:00
Adriane Boyd
95075298f5
Update pex Makefile defaults (#12832)
* Update pex Makefile defaults

- switch to python 3.8
- only install spacy-lookups-data for extra packages

* Update website for pex defaults
2023-07-18 09:29:04 +02:00
Ian Thompson
ef20e114e0
Typo fix in Language.replace_listeners docs (#12823)
* modified:   spacy/language.py
	- corrected typo in docstring for :method:`Language.replace_listeners`
	- added noqa comment on unused local variable assignment in :method:`Language.from_config` as I wasn't sure if it should be unassigned

modified:   website/docs/api/language.mdx
	- corrected typo in `Language.replace_listeners` markdown

* modified:   spacy/language.py
	- removed noqa comment

---------

Co-authored-by: Ian Thompson <ian.thompson@hrblock.com>
2023-07-14 09:45:54 +02:00
Sofie Van Landeghem
ddffd09602
Trainable lemmatizer docs link (#12795)
* add an anchor to the trainable lemmatizer section

* add requirement for morphologizer,tagger to rule-based lemmatizer

* morphologizer only
2023-07-07 15:18:16 +02:00
Adriane Boyd
1a55661cfb
Update website binder version to v3.6 (#12805) 2023-07-07 10:52:33 +02:00
Adriane Boyd
41dba5bd34
Update max_length default in span finder docs (#12803) 2023-07-07 10:17:41 +02:00
Adriane Boyd
4e19ec7eb8
Docs for v3.6.0 (#12792)
* Docs for v3.6.0

* Add sl performance

* Add da trf note
2023-07-06 12:58:25 +02:00
Tom Aarsen
eab929361d
Use 'exclude' instead of 'disable' (#12783)
as suggested by @svlandeg
2023-07-04 11:45:13 +02:00
Marcus Blättermann
bd239511a4
Fix problem with missing syntax highlighting languages causing runtime crash on the website (#12781)
* Fix problem with universe pages using `docker` language

* Fix problem with universe pages using `r` language

* Add fallback, in case code language is unknown
2023-07-03 10:24:25 +02:00
Daniël de Kok
57a230c6e4
Remove section about parallel training with Ray (#12770)
The Ray integration is currently broken, having these docs around
suggest that this functionality is currently available.
2023-06-28 17:09:57 +02:00
Adriane Boyd
fb0da3e097
Support custom token/lexeme attribute for vectors (#12625)
* Support custom token/lexeme attribute for vectors

* Fix imports

* Back off to ORTH without Vectors.attr

* Fallback if vectors.attr doesn't exist

* Update docs
2023-06-28 09:43:14 +02:00
Tom Aarsen
93983f08fc
Add SpanMarker for NER to spaCy universe (#12730)
* Add SpanMarker for NER to spaCy universe

* Escape the newlines in the text in the code example

Or at least, attempt to

* Remove now unnecessary import

* Disable NER pipeline component in code example
2023-06-20 16:47:44 +02:00
David Berenstein
53c400bd7a
docs: added reference to spacy-setfit to the spaCy Universe (#12737)
* docs: added reference to spacy-setfit

* removed package import after adding factory entry points to packages
2023-06-19 15:52:07 +02:00
Marcus Blättermann
7e4b38c841
Fix #12716 does not update the config generation section (#12718)
This is a really odd bug, where Firefox doesn't re-render the `code` element, even though `children` changed.

Two things fixed that:
- remove the `language-ini` `className`
- replace the `code` block with a `div`

Both are not ideal. Therefor this solution adds an inner `div` that now has the classes while still maintaining the semantic `code` element.

I couldn't find any explanation for why this is happening and why it only happens in Firefox. I assume it is a bug caused by one of our many dependencies (or their interplay)

To make matters worse: This bug *doesn't* occure when running the site in dev mode. You have to build and serve the site to recreate it.
2023-06-19 09:34:28 +02:00
Jacobo Myerston
daa6e0339f
Update universe.json (#12709)
* Update universe.json

* Update universe.json

add some missing commas in the greCy's description.
2023-06-12 13:55:20 +02:00
kadarakos
c003aac29a
SpanFinder into spaCy from experimental (#12507)
* span finder integrated into spacy from experimental

* black

* isort

* black

* default spankey constant

* black

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* rename

* rename

* max_length and min_length as Optional[int] and strict checking

* black

* mypy fix for integer type infinity

* revert line order

* implement all comparison operators for inf int

* avoid two for loops over all docs by not precomputing

* interleave thresholding with span creation

* black

* revert to not interleaving (relized its faster)

* black

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* update dosctring

* enforce that the gold and predicted documents have the same text

* new error for ensuring reference and predicted texts are the same

* remove todo

* adjust test

* black

* handle misaligned tokenization

* return correct variable

* failing overfit test

* only use a single spans_key like in spancat

* black

* remove debug lines

* typo

* remove comment

* remove near duplicate reduntant method

* use the 'spans_key' variable name everywhere

* Update spacy/pipeline/span_finder.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* flaky test fix suggestion, hand set bias terms

* only test suggester and test result exhaustively

* make it clear that the span_finder_suggester is more general (not specific to span_finder)

* Update spacy/tests/pipeline/test_span_finder.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Apply suggestions from code review

* remove question comment

* move preset_spans_suggester test to spancat tests

* Add docs and unify default configs for spancat and span finder

* Add `allow_overlap=True` to span finder scorer

* Fix offset bug in set_annotations

* Ignore labels in span finder scorer

* Format

* Add span_finder to quickstart template

* Move settings to self.cfg, store min/max unset as None

* Remove debugging

* Update docstrings and docs

* Update spacy/pipeline/span_finder.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix imports

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-06-07 15:52:28 +02:00
Isabel Zimmerman
05df59fd4a
[DOCS] add vetiver to spacy universe (#12557)
* add vetiver to spacy universe

* remove image

* update logo to render correctly in thumbnail

* apply Basil's suggestion

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>

* refer to the same model

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
2023-06-01 17:11:18 +02:00
Vinit Ravishankar
f0e0206b77
update universe for spacypdfreader (#12661) 2023-05-23 13:28:48 +02:00
Victoria
6930a6bf45
Add spaCy VSCode extension materials (#12592) 2023-05-19 14:38:53 +02:00
Adriane Boyd
df083f91a5
Add Malay to website languages (#12643) 2023-05-17 13:13:43 +02:00
Lj Miranda
58779c24ef
Remove shorthand for output-file in spacy apply (#12636)
The output-file argument is positional, so can't use a shorthand like -o.
2023-05-17 12:36:29 +02:00
David Berenstein
83b6f488cb
universe: Update examples Adept Augementation (#12620)
* Update universe.json

* chore: changed readme example as suggested by Vincent Warmerdam (koaning)
2023-05-15 14:09:33 +02:00
Adriane Boyd
3dc445df8d
Fix new tags in docs for v3.5.x (#12629)
* Fix new tags in docs for v3.5.x

* Fix new tag
2023-05-15 12:06:58 +02:00
Basile Dura
2dd8825f09
docs: add comment on offset_x argument (#12630) 2023-05-15 11:42:47 +02:00
Adriane Boyd
3637148c4d
Add scorer option to return per-component scores (#12540)
* Add scorer option to return per-component scores

Add `per_component` option to `Language.evaluate` and `Scorer.score` to
return scores keyed by `tokenizer` (hard-coded) or by component name.

Add option to `evaluate` CLI to score by component. Per-component scores
can only be saved to JSON.

* Update help text and messages
2023-05-12 15:36:54 +02:00
Kenneth Enevoldsen
88680a6eed
docs: remove invalid huggingface-hub push argument (#12624) 2023-05-12 09:40:28 +02:00
royashcenazi
3252f6b13f
Parsigs universe 3 (#12617)
* parsigs universe

* added model installation explanation in the description

* Update website/meta/universe.json

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>

* added model installement instruction in the code example

* added biomedical category

---------

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
2023-05-10 13:49:51 +02:00
royashcenazi
a56ab98e3c
parsigs universe (#12616)
* parsigs universe

* added model installation explanation in the description

* Update website/meta/universe.json

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>

* added model installement instruction in the code example

---------

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
2023-05-10 13:19:28 +02:00
David Berenstein
d11b549195
chore: added adept-augmentations to the spacy universe (#12609)
* chore: added adept-augmentations to the spacy universe

* Apply suggestions from code review

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>

* Update universe.json

---------

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
2023-05-10 13:16:16 +02:00
Patrick J. Burns
15f16db6ca
Fix typo (#12615) 2023-05-09 15:52:34 +02:00
Patrick J. Burns
eb3960a15a
Add LatinCy models to universe.json (#12597)
* Add LatinCy models to universe.json

* Update website/meta/universe.json

Add install code for LatinCy models to 'code_example'

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update LatinCy ‘code_example’ in website/meta/universe.json

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-05-09 12:02:45 +02:00
Kenneth Enevoldsen
73698326df
Update inmemorylookupkb.mdx (#12586)
Example does not refer to the in memory lookup
2023-05-02 12:51:13 +02:00
Victoria
a8dfc66135
Add spacy-wasm to universe (#12572)
* add spacy-wasm to universe

* add tag
2023-04-26 14:18:40 +02:00
moxley01
070fa16545
add spacysee project (#12568) 2023-04-25 12:30:19 +02:00
Victoria
e115408514
remove survey link (#12559) 2023-04-21 10:22:26 +02:00
Adriane Boyd
b60b027927
Add default option to MorphAnalysis.get (#12545)
* Add default to MorphAnalysis.get

Similar to `dict`, allow a `default` option for `MorphAnalysis.get` for
the user to provide a default return value if the field is not found.
The default return value remains `[]`, which is not the same as
`dict.get`, but is already established as this method's default return
value with the return type `List[str]`. However the new `default` option
does not enforce that the user-provided default is actually `List[str]`.

* Restore test case
2023-04-20 14:06:32 +02:00
TAN Long
119f959218
docs(REL_OP): modify docs for REL_OPs to match Semgrex's update on CoreNLP v4.5.2 (#12531)
Co-authored-by: Tan Long <tanloong@foxmail.com>
2023-04-17 13:14:01 +02:00
andyjessen
02259fa195
Add category to spaCy project (#12506)
ScispaCy fits within biomedical domain. Consider adding this category.
2023-04-07 15:31:04 +02:00
Madeesh Kannan
6db20b354f
Docs: Fix rule-based matching example that expands named entities (#12495) 2023-04-06 11:45:58 +02:00
Edward
c95d320d28
Add more information to custom code docs (#12491)
* Add info to sections

* Update website/docs/usage/training.mdx

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-04-06 11:45:19 +02:00
Will Frey
8d4129e177
Fix invalid ConsoleLogger.v3 example config (#12498)
Replace `progress_bar = "all_steps"` with `progress_bar = "eval"`, which is consistent with the default behavior for `spacy.ConsoleLogger.v1` and `spacy.ConsoleLogger.v2`.
2023-04-04 20:53:07 +02:00
Edward
de32011e4c
Add model-last saving mechanism to pretraining (#12459)
* Adjust pretrain command

* chane naming and add finally block

* Add unit test

* Add unit test assertions

* Update spacy/training/pretrain.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* change finally block

* Add to docs

* Update website/docs/usage/embeddings-transformers.mdx

* Add flag to skip saving model-last

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-04-03 15:24:03 +02:00
Ye Lei (叶磊)
ce258670b7
Allow passing a Span to displacy.parse_deps (#12477)
* Allow passing a Span to displacy.parse_deps

* Update docstring

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update API docs

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-31 09:44:01 +02:00
Edward
dba4e7bece
Add info to stringstore and vocab (#12471) 2023-03-27 13:15:14 +02:00
sloev / Johannes Valbjørn
fd072533e7
add spacy_onnx_sentiment_english to universe (#12422)
* add spacy_onnx_sentiment_english to universe

* rename to sentimental-onix

* fix comma json error

* fix typo

* typo fix

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* mention need to download model before example works

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-27 11:35:14 +02:00
Prajakta Darade
ae7779e830
corrected example code (#12466) 2023-03-27 11:32:49 +02:00
kadarakos
d1474fdd91
add explanation about overwriting behaviour (#12464)
* add explanation about overwriting behaviour

* Update website/docs/api/spancategorizer.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/spancategorizer.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/spancategorizer.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* format

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-27 10:27:11 +02:00
Vinit Ravishankar
28de85737f
Tagger label smoothing (#12293)
* add label smoothing

* use True/False instead of floats

* add entropy to debug data

* formatting

* docs

* change test to check difference in distributions

* Update website/docs/api/tagger.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/pipeline/tagger.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* bool -> float

* update docs

* fix seed

* black

* update tests to use label_smoothing = 0.0

* set default to 0.0, update quickstart

* Update spacy/pipeline/tagger.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* update morphologizer, tagger test

* fix morph docs

* add url to docs

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-22 12:17:56 +01:00
Ines Montani
b479f8bfa5
Add user survey alert to the top (#12452)
* Add user survey alert to the top

* Shorter

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-22 11:09:37 +01:00
Adriane Boyd
2ce9a220db
Fix --verbose for spacy find-threshold (#12418) 2023-03-14 17:16:49 +01:00
Lj Miranda
913d74f509
Add spancat_singlelabel pipeline for multiclass and non-overlapping span labelling tasks (#11365)
* [wip] Update

* [wip] Update

* Add initial port

* [wip] Update

* Fix all imports

* Add spancat_exclusive to pipeline

* [WIP] Update

* [ci skip] Add breakpoint for debugging

* Use spacy.SpanCategorizer.v1 as default archi

* Update spacy/pipeline/spancat_exclusive.py

Co-authored-by: kadarakos <kadar.akos@gmail.com>

* [ci skip] Small updates

* Use Softmax v2 directly from thinc

* Cache the label map

* Fix mypy errors

However, I ignored line 370 because it opened up a bunch of type errors
that might be trickier to solve and might lead to a more complicated
codebase.

* avoid multiplication with 1.0

Co-authored-by: kadarakos <kadar.akos@gmail.com>

* Update spacy/pipeline/spancat_exclusive.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update component versions to v2

* Add scorer to docstring

* Add _n_labels property to SpanCategorizer

Instead of using len(self.labels) in initialize() I am using a private
property self._n_labels. This achieves implementation parity and allows
me to delete the whole initialize() method for spancat_exclusive (since
it's now the same with spancat).

* Inherit from SpanCat instead of TrainablePipe

This commit changes the inheritance structure of Exclusive_Spancat,
now it's inheriting from SpanCategorizer than TrainablePipe. This
allows me to remove duplicate methods that are already present in
the parent function.

* Revert documentation link to spancat

* Fix init call for exclusive spancat

* Update spacy/pipeline/spancat_exclusive.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Import Suggester from spancat

* Include zero_init.v1 for spancat

* Implement _allow_extra_label to use _n_labels

To ensure that spancat / spancat_exclusive cannot be resized after
initialization, I inherited the _allow_extra_label() method from
spacy/pipeline/trainable_pipe.pyx and used self._n_labels instead
of len(self.labels) for checking.

I think that changing it locally is a better solution rather than
forcing each class that inherits TrainablePipe to use the self._n_labels
attribute.

Also note that I turned-off black formatting in this block of code
because it reads better without the overhang.

* Extend existing tests to spancat_exclusive

In this commit, I extended the existing tests for spancat to include
spancat_exclusive. I parametrized the test functions with 'name'
(similar var name with textcat and textcat_multilabel) for each
applicable test.

TODO: Add overfitting tests for spancat_exclusive

* Update documentation for spancat

* Turn on formatting for allow_extra_label

* Remove initializers in default config

* Use DEFAULT_EXCL_SPANCAT_MODEL

I also renamed spancat_exclusive_default_config into
spancat_excl_default_config because black does some not pretty
formatting changes.

* Update documentation

Update grammar and usage

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Clarify docstring for Exclusive_SpanCategorizer

* Remove mypy ignore and typecast labels to list

* Fix documentation API

* Use a single variable for tests

* Update defaults for number of rows

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Put back initializers in spancat config

Whenever I remove model.scorer.init_w and model.scorer.init_b,
I encounter an error in the test:

    SystemError: <method '__getitem__' of 'dict' objects> returned a result
    with an error set.

My Thinc version is 8.1.5, but I can't seem to check what's causing the
error.

* Update spancat_exclusive docstring

* Remove init_W and init_B parameters

This commit is expected to fail until the new Thinc release.

* Require thinc>=8.1.6 for serializable Softmax defaults

* Handle zero suggestions to make tests pass

I'm not sure if this is the most elegant solution. But what should
happen is that the _make_span_group function MUST return an empty
SpanGroup if there are no suggestions.

The error happens when the 'scores' variable is empty. We cannot
get the 'predicted' and other downstream vars.

* Better approach for handling zero suggestions

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spancategorizer headers

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add default value in negative_weight in docs

* Add default value in allow_overlap in docs

* Update how spancat_exclusive is constructed

In this commit, I added the following:
- Put the default values of negative_weight and allow_overlap
    in the default_config dictionary.
- Rename make_spancat -> make_exclusive_spancat

* Run prettier on spancategorizer.mdx

* Change exactly one -> at most one

* Add suggester documentation in Exclusive_SpanCategorizer

* Add suggester to spancat docstrings

* merge multilabel and singlelabel spancat

* rename spancat_exclusive to singlelable

* wire up different make_spangroups for single and multilabel

* black

* black

* add docstrings

* more docstring and fix negative_label

* don't rely on default arguments

* black

* remove spancat exclusive

* replace single_label with add_negative_label and adjust inference

* mypy

* logical bug in configuration check

* add spans.attrs[scores]

* single label make_spangroup test

* bugfix

* black

* tests for make_span_group with negative labels

* refactor make_span_group

* black

* Update spacy/tests/pipeline/test_spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* remove duplicate declaration

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* raise error instead of just print

* make label mapper private

* update docs

* run prettier

* Update website/docs/api/spancategorizer.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/spancategorizer.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* don't keep recomputing self._label_map for each span

* typo in docs

* Intervals to private and document 'name' param

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* add Tag to new features

* replace tags

* revert

* revert

* revert

* revert

* Update website/docs/api/spancategorizer.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/spancategorizer.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* prettier

* Fix merge

* Update website/docs/api/spancategorizer.mdx

* remove references to 'single_label'

* remove old paragraph

* Add spancat_singlelabel to config template

* Format

* Extend init config tests

---------

Co-authored-by: kadarakos <kadar.akos@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-09 10:30:59 +01:00
Victoria
4fdf356b29
Add links in website and readme for survey (#12385) 2023-03-09 10:01:18 +01:00
Marcus Blättermann
b309336712
Make sure to run Python setup before NPM dev mode (#12384) 2023-03-08 11:59:10 +01:00
Raphael Mitsch
6aa6b86d49
Make generation of empty KnowledgeBase instances configurable in EntityLinker (#12320)
* Make empty_kb() configurable.

* Format.

* Update docs.

* Be more specific in KB serialization test.

* Update KB serialization tests. Update docs.

* Remove doc update for batched candidate generation.

* Fix serialization of subclassed KB in tests.

* Format.

* Update docstring.

* Update docstring.

* Switch from pickle to json for custom field serialization.
2023-03-01 16:02:55 +01:00
kadarakos
56aa0cc75f
Displacy doc fix (#12352)
* more details for color setting

* more details for color setting

* prettier
2023-03-01 15:38:23 +01:00
Raphael Mitsch
efbc3d37b3
Update docs w.r.t. spacy.CandidateBatchGenerator.v1. (#12350) 2023-03-01 11:01:35 +01:00
Adriane Boyd
33864f1d07
Add new tags in docs for #12334 (#12348) 2023-03-01 10:46:13 +01:00
TAN Long
071667376a
Add new REL_OPs: >+, >-, <+, and <- (#12334)
* Add immediate left/right child/parent dependency relations

* Add tests for new REL_OPs: `>+`, `>-`, `<+`, and `<-`.

---------

Co-authored-by: Tan Long <tanloong@foxmail.com>
2023-02-28 14:36:33 +01:00
Adriane Boyd
4539fbae17
Revert "Fix FUZZY operator definition (#12318)" (#12336)
This reverts commit daedc45d05.

The default length depends on the length of the pattern string and was
correct for this example.
2023-02-27 09:48:36 +01:00
andyjessen
daedc45d05
Fix FUZZY operator definition (#12318)
* Fix FUZZY operator definition

The default length of the FUZZY operator is 2 and not 3.

* adjust edit distance in matcher usage docs too

---------

Co-authored-by: svlandeg <svlandeg@github.com>
2023-02-23 09:37:40 +01:00
Raphael Mitsch
2d4fb94ba0
Fix wrong file name in docs for rule-based matcher. (#12262) 2023-02-09 12:58:14 +01:00
Raphael Mitsch
d38a88f0f3
Remove negation. (#12252) 2023-02-08 14:18:33 +01:00
Sofie Van Landeghem
4c60afb946
Backslash fixes in docs (#12213)
* backslash fixes

* revert unrelated change
2023-02-01 10:15:38 +01:00
Paul O'Leary McCann
8932f4dc35
Add extra flag to assets docs (#12194)
* Add extra flag to assets docs

For some reason this wasn't included.

* Add new tag to docs
2023-01-30 10:05:23 +01:00
Sofie Van Landeghem
bd739e67d6
explain KB change and how to remedy (#12189) 2023-01-27 15:13:20 +01:00
Adriane Boyd
5f8a398bb9
Add span_id to Span.char_span, update Doc/Span.char_span docs (#12196)
* Add span_id to Span.char_span, update Doc/Span.char_span docs

`Span.char_span(id=)` should be removed in the future.

* Also use Union[int, str] in Doc docstring
2023-01-27 15:09:17 +01:00
Simon Gurcke
774c10fa39
Add alignment_mode argument to Span.char_span() (#12145)
* Add alignment_mode argument to Span.char_span()

* Update website

* Update spacy/tokens/span.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-27 11:43:40 +01:00
Daniël de Kok
8d69874afb
Add spacy.PlainTextCorpusReader.v1 (#12122)
* Add `spacy.PlainTextCorpusReader.v1`

This is a corpus reader that reads plain text corpora with the following
format:

- UTF-8 encoding
- One line per document.
- Blank lines are ignored.

It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.

* Update spacy/training/corpus.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* docs: add version to `PlainTextCorpus`

* Add docstring to registry function

* Add plain text corpus tests

* Only strip newline/carriage return

* Add return type _string_to_tmp_file helper

* Use a temporary directory in place of file name

Different OS auto delete/sharing semantics are just wonky.

* This will be new in 3.5.1 (rather than 4)

* Test improvements from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-26 11:33:22 +01:00
Marcus Blättermann
a37117abd0
Fix text colors in docs (#12186) 2023-01-26 10:30:24 +01:00
Marcus Blättermann
056b73468c
Load components dynamically (decrease initial file size for docs) (#12175)
* Extract `CodeBlock` component into own file

* Extract `InlineCode` component into own file

* Extract `TypeAnnotation` component into own file

* Convert named `export` to `default export`

* Remove unused `export`

* Simplify `TypeAnnotation` to remove dependency for Prism

* Load `Code` component dynamically

* Extract `MarkdownToReact` component into own file

* WIP Code Dynamic

* Load `MarkdownToReact` component dynamically

* Extract `htmlToReact` to own file

* Load `htmlToReact` component dynamically

* Dynamically load `Juniper`
2023-01-25 17:30:41 +01:00
Marcus Blättermann
11f10fff60
Fix frontpage image (#12184) 2023-01-25 13:17:35 +01:00
Marcus Blättermann
5a6000fb8b
Fix text color in docs (#12183)
* Fix text color on landing page

* Fix code color
2023-01-25 13:14:32 +01:00
Adriane Boyd
8ea15240ca
Update binder version to v3.5 (#12153) 2023-01-25 13:14:23 +01:00