Commit Graph

16269 Commits

Author SHA1 Message Date
Magdalena Aniol
1c0205967d fix training.batch_size example (#12963) 2023-09-06 16:39:35 +02:00
Magdalena Aniol
cc78847688
fix training.batch_size example (#12963) 2023-09-06 16:38:13 +02:00
Sofie Van Landeghem
807f36eaa1 Fix LLM usage example (#12950)
* fix usage example

* revert back to v2 to allow hot fix on main
2023-09-05 08:56:00 +02:00
Sofie Van Landeghem
6d1f6d9a23
Fix LLM usage example (#12950)
* fix usage example

* revert back to v2 to allow hot fix on main
2023-09-04 09:05:50 +02:00
Sofie Van Landeghem
642a4de63f fix typo in link (#12948)
* fix typo in link

* fix REL.v1 parameter
2023-09-01 13:49:03 +02:00
Sofie Van Landeghem
5c1f9264c2
fix typo in link (#12948)
* fix typo in link

* fix REL.v1 parameter
2023-09-01 13:47:20 +02:00
David Berenstein
d501b819ce updated add_pipe docs (#12947) 2023-09-01 11:06:58 +02:00
David Berenstein
065ead4eed
updated add_pipe docs (#12947) 2023-09-01 11:05:36 +02:00
vincent d warmerdam
238434b6b4 Update large-language-models.mdx (#12944) 2023-08-30 11:59:23 +02:00
vincent d warmerdam
3e4264899c
Update large-language-models.mdx (#12944) 2023-08-30 11:58:14 +02:00
Ines Montani
117c8f1e0e Add headers to netlify.toml [ci skip] 2023-08-30 11:55:38 +02:00
Ines Montani
52758e1afa Add headers to netlify.toml [ci skip] 2023-08-30 11:55:23 +02:00
Vinit Ravishankar
c2303858e6
Documentation for spacy-curated-transformers (#12677)
* initial

* initial documentation run

* fix typo

* Remove mentions of Torchscript and quantization

Both are disabled in the initial release of `spacy-curated-transformers`.

* Fix `piece_encoder` entries

* Remove `spacy-transformers`-specific warning

* Fix duplicate entries in tables

* Doc fixes

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove type aliases

* Fix copy-paste typo

* Change `debug pieces` version tag to `3.7`

* Set curated transformers API version to  `3.7`

* Fix transformer listener naming

* Add docs for `init fill-config-transformer`

* Update CLI command invocation syntax

* Update intro section of the pipeline component docs

* Fix source URL

* Add a note to the architectures section about the `init fill-config-transformer` CLI command

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update CLI command name, args

* Remove hyphen from the `curated-transformers.mdx` filename

* Fix links

* Remove placeholder text

* Add text to the model/tokenizer loader sections

* Fill in the `DocTransformerOutput` section

* Formatting fixes

* Add curated transformer page to API docs sidebar

* More formatting fixes

* Remove TODO comment

* Remove outdated info about default config

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add link to HF model hub

* `prettier`

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-08-29 17:52:16 +02:00
PD Hall
439a5165a6 docs: fix ngram_range_suggester max_size description (#12939) 2023-08-29 11:12:16 +02:00
PD Hall
d8a32c1050
docs: fix ngram_range_suggester max_size description (#12939) 2023-08-29 11:10:58 +02:00
Sofie Van Landeghem
869cc4ab0b
warn when an unsupported/unknown key is given to the dependency matcher (#12928) 2023-08-22 09:03:35 +02:00
Connor Brinton
ea2bb91e9b 📝 Fix formula for receptive field in docs (#12918)
SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce
contextualized embeddings using a `MaxoutWindowEncoder` layer. These
convolutions are implemented using Thinc's `expand_window` layer, which
concatenates `window_size` neighboring sequence items on either side of
the sequence item being processed. This is repeated across `depth`
convolutional layers.

For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder`
layer with a context window of 1 and a depth of 2. We'll focus on the
token "C". We can visually represent the contextual embedding produced
for "C" as:
```mermaid
flowchart LR
A0(A<sub>0</sub>)
B0(B<sub>0</sub>)
C0(C<sub>0</sub>)
D0(D<sub>0</sub>)
E0(E<sub>0</sub>)
B1(B<sub>1</sub>)
C1(C<sub>1</sub>)
D1(D<sub>1</sub>)
C2(C<sub>2</sub>)
A0 --> B1
B0 --> B1
C0 --> B1
B0 --> C1
C0 --> C1
D0 --> C1
C0 --> D1
D0 --> D1
E0 --> D1
B1 --> C2
C1 --> C2
D1 --> C2
```

Described in words, this graph shows that before the first layer of the
convolution, the "receptive field" centered at each token consists only
of that same token. That is to say, that we have a receptive field of 1.
The first layer of the convolution adds one neighboring token on either
side to the receptive field. Since this is done on both sides, the
receptive field increases by 2, giving the first layer a receptive field
of 3. The second layer of the convolutions adds an _additional_
neighboring token on either side to the receptive field, giving a final
receptive field of 5.

However, this doesn't match the formula currently given in the docs,
which read:
> The receptive field of the CNN will be
> `depth * (window_size * 2 + 1)`, so a 4-layer network with a window
> size of `2` will be sensitive to 20 words at a time.

Substituting in our depth of 2 and window size of 1, this formula gives
us a receptive field of:
```
depth * (window_size * 2 + 1)
= 2 * (1 * 2 + 1)
= 2 * (2 + 1)
= 2 * 3
= 6
```

This not only doesn't match our computations from above, it's also an
even number! This is suspicious, since the receptive field is supposed
to be centered on a token, and not between tokens. Generally, this
formula results in an even number for any even value of `depth`.

The error in this formula is that the adjustment for the center token
is multiplied by the depth, when it should occur only once. The
corrected formula, `depth * window_size * 2 + 1`, gives the correct
value for our small example from above:
```
depth * window_size * 2 + 1
= 2 * 1 * 2 + 1
= 4 + 1
= 5
```

These changes update the docs to correct the receptive field formula and
the example receptive field size.
2023-08-21 10:53:14 +02:00
Connor Brinton
6dd56868de
📝 Fix formula for receptive field in docs (#12918)
SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce
contextualized embeddings using a `MaxoutWindowEncoder` layer. These
convolutions are implemented using Thinc's `expand_window` layer, which
concatenates `window_size` neighboring sequence items on either side of
the sequence item being processed. This is repeated across `depth`
convolutional layers.

For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder`
layer with a context window of 1 and a depth of 2. We'll focus on the
token "C". We can visually represent the contextual embedding produced
for "C" as:
```mermaid
flowchart LR
A0(A<sub>0</sub>)
B0(B<sub>0</sub>)
C0(C<sub>0</sub>)
D0(D<sub>0</sub>)
E0(E<sub>0</sub>)
B1(B<sub>1</sub>)
C1(C<sub>1</sub>)
D1(D<sub>1</sub>)
C2(C<sub>2</sub>)
A0 --> B1
B0 --> B1
C0 --> B1
B0 --> C1
C0 --> C1
D0 --> C1
C0 --> D1
D0 --> D1
E0 --> D1
B1 --> C2
C1 --> C2
D1 --> C2
```

Described in words, this graph shows that before the first layer of the
convolution, the "receptive field" centered at each token consists only
of that same token. That is to say, that we have a receptive field of 1.
The first layer of the convolution adds one neighboring token on either
side to the receptive field. Since this is done on both sides, the
receptive field increases by 2, giving the first layer a receptive field
of 3. The second layer of the convolutions adds an _additional_
neighboring token on either side to the receptive field, giving a final
receptive field of 5.

However, this doesn't match the formula currently given in the docs,
which read:
> The receptive field of the CNN will be
> `depth * (window_size * 2 + 1)`, so a 4-layer network with a window
> size of `2` will be sensitive to 20 words at a time.

Substituting in our depth of 2 and window size of 1, this formula gives
us a receptive field of:
```
depth * (window_size * 2 + 1)
= 2 * (1 * 2 + 1)
= 2 * (2 + 1)
= 2 * 3
= 6
```

This not only doesn't match our computations from above, it's also an
even number! This is suspicious, since the receptive field is supposed
to be centered on a token, and not between tokens. Generally, this
formula results in an even number for any even value of `depth`.

The error in this formula is that the adjustment for the center token
is multiplied by the depth, when it should occur only once. The
corrected formula, `depth * window_size * 2 + 1`, gives the correct
value for our small example from above:
```
depth * window_size * 2 + 1
= 2 * 1 * 2 + 1
= 4 + 1
= 5
```

These changes update the docs to correct the receptive field formula and
the example receptive field size.
2023-08-21 10:52:32 +02:00
Adriane Boyd
198488ee86
Extend to weasel v0.3 (#12908)
* Extend to weasel v0.3

* Clean up unused imports in test_cli
2023-08-16 17:36:53 +02:00
Adriane Boyd
47a2b58af2 Docs: clarify abstract spacy.load examples (#12889) 2023-08-16 17:30:43 +02:00
William Mattingly
94c390d349 Update universe.json (#12904)
* Update universe.json

added hobbit-spacy to the universe json

* Update universe.json

removed displacy from hobbit-spacy and added a default text.
2023-08-16 17:30:30 +02:00
Adriane Boyd
76a9f9c6c6
Docs: clarify abstract spacy.load examples (#12889) 2023-08-16 17:28:34 +02:00
William Mattingly
64b8ee2dbe
Update universe.json (#12904)
* Update universe.json

added hobbit-spacy to the universe json

* Update universe.json

removed displacy from hobbit-spacy and added a default text.
2023-08-14 16:44:14 +02:00
denizcodeyaa
d50b8d51e2
Update examples.py (#12895)
Add: example sentences to improve the Turkish model. Let's get the tr_web_core_sm out in the the world yaa
2023-08-11 15:38:06 +02:00
Adriane Boyd
6a4aa43164
Extend to thinc v8.2 (#12897) 2023-08-11 13:05:46 +02:00
Adriane Boyd
9622c11529
Extend to weasel v0.2 (#12902) 2023-08-11 10:59:51 +02:00
Adriane Boyd
6ef29c4115
Merge pull request #12901 from adrianeboyd/feature/spacy-transformers-v1.3-revert
Revert "Extend to spacy-transformers v1.3.x (#12877)"
2023-08-10 16:43:10 +02:00
Adriane Boyd
060241a8d5 Revert "Extend to spacy-transformers v1.3.x (#12877)"
This reverts commit e5773e0c69.
2023-08-10 11:42:09 +02:00
Raphael Mitsch
a20f54fb91 Update incorrect example config. (#12893) 2023-08-09 10:18:04 +02:00
Adriane Boyd
458bc5f45c
Set version to v3.6.1 (#12892) 2023-08-08 15:04:13 +02:00
Adriane Boyd
c4e378df97
Update CuPy extras (#12890)
* Add `cuda12x` for `cupy-cuda12x`.
* Drop `cuda-autodetect` from quickstart, set default to `cuda11x`
instead.
2023-08-08 12:58:28 +02:00
Adriane Boyd
245e2ddc25
Allow pydantic v2 using transitional v1 support (#12888) 2023-08-08 11:27:28 +02:00
Adriane Boyd
45af8a5dcf
Update br tags (#12882)
* Fix displacy br tag

* Prefer <br>, also update package CLI
2023-08-04 10:52:41 +02:00
Sofie Van Landeghem
0ea7e22ba8 fix (#12881) 2023-08-03 08:38:16 +02:00
Sofie Van Landeghem
3b7faf4f5e
fix (#12881) 2023-08-03 08:37:43 +02:00
Arman Mohammadi
827cea6fc3 fix the regular expression matching on the full text (#12883)
There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.
2023-08-02 16:52:50 +02:00
Arman Mohammadi
07407e07ab
fix the regular expression matching on the full text (#12883)
There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.
2023-08-02 16:52:26 +02:00
Adriane Boyd
e5773e0c69
Extend to spacy-transformers v1.3.x (#12877) 2023-08-02 09:35:16 +02:00
Sofie Van Landeghem
0737443096
feat: add example stubs (3) (#12801)
* feat: add example stubs

* fix: add required annotations

* fix: mypy issues

* fix: use Py36-compatible Portocol

* Minor reformatting

* adding further type specifications and removing internal methods

* black formatting

* widen type to iterable

* add private methods that are being used by the built-in convertors

* revert changes to corpus.py

* fixes

* fixes

* fix typing of PlainTextCorpus

---------

Co-authored-by: Basile Dura <basile@bdura.me>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-08-02 08:15:12 +02:00
Madeesh Kannan
222bd3c5b1
Display model's full base version string in incompatiblity warning (#12857) 2023-08-02 08:06:41 +02:00
Adriane Boyd
0fe43f40f1
Support registered vectors (#12492)
* Support registered vectors

* Format

* Auto-fill [nlp] on load from config and from bytes/disk

* Only auto-fill [nlp]

* Undo all changes to Language.from_disk

* Expand BaseVectors

These methods are needed in various places for training and vector
similarity.

* isort

* More linting

* Only fill [nlp.vectors]

* Update spacy/vocab.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Revert changes to test related to auto-filling [nlp]

* Add vectors registry

* Rephrase error about vocab methods for vectors

* Switch to dummy implementation for BaseVectors.to_ops

* Add initial draft of docs

* Remove example from BaseVectors docs

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/basevectors.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix type and lint bpemb example

* Update website/docs/api/basevectors.mdx

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-08-01 15:46:08 +02:00
Peter Baumgartner
a0a195688f
Tests for CLI app - init config generates train-able config (#12173)
* remove migration support form

* initial test commit

* add fixture

* add combo test

* pull out parameter example data

* fix formatting on examples

* remove unused import

* remove unncessary fmt:off instructions

* only set logger level if verbose flag is explicitly set

---------

Co-authored-by: svlandeg <svlandeg@github.com>
2023-07-31 14:45:04 +02:00
Andy Friedman
5027b7d0a8 added entry for SaysWho (#12828)
* Update universe.json

added entry for Sayswho

* Update universe.json

updated sayswho entry

* Update universe.json

* Update website/meta/universe.json

* Update website/meta/universe.json

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-07-31 10:52:58 +02:00
Andy Friedman
186889ec9c
added entry for SaysWho (#12828)
* Update universe.json

added entry for Sayswho

* Update universe.json

updated sayswho entry

* Update universe.json

* Update website/meta/universe.json

* Update website/meta/universe.json

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-07-31 10:52:32 +02:00
Sofie Van Landeghem
c9e9dccf79
Add displaCy data structures to docs (2) (#12875)
* Add data structures to docs

* Adjusted descriptions for more consistency

* Add _optional_ flag to parameters

* Add tests and adjust optional title key in doc

* Add title to dep visualizations

* fix typo

---------

Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
2023-07-31 10:47:57 +02:00
Victoria
49055ed7c8
Add cli for finding locations of registered func (#12757)
* Add cli for finding locations of registered func

* fixes: naming and typing

* isort

* update naming

* remove to find-function

* remove file:// bit

* use registry name if given and exit gracefully if a registry was not found

* clean up failure msg

* specify registry_name options

* mypy fixes

* return location for internal usage

* add documentation

* more mypy fixes

* clean up example

* add section to menu

* add tests

---------

Co-authored-by: svlandeg <svlandeg@github.com>
2023-07-31 09:39:00 +02:00
Adriane Boyd
9ffa5d8a15
Remove ray extra (#12870) 2023-07-28 15:48:36 +02:00
Márton Kardos
4341881a05 Added OdyCy to spaCy Universe (#12826)
* Added OdyCy to spaCy Universe

* Replaced template tags

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-07-26 16:06:58 +02:00
Márton Kardos
51b9655470
Added OdyCy to spaCy Universe (#12826)
* Added OdyCy to spaCy Universe

* Replaced template tags

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-07-26 16:05:53 +02:00
Madeesh Kannan
d729a61a7d SpanCat: Remove invalid threshold config argument (#12860) 2023-07-26 13:57:19 +02:00