Commit Graph

3152 Commits

Author SHA1 Message Date
Adriane Boyd
e467573550
Docs: update trf_data examples and pipeline design info (#13164) 2023-12-04 15:15:54 +01:00
Daniël de Kok
da7ad97519
Update TextCatBOW to use the fixed SparseLinear layer (#13149)
* Update `TextCatBOW` to use the fixed `SparseLinear` layer

A while ago, we fixed the `SparseLinear` layer to use all available
parameters: https://github.com/explosion/thinc/pull/754

This change updates `TextCatBOW` to `v3` which uses the new
`SparseLinear_v2` layer. This results in a sizeable improvement on a
text categorization task that was tested.

While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent`
option to make it possible to change the hidden size. Ideally, we'd just
have an option called `length`. But the way that `TextCatBOW` uses
hashes results in a non-uniform distribution of parameters when the
length is not a power of two.

* Replace TexCatBOW `length_exponent` parameter by `length`

We now round up the length to the next power of two if it isn't
a power of two.

* Remove some tests for TextCatBOW.v2

* Fix missing import
2023-11-29 09:11:54 +01:00
Ines Montani
8f69e56a5a Add swag [ci skip] 2023-11-20 14:42:01 +01:00
Lise
b6e022381d
Feature/nn and fo language extensions (#13116)
* add language extensions for norwegian nynorsk and faroese

* update docstring for nn/examples.py

* use relative imports

* add fo and nn tokenizers to pytest fixtures

* add unittests for fo and nn and fix bug in nn

* remove module docstring from fo/__init__.py

* add comments about example sentences' origin

* add license information to faroese data credit

* format unittests using black

* add __init__ files to test/lang/nn and tests/lang/fo

* fix import order and use relative imports in fo/__nit__.py and nn/__init__.py

* Make the tests a bit more compact

* Add fo and nn to website languages

* Add note about jul.

* Add "jul." as exception

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-11-20 07:49:59 +01:00
ajbond
9f2ce6bb00
Add Redfield NLP Nodes to the Spacy Universe (#13133) 2023-11-17 09:48:02 +01:00
Adriane Boyd
513bbd5fa3
Add preferred use of build for package CLI (#13109)
Build with `build` if available. Warn and fall back to previous
`setup.py`-based builds if `build` build fails.
2023-11-08 17:35:24 +01:00
Sofie Van Landeghem
a804b83a4b
Update llm docs to clarify task-specific factories (#13082)
* fix typo

* add examples to specify custom model for task-specific factory
2023-10-31 22:07:07 +01:00
Sofie Van Landeghem
48248c62b6
Clarify EL example in docs (#13071)
* add comment that pipeline is a custom one

* add link to NEL tutorial

* prettier

* revert prettier reformat

* revert prettier reformat (2)

* fix typo

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-10-31 21:58:29 +01:00
Raphael Mitsch
0c15876502
Fix spancat typo. (#13095) 2023-10-31 13:45:10 +01:00
Raphael Mitsch
9deaac9786
Add note in docs on score_weight config if using a non-default spans_key for SpanCat (#13093)
* Add note on score_weight if using a non-default span_key for SpanCat.

* Fix formatting.

* Fix formatting.

* Fix typo.

* Use warning infobox.

* Fix infobox formatting.
2023-10-30 17:02:08 +01:00
Raphael Mitsch
d72029d9c8
Add binary examples for Textcat task in spacy-llm (#13051)
* Add examples for binary classification.

* Fix example.

* Remove binary textcat example. Format.

* Rephrase.
2023-10-11 12:23:38 +02:00
Ines Montani
65e7bd54f5 Update usage sidebar and nav alert [ci skip] 2023-10-06 14:36:37 +02:00
Ines Montani
b83f1e3724
Inline displaCy visualizations in docs (#13050) [ci skip] 2023-10-06 14:22:43 +02:00
Raphael Mitsch
be29216fe2
Merge pull request #13044 from explosion/docs/llm_main
Sync `master` with `docs/llm_main`
2023-10-05 16:10:19 +02:00
Raphael Mitsch
1162fcf099
Add Mistral mentions. (#13037) 2023-10-05 14:44:38 +02:00
Raphael Mitsch
862f8254e8
Add docs on Azure OpenAI support in spacy-llm (#13043)
* Add gpt-3.5-turbo-instruct to list of supported OpenAI models.

* Update `spacy-llm` task argument docs w.r.t. task refactoring (#12995)

* Update task arguments w.r.t. task refactoring in 0.5.0.

* Add disclaimer w.r.t. gated models/Llama 2.

* Update website/docs/api/large-language-models.mdx

* Update website/docs/api/large-language-models.mdx

* Update docs w.r.t. PaLM support. (#13018)

* Add info on spacy.Azure.v1.

* Attempt to fix netlify check fails.

* Attempt to fix netlify check fails.

* Attempt to fix netlify check fails.

* Attempt to fix netlify check fails.

* Attempt to fix netlify check fails.

* Attempt to fix netlify check fails.

* Attempt to fix netlify check fails.

* Attempt to fix netlify check fails.

* Attempt to fix netlify check fails.

* Format.
2023-10-05 13:18:27 +02:00
Raphael Mitsch
1dec138e61
Update docs w.r.t. PaLM support. (#13018) 2023-10-05 08:50:41 +02:00
Adriane Boyd
6e54360a3d
Remove pathy dependency, update docs for cloudpathlib in Weasel (#13035) 2023-10-05 08:50:22 +02:00
Raphael Mitsch
734826db79
Update spacy-llm task argument docs w.r.t. task refactoring (#12995)
* Update task arguments w.r.t. task refactoring in 0.5.0.

* Add disclaimer w.r.t. gated models/Llama 2.

* Update website/docs/api/large-language-models.mdx

* Update website/docs/api/large-language-models.mdx
2023-10-05 08:45:25 +02:00
Adriane Boyd
92ce32aa3f
Update binder version to v3.7 (#13034) 2023-10-02 12:53:46 +02:00
Adriane Boyd
160e61772e
Docs for v3.7.0 (#13029)
* Docs for v3.7.0

* Minor fixes

* Extend Weasel notes

* Minor edits

* Update version in README
2023-10-01 21:40:07 +02:00
Adriane Boyd
406794a081 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.7-1 2023-09-28 15:09:06 +02:00
Sergiu Nisioi
6255e38695
Adding rolegal model to the spaCy universe (#13017)
* adding rolegal model to the spaCy universe

* Fix formatting

* Use raw URL

* update image url and example

* fix pip and update url to raw

* okay, let's add thumb instead of image 🐙

* Update website/meta/universe.json

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-09-28 11:06:50 +02:00
Madeesh Kannan
b4501db6f8
Update emoji library in rule-based matcher example (#13014) 2023-09-25 18:20:30 +02:00
Adriane Boyd
ff4215f1c7
Drop support for python 3.6 (#13009)
* Drop support for python 3.6

* Update docs
2023-09-25 14:48:38 +02:00
Adriane Boyd
935a5455b6
Docs: add new tag for evaluate CLI --spans-keys (#13013) 2023-09-25 11:49:28 +02:00
Eliana Vornov
4e3360ad12
add --spans-key option for CLI spancat evaluation (#12981)
* add span key option for CLI evaluation

* Rephrase CLI help to refer to Doc.spans instead of spancat

* Rephrase docs to refer to Doc.spans instead of spancat

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-09-25 11:25:41 +02:00
Raphael Mitsch
bef9f63e13 Add gpt-3.5-turbo-instruct to list of supported OpenAI models. 2023-09-21 11:28:58 +02:00
Sofie Van Landeghem
8f0d6b0a8c
Fix in BertTokenizer docs (#12955)
* fix BertWordPieceTokenizer constructor call

* fix

* Update website/docs/usage/linguistic-features.mdx

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-09-13 13:21:58 +02:00
Sofie Van Landeghem
013762be41
Few spacy-llm doc fixes (#12969)
* fix construction example

* shorten task-specific factory list

* small edits to HF models

* small edit to API models

* typo

* fix space

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-09-08 11:35:38 +02:00
Sofie Van Landeghem
def7013eec
Docs for spacy-llm 0.5.0 (#12968)
* Update incorrect example config. (#12893)

* spacy-llm docs cleanup (#12945)

* Shorten NER section

* fix template references

* simplify sections

* set temperature to 0.0 in examples

* condense model information

* fix parameters for REST models

* set temperature to 0.0

* spelling fix

* trigger preview

* fix quotes

* add small note on noop.v1

* move up example noop config

* set appropriate model example configs

* explain config

* fix

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* Docs for ner.v3 and spancat.v3 spacy-llm tasks (#12949)

* formatting

* update usage table with NER.v3

* fix typo in links

* v3 overview of parameters

* add spancat.v3

* add further v3 explanations

* remove TODO comment

* few more small fixes

* Add doc section on LLM + task factories (#12905)

* Add section on LLM + task factories.

* Apply suggestions from code review

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* add default config to openai models (#12961)

* Docs for spacy-llm 0.5.0 (#12967)

* simplify Python example

* simplify Python example

* Refer only to latest OpenAI model versions from usage doc

* Typo fix

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* clarify accuracy claim

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-09-08 10:25:14 +02:00
Magdalena Aniol
cc78847688
fix training.batch_size example (#12963) 2023-09-06 16:38:13 +02:00
Sofie Van Landeghem
6d1f6d9a23
Fix LLM usage example (#12950)
* fix usage example

* revert back to v2 to allow hot fix on main
2023-09-04 09:05:50 +02:00
Sofie Van Landeghem
5c1f9264c2
fix typo in link (#12948)
* fix typo in link

* fix REL.v1 parameter
2023-09-01 13:47:20 +02:00
David Berenstein
065ead4eed
updated add_pipe docs (#12947) 2023-09-01 11:05:36 +02:00
vincent d warmerdam
3e4264899c
Update large-language-models.mdx (#12944) 2023-08-30 11:58:14 +02:00
Ines Montani
52758e1afa Add headers to netlify.toml [ci skip] 2023-08-30 11:55:23 +02:00
Vinit Ravishankar
c2303858e6
Documentation for spacy-curated-transformers (#12677)
* initial

* initial documentation run

* fix typo

* Remove mentions of Torchscript and quantization

Both are disabled in the initial release of `spacy-curated-transformers`.

* Fix `piece_encoder` entries

* Remove `spacy-transformers`-specific warning

* Fix duplicate entries in tables

* Doc fixes

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove type aliases

* Fix copy-paste typo

* Change `debug pieces` version tag to `3.7`

* Set curated transformers API version to  `3.7`

* Fix transformer listener naming

* Add docs for `init fill-config-transformer`

* Update CLI command invocation syntax

* Update intro section of the pipeline component docs

* Fix source URL

* Add a note to the architectures section about the `init fill-config-transformer` CLI command

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update CLI command name, args

* Remove hyphen from the `curated-transformers.mdx` filename

* Fix links

* Remove placeholder text

* Add text to the model/tokenizer loader sections

* Fill in the `DocTransformerOutput` section

* Formatting fixes

* Add curated transformer page to API docs sidebar

* More formatting fixes

* Remove TODO comment

* Remove outdated info about default config

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add link to HF model hub

* `prettier`

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-08-29 17:52:16 +02:00
PD Hall
d8a32c1050
docs: fix ngram_range_suggester max_size description (#12939) 2023-08-29 11:10:58 +02:00
Connor Brinton
6dd56868de
📝 Fix formula for receptive field in docs (#12918)
SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce
contextualized embeddings using a `MaxoutWindowEncoder` layer. These
convolutions are implemented using Thinc's `expand_window` layer, which
concatenates `window_size` neighboring sequence items on either side of
the sequence item being processed. This is repeated across `depth`
convolutional layers.

For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder`
layer with a context window of 1 and a depth of 2. We'll focus on the
token "C". We can visually represent the contextual embedding produced
for "C" as:
```mermaid
flowchart LR
A0(A<sub>0</sub>)
B0(B<sub>0</sub>)
C0(C<sub>0</sub>)
D0(D<sub>0</sub>)
E0(E<sub>0</sub>)
B1(B<sub>1</sub>)
C1(C<sub>1</sub>)
D1(D<sub>1</sub>)
C2(C<sub>2</sub>)
A0 --> B1
B0 --> B1
C0 --> B1
B0 --> C1
C0 --> C1
D0 --> C1
C0 --> D1
D0 --> D1
E0 --> D1
B1 --> C2
C1 --> C2
D1 --> C2
```

Described in words, this graph shows that before the first layer of the
convolution, the "receptive field" centered at each token consists only
of that same token. That is to say, that we have a receptive field of 1.
The first layer of the convolution adds one neighboring token on either
side to the receptive field. Since this is done on both sides, the
receptive field increases by 2, giving the first layer a receptive field
of 3. The second layer of the convolutions adds an _additional_
neighboring token on either side to the receptive field, giving a final
receptive field of 5.

However, this doesn't match the formula currently given in the docs,
which read:
> The receptive field of the CNN will be
> `depth * (window_size * 2 + 1)`, so a 4-layer network with a window
> size of `2` will be sensitive to 20 words at a time.

Substituting in our depth of 2 and window size of 1, this formula gives
us a receptive field of:
```
depth * (window_size * 2 + 1)
= 2 * (1 * 2 + 1)
= 2 * (2 + 1)
= 2 * 3
= 6
```

This not only doesn't match our computations from above, it's also an
even number! This is suspicious, since the receptive field is supposed
to be centered on a token, and not between tokens. Generally, this
formula results in an even number for any even value of `depth`.

The error in this formula is that the adjustment for the center token
is multiplied by the depth, when it should occur only once. The
corrected formula, `depth * window_size * 2 + 1`, gives the correct
value for our small example from above:
```
depth * window_size * 2 + 1
= 2 * 1 * 2 + 1
= 4 + 1
= 5
```

These changes update the docs to correct the receptive field formula and
the example receptive field size.
2023-08-21 10:52:32 +02:00
Adriane Boyd
76a9f9c6c6
Docs: clarify abstract spacy.load examples (#12889) 2023-08-16 17:28:34 +02:00
William Mattingly
64b8ee2dbe
Update universe.json (#12904)
* Update universe.json

added hobbit-spacy to the universe json

* Update universe.json

removed displacy from hobbit-spacy and added a default text.
2023-08-14 16:44:14 +02:00
Adriane Boyd
c4e378df97
Update CuPy extras (#12890)
* Add `cuda12x` for `cupy-cuda12x`.
* Drop `cuda-autodetect` from quickstart, set default to `cuda11x`
instead.
2023-08-08 12:58:28 +02:00
Sofie Van Landeghem
3b7faf4f5e
fix (#12881) 2023-08-03 08:37:43 +02:00
Arman Mohammadi
07407e07ab
fix the regular expression matching on the full text (#12883)
There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.
2023-08-02 16:52:26 +02:00
Adriane Boyd
0fe43f40f1
Support registered vectors (#12492)
* Support registered vectors

* Format

* Auto-fill [nlp] on load from config and from bytes/disk

* Only auto-fill [nlp]

* Undo all changes to Language.from_disk

* Expand BaseVectors

These methods are needed in various places for training and vector
similarity.

* isort

* More linting

* Only fill [nlp.vectors]

* Update spacy/vocab.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Revert changes to test related to auto-filling [nlp]

* Add vectors registry

* Rephrase error about vocab methods for vectors

* Switch to dummy implementation for BaseVectors.to_ops

* Add initial draft of docs

* Remove example from BaseVectors docs

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/basevectors.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix type and lint bpemb example

* Update website/docs/api/basevectors.mdx

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-08-01 15:46:08 +02:00
Andy Friedman
186889ec9c
added entry for SaysWho (#12828)
* Update universe.json

added entry for Sayswho

* Update universe.json

updated sayswho entry

* Update universe.json

* Update website/meta/universe.json

* Update website/meta/universe.json

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-07-31 10:52:32 +02:00
Sofie Van Landeghem
c9e9dccf79
Add displaCy data structures to docs (2) (#12875)
* Add data structures to docs

* Adjusted descriptions for more consistency

* Add _optional_ flag to parameters

* Add tests and adjust optional title key in doc

* Add title to dep visualizations

* fix typo

---------

Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
2023-07-31 10:47:57 +02:00
Victoria
49055ed7c8
Add cli for finding locations of registered func (#12757)
* Add cli for finding locations of registered func

* fixes: naming and typing

* isort

* update naming

* remove to find-function

* remove file:// bit

* use registry name if given and exit gracefully if a registry was not found

* clean up failure msg

* specify registry_name options

* mypy fixes

* return location for internal usage

* add documentation

* more mypy fixes

* clean up example

* add section to menu

* add tests

---------

Co-authored-by: svlandeg <svlandeg@github.com>
2023-07-31 09:39:00 +02:00
Márton Kardos
51b9655470
Added OdyCy to spaCy Universe (#12826)
* Added OdyCy to spaCy Universe

* Replaced template tags

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-07-26 16:05:53 +02:00