Commit Graph

16056 Commits

Author SHA1 Message Date
PD Hall
439a5165a6 docs: fix ngram_range_suggester max_size description (#12939) 2023-08-29 11:12:16 +02:00
Connor Brinton
ea2bb91e9b 📝 Fix formula for receptive field in docs (#12918)
SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce
contextualized embeddings using a `MaxoutWindowEncoder` layer. These
convolutions are implemented using Thinc's `expand_window` layer, which
concatenates `window_size` neighboring sequence items on either side of
the sequence item being processed. This is repeated across `depth`
convolutional layers.

For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder`
layer with a context window of 1 and a depth of 2. We'll focus on the
token "C". We can visually represent the contextual embedding produced
for "C" as:
```mermaid
flowchart LR
A0(A<sub>0</sub>)
B0(B<sub>0</sub>)
C0(C<sub>0</sub>)
D0(D<sub>0</sub>)
E0(E<sub>0</sub>)
B1(B<sub>1</sub>)
C1(C<sub>1</sub>)
D1(D<sub>1</sub>)
C2(C<sub>2</sub>)
A0 --> B1
B0 --> B1
C0 --> B1
B0 --> C1
C0 --> C1
D0 --> C1
C0 --> D1
D0 --> D1
E0 --> D1
B1 --> C2
C1 --> C2
D1 --> C2
```

Described in words, this graph shows that before the first layer of the
convolution, the "receptive field" centered at each token consists only
of that same token. That is to say, that we have a receptive field of 1.
The first layer of the convolution adds one neighboring token on either
side to the receptive field. Since this is done on both sides, the
receptive field increases by 2, giving the first layer a receptive field
of 3. The second layer of the convolutions adds an _additional_
neighboring token on either side to the receptive field, giving a final
receptive field of 5.

However, this doesn't match the formula currently given in the docs,
which read:
> The receptive field of the CNN will be
> `depth * (window_size * 2 + 1)`, so a 4-layer network with a window
> size of `2` will be sensitive to 20 words at a time.

Substituting in our depth of 2 and window size of 1, this formula gives
us a receptive field of:
```
depth * (window_size * 2 + 1)
= 2 * (1 * 2 + 1)
= 2 * (2 + 1)
= 2 * 3
= 6
```

This not only doesn't match our computations from above, it's also an
even number! This is suspicious, since the receptive field is supposed
to be centered on a token, and not between tokens. Generally, this
formula results in an even number for any even value of `depth`.

The error in this formula is that the adjustment for the center token
is multiplied by the depth, when it should occur only once. The
corrected formula, `depth * window_size * 2 + 1`, gives the correct
value for our small example from above:
```
depth * window_size * 2 + 1
= 2 * 1 * 2 + 1
= 4 + 1
= 5
```

These changes update the docs to correct the receptive field formula and
the example receptive field size.
2023-08-21 10:53:14 +02:00
Adriane Boyd
47a2b58af2 Docs: clarify abstract spacy.load examples (#12889) 2023-08-16 17:30:43 +02:00
William Mattingly
94c390d349 Update universe.json (#12904)
* Update universe.json

added hobbit-spacy to the universe json

* Update universe.json

removed displacy from hobbit-spacy and added a default text.
2023-08-16 17:30:30 +02:00
Raphael Mitsch
a20f54fb91 Update incorrect example config. (#12893) 2023-08-09 10:18:04 +02:00
Sofie Van Landeghem
0ea7e22ba8 fix (#12881) 2023-08-03 08:38:16 +02:00
Arman Mohammadi
827cea6fc3 fix the regular expression matching on the full text (#12883)
There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.
2023-08-02 16:52:50 +02:00
Andy Friedman
5027b7d0a8 added entry for SaysWho (#12828)
* Update universe.json

added entry for Sayswho

* Update universe.json

updated sayswho entry

* Update universe.json

* Update website/meta/universe.json

* Update website/meta/universe.json

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-07-31 10:52:58 +02:00
Márton Kardos
4341881a05 Added OdyCy to spaCy Universe (#12826)
* Added OdyCy to spaCy Universe

* Replaced template tags

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-07-26 16:06:58 +02:00
Madeesh Kannan
d729a61a7d SpanCat: Remove invalid threshold config argument (#12860) 2023-07-26 13:57:19 +02:00
Victoria
0d68d5bc33 Add spacy-llm docs to website (#12782)
* initial commit

* update for v0.4.0

* Apply suggestions from code review

* Fix formatting

* Apply suggestions from code review

* Update website/docs/api/large-language-models.mdx

* Update website/docs/api/large-language-models.mdx

* update usage page

* Apply suggestions from review

* Apply suggestions from review

* fix links

* fix relative links

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Apply suggestions from review

* Add section on Llama 2. Format.

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-07-24 14:46:44 +02:00
Adriane Boyd
21e47853b4 Update pex Makefile defaults (#12832)
* Update pex Makefile defaults

- switch to python 3.8
- only install spacy-lookups-data for extra packages

* Update website for pex defaults
2023-07-18 09:30:18 +02:00
Sofie Van Landeghem
58c54916f4 Trainable lemmatizer docs link (#12795)
* add an anchor to the trainable lemmatizer section

* add requirement for morphologizer,tagger to rule-based lemmatizer

* morphologizer only
2023-07-07 15:18:41 +02:00
Adriane Boyd
763e0b4106 Update website binder version to v3.6 (#12805) 2023-07-07 10:53:08 +02:00
Adriane Boyd
29c0c76448 Update max_length default in span finder docs (#12803) 2023-07-07 10:18:05 +02:00
Adriane Boyd
afe03898ed Merge branch 'master' into spacy.io 2023-07-07 10:07:56 +02:00
svlandeg
d26e4e0849 Revert "feat: add example stubs (#12679)"
This reverts commit 30bb34533a.
2023-07-06 17:02:38 +02:00
Basile Dura
30bb34533a
feat: add example stubs (#12679)
* feat: add example stubs

* fix: add required annotations

* fix: mypy issues

* fix: use Py36-compatible Portocol

* Minor reformatting

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2023-07-06 16:49:43 +02:00
Adriane Boyd
6fc153a266
Merge pull request #12794 from adrianeboyd/chore/v3.6.0-2
Reenable compat+models tests for v3.6.0
2023-07-06 13:22:21 +02:00
Adriane Boyd
4e19ec7eb8
Docs for v3.6.0 (#12792)
* Docs for v3.6.0

* Add sl performance

* Add da trf note
2023-07-06 12:58:25 +02:00
Adriane Boyd
76329e1dde Revert "Temporarily skip download CLI related tests in CI"
This reverts commit 46ce66021a.
2023-07-06 12:48:06 +02:00
Adriane Boyd
a1191146f5 Revert "Temporarily skip tests for compat table"
This reverts commit dd5e00c735.
2023-07-06 12:47:50 +02:00
Adriane Boyd
830dcca367
SpanFinder: set default max_length to 25 (#12791)
When the default `max_length` is not set and there are longer training
documents, it can be difficult to train and evaluate the span finder due
to memory limits and the time it takes to evaluate a huge number of
predicted spans.
2023-07-06 09:55:34 +02:00
Tom Aarsen
c772481b37 Use 'exclude' instead of 'disable' (#12783)
as suggested by @svlandeg
2023-07-04 11:46:15 +02:00
Tom Aarsen
eab929361d
Use 'exclude' instead of 'disable' (#12783)
as suggested by @svlandeg
2023-07-04 11:45:13 +02:00
Marcus Blättermann
5660261321 Fix problem with missing syntax highlighting languages causing runtime crash on the website (#12781)
* Fix problem with universe pages using `docker` language

* Fix problem with universe pages using `r` language

* Add fallback, in case code language is unknown
2023-07-03 10:24:50 +02:00
Marcus Blättermann
bd239511a4
Fix problem with missing syntax highlighting languages causing runtime crash on the website (#12781)
* Fix problem with universe pages using `docker` language

* Fix problem with universe pages using `r` language

* Add fallback, in case code language is unknown
2023-07-03 10:24:25 +02:00
Daniël de Kok
5e4fdfc233 Remove section about parallel training with Ray (#12770)
The Ray integration is currently broken, having these docs around
suggest that this functionality is currently available.
2023-06-28 17:10:32 +02:00
Daniël de Kok
57a230c6e4
Remove section about parallel training with Ray (#12770)
The Ray integration is currently broken, having these docs around
suggest that this functionality is currently available.
2023-06-28 17:09:57 +02:00
Adriane Boyd
fb0da3e097
Support custom token/lexeme attribute for vectors (#12625)
* Support custom token/lexeme attribute for vectors

* Fix imports

* Back off to ORTH without Vectors.attr

* Fallback if vectors.attr doesn't exist

* Update docs
2023-06-28 09:43:14 +02:00
Adriane Boyd
337a360cc7
Use spans_ prefix for default span finder scores (#12753) 2023-06-27 19:32:17 +02:00
Adriane Boyd
65f6c9cd10
Support overriding registered functions in configs (#12623)
Support overriding registered functions in configs. Previously the registry name was parsed as a section name rather than as a registry name.
2023-06-27 17:36:33 +02:00
Adriane Boyd
c067b5264c
Address issues with source with component names and replacing listeners (#12701)
When sourcing a component, the object from the original pipeline is added to the new pipeline as the same object. This creates a situation where there are several attributes that cannot be in sync between the original pipeline and the new pipeline at the same time for this one object:

* component.name
* component.listener_map / component.listening_components for tok2vec and transformer

When running replace_listeners on a component, the config is not updated correctly if the state of the component is incorrect for the current pipeline (in particular changes that should be applied from model.attrs["replace_listener_cfg"] as used in spacy-transformers) due to the fact that:

* find_listeners relies on component.name to set the name in the listener_map
* replace_listeners relies on listener_map to determine how to modify the configs

In addition, there are several places where pipeline components are modified and the listener map and/or internal component names aren't currently updated.

In cases where there is a component shared by two pipelines that cannot be in sync, this PR chooses to prioritize the most recently modified or initialized pipeline. There is no actual solution with the current source behavior that will make both pipelines usable, so the current pipeline is updated whenever components are added/renamed/removed or the pipeline is initialized for training.
2023-06-27 10:47:07 +02:00
Adriane Boyd
e1664217f5
Add spancat_singlelabel to debug data CLI (#12749) 2023-06-26 10:25:20 +02:00
Adriane Boyd
cb4fdc83e4
Merge pull request #12742 from adrianeboyd/chore/v3.6.0
Set version to v3.6.0
2023-06-21 15:34:28 +02:00
Adriane Boyd
34971bcbd1 Set version to v3.6.0 2023-06-21 12:59:36 +02:00
Adriane Boyd
dd5e00c735 Temporarily skip tests for compat table 2023-06-21 12:59:36 +02:00
Sofie Van Landeghem
d3ac8e897c
default value for phrasematcher in pyi (#12714) 2023-06-21 10:10:13 +02:00
Tom Aarsen
88ba050b76 Add SpanMarker for NER to spaCy universe (#12730)
* Add SpanMarker for NER to spaCy universe

* Escape the newlines in the text in the code example

Or at least, attempt to

* Remove now unnecessary import

* Disable NER pipeline component in code example
2023-06-20 16:49:07 +02:00
Tom Aarsen
93983f08fc
Add SpanMarker for NER to spaCy universe (#12730)
* Add SpanMarker for NER to spaCy universe

* Escape the newlines in the text in the code example

Or at least, attempt to

* Remove now unnecessary import

* Disable NER pipeline component in code example
2023-06-20 16:47:44 +02:00
David Berenstein
aaadd22941 docs: added reference to spacy-setfit to the spaCy Universe (#12737)
* docs: added reference to spacy-setfit

* removed package import after adding factory entry points to packages
2023-06-19 15:53:16 +02:00
David Berenstein
53c400bd7a
docs: added reference to spacy-setfit to the spaCy Universe (#12737)
* docs: added reference to spacy-setfit

* removed package import after adding factory entry points to packages
2023-06-19 15:52:07 +02:00
Ziad Amerr
3125b97ace
Fixed e941 link rendering by removing the dot (#12735) 2023-06-19 13:31:08 +02:00
Marcus Blättermann
1eb2de5ccf Fix #12716 does not update the config generation section (#12718)
This is a really odd bug, where Firefox doesn't re-render the `code` element, even though `children` changed.

Two things fixed that:
- remove the `language-ini` `className`
- replace the `code` block with a `div`

Both are not ideal. Therefor this solution adds an inner `div` that now has the classes while still maintaining the semantic `code` element.

I couldn't find any explanation for why this is happening and why it only happens in Firefox. I assume it is a bug caused by one of our many dependencies (or their interplay)

To make matters worse: This bug *doesn't* occure when running the site in dev mode. You have to build and serve the site to recreate it.
2023-06-19 09:36:20 +02:00
Marcus Blättermann
7e4b38c841
Fix #12716 does not update the config generation section (#12718)
This is a really odd bug, where Firefox doesn't re-render the `code` element, even though `children` changed.

Two things fixed that:
- remove the `language-ini` `className`
- replace the `code` block with a `div`

Both are not ideal. Therefor this solution adds an inner `div` that now has the classes while still maintaining the semantic `code` element.

I couldn't find any explanation for why this is happening and why it only happens in Firefox. I assume it is a bug caused by one of our many dependencies (or their interplay)

To make matters worse: This bug *doesn't* occure when running the site in dev mode. You have to build and serve the site to recreate it.
2023-06-19 09:34:28 +02:00
Daniël de Kok
e73c1a89bf
CI: add isort --check to validate job (#12727) 2023-06-15 23:10:25 +01:00
Daniël de Kok
e2b70df012
Configure isort to use the Black profile, recursively isort the spacy module (#12721)
* Use isort with Black profile

* isort all the things

* Fix import cycles as a result of import sorting

* Add DOCBIN_ALL_ATTRS type definition

* Add isort to requirements

* Remove isort from build dependencies check

* Typo
2023-06-14 17:48:41 +02:00
Jacobo Myerston
4f5dd710b9 Update universe.json (#12709)
* Update universe.json

* Update universe.json

add some missing commas in the greCy's description.
2023-06-12 13:56:05 +02:00
Jacobo Myerston
daa6e0339f
Update universe.json (#12709)
* Update universe.json

* Update universe.json

add some missing commas in the greCy's description.
2023-06-12 13:55:20 +02:00
Sofie Van Landeghem
d65e3c31a6
use system-independent commands (#12693) 2023-06-08 11:43:36 +02:00