Commit Graph

228 Commits

Author SHA1 Message Date
svlandeg
e32a394ff0 fix the fix for textcat init functionality 2024-05-14 18:45:51 +02:00
svlandeg
5992e927b9 fix textcat init functionality 2024-05-14 18:38:11 +02:00
svlandeg
c27679f210 Merge branch 'master' into feat/update_v4 2024-05-14 17:42:48 +02:00
Raphael Mitsch
304b9331e6
Modify EL batching to doc-wise streaming approach (#12367)
* Convert Candidate from Cython to Python class.

* Format.

* Fix .entity_ typo in _add_activations() usage.

* Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span].

* Update docs.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update doc string of BaseCandidate.__init__().

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.

* Adjust Candidate to support and mandate numerical entity IDs.

* Format.

* Fix docstring and docs.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename alias -> mention.

* Refactor Candidate attribute names. Update docs and tests accordingly.

* Refacor Candidate attributes and their usage.

* Format.

* Fix mypy error.

* Update error code in line with v4 convention.

* Modify EL batching system.

* Update leftover get_candidates() mention in docs.

* Format docs.

* Format.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Updated error code.

* Simplify interface for int/str representations.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename 'alias' to 'mention'.

* Port Candidate and InMemoryCandidate to Cython.

* Remove redundant entry in setup.py.

* Add abstract class check.

* Drop storing mention.

* Update spacy/kb/candidate.pxd

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix entity_id refactoring problems in docstrings.

* Drop unused InMemoryCandidate._entity_hash.

* Update docstrings.

* Move attributes out of Candidate.

* Partially fix alias/mention terminology usage. Convert Candidate to interface.

* Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs().

* Update docstrings related to prior_prob.

* Update alias/mention usage in doc(strings).

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs.

* Update docstrings.

* Fix InMemoryCandidate attribute names.

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update W401 test.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use Candidate output type for toy generators in the test suite to mimick best practices

* fix docs

* fix import

* Fix merge leftovers.

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/entitylinker.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb_in_memory.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/inmemorylookupkb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update get_candidates() docstring.

* Reformat imports in entity_linker.py.

* Drop valid_ent_idx_per_doc.

* Update docs.

* Format.

* Simplify doc loop in predict().

* Remove E1044 comment.

* Fix merge errors.

* Format.

* Format.

* Format.

* Fix merge error & tests.

* Format.

* Apply suggestions from code review

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Use type alias.

* isort.

* isort.

* Lint.

* Add typedefs.pyx.

* Fix typedef import.

* Fix type aliases.

* Format.

* Update docstring and type usage.

* Add info on get_candidates(), get_candidates_batched().

* Readd get_candidates info to v3 changelog.

* Update website/docs/api/entitylinker.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update factory functions for backwards compatibility.

* Format.

* Ignore mypy error.

* Fix mypy error.

* Format.

* Add test for multiple docs with multiple entities.

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2024-04-09 11:39:18 +02:00
Daniël de Kok
2dbb332cea
TextCatParametricAttention.v1: set key transform dimensions (#13249)
* TextCatParametricAttention.v1: set key transform dimensions

This is necessary for tok2vec implementations that initialize
lazily (e.g. curated transformers).

* Add lazily-initialized tok2vec to simulate transformers

Add a lazily-initialized tok2vec to the tests and test the current
textcat models with it.

Fix some additional issues found using this test.

* isort

* Add `test.` prefix to `LazyInitTok2Vec.v1`
2024-02-02 13:01:59 +01:00
Daniël de Kok
e722284ff4 Construct TextCatEnsemble.v2 using helper function 2024-01-24 14:59:01 +01:00
Daniël de Kok
81beaea70e Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119 2024-01-19 12:34:29 +01:00
Daniël de Kok
e2a3952de5
Add spacy.TextCatParametricAttention.v1 (#13201)
* Add spacy.TextCatParametricAttention.v1

This layer provides is a simplification of the ensemble classifier that
only uses paramteric attention. We have found empirically that with a
sufficient amount of training data, using the ensemble classifier with
BoW does not provide significant improvement in classifier accuracy.
However, plugging in a BoW classifier does reduce GPU training and
inference performance substantially, since it uses a GPU-only kernel.

* Fix merge fallout
2024-01-02 10:03:06 +01:00
Daniël de Kok
7ebba86402
Add TextCatReduce.v1 (#13181)
* Add TextCatReduce.v1

This is a textcat classifier that pools the vectors generated by a
tok2vec implementation and then applies a classifier to the pooled
representation. Three reductions are supported for pooling: first, max,
and mean. When multiple reductions are enabled, the reductions are
concatenated before providing them to the classification layer.

This model is a generalization of the TextCatCNN model, which only
supports mean reductions and is a bit of a misnomer, because it can also
be used with transformers. This change also reimplements TextCatCNN.v2
using the new TextCatReduce.v1 layer.

* Doc fixes

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fully specify `TextCatCNN` <-> `TextCatReduce` equivalence

* Move TextCatCNN docs to legacy, in prep for moving to spacy-legacy

* Add back a test for TextCatCNN.v2

* Replace TextCatCNN in pipe configurations and templates

* Add an infobox to the `TextCatReduce` section with an `TextCatCNN` anchor

* Add last reduction (`use_reduce_last`)

* Remove non-working TextCatCNN Netlify redirect

* Revert layer changes for the quickstart

* Revert one more quickstart change

* Remove unused import

* Fix docstring

* Fix setting name in error message

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-12-21 11:00:06 +01:00
Daniël de Kok
7b689bde44 No need for Literal compat, since we only support >= 3.8 2023-12-21 09:47:38 +01:00
Daniël de Kok
9b36729cbd Fix Cython lints 2023-12-18 20:02:15 +01:00
Daniël de Kok
e2591cda36 isort 2023-12-08 20:24:09 +01:00
Daniël de Kok
e5ec45cb7e Revert "Merge the parser refactor into v4 (#10940)"
This reverts commit a183db3cef.
2023-12-08 20:23:08 +01:00
Daniël de Kok
05803cfe76 Revert "Reimplement distillation with oracle cut size (#12214)"
This reverts commit e27c60a702.
2023-12-08 14:38:05 +01:00
Daniël de Kok
da7ad97519
Update TextCatBOW to use the fixed SparseLinear layer (#13149)
* Update `TextCatBOW` to use the fixed `SparseLinear` layer

A while ago, we fixed the `SparseLinear` layer to use all available
parameters: https://github.com/explosion/thinc/pull/754

This change updates `TextCatBOW` to `v3` which uses the new
`SparseLinear_v2` layer. This results in a sizeable improvement on a
text categorization task that was tested.

While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent`
option to make it possible to change the hidden size. Ideally, we'd just
have an option called `length`. But the way that `TextCatBOW` uses
hashes results in a non-uniform distribution of parameters when the
length is not a power of two.

* Replace TexCatBOW `length_exponent` parameter by `length`

We now round up the length to the next power of two if it isn't
a power of two.

* Remove some tests for TextCatBOW.v2

* Fix missing import
2023-11-29 09:11:54 +01:00
Adriane Boyd
55614d6799 Add profile=False to currently unprofiled cython 2023-09-28 17:09:41 +02:00
Adriane Boyd
406794a081 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.7-1 2023-09-28 15:09:06 +02:00
Connor Brinton
6dd56868de
📝 Fix formula for receptive field in docs (#12918)
SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce
contextualized embeddings using a `MaxoutWindowEncoder` layer. These
convolutions are implemented using Thinc's `expand_window` layer, which
concatenates `window_size` neighboring sequence items on either side of
the sequence item being processed. This is repeated across `depth`
convolutional layers.

For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder`
layer with a context window of 1 and a depth of 2. We'll focus on the
token "C". We can visually represent the contextual embedding produced
for "C" as:
```mermaid
flowchart LR
A0(A<sub>0</sub>)
B0(B<sub>0</sub>)
C0(C<sub>0</sub>)
D0(D<sub>0</sub>)
E0(E<sub>0</sub>)
B1(B<sub>1</sub>)
C1(C<sub>1</sub>)
D1(D<sub>1</sub>)
C2(C<sub>2</sub>)
A0 --> B1
B0 --> B1
C0 --> B1
B0 --> C1
C0 --> C1
D0 --> C1
C0 --> D1
D0 --> D1
E0 --> D1
B1 --> C2
C1 --> C2
D1 --> C2
```

Described in words, this graph shows that before the first layer of the
convolution, the "receptive field" centered at each token consists only
of that same token. That is to say, that we have a receptive field of 1.
The first layer of the convolution adds one neighboring token on either
side to the receptive field. Since this is done on both sides, the
receptive field increases by 2, giving the first layer a receptive field
of 3. The second layer of the convolutions adds an _additional_
neighboring token on either side to the receptive field, giving a final
receptive field of 5.

However, this doesn't match the formula currently given in the docs,
which read:
> The receptive field of the CNN will be
> `depth * (window_size * 2 + 1)`, so a 4-layer network with a window
> size of `2` will be sensitive to 20 words at a time.

Substituting in our depth of 2 and window size of 1, this formula gives
us a receptive field of:
```
depth * (window_size * 2 + 1)
= 2 * (1 * 2 + 1)
= 2 * (2 + 1)
= 2 * 3
= 6
```

This not only doesn't match our computations from above, it's also an
even number! This is suspicious, since the receptive field is supposed
to be centered on a token, and not between tokens. Generally, this
formula results in an even number for any even value of `depth`.

The error in this formula is that the adjustment for the center token
is multiplied by the depth, when it should occur only once. The
corrected formula, `depth * window_size * 2 + 1`, gives the correct
value for our small example from above:
```
depth * window_size * 2 + 1
= 2 * 1 * 2 + 1
= 4 + 1
= 5
```

These changes update the docs to correct the receptive field formula and
the example receptive field size.
2023-08-21 10:52:32 +02:00
Adriane Boyd
0fe43f40f1
Support registered vectors (#12492)
* Support registered vectors

* Format

* Auto-fill [nlp] on load from config and from bytes/disk

* Only auto-fill [nlp]

* Undo all changes to Language.from_disk

* Expand BaseVectors

These methods are needed in various places for training and vector
similarity.

* isort

* More linting

* Only fill [nlp.vectors]

* Update spacy/vocab.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Revert changes to test related to auto-filling [nlp]

* Add vectors registry

* Rephrase error about vocab methods for vectors

* Switch to dummy implementation for BaseVectors.to_ops

* Add initial draft of docs

* Remove example from BaseVectors docs

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/basevectors.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix type and lint bpemb example

* Update website/docs/api/basevectors.mdx

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-08-01 15:46:08 +02:00
Sofie Van Landeghem
f293386d3e
remove unnecessary line
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-07-20 14:08:29 +02:00
Adriane Boyd
4f37e4031c
Update spacy/ml/tb_framework.pyx
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-07-20 09:59:19 +02:00
svlandeg
96f2e30c4b cython fixes and cleanup 2023-07-19 17:41:29 +02:00
svlandeg
0e3b6a87d6 Merge branch 'upstream_master' into sync_v4 2023-07-19 16:37:31 +02:00
Basile Dura
b0228d8ea6
ci: add cython linter (#12694)
* chore: add cython-linter dev dependency

* fix: lexeme.pyx

* fix: morphology.pxd

* fix: tokenizer.pxd

* fix: vocab.pxd

* fix: morphology.pxd (line length)

* ci: add cython-lint

* ci: fix cython-lint call

* Fix kb/candidate.pyx.

* Fix kb/kb.pyx.

* Fix kb/kb_in_memory.pyx.

* Fix kb.

* Fix training/ partially.

* Fix training/. Ignore trailing whitespaces and too long lines.

* Fix ml/.

* Fix matcher/.

* Fix pipeline/.

* Fix tokens/.

* Fix build errors. Fix vocab.pyx.

* Fix cython-lint install and run.

* Fix lexeme.pyx, parts_of_speech.pxd, vectors.pyx. Temporarily disable cython-lint execution.

* Fix attrs.pyx, lexeme.pyx, symbols.pxd, isort issues.

* Make cython-lint install conditional. Fix tokenizer.pyx.

* Fix remaining files. Reenable cython-lint check.

* Readded parentheses.

* Fix test_build_dependencies().

* Add explanatory comment to cython-lint execution.

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-07-19 12:03:31 +02:00
Adriane Boyd
fb0da3e097
Support custom token/lexeme attribute for vectors (#12625)
* Support custom token/lexeme attribute for vectors

* Fix imports

* Back off to ORTH without Vectors.attr

* Fallback if vectors.attr doesn't exist

* Update docs
2023-06-28 09:43:14 +02:00
Daniël de Kok
2468742cb8 isort all the things 2023-06-26 11:41:03 +02:00
Daniël de Kok
e2b70df012
Configure isort to use the Black profile, recursively isort the spacy module (#12721)
* Use isort with Black profile

* isort all the things

* Fix import cycles as a result of import sorting

* Add DOCBIN_ALL_ATTRS type definition

* Add isort to requirements

* Remove isort from build dependencies check

* Typo
2023-06-14 17:48:41 +02:00
Daniël de Kok
50c5e9a2dd Merge remote-tracking branch 'upstream/master' into sync-v4-master-20230612 2023-06-12 15:57:10 +02:00
kadarakos
c003aac29a
SpanFinder into spaCy from experimental (#12507)
* span finder integrated into spacy from experimental

* black

* isort

* black

* default spankey constant

* black

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* rename

* rename

* max_length and min_length as Optional[int] and strict checking

* black

* mypy fix for integer type infinity

* revert line order

* implement all comparison operators for inf int

* avoid two for loops over all docs by not precomputing

* interleave thresholding with span creation

* black

* revert to not interleaving (relized its faster)

* black

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* update dosctring

* enforce that the gold and predicted documents have the same text

* new error for ensuring reference and predicted texts are the same

* remove todo

* adjust test

* black

* handle misaligned tokenization

* return correct variable

* failing overfit test

* only use a single spans_key like in spancat

* black

* remove debug lines

* typo

* remove comment

* remove near duplicate reduntant method

* use the 'spans_key' variable name everywhere

* Update spacy/pipeline/span_finder.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* flaky test fix suggestion, hand set bias terms

* only test suggester and test result exhaustively

* make it clear that the span_finder_suggester is more general (not specific to span_finder)

* Update spacy/tests/pipeline/test_span_finder.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Apply suggestions from code review

* remove question comment

* move preset_spans_suggester test to spancat tests

* Add docs and unify default configs for spancat and span finder

* Add `allow_overlap=True` to span finder scorer

* Fix offset bug in set_annotations

* Ignore labels in span finder scorer

* Format

* Add span_finder to quickstart template

* Move settings to self.cfg, store min/max unset as None

* Remove debugging

* Update docstrings and docs

* Update spacy/pipeline/span_finder.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix imports

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-06-07 15:52:28 +02:00
kadarakos
34d1164b0e
Spancat speed improvement (#12577)
* avoid nesting then flattening

* mypy fix

* Apply suggestions from code review

* Add type for indices

* Run full matrix for mypy

* Add back modified type: ignore

* Revert "Run full matrix for mypy"

This reverts commit e218873d04.

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-04-27 15:27:13 +02:00
Adriane Boyd
fac457a509
Support floret for PretrainVectors (#12435)
* Support floret for PretrainVectors

* Format
2023-03-24 16:28:51 +01:00
Raphael Mitsch
3102e2e27a
Entity linking: use SpanGroup instead of Iterable[Span] for mentions (#12344)
* Convert Candidate from Cython to Python class.

* Format.

* Fix .entity_ typo in _add_activations() usage.

* Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span].

* Update docs.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update doc string of BaseCandidate.__init__().

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.

* Adjust Candidate to support and mandate numerical entity IDs.

* Format.

* Fix docstring and docs.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename alias -> mention.

* Refactor Candidate attribute names. Update docs and tests accordingly.

* Refacor Candidate attributes and their usage.

* Format.

* Fix mypy error.

* Update error code in line with v4 convention.

* Reverse erroneous changes during merge.

* Update return type in EL tests.

* Re-add Candidate to setup.py.

* Format updated docs.

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-20 12:25:18 +01:00
Raphael Mitsch
9340eb8ad2
Introduce hierarchy for EL Candidate objects (#12341)
* Convert Candidate from Cython to Python class.

* Format.

* Fix .entity_ typo in _add_activations() usage.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update doc string of BaseCandidate.__init__().

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.

* Adjust Candidate to support and mandate numerical entity IDs.

* Format.

* Fix docstring and docs.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename alias -> mention.

* Refactor Candidate attribute names. Update docs and tests accordingly.

* Refacor Candidate attributes and their usage.

* Format.

* Fix mypy error.

* Update error code in line with v4 convention.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Updated error code.

* Simplify interface for int/str representations.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename 'alias' to 'mention'.

* Port Candidate and InMemoryCandidate to Cython.

* Remove redundant entry in setup.py.

* Add abstract class check.

* Drop storing mention.

* Update spacy/kb/candidate.pxd

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix entity_id refactoring problems in docstrings.

* Drop unused InMemoryCandidate._entity_hash.

* Update docstrings.

* Move attributes out of Candidate.

* Partially fix alias/mention terminology usage. Convert Candidate to interface.

* Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs().

* Update docstrings related to prior_prob.

* Update alias/mention usage in doc(strings).

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs.

* Update docstrings.

* Fix InMemoryCandidate attribute names.

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update W401 test.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use Candidate output type for toy generators in the test suite to mimick best practices

* fix docs

* fix import

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-20 00:34:35 +01:00
Adriane Boyd
260cb9c6fe
Raise error for non-default vectors with PretrainVectors (#12366) 2023-03-06 18:06:31 +01:00
Raphael Mitsch
1ea31552be Merge branch 'master' into sync/master-into-v4
# Conflicts:
#	requirements.txt
#	spacy/pipeline/entity_linker.py
#	spacy/util.py
#	website/docs/api/entitylinker.mdx
2023-03-02 16:24:15 +01:00
Raphael Mitsch
6aa6b86d49
Make generation of empty KnowledgeBase instances configurable in EntityLinker (#12320)
* Make empty_kb() configurable.

* Format.

* Update docs.

* Be more specific in KB serialization test.

* Update KB serialization tests. Update docs.

* Remove doc update for batched candidate generation.

* Fix serialization of subclassed KB in tests.

* Format.

* Update docstring.

* Update docstring.

* Switch from pickle to json for custom field serialization.
2023-03-01 16:02:55 +01:00
Daniël de Kok
e27c60a702
Reimplement distillation with oracle cut size (#12214)
* Improve the correctness of _parse_patch

* If there are no more actions, do not attempt to make further
  transitions, even if not all states are final.
* Assert that the number of actions for a step is the same as
  the number of states.

* Reimplement distillation with oracle cut size

The code for distillation with an oracle cut size was not reimplemented
after the parser refactor. We did not notice, because we did not have
tests for this functionality. This change brings back the functionality
and adds this to the parser tests.

* Rename states2actions to _states_to_actions for consistency

* Test distillation max cuts in NER

* Mark parser/NER tests as slow

* Typo

* Fix invariant in _states_diff_to_actions

* Rename _init_batch -> _init_batch_from_teacher

* Ninja edit the ninja edit

* Check that we raise an exception when we pass the incorrect number or actions

* Remove unnecessary get

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Write out condition more explicitly

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2023-02-21 15:47:18 +01:00
Adriane Boyd
ec45f704b1
Drop python 3.6/3.7, remove unneeded compat (#12187)
* Drop python 3.6/3.7, remove unneeded compat

* Remove unused import

* Minimal python 3.8+ docs updates
2023-01-27 15:48:20 +01:00
Daniël de Kok
a183db3cef
Merge the parser refactor into v4 (#10940)
* Try to fix doc.copy

* Set dev version

* Make vocab always own lexemes

* Change version

* Add SpanGroups.copy method

* Fix set_annotations during Parser.update

* Fix dict proxy copy

* Upd version

* Fix copying SpanGroups

* Fix set_annotations in parser.update

* Fix parser set_annotations during update

* Revert "Fix parser set_annotations during update"

This reverts commit eb138c89ed.

* Revert "Fix set_annotations in parser.update"

This reverts commit c6df0eafd0.

* Fix set_annotations during parser update

* Inc version

* Handle final states in get_oracle_sequence

* Inc version

* Try to fix parser training

* Inc version

* Fix

* Inc version

* Fix parser oracle

* Inc version

* Inc version

* Fix transition has_gold

* Inc version

* Try to use real histories, not oracle

* Inc version

* Upd parser

* Inc version

* WIP on rewrite parser

* WIP refactor parser

* New progress on parser model refactor

* Prepare to remove parser_model.pyx

* Convert parser from cdef class

* Delete spacy.ml.parser_model

* Delete _precomputable_affine module

* Wire up tb_framework to new parser model

* Wire up parser model

* Uncython ner.pyx and dep_parser.pyx

* Uncython

* Work on parser model

* Support unseen_classes in parser model

* Support unseen classes in parser

* Cleaner handling of unseen classes

* Work through tests

* Keep working through errors

* Keep working through errors

* Work on parser. 15 tests failing

* Xfail beam stuff. 9 failures

* More xfail. 7 failures

* Xfail. 6 failures

* cleanup

* formatting

* fixes

* pass nO through

* Fix empty doc in update

* Hackishly fix resizing. 3 failures

* Fix redundant test. 2 failures

* Add reference version

* black formatting

* Get tests passing with reference implementation

* Fix missing prints

* Add missing file

* Improve indexing on reference implementation

* Get non-reference forward func working

* Start rigging beam back up

* removing redundant tests, cf #8106

* black formatting

* temporarily xfailing issue 4314

* make flake8 happy again

* mypy fixes

* ensure labels are added upon predict

* cleanup remnants from merge conflicts

* Improve unseen label masking

Two changes to speed up masking by ~10%:

- Use a bool array rather than an array of float32.

- Let the mask indicate whether a label was seen, rather than
  unseen. The mask is most frequently used to index scores for
  seen labels. However, since the mask marked unseen labels,
  this required computing an intermittent flipped mask.

* Write moves costs directly into numpy array (#10163)

This avoids elementwise indexing and the allocation of an additional
array.

Gives a ~15% speed improvement when using batch_by_sequence with size
32.

* Temporarily disable ner and rehearse tests

Until rehearse is implemented again in the refactored parser.

* Fix loss serialization issue (#10600)

* Fix loss serialization issue

Serialization of a model fails with:

TypeError: array(738.3855, dtype=float32) is not JSON serializable

Fix this using float conversion.

* Disable CI steps that require spacy.TransitionBasedParser.v2

After finishing the refactor, TransitionBasedParser.v2 should be
provided for backwards compat.

* Add back support for beam parsing to the refactored parser (#10633)

* Add back support for beam parsing

Beam parsing was already implemented as part of the `BeamBatch` class.
This change makes its counterpart `GreedyBatch`. Both classes are hooked
up in `TransitionModel`, selecting `GreedyBatch` when the beam size is
one, or `BeamBatch` otherwise.

* Use kwarg for beam width

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Avoid implicit default for beam_width and beam_density

* Parser.{beam,greedy}_parse: ensure labels are added

* Remove 'deprecated' comments

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Parser `StateC` optimizations (#10746)

* `StateC`: Optimizations

Avoid GIL acquisition in `__init__`
Increase default buffer capacities on init
Reduce C++ exception overhead

* Fix typo

* Replace `set::count` with `set::find`

* Add exception attribute to c'tor

* Remove unused import

* Use a power-of-two value for initial capacity
Use default-insert to init `_heads` and `_unshiftable`

* Merge `cdef` variable declarations and assignments

* Vectorize `example.get_aligned_parses` (#10789)

* `example`: Vectorize `get_aligned_parse`
Rename `numpy` import

* Convert aligned array to lists before returning

* Revert import renaming

* Elide slice arguments when selecting the entire range

* Tagger/morphologizer alignment performance optimizations (#10798)

* `example`: Unwrap `numpy` scalar arrays before passing them to `StringStore.__getitem__`

* `AlignmentArray`: Use native list as staging buffer for offset calculation

* `example`: Vectorize `get_aligned`

* Hoist inner functions out of `get_aligned`

* Replace inline `if..else` clause in assignment statement

* `AlignmentArray`: Use raw indexing into offset and data `numpy` arrays

* `example`: Replace array unique value check with `groupby`

* `example`: Correctly exclude tokens with no alignment in `_get_aligned_vectorized`
Simplify `_get_aligned_non_vectorized`

* `util`: Update `all_equal` docstring

* Explicitly use `int32_t*`

* Restore C CPU inference in the refactored parser (#10747)

* Bring back the C parsing model

The C parsing model is used for CPU inference and is still faster for
CPU inference than the forward pass of the Thinc model.

* Use C sgemm provided by the Ops implementation

* Make tb_framework module Cython, merge in C forward implementation

* TransitionModel: raise in backprop returned from forward_cpu

* Re-enable greedy parse test

* Return transition scores when forward_cpu is used

* Apply suggestions from code review

Import `Model` from `thinc.api`

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use relative imports in tb_framework

* Don't assume a default for beam_width

* We don't have a direct dependency on BLIS anymore

* Rename forwards to _forward_{fallback,greedy_cpu}

* Require thinc >=8.1.0,<8.2.0

* tb_framework: clean up imports

* Fix return type of _get_seen_mask

* Move up _forward_greedy_cpu

* Style fixes.

* Lower thinc lowerbound to 8.1.0.dev0

* Formatting fix

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Reimplement parser rehearsal function (#10878)

* Reimplement parser rehearsal function

Before the parser refactor, rehearsal was driven by a loop in the
`rehearse` method itself. For each parsing step, the loops would:

1. Get the predictions of the teacher.
2. Get the predictions and backprop function of the student.
3. Compute the loss and backprop into the student.
4. Move the teacher and student forward with the predictions of
   the student.

In the refactored parser, we cannot perform search stepwise rehearsal
anymore, since the model now predicts all parsing steps at once.
Therefore, rehearsal is performed in the following steps:

1. Get the predictions of all parsing steps from the student, along
   with its backprop function.
2. Get the predictions from the teacher, but use the predictions of
   the student to advance the parser while doing so.
3. Compute the loss and backprop into the student.

To support the second step a new method, `advance_with_actions` is
added to `GreedyBatch`, which performs the provided parsing steps.

* tb_framework: wrap upper_W and upper_b in Linear

Thinc's Optimizer cannot handle resizing of existing parameters. Until
it does, we work around this by wrapping the weights/biases of the upper
layer of the parser model in Linear. When the upper layer is resized, we
copy over the existing parameters into a new Linear instance. This does
not trigger an error in Optimizer, because it sees the resized layer as
a new set of parameters.

* Add test for TransitionSystem.apply_actions

* Better FIXME marker

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Fixes from Madeesh

* Apply suggestions from Sofie

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove useless assignment

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename some identifiers in the parser refactor (#10935)

* Rename _parseC to _parse_batch

* tb_framework: prefix many auxiliary functions with underscore

To clearly state the intent that they are private.

* Rename `lower` to `hidden`, `upper` to `output`

* Parser slow test fixup

We don't have TransitionBasedParser.{v1,v2} until we bring it back as a
legacy option.

* Remove last vestiges of PrecomputableAffine

This does not exist anymore as a separate layer.

* ner: re-enable sentence boundary checks

* Re-enable test that works now.

* test_ner: make loss test more strict again

* Remove commented line

* Re-enable some more beam parser tests

* Remove unused _forward_reference function

* Update for CBlas changes in Thinc 8.1.0.dev2

Bump thinc dependency to 8.1.0.dev3.

* Remove references to spacy.TransitionBasedParser.{v1,v2}

Since they will not be offered starting with spaCy v4.

* `tb_framework`: Replace references to `thinc.backends.linalg` with `CBlas`

* dont use get_array_module (#11056) (#11293)

Co-authored-by: kadarakos <kadar.akos@gmail.com>

* Move `thinc.extra.search` to `spacy.pipeline._parser_internals` (#11317)

* `search`: Move from `thinc.extra.search`
Fix NPE in `Beam.__dealloc__`

* `pytest`: Add support for executing Cython tests
Move `search` tests from thinc and patch them to run with `pytest`

* `mypy` fix

* Update comment

* `conftest`: Expose `register_cython_tests`

* Remove unused import

* Move `argmax` impls to new `_parser_utils` Cython module (#11410)

* Parser does not have to be a cdef class anymore

This also fixes validation of the initialization schema.

* Add back spacy.TransitionBasedParser.v2

* Fix a rename that was missed in #10878.

So that rehearsal tests pass.

* Remove module from setup.py that got added during the merge

* Bring back support for `update_with_oracle_cut_size` (#12086)

* Bring back support for `update_with_oracle_cut_size`

This option was available in the pre-refactor parser, but was never
implemented in the refactored parser. This option cuts transition
sequences that are longer than `update_with_oracle_cut` size into
separate sequences that have at most `update_with_oracle_cut`
transitions. The oracle (gold standard) transition sequence is used to
determine the cuts and the initial states for the additional sequences.

Applying this cut makes the batches more homogeneous in the transition
sequence lengths, making forward passes (and as a consequence training)
much faster.

Training time 1000 steps on de_core_news_lg:

- Before this change: 149s
- After this change: 68s
- Pre-refactor parser: 81s

* Fix a rename that was missed in #10878.

So that rehearsal tests pass.

* Apply suggestions from @shadeMe

* Use chained conditional

* Test with update_with_oracle_cut_size={0, 1, 5, 100}

And fix a git that occurs with a cut size of 1.

* Fix up some merge fall out

* Update parser distillation for the refactor

In the old parser, we'd iterate over the transitions in the distill
function and compute the loss/gradients on the go. In the refactored
parser, we first let the student model parse the inputs. Then we'll let
the teacher compute the transition probabilities of the states in the
student's transition sequence. We can then compute the gradients of the
student given the teacher.

* Add back spacy.TransitionBasedParser.v1 references

- Accordion in the architecture docs.
- Test in test_parse, but disabled until we have a spacy-legacy release.

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: kadarakos <kadar.akos@gmail.com>
2023-01-18 11:27:45 +01:00
Daniël de Kok
5e297aa20e
Add TrainablePipe.{distill,get_teacher_student_loss} (#12016)
* Add `TrainablePipe.{distill,get_teacher_student_loss}`

This change adds two methods:

- `TrainablePipe::distill` which performs a training step of a
   student pipe on a teacher pipe, giving a batch of `Doc`s.
- `TrainablePipe::get_teacher_student_loss` computes the loss
  of a student relative to the teacher.

The `distill` or `get_teacher_student_loss` methods are also implemented
in the tagger, edit tree lemmatizer, and parser pipes, to enable
distillation in those pipes and as an example for other pipes.

* Fix stray `Beam` import

* Fix incorrect import

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* TrainablePipe.distill: use `Iterable[Example]`

* Add Pipe.is_distillable method

* Add `validate_distillation_examples`

This first calls `validate_examples` and then checks that the
student/teacher tokens are the same.

* Update distill documentation

* Add distill documentation for all pipes that support distillation

* Fix incorrect identifier

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add comment to explain `is_distillable`

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-01-16 10:25:53 +01:00
Daniël de Kok
f9308aae13
Fix v4 branch to build against Thinc v9 (#11921)
* Move `thinc.extra.search` to `spacy.pipeline._parser_internals`

Backport of:
https://github.com/explosion/spaCy/pull/11317

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Replace references to `thinc.backends.linalg` with `CBlas`

Backport of:
https://github.com/explosion/spaCy/pull/11292

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Use cross entropy from `thinc.legacy`

* Require thinc>=9.0.0.dev0,<9.1.0

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2022-12-17 14:32:19 +01:00
Adriane Boyd
68b8fa2df2 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master-4 2022-11-03 09:42:36 +01:00
Paul O'Leary McCann
d61e742960
Handle Docs with no entities in EntityLinker (#11640)
* Handle docs with no entities

If a whole batch contains no entities it won't make it to the model, but
it's possible for individual Docs to have no entities. Before this
commit, those Docs would cause an error when attempting to concatenate
arrays because the dimensions didn't match.

It turns out the process of preparing the Ragged at the end of the span
maker forward was a little different from list2ragged, which just uses
the flatten function directly. Letting list2ragged do the conversion
avoids the dimension issue.

This did not come up before because in NEL demo projects it's typical
for data with no entities to be discarded before it reaches the NEL
component.

This includes a simple direct test that shows the issue and checks it's
resolved. It doesn't check if there are any downstream changes, so a
more complete test could be added. A full run was tested by adding an
example with no entities to the Emerson sample project.

* Add a blank instance to default training data in tests

Rather than adding a specific test, since not failing on instances with
no entities is basic functionality, it makes sense to add it to the
default set.

* Fix without modifying architecture

If the architecture is modified this would have to be a new version, but
this change isn't big enough to merit that.
2022-10-28 10:25:34 +02:00
svlandeg
e3027c65b8 Merge branch 'copy_develop' into copy_v4 2022-10-03 14:12:16 +02:00
svlandeg
9c8cdb403e Merge branch 'master_copy' into develop_copy 2022-09-30 15:40:26 +02:00
Madeesh Kannan
0ec9a696e6
Fix config validation failures caused by NVTX pipeline wrappers (#11460)
* Enable Cython<->Python bindings for `Pipe` and `TrainablePipe` methods

* `pipes_with_nvtx_range`: Skip hooking methods whose signature cannot be ascertained

When loading pipelines from a config file, the arguments passed to individual pipeline components is validated by `pydantic` during init. For this, the validation model attempts to parse the function signature of the component's c'tor/entry point so that it can check if all mandatory parameters are present in the config file.

When using the `models_and_pipes_with_nvtx_range` as a `after_pipeline_creation` callback, the methods of all pipeline components get replaced by a NVTX range wrapper **before** the above-mentioned validation takes place. This can be problematic for components that are implemented as Cython extension types - if the extension type is not compiled with Python bindings for its methods, they will have no signatures at runtime. This resulted in `pydantic` matching the *wrapper's* parameters with the those in the config and raising errors.

To avoid this, we now skip applying the wrapper to any (Cython) methods that do not have signatures.
2022-09-12 14:55:41 +02:00
Raphael Mitsch
1f23c615d7
Refactor KB for easier customization (#11268)
* Add implementation of batching + backwards compatibility fixes. Tests indicate issue with batch disambiguation for custom singular entity lookups.

* Fix tests. Add distinction w.r.t. batch size.

* Remove redundant and add new comments.

* Adjust comments. Fix variable naming in EL prediction.

* Fix mypy errors.

* Remove KB entity type config option. Change return types of candidate retrieval functions to Iterable from Iterator. Fix various other issues.

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/kb_base.pyx

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/kb_base.pyx

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Add error messages to NotImplementedErrors. Remove redundant comment.

* Fix imports.

* Remove redundant comments.

* Rename KnowledgeBase to InMemoryLookupKB and BaseKnowledgeBase to KnowledgeBase.

* Fix tests.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move KB into subdirectory.

* Adjust imports after KB move to dedicated subdirectory.

* Fix config imports.

* Move Candidate + retrieval functions to separate module. Fix other, small issues.

* Fix docstrings and error message w.r.t. class names. Fix typing for candidate retrieval functions.

* Update spacy/kb/kb_in_memory.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix typing.

* Change typing of mentions to be Span instead of Union[Span, str].

* Update docs.

* Update EntityLinker and _architecture docs.

* Update website/docs/api/entitylinker.md

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Adjust message for E1046.

* Re-add section for Candidate in kb.md, add reference to dedicated page.

* Update docs and docstrings.

* Re-add section + reference for KnowledgeBase.get_alias_candidates() in docs.

* Update spacy/kb/candidate.pyx

* Update spacy/kb/kb_in_memory.pyx

* Update spacy/pipeline/legacy/entity_linker.py

* Remove canididate.md. Remove mistakenly added config snippet in entity_linker.py.

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-08 10:38:07 +02:00
Adriane Boyd
98a916e01a
Make stable private modules public and adjust names (#11353)
* Make stable private modules public and adjust names

* `spacy.ml._character_embed` -> `spacy.ml.character_embed`
* `spacy.ml._precomputable_affine` -> `spacy.ml.precomputable_affine`
* `spacy.tokens._serialize` -> `spacy.tokens.doc_bin`
* `spacy.tokens._retokenize` -> `spacy.tokens.retokenize`
* `spacy.tokens._dict_proxies` -> `spacy.tokens.span_groups`

* Skip _precomputable_affine

* retokenize -> retokenizer

* Fix imports
2022-08-30 13:56:35 +02:00
Daniël de Kok
e581eeac34
precompute_hiddens/Parser: look up CPU ops once (v4) (#11068)
* precompute_hiddens/Parser: look up CPU ops once

* precompute_hiddens: make cpu_ops private
2022-07-29 15:12:19 +02:00
Daniël de Kok
4ee8a06149
Fix compatibility with CuPy 9.x (#11194)
After the precomputable affine table of shape [nB, nF, nO, nP] is
computed, padding with shape [1, nF, nO, nP] is assigned to the first
row of the precomputed affine table. However, when we are indexing the
precomputed table, we get a row of shape [nF, nO, nP]. CuPy versions
before 10.0 cannot paper over this shape difference.

This change fixes compatibility with CuPy < 10.0 by squeezing the first
dimension of the padding before assignment.
2022-07-26 10:52:01 +02:00