Adriane Boyd
5d0f48fe69
Enforce that Span.start/end(_char) remain valid and in sync ( #12268 )
...
* Enforce that Span.start/end(_char) remain valid and in sync
Allowing span attributes to be writable starting in v3 has made it
possible for the internal `Span.start/end/start_char/end_char` to get
out-of-sync or have invalid values.
This checks that the values are valid and syncs the token and char
offsets if any attributes are modified directly. It does not yet handle
the case where the underlying doc is modified.
* Format
2023-04-06 16:01:59 +02:00
Madeesh Kannan
6db20b354f
Docs
: Fix rule-based matching example that expands named entities (#12495 )
2023-04-06 11:45:58 +02:00
Edward
c95d320d28
Add more information to custom code docs ( #12491 )
...
* Add info to sections
* Update website/docs/usage/training.mdx
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-04-06 11:45:19 +02:00
Will Frey
8d4129e177
Fix invalid ConsoleLogger.v3 example config ( #12498 )
...
Replace `progress_bar = "all_steps"` with `progress_bar = "eval"`, which is consistent with the default behavior for `spacy.ConsoleLogger.v1` and `spacy.ConsoleLogger.v2`.
2023-04-04 20:53:07 +02:00
Edward
de32011e4c
Add model-last saving mechanism to pretraining ( #12459 )
...
* Adjust pretrain command
* chane naming and add finally block
* Add unit test
* Add unit test assertions
* Update spacy/training/pretrain.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* change finally block
* Add to docs
* Update website/docs/usage/embeddings-transformers.mdx
* Add flag to skip saving model-last
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-04-03 15:24:03 +02:00
Adriane Boyd
4a1ec332de
Add Span.kb_id/Span.id strings to Doc/DocBin serialization if set ( #12493 )
...
* Add Span.kb_id/Span.id strings to Doc/DocBin serialization if set
* Format
2023-04-03 15:11:12 +02:00
Adriane Boyd
4538ceb507
Remove redundant strings.add for Doc.char_span ( #12429 )
2023-04-03 11:38:56 +02:00
Adriane Boyd
476a2e7a0a
Allow cupy 12.0 for extras ( #12490 )
2023-03-31 13:48:15 +02:00
Adriane Boyd
69e20ce03d
Fix pickle for ngram suggester ( #12486 )
2023-03-31 13:43:51 +02:00
Adriane Boyd
140d53649d
Convert values to numpy for label smoothing tests ( #12472 )
2023-03-31 13:41:41 +02:00
Ye Lei (叶磊)
ce258670b7
Allow passing a Span to displacy.parse_deps ( #12477 )
...
* Allow passing a Span to displacy.parse_deps
* Update docstring
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update API docs
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-31 09:44:01 +02:00
Daniël de Kok
b734e5314d
Avoid TrainablePipe.finish_update
getting called twice during training ( #12450 )
...
* Avoid `TrainablePipe.finish_update` getting called twice during training
PR #12136 fixed an issue where the tok2vec pipe was updated before
gradient were accumulated. However, it introduced a new bug that cause
`finish_update` to be called twice when using the training loop. This
causes a fairly large slowdown.
The `Language.update` method accepts the `sgd` argument for passing an
optimizer. This argument has three possible values:
- `Optimizer`: use the given optimizer to finish pipe updates.
- `None`: use a default optimizer to finish pipe updates.
- `False`: do not finish pipe updates.
However, the latter option was not documented and not valid with the
existing type of `sgd`. I assumed that this was a remnant of earlier
spaCy versions and removed handling of `False`.
However, with that change, we are passing `None` to `Language.update`.
As a result, we were calling `finish_update` in both `Language.update`
and in the training loop after all subbatches are processed.
This change restores proper handling/use of `False`. Moreover, the role
of `False` is now documented and added to the type to avoid future
accidents.
* Fix typo
* Document defaults for `Language.update`
2023-03-30 09:30:42 +02:00
Raphael Mitsch
d85df9d577
Fix Span.sents for edge case of Span being the only Span in the last sentence of a Doc. ( #12484 )
2023-03-29 18:54:47 +02:00
kadarakos
372a90885e
Fix spancat-singlelabel score ( #12469 )
...
* debug argmax sort and add span scores
* add missing tests for spanscores
2023-03-29 08:38:11 +02:00
Edward
dba4e7bece
Add info to stringstore and vocab ( #12471 )
2023-03-27 13:15:14 +02:00
Adriane Boyd
2fba21be63
Restrict github workflows to explosion ( #12470 )
2023-03-27 12:44:04 +02:00
sloev / Johannes Valbjørn
fd072533e7
add spacy_onnx_sentiment_english to universe ( #12422 )
...
* add spacy_onnx_sentiment_english to universe
* rename to sentimental-onix
* fix comma json error
* fix typo
* typo fix
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* mention need to download model before example works
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-27 11:35:14 +02:00
Prajakta Darade
ae7779e830
corrected example code ( #12466 )
2023-03-27 11:32:49 +02:00
kadarakos
d1474fdd91
add explanation about overwriting behaviour ( #12464 )
...
* add explanation about overwriting behaviour
* Update website/docs/api/spancategorizer.mdx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update website/docs/api/spancategorizer.mdx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update website/docs/api/spancategorizer.mdx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* format
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-27 10:27:11 +02:00
Edward
a653dec654
Add info that Vocab and StringStore are not static in docs ( #12427 )
...
* Add size increase info about vocab and stringstore
* Update website/docs/api/stringstore.mdx
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
* Update website/docs/api/vocab.mdx
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
* Change wording
---------
Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-03-27 09:18:23 +02:00
Adriane Boyd
fac457a509
Support floret for PretrainVectors ( #12435 )
...
* Support floret for PretrainVectors
* Format
2023-03-24 16:28:51 +01:00
Adriane Boyd
d0bd3f5ee4
Update Serbian tokenization for UD Serbian SET ( #12442 )
2023-03-24 16:26:40 +01:00
Vinit Ravishankar
28de85737f
Tagger label smoothing ( #12293 )
...
* add label smoothing
* use True/False instead of floats
* add entropy to debug data
* formatting
* docs
* change test to check difference in distributions
* Update website/docs/api/tagger.mdx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/pipeline/tagger.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* bool -> float
* update docs
* fix seed
* black
* update tests to use label_smoothing = 0.0
* set default to 0.0, update quickstart
* Update spacy/pipeline/tagger.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* update morphologizer, tagger test
* fix morph docs
* add url to docs
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-22 12:17:56 +01:00
Ines Montani
b479f8bfa5
Add user survey alert to the top ( #12452 )
...
* Add user survey alert to the top
* Shorter
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-22 11:09:37 +01:00
Raphael Mitsch
3102e2e27a
Entity linking: use SpanGroup
instead of Iterable[Span]
for mentions ( #12344 )
...
* Convert Candidate from Cython to Python class.
* Format.
* Fix .entity_ typo in _add_activations() usage.
* Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span].
* Update docs.
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update doc string of BaseCandidate.__init__().
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.
* Adjust Candidate to support and mandate numerical entity IDs.
* Format.
* Fix docstring and docs.
* Update website/docs/api/kb.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename alias -> mention.
* Refactor Candidate attribute names. Update docs and tests accordingly.
* Refacor Candidate attributes and their usage.
* Format.
* Fix mypy error.
* Update error code in line with v4 convention.
* Reverse erroneous changes during merge.
* Update return type in EL tests.
* Re-add Candidate to setup.py.
* Format updated docs.
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-20 12:25:18 +01:00
Raphael Mitsch
e5be5d6092
Merge branch 'v4' into feature/docwise-generator-batching
...
# Conflicts:
# spacy/kb/kb.pyx
# spacy/kb/kb_in_memory.pyx
# spacy/ml/models/entity_linker.py
# spacy/pipeline/entity_linker.py
# spacy/tests/pipeline/test_entity_linker.py
# website/docs/api/inmemorylookupkb.mdx
# website/docs/api/kb.mdx
2023-03-20 10:50:54 +01:00
Raphael Mitsch
cb79af3a10
Fix merge leftovers.
2023-03-20 10:31:11 +01:00
Raphael Mitsch
73bdeb01e4
Merge branch 'refactor/el-candidates' into feature/docwise-generator-batching
...
# Conflicts:
# spacy/kb/candidate.py
# spacy/kb/kb.pyx
# spacy/kb/kb_in_memory.pyx
# spacy/ml/models/entity_linker.py
# spacy/pipeline/entity_linker.py
# spacy/tests/pipeline/test_entity_linker.py
# website/docs/api/inmemorylookupkb.mdx
# website/docs/api/kb.mdx
2023-03-20 10:24:17 +01:00
Raphael Mitsch
9340eb8ad2
Introduce hierarchy for EL Candidate
objects ( #12341 )
...
* Convert Candidate from Cython to Python class.
* Format.
* Fix .entity_ typo in _add_activations() usage.
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update doc string of BaseCandidate.__init__().
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.
* Adjust Candidate to support and mandate numerical entity IDs.
* Format.
* Fix docstring and docs.
* Update website/docs/api/kb.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename alias -> mention.
* Refactor Candidate attribute names. Update docs and tests accordingly.
* Refacor Candidate attributes and their usage.
* Format.
* Fix mypy error.
* Update error code in line with v4 convention.
* Update spacy/kb/candidate.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Updated error code.
* Simplify interface for int/str representations.
* Update website/docs/api/kb.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Rename 'alias' to 'mention'.
* Port Candidate and InMemoryCandidate to Cython.
* Remove redundant entry in setup.py.
* Add abstract class check.
* Drop storing mention.
* Update spacy/kb/candidate.pxd
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix entity_id refactoring problems in docstrings.
* Drop unused InMemoryCandidate._entity_hash.
* Update docstrings.
* Move attributes out of Candidate.
* Partially fix alias/mention terminology usage. Convert Candidate to interface.
* Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs().
* Update docstrings related to prior_prob.
* Update alias/mention usage in doc(strings).
* Update spacy/ml/models/entity_linker.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/ml/models/entity_linker.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs.
* Update docstrings.
* Fix InMemoryCandidate attribute names.
* Update spacy/kb/kb.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/ml/models/entity_linker.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update W401 test.
* Update spacy/errors.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/kb/kb.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Use Candidate output type for toy generators in the test suite to mimick best practices
* fix docs
* fix import
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-20 00:34:35 +01:00
Adriane Boyd
6ae7618418
Clean up Vocab constructor ( #12290 )
...
* Clean up Vocab constructor
* Change effective type of `strings` from `Iterable[str]` to `Optional[StringStore]`
* Don't automatically add strings to vocab
* Change default values to `None`
* Remove `**deprecated_kwargs`
* Format
2023-03-19 23:41:20 +01:00
Sofie Van Landeghem
b83407388a
fix import
2023-03-19 23:34:00 +01:00
Sofie Van Landeghem
0365d3d2e2
fix docs
2023-03-19 23:31:02 +01:00
Sofie Van Landeghem
9e71adc074
Use Candidate output type for toy generators in the test suite to mimick best practices
2023-03-19 23:27:20 +01:00
Raphael Mitsch
faede7155c
Update spacy/kb/kb.pyx
...
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-17 11:32:41 +01:00
Raphael Mitsch
4d8dce5ba2
Update spacy/errors.py
...
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-17 11:28:18 +01:00
Adriane Boyd
54c614e116
CI: Separate spacy universe validation into a separate workflow ( #12440 )
...
* Separate spacy universe validation into a separate workflow
* Fix new workflow name
2023-03-17 10:59:53 +01:00
Adriane Boyd
5f72d6c836
CI: Switch PR back to paths-ignore ( #12438 )
...
Switch PR tests back to paths-ignore but include changes to `.github`
for all PRs rather than trying to figure out complicated
includes+excludes. Changes to `.github` are relatively rare and should
not be a huge burden for the CI.
2023-03-17 10:01:49 +01:00
Adriane Boyd
4c5a3a2a7b
Remove autoblack workflow ( #12437 )
...
Now that all PRs have `black` formatting validation, we no longer need the
autoblack workflow.
2023-03-17 09:35:00 +01:00
Raphael Mitsch
2377b67f81
Update W401 test.
2023-03-17 08:59:52 +01:00
Raphael Mitsch
307bbab285
Update spacy/ml/models/entity_linker.py
...
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-17 08:58:28 +01:00
Raphael Mitsch
978fbdcee1
Update spacy/kb/kb.pyx
...
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-17 08:58:17 +01:00
Raphael Mitsch
830939ee64
Fix InMemoryCandidate attribute names.
2023-03-15 10:51:34 +01:00
Raphael Mitsch
80fb0666b9
Update docstrings.
2023-03-15 09:25:41 +01:00
Raphael Mitsch
3cfc1c6acc
Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs.
2023-03-15 09:23:31 +01:00
Raphael Mitsch
961795d9f1
Update spacy/ml/models/entity_linker.py
...
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-15 09:20:25 +01:00
Raphael Mitsch
b7b4282821
Update spacy/ml/models/entity_linker.py
...
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-15 09:20:07 +01:00
Raphael Mitsch
96b61d0671
Fix EL failure with sentence-crossing entities ( #12398 )
...
* Add test reproducing EL failure in sentence-crossing entities.
* Format.
* Draft fix.
* Format.
* Fix case for len(ent.sents) == 1.
* Format.
* Format.
* Format.
* Fix mypy error.
* Merge EL sentence crossing tests.
* Remove unneeded sentencizer component.
* Fix or ignore mypy issues in test.
* Simplify ent.sents handling.
* Format. Update assert in ent.sents handling.
* Small rewrite
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-14 22:02:49 +01:00
Adriane Boyd
2ce9a220db
Fix --verbose for spacy find-threshold ( #12418 )
2023-03-14 17:16:49 +01:00
Adriane Boyd
377f601bff
CI: Add all paths before excluding patterns ( #12419 )
2023-03-14 16:06:08 +01:00
Raphael Mitsch
28dbed64cb
Update alias/mention usage in doc(strings).
2023-03-14 13:33:05 +01:00