Commit Graph

1580 Commits

Author SHA1 Message Date
Sofie Van Landeghem
02a6a5fea0
Fix 'debug model' for transformers + generalize (#7973)
* add overrides to docs

* fix debug model with transformer

* assume training data is set in config
2021-05-06 18:43:32 +10:00
Paul O'Leary McCann
66bfabd839
Fix pretraining objectives fragment (#8005)
* Fix pretraining objectives fragment

The fragment here is reused from a heading higher up, so you couldn't
link to this section.

* Fix section link to new fragment
2021-05-06 08:27:36 +02:00
Adriane Boyd
2320791f6d
Fix Transformer.initialize example (#7963) 2021-04-30 12:21:31 +02:00
Adriane Boyd
95c0833656
Add training option to set annotations on update (#7767)
* Add training option to set annotations on update

Add a `[training]` option called `set_annotations_on_update` to specify
a list of components for which the predicted annotations should be set
on `example.predicted` immediately after that component has been
updated. The predicted annotations can be accessed by later components
in the pipeline during the processing of the batch in the same `update`
call.

* Rename to annotates / annotating_components

* Add test for `annotating_components` when training from config

* Add documentation
2021-04-26 16:53:53 +02:00
Adriane Boyd
bdb485cc80
Add callback to copy vocab/tokenizer from model (#7750)
* Add callback to copy vocab/tokenizer from model

Add callback `spacy.copy_from_base_model.v1` to copy the tokenizer
settings and/or vocab (including vectors) from a base model.

* Move spacy.copy_from_base_model.v1 to spacy.training.callbacks

* Add documentation

* Modify to specify model as tokenizer and vocab params
2021-04-22 12:36:50 +02:00
Adriane Boyd
f68fc29130
Update sent_starts in Example.from_dict (#7847)
* Update sent_starts in Example.from_dict

Update `sent_starts` for `Example.from_dict` so that `Optional[bool]`
values have the same meaning as for `Token.is_sent_start`.

Use `Optional[bool]` as the type for sent start values in the docs.

* Use helper function for conversion to ternary ints
2021-04-22 11:32:45 +02:00
Adriane Boyd
d2bdaa7823
Replace negative rows with 0 in StaticVectors (#7674)
* Replace negative rows with 0 in StaticVectors

Replace negative row indices with 0-vectors in `StaticVectors`.

* Increase versions related to StaticVectors

* Increase versions of all architctures and layers related to
`StaticVectors`
* Improve efficiency of 0-vector operations

Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5

* Update config defaults to new versions

* Update docs
2021-04-22 18:04:15 +10:00
Sofie Van Landeghem
6f565cf39d
fix typo in entity_linker docs 2021-04-22 09:59:24 +02:00
Sofie Van Landeghem
2e746dbf32
update EL training data format in docs (#7839)
* update EL training data format

* fix typo

* all -1 because reasons
2021-04-22 08:50:09 +02:00
Shantam Raj
6017fcf693
Default code for Setting Entity annotations on the website errors (#7738)
* the default example for "Setting entity annotations" errors on Binder

* updating contributer info

* using a new variable to store original entities
2021-04-21 09:16:32 +02:00
langdonholmes
df541c6b5e
Update processing-pipelines.md to mention method for doc metadata (#7480)
* Update processing-pipelines.md

Under "things to try," inform users they can save metadata when using nlp.pipe(foobar, as_tuples=True)

Link to a new example on the attributes page detailing the following:

> ```
> data = [
>   ("Some text to process", {"meta": "foo"}),
>   ("And more text...", {"meta": "bar"})
> ]
> 
> for doc, context in nlp.pipe(data, as_tuples=True):
>     # Let's assume you have a "meta" extension registered on the Doc
>     doc._.meta = context["meta"]
> ```

from https://stackoverflow.com/questions/57058798/make-spacy-nlp-pipe-process-tuples-of-text-and-additional-information-to-add-as

* Updating the attributes section

Update the attributes section with example of how extensions can be used to store metadata.

* Update processing-pipelines.md

* Update processing-pipelines.md

Made as_tuples example executable and relocated to the end of the "Processing Text" section.

* Update processing-pipelines.md

* Update processing-pipelines.md

Removed extra line

* Reformat and rephrase

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-04-19 11:58:12 +02:00
Adriane Boyd
0e7f94b247
Update Tokenizer.explain with special matches (#7749)
* Update Tokenizer.explain with special matches

Update `Tokenizer.explain` and the pseudo-code in the docs to include
the processing of special cases that contain affixes or whitespace.

* Handle optional settings in explain

* Add test for special matches in explain

Add test for `Tokenizer.explain` for special cases containing affixes.
2021-04-19 19:08:20 +10:00
Sofie Van Landeghem
c786e98e56
assemble CLI command (#7783)
* assemble CLI command

* ensure assemble runs even without training section

* cleanup
2021-04-19 18:39:11 +10:00
Bram Vanroy
ed561cf428
Terminology: deprecated vs obsolete (#7621)
* Terminology: deprecated vs obsolete

Typically, deprecated is used for functionality that is bound to become unavailable but that can still be used. Obsolete is used for features that have been removed. In E941, I think what is meant is "obsolete" since loading a model by a shortcut simply does not work anymore (and throws an error). This is different from downloading a model with a shortcut, which is deprecated but still works.

In light of this, perhaps all other error codes should be checked as well.

* clarify that the link command is removed and not just deprecated

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2021-04-12 14:37:00 +02:00
Adriane Boyd
673e2bc4c0
Add usage docs for streamed train corpora (#7693) 2021-04-09 16:15:38 +02:00
Sofie Van Landeghem
204c2f116b
Extend score_spans for overlapping & non-labeled spans (#7209)
* extend span scorer with consider_label and allow_overlap

* unit test for spans y2x overlap

* add score_spans unit test

* docs for new fields in scorer.score_spans

* rename to include_label

* spell out if-else for clarity

* rename to 'labeled'

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-04-08 12:19:17 +02:00
broaddeep
ee159b8543
Support match alignments (#7321)
* Support match alignments

* change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case

* remove added errors, utilize bint type, cleanup whitespace

* fix no new line in end of file

* Minor formatting

* Skip alignments processing if as_spans is set

* Add with_alignments to Matcher API docs

* Update website/docs/api/matcher.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-04-08 18:10:14 +10:00
Ines Montani
de4f4c9b8a Add more link anchors [ci skip] 2021-04-06 14:15:21 +10:00
Ines Montani
5bbdd7dc4c Update pipeline design docs [ci skip] 2021-04-06 14:13:22 +10:00
Ines Montani
1d1cfadbca Fix formatting [ci skip] 2021-04-06 14:13:13 +10:00
Ayush Chaurasia
3c2ce41dd8
W&B integration: Optional support for dataset and model checkpoint logging and versioning (#7429)
* Add optional artifacts logging

* Update docs

* Update spacy/training/loggers.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/training/loggers.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/training/loggers.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Bump WandbLogger Version

* Add documentation of v1 to legacy docs

* bump spacy-legacy to 3.0.2 (to be released)

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2021-04-01 19:36:23 +02:00
Sofie Van Landeghem
59c2069eb1
Legacy docs (#7601)
* document legacy Tok2Vec architectures

* add TextCatEnsemble.v1 legacy documentation

* Separate legacy section in side bar
2021-03-30 12:43:14 +02:00
Santiago Castro
af07fc3bc1
Add support for CUDA 11.2 (#7583)
* Add support for CUDA 11.2

* Update the docs

* Format

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-03-30 09:47:33 +02:00
Álvaro Abella Bascarán
5b4dde38a3
fix fn name: tokenizer.infixes_finditer -> tokenizer.infix_finditer (#7606) 2021-03-30 09:45:49 +02:00
Ines Montani
be55f43163
Merge pull request #7473 from adrianeboyd/docs/v3-pipeline-deps-order 2021-03-22 12:43:07 +01:00
Ines Montani
3ee2fcfba0
Merge pull request #7483 from adrianeboyd/docs/various-v3-4 [ci skip] 2021-03-22 12:37:06 +01:00
Ines Montani
88e5a0dc16
Merge pull request #7504 from polm/fix/lexeme-docs [ci skip]
Fix mismatched backtick in Lexeme docs
2021-03-22 12:36:44 +01:00
Adriane Boyd
0d2b723e8d Update entity setting section 2021-03-20 11:38:55 +01:00
Paul O'Leary McCann
e39c0dcf33 Fix mismatched backtick in Lexeme docs 2021-03-20 18:40:00 +09:00
Adriane Boyd
c771ec22f0 Update matcher errors and docs
* Mention `tagger+attribute_ruler` in `POS`/`MORPH` error messages for
`Matcher` and `PhraseMatcher`
* Document `Matcher.__call__(allow_missing=)`
2021-03-19 10:11:18 +01:00
Adriane Boyd
6a9a467766
Update website/docs/usage/processing-pipelines.md
Co-authored-by: Ines Montani <ines@ines.io>
2021-03-19 08:12:49 +01:00
Adriane Boyd
6354b642c5
Fix typo 2021-03-18 19:01:10 +01:00
Adriane Boyd
40e5d3a980 Update saving/loading example 2021-03-18 16:56:10 +01:00
Adriane Boyd
0fb1881f36 Reformat processing pipelines 2021-03-18 13:31:42 +01:00
Adriane Boyd
acc58719da Update custom similarity hooks example 2021-03-18 13:31:42 +01:00
Adriane Boyd
c9e1a9ac17 Add multiprocessing section 2021-03-18 13:31:42 +01:00
Adriane Boyd
9a254d3995 Include all en_core_web_sm components in examples 2021-03-18 13:31:42 +01:00
Adriane Boyd
83c1b919a7 Fix positional/option in CLI types 2021-03-18 13:31:42 +01:00
Adriane Boyd
9fd41d6742 Remove Language.pipe cleanup arg 2021-03-18 13:31:42 +01:00
Adriane Boyd
5da323fd86
Minor edits 2021-03-17 12:59:05 +01:00
Adriane Boyd
a5ffe8dfed Add details about pretrained pipeline design 2021-03-17 11:31:26 +01:00
bsweileh
61472e7cb3
Update _training.md - Fix broken link on backpropagation (#7431)
* Update _training.md

Fix broken link on backpropagation

* Add agreement

add spacy contributor agreement
2021-03-15 09:21:35 +01:00
Ines Montani
c67d5a6eb0
Merge pull request #7394 from adrianeboyd/docs/ner-example-data-readme 2021-03-13 04:26:18 +01:00
Ines Montani
068b97a617
Merge pull request #7408 from adrianeboyd/bugfix/load-keyword-only 2021-03-13 04:25:50 +01:00
Adriane Boyd
3168103605 Fix type of spacy train --output in docs 2021-03-12 10:04:57 +01:00
Adriane Boyd
03e9e7b567 Add --code option to init fill-config 2021-03-12 10:03:57 +01:00
Adriane Boyd
124304b146 Add vocab kwarg back to spacy.load
* Additional minor formatting and docs cleanup
2021-03-11 10:58:59 +01:00
Adriane Boyd
84470d9b9e Incorporate BILUO note from #7407 2021-03-11 10:11:21 +01:00
Adriane Boyd
4294bcf4ab Align keyword-only in docs for init/util 2021-03-11 09:52:40 +01:00
Adriane Boyd
28726c25a1 Update docs for convert CLI and NER examples 2021-03-10 11:42:02 +01:00
Adriane Boyd
d746ea6278
Add warning about GPU selection in Jupyter notebooks (#7075)
* Initial warning

* Update check

* Redo edit

* Move jupyter warning to helper method

* Add link with details to warnings
2021-03-09 15:35:21 +01:00
Sofie Van Landeghem
932887b950
textcat scoring fix and multi_label docs (#6974)
* add multi-label textcat to menu

* add infobox on textcat API

* add info to v3 migration guide

* small edits

* further fixes in doc strings

* add infobox to textcat architectures

* add textcat_multilabel to overview of built-in components

* spelling

* fix unrelated warn msg

* Add textcat_multilabel to quickstart [ci skip]

* remove separate documentation page for multilabel_textcategorizer

* small edits

* positive label clarification

* avoid duplicating information in self.cfg and fix textcat.score

* fix multilabel textcat too

* revert threshold to storage in cfg

* revert threshold stuff for multi-textcat

Co-authored-by: Ines Montani <ines@ines.io>
2021-03-09 23:04:22 +11:00
Sofie Van Landeghem
cd70c3cb79
Fixing pretrain (#7342)
* initialize NLP with train corpus

* add more pretraining tests

* more tests

* function to fetch tok2vec layer for pretraining

* clarify parameter name

* test different objectives

* formatting

* fix check for static vectors when using vectors objective

* clarify docs

* logger statement

* fix init_tok2vec and proc.initialize order

* test training after pretraining

* add init_config tests for pretraining

* pop pretraining block to avoid config validation errors

* custom errors
2021-03-09 14:01:13 +11:00
Ines Montani
dfb23a419e Merge branch 'spacy.io' [ci skip] 2021-03-06 17:38:54 +11:00
graue70
7d085d5b1c
Fix typo in docs 2021-03-05 18:30:09 +01:00
svlandeg
682a6232e3 fix typo 2021-03-02 17:59:13 +01:00
svlandeg
d900c55061 consistently use registry as callable 2021-03-02 17:56:28 +01:00
graue70
0fddc0447c
Fix copy & paste error in API docs 2021-03-02 14:00:14 +01:00
Ines Montani
8f7c7b2658
Merge pull request #7211 from svlandeg/docs/el_update [ci skip]
kb.get_candidates renamed to get_alias_candidates
2021-02-27 11:51:22 +11:00
Ines Montani
408b94887a
Merge pull request #7207 from adrianeboyd/docs/get-noun-chunks [ci skip]
Extend docs related to Vocab.get_noun_chunks
2021-02-27 11:51:08 +11:00
svlandeg
248339039e fix type in docs 2021-02-26 14:27:10 +01:00
svlandeg
08fd901a1b kb.get_candidates renamed to get_alias_candidates 2021-02-25 20:09:36 +01:00
Adriane Boyd
6a37f343d5 Extend docs related to Vocab.get_noun_chunks 2021-02-25 16:38:21 +01:00
Ines Montani
24cecbb3f4
Merge pull request #7126 from adrianeboyd/docs/gpu-id-opt [ci skip]
Add tip about --gpu-id to training quickstart
2021-02-24 22:34:17 +11:00
Ken
fa7ddc7f88
Update sentencizer documentation example with sentencizer pipe name (#7185) 2021-02-24 08:06:54 +01:00
Tocic
b1996a51a1
fix typo in models.md (#7157) 2021-02-22 09:00:38 +01:00
Sofie Van Landeghem
b92f81d5da
fix NEL config and IO, and n_sents functionality (#7100)
* fix NEL config and IO, and n_sents functionality

* add docs

* fix test
2021-02-22 14:49:52 +11:00
Sofie Van Landeghem
ba5a50f62b
NEL docs & UX (#7129)
* EL set_kb docs fix

* custom warning for set_kb mistake
2021-02-22 11:04:22 +11:00
Adriane Boyd
7198be0f4b Add tip about --gpu-id to training quickstart 2021-02-19 14:07:51 +01:00
Sofie Van Landeghem
709c9e75af
span.ent only returns first sentence (#7084)
* return first sentence when span contains sentence boundary

* docs fix

* small fixes

* cleanup
2021-02-19 23:02:38 +11:00
palandlom
9b82586699
var batch is useless (#7111)
It seems that nlp.update(examples) should be nlp.update(batch)
2021-02-18 09:44:22 +01:00
Ines Montani
fc4fb6eb3a Make v2.x docs more prominent [ci skip] 2021-02-17 23:42:27 +11:00
Ines Montani
6b9026a219
Merge pull request #7000 from explosion/feature/project-yml-overrides
Support env vars and CLI overrides for project.yml
2021-02-11 12:31:45 +11:00
Peter Baumann
61b04a70d5
Run PhraseMatcher on Spans (#6918)
* Add regression test

* Run PhraseMatcher on Spans

* Add test for PhraseMatcher on Spans and Docs

* Add SCA

* Add test with 3 matches in Doc, 1 match in Span

* Update docs

* Use doc.length for find_matches in tokenizer

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-02-10 23:43:32 +11:00
Ines Montani
c08b3f294c Support env vars and CLI overrides for project.yml 2021-02-10 13:45:27 +11:00
svlandeg
9a7f33c916 final 3.0 benchmark numbers 2021-02-09 21:28:33 +01:00
Ines Montani
ca3f8386d7
Merge pull request #6975 from svlandeg/fix/link [ci skip]
fix link
2021-02-09 14:34:11 +11:00
tarskiandhutch
e897e7aaad
Line 70: syntax error
Original config definition treated dictionary key as a function argument.
2021-02-08 15:24:57 -05:00
svlandeg
bb7482bef8 fix link 2021-02-08 18:39:59 +01:00
Sofie Van Landeghem
6ed423c16c
reduce memory load when reading all vectors from file (#6945)
* reduce memory load when reading all vectors from file

* one more small typo fix
2021-02-07 08:05:43 +08:00
Ines Montani
433835d9b0
Merge pull request #6889 from adrianeboyd/docs/source-install-dup [ci skip] 2021-02-05 13:35:16 +11:00
svlandeg
7cda5605a0 add type 2021-02-03 13:13:58 +01:00
svlandeg
94929c2b98 small doc fixes 2021-02-03 13:10:22 +01:00
Ines Montani
2cdfcd2d19 Update naming [ci skip] 2021-02-03 12:48:31 +11:00
Adriane Boyd
37a68a06ab Update to recommend editable installs for source installs 2021-02-02 16:51:27 +01:00
Adriane Boyd
3a3e4daf60 Update install instructions
* Remove duplicate section about compiling from source
2021-02-02 14:44:15 +01:00
Pengcheng YIN
6fdc33203a
Fix a typo 2021-02-01 17:26:28 -05:00
Ines Montani
a59f3fcf5d Make wheel the default format and update docs [ci skip] 2021-02-01 23:18:43 +11:00
Ines Montani
31b842d6ce Update table [ci skip] 2021-02-01 14:17:52 +11:00
Ines Montani
7752f80f39 Update docs [ci skip] 2021-01-31 16:11:24 +11:00
Ines Montani
a8a1231ccd Update README and docs [ci skip] 2021-01-31 12:36:04 +11:00
Ines Montani
45c551037d Update CLI docs [ci skip] 2021-01-30 21:50:23 +11:00
Ines Montani
ae07416fda Merge branch 'website/v3-launch' into develop 2021-01-30 20:31:06 +11:00
Ines Montani
2332c4280b Update and use unified --build option 2021-01-30 13:11:36 +11:00
Ines Montani
2609ba4e89 Support building wheel in spacy package 2021-01-30 11:54:02 +11:00
Ines Montani
95e958a229
Merge pull request #6852 from explosion/feature/replace-listeners 2021-01-30 00:58:08 +11:00
Ines Montani
7694f76dd1 Update warning and mention replace_listeners 2021-01-29 23:46:01 +11:00
Adriane Boyd
8b76cb8095 Rephrase transformers PyTorch instructions 2021-01-29 13:36:56 +01:00
Ines Montani
095055ac48
Merge pull request #6855 from adrianeboyd/docs/trf-sentencepiece [ci skip]
Update transfomers install docs
2021-01-29 23:34:01 +11:00
Adriane Boyd
e3e87e7275 Update transfomers install docs
* Recommend installing PyTorch separately
* Add instructions for `sentencepiece`
2021-01-29 13:27:43 +01:00