spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-21 07:44:32 +03:00

Author	SHA1	Message	Date
Raphael Mitsch	575c405ae3	Fix LLM docs on task factories.	2024-01-19 16:48:54 +01:00
Raphael Mitsch	256468c414	Merge branch 'docs/llm_main' into chore/sync-master-with-llm_main # Conflicts: # website/docs/api/large-language-models.mdx	2024-01-19 16:34:35 +01:00
Raphael Mitsch	91c24c0285	Merge pull request #13251 from explosion/docs/llm_develop Sync `docs/llm_main` with `docs/llm_develop`	2024-01-19 12:56:38 +01:00
Raphael Mitsch	0062c22c35	Updated docs w.r.t. infinite doc length changes (#13214 ) * Updated docs w.r.t. infinite doc length. * Fix typo. * fix typo's * Fix table formatting. * Update formatting. --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2024-01-05 14:20:58 +01:00
Daniël de Kok	e2a3952de5	Add spacy.TextCatParametricAttention.v1 (#13201 ) * Add spacy.TextCatParametricAttention.v1 This layer provides is a simplification of the ensemble classifier that only uses paramteric attention. We have found empirically that with a sufficient amount of training data, using the ensemble classifier with BoW does not provide significant improvement in classifier accuracy. However, plugging in a BoW classifier does reduce GPU training and inference performance substantially, since it uses a GPU-only kernel. * Fix merge fallout	2024-01-02 10:03:06 +01:00
Daniël de Kok	7ebba86402	Add TextCatReduce.v1 (#13181 ) * Add TextCatReduce.v1 This is a textcat classifier that pools the vectors generated by a tok2vec implementation and then applies a classifier to the pooled representation. Three reductions are supported for pooling: first, max, and mean. When multiple reductions are enabled, the reductions are concatenated before providing them to the classification layer. This model is a generalization of the TextCatCNN model, which only supports mean reductions and is a bit of a misnomer, because it can also be used with transformers. This change also reimplements TextCatCNN.v2 using the new TextCatReduce.v1 layer. * Doc fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fully specify `TextCatCNN` <-> `TextCatReduce` equivalence * Move TextCatCNN docs to legacy, in prep for moving to spacy-legacy * Add back a test for TextCatCNN.v2 * Replace TextCatCNN in pipe configurations and templates * Add an infobox to the `TextCatReduce` section with an `TextCatCNN` anchor * Add last reduction (`use_reduce_last`) * Remove non-working TextCatCNN Netlify redirect * Revert layer changes for the quickstart * Revert one more quickstart change * Remove unused import * Fix docstring * Fix setting name in error message --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-12-21 11:00:06 +01:00
Raphael Mitsch	d56ee65ddf	Document `spacy-llm`'s `TranslationTask` (#13183 ) * Describe translation task. * Fix references to examples and template. * Format.	2023-12-11 17:41:04 +01:00
Raphael Mitsch	e79a9c5acd	Document `spacy-llm`'s `RawTask` (#13180 ) * Add section on RawTask. * Fix API docs. * Update website/docs/api/large-language-models.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-12-11 17:14:12 +01:00
Raphael Mitsch	9fcd2bfa08	Add info on endpoint arg. (#13169 )	2023-12-05 12:46:29 +01:00
Raphael Mitsch	a25a3b996b	Merge pull request #13173 from explosion/docs/llm_main Sync `llm_develop` with `llm_main`	2023-12-04 16:46:21 +01:00
Raphael Mitsch	55ed2b4e82	Add documentation for EL task (#12988 ) * Add documentation for EL task. * Fix EL factory name. * Add llm_entity_linker_mentio. * Apply suggestions from code review Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Incorporate feedback. * Format. * Fix link to KB data. --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2023-12-04 15:23:28 +01:00
Adriane Boyd	e467573550	Docs: update trf_data examples and pipeline design info (#13164 )	2023-12-04 15:15:54 +01:00
Raphael Mitsch	0e43fca036	Add Claude-2.1 mention. (#13167 )	2023-12-01 16:48:35 +01:00
Daniël de Kok	da7ad97519	Update `TextCatBOW` to use the fixed `SparseLinear` layer (#13149 ) * Update `TextCatBOW` to use the fixed `SparseLinear` layer A while ago, we fixed the `SparseLinear` layer to use all available parameters: https://github.com/explosion/thinc/pull/754 This change updates `TextCatBOW` to `v3` which uses the new `SparseLinear_v2` layer. This results in a sizeable improvement on a text categorization task that was tested. While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent` option to make it possible to change the hidden size. Ideally, we'd just have an option called `length`. But the way that `TextCatBOW` uses hashes results in a non-uniform distribution of parameters when the length is not a power of two. * Replace TexCatBOW `length_exponent` parameter by `length` We now round up the length to the next power of two if it isn't a power of two. * Remove some tests for TextCatBOW.v2 * Fix missing import	2023-11-29 09:11:54 +01:00
Raphael Mitsch	b2e831d966	LLM docs: OpenAI model update (#13119 ) * Update supported OpenAI models. * Update with new GPT-3.5 and GPT-4 versions. * Add links to OpenAI model docs.	2023-11-08 17:55:16 +01:00
Adriane Boyd	513bbd5fa3	Add preferred use of build for package CLI (#13109 ) Build with `build` if available. Warn and fall back to previous `setup.py`-based builds if `build` build fails.	2023-11-08 17:35:24 +01:00
Sofie Van Landeghem	a804b83a4b	Update llm docs to clarify task-specific factories (#13082 ) * fix typo * add examples to specify custom model for task-specific factory	2023-10-31 22:07:07 +01:00
Sofie Van Landeghem	48248c62b6	Clarify EL example in docs (#13071 ) * add comment that pipeline is a custom one * add link to NEL tutorial * prettier * revert prettier reformat * revert prettier reformat (2) * fix typo Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-10-31 21:58:29 +01:00
Raphael Mitsch	0c15876502	Fix spancat typo. (#13095 )	2023-10-31 13:45:10 +01:00
Raphael Mitsch	9deaac9786	Add note in docs on `score_weight` config if using a non-default `spans_key` for SpanCat (#13093 ) * Add note on score_weight if using a non-default span_key for SpanCat. * Fix formatting. * Fix formatting. * Fix typo. * Use warning infobox. * Fix infobox formatting.	2023-10-30 17:02:08 +01:00
Raphael Mitsch	d72029d9c8	Add binary examples for Textcat task in `spacy-llm` (#13051 ) * Add examples for binary classification. * Fix example. * Remove binary textcat example. Format. * Rephrase.	2023-10-11 12:23:38 +02:00
Ines Montani	b83f1e3724	Inline displaCy visualizations in docs (#13050 ) [ci skip]	2023-10-06 14:22:43 +02:00
Raphael Mitsch	be29216fe2	Merge pull request #13044 from explosion/docs/llm_main Sync `master` with `docs/llm_main`	2023-10-05 16:10:19 +02:00
Raphael Mitsch	1162fcf099	Add Mistral mentions. (#13037 )	2023-10-05 14:44:38 +02:00
Raphael Mitsch	862f8254e8	Add docs on Azure OpenAI support in `spacy-llm` (#13043 ) * Add gpt-3.5-turbo-instruct to list of supported OpenAI models. * Update `spacy-llm` task argument docs w.r.t. task refactoring (#12995) * Update task arguments w.r.t. task refactoring in 0.5.0. * Add disclaimer w.r.t. gated models/Llama 2. * Update website/docs/api/large-language-models.mdx * Update website/docs/api/large-language-models.mdx * Update docs w.r.t. PaLM support. (#13018) * Add info on spacy.Azure.v1. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Format.	2023-10-05 13:18:27 +02:00
Raphael Mitsch	1dec138e61	Update docs w.r.t. PaLM support. (#13018 )	2023-10-05 08:50:41 +02:00
Adriane Boyd	6e54360a3d	Remove pathy dependency, update docs for cloudpathlib in Weasel (#13035 )	2023-10-05 08:50:22 +02:00
Raphael Mitsch	734826db79	Update `spacy-llm` task argument docs w.r.t. task refactoring (#12995 ) * Update task arguments w.r.t. task refactoring in 0.5.0. * Add disclaimer w.r.t. gated models/Llama 2. * Update website/docs/api/large-language-models.mdx * Update website/docs/api/large-language-models.mdx	2023-10-05 08:45:25 +02:00
Adriane Boyd	160e61772e	Docs for v3.7.0 (#13029 ) * Docs for v3.7.0 * Minor fixes * Extend Weasel notes * Minor edits * Update version in README	2023-10-01 21:40:07 +02:00
Adriane Boyd	406794a081	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.7-1	2023-09-28 15:09:06 +02:00
Madeesh Kannan	b4501db6f8	Update emoji library in rule-based matcher example (#13014 )	2023-09-25 18:20:30 +02:00
Adriane Boyd	ff4215f1c7	Drop support for python 3.6 (#13009 ) * Drop support for python 3.6 * Update docs	2023-09-25 14:48:38 +02:00
Adriane Boyd	935a5455b6	Docs: add new tag for evaluate CLI --spans-keys (#13013 )	2023-09-25 11:49:28 +02:00
Eliana Vornov	4e3360ad12	add --spans-key option for CLI spancat evaluation (#12981 ) * add span key option for CLI evaluation * Rephrase CLI help to refer to Doc.spans instead of spancat * Rephrase docs to refer to Doc.spans instead of spancat --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-09-25 11:25:41 +02:00
Raphael Mitsch	bef9f63e13	Add gpt-3.5-turbo-instruct to list of supported OpenAI models.	2023-09-21 11:28:58 +02:00
Sofie Van Landeghem	8f0d6b0a8c	Fix in BertTokenizer docs (#12955 ) * fix BertWordPieceTokenizer constructor call * fix * Update website/docs/usage/linguistic-features.mdx --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-09-13 13:21:58 +02:00
Sofie Van Landeghem	013762be41	Few spacy-llm doc fixes (#12969 ) * fix construction example * shorten task-specific factory list * small edits to HF models * small edit to API models * typo * fix space Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-09-08 11:35:38 +02:00
Sofie Van Landeghem	def7013eec	Docs for spacy-llm 0.5.0 (#12968 ) * Update incorrect example config. (#12893) * spacy-llm docs cleanup (#12945) * Shorten NER section * fix template references * simplify sections * set temperature to 0.0 in examples * condense model information * fix parameters for REST models * set temperature to 0.0 * spelling fix * trigger preview * fix quotes * add small note on noop.v1 * move up example noop config * set appropriate model example configs * explain config * fix Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * Docs for ner.v3 and spancat.v3 spacy-llm tasks (#12949) * formatting * update usage table with NER.v3 * fix typo in links * v3 overview of parameters * add spancat.v3 * add further v3 explanations * remove TODO comment * few more small fixes * Add doc section on LLM + task factories (#12905) * Add section on LLM + task factories. * Apply suggestions from code review --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * add default config to openai models (#12961) * Docs for spacy-llm 0.5.0 (#12967) * simplify Python example * simplify Python example * Refer only to latest OpenAI model versions from usage doc * Typo fix Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * clarify accuracy claim --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-09-08 10:25:14 +02:00
Magdalena Aniol	cc78847688	fix training.batch_size example (#12963 )	2023-09-06 16:38:13 +02:00
Sofie Van Landeghem	6d1f6d9a23	Fix LLM usage example (#12950 ) * fix usage example * revert back to v2 to allow hot fix on main	2023-09-04 09:05:50 +02:00
Sofie Van Landeghem	5c1f9264c2	fix typo in link (#12948 ) * fix typo in link * fix REL.v1 parameter	2023-09-01 13:47:20 +02:00
vincent d warmerdam	3e4264899c	Update large-language-models.mdx (#12944 )	2023-08-30 11:58:14 +02:00
Vinit Ravishankar	c2303858e6	Documentation for spacy-curated-transformers (#12677 ) * initial * initial documentation run * fix typo * Remove mentions of Torchscript and quantization Both are disabled in the initial release of `spacy-curated-transformers`. * Fix `piece_encoder` entries * Remove `spacy-transformers`-specific warning * Fix duplicate entries in tables * Doc fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove type aliases * Fix copy-paste typo * Change `debug pieces` version tag to `3.7` * Set curated transformers API version to `3.7` * Fix transformer listener naming * Add docs for `init fill-config-transformer` * Update CLI command invocation syntax * Update intro section of the pipeline component docs * Fix source URL * Add a note to the architectures section about the `init fill-config-transformer` CLI command * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update CLI command name, args * Remove hyphen from the `curated-transformers.mdx` filename * Fix links * Remove placeholder text * Add text to the model/tokenizer loader sections * Fill in the `DocTransformerOutput` section * Formatting fixes * Add curated transformer page to API docs sidebar * More formatting fixes * Remove TODO comment * Remove outdated info about default config * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add link to HF model hub * `prettier` --------- Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-08-29 17:52:16 +02:00
PD Hall	d8a32c1050	docs: fix ngram_range_suggester max_size description (#12939 )	2023-08-29 11:10:58 +02:00
Connor Brinton	6dd56868de	📝 Fix formula for receptive field in docs (#12918 ) SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce contextualized embeddings using a `MaxoutWindowEncoder` layer. These convolutions are implemented using Thinc's `expand_window` layer, which concatenates `window_size` neighboring sequence items on either side of the sequence item being processed. This is repeated across `depth` convolutional layers. For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder` layer with a context window of 1 and a depth of 2. We'll focus on the token "C". We can visually represent the contextual embedding produced for "C" as: ```mermaid flowchart LR A0(A<sub>0</sub>) B0(B<sub>0</sub>) C0(C<sub>0</sub>) D0(D<sub>0</sub>) E0(E<sub>0</sub>) B1(B<sub>1</sub>) C1(C<sub>1</sub>) D1(D<sub>1</sub>) C2(C<sub>2</sub>) A0 --> B1 B0 --> B1 C0 --> B1 B0 --> C1 C0 --> C1 D0 --> C1 C0 --> D1 D0 --> D1 E0 --> D1 B1 --> C2 C1 --> C2 D1 --> C2 ``` Described in words, this graph shows that before the first layer of the convolution, the "receptive field" centered at each token consists only of that same token. That is to say, that we have a receptive field of 1. The first layer of the convolution adds one neighboring token on either side to the receptive field. Since this is done on both sides, the receptive field increases by 2, giving the first layer a receptive field of 3. The second layer of the convolutions adds an _additional_ neighboring token on either side to the receptive field, giving a final receptive field of 5. However, this doesn't match the formula currently given in the docs, which read: > The receptive field of the CNN will be > `depth * (window_size * 2 + 1)`, so a 4-layer network with a window > size of `2` will be sensitive to 20 words at a time. Substituting in our depth of 2 and window size of 1, this formula gives us a receptive field of: ``` depth * (window_size * 2 + 1) = 2 * (1 * 2 + 1) = 2 * (2 + 1) = 2 * 3 = 6 ``` This not only doesn't match our computations from above, it's also an even number! This is suspicious, since the receptive field is supposed to be centered on a token, and not between tokens. Generally, this formula results in an even number for any even value of `depth`. The error in this formula is that the adjustment for the center token is multiplied by the depth, when it should occur only once. The corrected formula, `depth * window_size * 2 + 1`, gives the correct value for our small example from above: ``` depth * window_size * 2 + 1 = 2 * 1 * 2 + 1 = 4 + 1 = 5 ``` These changes update the docs to correct the receptive field formula and the example receptive field size.	2023-08-21 10:52:32 +02:00
Adriane Boyd	76a9f9c6c6	Docs: clarify abstract spacy.load examples (#12889 )	2023-08-16 17:28:34 +02:00
Sofie Van Landeghem	3b7faf4f5e	fix (#12881 )	2023-08-03 08:37:43 +02:00
Arman Mohammadi	07407e07ab	fix the regular expression matching on the full text (#12883 ) There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.	2023-08-02 16:52:26 +02:00
Adriane Boyd	0fe43f40f1	Support registered vectors (#12492 ) * Support registered vectors * Format * Auto-fill [nlp] on load from config and from bytes/disk * Only auto-fill [nlp] * Undo all changes to Language.from_disk * Expand BaseVectors These methods are needed in various places for training and vector similarity. * isort * More linting * Only fill [nlp.vectors] * Update spacy/vocab.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Revert changes to test related to auto-filling [nlp] * Add vectors registry * Rephrase error about vocab methods for vectors * Switch to dummy implementation for BaseVectors.to_ops * Add initial draft of docs * Remove example from BaseVectors docs * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/basevectors.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix type and lint bpemb example * Update website/docs/api/basevectors.mdx --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-08-01 15:46:08 +02:00
Sofie Van Landeghem	c9e9dccf79	Add displaCy data structures to docs (2) (#12875 ) * Add data structures to docs * Adjusted descriptions for more consistency * Add _optional_ flag to parameters * Add tests and adjust optional title key in doc * Add title to dep visualizations * fix typo --------- Co-authored-by: thomashacker <EdwardSchmuhl@web.de>	2023-07-31 10:47:57 +02:00

1 2 3 4 5 ...

1896 Commits