mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 07:57:35 +03:00 
			
		
		
		
	Updated docs w.r.t. infinite doc length.
This commit is contained in:
		
							parent
							
								
									d56ee65ddf
								
							
						
					
					
						commit
						d41050baba
					
				|  | @ -9,8 +9,8 @@ menu: | |||
|   - ['Various Functions', 'various-functions'] | ||||
| --- | ||||
| 
 | ||||
| [The spacy-llm package](https://github.com/explosion/spacy-llm) integrates Large | ||||
| Language Models (LLMs) into spaCy, featuring a modular system for **fast | ||||
| [The `spacy-llm` package](https://github.com/explosion/spacy-llm) integrates | ||||
| Large Language Models (LLMs) into spaCy, featuring a modular system for **fast | ||||
| prototyping** and **prompting**, and turning unstructured responses into | ||||
| **robust outputs** for various NLP tasks, **no training data** required. | ||||
| 
 | ||||
|  | @ -202,13 +202,82 @@ not require labels. | |||
| 
 | ||||
| ## Tasks {id="tasks"} | ||||
| 
 | ||||
| ### Task implementation {id="task-implementation"} | ||||
| In `spacy-llm`, a _task_ defines an NLP problem or question and its solution | ||||
| using an LLM. It does so by implementing the following responsibilities: | ||||
| 
 | ||||
| A _task_ defines an NLP problem or question, that will be sent to the LLM via a | ||||
| prompt. Further, the task defines how to parse the LLM's responses back into | ||||
| structured information. All tasks are registered in the `llm_tasks` registry. | ||||
| 1. Loading a prompt template and injecting documents' data into the prompt. | ||||
|    Optionally, include fewshot examples in the prompt. | ||||
| 2. Splitting the prompt into several pieces following a map-reduce paradigm, | ||||
|    _if_ the prompt is too long to fit into the model's context and the task | ||||
|    supports sharding prompts. | ||||
| 3. Parsing the LLM's responses back into structured information and validating | ||||
|    the parsed output. | ||||
| 
 | ||||
| #### task.generate_prompts {id="task-generate-prompts"} | ||||
| Two different task interfaces are supported: `ShardingLLMTask` and | ||||
| `NonShardingLLMTask`. Only the former supports the sharding of documents, i. e. | ||||
| splitting up prompts if they are too long. | ||||
| 
 | ||||
| All tasks are registered in the `llm_tasks` registry. | ||||
| 
 | ||||
| ### On Sharding {id="task-sharding"} | ||||
| 
 | ||||
| "Sharding" describes, generally speaking, the process of distributing parts of a | ||||
| dataset across multiple storage units for easier processing and lookups. In | ||||
| `spacy-llm` we use this term (synonymously: "mapping") to describe the splitting | ||||
| up of prompts if they are too long for a model to handle, and "fusing" | ||||
| (synonymously: "reducing") to describe how the model responses for several shars | ||||
| are merged back together into a single document. | ||||
| 
 | ||||
| Prompts are broken up in a manner that _always_ keeps the prompt in the template | ||||
| intact, meaning that the instructions to the LLM will always stay complete. The | ||||
| document content however will be split, if the length of the fully rendered | ||||
| prompt exceeds a model context length. | ||||
| 
 | ||||
| A toy example: let's assume a model has a context window of 25 tokens and the | ||||
| prompt template for our fictional, sharding-supporting task looks like this: | ||||
| 
 | ||||
| ``` | ||||
| Estimate the sentiment of this text: | ||||
| "{text}" | ||||
| Estimated entiment: | ||||
| ``` | ||||
| 
 | ||||
| Depening on how tokens are counted exactly (this is a config setting), we might | ||||
| come up with `n = 12` tokens for the number of tokens in the prompt | ||||
| instructions. Furthermore let's assume that our `text` is "This has been | ||||
| amazing - I can't remember the last time I left the cinema so impressed." - | ||||
| which has roughly 19 tokens. | ||||
| 
 | ||||
| Considering we only have 13 tokens to add to our prompt before we hit the | ||||
| context limit, we'll have to split our prompt into two parts. Thus `spacy-llm`, | ||||
| assuming the task used supports sharding, will split the prompt into two (the | ||||
| default splitting strategy splits by tokens, but alternative splitting | ||||
| strategies splitting e. g. by sentences can be configured): | ||||
| 
 | ||||
| _(Prompt 1/2)_ | ||||
| 
 | ||||
| ``` | ||||
| Estimate the sentiment of this text: | ||||
| "This has been amazing - I can't remember " | ||||
| Estimated entiment: | ||||
| ``` | ||||
| 
 | ||||
| _(Prompt 2/2)_ | ||||
| 
 | ||||
| ``` | ||||
| Estimate the sentiment of this text: | ||||
| "the last time I left the cinema so impressed." | ||||
| Estimated entiment: | ||||
| ``` | ||||
| 
 | ||||
| The reduction step is task-specific - a sentiment estimation task might e. g. do | ||||
| a weighted average of the sentiment scores. Note that prompt sharding introduces | ||||
| potential inaccuracies, as the LLM won't have access to the entire document at | ||||
| once. Depending on your use case this might or might not be problematic. | ||||
| 
 | ||||
| ### `NonShardingLLMTask` {id="task-nonsharding"} | ||||
| 
 | ||||
| #### task.generate_prompts {id="task-nonsharding-generate-prompts"} | ||||
| 
 | ||||
| Takes a collection of documents, and returns a collection of "prompts", which | ||||
| can be of type `Any`. Often, prompts are of type `str` - but this is not | ||||
|  | @ -219,7 +288,7 @@ enforced to allow for maximum flexibility in the framework. | |||
| | `docs`      | The input documents. ~~Iterable[Doc]~~   | | ||||
| | **RETURNS** | The generated prompts. ~~Iterable[Any]~~ | | ||||
| 
 | ||||
| #### task.parse_responses {id="task-parse-responses"} | ||||
| #### task.parse_responses {id="task-non-sharding-parse-responses"} | ||||
| 
 | ||||
| Takes a collection of LLM responses and the original documents, parses the | ||||
| responses into structured information, and sets the annotations on the | ||||
|  | @ -231,18 +300,43 @@ The `responses` are of type `Iterable[Any]`, though they will often be `str` | |||
| objects. This depends on the return type of the [model](#models). | ||||
| 
 | ||||
| | Argument    | Description                                            | | ||||
| | ----------- | ------------------------------------------ | | ||||
| | ----------- | ------------------------------------------------------ | | ||||
| | `docs`      | The input documents. ~~Iterable[Doc]~~                 | | ||||
| | `responses` | The generated prompts. ~~Iterable[Any]~~   | | ||||
| | `responses` | The responses received from the LLM. ~~Iterable[Any]~~ | | ||||
| | **RETURNS** | The annotated documents. ~~Iterable[Doc]~~             | | ||||
| 
 | ||||
| ### Raw prompting {id="raw"} | ||||
| ### `ShardingLLMTask` {id="task-sharding"} | ||||
| 
 | ||||
| Different to all other tasks `spacy.Raw.vX` doesn't provide a specific prompt, | ||||
| wrapping doc data, to the model. Instead it instructs the model to reply to the | ||||
| doc content. This is handy for use cases like question answering (where each doc | ||||
| contains one question) or if you want to include customized prompts for each | ||||
| doc. | ||||
| #### task.generate_prompts {id="task-sharding-generate-prompts"} | ||||
| 
 | ||||
| Takes a collection of documents, breaks them up into shards if necessary to fit | ||||
| all content into the model's context, and returns a collection of collections of | ||||
| "prompts" (i. e. each doc can have multiple shards, each of which have exactly | ||||
| one prompt), which can be of type `Any`. Often, prompts are of type `str` - but | ||||
| this is not enforced to allow for maximum flexibility in the framework. | ||||
| 
 | ||||
| | Argument    | Description                                        | | ||||
| | ----------- | -------------------------------------------------- | | ||||
| | `docs`      | The input documents. ~~Iterable[Doc]~~             | | ||||
| | **RETURNS** | The generated prompts. ~~Iterable[Iterable[Any]]~~ | | ||||
| 
 | ||||
| #### task.parse_responses {id="task-sharding-parse-responses"} | ||||
| 
 | ||||
| Receives a collection of collection of LLM responses (i. e. each doc can have | ||||
| multiple shards, each of which have exactly one prompt / prompt response) and | ||||
| the original shards, parses the responses into structured information, sets the | ||||
| annotations on the shards, and merges back doc shards into single docs. The | ||||
| `parse_responses` function is free to set the annotations in any way, including | ||||
| `Doc` fields like `ents`, `spans` or `cats`, or using custom defined fields. | ||||
| 
 | ||||
| The `responses` are of type `Iterable[Iterable[Any]]`, though they will often be | ||||
| `str` objects. This depends on the return type of the [model](#models). | ||||
| 
 | ||||
| | Argument    | Description                                                      | | ||||
| | ----------- | ---------------------------------------------------------------- | | ||||
| | `shards`    | The input document shards. ~~Iterable[Iterable[Doc]]~~           | | ||||
| | `responses` | The responses received from the LLM. ~~Iterable[Iterable[Any]]~~ | | ||||
| | **RETURNS** | The annotated documents. ~~Iterable[Doc]~~                       | | ||||
| 
 | ||||
| ### Translation {id="translation"} | ||||
| 
 | ||||
|  | @ -295,6 +389,14 @@ target_lang = "Spanish" | |||
| path = "translation_examples.yml" | ||||
| ``` | ||||
| 
 | ||||
| ### Raw prompting {id="raw"} | ||||
| 
 | ||||
| Different to all other tasks `spacy.Raw.vX` doesn't provide a specific prompt, | ||||
| wrapping doc data, to the model. Instead it instructs the model to reply to the | ||||
| doc content. This is handy for use cases like question answering (where each doc | ||||
| contains one question) or if you want to include customized prompts for each | ||||
| doc. | ||||
| 
 | ||||
| #### spacy.Raw.v1 {id="raw-v1"} | ||||
| 
 | ||||
| Note that since this task may request arbitrary information, it doesn't do any | ||||
|  | @ -1239,9 +1341,15 @@ A _model_ defines which LLM model to query, and how to query it. It can be a | |||
| simple function taking a collection of prompts (consistent with the output type | ||||
| of `task.generate_prompts()`) and returning a collection of responses | ||||
| (consistent with the expected input of `parse_responses`). Generally speaking, | ||||
| it's a function of type `Callable[[Iterable[Any]], Iterable[Any]]`, but specific | ||||
| it's a function of type | ||||
| `Callable[[Iterable[Iterable[Any]]], Iterable[Iterable[Any]]]`, but specific | ||||
| implementations can have other signatures, like | ||||
| `Callable[[Iterable[str]], Iterable[str]]`. | ||||
| `Callable[[Iterable[Iterable[str]]], Iterable[Iterable[str]]]`. | ||||
| 
 | ||||
| Note: the model signature expects a nested iterable so it's able to deal with | ||||
| sharded docs. Unsharded docs (i. e. those produced by (nonsharding | ||||
| tasks)[/api/large-language-models#task-nonsharding]) are reshaped to fit the | ||||
| expected data structure. | ||||
| 
 | ||||
| ### Models via REST API {id="models-rest"} | ||||
| 
 | ||||
|  |  | |||
|  | @ -340,15 +340,45 @@ A _task_ defines an NLP problem or question, that will be sent to the LLM via a | |||
| prompt. Further, the task defines how to parse the LLM's responses back into | ||||
| structured information. All tasks are registered in the `llm_tasks` registry. | ||||
| 
 | ||||
| Practically speaking, a task should adhere to the `Protocol` `LLMTask` defined | ||||
| in [`ty.py`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/ty.py). | ||||
| It needs to define a `generate_prompts` function and a `parse_responses` | ||||
| function. | ||||
| Practically speaking, a task should adhere to the `Protocol` named `LLMTask` | ||||
| defined in | ||||
| [`ty.py`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/ty.py). It | ||||
| needs to define a `generate_prompts` function and a `parse_responses` function. | ||||
| 
 | ||||
| | Task                                                                        | Description                                                                                                                                                  | | ||||
| | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ||||
| | [`task.generate_prompts`](/api/large-language-models#task-generate-prompts) | Takes a collection of documents, and returns a collection of "prompts", which can be of type `Any`.                                                          | | ||||
| | [`task.parse_responses`](/api/large-language-models#task-parse-responses)   | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. | | ||||
| Tasks may support prompt sharding (for more info see the API docs on | ||||
| [sharding](/api/large-language-models#task-sharding) and | ||||
| [non-sharding](/api/large-language-models#task-nonsharding) tasks). The function | ||||
| signatures for `generate_prompts` and `parse_responses` depend on whether they | ||||
| do. | ||||
| 
 | ||||
| | _For tasks *not supporting* sharding:_ | Task | Description |     | | ||||
| | -------------------------------------- | ---- | ----------- | --- | | ||||
| 
 | ||||
| --- | ||||
| 
 | ||||
| | | | ||||
| [`task.generate_prompts`](/api/large-language-models#task-nonsharding-generate-prompts) | ||||
| | Takes a collection of documents, and returns a collection of prompts, which | ||||
| can be of type `Any`. | | | ||||
| [`task.parse_responses`](/api/large-language-models#task-nonsharding-parse-responses) | ||||
| | Takes a collection of LLM responses and the original documents, parses the | ||||
| responses into structured information, and sets the annotations on the | ||||
| documents. | | ||||
| 
 | ||||
| | _For tasks *supporting* sharding:_ | Task | Description |     | | ||||
| | ---------------------------------- | ---- | ----------- | --- | | ||||
| 
 | ||||
| --- | ||||
| 
 | ||||
| | | | ||||
| [`task.generate_prompts`](/api/large-language-models#task-sharding-generate-prompts) | ||||
| | Takes a collection of documents, and returns a collection of collection of | ||||
| prompt shards, which can be of type `Any`. | | | ||||
| [`task.parse_responses`](/api/large-language-models#task-sharding-parse-responses) | ||||
| | Takes a collection of collection of LLM responses (one per prompt shard) and | ||||
| the original documents, parses the responses into structured information, sets | ||||
| the annotations on the doc shards, and merges those doc shards back into a | ||||
| single doc instance. | | ||||
| 
 | ||||
| Moreover, the task may define an optional [`scorer` method](/api/scorer#score). | ||||
| It should accept an iterable of `Example` objects as input and return a score | ||||
|  | @ -370,7 +400,9 @@ evaluate the component. | |||
| | [`spacy.TextCat.v2`](/api/large-language-models#textcat-v2)             | Version 2 builds on v1 and includes an improved prompt template.                                                  | | ||||
| | [`spacy.TextCat.v1`](/api/large-language-models#textcat-v1)             | Version 1 of the built-in TextCat task supports both zero-shot and few-shot prompting.                            | | ||||
| | [`spacy.Lemma.v1`](/api/large-language-models#lemma-v1)                 | Lemmatizes the provided text and updates the `lemma_` attribute of the tokens accordingly.                        | | ||||
| | [`spacy.Raw.v1`](/api/large-language-models#raw-v1)                     | Executes raw doc content as prompt to LLM.                                                                        | | ||||
| | [`spacy.Sentiment.v1`](/api/large-language-models#sentiment-v1)         | Performs sentiment analysis on provided texts.                                                                    | | ||||
| | [`spacy.Translation.v1`](/api/large-language-models#translation-v1)     | Translates doc content into the specified target language.                                                        | | ||||
| | [`spacy.NoOp.v1`](/api/large-language-models#noop-v1)                   | This task is only useful for testing - it tells the LLM to do nothing, and does not set any fields on the `docs`. | | ||||
| 
 | ||||
| #### Providing examples for few-shot prompts {id="few-shot-prompts"} | ||||
|  |  | |||
		Loading…
	
		Reference in New Issue
	
	Block a user