mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 07:57:35 +03:00 
			
		
		
		
	Updated docs w.r.t. infinite doc length.
This commit is contained in:
		
							parent
							
								
									d56ee65ddf
								
							
						
					
					
						commit
						d41050baba
					
				|  | @ -9,8 +9,8 @@ menu: | ||||||
|   - ['Various Functions', 'various-functions'] |   - ['Various Functions', 'various-functions'] | ||||||
| --- | --- | ||||||
| 
 | 
 | ||||||
| [The spacy-llm package](https://github.com/explosion/spacy-llm) integrates Large | [The `spacy-llm` package](https://github.com/explosion/spacy-llm) integrates | ||||||
| Language Models (LLMs) into spaCy, featuring a modular system for **fast | Large Language Models (LLMs) into spaCy, featuring a modular system for **fast | ||||||
| prototyping** and **prompting**, and turning unstructured responses into | prototyping** and **prompting**, and turning unstructured responses into | ||||||
| **robust outputs** for various NLP tasks, **no training data** required. | **robust outputs** for various NLP tasks, **no training data** required. | ||||||
| 
 | 
 | ||||||
|  | @ -202,13 +202,82 @@ not require labels. | ||||||
| 
 | 
 | ||||||
| ## Tasks {id="tasks"} | ## Tasks {id="tasks"} | ||||||
| 
 | 
 | ||||||
| ### Task implementation {id="task-implementation"} | In `spacy-llm`, a _task_ defines an NLP problem or question and its solution | ||||||
|  | using an LLM. It does so by implementing the following responsibilities: | ||||||
| 
 | 
 | ||||||
| A _task_ defines an NLP problem or question, that will be sent to the LLM via a | 1. Loading a prompt template and injecting documents' data into the prompt. | ||||||
| prompt. Further, the task defines how to parse the LLM's responses back into |    Optionally, include fewshot examples in the prompt. | ||||||
| structured information. All tasks are registered in the `llm_tasks` registry. | 2. Splitting the prompt into several pieces following a map-reduce paradigm, | ||||||
|  |    _if_ the prompt is too long to fit into the model's context and the task | ||||||
|  |    supports sharding prompts. | ||||||
|  | 3. Parsing the LLM's responses back into structured information and validating | ||||||
|  |    the parsed output. | ||||||
| 
 | 
 | ||||||
| #### task.generate_prompts {id="task-generate-prompts"} | Two different task interfaces are supported: `ShardingLLMTask` and | ||||||
|  | `NonShardingLLMTask`. Only the former supports the sharding of documents, i. e. | ||||||
|  | splitting up prompts if they are too long. | ||||||
|  | 
 | ||||||
|  | All tasks are registered in the `llm_tasks` registry. | ||||||
|  | 
 | ||||||
|  | ### On Sharding {id="task-sharding"} | ||||||
|  | 
 | ||||||
|  | "Sharding" describes, generally speaking, the process of distributing parts of a | ||||||
|  | dataset across multiple storage units for easier processing and lookups. In | ||||||
|  | `spacy-llm` we use this term (synonymously: "mapping") to describe the splitting | ||||||
|  | up of prompts if they are too long for a model to handle, and "fusing" | ||||||
|  | (synonymously: "reducing") to describe how the model responses for several shars | ||||||
|  | are merged back together into a single document. | ||||||
|  | 
 | ||||||
|  | Prompts are broken up in a manner that _always_ keeps the prompt in the template | ||||||
|  | intact, meaning that the instructions to the LLM will always stay complete. The | ||||||
|  | document content however will be split, if the length of the fully rendered | ||||||
|  | prompt exceeds a model context length. | ||||||
|  | 
 | ||||||
|  | A toy example: let's assume a model has a context window of 25 tokens and the | ||||||
|  | prompt template for our fictional, sharding-supporting task looks like this: | ||||||
|  | 
 | ||||||
|  | ``` | ||||||
|  | Estimate the sentiment of this text: | ||||||
|  | "{text}" | ||||||
|  | Estimated entiment: | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | Depening on how tokens are counted exactly (this is a config setting), we might | ||||||
|  | come up with `n = 12` tokens for the number of tokens in the prompt | ||||||
|  | instructions. Furthermore let's assume that our `text` is "This has been | ||||||
|  | amazing - I can't remember the last time I left the cinema so impressed." - | ||||||
|  | which has roughly 19 tokens. | ||||||
|  | 
 | ||||||
|  | Considering we only have 13 tokens to add to our prompt before we hit the | ||||||
|  | context limit, we'll have to split our prompt into two parts. Thus `spacy-llm`, | ||||||
|  | assuming the task used supports sharding, will split the prompt into two (the | ||||||
|  | default splitting strategy splits by tokens, but alternative splitting | ||||||
|  | strategies splitting e. g. by sentences can be configured): | ||||||
|  | 
 | ||||||
|  | _(Prompt 1/2)_ | ||||||
|  | 
 | ||||||
|  | ``` | ||||||
|  | Estimate the sentiment of this text: | ||||||
|  | "This has been amazing - I can't remember " | ||||||
|  | Estimated entiment: | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | _(Prompt 2/2)_ | ||||||
|  | 
 | ||||||
|  | ``` | ||||||
|  | Estimate the sentiment of this text: | ||||||
|  | "the last time I left the cinema so impressed." | ||||||
|  | Estimated entiment: | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | The reduction step is task-specific - a sentiment estimation task might e. g. do | ||||||
|  | a weighted average of the sentiment scores. Note that prompt sharding introduces | ||||||
|  | potential inaccuracies, as the LLM won't have access to the entire document at | ||||||
|  | once. Depending on your use case this might or might not be problematic. | ||||||
|  | 
 | ||||||
|  | ### `NonShardingLLMTask` {id="task-nonsharding"} | ||||||
|  | 
 | ||||||
|  | #### task.generate_prompts {id="task-nonsharding-generate-prompts"} | ||||||
| 
 | 
 | ||||||
| Takes a collection of documents, and returns a collection of "prompts", which | Takes a collection of documents, and returns a collection of "prompts", which | ||||||
| can be of type `Any`. Often, prompts are of type `str` - but this is not | can be of type `Any`. Often, prompts are of type `str` - but this is not | ||||||
|  | @ -219,7 +288,7 @@ enforced to allow for maximum flexibility in the framework. | ||||||
| | `docs`      | The input documents. ~~Iterable[Doc]~~   | | | `docs`      | The input documents. ~~Iterable[Doc]~~   | | ||||||
| | **RETURNS** | The generated prompts. ~~Iterable[Any]~~ | | | **RETURNS** | The generated prompts. ~~Iterable[Any]~~ | | ||||||
| 
 | 
 | ||||||
| #### task.parse_responses {id="task-parse-responses"} | #### task.parse_responses {id="task-non-sharding-parse-responses"} | ||||||
| 
 | 
 | ||||||
| Takes a collection of LLM responses and the original documents, parses the | Takes a collection of LLM responses and the original documents, parses the | ||||||
| responses into structured information, and sets the annotations on the | responses into structured information, and sets the annotations on the | ||||||
|  | @ -230,19 +299,44 @@ defined fields. | ||||||
| The `responses` are of type `Iterable[Any]`, though they will often be `str` | The `responses` are of type `Iterable[Any]`, though they will often be `str` | ||||||
| objects. This depends on the return type of the [model](#models). | objects. This depends on the return type of the [model](#models). | ||||||
| 
 | 
 | ||||||
| | Argument    | Description                                | | | Argument    | Description                                            | | ||||||
| | ----------- | ------------------------------------------ | | | ----------- | ------------------------------------------------------ | | ||||||
| | `docs`      | The input documents. ~~Iterable[Doc]~~     | | | `docs`      | The input documents. ~~Iterable[Doc]~~                 | | ||||||
| | `responses` | The generated prompts. ~~Iterable[Any]~~   | | | `responses` | The responses received from the LLM. ~~Iterable[Any]~~ | | ||||||
| | **RETURNS** | The annotated documents. ~~Iterable[Doc]~~ | | | **RETURNS** | The annotated documents. ~~Iterable[Doc]~~             | | ||||||
| 
 | 
 | ||||||
| ### Raw prompting {id="raw"} | ### `ShardingLLMTask` {id="task-sharding"} | ||||||
| 
 | 
 | ||||||
| Different to all other tasks `spacy.Raw.vX` doesn't provide a specific prompt, | #### task.generate_prompts {id="task-sharding-generate-prompts"} | ||||||
| wrapping doc data, to the model. Instead it instructs the model to reply to the | 
 | ||||||
| doc content. This is handy for use cases like question answering (where each doc | Takes a collection of documents, breaks them up into shards if necessary to fit | ||||||
| contains one question) or if you want to include customized prompts for each | all content into the model's context, and returns a collection of collections of | ||||||
| doc. | "prompts" (i. e. each doc can have multiple shards, each of which have exactly | ||||||
|  | one prompt), which can be of type `Any`. Often, prompts are of type `str` - but | ||||||
|  | this is not enforced to allow for maximum flexibility in the framework. | ||||||
|  | 
 | ||||||
|  | | Argument    | Description                                        | | ||||||
|  | | ----------- | -------------------------------------------------- | | ||||||
|  | | `docs`      | The input documents. ~~Iterable[Doc]~~             | | ||||||
|  | | **RETURNS** | The generated prompts. ~~Iterable[Iterable[Any]]~~ | | ||||||
|  | 
 | ||||||
|  | #### task.parse_responses {id="task-sharding-parse-responses"} | ||||||
|  | 
 | ||||||
|  | Receives a collection of collection of LLM responses (i. e. each doc can have | ||||||
|  | multiple shards, each of which have exactly one prompt / prompt response) and | ||||||
|  | the original shards, parses the responses into structured information, sets the | ||||||
|  | annotations on the shards, and merges back doc shards into single docs. The | ||||||
|  | `parse_responses` function is free to set the annotations in any way, including | ||||||
|  | `Doc` fields like `ents`, `spans` or `cats`, or using custom defined fields. | ||||||
|  | 
 | ||||||
|  | The `responses` are of type `Iterable[Iterable[Any]]`, though they will often be | ||||||
|  | `str` objects. This depends on the return type of the [model](#models). | ||||||
|  | 
 | ||||||
|  | | Argument    | Description                                                      | | ||||||
|  | | ----------- | ---------------------------------------------------------------- | | ||||||
|  | | `shards`    | The input document shards. ~~Iterable[Iterable[Doc]]~~           | | ||||||
|  | | `responses` | The responses received from the LLM. ~~Iterable[Iterable[Any]]~~ | | ||||||
|  | | **RETURNS** | The annotated documents. ~~Iterable[Doc]~~                       | | ||||||
| 
 | 
 | ||||||
| ### Translation {id="translation"} | ### Translation {id="translation"} | ||||||
| 
 | 
 | ||||||
|  | @ -295,6 +389,14 @@ target_lang = "Spanish" | ||||||
| path = "translation_examples.yml" | path = "translation_examples.yml" | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
|  | ### Raw prompting {id="raw"} | ||||||
|  | 
 | ||||||
|  | Different to all other tasks `spacy.Raw.vX` doesn't provide a specific prompt, | ||||||
|  | wrapping doc data, to the model. Instead it instructs the model to reply to the | ||||||
|  | doc content. This is handy for use cases like question answering (where each doc | ||||||
|  | contains one question) or if you want to include customized prompts for each | ||||||
|  | doc. | ||||||
|  | 
 | ||||||
| #### spacy.Raw.v1 {id="raw-v1"} | #### spacy.Raw.v1 {id="raw-v1"} | ||||||
| 
 | 
 | ||||||
| Note that since this task may request arbitrary information, it doesn't do any | Note that since this task may request arbitrary information, it doesn't do any | ||||||
|  | @ -1239,9 +1341,15 @@ A _model_ defines which LLM model to query, and how to query it. It can be a | ||||||
| simple function taking a collection of prompts (consistent with the output type | simple function taking a collection of prompts (consistent with the output type | ||||||
| of `task.generate_prompts()`) and returning a collection of responses | of `task.generate_prompts()`) and returning a collection of responses | ||||||
| (consistent with the expected input of `parse_responses`). Generally speaking, | (consistent with the expected input of `parse_responses`). Generally speaking, | ||||||
| it's a function of type `Callable[[Iterable[Any]], Iterable[Any]]`, but specific | it's a function of type | ||||||
|  | `Callable[[Iterable[Iterable[Any]]], Iterable[Iterable[Any]]]`, but specific | ||||||
| implementations can have other signatures, like | implementations can have other signatures, like | ||||||
| `Callable[[Iterable[str]], Iterable[str]]`. | `Callable[[Iterable[Iterable[str]]], Iterable[Iterable[str]]]`. | ||||||
|  | 
 | ||||||
|  | Note: the model signature expects a nested iterable so it's able to deal with | ||||||
|  | sharded docs. Unsharded docs (i. e. those produced by (nonsharding | ||||||
|  | tasks)[/api/large-language-models#task-nonsharding]) are reshaped to fit the | ||||||
|  | expected data structure. | ||||||
| 
 | 
 | ||||||
| ### Models via REST API {id="models-rest"} | ### Models via REST API {id="models-rest"} | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -340,15 +340,45 @@ A _task_ defines an NLP problem or question, that will be sent to the LLM via a | ||||||
| prompt. Further, the task defines how to parse the LLM's responses back into | prompt. Further, the task defines how to parse the LLM's responses back into | ||||||
| structured information. All tasks are registered in the `llm_tasks` registry. | structured information. All tasks are registered in the `llm_tasks` registry. | ||||||
| 
 | 
 | ||||||
| Practically speaking, a task should adhere to the `Protocol` `LLMTask` defined | Practically speaking, a task should adhere to the `Protocol` named `LLMTask` | ||||||
| in [`ty.py`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/ty.py). | defined in | ||||||
| It needs to define a `generate_prompts` function and a `parse_responses` | [`ty.py`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/ty.py). It | ||||||
| function. | needs to define a `generate_prompts` function and a `parse_responses` function. | ||||||
| 
 | 
 | ||||||
| | Task                                                                        | Description                                                                                                                                                  | | Tasks may support prompt sharding (for more info see the API docs on | ||||||
| | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | | [sharding](/api/large-language-models#task-sharding) and | ||||||
| | [`task.generate_prompts`](/api/large-language-models#task-generate-prompts) | Takes a collection of documents, and returns a collection of "prompts", which can be of type `Any`.                                                          | | [non-sharding](/api/large-language-models#task-nonsharding) tasks). The function | ||||||
| | [`task.parse_responses`](/api/large-language-models#task-parse-responses)   | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. | | signatures for `generate_prompts` and `parse_responses` depend on whether they | ||||||
|  | do. | ||||||
|  | 
 | ||||||
|  | | _For tasks *not supporting* sharding:_ | Task | Description |     | | ||||||
|  | | -------------------------------------- | ---- | ----------- | --- | | ||||||
|  | 
 | ||||||
|  | --- | ||||||
|  | 
 | ||||||
|  | | | | ||||||
|  | [`task.generate_prompts`](/api/large-language-models#task-nonsharding-generate-prompts) | ||||||
|  | | Takes a collection of documents, and returns a collection of prompts, which | ||||||
|  | can be of type `Any`. | | | ||||||
|  | [`task.parse_responses`](/api/large-language-models#task-nonsharding-parse-responses) | ||||||
|  | | Takes a collection of LLM responses and the original documents, parses the | ||||||
|  | responses into structured information, and sets the annotations on the | ||||||
|  | documents. | | ||||||
|  | 
 | ||||||
|  | | _For tasks *supporting* sharding:_ | Task | Description |     | | ||||||
|  | | ---------------------------------- | ---- | ----------- | --- | | ||||||
|  | 
 | ||||||
|  | --- | ||||||
|  | 
 | ||||||
|  | | | | ||||||
|  | [`task.generate_prompts`](/api/large-language-models#task-sharding-generate-prompts) | ||||||
|  | | Takes a collection of documents, and returns a collection of collection of | ||||||
|  | prompt shards, which can be of type `Any`. | | | ||||||
|  | [`task.parse_responses`](/api/large-language-models#task-sharding-parse-responses) | ||||||
|  | | Takes a collection of collection of LLM responses (one per prompt shard) and | ||||||
|  | the original documents, parses the responses into structured information, sets | ||||||
|  | the annotations on the doc shards, and merges those doc shards back into a | ||||||
|  | single doc instance. | | ||||||
| 
 | 
 | ||||||
| Moreover, the task may define an optional [`scorer` method](/api/scorer#score). | Moreover, the task may define an optional [`scorer` method](/api/scorer#score). | ||||||
| It should accept an iterable of `Example` objects as input and return a score | It should accept an iterable of `Example` objects as input and return a score | ||||||
|  | @ -370,7 +400,9 @@ evaluate the component. | ||||||
| | [`spacy.TextCat.v2`](/api/large-language-models#textcat-v2)             | Version 2 builds on v1 and includes an improved prompt template.                                                  | | | [`spacy.TextCat.v2`](/api/large-language-models#textcat-v2)             | Version 2 builds on v1 and includes an improved prompt template.                                                  | | ||||||
| | [`spacy.TextCat.v1`](/api/large-language-models#textcat-v1)             | Version 1 of the built-in TextCat task supports both zero-shot and few-shot prompting.                            | | | [`spacy.TextCat.v1`](/api/large-language-models#textcat-v1)             | Version 1 of the built-in TextCat task supports both zero-shot and few-shot prompting.                            | | ||||||
| | [`spacy.Lemma.v1`](/api/large-language-models#lemma-v1)                 | Lemmatizes the provided text and updates the `lemma_` attribute of the tokens accordingly.                        | | | [`spacy.Lemma.v1`](/api/large-language-models#lemma-v1)                 | Lemmatizes the provided text and updates the `lemma_` attribute of the tokens accordingly.                        | | ||||||
|  | | [`spacy.Raw.v1`](/api/large-language-models#raw-v1)                     | Executes raw doc content as prompt to LLM.                                                                        | | ||||||
| | [`spacy.Sentiment.v1`](/api/large-language-models#sentiment-v1)         | Performs sentiment analysis on provided texts.                                                                    | | | [`spacy.Sentiment.v1`](/api/large-language-models#sentiment-v1)         | Performs sentiment analysis on provided texts.                                                                    | | ||||||
|  | | [`spacy.Translation.v1`](/api/large-language-models#translation-v1)     | Translates doc content into the specified target language.                                                        | | ||||||
| | [`spacy.NoOp.v1`](/api/large-language-models#noop-v1)                   | This task is only useful for testing - it tells the LLM to do nothing, and does not set any fields on the `docs`. | | | [`spacy.NoOp.v1`](/api/large-language-models#noop-v1)                   | This task is only useful for testing - it tells the LLM to do nothing, and does not set any fields on the `docs`. | | ||||||
| 
 | 
 | ||||||
| #### Providing examples for few-shot prompts {id="few-shot-prompts"} | #### Providing examples for few-shot prompts {id="few-shot-prompts"} | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user