mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 16:07:41 +03:00 
			
		
		
		
	Update docs [ci skip]
This commit is contained in:
		
							parent
							
								
									2285e59765
								
							
						
					
					
						commit
						9c25656ccc
					
				|  | @ -179,6 +179,7 @@ of objects by referring to creation functions, including functions you register | ||||||
| yourself. For details on how to get started with training your own model, check | yourself. For details on how to get started with training your own model, check | ||||||
| out the [training quickstart](/usage/training#quickstart). | out the [training quickstart](/usage/training#quickstart). | ||||||
| 
 | 
 | ||||||
|  | <!-- TODO: | ||||||
| <Project id="en_core_bert"> | <Project id="en_core_bert"> | ||||||
| 
 | 
 | ||||||
| The easiest way to get started is to clone a transformers-based project | The easiest way to get started is to clone a transformers-based project | ||||||
|  | @ -186,6 +187,7 @@ template. Swap in your data, edit the settings and hyperparameters and train, | ||||||
| evaluate, package and visualize your model. | evaluate, package and visualize your model. | ||||||
| 
 | 
 | ||||||
| </Project> | </Project> | ||||||
|  | --> | ||||||
| 
 | 
 | ||||||
| The `[components]` section in the [`config.cfg`](/api/data-formats#config) | The `[components]` section in the [`config.cfg`](/api/data-formats#config) | ||||||
| describes the pipeline components and the settings used to construct them, | describes the pipeline components and the settings used to construct them, | ||||||
|  |  | ||||||
|  | @ -33,6 +33,7 @@ and prototypes and ship your models into production. | ||||||
| 
 | 
 | ||||||
| <!-- TODO: decide how to introduce concept --> | <!-- TODO: decide how to introduce concept --> | ||||||
| 
 | 
 | ||||||
|  | <!-- TODO: | ||||||
| <Project id="some_example_project"> | <Project id="some_example_project"> | ||||||
| 
 | 
 | ||||||
| Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum | Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum | ||||||
|  | @ -40,6 +41,7 @@ sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat | ||||||
| mattis pretium. | mattis pretium. | ||||||
| 
 | 
 | ||||||
| </Project> | </Project> | ||||||
|  | --> | ||||||
| 
 | 
 | ||||||
| spaCy projects make it easy to integrate with many other **awesome tools** in | spaCy projects make it easy to integrate with many other **awesome tools** in | ||||||
| the data science and machine learning ecosystem to track and manage your data | the data science and machine learning ecosystem to track and manage your data | ||||||
|  |  | ||||||
|  | @ -92,6 +92,7 @@ spaCy's binary `.spacy` format. You can either include the data paths in the | ||||||
| $ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy | $ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
|  | <!-- TODO: | ||||||
| <Project id="some_example_project"> | <Project id="some_example_project"> | ||||||
| 
 | 
 | ||||||
| The easiest way to get started with an end-to-end training process is to clone a | The easiest way to get started with an end-to-end training process is to clone a | ||||||
|  | @ -99,6 +100,7 @@ The easiest way to get started with an end-to-end training process is to clone a | ||||||
| workflows, from data preprocessing to training and packaging your model. | workflows, from data preprocessing to training and packaging your model. | ||||||
| 
 | 
 | ||||||
| </Project> | </Project> | ||||||
|  | --> | ||||||
| 
 | 
 | ||||||
| ## Training config {#config} | ## Training config {#config} | ||||||
| 
 | 
 | ||||||
|  | @ -656,32 +658,74 @@ factor = 1.005 | ||||||
| 
 | 
 | ||||||
| #### Example: Custom data reading and batching {#custom-code-readers-batchers} | #### Example: Custom data reading and batching {#custom-code-readers-batchers} | ||||||
| 
 | 
 | ||||||
| Some use-cases require streaming in data or manipulating datasets on the fly, | Some use-cases require **streaming in data** or manipulating datasets on the | ||||||
| rather than generating all data beforehand and storing it to file. Instead of | fly, rather than generating all data beforehand and storing it to file. Instead | ||||||
| using the built-in reader `"spacy.Corpus.v1"`, which uses static file paths, you | of using the built-in [`Corpus`](/api/corpus) reader, which uses static file | ||||||
| can create and register a custom function that generates | paths, you can create and register a custom function that generates | ||||||
| [`Example`](/api/example) objects. The resulting generator can be infinite. When | [`Example`](/api/example) objects. The resulting generator can be infinite. When | ||||||
| using this dataset for training, stopping criteria such as maximum number of | using this dataset for training, stopping criteria such as maximum number of | ||||||
| steps, or stopping when the loss does not decrease further, can be used. | steps, or stopping when the loss does not decrease further, can be used. | ||||||
| 
 | 
 | ||||||
| In this example we assume a custom function `read_custom_data()` which loads or | In this example we assume a custom function `read_custom_data` which loads or | ||||||
| generates texts with relevant textcat annotations. Then, small lexical | generates texts with relevant text classification annotations. Then, small | ||||||
| variations of the input text are created before generating the final `Example` | lexical variations of the input text are created before generating the final | ||||||
| objects. | [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets | ||||||
| 
 | you register the function creating the custom reader in the `readers` | ||||||
| We can also customize the batching strategy by registering a new "batcher" which | [registry](/api/top-level#registry) and assign it a string name, so it can be | ||||||
| turns a stream of items into a stream of batches. spaCy has several useful | used in your config. All arguments on the registered function become available | ||||||
| built-in batching strategies with customizable sizes<!-- TODO: link  -->, but | as **config settings** – in this case, `source`. | ||||||
| it's also easy to implement your own. For instance, the following function takes |  | ||||||
| the stream of generated `Example` objects, and removes those which have the |  | ||||||
| exact same underlying raw text, to avoid duplicates within each batch. Note that |  | ||||||
| in a more realistic implementation, you'd also want to check whether the |  | ||||||
| annotations are exactly the same. |  | ||||||
| 
 | 
 | ||||||
|  | > #### config.cfg | ||||||
|  | > | ||||||
| > ```ini | > ```ini | ||||||
| > [training.train_corpus] | > [training.train_corpus] | ||||||
| > @readers = "corpus_variants.v1" | > @readers = "corpus_variants.v1" | ||||||
|  | > source = "s3://your_bucket/path/data.csv" | ||||||
|  | > ``` | ||||||
|  | 
 | ||||||
|  | ```python | ||||||
|  | ### functions.py {highlight="7-8"} | ||||||
|  | from typing import Callable, Iterator, List | ||||||
|  | import spacy | ||||||
|  | from spacy.gold import Example | ||||||
|  | from spacy.language import Language | ||||||
|  | import random | ||||||
|  | 
 | ||||||
|  | @spacy.registry.readers("corpus_variants.v1") | ||||||
|  | def stream_data(source: str) -> Callable[[Language], Iterator[Example]]: | ||||||
|  |     def generate_stream(nlp): | ||||||
|  |         for text, cats in read_custom_data(source): | ||||||
|  |             # Create a random variant of the example text | ||||||
|  |             i = random.randint(0, len(text) - 1) | ||||||
|  |             variant = text[:i] + text[i].upper() + text[i + 1:] | ||||||
|  |             doc = nlp.make_doc(variant) | ||||||
|  |             example = Example.from_dict(doc, {"cats": cats}) | ||||||
|  |             yield example | ||||||
|  | 
 | ||||||
|  |     return generate_stream | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | <Infobox variant="warning"> | ||||||
|  | 
 | ||||||
|  | Remember that a registered function should always be a function that spaCy | ||||||
|  | **calls to create something**. In this case, it **creates the reader function** | ||||||
|  | – it's not the reader itself. | ||||||
|  | 
 | ||||||
|  | </Infobox> | ||||||
|  | 
 | ||||||
|  | We can also customize the **batching strategy** by registering a new batcher | ||||||
|  | function in the `batchers` [registry](/api/top-level#registry). A batcher turns | ||||||
|  | a stream of items into a stream of batches. spaCy has several useful built-in | ||||||
|  | [batching strategies](/api/top-level#batchers) with customizable sizes, but it's | ||||||
|  | also easy to implement your own. For instance, the following function takes the | ||||||
|  | stream of generated [`Example`](/api/example) objects, and removes those which | ||||||
|  | have the exact same underlying raw text, to avoid duplicates within each batch. | ||||||
|  | Note that in a more realistic implementation, you'd also want to check whether | ||||||
|  | the annotations are exactly the same. | ||||||
|  | 
 | ||||||
|  | > #### config.cfg | ||||||
| > | > | ||||||
|  | > ```ini | ||||||
| > [training.batcher] | > [training.batcher] | ||||||
| > @batchers = "filtering_batch.v1" | > @batchers = "filtering_batch.v1" | ||||||
| > size = 150 | > size = 150 | ||||||
|  | @ -689,39 +733,26 @@ annotations are exactly the same. | ||||||
| 
 | 
 | ||||||
| ```python | ```python | ||||||
| ### functions.py | ### functions.py | ||||||
| from typing import Callable, Iterable, List | from typing import Callable, Iterable, Iterator | ||||||
| import spacy | import spacy | ||||||
| from spacy.gold import Example | from spacy.gold import Example | ||||||
| import random |  | ||||||
| 
 |  | ||||||
| @spacy.registry.readers("corpus_variants.v1") |  | ||||||
| def stream_data() -> Callable[["Language"], Iterable[Example]]: |  | ||||||
|     def generate_stream(nlp): |  | ||||||
|         for text, cats in read_custom_data(): |  | ||||||
|             random_index = random.randint(0, len(text) - 1) |  | ||||||
|             variant = text[:random_index] + text[random_index].upper() + text[random_index + 1:] |  | ||||||
|             doc = nlp.make_doc(variant) |  | ||||||
|             example = Example.from_dict(doc, {"cats": cats}) |  | ||||||
|             yield example |  | ||||||
|     return generate_stream |  | ||||||
| 
 |  | ||||||
| 
 | 
 | ||||||
| @spacy.registry.batchers("filtering_batch.v1") | @spacy.registry.batchers("filtering_batch.v1") | ||||||
| def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterable[List[Example]]]: | def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Example]]]: | ||||||
|     def create_filtered_batches(examples: Iterable[Example]) -> Iterable[List[Example]]: |     def create_filtered_batches(examples): | ||||||
|         batch = [] |         batch = [] | ||||||
|         for eg in examples: |         for eg in examples: | ||||||
|  |             # Remove duplicate examples with the same text from batch | ||||||
|             if eg.text not in [x.text for x in batch]: |             if eg.text not in [x.text for x in batch]: | ||||||
|                 batch.append(eg) |                 batch.append(eg) | ||||||
|             if len(batch) == size: |             if len(batch) == size: | ||||||
|                 yield batch |                 yield batch | ||||||
|                 batch = [] |                 batch = [] | ||||||
|  | 
 | ||||||
|     return create_filtered_batches |     return create_filtered_batches | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| ### Wrapping PyTorch and TensorFlow {#custom-frameworks} | <!-- TODO: | ||||||
| 
 |  | ||||||
| <!-- TODO:  --> |  | ||||||
| 
 | 
 | ||||||
| <Project id="example_pytorch_model"> | <Project id="example_pytorch_model"> | ||||||
| 
 | 
 | ||||||
|  | @ -731,12 +762,17 @@ mattis pretium. | ||||||
| 
 | 
 | ||||||
| </Project> | </Project> | ||||||
| 
 | 
 | ||||||
|  |  --> | ||||||
|  | 
 | ||||||
| ### Defining custom architectures {#custom-architectures} | ### Defining custom architectures {#custom-architectures} | ||||||
| 
 | 
 | ||||||
| <!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works --> | <!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works --> | ||||||
|  | <!-- TODO: Wrapping PyTorch and TensorFlow --> | ||||||
| 
 | 
 | ||||||
| ## Transfer learning {#transfer-learning} | ## Transfer learning {#transfer-learning} | ||||||
| 
 | 
 | ||||||
|  | <!-- TODO: link to embeddings and transformers page --> | ||||||
|  | 
 | ||||||
| ### Using transformer models like BERT {#transformers} | ### Using transformer models like BERT {#transformers} | ||||||
| 
 | 
 | ||||||
| spaCy v3.0 lets you use almost any statistical model to power your pipeline. You | spaCy v3.0 lets you use almost any statistical model to power your pipeline. You | ||||||
|  | @ -748,6 +784,8 @@ do the required plumbing. It also provides a pipeline component, | ||||||
| [`Transformer`](/api/transformer), that lets you do multi-task learning and lets | [`Transformer`](/api/transformer), that lets you do multi-task learning and lets | ||||||
| you save the transformer outputs for later use. | you save the transformer outputs for later use. | ||||||
| 
 | 
 | ||||||
|  | <!-- TODO: | ||||||
|  | 
 | ||||||
| <Project id="en_core_bert"> | <Project id="en_core_bert"> | ||||||
| 
 | 
 | ||||||
| Try out a BERT-based model pipeline using this project template: swap in your | Try out a BERT-based model pipeline using this project template: swap in your | ||||||
|  | @ -755,6 +793,7 @@ data, edit the settings and hyperparameters and train, evaluate, package and | ||||||
| visualize your model. | visualize your model. | ||||||
| 
 | 
 | ||||||
| </Project> | </Project> | ||||||
|  | --> | ||||||
| 
 | 
 | ||||||
| For more details on how to integrate transformer models into your training | For more details on how to integrate transformer models into your training | ||||||
| config and customize the implementations, see the usage guide on | config and customize the implementations, see the usage guide on | ||||||
|  | @ -766,7 +805,8 @@ config and customize the implementations, see the usage guide on | ||||||
| 
 | 
 | ||||||
| ## Parallel Training with Ray {#parallel-training} | ## Parallel Training with Ray {#parallel-training} | ||||||
| 
 | 
 | ||||||
| <!-- TODO: document Ray integration --> | <!-- TODO: | ||||||
|  | 
 | ||||||
| 
 | 
 | ||||||
| <Project id="some_example_project"> | <Project id="some_example_project"> | ||||||
| 
 | 
 | ||||||
|  | @ -776,6 +816,8 @@ mattis pretium. | ||||||
| 
 | 
 | ||||||
| </Project> | </Project> | ||||||
| 
 | 
 | ||||||
|  | --> | ||||||
|  | 
 | ||||||
| ## Internal training API {#api} | ## Internal training API {#api} | ||||||
| 
 | 
 | ||||||
| <Infobox variant="warning"> | <Infobox variant="warning"> | ||||||
|  |  | ||||||
|  | @ -444,6 +444,8 @@ values. You can then use the auto-generated `config.cfg` for training: | ||||||
| + python -m spacy train ./config.cfg --output ./output | + python -m spacy train ./config.cfg --output ./output | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
|  | <!-- TODO: | ||||||
|  | 
 | ||||||
| <Project id="some_example_project"> | <Project id="some_example_project"> | ||||||
| 
 | 
 | ||||||
| The easiest way to get started with an end-to-end training process is to clone a | The easiest way to get started with an end-to-end training process is to clone a | ||||||
|  | @ -452,6 +454,8 @@ workflows, from data preprocessing to training and packaging your model. | ||||||
| 
 | 
 | ||||||
| </Project> | </Project> | ||||||
| 
 | 
 | ||||||
|  | --> | ||||||
|  | 
 | ||||||
| #### Training via the Python API {#migrating-training-python} | #### Training via the Python API {#migrating-training-python} | ||||||
| 
 | 
 | ||||||
| For most use cases, you **shouldn't** have to write your own training scripts | For most use cases, you **shouldn't** have to write your own training scripts | ||||||
|  |  | ||||||
|  | @ -396,7 +396,7 @@ body [id]:target | ||||||
|     margin-right: -1.5em |     margin-right: -1.5em | ||||||
|     margin-left: -1.5em |     margin-left: -1.5em | ||||||
|     padding-right: 1.5em |     padding-right: 1.5em | ||||||
|     padding-left: 1.25em |     padding-left: 1.2em | ||||||
| 
 | 
 | ||||||
|     &:empty:before |     &:empty:before | ||||||
|         // Fix issue where empty lines would disappear |         // Fix issue where empty lines would disappear | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user