mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 09:57:26 +03:00 
			
		
		
		
	Update details [ci skip]
This commit is contained in:
		
							parent
							
								
									e9b68d4f4c
								
							
						
					
					
						commit
						ca0d904faa
					
				
							
								
								
									
										
											BIN
										
									
								
								website/docs/images/prodigy_spans-manual.jpg
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								website/docs/images/prodigy_spans-manual.jpg
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							| 
		 After Width: | Height: | Size: 304 KiB  | 
| 
						 | 
					@ -12,7 +12,30 @@ menu:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Using predicted annotations during training {#predicted-annotations-training}
 | 
					### Using predicted annotations during training {#predicted-annotations-training}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<!-- TODO: write -->
 | 
					By default, components are updated in isolation during training, which means
 | 
				
			||||||
 | 
					that they don't see the predictions of any earlier components in the pipeline.
 | 
				
			||||||
 | 
					The new
 | 
				
			||||||
 | 
					[`[training.annotating_components]`](/usage/training#annotating-components)
 | 
				
			||||||
 | 
					config setting lets you specify pipeline component names that should set
 | 
				
			||||||
 | 
					annotations on the predicted docs during training. This makes it easy to use the
 | 
				
			||||||
 | 
					predictions of a previous component in the pipeline as features for a subsequent
 | 
				
			||||||
 | 
					component, e.g. the dependency labels in the tagger:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```ini
 | 
				
			||||||
 | 
					### config.cfg (excerpt) {highlight="7,12"}
 | 
				
			||||||
 | 
					[nlp]
 | 
				
			||||||
 | 
					pipeline = ["parser", "tagger"]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					[components.tagger.model.tok2vec.embed]
 | 
				
			||||||
 | 
					@architectures = "spacy.MultiHashEmbed.v1"
 | 
				
			||||||
 | 
					width = ${components.tagger.model.tok2vec.encode.width}
 | 
				
			||||||
 | 
					attrs = ["NORM","DEP"]
 | 
				
			||||||
 | 
					rows = [5000,2500]
 | 
				
			||||||
 | 
					include_static_vectors = false
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					[training]
 | 
				
			||||||
 | 
					annotating_components = ["parser"]
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<Project id="pipelines/tagger_parser_predicted_annotations">
 | 
					<Project id="pipelines/tagger_parser_predicted_annotations">
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -41,7 +64,7 @@ available via the [`Doc.spans`](/api/doc#spans) container.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<Infobox title="Tip: Create data with Prodigy's new span annotation UI">
 | 
					<Infobox title="Tip: Create data with Prodigy's new span annotation UI">
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<!-- TODO: screenshot -->
 | 
					[](https://support.prodi.gy/t/3861)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
The upcoming version of our annotation tool [Prodigy](https://prodi.gy)
 | 
					The upcoming version of our annotation tool [Prodigy](https://prodi.gy)
 | 
				
			||||||
(currently available as a [pre-release](https://support.prodi.gy/t/3861) for all
 | 
					(currently available as a [pre-release](https://support.prodi.gy/t/3861) for all
 | 
				
			||||||
| 
						 | 
					@ -66,11 +89,11 @@ for spaCy's `SpanCategorizer` component.
 | 
				
			||||||
The [`EntityRecognizer`](/api/entityrecognizer) can now be updated with known
 | 
					The [`EntityRecognizer`](/api/entityrecognizer) can now be updated with known
 | 
				
			||||||
incorrect annotations, which lets you take advantage of partial and sparse data.
 | 
					incorrect annotations, which lets you take advantage of partial and sparse data.
 | 
				
			||||||
For example, you'll be able to use the information that certain spans of text
 | 
					For example, you'll be able to use the information that certain spans of text
 | 
				
			||||||
are definitely **not** `PERSON` entities, without having to provide the
 | 
					are definitely **not** `PERSON` entities, without having to provide the complete
 | 
				
			||||||
complete-gold standard annotations for the given example. The incorrect span
 | 
					gold-standard annotations for the given example. The incorrect span annotations
 | 
				
			||||||
annotations can be added via the [`Doc.spans`](/api/doc#spans) in the training
 | 
					can be added via the [`Doc.spans`](/api/doc#spans) in the training data under
 | 
				
			||||||
data under the key defined as
 | 
					the key defined as [`incorrect_spans_key`](/api/entityrecognizer#init) in the
 | 
				
			||||||
[`incorrect_spans_key`](/api/entityrecognizer#init) in the component config.
 | 
					component config.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```python
 | 
					```python
 | 
				
			||||||
train_doc = nlp.make_doc("Barack Obama was born in Hawaii.")
 | 
					train_doc = nlp.make_doc("Barack Obama was born in Hawaii.")
 | 
				
			||||||
| 
						 | 
					@ -104,7 +127,12 @@ your own.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Resizable text classification architectures {#resizable-textcat}
 | 
					### Resizable text classification architectures {#resizable-textcat}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<!-- TODO: write -->
 | 
					Previously, a trained [`TextCategorizer`](/api/textcategorizer) architectures
 | 
				
			||||||
 | 
					could not be resized, meaning that you couldn't add new labels to an already
 | 
				
			||||||
 | 
					trained text classifier. In spaCy v3.1, the
 | 
				
			||||||
 | 
					[TextCatCNN](/api/architectures#TextCatCNN) and
 | 
				
			||||||
 | 
					[TextCatBOW](/api/architectures#TextCatBOW) architectures are now resizable,
 | 
				
			||||||
 | 
					while ensuring that the predictions for the old labels remain the same.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### CLI command to assemble pipeline from config {#assemble}
 | 
					### CLI command to assemble pipeline from config {#assemble}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -119,11 +147,39 @@ $ python -m spacy assemble config.cfg ./output
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Support for streaming large or infinite corpora {#streaming-corpora}
 | 
					### Support for streaming large or infinite corpora {#streaming-corpora}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<!-- TODO: write -->
 | 
					> #### config.cfg (excerpt)
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> ```ini
 | 
				
			||||||
 | 
					> [training]
 | 
				
			||||||
 | 
					> max_epochs = -1
 | 
				
			||||||
 | 
					> ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The training process now supports streaming large or infinite corpora
 | 
				
			||||||
 | 
					out-of-the-box, which can be controlled via the
 | 
				
			||||||
 | 
					[`[training.max_epochs]`](/api/data-formats#training) config setting. Setting it
 | 
				
			||||||
 | 
					to `-1` means that the train corpus should be streamed rather than loaded into
 | 
				
			||||||
 | 
					memory with no shuffling within the training loop. For details on how to
 | 
				
			||||||
 | 
					implement a custom corpus loader, e.g. to stream in data from a remote storage,
 | 
				
			||||||
 | 
					see the usage guide on
 | 
				
			||||||
 | 
					[custom data reading](/usage/training#custom-code-readers-batchers).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					When streaming a corpus, only the first 100 examples will be used for
 | 
				
			||||||
 | 
					[initialization](/usage/training#config-lifecycle). This is no problem if you're
 | 
				
			||||||
 | 
					training a component like the text classifier with data that specifies all
 | 
				
			||||||
 | 
					available labels in every example. If necessary, you can use the
 | 
				
			||||||
 | 
					[`init labels`](/api/cli#init-labels) command to pre-generate the labels for
 | 
				
			||||||
 | 
					your components using a representative sample so the model can be initialized
 | 
				
			||||||
 | 
					correctly before training.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### New lemmatizers for Catalan and Italian {#pos-lemmatizers}
 | 
					### New lemmatizers for Catalan and Italian {#pos-lemmatizers}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<!-- TODO: write -->
 | 
					The trained pipelines for [Catalan](/models/ca) and [Italian](/models/it) now
 | 
				
			||||||
 | 
					include lemmatizers that use the predicted part-of-speech tags as part of the
 | 
				
			||||||
 | 
					lookup lemmatization for higher lemmatization accuracy. If you're training your
 | 
				
			||||||
 | 
					own pipelines for these languages and you want to include a lemmatizer, make
 | 
				
			||||||
 | 
					sure you have the
 | 
				
			||||||
 | 
					[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package
 | 
				
			||||||
 | 
					installed, which provides the relevant tables.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Notes about upgrading from v3.0 {#upgrading}
 | 
					## Notes about upgrading from v3.0 {#upgrading}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
		Reference in New Issue
	
	Block a user