mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	Update details [ci skip]
This commit is contained in:
		
							parent
							
								
									e9b68d4f4c
								
							
						
					
					
						commit
						ca0d904faa
					
				
							
								
								
									
										
											BIN
										
									
								
								website/docs/images/prodigy_spans-manual.jpg
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								website/docs/images/prodigy_spans-manual.jpg
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							| 
		 After Width: | Height: | Size: 304 KiB  | 
| 
						 | 
				
			
			@ -12,7 +12,30 @@ menu:
 | 
			
		|||
 | 
			
		||||
### Using predicted annotations during training {#predicted-annotations-training}
 | 
			
		||||
 | 
			
		||||
<!-- TODO: write -->
 | 
			
		||||
By default, components are updated in isolation during training, which means
 | 
			
		||||
that they don't see the predictions of any earlier components in the pipeline.
 | 
			
		||||
The new
 | 
			
		||||
[`[training.annotating_components]`](/usage/training#annotating-components)
 | 
			
		||||
config setting lets you specify pipeline component names that should set
 | 
			
		||||
annotations on the predicted docs during training. This makes it easy to use the
 | 
			
		||||
predictions of a previous component in the pipeline as features for a subsequent
 | 
			
		||||
component, e.g. the dependency labels in the tagger:
 | 
			
		||||
 | 
			
		||||
```ini
 | 
			
		||||
### config.cfg (excerpt) {highlight="7,12"}
 | 
			
		||||
[nlp]
 | 
			
		||||
pipeline = ["parser", "tagger"]
 | 
			
		||||
 | 
			
		||||
[components.tagger.model.tok2vec.embed]
 | 
			
		||||
@architectures = "spacy.MultiHashEmbed.v1"
 | 
			
		||||
width = ${components.tagger.model.tok2vec.encode.width}
 | 
			
		||||
attrs = ["NORM","DEP"]
 | 
			
		||||
rows = [5000,2500]
 | 
			
		||||
include_static_vectors = false
 | 
			
		||||
 | 
			
		||||
[training]
 | 
			
		||||
annotating_components = ["parser"]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
<Project id="pipelines/tagger_parser_predicted_annotations">
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -41,7 +64,7 @@ available via the [`Doc.spans`](/api/doc#spans) container.
 | 
			
		|||
 | 
			
		||||
<Infobox title="Tip: Create data with Prodigy's new span annotation UI">
 | 
			
		||||
 | 
			
		||||
<!-- TODO: screenshot -->
 | 
			
		||||
[](https://support.prodi.gy/t/3861)
 | 
			
		||||
 | 
			
		||||
The upcoming version of our annotation tool [Prodigy](https://prodi.gy)
 | 
			
		||||
(currently available as a [pre-release](https://support.prodi.gy/t/3861) for all
 | 
			
		||||
| 
						 | 
				
			
			@ -66,11 +89,11 @@ for spaCy's `SpanCategorizer` component.
 | 
			
		|||
The [`EntityRecognizer`](/api/entityrecognizer) can now be updated with known
 | 
			
		||||
incorrect annotations, which lets you take advantage of partial and sparse data.
 | 
			
		||||
For example, you'll be able to use the information that certain spans of text
 | 
			
		||||
are definitely **not** `PERSON` entities, without having to provide the
 | 
			
		||||
complete-gold standard annotations for the given example. The incorrect span
 | 
			
		||||
annotations can be added via the [`Doc.spans`](/api/doc#spans) in the training
 | 
			
		||||
data under the key defined as
 | 
			
		||||
[`incorrect_spans_key`](/api/entityrecognizer#init) in the component config.
 | 
			
		||||
are definitely **not** `PERSON` entities, without having to provide the complete
 | 
			
		||||
gold-standard annotations for the given example. The incorrect span annotations
 | 
			
		||||
can be added via the [`Doc.spans`](/api/doc#spans) in the training data under
 | 
			
		||||
the key defined as [`incorrect_spans_key`](/api/entityrecognizer#init) in the
 | 
			
		||||
component config.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
train_doc = nlp.make_doc("Barack Obama was born in Hawaii.")
 | 
			
		||||
| 
						 | 
				
			
			@ -104,7 +127,12 @@ your own.
 | 
			
		|||
 | 
			
		||||
### Resizable text classification architectures {#resizable-textcat}
 | 
			
		||||
 | 
			
		||||
<!-- TODO: write -->
 | 
			
		||||
Previously, a trained [`TextCategorizer`](/api/textcategorizer) architectures
 | 
			
		||||
could not be resized, meaning that you couldn't add new labels to an already
 | 
			
		||||
trained text classifier. In spaCy v3.1, the
 | 
			
		||||
[TextCatCNN](/api/architectures#TextCatCNN) and
 | 
			
		||||
[TextCatBOW](/api/architectures#TextCatBOW) architectures are now resizable,
 | 
			
		||||
while ensuring that the predictions for the old labels remain the same.
 | 
			
		||||
 | 
			
		||||
### CLI command to assemble pipeline from config {#assemble}
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -119,11 +147,39 @@ $ python -m spacy assemble config.cfg ./output
 | 
			
		|||
 | 
			
		||||
### Support for streaming large or infinite corpora {#streaming-corpora}
 | 
			
		||||
 | 
			
		||||
<!-- TODO: write -->
 | 
			
		||||
> #### config.cfg (excerpt)
 | 
			
		||||
>
 | 
			
		||||
> ```ini
 | 
			
		||||
> [training]
 | 
			
		||||
> max_epochs = -1
 | 
			
		||||
> ```
 | 
			
		||||
 | 
			
		||||
The training process now supports streaming large or infinite corpora
 | 
			
		||||
out-of-the-box, which can be controlled via the
 | 
			
		||||
[`[training.max_epochs]`](/api/data-formats#training) config setting. Setting it
 | 
			
		||||
to `-1` means that the train corpus should be streamed rather than loaded into
 | 
			
		||||
memory with no shuffling within the training loop. For details on how to
 | 
			
		||||
implement a custom corpus loader, e.g. to stream in data from a remote storage,
 | 
			
		||||
see the usage guide on
 | 
			
		||||
[custom data reading](/usage/training#custom-code-readers-batchers).
 | 
			
		||||
 | 
			
		||||
When streaming a corpus, only the first 100 examples will be used for
 | 
			
		||||
[initialization](/usage/training#config-lifecycle). This is no problem if you're
 | 
			
		||||
training a component like the text classifier with data that specifies all
 | 
			
		||||
available labels in every example. If necessary, you can use the
 | 
			
		||||
[`init labels`](/api/cli#init-labels) command to pre-generate the labels for
 | 
			
		||||
your components using a representative sample so the model can be initialized
 | 
			
		||||
correctly before training.
 | 
			
		||||
 | 
			
		||||
### New lemmatizers for Catalan and Italian {#pos-lemmatizers}
 | 
			
		||||
 | 
			
		||||
<!-- TODO: write -->
 | 
			
		||||
The trained pipelines for [Catalan](/models/ca) and [Italian](/models/it) now
 | 
			
		||||
include lemmatizers that use the predicted part-of-speech tags as part of the
 | 
			
		||||
lookup lemmatization for higher lemmatization accuracy. If you're training your
 | 
			
		||||
own pipelines for these languages and you want to include a lemmatizer, make
 | 
			
		||||
sure you have the
 | 
			
		||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package
 | 
			
		||||
installed, which provides the relevant tables.
 | 
			
		||||
 | 
			
		||||
## Notes about upgrading from v3.0 {#upgrading}
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
		Loading…
	
		Reference in New Issue
	
	Block a user