15 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	| title | tag | source | new | 
|---|---|---|---|
| TextCategorizer | class | spacy/pipeline.pyx | 2 | 
This class is a subclass of Pipe and follows the same API. The pipeline
component is available in the processing pipeline
via the ID "textcat".
TextCategorizer.Model
Initialize a model for the pipe. The model should implement the
thinc.neural.Model API. Wrappers are under development for most major machine
learning libraries.
| Name | Type | Description | 
|---|---|---|
| **kwargs | - | Parameters for initializing the model | 
| RETURNS | object | The initialized model. | 
TextCategorizer.__init__
Create a new pipeline instance. In your application, you would normally use a
shortcut for this and instantiate the component using its string name and
nlp.create_pipe.
Example
# Construction via create_pipe textcat = nlp.create_pipe("textcat") textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True}) # Construction from class from spacy.pipeline import TextCategorizer textcat = TextCategorizer(nlp.vocab) textcat.from_disk("/path/to/model")
| Name | Type | Description | 
|---|---|---|
| vocab | Vocab | The shared vocabulary. | 
| model | thinc.neural.Model/True | The model powering the pipeline component. If no model is supplied, the model is created when you call begin_training,from_diskorfrom_bytes. | 
| exclusive_classes | bool | Make categories mutually exclusive. Defaults to False. | 
| architecture | unicode | Model architecture to use, see architectures for details. Defaults to "ensemble". | 
| RETURNS | TextCategorizer | The newly constructed object. | 
Architectures
Text classification models can be used to solve a wide variety of problems.
Differences in text length, number of labels, difficulty, and runtime
performance constraints mean that no single algorithm performs well on all types
of problems. To handle a wider variety of problems, the TextCategorizer object
allows configuration of its model architecture, using the architecture keyword
argument.
| Name | Description | 
|---|---|
| "ensemble" | Default: Stacked ensemble of a unigram bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. | 
| "simple_cnn" | A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. | 
TextCategorizer.__call__
Apply the pipe to one document. The document is modified in place, and returned.
This usually happens under the hood when you call the nlp object on a text and
all pipeline components are applied to the Doc in order. Both
__call__ and pipe
delegate to the predict and
set_annotations methods.
Example
textcat = TextCategorizer(nlp.vocab) doc = nlp(u"This is a sentence.") # This usually happens under the hood processed = textcat(doc)
| Name | Type | Description | 
|---|---|---|
| doc | Doc | The document to process. | 
| RETURNS | Doc | The processed document. | 
TextCategorizer.pipe
Apply the pipe to a stream of documents. Both
__call__ and pipe
delegate to the predict and
set_annotations methods.
Example
texts = [u"One doc", u"...", u"Lots of docs"] textcat = TextCategorizer(nlp.vocab) for doc in textcat.pipe(texts, batch_size=50): pass
| Name | Type | Description | 
|---|---|---|
| stream | iterable | A stream of documents. | 
| batch_size | int | The number of texts to buffer. Defaults to 128. | 
| YIELDS | Doc | Processed documents in the order of the original text. | 
TextCategorizer.predict
Apply the pipeline's model to a batch of docs, without modifying them.
Example
textcat = TextCategorizer(nlp.vocab) scores = textcat.predict([doc1, doc2])
| Name | Type | Description | 
|---|---|---|
| docs | iterable | The documents to predict. | 
| RETURNS | - | Scores from the model. | 
TextCategorizer.set_annotations
Modify a batch of documents, using pre-computed scores.
Example
textcat = TextCategorizer(nlp.vocab) scores = textcat.predict([doc1, doc2]) textcat.set_annotations([doc1, doc2], scores)
| Name | Type | Description | 
|---|---|---|
| docs | iterable | The documents to modify. | 
| scores | - | The scores to set, produced by TextCategorizer.predict. | 
TextCategorizer.update
Learn from a batch of documents and gold-standard information, updating the
pipe's model. Delegates to predict and
get_loss.
Example
textcat = TextCategorizer(nlp.vocab) losses = {} optimizer = nlp.begin_training() textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
| Name | Type | Description | 
|---|---|---|
| docs | iterable | A batch of documents to learn from. | 
| golds | iterable | The gold-standard data. Must have the same length as docs. | 
| drop | float | The dropout rate. | 
| sgd | callable | The optimizer. Should take two arguments weightsandgradient, and an optional ID. | 
| losses | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | 
TextCategorizer.get_loss
Find the loss and gradient of loss for the batch of documents and their predicted scores.
Example
textcat = TextCategorizer(nlp.vocab) scores = textcat.predict([doc1, doc2]) loss, d_loss = textcat.get_loss([doc1, doc2], [gold1, gold2], scores)
| Name | Type | Description | 
|---|---|---|
| docs | iterable | The batch of documents. | 
| golds | iterable | The gold-standard data. Must have the same length as docs. | 
| scores | - | Scores representing the model's predictions. | 
| RETURNS | tuple | The loss and the gradient, i.e. (loss, gradient). | 
TextCategorizer.begin_training
Initialize the pipe for training, using data examples if available. If no model has been initialized yet, the model is added.
Example
textcat = TextCategorizer(nlp.vocab) nlp.pipeline.append(textcat) optimizer = textcat.begin_training(pipeline=nlp.pipeline)
| Name | Type | Description | 
|---|---|---|
| gold_tuples | iterable | Optional gold-standard annotations from which to construct GoldParseobjects. | 
| pipeline | list | Optional list of pipeline components that this component is part of. | 
| sgd | callable | An optional optimizer. Should take two arguments weightsandgradient, and an optional ID. Will be created viaTextCategorizerif not set. | 
| RETURNS | callable | An optimizer. | 
TextCategorizer.create_optimizer
Create an optimizer for the pipeline component.
Example
textcat = TextCategorizer(nlp.vocab) optimizer = textcat.create_optimizer()
| Name | Type | Description | 
|---|---|---|
| RETURNS | callable | The optimizer. | 
TextCategorizer.use_params
Modify the pipe's model, to use the given parameter values.
Example
textcat = TextCategorizer(nlp.vocab) with textcat.use_params(): textcat.to_disk('/best_model')
| Name | Type | Description | 
|---|---|---|
| params | - | The parameter values to use in the model. At the end of the context, the original parameters are restored. | 
TextCategorizer.add_label
Add a new label to the pipe.
Example
textcat = TextCategorizer(nlp.vocab) textcat.add_label('MY_LABEL')
| Name | Type | Description | 
|---|---|---|
| label | unicode | The label to add. | 
TextCategorizer.to_disk
Serialize the pipe to disk.
Example
textcat = TextCategorizer(nlp.vocab) textcat.to_disk('/path/to/textcat')
| Name | Type | Description | 
|---|---|---|
| path | unicode / Path | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects. | 
TextCategorizer.from_disk
Load the pipe from disk. Modifies the object in place and returns it.
Example
textcat = TextCategorizer(nlp.vocab) textcat.from_disk('/path/to/textcat')
| Name | Type | Description | 
|---|---|---|
| path | unicode / Path | A path to a directory. Paths may be either strings or Path-like objects. | 
| RETURNS | TextCategorizer | The modified TextCategorizerobject. | 
TextCategorizer.to_bytes
example
textcat = TextCategorizer(nlp.vocab) textcat_bytes = textcat.to_bytes()
Serialize the pipe to a bytestring.
| Name | Type | Description | 
|---|---|---|
| **exclude | - | Named attributes to prevent from being serialized. | 
| RETURNS | bytes | The serialized form of the TextCategorizerobject. | 
TextCategorizer.from_bytes
Load the pipe from a bytestring. Modifies the object in place and returns it.
Example
textcat_bytes = textcat.to_bytes() textcat = TextCategorizer(nlp.vocab) textcat.from_bytes(textcat_bytes)
| Name | Type | Description | 
|---|---|---|
| bytes_data | bytes | The data to load from. | 
| **exclude | - | Named attributes to prevent from being loaded. | 
| RETURNS | TextCategorizer | The TextCategorizerobject. | 
TextCategorizer.labels
The labels currently added to the component.
Example
textcat.add_label("MY_LABEL") assert "MY_LABEL" in textcat.labels
| Name | Type | Description | 
|---|---|---|
| RETURNS | tuple | The labels added to the component. |