15 KiB
title | tag | source | new |
---|---|---|---|
TextCategorizer | class | spacy/pipeline.pyx | 2 |
This class is a subclass of Pipe
and follows the same API. The pipeline
component is available in the processing pipeline
via the ID "textcat"
.
TextCategorizer.Model
Initialize a model for the pipe. The model should implement the
thinc.neural.Model
API. Wrappers are under development for most major machine
learning libraries.
Name | Type | Description |
---|---|---|
**kwargs |
- | Parameters for initializing the model |
RETURNS | object | The initialized model. |
TextCategorizer.__init__
Create a new pipeline instance. In your application, you would normally use a
shortcut for this and instantiate the component using its string name and
nlp.create_pipe
.
Example
# Construction via create_pipe textcat = nlp.create_pipe("textcat") textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True}) # Construction from class from spacy.pipeline import TextCategorizer textcat = TextCategorizer(nlp.vocab) textcat.from_disk("/path/to/model")
Name | Type | Description |
---|---|---|
vocab |
Vocab |
The shared vocabulary. |
model |
thinc.neural.Model / True |
The model powering the pipeline component. If no model is supplied, the model is created when you call begin_training , from_disk or from_bytes . |
exclusive_classes |
bool | Make categories mutually exclusive. Defaults to False . |
architecture |
unicode | Model architecture to use, see architectures for details. Defaults to "ensemble" . |
RETURNS | TextCategorizer |
The newly constructed object. |
Architectures
Text classification models can be used to solve a wide variety of problems.
Differences in text length, number of labels, difficulty, and runtime
performance constraints mean that no single algorithm performs well on all types
of problems. To handle a wider variety of problems, the TextCategorizer
object
allows configuration of its model architecture, using the architecture
keyword
argument.
Name | Description |
---|---|
"ensemble" |
Default: Stacked ensemble of a unigram bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. |
"simple_cnn" |
A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. |
TextCategorizer.__call__
Apply the pipe to one document. The document is modified in place, and returned.
This usually happens under the hood when you call the nlp
object on a text and
all pipeline components are applied to the Doc
in order. Both
__call__
and pipe
delegate to the predict
and
set_annotations
methods.
Example
textcat = TextCategorizer(nlp.vocab) doc = nlp(u"This is a sentence.") # This usually happens under the hood processed = textcat(doc)
Name | Type | Description |
---|---|---|
doc |
Doc |
The document to process. |
RETURNS | Doc |
The processed document. |
TextCategorizer.pipe
Apply the pipe to a stream of documents. Both
__call__
and pipe
delegate to the predict
and
set_annotations
methods.
Example
texts = [u"One doc", u"...", u"Lots of docs"] textcat = TextCategorizer(nlp.vocab) for doc in textcat.pipe(texts, batch_size=50): pass
Name | Type | Description |
---|---|---|
stream |
iterable | A stream of documents. |
batch_size |
int | The number of texts to buffer. Defaults to 128 . |
YIELDS | Doc |
Processed documents in the order of the original text. |
TextCategorizer.predict
Apply the pipeline's model to a batch of docs, without modifying them.
Example
textcat = TextCategorizer(nlp.vocab) scores = textcat.predict([doc1, doc2])
Name | Type | Description |
---|---|---|
docs |
iterable | The documents to predict. |
RETURNS | - | Scores from the model. |
TextCategorizer.set_annotations
Modify a batch of documents, using pre-computed scores.
Example
textcat = TextCategorizer(nlp.vocab) scores = textcat.predict([doc1, doc2]) textcat.set_annotations([doc1, doc2], scores)
Name | Type | Description |
---|---|---|
docs |
iterable | The documents to modify. |
scores |
- | The scores to set, produced by TextCategorizer.predict . |
TextCategorizer.update
Learn from a batch of documents and gold-standard information, updating the
pipe's model. Delegates to predict
and
get_loss
.
Example
textcat = TextCategorizer(nlp.vocab) losses = {} optimizer = nlp.begin_training() textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
Name | Type | Description |
---|---|---|
docs |
iterable | A batch of documents to learn from. |
golds |
iterable | The gold-standard data. Must have the same length as docs . |
drop |
float | The dropout rate. |
sgd |
callable | The optimizer. Should take two arguments weights and gradient , and an optional ID. |
losses |
dict | Optional record of the loss during training. The value keyed by the model's name is updated. |
TextCategorizer.get_loss
Find the loss and gradient of loss for the batch of documents and their predicted scores.
Example
textcat = TextCategorizer(nlp.vocab) scores = textcat.predict([doc1, doc2]) loss, d_loss = textcat.get_loss([doc1, doc2], [gold1, gold2], scores)
Name | Type | Description |
---|---|---|
docs |
iterable | The batch of documents. |
golds |
iterable | The gold-standard data. Must have the same length as docs . |
scores |
- | Scores representing the model's predictions. |
RETURNS | tuple | The loss and the gradient, i.e. (loss, gradient) . |
TextCategorizer.begin_training
Initialize the pipe for training, using data examples if available. If no model has been initialized yet, the model is added.
Example
textcat = TextCategorizer(nlp.vocab) nlp.pipeline.append(textcat) optimizer = textcat.begin_training(pipeline=nlp.pipeline)
Name | Type | Description |
---|---|---|
gold_tuples |
iterable | Optional gold-standard annotations from which to construct GoldParse objects. |
pipeline |
list | Optional list of pipeline components that this component is part of. |
sgd |
callable | An optional optimizer. Should take two arguments weights and gradient , and an optional ID. Will be created via TextCategorizer if not set. |
RETURNS | callable | An optimizer. |
TextCategorizer.create_optimizer
Create an optimizer for the pipeline component.
Example
textcat = TextCategorizer(nlp.vocab) optimizer = textcat.create_optimizer()
Name | Type | Description |
---|---|---|
RETURNS | callable | The optimizer. |
TextCategorizer.use_params
Modify the pipe's model, to use the given parameter values.
Example
textcat = TextCategorizer(nlp.vocab) with textcat.use_params(): textcat.to_disk('/best_model')
Name | Type | Description |
---|---|---|
params |
- | The parameter values to use in the model. At the end of the context, the original parameters are restored. |
TextCategorizer.add_label
Add a new label to the pipe.
Example
textcat = TextCategorizer(nlp.vocab) textcat.add_label('MY_LABEL')
Name | Type | Description |
---|---|---|
label |
unicode | The label to add. |
TextCategorizer.to_disk
Serialize the pipe to disk.
Example
textcat = TextCategorizer(nlp.vocab) textcat.to_disk('/path/to/textcat')
Name | Type | Description |
---|---|---|
path |
unicode / Path |
A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path -like objects. |
TextCategorizer.from_disk
Load the pipe from disk. Modifies the object in place and returns it.
Example
textcat = TextCategorizer(nlp.vocab) textcat.from_disk('/path/to/textcat')
Name | Type | Description |
---|---|---|
path |
unicode / Path |
A path to a directory. Paths may be either strings or Path -like objects. |
RETURNS | TextCategorizer |
The modified TextCategorizer object. |
TextCategorizer.to_bytes
example
textcat = TextCategorizer(nlp.vocab) textcat_bytes = textcat.to_bytes()
Serialize the pipe to a bytestring.
Name | Type | Description |
---|---|---|
**exclude |
- | Named attributes to prevent from being serialized. |
RETURNS | bytes | The serialized form of the TextCategorizer object. |
TextCategorizer.from_bytes
Load the pipe from a bytestring. Modifies the object in place and returns it.
Example
textcat_bytes = textcat.to_bytes() textcat = TextCategorizer(nlp.vocab) textcat.from_bytes(textcat_bytes)
Name | Type | Description |
---|---|---|
bytes_data |
bytes | The data to load from. |
**exclude |
- | Named attributes to prevent from being loaded. |
RETURNS | TextCategorizer |
The TextCategorizer object. |
TextCategorizer.labels
The labels currently added to the component.
Example
textcat.add_label("MY_LABEL") assert "MY_LABEL" in textcat.labels
Name | Type | Description |
---|---|---|
RETURNS | tuple | The labels added to the component. |