spaCy/website/docs/api/textcategorizer.md
Ines Montani e597110d31
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 19:31:19 +01:00

13 KiB

title tag source new
TextCategorizer class spacy/pipeline.pyx 2

This class is a subclass of Pipe and follows the same API. The pipeline component is available in the processing pipeline via the ID "textcat".

TextCategorizer.Model

Initialize a model for the pipe. The model should implement the thinc.neural.Model API. Wrappers are under development for most major machine learning libraries.

Name Type Description
**kwargs - Parameters for initializing the model
RETURNS object The initialized model.

TextCategorizer.__init__

Create a new pipeline instance. In your application, you would normally use a shortcut for this and instantiate the component using its string name and nlp.create_pipe.

Example

# Construction via create_pipe
textcat = nlp.create_pipe("textcat")

# Construction from class
from spacy.pipeline import TextCategorizer
textcat = TextCategorizer(nlp.vocab)
textcat.from_disk("/path/to/model")
Name Type Description
vocab Vocab The shared vocabulary.
model thinc.neural.Model or True The model powering the pipeline component. If no model is supplied, the model is created when you call begin_training, from_disk or from_bytes.
**cfg - Configuration parameters.
RETURNS TextCategorizer The newly constructed object.

TextCategorizer.__call__

Apply the pipe to one document. The document is modified in place, and returned. Both __call__ and pipe delegate to the predict and set_annotations methods.

Example

textcat = TextCategorizer(nlp.vocab)
doc = nlp(u"This is a sentence.")
processed = textcat(doc)
Name Type Description
doc Doc The document to process.
RETURNS Doc The processed document.

TextCategorizer.pipe

Apply the pipe to a stream of documents. Both __call__ and pipe delegate to the predict and set_annotations methods.

Example

texts = [u"One doc", u"...", u"Lots of docs"]
textcat = TextCategorizer(nlp.vocab)
for doc in textcat.pipe(texts, batch_size=50):
    pass
Name Type Description
stream iterable A stream of documents.
batch_size int The number of texts to buffer. Defaults to 128.
n_threads int The number of worker threads to use. If -1, OpenMP will decide how many to use at run time. Default is -1.
YIELDS Doc Processed documents in the order of the original text.

TextCategorizer.predict

Apply the pipeline's model to a batch of docs, without modifying them.

Example

textcat = TextCategorizer(nlp.vocab)
scores = textcat.predict([doc1, doc2])
Name Type Description
docs iterable The documents to predict.
RETURNS - Scores from the model.

TextCategorizer.set_annotations

Modify a batch of documents, using pre-computed scores.

Example

textcat = TextCategorizer(nlp.vocab)
scores = textcat.predict([doc1, doc2])
textcat.set_annotations([doc1, doc2], scores)
Name Type Description
docs iterable The documents to modify.
scores - The scores to set, produced by TextCategorizer.predict.

TextCategorizer.update

Learn from a batch of documents and gold-standard information, updating the pipe's model. Delegates to predict and get_loss.

Example

textcat = TextCategorizer(nlp.vocab)
losses = {}
optimizer = nlp.begin_training()
textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
Name Type Description
docs iterable A batch of documents to learn from.
golds iterable The gold-standard data. Must have the same length as docs.
drop float The dropout rate.
sgd callable The optimizer. Should take two arguments weights and gradient, and an optional ID.
losses dict Optional record of the loss during training. The value keyed by the model's name is updated.

TextCategorizer.get_loss

Find the loss and gradient of loss for the batch of documents and their predicted scores.

Example

textcat = TextCategorizer(nlp.vocab)
scores = textcat.predict([doc1, doc2])
loss, d_loss = textcat.get_loss([doc1, doc2], [gold1, gold2], scores)
Name Type Description
docs iterable The batch of documents.
golds iterable The gold-standard data. Must have the same length as docs.
scores - Scores representing the model's predictions.
RETURNS tuple The loss and the gradient, i.e. (loss, gradient).

TextCategorizer.begin_training

Initialize the pipe for training, using data examples if available. If no model has been initialized yet, the model is added.

Example

textcat = TextCategorizer(nlp.vocab)
nlp.pipeline.append(textcat)
optimizer = textcat.begin_training(pipeline=nlp.pipeline)
Name Type Description
gold_tuples iterable Optional gold-standard annotations from which to construct GoldParse objects.
pipeline list Optional list of pipeline components that this component is part of.
sgd callable An optional optimizer. Should take two arguments weights and gradient, and an optional ID. Will be created via TextCategorizer if not set.
RETURNS callable An optimizer.

TextCategorizer.create_optimizer

Create an optimizer for the pipeline component.

Example

textcat = TextCategorizer(nlp.vocab)
optimizer = textcat.create_optimizer()
Name Type Description
RETURNS callable The optimizer.

TextCategorizer.use_params

Modify the pipe's model, to use the given parameter values.

Example

textcat = TextCategorizer(nlp.vocab)
with textcat.use_params():
    textcat.to_disk('/best_model')
Name Type Description
params - The parameter values to use in the model. At the end of the context, the original parameters are restored.

TextCategorizer.add_label

Add a new label to the pipe.

Example

textcat = TextCategorizer(nlp.vocab)
textcat.add_label('MY_LABEL')
Name Type Description
label unicode The label to add.

TextCategorizer.to_disk

Serialize the pipe to disk.

Example

textcat = TextCategorizer(nlp.vocab)
textcat.to_disk('/path/to/textcat')
Name Type Description
path unicode / Path A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects.

TextCategorizer.from_disk

Load the pipe from disk. Modifies the object in place and returns it.

Example

textcat = TextCategorizer(nlp.vocab)
textcat.from_disk('/path/to/textcat')
Name Type Description
path unicode / Path A path to a directory. Paths may be either strings or Path-like objects.
RETURNS TextCategorizer The modified TextCategorizer object.

TextCategorizer.to_bytes

example

textcat = TextCategorizer(nlp.vocab)
textcat_bytes = textcat.to_bytes()

Serialize the pipe to a bytestring.

Name Type Description
**exclude - Named attributes to prevent from being serialized.
RETURNS bytes The serialized form of the TextCategorizer object.

TextCategorizer.from_bytes

Load the pipe from a bytestring. Modifies the object in place and returns it.

Example

textcat_bytes = textcat.to_bytes()
textcat = TextCategorizer(nlp.vocab)
textcat.from_bytes(textcat_bytes)
Name Type Description
bytes_data bytes The data to load from.
**exclude - Named attributes to prevent from being loaded.
RETURNS TextCategorizer The TextCategorizer object.

TextCategorizer.labels

The labels currently added to the component.

Example

textcat.add_label("MY_LABEL")
assert "MY_LABEL" in textcat.labels
Name Type Description
RETURNS tuple The labels added to the component.