spaCy/textcategorizer.md at c91577db028e3343e7280f0614f7bd89451f93f0

mirror of https://github.com/explosion/spaCy.git synced 2024-11-14 13:47:13 +03:00

Tidy up and improve docs and docstrings (#3370 )

<!--- Provide a general summary of your changes in the title. -->

## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs

### Types of change
enhancement, docs

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

2019-03-08 11:42:26 +01:00

16 KiB

Raw Blame History

title	tag	source	new
TextCategorizer	class	spacy/pipeline/pipes.pyx	2

This class is a subclass of Pipe and follows the same API. The pipeline component is available in the processing pipeline via the ID "textcat".

TextCategorizer.Model

Initialize a model for the pipe. The model should implement the thinc.neural.Model API. Wrappers are under development for most major machine learning libraries.

Name	Type	Description
`**kwargs`	-	Parameters for initializing the model
RETURNS	object	The initialized model.

TextCategorizer.init

Create a new pipeline instance. In your application, you would normally use a shortcut for this and instantiate the component using its string name and nlp.create_pipe.

Example

# Construction via create_pipe
textcat = nlp.create_pipe("textcat")
textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True})

# Construction from class
from spacy.pipeline import TextCategorizer
textcat = TextCategorizer(nlp.vocab)
textcat.from_disk("/path/to/model")

Name	Type	Description
`vocab`	`Vocab`	The shared vocabulary.
`model`	`thinc.neural.Model` / `True`	The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`.
`exclusive_classes`	bool	Make categories mutually exclusive. Defaults to `False`.
`architecture`	unicode	Model architecture to use, see architectures for details. Defaults to `"ensemble"`.
RETURNS	`TextCategorizer`	The newly constructed object.

Architectures

Text classification models can be used to solve a wide variety of problems. Differences in text length, number of labels, difficulty, and runtime performance constraints mean that no single algorithm performs well on all types of problems. To handle a wider variety of problems, the TextCategorizer object allows configuration of its model architecture, using the architecture keyword argument.

Name	Description
`"ensemble"`	Default: Stacked ensemble of a unigram bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention.
`"simple_cnn"`	A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network.

TextCategorizer.call

Apply the pipe to one document. The document is modified in place, and returned. This usually happens under the hood when the nlp object is called on a text and all pipeline components are applied to the Doc in order. Both __call__ and pipe delegate to the predict and set_annotations methods.

Example

textcat = TextCategorizer(nlp.vocab)
doc = nlp(u"This is a sentence.")
# This usually happens under the hood
processed = textcat(doc)

Name	Type	Description
`doc`	`Doc`	The document to process.
RETURNS	`Doc`	The processed document.

TextCategorizer.pipe

Apply the pipe to a stream of documents. This usually happens under the hood when the nlp object is called on a text and all pipeline components are applied to the Doc in order. Both __call__ and pipe delegate to the predict and set_annotations methods.

Example

textcat = TextCategorizer(nlp.vocab)
for doc in textcat.pipe(docs, batch_size=50):
    pass

Name	Type	Description
`stream`	iterable	A stream of documents.
`batch_size`	int	The number of texts to buffer. Defaults to `128`.
YIELDS	`Doc`	Processed documents in the order of the original text.

TextCategorizer.predict

Apply the pipeline's model to a batch of docs, without modifying them.

Example

textcat = TextCategorizer(nlp.vocab)
scores = textcat.predict([doc1, doc2])

Name	Type	Description
`docs`	iterable	The documents to predict.
RETURNS	tuple	A `(scores, tensors)` tuple where `scores` is the model's prediction for each document and `tensors` is the token representations used to predict the scores. Each tensor is an array with one row for each token in the document.

TextCategorizer.set_annotations

Modify a batch of documents, using pre-computed scores.

Example

textcat = TextCategorizer(nlp.vocab)
scores = textcat.predict([doc1, doc2])
textcat.set_annotations([doc1, doc2], scores)

Name	Type	Description
`docs`	iterable	The documents to modify.
`scores`	-	The scores to set, produced by `TextCategorizer.predict`.

TextCategorizer.update

Learn from a batch of documents and gold-standard information, updating the pipe's model. Delegates to predict and get_loss.

Example

textcat = TextCategorizer(nlp.vocab)
losses = {}
optimizer = nlp.begin_training()
textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)

Name	Type	Description
`docs`	iterable	A batch of documents to learn from.
`golds`	iterable	The gold-standard data. Must have the same length as `docs`.
`drop`	float	The dropout rate.
`sgd`	callable	The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID.
`losses`	dict	Optional record of the loss during training. The value keyed by the model's name is updated.

TextCategorizer.get_loss

Find the loss and gradient of loss for the batch of documents and their predicted scores.

Example

textcat = TextCategorizer(nlp.vocab)
scores = textcat.predict([doc1, doc2])
loss, d_loss = textcat.get_loss([doc1, doc2], [gold1, gold2], scores)

Name	Type	Description
`docs`	iterable	The batch of documents.
`golds`	iterable	The gold-standard data. Must have the same length as `docs`.
`scores`	-	Scores representing the model's predictions.
RETURNS	tuple	The loss and the gradient, i.e. `(loss, gradient)`.

TextCategorizer.begin_training

Initialize the pipe for training, using data examples if available. If no model has been initialized yet, the model is added.

Example

textcat = TextCategorizer(nlp.vocab)
nlp.pipeline.append(textcat)
optimizer = textcat.begin_training(pipeline=nlp.pipeline)

Name	Type	Description
`gold_tuples`	iterable	Optional gold-standard annotations from which to construct `GoldParse` objects.
`pipeline`	list	Optional list of pipeline components that this component is part of.
`sgd`	callable	An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via `TextCategorizer` if not set.
RETURNS	callable	An optimizer.

TextCategorizer.create_optimizer

Create an optimizer for the pipeline component.

Example

textcat = TextCategorizer(nlp.vocab)
optimizer = textcat.create_optimizer()

Name	Type	Description
RETURNS	callable	The optimizer.

TextCategorizer.use_params

Modify the pipe's model, to use the given parameter values.

Example

textcat = TextCategorizer(nlp.vocab)
with textcat.use_params():
    textcat.to_disk("/best_model")

Name	Type	Description
`params`	-	The parameter values to use in the model. At the end of the context, the original parameters are restored.

TextCategorizer.add_label

Add a new label to the pipe.

Example

textcat = TextCategorizer(nlp.vocab)
textcat.add_label("MY_LABEL")

Name	Type	Description
`label`	unicode	The label to add.

TextCategorizer.to_disk

Serialize the pipe to disk.

Example

textcat = TextCategorizer(nlp.vocab)
textcat.to_disk("/path/to/textcat")

Name	Type	Description
`path`	unicode / `Path`	A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects.

TextCategorizer.from_disk

Load the pipe from disk. Modifies the object in place and returns it.

Example

textcat = TextCategorizer(nlp.vocab)
textcat.from_disk("/path/to/textcat")

Name	Type	Description
`path`	unicode / `Path`	A path to a directory. Paths may be either strings or `Path`-like objects.
RETURNS	`TextCategorizer`	The modified `TextCategorizer` object.

TextCategorizer.to_bytes

Example

textcat = TextCategorizer(nlp.vocab)
textcat_bytes = textcat.to_bytes()

Serialize the pipe to a bytestring.

Name	Type	Description
`**exclude`	-	Named attributes to prevent from being serialized.
RETURNS	bytes	The serialized form of the `TextCategorizer` object.

TextCategorizer.from_bytes

Load the pipe from a bytestring. Modifies the object in place and returns it.

Example

textcat_bytes = textcat.to_bytes()
textcat = TextCategorizer(nlp.vocab)
textcat.from_bytes(textcat_bytes)

Name	Type	Description
`bytes_data`	bytes	The data to load from.
`**exclude`	-	Named attributes to prevent from being loaded.
RETURNS	`TextCategorizer`	The `TextCategorizer` object.

TextCategorizer.labels

The labels currently added to the component.

Example

textcat.add_label("MY_LABEL")
assert "MY_LABEL" in textcat.labels

Name	Type	Description
RETURNS	tuple	The labels added to the component.

16 KiB Raw Blame History

TextCategorizer.Model

TextCategorizer.__init__

Example

Architectures

TextCategorizer.__call__

Example

TextCategorizer.pipe

Example

TextCategorizer.predict

Example

TextCategorizer.set_annotations

Example

TextCategorizer.update

Example

TextCategorizer.get_loss

Example

TextCategorizer.begin_training

Example

TextCategorizer.create_optimizer

Example

TextCategorizer.use_params

Example

TextCategorizer.add_label

Example

TextCategorizer.to_disk

Example

TextCategorizer.from_disk

Example

TextCategorizer.to_bytes

Example

TextCategorizer.from_bytes

Example

TextCategorizer.labels

Example

16 KiB

Raw Blame History

TextCategorizer.init

TextCategorizer.call