spaCy/textcategorizer.md at e597110d31c41642a01d4c22e485bd5204ac7c81

mirror of https://github.com/explosion/spaCy.git synced 2025-07-02 19:03:14 +03:00

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

2019-02-17 19:31:19 +01:00

13 KiB

Raw Blame History

title	tag	source	new
TextCategorizer	class	spacy/pipeline.pyx	2

This class is a subclass of Pipe and follows the same API. The pipeline component is available in the processing pipeline via the ID "textcat".

TextCategorizer.Model

Initialize a model for the pipe. The model should implement the thinc.neural.Model API. Wrappers are under development for most major machine learning libraries.

Name	Type	Description
`**kwargs`	-	Parameters for initializing the model
RETURNS	object	The initialized model.

TextCategorizer.init

Create a new pipeline instance. In your application, you would normally use a shortcut for this and instantiate the component using its string name and nlp.create_pipe.

Example

# Construction via create_pipe
textcat = nlp.create_pipe("textcat")

# Construction from class
from spacy.pipeline import TextCategorizer
textcat = TextCategorizer(nlp.vocab)
textcat.from_disk("/path/to/model")

Name	Type	Description
`vocab`	`Vocab`	The shared vocabulary.
`model`	`thinc.neural.Model` or `True`	The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`.
`**cfg`	-	Configuration parameters.
RETURNS	`TextCategorizer`	The newly constructed object.

TextCategorizer.call

Apply the pipe to one document. The document is modified in place, and returned. Both __call__ and pipe delegate to the predict and set_annotations methods.

Example

textcat = TextCategorizer(nlp.vocab)
doc = nlp(u"This is a sentence.")
processed = textcat(doc)

Name	Type	Description
`doc`	`Doc`	The document to process.
RETURNS	`Doc`	The processed document.

TextCategorizer.pipe

Apply the pipe to a stream of documents. Both __call__ and pipe delegate to the predict and set_annotations methods.

Example

texts = [u"One doc", u"...", u"Lots of docs"]
textcat = TextCategorizer(nlp.vocab)
for doc in textcat.pipe(texts, batch_size=50):
    pass

Name	Type	Description
`stream`	iterable	A stream of documents.
`batch_size`	int	The number of texts to buffer. Defaults to `128`.
`n_threads`	int	The number of worker threads to use. If `-1`, OpenMP will decide how many to use at run time. Default is `-1`.
YIELDS	`Doc`	Processed documents in the order of the original text.

TextCategorizer.predict

Apply the pipeline's model to a batch of docs, without modifying them.

Example

textcat = TextCategorizer(nlp.vocab)
scores = textcat.predict([doc1, doc2])

Name	Type	Description
`docs`	iterable	The documents to predict.
RETURNS	-	Scores from the model.

TextCategorizer.set_annotations

Modify a batch of documents, using pre-computed scores.

Example

textcat = TextCategorizer(nlp.vocab)
scores = textcat.predict([doc1, doc2])
textcat.set_annotations([doc1, doc2], scores)

Name	Type	Description
`docs`	iterable	The documents to modify.
`scores`	-	The scores to set, produced by `TextCategorizer.predict`.

TextCategorizer.update

Learn from a batch of documents and gold-standard information, updating the pipe's model. Delegates to predict and get_loss.

Example

textcat = TextCategorizer(nlp.vocab)
losses = {}
optimizer = nlp.begin_training()
textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)

Name	Type	Description
`docs`	iterable	A batch of documents to learn from.
`golds`	iterable	The gold-standard data. Must have the same length as `docs`.
`drop`	float	The dropout rate.
`sgd`	callable	The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID.
`losses`	dict	Optional record of the loss during training. The value keyed by the model's name is updated.

TextCategorizer.get_loss

Find the loss and gradient of loss for the batch of documents and their predicted scores.

Example

textcat = TextCategorizer(nlp.vocab)
scores = textcat.predict([doc1, doc2])
loss, d_loss = textcat.get_loss([doc1, doc2], [gold1, gold2], scores)

Name	Type	Description
`docs`	iterable	The batch of documents.
`golds`	iterable	The gold-standard data. Must have the same length as `docs`.
`scores`	-	Scores representing the model's predictions.
RETURNS	tuple	The loss and the gradient, i.e. `(loss, gradient)`.

TextCategorizer.begin_training

Initialize the pipe for training, using data examples if available. If no model has been initialized yet, the model is added.

Example

textcat = TextCategorizer(nlp.vocab)
nlp.pipeline.append(textcat)
optimizer = textcat.begin_training(pipeline=nlp.pipeline)

Name	Type	Description
`gold_tuples`	iterable	Optional gold-standard annotations from which to construct `GoldParse` objects.
`pipeline`	list	Optional list of pipeline components that this component is part of.
`sgd`	callable	An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via `TextCategorizer` if not set.
RETURNS	callable	An optimizer.

TextCategorizer.create_optimizer

Create an optimizer for the pipeline component.

Example

textcat = TextCategorizer(nlp.vocab)
optimizer = textcat.create_optimizer()

Name	Type	Description
RETURNS	callable	The optimizer.

TextCategorizer.use_params

Modify the pipe's model, to use the given parameter values.

Example

textcat = TextCategorizer(nlp.vocab)
with textcat.use_params():
    textcat.to_disk('/best_model')

Name	Type	Description
`params`	-	The parameter values to use in the model. At the end of the context, the original parameters are restored.

TextCategorizer.add_label

Add a new label to the pipe.

Example

textcat = TextCategorizer(nlp.vocab)
textcat.add_label('MY_LABEL')

Name	Type	Description
`label`	unicode	The label to add.

TextCategorizer.to_disk

Serialize the pipe to disk.

Example

textcat = TextCategorizer(nlp.vocab)
textcat.to_disk('/path/to/textcat')

Name	Type	Description
`path`	unicode / `Path`	A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects.

TextCategorizer.from_disk

Load the pipe from disk. Modifies the object in place and returns it.

Example

textcat = TextCategorizer(nlp.vocab)
textcat.from_disk('/path/to/textcat')

Name	Type	Description
`path`	unicode / `Path`	A path to a directory. Paths may be either strings or `Path`-like objects.
RETURNS	`TextCategorizer`	The modified `TextCategorizer` object.

TextCategorizer.to_bytes

example

textcat = TextCategorizer(nlp.vocab)
textcat_bytes = textcat.to_bytes()

Serialize the pipe to a bytestring.

Name	Type	Description
`**exclude`	-	Named attributes to prevent from being serialized.
RETURNS	bytes	The serialized form of the `TextCategorizer` object.

TextCategorizer.from_bytes

Load the pipe from a bytestring. Modifies the object in place and returns it.

Example

textcat_bytes = textcat.to_bytes()
textcat = TextCategorizer(nlp.vocab)
textcat.from_bytes(textcat_bytes)

Name	Type	Description
`bytes_data`	bytes	The data to load from.
`**exclude`	-	Named attributes to prevent from being loaded.
RETURNS	`TextCategorizer`	The `TextCategorizer` object.

TextCategorizer.labels

The labels currently added to the component.

Example

textcat.add_label("MY_LABEL")
assert "MY_LABEL" in textcat.labels

Name	Type	Description
RETURNS	tuple	The labels added to the component.

13 KiB Raw Blame History

TextCategorizer.Model

TextCategorizer.__init__

Example

TextCategorizer.__call__

Example

TextCategorizer.pipe

Example

TextCategorizer.predict

Example

TextCategorizer.set_annotations

Example

TextCategorizer.update

Example

TextCategorizer.get_loss

Example

TextCategorizer.begin_training

Example

TextCategorizer.create_optimizer

Example

TextCategorizer.use_params

Example

TextCategorizer.add_label

Example

TextCategorizer.to_disk

Example

TextCategorizer.from_disk

Example

TextCategorizer.to_bytes

example

TextCategorizer.from_bytes

Example

TextCategorizer.labels

Example

13 KiB

Raw Blame History

TextCategorizer.init

TextCategorizer.call