spaCy/examples/training/train_textcat.py

#!/usr/bin/env python
# coding: utf8
"""Train a convolutional neural network text classifier on the
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
automatically via Thinc's built-in dataset loader. The model is added to
spacy.pipeline, and predictions are available via `doc.cats`. For more details,
see the documentation:
* Training: https://spacy.io/usage/training

Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import thinc.extra.datasets

import spacy
from spacy.util import minibatch, compounding


@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_texts=("Number of texts to train from", "option", "t", int),
    n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank('en')  # create blank Language class
        print("Created blank 'en' model")

    # add the text classifier to the pipeline if it doesn't exist
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'textcat' not in nlp.pipe_names:
        textcat = nlp.create_pipe('textcat')
        nlp.add_pipe(textcat, last=True)
    # otherwise, get it, so we can add labels to it
    else:
        textcat = nlp.get_pipe('textcat')

    # add label to text classifier
    textcat.add_label('POSITIVE')

    # load the IMDB dataset
    print("Loading IMDB data...")
    (train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)
    print("Using {} examples ({} training, {} evaluation)"
          .format(n_texts, len(train_texts), len(dev_texts)))
    train_data = list(zip(train_texts,
                          [{'cats': cats} for cats in train_cats]))

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
    with nlp.disable_pipes(*other_pipes):  # only train textcat
        optimizer = nlp.begin_training()
        print("Training the model...")
        print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
        for i in range(n_iter):
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(train_data, size=compounding(4., 32., 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.2,
                           losses=losses)
            with textcat.model.use_params(optimizer.averages):
                # evaluate on the dev data split off in load_data()
                scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
            print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}'  # print a simple table
                  .format(losses['textcat'], scores['textcat_p'],
                          scores['textcat_r'], scores['textcat_f']))

    # test the trained model
    test_text = "This movie sucked"
    doc = nlp(test_text)
    print(test_text, doc.cats)

    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        doc2 = nlp2(test_text)
        print(test_text, doc2.cats)


def load_data(limit=0, split=0.8):
    """Load data from the IMDB dataset."""
    # Partition off part of the train data for evaluation
    train_data, _ = thinc.extra.datasets.imdb()
    random.shuffle(train_data)
    train_data = train_data[-limit:]
    texts, labels = zip(*train_data)
    cats = [{'POSITIVE': bool(y)} for y in labels]
    split = int(len(train_data) * split)
    return (texts[:split], cats[:split]), (texts[split:], cats[split:])


def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 1e-8  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 1e-8  # True negatives
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f_score = 2 * (precision * recall) / (precision + recall)
    return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}


if __name__ == '__main__':
    plac.call(main)
Update textcat example 2017-10-27 01:32:19 +03:00			`#!/usr/bin/env python`
			`# coding: utf8`
💫 Interactive code examples, spaCy Universe and various docs improvements (#2274) * Integrate Python kernel via Binder * Add live model test for languages with examples * Update docs and code examples * Adjust margin (if not bootstrapped) * Add binder version to global config * Update terminal and executable code mixins * Pass attributes through infobox and section * Hide v-cloak * Fix example * Take out model comparison for now * Add meta text for compat * Remove chart.js dependency * Tidy up and simplify JS and port big components over to Vue * Remove chartjs example * Add Twitter icon * Add purple stylesheet option * Add utility for hand cursor (special cases only) * Add transition classes * Add small option for section * Add thumb object for small round thumbnail images * Allow unset code block language via "none" value (workaround to still allow unset language to default to DEFAULT_SYNTAX) * Pass through attributes * Add syntax highlighting definitions for Julia, R and Docker * Add website icon * Remove user survey from navigation * Don't hide GitHub icon on small screens * Make top navigation scrollable on small screens * Remove old resources page and references to it * Add Universe * Add helper functions for better page URL and title * Update site description * Increment versions * Update preview images * Update mentions of resources * Fix image * Fix social images * Fix problem with cover sizing and floats * Add divider and move badges into heading * Add docstrings * Reference converting section * Add section on converting word vectors * Move converting section to custom section and fix formatting * Remove old fastText example * Move extensions content to own section Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary) * Use better component example and add factories section * Add note on larger model * Use better example for non-vector * Remove similarity in context section Only works via small models with tensors so has always been kind of confusing * Add note on init-model command * Fix lightning tour examples and make excutable if possible * Add spacy train CLI section to train * Fix formatting and add video * Fix formatting * Fix textcat example description (resolves #2246) * Add dummy file to try resolve conflict * Delete dummy file * Tidy up [ci skip] * Ensure sufficient height of loading container * Add loading animation to universe * Update Thebelab build and use better startup message * Fix asset versioning * Fix typo [ci skip] * Add note on project idea label 2018-04-29 03:06:46 +03:00			`"""Train a convolutional neural network text classifier on the`
Update textcat example 2017-10-27 01:32:19 +03:00			`IMDB dataset, using the TextCategorizer component. The dataset will be loaded`
Update textcat training example and docs 2017-10-27 01:48:45 +03:00			`automatically via Thinc's built-in dataset loader. The model is added to`
Fix formatting 2017-11-01 02:43:22 +03:00			spacy.pipeline, and predictions are available via `doc.cats`. For more details,
			`see the documentation:`
Get docs ready for v2.0.0 2017-11-07 14:00:43 +03:00			`* Training: https://spacy.io/usage/training`
Update textcat example 2017-10-27 01:32:19 +03:00
Update examples 2017-11-07 03:22:30 +03:00			`Compatible with: spaCy v2.0.0+`
Update textcat example 2017-10-27 01:32:19 +03:00			`"""`
			`from __future__ import unicode_literals, print_function`
Add example for training text classifier 2017-07-22 21:15:32 +03:00			`import plac`
			`import random`
Update textcat example 2017-10-27 01:32:19 +03:00			`from pathlib import Path`
Add example for training text classifier 2017-07-22 21:15:32 +03:00			`import thinc.extra.datasets`

Update textcat example 2017-10-27 01:32:19 +03:00			`import spacy`
Update and document new util functions 2017-11-07 02:22:43 +03:00			`from spacy.util import minibatch, compounding`
Add example for training text classifier 2017-07-22 21:15:32 +03:00
Update textcat example 2017-10-04 16:12:28 +03:00
Update textcat example 2017-10-27 01:32:19 +03:00			`@plac.annotations(`
			`model=("Model name. Defaults to blank 'en' model.", "option", "m", str),`
			`output_dir=("Optional output directory", "option", "o", Path),`
Update textcat example 2017-11-01 19:09:22 +03:00			`n_texts=("Number of texts to train from", "option", "t", int),`
Update textcat example 2017-10-27 01:32:19 +03:00			`n_iter=("Number of training iterations", "option", "n", int))`
Fix print statements in text classifier example 2017-11-01 18:34:31 +03:00			`def main(model=None, output_dir=None, n_iter=20, n_texts=2000):`
Update textcat example 2017-10-27 01:32:19 +03:00			`if model is not None:`
			`nlp = spacy.load(model) # load existing spaCy model`
			`print("Loaded model '%s'" % model)`
			`else:`
			`nlp = spacy.blank('en') # create blank Language class`
			`print("Created blank 'en' model")`

			`# add the text classifier to the pipeline if it doesn't exist`
			`# nlp.create_pipe works for built-ins that are registered with spaCy`
			`if 'textcat' not in nlp.pipe_names:`
Update textcat example 2017-11-01 19:09:22 +03:00			`textcat = nlp.create_pipe('textcat')`
Update textcat training example and docs 2017-10-27 01:48:45 +03:00			`nlp.add_pipe(textcat, last=True)`
Update textcat example 2017-10-27 01:32:19 +03:00			`# otherwise, get it, so we can add labels to it`
			`else:`
			`textcat = nlp.get_pipe('textcat')`

			`# add label to text classifier`
Update textcat example 2017-11-01 19:09:22 +03:00			`textcat.add_label('POSITIVE')`
Update textcat example 2017-10-27 01:32:19 +03:00
Fix typo in comment 2017-12-09 15:14:57 +03:00			`# load the IMDB dataset`
Update textcat example 2017-10-27 01:32:19 +03:00			`print("Loading IMDB data...")`
Fix print statements in text classifier example 2017-11-01 18:34:31 +03:00			`(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)`
Fix print statement in textcat training example (resolves #1515) 2017-11-08 19:17:40 +03:00			`print("Using {} examples ({} training, {} evaluation)"`
			`.format(n_texts, len(train_texts), len(dev_texts)))`
Update training examples to use "simple style" 2017-11-07 01:14:04 +03:00			`train_data = list(zip(train_texts,`
			`[{'cats': cats} for cats in train_cats]))`
Update textcat example 2017-10-27 01:32:19 +03:00
			`# get names of other pipes to disable them during training`
			`other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']`
			`with nlp.disable_pipes(*other_pipes): # only train textcat`
Fix begin_training if get_gold_tuples is None 2017-11-01 15:14:31 +03:00			`optimizer = nlp.begin_training()`
Update textcat example 2017-10-27 01:32:19 +03:00			`print("Training the model...")`
			`print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))`
			`for i in range(n_iter):`
			`losses = {}`
			`# batch up the examples using spaCy's minibatch`
Fix print statements in text classifier example 2017-11-01 18:34:31 +03:00			`batches = minibatch(train_data, size=compounding(4., 32., 1.001))`
Update textcat example 2017-10-27 01:32:19 +03:00			`for batch in batches:`
Update training examples to use "simple style" 2017-11-07 01:14:04 +03:00			`texts, annotations = zip(*batch)`
			`nlp.update(texts, annotations, sgd=optimizer, drop=0.2,`
			`losses=losses)`
Update textcat example 2017-10-27 01:32:19 +03:00			`with textcat.model.use_params(optimizer.averages):`
			`# evaluate on the dev data split off in load_data()`
			`scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)`
Fix print statements in text classifier example 2017-11-01 18:34:31 +03:00			`print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}' # print a simple table`
Update textcat example 2017-10-27 01:32:19 +03:00			`.format(losses['textcat'], scores['textcat_p'],`
			`scores['textcat_r'], scores['textcat_f']))`

			`# test the trained model`
			`test_text = "This movie sucked"`
			`doc = nlp(test_text)`
			`print(test_text, doc.cats)`

			`if output_dir is not None:`
			`output_dir = Path(output_dir)`
			`if not output_dir.exists():`
			`output_dir.mkdir()`
			`nlp.to_disk(output_dir)`
			`print("Saved model to", output_dir)`

			`# test the saved model`
			`print("Loading from", output_dir)`
			`nlp2 = spacy.load(output_dir)`
			`doc2 = nlp2(test_text)`
			`print(test_text, doc2.cats)`


			`def load_data(limit=0, split=0.8):`
			`"""Load data from the IMDB dataset."""`
			`# Partition off part of the train data for evaluation`
			`train_data, _ = thinc.extra.datasets.imdb()`
			`random.shuffle(train_data)`
			`train_data = train_data[-limit:]`
			`texts, labels = zip(*train_data)`
			`cats = [{'POSITIVE': bool(y)} for y in labels]`
			`split = int(len(train_data) * split)`
			`return (texts[:split], cats[:split]), (texts[split:], cats[split:])`
Add example for training text classifier 2017-07-22 21:15:32 +03:00

			`def evaluate(tokenizer, textcat, texts, cats):`
			`docs = (tokenizer(text) for text in texts)`
Update textcat example 2017-10-27 01:32:19 +03:00			`tp = 1e-8 # True positives`
			`fp = 1e-8 # False positives`
			`fn = 1e-8 # False negatives`
			`tn = 1e-8 # True negatives`
Add example for training text classifier 2017-07-22 21:15:32 +03:00			`for i, doc in enumerate(textcat.pipe(docs)):`
			`gold = cats[i]`
			`for label, score in doc.cats.items():`
Fix multi-label support for text classification The TextCategorizer class is supposed to support multi-label text classification, and allow training data to contain missing values. For this to work, the gradient of the loss should be 0 when labels are missing. Instead, there was no way to actually denote "missing" in the GoldParse class, and so the TextCategorizer class treated the label set within gold.cats as complete. To fix this, we change GoldParse.cats to be a dict instead of a list. The GoldParse.cats dict should map to floats, with 1. denoting 'present' and 0. denoting 'absent'. Gradients are zeroed for categories absent from the gold.cats dict. A nice bonus is that you can also set values between 0 and 1 for partial membership. You can also set numeric values, if you're using a text classification model that uses an appropriate loss function. Unfortunately this is a breaking change; although the functionality was only recently introduced and hasn't been properly documented yet. I've updated the example script accordingly. 2017-10-06 02:43:02 +03:00			`if label not in gold:`
			`continue`
			`if score >= 0.5 and gold[label] >= 0.5:`
Add example for training text classifier 2017-07-22 21:15:32 +03:00			`tp += 1.`
Fix multi-label support for text classification The TextCategorizer class is supposed to support multi-label text classification, and allow training data to contain missing values. For this to work, the gradient of the loss should be 0 when labels are missing. Instead, there was no way to actually denote "missing" in the GoldParse class, and so the TextCategorizer class treated the label set within gold.cats as complete. To fix this, we change GoldParse.cats to be a dict instead of a list. The GoldParse.cats dict should map to floats, with 1. denoting 'present' and 0. denoting 'absent'. Gradients are zeroed for categories absent from the gold.cats dict. A nice bonus is that you can also set values between 0 and 1 for partial membership. You can also set numeric values, if you're using a text classification model that uses an appropriate loss function. Unfortunately this is a breaking change; although the functionality was only recently introduced and hasn't been properly documented yet. I've updated the example script accordingly. 2017-10-06 02:43:02 +03:00			`elif score >= 0.5 and gold[label] < 0.5:`
Add example for training text classifier 2017-07-22 21:15:32 +03:00			`fp += 1.`
Fix multi-label support for text classification The TextCategorizer class is supposed to support multi-label text classification, and allow training data to contain missing values. For this to work, the gradient of the loss should be 0 when labels are missing. Instead, there was no way to actually denote "missing" in the GoldParse class, and so the TextCategorizer class treated the label set within gold.cats as complete. To fix this, we change GoldParse.cats to be a dict instead of a list. The GoldParse.cats dict should map to floats, with 1. denoting 'present' and 0. denoting 'absent'. Gradients are zeroed for categories absent from the gold.cats dict. A nice bonus is that you can also set values between 0 and 1 for partial membership. You can also set numeric values, if you're using a text classification model that uses an appropriate loss function. Unfortunately this is a breaking change; although the functionality was only recently introduced and hasn't been properly documented yet. I've updated the example script accordingly. 2017-10-06 02:43:02 +03:00			`elif score < 0.5 and gold[label] < 0.5:`
Add example for training text classifier 2017-07-22 21:15:32 +03:00			`tn += 1`
Fix multi-label support for text classification The TextCategorizer class is supposed to support multi-label text classification, and allow training data to contain missing values. For this to work, the gradient of the loss should be 0 when labels are missing. Instead, there was no way to actually denote "missing" in the GoldParse class, and so the TextCategorizer class treated the label set within gold.cats as complete. To fix this, we change GoldParse.cats to be a dict instead of a list. The GoldParse.cats dict should map to floats, with 1. denoting 'present' and 0. denoting 'absent'. Gradients are zeroed for categories absent from the gold.cats dict. A nice bonus is that you can also set values between 0 and 1 for partial membership. You can also set numeric values, if you're using a text classification model that uses an appropriate loss function. Unfortunately this is a breaking change; although the functionality was only recently introduced and hasn't been properly documented yet. I've updated the example script accordingly. 2017-10-06 02:43:02 +03:00			`elif score < 0.5 and gold[label] >= 0.5:`
Add example for training text classifier 2017-07-22 21:15:32 +03:00			`fn += 1`
Update textcat example 2017-10-27 01:32:19 +03:00			`precision = tp / (tp + fp)`
Add example for training text classifier 2017-07-22 21:15:32 +03:00			`recall = tp / (tp + fn)`
Update textcat example 2017-10-27 01:32:19 +03:00			`f_score = 2 * (precision * recall) / (precision + recall)`
			`return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}`
Finish text classifier example 2017-07-23 01:34:12 +03:00
Add example for training text classifier 2017-07-22 21:15:32 +03:00
			`if __name__ == '__main__':`
			`plac.call(main)`