spaCy/spacy/tests/regression/test_issue3611.py

# coding: utf8
from __future__ import unicode_literals

import spacy
from spacy.util import minibatch, compounding


def test_issue3611():
    """ Test whether adding n-grams in the textcat works even when n > token length of some docs """
    unique_classes = ["offensive", "inoffensive"]
    x_train = [
        "This is an offensive text",
        "This is the second offensive text",
        "inoff",
    ]
    y_train = ["offensive", "offensive", "inoffensive"]

    # preparing the data
    pos_cats = list()
    for train_instance in y_train:
        pos_cats.append({label: label == train_instance for label in unique_classes})
    train_data = list(zip(x_train, [{"cats": cats} for cats in pos_cats]))

    # set up the spacy model with a text categorizer component
    nlp = spacy.blank("en")

    textcat = nlp.create_pipe(
        "textcat",
        config={"exclusive_classes": True, "architecture": "bow", "ngram_size": 2},
    )

    for label in unique_classes:
        textcat.add_label(label)
    nlp.add_pipe(textcat, last=True)

    # training the network
    with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]):
        optimizer = nlp.begin_training()
        for i in range(3):
            losses = {}
            batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))

            for batch in batches:
                nlp.update(
                    examples=batch,
                    sgd=optimizer,
                    drop=0.1,
                    losses=losses,
                )
Fixing ngram bug (#3953) * minimal failing example for Issue #3661 * referenced Issue #3661 instead of Issue #3611 * cleanup 2019-07-12 11:01:35 +03:00			`# coding: utf8`
			`from __future__ import unicode_literals`

			`import spacy`
			`from spacy.util import minibatch, compounding`


			`def test_issue3611():`
			`""" Test whether adding n-grams in the textcat works even when n > token length of some docs """`
			`unique_classes = ["offensive", "inoffensive"]`
Auto-format [ci skip] 2019-07-17 13:34:13 +03:00			`x_train = [`
			`"This is an offensive text",`
			`"This is the second offensive text",`
			`"inoff",`
			`]`
Fixing ngram bug (#3953) * minimal failing example for Issue #3661 * referenced Issue #3661 instead of Issue #3611 * cleanup 2019-07-12 11:01:35 +03:00			`y_train = ["offensive", "offensive", "inoffensive"]`

			`# preparing the data`
			`pos_cats = list()`
			`for train_instance in y_train:`
			`pos_cats.append({label: label == train_instance for label in unique_classes})`
Auto-format [ci skip] 2019-07-17 13:34:13 +03:00			`train_data = list(zip(x_train, [{"cats": cats} for cats in pos_cats]))`
Fixing ngram bug (#3953) * minimal failing example for Issue #3661 * referenced Issue #3661 instead of Issue #3611 * cleanup 2019-07-12 11:01:35 +03:00
			`# set up the spacy model with a text categorizer component`
Auto-format [ci skip] 2019-07-17 13:34:13 +03:00			`nlp = spacy.blank("en")`
Fixing ngram bug (#3953) * minimal failing example for Issue #3661 * referenced Issue #3661 instead of Issue #3611 * cleanup 2019-07-12 11:01:35 +03:00
			`textcat = nlp.create_pipe(`
			`"textcat",`
Auto-format [ci skip] 2019-07-17 13:34:13 +03:00			`config={"exclusive_classes": True, "architecture": "bow", "ngram_size": 2},`
Fixing ngram bug (#3953) * minimal failing example for Issue #3661 * referenced Issue #3661 instead of Issue #3611 * cleanup 2019-07-12 11:01:35 +03:00			`)`

			`for label in unique_classes:`
			`textcat.add_label(label)`
			`nlp.add_pipe(textcat, last=True)`

			`# training the network`
Also support passing list to Language.disable_pipes (#4521) * Also support passing list to Language.disable_pipes * Adjust internals 2019-10-25 17:19:08 +03:00			`with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]):`
Fixing ngram bug (#3953) * minimal failing example for Issue #3661 * referenced Issue #3661 instead of Issue #3611 * cleanup 2019-07-12 11:01:35 +03:00			`optimizer = nlp.begin_training()`
			`for i in range(3):`
			`losses = {}`
			`batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))`

			`for batch in batches:`
Auto-format [ci skip] 2019-07-17 13:34:13 +03:00			`nlp.update(`
Example class for training data (#4543) * OrigAnnot class instead of gold.orig_annot list of zipped tuples * from_orig to replace from_annot_tuples * rename to RawAnnot * some unit tests for GoldParse creation and internal format * removing orig_annot and switching to lists instead of tuple * rewriting tuples to use RawAnnot (+ debug statements, WIP) * fix pop() changing the data * small fixes * pop-append fixes * return RawAnnot for existing GoldParse to have uniform interface * clean up imports * fix merge_sents * add unit test for 4402 with new structure (not working yet) * introduce DocAnnot * typo fixes * add unit test for merge_sents * rename from_orig to from_raw * fixing unit tests * fix nn parser * read_annots to produce text, doc_annot pairs * _make_golds fix * rename golds_to_gold_annots * small fixes * fix encoding * have golds_to_gold_annots use DocAnnot * missed a spot * merge_sents as function in DocAnnot * allow specifying only part of the token-level annotations * refactor with Example class + underlying dicts * pipeline components to work with Example objects (wip) * input checking * fix yielding * fix calls to update * small fixes * fix scorer unit test with new format * fix kwargs order * fixes for ud and conllu scripts * fix reading data for conllu script * add in proper errors (not fixed numbering yet to avoid merge conflicts) * fixing few more small bugs * fix EL script 2019-11-11 19:35:27 +03:00			`examples=batch,`
Auto-format [ci skip] 2019-07-17 13:34:13 +03:00			`sgd=optimizer,`
			`drop=0.1,`
			`losses=losses,`
			`)`