spaCy/examples/training/ner_multitask_objective.py

"""This example shows how to add a multi-task objective that is trained
alongside the entity recognizer. This is an alternative to adding features
to the model.

The multi-task idea is to train an auxiliary model to predict some attribute,
with weights shared between the auxiliary model and the main model. In this
example, we're predicting the position of the word in the document.

The model that predicts the position of the word encourages the convolutional
layers to include the position information in their representation. The
information is then available to the main model, as a feature.

The overall idea is that we might know something about what sort of features
we'd like the CNN to extract. The multi-task objectives can encourage the
extraction of this type of feature. The multi-task objective is only used
during training. We discard the auxiliary model before run-time.

The specific example here is not necessarily a good idea --- but it shows
how an arbitrary objective function for some word can be used.

Developed for spaCy 2.0.6 and last tested for 2.2.2
"""
import random
import plac
import spacy
import os.path
from spacy.gold import read_json_file, GoldParse

from spacy.tokens import Doc

random.seed(0)

PWD = os.path.dirname(__file__)

TRAIN_DATA = list(read_json_file(os.path.join(PWD, "training-data.json")))


def get_position_label(i, words, tags, heads, labels, ents):
    """Return labels indicating the position of the word in the document.
    """
    if len(words) < 20:
        return "short-doc"
    elif i == 0:
        return "first-word"
    elif i < 10:
        return "early-word"
    elif i < 20:
        return "mid-word"
    elif i == len(words) - 1:
        return "last-word"
    else:
        return "late-word"


def main(n_iter=10):
    nlp = spacy.blank("en")
    ner = nlp.create_pipe("ner")
    ner.add_multitask_objective(get_position_label)
    nlp.add_pipe(ner)

    _, sents = TRAIN_DATA[0]
    print("Create data, # of sentences =", len(sents) - 1) # not counting the cats attribute
    optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA)
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}

        for raw_text, annots_brackets in TRAIN_DATA:
            cats = annots_brackets.pop()
            for annotations, _ in annots_brackets:
                annotations.append(cats)  # temporarily add it here for from_annot_tuples to work
                doc = Doc(nlp.vocab, words=annotations[1])
                gold = GoldParse.from_annot_tuples(doc, annotations)
                annotations.pop()  # restore data

                nlp.update(
                    [doc],  # batch of texts
                    [gold],  # batch of annotations
                    drop=0.2,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses,
                )
            annots_brackets.append(cats)  # restore data
        print(losses.get("nn_labeller", 0.0), losses["ner"])

    # test the trained model
    for text, _ in TRAIN_DATA:
        doc = nlp(text)
        print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
        print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])


if __name__ == "__main__":
    plac.call(main)
Auto-format examples 2018-12-02 06:26:26 +03:00			`"""This example shows how to add a multi-task objective that is trained`
Add example of NER multitask objective 2018-01-21 21:46:37 +03:00			`alongside the entity recognizer. This is an alternative to adding features`
			`to the model.`

			`The multi-task idea is to train an auxiliary model to predict some attribute,`
			`with weights shared between the auxiliary model and the main model. In this`
			`example, we're predicting the position of the word in the document.`

			`The model that predicts the position of the word encourages the convolutional`
			`layers to include the position information in their representation. The`
			`information is then available to the main model, as a feature.`

			`The overall idea is that we might know something about what sort of features`
			`we'd like the CNN to extract. The multi-task objectives can encourage the`
			`extraction of this type of feature. The multi-task objective is only used`
			`during training. We discard the auxiliary model before run-time.`

			`The specific example here is not necessarily a good idea --- but it shows`
			`how an arbitrary objective function for some word can be used.`

Match pop with append for training format (#4516) * trying to fix script - not succesful yet * match pop() with extend() to avoid changing the data * few more pop-extend fixes * reinsert deleted print statement * fix print statement * add last tested version * append instead of extend * add in few comments * quick fix for 4402 + unit test * fixing number of docs (not counting cats) * more fixes * fix len * print tmp file instead of using data from examples dir * print tmp file instead of using data from examples dir (2) 2019-10-27 18:01:32 +03:00			`Developed for spaCy 2.0.6 and last tested for 2.2.2`
Auto-format examples 2018-12-02 06:26:26 +03:00			`"""`
Add example of NER multitask objective 2018-01-21 21:46:37 +03:00			`import random`
			`import plac`
			`import spacy`
			`import os.path`
			`from spacy.gold import read_json_file, GoldParse`

Match pop with append for training format (#4516) * trying to fix script - not succesful yet * match pop() with extend() to avoid changing the data * few more pop-extend fixes * reinsert deleted print statement * fix print statement * add last tested version * append instead of extend * add in few comments * quick fix for 4402 + unit test * fixing number of docs (not counting cats) * more fixes * fix len * print tmp file instead of using data from examples dir * print tmp file instead of using data from examples dir (2) 2019-10-27 18:01:32 +03:00			`from spacy.tokens import Doc`

Add example of NER multitask objective 2018-01-21 21:46:37 +03:00			`random.seed(0)`

			`PWD = os.path.dirname(__file__)`

Auto-format examples 2018-12-02 06:26:26 +03:00			`TRAIN_DATA = list(read_json_file(os.path.join(PWD, "training-data.json")))`
Add example of NER multitask objective 2018-01-21 21:46:37 +03:00

			`def get_position_label(i, words, tags, heads, labels, ents):`
Auto-format examples 2018-12-02 06:26:26 +03:00			`"""Return labels indicating the position of the word in the document.`
			`"""`
Add example of NER multitask objective 2018-01-21 21:46:37 +03:00			`if len(words) < 20:`
Auto-format examples 2018-12-02 06:26:26 +03:00			`return "short-doc"`
Add example of NER multitask objective 2018-01-21 21:46:37 +03:00			`elif i == 0:`
Auto-format examples 2018-12-02 06:26:26 +03:00			`return "first-word"`
Add example of NER multitask objective 2018-01-21 21:46:37 +03:00			`elif i < 10:`
Auto-format examples 2018-12-02 06:26:26 +03:00			`return "early-word"`
Add example of NER multitask objective 2018-01-21 21:46:37 +03:00			`elif i < 20:`
Auto-format examples 2018-12-02 06:26:26 +03:00			`return "mid-word"`
			`elif i == len(words) - 1:`
			`return "last-word"`
Add example of NER multitask objective 2018-01-21 21:46:37 +03:00			`else:`
Auto-format examples 2018-12-02 06:26:26 +03:00			`return "late-word"`
Add example of NER multitask objective 2018-01-21 21:46:37 +03:00

			`def main(n_iter=10):`
Auto-format examples 2018-12-02 06:26:26 +03:00			`nlp = spacy.blank("en")`
			`ner = nlp.create_pipe("ner")`
Add example of NER multitask objective 2018-01-21 21:46:37 +03:00			`ner.add_multitask_objective(get_position_label)`
			`nlp.add_pipe(ner)`

Match pop with append for training format (#4516) * trying to fix script - not succesful yet * match pop() with extend() to avoid changing the data * few more pop-extend fixes * reinsert deleted print statement * fix print statement * add last tested version * append instead of extend * add in few comments * quick fix for 4402 + unit test * fixing number of docs (not counting cats) * more fixes * fix len * print tmp file instead of using data from examples dir * print tmp file instead of using data from examples dir (2) 2019-10-27 18:01:32 +03:00			`_, sents = TRAIN_DATA[0]`
			`print("Create data, # of sentences =", len(sents) - 1) # not counting the cats attribute`
Add example of NER multitask objective 2018-01-21 21:46:37 +03:00			`optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA)`
			`for itn in range(n_iter):`
			`random.shuffle(TRAIN_DATA)`
			`losses = {}`
Match pop with append for training format (#4516) * trying to fix script - not succesful yet * match pop() with extend() to avoid changing the data * few more pop-extend fixes * reinsert deleted print statement * fix print statement * add last tested version * append instead of extend * add in few comments * quick fix for 4402 + unit test * fixing number of docs (not counting cats) * more fixes * fix len * print tmp file instead of using data from examples dir * print tmp file instead of using data from examples dir (2) 2019-10-27 18:01:32 +03:00
			`for raw_text, annots_brackets in TRAIN_DATA:`
			`cats = annots_brackets.pop()`
			`for annotations, _ in annots_brackets:`
			`annotations.append(cats) # temporarily add it here for from_annot_tuples to work`
			`doc = Doc(nlp.vocab, words=annotations[1])`
			`gold = GoldParse.from_annot_tuples(doc, annotations)`
			`annotations.pop() # restore data`

			`nlp.update(`
			`[doc], # batch of texts`
			`[gold], # batch of annotations`
			`drop=0.2, # dropout - make it harder to memorise data`
			`sgd=optimizer, # callable to update weights`
			`losses=losses,`
			`)`
			`annots_brackets.append(cats) # restore data`
Auto-format examples 2018-12-02 06:26:26 +03:00			`print(losses.get("nn_labeller", 0.0), losses["ner"])`
Add example of NER multitask objective 2018-01-21 21:46:37 +03:00
			`# test the trained model`
			`for text, _ in TRAIN_DATA:`
			`doc = nlp(text)`
Auto-format examples 2018-12-02 06:26:26 +03:00			`print("Entities", [(ent.text, ent.label_) for ent in doc.ents])`
			`print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])`
Add example of NER multitask objective 2018-01-21 21:46:37 +03:00

Auto-format examples 2018-12-02 06:26:26 +03:00			`if __name__ == "__main__":`
Add example of NER multitask objective 2018-01-21 21:46:37 +03:00			`plac.call(main)`