diff --git a/website/docs/usage/_data.json b/website/docs/usage/_data.json index c8c85af1d..f81fb245f 100644 --- a/website/docs/usage/_data.json +++ b/website/docs/usage/_data.json @@ -22,6 +22,7 @@ "Custom tokenization": "customizing-tokenizer", "Training": "training", "Adding languages": "adding-languages" + "Training NER": "training-ner", }, "Examples": { "Tutorials": "tutorials", @@ -106,6 +107,10 @@ "training": { "title": "Training the tagger, parser and entity recognizer" + "training-ner": { + "title": "Training the Named Entity Recognizer", + "next": "saving-loading" + }, }, "pos-tagging": { diff --git a/website/docs/usage/training-ner.jade b/website/docs/usage/training-ner.jade new file mode 100644 index 000000000..78eb4905e --- /dev/null +++ b/website/docs/usage/training-ner.jade @@ -0,0 +1,174 @@ +include ../../_includes/_mixins + +p + | All #[+a("/docs/usage/models") spaCy models] support online learning, so + | you can update a pre-trained model with new examples. You can even add + | new classes to an existing model, to recognise a new entity type, + | part-of-speech, or syntactic relation. Updating an existing model is + | particularly useful as a "quick and dirty solution", if you have only a + | few corrections or annotations. + ++h(2, "improving-accuracy") Improving accuracy on existing entity types + +p + | To update the model, you first need to create an instance of + | #[+api("goldparse") #[code spacy.gold.GoldParse]], with the entity labels + | you want to learn. You will then pass this instance to the + | #[+api("entityrecognizer#update") #[code EntityRecognizer.update()]] + | method. For example: + ++code. + import spacy + from spacy.gold import GoldParse + + nlp = spacy.load('en') + doc = nlp.make_doc(u'Facebook released React in 2014') + gold = GoldParse(doc, entities=['U-ORG', 'O', 'U-TECHNOLOGY', 'O', 'U-DATE']) + nlp.entity.update(doc, gold) + +p + | You'll usually need to provide many examples to meaningfully improve the + | system — a few hundred is a good start, although more is better. You + | should avoid iterating over the same few examples multiple times, or the + | model is likely to "forget" how to annotate other examples. If you + | iterate over the same few examples, you're effectively changing the loss + | function. The optimizer will find a way to minimize the loss on your + | examples, without regard for the consequences on the examples it's no + | longer paying attention to. + +p + | One way to avoid this "catastrophic forgetting" problem is to "remind" + | the model of other examples by augmenting your annotations with sentences + | annotated with entities automatically recognised by the original model. + | Ultimately, this is an empirical process: you'll need to + | #[strong experiment on your own data] to find a solution that works best + | for you. + ++h(2, "adding") Adding a new entity type + +p + | You can add new entity types to an existing model. Let's say we want to + | recognise the category #[code TECHNOLOGY]. The new category will include + | programming languages, frameworks and platforms. First, we need to + | register the new entity type: + ++code. + nlp.entity.add_label('TECHNOLOGY') + +p + | Next, iterate over your examples, calling #[code entity.update()]. As + | above, we want to avoid iterating over only a small number of sentences. + | A useful compromise is to run the model over a number of plain-text + | sentences, and pass the entities to #[code GoldParse], as "true" + | annotations. This encourages the optimizer to find a solution that + | predicts the new category with minimal difference from the previous + | output. + ++h(2, "saving-loading") Saving and loading + +p + | After training our model, you'll usually want to save its state, and load + | it back later. You can do this with the #[code Language.save_to_directory()] + | method: + ++code. + nlp.save_to_directory('/home/me/data/en_technology') + +p + | To make the model more convenient to deploy, we recommend wrapping it as + | a Python package, so that you can install it via pip and load it as a + | module. spaCy comes with a handy #[+a("/docs/usage/cli#package") CLI command] + | to create all required files and directories. + ++code(false, "bash"). + python -m spacy package /home/me/data/en_technology /home/me/my_models + +p + | To build the package and create a #[code .tar.gz] archive, run + | #[code python setup.py sdist] from within its directory. + ++infobox("Saving and loading models") + | For more information and a detailed guide on how to package your model, + | see the documentation on + | #[+a("/docs/usage/saving-loading") saving and loading models]. + +p + | After you've generated and installed the package, you'll be able to + | load the model as follows: + ++code. + import en_technology + nlp = en_technology.load() + ++h(2, "example") Example: Adding and training an #[code ANIMAL] entity + +p + | This script shows how to add a new entity type to an existing pre-trained + | NER model. To keep the example short and simple, only four sentences are + | provided as examples. In practice, you'll need many more — + | #[strong a few hundred] would be a good start. You will also likely need + | to mix in #[strong examples of other entity types], which might be + | obtained by running the entity recognizer over unlabelled sentences, and + | adding their annotations to the training set. + +p + | For the full, runnable script of this example, see + | #[+src(gh("spacy", "examples/training/train_new_entity_type.py")) train_new_entity_type.py]. + ++code("Training the entity recognizer"). + import spacy + from spacy.pipeline import EntityRecognizer + from spacy.gold import GoldParse + from spacy.tagger import Tagger + import random + + model_name = 'en' + entity_label = 'ANIMAL' + output_directory = '/path/to/model' + train_data = [ + ("Horses are too tall and they pretend to care about your feelings", + [(0, 6, 'ANIMAL')]), + ("horses are too tall and they pretend to care about your feelings", + [(0, 6, 'ANIMAL')]), + ("horses pretend to care about your feelings", + [(0, 6, 'ANIMAL')]), + ("they pretend to care about your feelings, those horses", + [(48, 54, 'ANIMAL')]) + ] + + nlp = spacy.load(model_name) + nlp.entity.add_label(entity_label) + ner = train_ner(nlp, train_data, output_directory) + + def train_ner(nlp, train_data, output_dir): + # Add new words to vocab + for raw_text, _ in train_data: + doc = nlp.make_doc(raw_text) + for word in doc: + _ = nlp.vocab[word.orth] + + for itn in range(20): + random.shuffle(train_data) + for raw_text, entity_offsets in train_data: + gold = GoldParse(doc, entities=entity_offsets) + doc = nlp.make_doc(raw_text) + nlp.tagger(doc) + loss = nlp.entity.update(doc, gold) + nlp.end_training() + nlp.save_to_directory(output_dir) + +p + +button(gh("spaCy", "examples/training/train_new_entity_type.py"), false, "secondary") Full example + +p + | The actual training is performed by looping over the examples, and + | calling #[code nlp.entity.update()]. The #[code update()] method steps + | through the words of the input. At each word, it makes a prediction. It + | then consults the annotations provided on the #[code GoldParse] instance, + | to see whether it was right. If it was wrong, it adjusts its weights so + | that the correct action will score higher next time. + +p + | After training your model, you can + | #[+a("/docs/usage/saving-loading") save it to a directory]. We recommend wrapping + | models as Python packages, for ease of deployment.