//- 💫 DOCS > USAGE > TRAINING > BASICS include ../_spacy-101/_training +h(3, "training-data") How do I get training data? p | Collecting training data may sound incredibly painful – and it can be, | if you're planning a large-scale annotation project. However, if your main | goal is to update an existing model's predictions – for example, spaCy's | named entity recognition – the hard is part usually not creating the | actual annotations. It's finding representative examples and | #[strong extracting potential candidates]. The good news is, if you've | been noticing bad performance on your data, you likely | already have some relevant text, and you can use spaCy to | #[strong bootstrap a first set of training examples]. For example, | after processing a few sentences, you may end up with the following | entities, some correct, some incorrect. +aside("How many examples do I need?") | As a rule of thumb, you should allocate at least 10% of your project | resources to creating training and evaluation data. If you're looking to | improve an existing model, you might be able to start off with only a | handful of examples. Keep in mind that you'll always want a lot more than | that for #[strong evaluation] – especially previous errors the model has | made. Otherwise, you won't be able to sufficiently verify that the model | has actually made the #[strong correct generalisations] required for your | use case. +table(["Text", "Entity", "Start", "End", "Label", ""]) - var style = [0, 0, 1, 1, 1] +annotation-row(["Uber blew through $1 million a week", "Uber", 0, 4, "ORG"], style) +cell #[+procon("yes", "right", true)] +annotation-row(["Android Pay expands to Canada", "Android", 0, 7, "PERSON"], style) +cell #[+procon("no", "wrong", true)] +annotation-row(["Android Pay expands to Canada", "Canada", 23, 30, "GPE"], style) +cell #[+procon("yes", "right", true)] +annotation-row(["Spotify steps up Asia expansion", "Spotify", 0, 8, "ORG"], style) +cell #[+procon("yes", "right", true)] +annotation-row(["Spotify steps up Asia expansion", "Asia", 17, 21, "NORP"], style) +cell #[+procon("no", "wrong", true)] p | Alternatively, the | #[+a("/usage/linguistic-features#rule-based-matching") rule-based matcher] | can be a useful tool to extract tokens or combinations of tokens, as | well as their start and end index in a document. In this case, we'll | extract mentions of Google and assume they're an #[code ORG]. +table(["Text", "Entity", "Start", "End", "Label", ""]) - var style = [0, 0, 1, 1, 1] +annotation-row(["let me google this for you", "google", 7, 13, "ORG"], style) +cell #[+procon("no", "wrong", true)] +annotation-row(["Google Maps launches location sharing", "Google", 0, 6, "ORG"], style) +cell #[+procon("no", "wrong", true)] +annotation-row(["Google rebrands its business apps", "Google", 0, 6, "ORG"], style) +cell #[+procon("yes", "right", true)] +annotation-row(["look what i found on google! 😂", "google", 21, 27, "ORG"], style) +cell #[+procon("no", "wrong", true)] p | Based on the few examples above, you can already create six training | sentences with eight entities in total. Of course, what you consider a | "correct annotation" will always depend on | #[strong what you want the model to learn]. While there are some entity | annotations that are more or less universally correct – like Canada being | a geopolitical entity – your application may have its very own definition | of the #[+a("/api/annotation#named-entities") NER annotation scheme]. +code. train_data = [ ("Uber blew through $1 million a week", [(0, 4, 'ORG')]), ("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]), ("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]), ("Google Maps launches location sharing", [(0, 11, "PRODUCT")]), ("Google rebrands its business apps", [(0, 6, "ORG")]), ("look what i found on google! 😂", [(21, 27, "PRODUCT")])] +infobox("Tip: Try the Prodigy annotation tool") +infobox-logos(["prodigy", 100, 29, "https://prodi.gy"]) | If you need to label a lot of data, check out | #[+a("https://prodi.gy", true) Prodigy], a new, active learning-powered | annotation tool we've developed. Prodigy is fast and extensible, and | comes with a modern #[strong web application] that helps you collect | training data faster. It integrates seamlessly with spaCy, pre-selects | the #[strong most relevant examples] for annotation, and lets you | train and evaluate ready-to-use spaCy models. +h(3, "annotations") Training with annotations p | The #[+api("goldparse") #[code GoldParse]] object collects the annotated | training examples, also called the #[strong gold standard]. It's | initialised with the #[+api("doc") #[code Doc]] object it refers to, | and keyword arguments specifying the annotations, like #[code tags] | or #[code entities]. Its job is to encode the annotations, keep them | aligned and create the C-level data structures required for efficient access. | Here's an example of a simple #[code GoldParse] for part-of-speech tags: +code. vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}}) doc = Doc(vocab, words=['I', 'like', 'stuff']) gold = GoldParse(doc, tags=['N', 'V', 'N']) p | Using the #[code Doc] and its gold-standard annotations, the model can be | updated to learn a sentence of three words with their assigned | part-of-speech tags. The #[+a("/usage/adding-languages#tag-map") tag map] | is part of the vocabulary and defines the annotation scheme. If you're | training a new language model, this will let you map the tags present in | the treebank you train on to spaCy's tag scheme. +code. doc = Doc(Vocab(), words=['Facebook', 'released', 'React', 'in', '2014']) gold = GoldParse(doc, entities=['U-ORG', 'O', 'U-TECHNOLOGY', 'O', 'U-DATE']) p | The same goes for named entities. The letters added before the labels | refer to the tags of the | #[+a("/usage/linguistic-features#updating-biluo") BILUO scheme] – | #[code O] is a token outside an entity, #[code U] an single entity unit, | #[code B] the beginning of an entity, #[code I] a token inside an entity | and #[code L] the last token of an entity. +aside | #[strong Training data]: The training examples.#[br] | #[strong Text and label]: The current example.#[br] | #[strong Doc]: A #[code Doc] object created from the example text.#[br] | #[strong GoldParse]: A #[code GoldParse] object of the #[code Doc] and label.#[br] | #[strong nlp]: The #[code nlp] object with the model.#[br] | #[strong Optimizer]: A function that holds state between updates.#[br] | #[strong Update]: Update the model's weights.#[br] | #[strong ] +graphic("/assets/img/training-loop.svg") include ../../assets/img/training-loop.svg p | Of course, it's not enough to only show a model a single example once. | Especially if you only have few examples, you'll want to train for a | #[strong number of iterations]. At each iteration, the training data is | #[strong shuffled] to ensure the model doesn't make any generalisations | based on the order of examples. Another technique to improve the learning | results is to set a #[strong dropout rate], a rate at which to randomly | "drop" individual features and representations. This makes it harder for | the model to memorise the training data. For example, a #[code 0.25] | dropout means that each feature or internal representation has a 1/4 | likelihood of being dropped. +aside | #[+api("language#begin_training") #[code begin_training()]]: Start the | training and return an optimizer function to update the model's weights. | Can take an optional function converting the training data to spaCy's | training format.#[br] | #[+api("language#update") #[code update()]]: Update the model with the | training example and gold data.#[br] | #[+api("language#to_disk") #[code to_disk()]]: Save the updated model to | a directory. +code("Example training loop"). optimizer = nlp.begin_training(get_data) for itn in range(100): random.shuffle(train_data) for raw_text, entity_offsets in train_data: doc = nlp.make_doc(raw_text) gold = GoldParse(doc, entities=entity_offsets) nlp.update([doc], [gold], drop=0.5, sgd=optimizer) nlp.to_disk('/model') p | The #[+api("language#update") #[code nlp.update]] method takes the | following arguments: +table(["Name", "Description"]) +row +cell #[code docs] +cell | #[+api("doc") #[code Doc]] objects. The #[code update] method | takes a sequence of them, so you can batch up your training | examples. Alternatively, you can also pass in a sequence of | raw texts. +row +cell #[code golds] +cell | #[+api("goldparse") #[code GoldParse]] objects. The #[code update] | method takes a sequence of them, so you can batch up your | training examples. Alternatively, you can also pass in a | dictionary containing the annotations. +row +cell #[code drop] +cell | Dropout rate. Makes it harder for the model to just memorise | the data. +row +cell #[code sgd] +cell | An optimizer, i.e. a callable to update the model's weights. If | not set, spaCy will create a new one and save it for further use. p | Instead of writing your own training loop, you can also use the | built-in #[+api("cli#train") #[code train]] command, which expects data | in spaCy's #[+a("/api/annotation#json-input") JSON format]. On each epoch, | a model will be saved out to the directory. After training, you can | use the #[+api("cli#package") #[code package]] command to generate an | installable Python package from your model. +code(false, "bash"). python -m spacy convert /tmp/train.conllu /tmp/data python -m spacy train en /tmp/model /tmp/data/train.json -n 5 +h(3, "training-simple-style") Simple training style +tag-new(2) p | Instead of sequences of #[code Doc] and #[code GoldParse] objects, | you can also use the "simple training style" and pass | #[strong raw texts] and #[strong dictionaries of annotations] | to #[+api("language#update") #[code nlp.update]]. | The dictionaries can have the keys #[code entities], #[code heads], | #[code deps], #[code tags] and #[code cats]. This is generally | recommended, as it removes one layer of abstraction, and avoids | unnecessary imports. It also makes it easier to structure and load | your training data. +aside-code("Example Annotations"). { 'entities': [(0, 4, 'ORG')], 'heads': [1, 1, 1, 5, 5, 2, 7, 5], 'deps': ['nsubj', 'ROOT', 'prt', 'quantmod', 'compound', 'pobj', 'det', 'npadvmod'], 'tags': ['PROPN', 'VERB', 'ADP', 'SYM', 'NUM', 'NUM', 'DET', 'NOUN'], 'cats': {'BUSINESS': 1.0} } +code("Simple training loop"). TRAIN_DATA = [ ("Uber blew through $1 million a week", {'entities': [(0, 4, 'ORG')]}), ("Google rebrands its business apps", {'entities': [(0, 6, "ORG")]})] nlp = spacy.blank('en') optimizer = nlp.begin_training() for i in range(20): random.shuffle(TRAIN_DATA) for text, annotations in TRAIN_DATA: nlp.update([text], [annotations], sgd=optimizer) nlp.to_disk('/model') p | The above training loop leaves out a few details that can really | improve accuracy – but the principle really is #[em that] simple. Once | you've got your pipeline together and you want to tune the accuracy, | you usually want to process your training examples in batches, and | experiment with #[+api("top-level#util.minibatch") #[code minibatch]] | sizes and dropout rates, set via the #[code drop] keyword argument. See | the #[+api("language") #[code Language]] and #[+api("pipe") #[code Pipe]] | API docs for available options.