* Write more notes about spaCy's NER

2025-08-23 21:44:54 +03:00 · 2016-01-21 16:37:13 +01:00 · 2016-01-21 16:37:13 +01:00 · 0ec4df6d7c
commit 0ec4df6d7c
parent 7d16f25218
1 changed files with 198 additions and 16 deletions
--- a/website/src/jade/tutorials/bootstrap-ner-word2vec/index.jade
+++ b/website/src/jade/tutorials/bootstrap-ner-word2vec/index.jade
@ -3,15 +3,43 @@ include ./meta.jade


 +WritePost(Meta)
-    p Until machines can think for themselves, we have to either program them explicitly, or give them examples.  Let's say you want to write a program that tags mentions of a certain type of thing in text --- let's say, food.  You can't write this program without supplying some definition of what you consider to be food, and what you don't. That much is fundamental. But, you don't necessarily need to sit down with an arbitrary sample of text, and start tagging it manually, word-by-word. Sometimes you do --- but sometimes there are easier ways.
-    
-    p In this post I'll describe how to adapt spaCy's named entity recogniser to recognise categories of your choosing, including a quick and dirty way to produce training data semi-automatically. The strategy is to define a handful of seed words or phrases, and then use word2vec to generate a more expansive list. Before using Gensim to train the word vectors, we merge base noun phrases and previously recognised entities into single tokens, using spaCy's annotations. This gives a vector space that includes phrases as well as words. To train our tagger, we can manually classify some of these phrases in context, or we can use an old data generation trick adapted from word sense disambiguation.
-
    h4 Adding a new class to a pre-trained model

-    p spaCy provides a named entity recogniser trained on OntoNotes 5, that recognises the usual classes --- person, location, organization, etc. In this tutorial we'll look for video games, using comments posted to Reddit in 2015. The Reddit comment corpus is some of my favourite data. It's a good amount of text, and it's quite clean. We don't have to do any HTML or MediaWiki processing, and there's very little automatically generated text. Most of the comments are around the same length, and people mostly type in full sentences.
+    p spaCy provides a named entity recogniser trained on OntoNotes 5, that recognises the usual classes --- person, location, organization, etc. In this tutorial I'll show how to:

-    p First, let's have a look at what spaCy does out-of-the-box, with the pre-trained statistical model. Install spaCy and download the data:
+    ol
+        li Add classes to an existing entity recognition model;
+        li Update the model to fix errors on your data;
+        li Train a new named entity recognition model from scratch;
+
+    h4 How the entity recognizer works
+
+    p Named entity recognition is typically modelled as a problem of assigning labels to contiguous spans of text. The spans are usually encoded into structured tags that indicate the span boundaries and types, so that the entities can be assigned like part-of-speech tags, using a linear-chain CRF or a Hidden Markov Model.
+
+    p All three parts of this design are bad. Named entities are not actually flat-structured. For instance, #[the University of Sydney] is an organization name that contains a city name. This is very common. Entities also don't fall into neat exclusive hierarchies. Finally, and of least significance, encoding the problem into per-word tags fails to make the right thing easy. It makes it tricky to write loss functions and features that refer to the actual structures we're trying to predict: the spans.
+
+    p Addressing the first two problems would require us to depart significantly from most of the existing data and research.  We'll have richly labelled, internally structured entities eventually. But we can at least set up the learning problem better, without the pointless tag encoding.
+
+    p Instead of a tagger, spaCy's entity recognizer is an incremental transition-based parser, with a simple transition system that matches the #[code BILOU] tagging scheme. We have the following actions:
+
+    ul
+        li Begin(T): Open a new entity of type T at position i. Advance to position i+1. Only valid if no entities are currently open.
+        li In(T): Advance to position i+1. Only valid if an entity of type T is currently open.
+        li Last(T): Close the current open entity, setting its end-point to i+1. Advance to i+1. Only valid if an entity of type T is currently open.
+        li Unitary(T): Add an entity of type T, beginning at position i and ending at i+1. Advance to position i+1. Only valid if no entities are currently open.
+        li Out: Advance to i+1. Only valid if an entity is not currently open.
+
+    p The BILOU system introduces some redundancy, to make the decision boundary between the classes a little sharper. The idea is that this might make them a bit more linearly separable. If an entity of type T is currently open, the relevant decision is In(T) vs Last(T). If no entity is open, the decision is Begin(x) vs. Unitary(x) vs Out. The accuracy advantage of this over the IOB scheme is well observed.
+
+    p A well set up tagging model will be equivalent to our parsing model. It's not very difficult to write your tagger such that invalid tag sequences are given zero probability. But to do that, you usually have to break the abstraction provided by the tagging set up. The linear chain CRF or HMM model invites you to consider the tags you're assigning as atomic. They're not --- they're structured. Similarly, unless you break abstraction, you'll probably write suboptimal features. The standard approach invites an Nth-order Markov assumption: you only get to ask questions about the last N tags. A typical N is 2 or 3. This means you can't ask a simple question, like "What's the first word of the current entity?". This is a natural question to ask, and a model that stops you from asking it is nonsense.
+    
+   p So: spaCy's entity recognizer is an instance of class #[code spacy.parser.Parser]. It holds an instance of #[code spacy.parser.ner.BiluoPushDown], which owns an array of #[code Transition] structs. Each struct holds three function pointers, that are used to apply the action, determine whether the action is valid given the current state, and determine what its cost would be with respect to a gold-standard analysis.
+   
+   p Entity recognition proceeds word-by-word. At each word we ask the model to score all the actions. We then loop over the structs and ask each to tell us whether they are valid. We then apply the highest scoring valid action, updating the state in place. At the end of the sentence we apply the analysis to the document.
+   
+   p We can condition the decisions on the whole document, and on all previous decisions. So we can easily ask questions about what entities have been frequent, and how we previously tagged some word. We don't currently have any features that do this, although research suggests they can be effective. We can also easily ask any questions about the current open entity, including whether it currently matches a gazetteer.
+
+   p The parser's model employs sparse data structures, so we don't have to resize anything if we add an additional class, and start providing examples of it. This is probably not an optimal way to fit the model, but it's often convenient.

    pre
        code
@ -35,7 +63,13 @@ include ./meta.jade

    p This prints:

-    p To add a new class to the existing model, we simply start providing examples to the #[code .train] method:
+    p To add a new class to the existing model, first call the #[code nlp.entity.add_label()] method:
+
+    pre
+        code
+            | nlp.entity.add_label('VGAME')
+
+    p This registers the label, and you can now supply examples of it, that the model can learn from. The #[code nlp.entity.train] method requires two arguments: an instance of #[code spacy.tokens.Doc], and an instance of #[code spacy.gold.GoldParse]. The #[code Doc] instance needs to #[em not] have existing named entity annotations, so it should be created via #[code nlp.tokenizer] directly:

    pre
        code
@ -54,18 +88,166 @@ include ./meta.jade

    p This really isn't a good solution. It's just a quick hack that requires no set up, and very few lines of code. If we started from scratch with the old data and the new examples, we could almost certainly do better. And retraining the entity recognition model doesn't take very long. However, we can't distribute the training data to you, and the best solution will depend on the specifics of your problem. If you need better accuracy, you should get in touch.
    
-    h4 Boot-strapping examples with nlp2vec
-
-    p We still need more examples. One solution is to sit down with a bunch of comments, and start annotating them from start to finish. For evaluation data, this is the only really satisfactory approach. But for training data, it's pretty inefficient.
-
-    p What we'll do is make a little seed dictionary of food terms. We'll then use a distributional similarity model to rank words and phrases for their similarity to our seed set. Then, we'll page through each of these words and phrases in turn, tagging examples of them in context. 
-
-    p Mostly, things are food or they aren't. Most words and phrases are unambiguous in this respect, which means they can be tagged in bulk, with one decision applying to many examples. And for the ambiguous entries, annotation decisions can be made much more quickly if you make lots of the same type of decision in a row.
-
-    p It's important that we're able to do this over #[em phrases], not just individual words. We want to be tagging instances of "hot dog", not just "dog". A simple way to achieve this is to use spaCy's syntactic parser to identify base noun phrases. We then merge these noun phrases into single tokens, before we send the data to Gensim's word2vec implementation. I did this over all comments posted to Reddit in 2015.
+//    h4 Boot-strapping examples with nlp2vec
+//
+//    p We still need more examples. One solution is to sit down with a bunch of comments, and start annotating them from start to finish. For evaluation data, this is the only really satisfactory approach. But for training data, it's pretty inefficient.
+//
+//    p What we'll do is make a little seed dictionary of food terms. We'll then use a distributional similarity model to rank words and phrases for their similarity to our seed set. Then, we'll page through each of these words and phrases in turn, tagging examples of them in context. 
+//
+//    p Mostly, things are food or they aren't. Most words and phrases are unambiguous in this respect, which means they can be tagged in bulk, with one decision applying to many examples. And for the ambiguous entries, annotation decisions can be made much more quickly if you make lots of the same type of decision in a row.
+//
+//    p It's important that we're able to do this over #[em phrases], not just individual words. We want to be tagging instances of "hot dog", not just "dog". A simple way to achieve this is to use spaCy's syntactic parser to identify base noun phrases. We then merge these noun phrases into single tokens, before we send the data to Gensim's word2vec implementation. I did this over all comments posted to Reddit in 2015.
+//
+//    pre
+//        code
+//            | from __future__ import print_function, unicode_literals, division
+//            | import io
+//            | import bz2
+//            | import logging
+//            | from os import path
+//            | import os
+//            | 
+//            | import plac
+//            | import ujson
+//            | from gensim.models import Word2Vec
+//            | 
+//            | 
+//            | class Corpus(object):
+//            |    def __init__(self, directory):
+//            |        self.directory = directory
+//            |
+//            |    def __iter__(self):
+//            |        for filename in os.listdir(self.directory):
+//            |            text_loc = path.join(self.directory, filename)
+//            |            with io.open(text_loc, 'r', encoding='utf8') as file_:
+//            |                for sent_str in file_:
+//            |                    yield sent_str.split()
+//            | 
+//            | 
+//            | @plac.annotations(
+//            |     in_dir=("Location of input directory"),
+//            |     out_loc=("Location of output file"),
+//            |     n_workers=("Number of workers", "option", "n", int),
+//            |     size=("Dimension of the word vectors", "option", "d", int),
+//            |     window=("Context window size", "option", "w", int),
+//            |     min_count=("Min count", "option", "m", int))
+//            | def main(in_dir, out_loc, n_workers=4, window=5, size=128, min_count=10):
+//            |    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO|)
+//            |       corpus = Corpus(in_dir)
+//            |       model = Word2Vec(
+//            |           Corpus(in_dir),
+//            |           size=size,
+//            |           window=window,
+//            |           min_count=min_count,
+//            |           workers=n_workers)
+//            |       model.save(out_loc)
+//            | 
+//            | 
+//            | if __name__ == '__main__':
+//            |     plac.call(main)

    //p Note that the approach that I'm advocating here is actually fairly different from "active learning", where you try to get the classifier's uncertainty to prioritise a queue of examples to be annotated. This can minimise the total number of decisions you need to make to achieve some given accuracy. However, it makes the annotation task much more difficult, and introduces a whole new machine learning problem to tinker with. Active learning is well researched because it's generalisable: it doesn't require any specific insights about your annotation task. But this also makes it much less powerful.

+    p I'd been thinking about learning noun phrase vectors for a long time, but it only left the idea queue when I read the Trask et al "sense2vec" paper, where they show that simply using part-of-speech tags to make the keys a little more context specific helps a lot. The addition of phrases seems to make things even better.
+
+    p As a simple example, consider a common word like #[em take], which has a large number of distinct senses, as both a noun and a verb. Vanilla word2vec collapses all uses of this word into a single entry. Splitting these entries out even a little bit turns out to be very useful. Here we see the nearest neighbours of the nominal sense:
+
+    pre
+        code
+            | >>> model = gensim.models.Word2Vec.load('np_ner_tag_reddit_2015_01-300d.model')
+            | >>> model.most_similar(['take|NOUN'])
+            | [(u'personal_stance|NOUN', 0.49355706572532654), (u'best_guess|NOUN', 0.48488736152648926), (u'personal_favorite_song|NOUN', 0.4821690320968628), (u'stance|NOUN', 0.48033151030540466), (u'favorite_track|NOUN', 0.47577959299087524), (u'personal_opinion|NOUN', 0.47202107310295105), (u'favorite_community|NOUN', 0.4700315594673157), (u'personal_view|NOUN', 0.4675809144973755), (u'overall_opinion|NOUN', 0.4594742953777313), (u'first_impression|NOUN', 0.458611398935318)]
+
+    p The model has learned that as a noun, #[em take] is similar to phrases like #[em personal stance] and #[em best guess]. Its similarity to the verbal sense is quite low:
+
+    pre
+        code
+            | >>> model.similarity('take|NOUN', 'take|VERB')
+            | 0.13804844694430313
+
+    p When we ask the model for entries similar to the verbal sense of take, the result is interesting:
+
+    pre
+        code
+            | >>> w2v.most_similar(['take|VERB'])
+            | [(u'taking|VERB', 0.8178646564483643), (u'taken|VERB', 0.7978687286376953), (u'takes|VERB', 0.7828329205513), (u'tale|VERB', 0.7796445488929749), (u'Take|VERB', 0.7672088146209717), (u'give|VERB', 0.7639298439025879), (u'take|ADJ', 0.7631379961967468), (u'Taking|VERB', 0.7620075345039368), (u'put|VERB', 0.7561365962028503), (u'took|VERB', 0.7524303793907166)]
+            | >>> model.similarity('take|VERB', 'tale|VERB')
+            | 0.49611725495669901
+
+    p The first suggestion is #[em tale], tagged as a verb. This is surely a misspelling, that the part-of-speech tagger has managed to correctly tag, even though the actual word #[em tale] cannot be used as a verb.
+
+    p A few bonus queries on the chunked model.
+
+    p 1) The vector space seems like it'll give a good way to show compositionality:
+
+    p "fair game" is not a type of game:
+
+    pre
+        code
+            | >>> model.similarity('fair_game|NOUN', 'game|NOUN')
+            | 0.034977455677555599
+            | >>> model.similarity('multiplayer_game|NOUN', 'game|NOUN')
+            | 0.54464530644393849
+            | A "class action" is only very weakly a type of action:
+            | >>> model.similarity('class_action|NOUN', 'action|NOUN')
+            | 0.14957825452335169
+            | But a class action lawsuit is definitely a type of lawsuit:
+            | >>> model.similarity('class_action_lawsuit|NOUN', 'lawsuit|NOUN')
+            | 0.69595765453644187
+            | 2) Similarity between entities can be kind of fun.
+
+    p Here's what Reddit thinks of Donald Trump:
+    pre
+        code
+            | >>> model.most_similar(['Donald_Trump|PERSON'])
+            | [(u'Sarah_Palin|PERSON', 0.5510910749435425), (u'Rick_Perry|PERSON', 0.5508972406387329), (u'Stephen_Colbert|PERSON', 0.5499709844589233), (u'Alex_Jones|PERSON', 0.5492554306983948), (u'Michael_Moore|PERSON', 0.5363447666168213), (u'Charles_Manson|PERSON', 0.5363028645515442), (u'Dick_Cheney|PERSON', 0.5348431468009949), (u'Mark_Zuckerberg|PERSON', 0.5258212089538574), (u'Mark_Wahlberg|PERSON', 0.5251839756965637), (u'Michael_Jackson|PERSON', 0.5229078531265259)]
+    p Discussion of Bill Cosby makes some obvious (and some less obvious) comparisons:
+    pre
+        code
+            | >>> model.most_similar(['Bill_Cosby|PERSON'])
+            | [(u'Cosby|ORG', 0.6004706621170044), (u'Cosby|PERSON', 0.5874950885772705), (u'Roman_Polanski|PERSON', 0.5478169918060303), (u'George_Zimmerman|PERSON', 0.5398542881011963), (u'Charles_Manson|PERSON', 0.5387344360351562), (u'OJ_Simpson|PERSON', 0.5228893160820007), (u'Trayvon_Martin|PERSON', 0.514190137386322), (u'Adnan_Syed|PERSON', 0.49992451071739197), (u'rapist|NOUN', 0.49792540073394775), (u'srhbutts|NOUN', 0.49792492389678955)]
+
+    p Some queries produce more confusing results:
+
+    pre
+        code
+            | >>> model.most_similar(['Carrot_Top|PERSON'])
+            | [(u'Kate_Mara|PERSON', 0.5347248911857605), (u'Andy_Samberg|PERSON', 0.5336876511573792), (u'Ryan_Gosling|PERSON', 0.5287898182868958), (u'Emma_Stone|PERSON', 0.5243821740150452), (u'Charlie_Sheen|PERSON', 0.5209298133850098), (u'Joseph_Gordon_Levitt|PERSON', 0.5196050405502319), (u'Jonah_Hill|PERSON', 0.5151286125183105), (u'Zooey_Deschanel|PERSON', 0.514430582523346), (u'Gerard_Butler|PERSON', 0.5115377902984619), (u'Ellen_Page|PERSON', 0.5094753503799438)]
+
+    p I can't say the connection between Carrot Top and Kate Mara is obvious to me. I suppose this is true of most things about Carrot Top, so...Fair play.
+
+    p 3) Reddit talks about food a lot, and those regions of the vector space seem very well defined:
+
+    pre
+        code
+            | >>> model.most_similar(['onion_rings|NOUN'])
+            | [(u'hashbrowns|NOUN', 0.8040812611579895), (u'hot_dogs|NOUN', 0.7978234887123108), (u'chicken_wings|NOUN', 0.793393611907959), (u'sandwiches|NOUN', 0.7903584241867065), (u'fries|NOUN', 0.7885469198226929), (u'tater_tots|NOUN', 0.7821801900863647), (u'bagels|NOUN', 0.7788236141204834), (u'chicken_nuggets|NOUN', 0.7787706255912781), (u'coleslaw|NOUN', 0.7771176099777222), (u'nachos|NOUN', 0.7755396366119385)]
+
+    p Some of Reddit's ideas about food are kind of...interesting. It seems to think bacon and brocolli are very similar:
+
+    pre
+        code
+            | >>> model.similarity('bacon|NOUN', 'broccoli|NOUN')
+            | 0.83276615202851845
+
+    p Reddit also thinks hot dogs are practically salad:
+
+    pre
+        code
+            | >>> model.similarity('hot_dogs|NOUN', 'salad|NOUN')
+            | 0.76765100035460465
+            | >>> model.similarity('hot_dogs|NOUN', 'entrails|NOUN')
+            | 0.28360725445449464
+
+    p Just keep telling yourself that Reddit.
+
    //p Our next step is to cluster phrases. We will use the skip-grams with negative sampling algorithm, generally known as "word2vec". However, the vectors we'll learn will be keyed by larger units, discovered using linguistic annotations. Trask et al. describe an adaptation of word2vec that uses part-of-speech tagged keys as "sense2vec". I like the sound of "nlp2vec".
+
    
+    //p Until machines can think for themselves, we have to either program them explicitly, or give them examples.  Let's say you want to write a program that tags mentions of a certain type of thing in text --- let's say, food.  You can't write this program without supplying some definition of what you consider to be food, and what you don't. That much is fundamental. But, you don't necessarily need to sit down with an arbitrary sample of text, and start tagging it manually, word-by-word. Sometimes you do --- but sometimes there are easier ways.
    
+    //p In this post I'll describe how to adapt spaCy's named entity recogniser to recognise categories of your choosing, including a quick and dirty way to produce training data semi-automatically. The strategy is to define a handful of seed words or phrases, and then use word2vec to generate a more expansive list. Before using Gensim to train the word vectors, we merge base noun phrases and previously recognised entities into single tokens, using spaCy's annotations. This gives a vector space that includes phrases as well as words. To train our tagger, we can manually classify some of these phrases in context, or we can use an old data generation trick adapted from word sense disambiguation.
+
+    //In this tutorial we'll look for video games, using comments posted to Reddit in 2015. The Reddit comment corpus is some of my favourite data. It's a good amount of text, and it's quite clean. We don't have to do any HTML or MediaWiki processing, and there's very little automatically generated text. Most of the comments are around the same length, and people mostly type in full sentences.
+
+