mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-25 21:21:10 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			263 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			263 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| How spaCy Works
 | |
| ===============
 | |
| 
 | |
| The following are some hasty preliminary notes on how spaCy works.  The short
 | |
| story is, there are no new killer algorithms.  The way that the tokenizer works
 | |
| is novel and a bit neat, and the parser has a new feature set, but otherwise
 | |
| the key algorithms are well known in the recent literature.
 | |
| 
 | |
| Some might also wonder how I get Python code to run so fast.  I don't --- spaCy
 | |
| is written in `Cython`_, an optionally statically-typed language that compiles
 | |
| to C or C++, which is then loaded as a C extension module.
 | |
| This makes it `easy to achieve the performance of native C code`_, but allows the
 | |
| use of Python language features, via the Python C API.  The Python unicode
 | |
| library was particularly useful to me.  I think it would have been much more
 | |
| difficult to write spaCy in another language.
 | |
| 
 | |
| .. _Cython: http://cython.org/
 | |
| 
 | |
| .. _easy to achieve the performance of native C code: https://honnibal.wordpress.com/2014/10/21/writing-c-in-cython/
 | |
| 
 | |
| Tokenizer and Lexicon
 | |
| ---------------------
 | |
| 
 | |
| Tokenization is the task of splitting a string into meaningful pieces, called
 | |
| tokens, which you can then compute with.  In practice, the task is usually to
 | |
| match the tokenization performed in some treebank, or other corpus.  If we want
 | |
| to apply a tagger, entity recogniser, parser etc, then we want our run-time
 | |
| text to match the training conventions.  If we want to use a model that's been
 | |
| trained to expect "isn't" to be split into two tokens, ["is", "n't"], then that's
 | |
| how we need to prepare our data.
 | |
| 
 | |
| In order to train spaCy's models with the best data available, I therefore
 | |
| tokenize English according to the Penn Treebank scheme.  It's not perfect, but
 | |
| it's what everybody is using, and it's good enough.
 | |
| 
 | |
| What we don't do
 | |
| ################
 | |
| 
 | |
| The Penn Treebank was distributed with a script called tokenizer.sed, which
 | |
| tokenizes ASCII newswire text roughly according to the Penn Treebank standard.
 | |
| Almost all tokenizers are based on these regular expressions, with various
 | |
| updates to account for unicode characters, and the fact that it's no longer
 | |
| 1986 --- today's text has URLs, emails, emoji, etc.
 | |
| 
 | |
| Usually, the resulting regular expressions are applied in multiple passes, which
 | |
| is quite inefficient.  Often no care is taken to preserve indices into the original
 | |
| string.  If you lose these indices, it'll be difficult to calculate mark-up based
 | |
| on your annotations.
 | |
| 
 | |
| Tokenizer Algorithm
 | |
| ###################
 | |
| 
 | |
| spaCy's tokenizer assumes that no tokens will cross whitespace --- there will
 | |
| be no multi-word tokens.  If we want these, we can post-process the
 | |
| token-stream later, merging as necessary.  This assumption allows us to deal
 | |
| only with small chunks of text.  We can cache the processing of these, and
 | |
| simplify our expressions somewhat.
 | |
| 
 | |
| Here is what the outer-loop would look like in Python. (You can see the
 | |
| production implementation, in Cython, here.)
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|     cache = {}
 | |
|     def tokenize(text):
 | |
|         tokens = []
 | |
|         for substring in text.split(' '):
 | |
|             if substring in cache:
 | |
|                 tokens.extend(cache[substring])
 | |
|             else:
 | |
|                 subtokens = _tokenize_substring(substring)
 | |
|                 tokens.extend(subtokens)
 | |
|                 cache[substring] = subtokens
 | |
|         return tokens
 | |
| 
 | |
| The actual work is performed in _tokenize_substring.  For this, I divide the
 | |
| tokenization rules into three pieces:
 | |
| 
 | |
| 1. A prefixes expression, which matches from the start of the string;
 | |
| 2. A suffixes expression, which matches from the end of the string;
 | |
| 3. A special-cases table, which matches the whole string.
 | |
| 
 | |
| The algorithm then proceeds roughly like this (consider this like pseudo-code;
 | |
| this was written quickly and has not been executed):
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|     # Tokens which can be attached at the beginning or end of another
 | |
|     prefix_re = _make_re([",", '"', '(', ...])
 | |
|     suffix_re = _make_re(s[",", "'", ":", "'s", ...])
 | |
| 
 | |
|     # Contractions etc are simply enumerated, since they're a finite set.  We
 | |
|     # can also specify anything we like here, which is nice --- different data
 | |
|     # has different quirks, so we want to be able to add ad hoc exceptions.
 | |
|     special_cases = {
 | |
|         "can't": ("ca", "n't"),
 | |
|         "won't": ("wo", "n't"),
 | |
|         "he'd've": ("he", "'d", "'ve"),
 | |
|         ...
 | |
|         ":)": (":)",) # We can add any arbitrary thing to this list.
 | |
|     }
 | |
| 
 | |
|     def _tokenize_substring(substring):
 | |
|         prefixes = []
 | |
|         suffixes = []
 | |
|         while substring not in special_cases:
 | |
|             prefix, substring = _apply_re(substring, prefix_re)
 | |
|             if prefix:
 | |
|                 prefixes.append(prefix)
 | |
|             else:
 | |
|                 suffix, substring = _apply_re(substring, suffix_re)
 | |
|                 if suffix:
 | |
|                     suffixes.append(suffix)
 | |
|                 else:
 | |
|                     break
 | |
| 
 | |
| 
 | |
| This procedure splits off tokens from the start and end of the string, at each
 | |
| point checking whether the remaining string is in our special-cases table.  If
 | |
| it is, we stop splitting, and return the tokenization at that point.
 | |
| 
 | |
| The advantage of this design is that the prefixes, suffixes and special-cases
 | |
| can be declared separately, in easy-to-understand files.  If a new entry is
 | |
| added to the special-cases, you can be sure that it won't have some unforeseen
 | |
| consequence to a complicated regular-expression grammar.
 | |
| 
 | |
| Coupling the Tokenizer and Lexicon
 | |
| ##################################
 | |
| 
 | |
| As mentioned above, the tokenizer is designed to support easy caching.  If all
 | |
| we were caching were the matched substrings, this would not be so advantageous.
 | |
| Instead, what we do is create a struct which houses all of our lexical
 | |
| features, and cache *that*.  The tokens are then simply pointers to these rich
 | |
| lexical types.
 | |
| 
 | |
| In a sample of text, vocabulary size grows exponentially slower than word
 | |
| count.  So any computations we can perform over the vocabulary and apply to the
 | |
| word count are efficient.
 | |
| 
 | |
| 
 | |
| Part-of-speech Tagger
 | |
| ---------------------
 | |
| 
 | |
| .. _how to write a good part of speech tagger: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/ .
 | |
| 
 | |
| In 2013, I wrote a blog post describing `how to write a good part of speech
 | |
| tagger`_.
 | |
| My recommendation then was to use greedy decoding with the averaged perceptron.
 | |
| I think this is still the best approach, so it's what I implemented in spaCy.
 | |
| 
 | |
| The tutorial also recommends the use of Brown cluster features, and case
 | |
| normalization features, as these make the model more robust and domain
 | |
| independent.  spaCy's tagger makes heavy use of these features.
 | |
| 
 | |
| Dependency Parser
 | |
| -----------------
 | |
| 
 | |
| .. _2014 blog post: https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/
 | |
| 
 | |
| The parser uses the algorithm described in my `2014 blog post`_.
 | |
| This algorithm, shift-reduce dependency parsing, is becoming widely adopted due
 | |
| to its compelling speed/accuracy trade-off.
 | |
| 
 | |
| Some quick details about spaCy's take on this, for those who happen to know
 | |
| these models well.  I'll write up a better description shortly.
 | |
| 
 | |
| 1. I use greedy decoding, not beam search;
 | |
| 2. I use the arc-eager transition system;
 | |
| 3. I use the Goldberg and Nivre (2012) dynamic oracle.
 | |
| 4. I use the non-monotonic update from my CoNLL 2013 paper (Honnibal, Goldberg
 | |
|    and Johnson 2013).
 | |
| 
 | |
| So far, this is exactly the configuration from the CoNLL 2013 paper, which
 | |
| scored 91.0. So how have I gotten it to 92.4?  The following tweaks:
 | |
| 
 | |
| 1. I use Brown cluster features --- these help a lot;
 | |
| 2. I redesigned the feature set. I've long known that the Zhang and Nivre
 | |
|    (2011) feature set was suboptimal, but a few features don't make a very
 | |
|    compelling publication.  Still, they're important.
 | |
| 3. When I do the dynamic oracle training, I also make
 | |
|    the upate cost-sensitive: if the oracle determines that the move the parser
 | |
|    took has a cost of N, then the weights for the gold class are incremented by
 | |
|    +N, and the weights for the predicted class are incremented by -N.  This
 | |
|    only made a small (0.1-0.2%) difference.
 | |
| 
 | |
| Implementation
 | |
| ##############
 | |
| 
 | |
| I don't do anything algorithmically novel to improve the efficiency of the
 | |
| parser.  However, I was very careful in the implementation.
 | |
| 
 | |
| A greedy shift-reduce parser with a linear model boils down to the following
 | |
| loop:
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|     def parse(words, model, feature_funcs, n_classes):
 | |
|         state = init_state(words)
 | |
|         for _ in range(len(words) * 2):
 | |
|             features = [templ(state) for templ in feature_funcs]
 | |
|             scores = [0 for _ in range(n_classes)]
 | |
|             for feat in features:
 | |
|                 weights = model[feat]
 | |
|                 for i, weight in enumerate(weights):
 | |
|                     scores[i] += weight
 | |
|             class_, score = max(enumerate(scores), key=lambda item: item[1])
 | |
|             transition(state, class_)
 | |
| 
 | |
| The parser makes 2N transitions for a sentence of length N.  In order to select
 | |
| the transition, it extracts a vector of K features from the state.  Each feature
 | |
| is used as a key into a hash table managed by the model.  The features map to
 | |
| a vector of weights, of length C.  We then dot product the feature weights to the
 | |
| scores vector we are building for that instance.
 | |
| 
 | |
| The inner-most loop here is not so bad: we only have a few dozen classes, so
 | |
| it's just a short dot product.  Both of the vectors are in the cache, so this
 | |
| is a snack to a modern CPU.
 | |
| 
 | |
| The bottle-neck in this algorithm is the 2NK look-ups into the hash-table that
 | |
| we must make, as these almost always have to hit main memory.  The feature-set
 | |
| is enormously large, because all of our features are one-hot boolean
 | |
| indicators.  Some of the features will be common, so they'll lurk around in the
 | |
| CPU's cache hierarchy.  But a lot of them won't be, and accessing main memory
 | |
| takes a lot of cycles.
 | |
| 
 | |
| .. _Jeff Preshing's excellent post: http://preshing.com/20130107/this-hash-table-is-faster-than-a-judy-array/ .
 | |
| 
 | |
| I used to use the Google dense_hash_map implementation.  This seemed a solid
 | |
| choice: it came from a big brand, it was in C++, and it seemed very
 | |
| complicated.  Later, I read `Jeff Preshing's excellent post`_ on open-addressing
 | |
| with linear probing.
 | |
| This really spoke to me.  I had assumed that a fast hash table implementation
 | |
| would necessarily be very complicated, but no --- this is another situation
 | |
| where the simple strategy wins.
 | |
| 
 | |
| I've packaged my Cython implementation separately from spaCy, in the package
 | |
| `preshed`_ --- for "pre-hashed", but also as a nod to Preshing.  I've also taken
 | |
| great care over the feature extraction and perceptron code, which I'm distributing
 | |
| in a package named `thinc`_ (since it's for learning very sparse models with
 | |
| Cython).
 | |
| 
 | |
| .. _preshed: https://github.com/syllog1sm/preshed
 | |
| 
 | |
| .. _thinc: https://github.com/honnibal/thinc
 | |
| 
 | |
| By the way: from comparing notes with a few people, it seems common to
 | |
| implement linear models in a way that's suboptimal for multi-class
 | |
| classification.  The mistake is to store in the hash-table one weight per
 | |
| (feature, class) pair, rather than mapping the feature to a vector of weights,
 | |
| for all of the classes.  This is bad because it means you need to hit the table
 | |
| C times, one per class, as you always need to evaluate a feature against all of
 | |
| the classes.  In the case of the parser, this means the hash table is accessed
 | |
| 2NKC times, instead of the 2NK times if you have a weights vector.  You should
 | |
| also be careful to store the weights contiguously in memory --- you don't want
 | |
| a linked list here.  I use a block-sparse format, because my problems tend to
 | |
| have a few dozen classes.
 | |
| 
 | |
| I guess if I had to summarize my experience, I'd say that the efficiency of
 | |
| these models is really all about the data structures.  We want to stay small,
 | |
| and stay contiguous.  Minimize redundancy and minimize pointer chasing.
 | |
| That's why Cython is so well suited to this: we get to lay out our data
 | |
| structures, and manage the memory ourselves, with full C-level control.
 |