mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-30 23:47:31 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			281 lines
		
	
	
		
			9.4 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			281 lines
		
	
	
		
			9.4 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| ===================================
 | |
| Tutorial: Extractive Summarization
 | |
| ===================================
 | |
| 
 | |
| This tutorial will go through the implementation of several extractive
 | |
| summarization models with spaCy.
 | |
| 
 | |
| An *extractive* summarization system is a filter over the original document/s:
 | |
| most of the text is removed, and the remaining text is formatted as a summary.
 | |
| In contrast, an *abstractive* summarization system generates new text.
 | |
| 
 | |
| Application Context
 | |
| -------------------
 | |
| 
 | |
| Extractive summarization systems need an application context.  We can't ask how
 | |
| to design the system without some concept of what sort of summary will be
 | |
| useful for a given application.  (Contrast with speech recognition, where
 | |
| a notion of "correct" is much less application-sensitive.)
 | |
| 
 | |
| For this, I've adopted the application context that `Flipboard`_ discuss in a
 | |
| recent blog post: they want to display lead-text to readers on mobile devices,
 | |
| so that readers can easily choose interesting links.
 | |
| 
 | |
| I've chosen this application context for two reasons.  First, `Flipboard`_ say
 | |
| they're putting something like this into production.  Second, there's a ready
 | |
| source of evaluation data.  We can look at the lead-text that human editors
 | |
| have chosen, and evaluate whether our automatic system chooses similar text.
 | |
| 
 | |
| Experimental Setup
 | |
| ------------------
 | |
| 
 | |
| Instead of scraping data, I'm using articles from the New York Times Annotated
 | |
| Corpus, which is a handy dump of XML-annotated articles distributed by the LDC.
 | |
| The annotations come with a field named "online lead paragraph".  Our
 | |
| summarization systems will be evaluated on their Rouge-1 overlap with this
 | |
| field.
 | |
| 
 | |
| Further details of the experimental setup can be found in the appendices.
 | |
| 
 | |
| .. _newyorktimes.com: http://newyorktimes.com
 | |
| 
 | |
| .. _Flipboard: http://engineering.flipboard.com/2014/10/summarization/
 | |
| 
 | |
| .. _vector-space model: https://en.wikipedia.org/wiki/Vector_space_model
 | |
| 
 | |
| .. _LexRank algorithm: https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html
 | |
| 
 | |
| .. _PageRank: https://en.wikipedia.org/wiki/PageRank
 | |
| 
 | |
| Summarizer API
 | |
| --------------
 | |
| 
 | |
| Each summarization model will have the following API:
 | |
| 
 | |
| .. py:func:`summarize(nlp: spacy.en.English, headline: unicode, paragraphs: List[unicode],
 | |
|                       target_length: int) --> summary: unicode
 | |
| 
 | |
| We receive the headline and a list of paragraphs, and a target length.  We have
 | |
| to produce a block of text where len(text) < target_length.  We want summaries
 | |
| that users will click-on, and not bounce back out of.  Long-term, we want
 | |
| summaries that would keep people using the app.
 | |
| 
 | |
| Baselines: Truncate
 | |
| -------------------
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|     def truncate_chars(nlp, headline, paragraphs, target_length):
 | |
|         text = ' '.join(paragraphs)
 | |
|         return text[:target_length - 3] + '...'
 | |
| 
 | |
|     def truncate_words(nlp, headline, paragraphs, target_length):
 | |
|         text = ' '.join(paragraphs)
 | |
|         tokens = text.split()
 | |
|         summary = []
 | |
|         n_words = 0
 | |
|         n_chars = 0
 | |
|         while n_chars < target_length - 3:
 | |
|             n_chars += len(tokens[n_words])
 | |
|             n_chars += 1 # Space
 | |
|             n_words += 1
 | |
|         return ' '.join(tokens[:n_words]) + '...'
 | |
| 
 | |
|     def truncate_sentences(nlp, headline, paragraphs, target_length):
 | |
|         sentences = []
 | |
|         summary = ''
 | |
|         for para in paragraphs:
 | |
|             tokens = nlp(para)
 | |
|             for sentence in tokens.sentences():
 | |
|                 if len(summary) + len(sentence) >= target_length:
 | |
|                     return summary
 | |
|                 summary += str(sentence)
 | |
|         return summary
 | |
| 
 | |
| I'd be surprised if Flipboard never had something like this in production.  Details
 | |
| like lead-text take a while to float up the priority list.  This strategy also has
 | |
| the advantage of transparency: it's obvious to users how the decision is being
 | |
| made, so nobody is likely to complain about the feature if it works this way.
 | |
| 
 | |
| Instead of cutting off the text mid-word, we can tokenize the text, and
 | |
| 
 | |
| +----------------+-----------+
 | |
| | System         | Rouge-1 R |
 | |
| +----------------+-----------+
 | |
| | Truncate chars | 69.3      |
 | |
| +----------------+-----------+
 | |
| | Truncate words | 69.8      |
 | |
| +----------------+-----------+
 | |
| | Truncate sents | 48.5      |
 | |
| +----------------+-----------+
 | |
| 
 | |
| Sentence Vectors
 | |
| ----------------
 | |
| 
 | |
| A simple bag-of-words model can be created using the `count_by` method, which
 | |
| produces a dictionary of frequencies, keyed by string IDs:
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|     >>> from spacy.en import English
 | |
|     >>> from spacy.en.attrs import SIC
 | |
|     >>> nlp = English()
 | |
|     >>> tokens = nlp(u'a a a. b b b b.')
 | |
|     >>> tokens.count_by(SIC)
 | |
|     {41L: 4, 11L: 3, 5L: 2}
 | |
|     >>> [s.count_by(SIC) for s in tokens.sentences()]
 | |
|     [{11L: 3, 5L: 1}, {41L: 4, 5L: 1}]
 | |
| 
 | |
| 
 | |
| Similar functionality is provided by `scikit-learn`_, but with a different
 | |
| style of API design.  With spaCy, functions generally have more limited
 | |
| responsibility.  The advantage of this is that spaCy's APIs are much simpler,
 | |
| and it's often easier to compose functions in a more flexible way.
 | |
| 
 | |
| One particularly powerful feature of spaCy is its support for
 | |
| `word embeddings`_ --- the dense vectors introduced by deep learning models, and
 | |
| now commonly produced by `word2vec`_ and related systems.
 | |
| 
 | |
| Once a set of word embeddings has been installed, the vectors are available
 | |
| from any token:
 | |
| 
 | |
|     >>> from spacy.en import English
 | |
|     >>> from spacy.en.attrs import SIC
 | |
|     >>> from scipy.spatial.distance import cosine
 | |
|     >>> nlp = English()
 | |
|     >>> tokens = nlp(u'Apple banana Batman hero')
 | |
|     >>> cosine(tokens[0].vec, tokens[1].vec)
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| .. _word embeddings: https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
 | |
| 
 | |
| .. _word2vec: https://code.google.com/p/word2vec/
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|     def main(db_loc, output_dir, feat_type="tfidf"):
 | |
|         nlp = spacy.en.English()
 | |
| 
 | |
|         # Read stop list and make TF-IDF weights --- data needed for the
 | |
|         # feature extraction.
 | |
|         with open(stops_loc) as file_:
 | |
|             stop_words = set(nlp.vocab.strings[word.strip()] for word in file_)
 | |
|         idf_weights = get_idf_weights(nlp, iter_docs(db_loc))
 | |
|         if feat_type == 'tfidf':
 | |
|             feature_extractor = tfidf_extractor(stop_words, idf_weights)
 | |
|         elif feat_type == 'vec':
 | |
|             feature_extractor = vec_extractor(stop_words, idf_weights)
 | |
| 
 | |
|         for i, text in enumerate(iter_docs(db_loc)):
 | |
|             tokens = nlp(body)
 | |
|             sentences = tokens.sentences()
 | |
|             summary = summarize(sentences, feature_extractor)
 | |
|             write_output(summary, output_dir, i)
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| .. _scikit-learn: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| The LexRank Algorithm
 | |
| ----------------------
 | |
| 
 | |
| LexRank is described as a graph-based algorithm, derived from `Google's PageRank`_.
 | |
| The nodes are sentences, and the edges are the similarities between one
 | |
| sentence and another.  The "graph" is fully-connected, and its edges are
 | |
| undirected --- so, it's natural to represent this as a matrix:
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|     from scipy.spatial.distance import cosine
 | |
|     import numpy
 | |
| 
 | |
| 
 | |
|     def lexrank(sent_vectors):
 | |
|         n = len(sent_vectors)
 | |
|         # Build the cosine similarity matrix
 | |
|         matrix = numpy.ndarray(shape=(n, n))
 | |
|         for i in range(n):
 | |
|             for j in range(n):
 | |
|                 matrix[i, j] = cosine(sent_vectors[i], sent_vectors[j])
 | |
|         # Normalize
 | |
|         for i in range(n):
 | |
|             matrix[i] /= sum(matrix[i])
 | |
|         return _pagerank(matrix)
 | |
| 
 | |
| The rows are normalized (i.e. rows sum to 1), allowing the PageRank algorithm
 | |
| to be applied.  Unfortunately the PageRank implementation is rather opaque ---
 | |
| it's easier to just read the Wikipedia page:
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|     def _pagerank(matrix, d=0.85):
 | |
|         # This is admittedly opaque --- just read the Wikipedia page.
 | |
|         n = len(matrix)
 | |
|         rank = numpy.ones(shape=(n,)) / n
 | |
|         new_rank = numpy.zeros(shape=(n,))
 | |
|         while not _has_converged(rank, new_rank):
 | |
|             rank, new_rank = new_rank, rank
 | |
|             for i in range(n):
 | |
|                 new_rank[i] = ((1.0 - d) / n) + (d * sum(rank * matrix[i]))
 | |
|         return rank
 | |
| 
 | |
|     def _has_converged(x, y, epsilon=0.0001):
 | |
|         return all(abs(x[i] - y[i]) < epsilon for i in range(n))
 | |
| 
 | |
| 
 | |
| Initial Processing
 | |
| ------------------
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| Feature Extraction
 | |
| ------------------
 | |
| 
 | |
|   .. code:: python
 | |
|       def sentence_vectors(sentence, idf_weights):
 | |
|           tf_idf = {}
 | |
|           for term, freq in sent.count_by(LEMMA).items():
 | |
|               tf_idf[term] = freq * idf_weights[term]
 | |
|            vectors.append(tf_idf)
 | |
|            return vectors
 | |
| 
 | |
| The LexRank paper models each sentence as a bag-of-words
 | |
| 
 | |
| This is simple and fairly standard, but often gives
 | |
| underwhelming results.  My idea is to instead calculate vectors from
 | |
| `word-embeddings`_, which have been one of the exciting outcomes of the recent
 | |
| work on deep-learning.  I had a quick look at the literature, and found
 | |
| a `recent workshop paper`_ that suggested the idea was plausible.
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| Taking the feature representation and similarity function as parameters, the
 | |
| LexRank function looks like this:
 | |
| 
 | |
| 
 | |
| Given a list of N sentences, a function that maps a sentence to a feature
 | |
| vector, and a function that computes a similarity measure of two feature
 | |
| vectors, this produces a vector of N floats, which indicate how well each
 | |
| sentence represents the document as a whole.
 | |
| 
 | |
| .. _Rouge: https://en.wikipedia.org/wiki/ROUGE_%28metric%29
 | |
| 
 | |
| 
 | |
| .. _word embeddings: https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
 | |
| 
 | |
| .. _recent workshop paper: https://www.aclweb.org/anthology/W/W14/W14-1504.pdf
 | |
| 
 | |
| 
 | |
| Document Model
 | |
| --------------
 |